Community:HowToDetectAndFixCorruptedBucketInIndexerClustering pre6.5

From Splunk Wiki

Jump to: navigation, search

How to find corrupted buckets in Indexer Clustering Peers

  This is documented for corrupted bucket issue in pre-6.5 Indexer Clustering. It is because in pre-v6.5, 'fsck scan' detected unsearchable buckets as corrupted buckets. 
  So, if we run 'fsck repair' made even unsearchable buckets searchable and took hours and hours to complete.
  
  NOTE:  After v6.5.x??, we do not need to follow this procedure. We can simply run "fsck scan" to identify corrupted buckets. 
   1. @CM, enable maintenance mode 
   2. @CP, make a list of Searchable buckets  and create an output file
       Example of exported/output file name: 
           list_searchable_buckets_<Name_of_ClusterPeer>_<date_and_time>.out
       # In SplunkWeb, you can run the following search to make a list of buckets
       # Then, export the output 
       #  - Please replace <Cluster_Peer_ServerName> with a proper Cluster Peer's name
       #  - You can run this Splunk search from Cluster Master
       | dbinspect splunk_server=<Cluster_Peer_ServerName> index=* | table bucketId path
       | join bucketId [ rest splunk_server=<Cluster_Peer_ServerName> /services/cluster/slave/buckets filter=search_state=Searchable 
          | rex field=title "^(?<repl_index>[^\~]+)"
          | search repl_index="*"
          | rename title AS bucketId   
          | fields bucketId search_state 
          | regex search_state="^Searchable$" 
          | table bucketId ] | table path
          
       # Example of Splunk CLI at Cluster Peer
       # Please try this search as one line
       ./bin/splunk search "| dbinspect splunk_server=<Cluster_Peer_ServerName> index=* | table bucketId path | join bucketId [ rest splunk_server=<Cluster_Peer_ServerName> /services/cluster/slave/buckets filter=search_state=Searchable | rex field=title \"^(?<repl_index>[^\~]+)\" | search repl_index=\"*\" | rename title AS bucketId | fields bucketId search_state | regex search_state=\"^Searchable$\" | table bucketId ] | table path" > list_searchable_buckets_`hostname`_`date +"%Y%m%d_%H%M"`.out
   3. @CP, Stop Splunk 
        - Please stop Splunk soon after the step 2 so that state of buckets (Searchable/Unsearchable) do not chanage
   4. @CP, Check corrupted buckets(index data) by "fsck scan" command
          # $SPLUNK_HOME/bin/splunk fsck scan --all-buckets-all-indexes --v 2>&1 | tee fsck_scan_`hostname`_`date +"%Y%m%d_%H%M"`.out
               => Require to stop Splunk
               => This may require hours to finish. 
               => This will check integrity of index files and rawdata(journal.gz)
               => NOTE: The fsck command in pre-v6.5 returns "Unsearchable" buckets as corrupted buckets (Fixed in v6.5)
               => (Option) Use "--all-buckets-one-index --index-name=<name>", instead of "--all-buckets-all-indexes" if you want to check specific index.
                   $SPLUNK_HOME/bin/splunk check-rawdata-format -allindexes 2>&1 | tee spl_check_rawdata_`hostname`_`date +"%Y%m%d_%H%M"`.out
 
   5. @CP, Check and list all buckets needs to be fixed
          # grep -B 1 "Corruption" <output_file_created_in the_previous_step> | grep -oP "(?<=bucket=')[^\']+" > fsck_corrupted_bucket_`hostname`.out
          -  Example
          # grep -B 1 "Corruption" fsck_scan_clusterpeer01_20170215_1032.out | grep -oP "(?<=bucket=')[^\']+" > fsck_corrupted_buckets_`hostname`.out
   6. Make a list of corrupted buckets by finding common buckets in output files between Step 2 and Step 5  
       - sort files and compare
       Example:
           - sort output file of step 2 ( output file of step 2, list_searchable_buckets_clusterpeer01_20170215_1005.out )
           # sort list_searchable_buckets_clusterpeer01_20170215_1005.out -o list_searchable_buckets_clusterpeer01_20170215_1005.out
           - sort output file of step 5 ( output file of step 5, fsck_scan_clusterpeer01_20170215_1032.out )
           # sort fsck_scan_clusterpeer01_20170215_1032.out -o fsck_scan_clusterpeer01_20170215_1032.out
           - find only common buckets ( Searchable but corrupted buckets)
           # comm -12  fsck_scan_clusterpeer01_20170215_1032.out list_searchable_buckets_clusterpeer01_20170215_1005.out > list_corrupted_buckets_clusterpeer01.out
   7. @CP, fsck repair for corrupted buckets with 
       WARING: This may take hours if there are so many buckets to be fixed
         $SPLUNK_HOME/bin/splunk fsck repair --one-bucket --bucket-path=${BUCKET}
         ./bin/splunk fsck repair --one-bucket --include-hots --bucket-path=db_1430843723_1430786206_13_ --log-to--splunkd-log --ignore-read-error
       - Example script
       # Run this script from $SPLUNK_HOME directory
       # Usage: 
       # /bin/bash splunk_fsck_buckets.sh <a_file_contains_list_of_bucket_paths>
       # Example:
       # /bin/bash splunk_fsck_buckets.sh list_corrupted_buckets_clusterpeer01.out
       #-----------------------------------
       #!/bin/bash
       SPLUNK_HOME=`pwd`
       FILENAME=${1}
       if [ -z ${FILENAME} ]
       then
         echo "No file name provided"
         exit
       else
           LIST_BUCKETS=$(cat ${FILENAME})
           for BUCKET in ${LIST_BUCKETS[@]}
           do
               $SPLUNK_HOME/bin/splunk fsck repair --one-bucket --bucket-path=${BUCKET}
           done
       fi
       #-----------------------------------
   8. @CP, Start Splunk
   9. @CM, disable maintenance mode
Personal tools
Hot Wiki Topics


About Splunk >
  • Search and navigate IT data from applications, servers and network devices in real-time.
  • Download Splunk