Community:HowToDetectAndFixCorruptedBucketInIndexerClustering pre6.5
From Splunk Wiki
How to find corrupted buckets in Indexer Clustering Peers
This is documented for corrupted bucket issue in pre-6.5 Indexer Clustering. It is because in pre-v6.5, 'fsck scan' detected unsearchable buckets as corrupted buckets. So, if we run 'fsck repair' made even unsearchable buckets searchable and took hours and hours to complete. NOTE: After v6.5.x??, we do not need to follow this procedure. We can simply run "fsck scan" to identify corrupted buckets.
1. @CM, enable maintenance mode
2. @CP, make a list of Searchable buckets and create an output file Example of exported/output file name: list_searchable_buckets_<Name_of_ClusterPeer>_<date_and_time>.out
# In SplunkWeb, you can run the following search to make a list of buckets # Then, export the output # - Please replace <Cluster_Peer_ServerName> with a proper Cluster Peer's name # - You can run this Splunk search from Cluster Master | dbinspect splunk_server=<Cluster_Peer_ServerName> index=* | table bucketId path | join bucketId [ rest splunk_server=<Cluster_Peer_ServerName> /services/cluster/slave/buckets filter=search_state=Searchable | rex field=title "^(?<repl_index>[^\~]+)" | search repl_index="*" | rename title AS bucketId | fields bucketId search_state | regex search_state="^Searchable$" | table bucketId ] | table path # Example of Splunk CLI at Cluster Peer # Please try this search as one line ./bin/splunk search "| dbinspect splunk_server=<Cluster_Peer_ServerName> index=* | table bucketId path | join bucketId [ rest splunk_server=<Cluster_Peer_ServerName> /services/cluster/slave/buckets filter=search_state=Searchable | rex field=title \"^(?<repl_index>[^\~]+)\" | search repl_index=\"*\" | rename title AS bucketId | fields bucketId search_state | regex search_state=\"^Searchable$\" | table bucketId ] | table path" > list_searchable_buckets_`hostname`_`date +"%Y%m%d_%H%M"`.out
3. @CP, Stop Splunk - Please stop Splunk soon after the step 2 so that state of buckets (Searchable/Unsearchable) do not chanage
4. @CP, Check corrupted buckets(index data) by "fsck scan" command # $SPLUNK_HOME/bin/splunk fsck scan --all-buckets-all-indexes --v 2>&1 | tee fsck_scan_`hostname`_`date +"%Y%m%d_%H%M"`.out => Require to stop Splunk => This may require hours to finish. => This will check integrity of index files and rawdata(journal.gz) => NOTE: The fsck command in pre-v6.5 returns "Unsearchable" buckets as corrupted buckets (Fixed in v6.5) => (Option) Use "--all-buckets-one-index --index-name=<name>", instead of "--all-buckets-all-indexes" if you want to check specific index. $SPLUNK_HOME/bin/splunk check-rawdata-format -allindexes 2>&1 | tee spl_check_rawdata_`hostname`_`date +"%Y%m%d_%H%M"`.out 5. @CP, Check and list all buckets needs to be fixed # grep -B 1 "Corruption" <output_file_created_in the_previous_step> | grep -oP "(?<=bucket=')[^\']+" > fsck_corrupted_bucket_`hostname`.out - Example # grep -B 1 "Corruption" fsck_scan_clusterpeer01_20170215_1032.out | grep -oP "(?<=bucket=')[^\']+" > fsck_corrupted_buckets_`hostname`.out
6. Make a list of corrupted buckets by finding common buckets in output files between Step 2 and Step 5 - sort files and compare Example: - sort output file of step 2 ( output file of step 2, list_searchable_buckets_clusterpeer01_20170215_1005.out ) # sort list_searchable_buckets_clusterpeer01_20170215_1005.out -o list_searchable_buckets_clusterpeer01_20170215_1005.out - sort output file of step 5 ( output file of step 5, fsck_scan_clusterpeer01_20170215_1032.out ) # sort fsck_scan_clusterpeer01_20170215_1032.out -o fsck_scan_clusterpeer01_20170215_1032.out - find only common buckets ( Searchable but corrupted buckets) # comm -12 fsck_scan_clusterpeer01_20170215_1032.out list_searchable_buckets_clusterpeer01_20170215_1005.out > list_corrupted_buckets_clusterpeer01.out
7. @CP, fsck repair for corrupted buckets with WARING: This may take hours if there are so many buckets to be fixed $SPLUNK_HOME/bin/splunk fsck repair --one-bucket --bucket-path=${BUCKET} ./bin/splunk fsck repair --one-bucket --include-hots --bucket-path=db_1430843723_1430786206_13_ --log-to--splunkd-log --ignore-read-error - Example script # Run this script from $SPLUNK_HOME directory # Usage: # /bin/bash splunk_fsck_buckets.sh <a_file_contains_list_of_bucket_paths> # Example: # /bin/bash splunk_fsck_buckets.sh list_corrupted_buckets_clusterpeer01.out #----------------------------------- #!/bin/bash SPLUNK_HOME=`pwd` FILENAME=${1} if [ -z ${FILENAME} ] then echo "No file name provided" exit else LIST_BUCKETS=$(cat ${FILENAME}) for BUCKET in ${LIST_BUCKETS[@]} do $SPLUNK_HOME/bin/splunk fsck repair --one-bucket --bucket-path=${BUCKET} done fi #-----------------------------------
8. @CP, Start Splunk
9. @CM, disable maintenance mode