From Splunk Wiki
Post-Crash Fsck Repair
If splunk suffers an unclean shutdown (power loss, hardware failure, OS failure, sysadmin goes postal, etc) then some buckets can be left in a bad state where not all data is searchable.
If this happens, on splunk startup (in 4.2.2 and later) you will get a message such as:
Splunk has detected an unclean shutdown. The database should be checked in order to ensure correct search results, but this may take a very long time, depending on your system.
If you would like to check/repair the database, stop Splunk and run: splunk fsck --all --repair
When it says "a very long time", estimate 20-30 minutes for a 10GB bucket repair. Thus if you don't have a need to have Splunk continuously available (have backup nodes, etc) then you can follow the simple sequence:
# splunk stop # splunk fsck --all --repair # splunk start
In v6.2, the syntax has changed slightly.
- splunk fsck repair --all-buckets-all-indexes
Tricky Recovery with Reduced Downtime
With current technology, we can only safely rebuild buckets that Splunk is not actively searching. In order to rebuild buckets without keeping your splunk instance down, you need to take those buckets away from splunk.
Which buckets are unhappy?
To determine which buckets are not happy, run the command while splunk is down
# splunk fsck --all
This will not actually modify any data, but it will emit output such as:
bucket=/home/jrodman/p4/splunk/branches/hammer/built/var/lib/splunk/_internaldb/db/db_1309313285_1309228812_31 NEEDS REPAIR: count mismatch tsidx=128428 source-metadata=128425
SUMMARY: We have detected 1 buckets (2464688 bytes of compressed rawdata) need rebuilding. Depending on the speed of your server, this may take from 0 to 1 minutes. You can use the --repair option to fix
It's saying it found one problemed bucket, and gives an exact directory path. If we had multiple buckets, there would be many listed directories.
Move the buckets out of the way
Here, I create a sibling directory in the index for doing the repair, a location that splunkd will ignore. Any location on the same filesystem is convenient, but be sure to arrange to be able to remember what buckets belong to what indexes.
# mkdir /home/jrodman/p4/splunk/branches/hammer/built/var/lib/splunk/_internaldb/repair_dir # mv /home/jrodman/p4/splunk/branches/hammer/built/var/lib/splunk/_internaldb/db/db_1309313285_1309228812_31 /home/jrodman/p4/splunk/branches/hammer/built/var/lib/splunk/_internaldb/repair_dir
Start splunk, and repair the buckets
These steps can be done in either order. We can now start up splunk again, and it will not be able to see or use or search these buckets at all, until we finish repairing them. Splunk can continue to work, while we fix the buckets
To start splunk again:
# splunk start
For each broken bucket:
# splunk rebuild <bucket_directory>
# splunk rebuild /home/jrodman/p4/splunk/branches/hammer/built/var/lib/splunk/_internaldb/repair_dir/db_1309313285_1309228812_31
Rebuild is not very chatty, expect no output for a long time.
Reinsert the buckets into the index
The simple, safe way, is to stop splunk, and then move the repaired buckets back into the directories where they came from.
The tricky hacky way is atomically put them back in place. This means to mv them from a location on the same filesystem into the index from which they came. This will work live, although splunk may take some time to notice them and incorporate them into all the statistics and numbers again.