Community:PostCrashFsckRepair

From Splunk Wiki

Jump to: navigation, search

Post-Crash Fsck Repair

If splunk suffers an unclean shutdown (power loss, hardware failure, OS failure, sysadmin goes postal, etc) then some buckets can be left in a bad state where not all data is searchable.

If this happens, on splunk startup (in 4.2.2 and later) you will get a message such as:

Splunk has detected an unclean shutdown.  The database should be checked in
order to ensure correct search results, but this may take a very long time,
depending on your system.
If you would like to check/repair the database, stop Splunk and run:
  splunk fsck --all --repair

Simple Recovery

When it says "a very long time", estimate 20-30 minutes for a 10GB bucket repair. Thus if you don't have a need to have Splunk continuously available (have backup nodes, etc) then you can follow the simple sequence:

# splunk stop
# splunk fsck --all --repair
# splunk start

In v6.2, the syntax has changed slightly.

  1. splunk fsck repair --all-buckets-all-indexes

Tricky Recovery with Reduced Downtime

With current technology, we can only safely rebuild buckets that Splunk is not actively searching. In order to rebuild buckets without keeping your splunk instance down, you need to take those buckets away from splunk.

Which buckets are unhappy?

To determine which buckets are not happy, run the command while splunk is down

# splunk fsck --all

This will not actually modify any data, but it will emit output such as:

bucket=/home/jrodman/p4/splunk/branches/hammer/built/var/lib/splunk/_internaldb/db/db_1309313285_1309228812_31 NEEDS REPAIR: count mismatch tsidx=128428 source-metadata=128425

SUMMARY: We have detected 1 buckets (2464688 bytes of compressed rawdata) need rebuilding. Depending on the speed of your server, this may take from 0 to 1 minutes. You can use the --repair option to fix


It's saying it found one problemed bucket, and gives an exact directory path. If we had multiple buckets, there would be many listed directories.

Move the buckets out of the way

Here, I create a sibling directory in the index for doing the repair, a location that splunkd will ignore. Any location on the same filesystem is convenient, but be sure to arrange to be able to remember what buckets belong to what indexes.

# mkdir /home/jrodman/p4/splunk/branches/hammer/built/var/lib/splunk/_internaldb/repair_dir
# mv /home/jrodman/p4/splunk/branches/hammer/built/var/lib/splunk/_internaldb/db/db_1309313285_1309228812_31 /home/jrodman/p4/splunk/branches/hammer/built/var/lib/splunk/_internaldb/repair_dir


Start splunk, and repair the buckets

These steps can be done in either order. We can now start up splunk again, and it will not be able to see or use or search these buckets at all, until we finish repairing them. Splunk can continue to work, while we fix the buckets

To start splunk again:

# splunk start

For each broken bucket:

# splunk rebuild <bucket_directory>

for example:

# splunk rebuild /home/jrodman/p4/splunk/branches/hammer/built/var/lib/splunk/_internaldb/repair_dir/db_1309313285_1309228812_31

Rebuild is not very chatty, expect no output for a long time.

Reinsert the buckets into the index

The simple, safe way, is to stop splunk, and then move the repaired buckets back into the directories where they came from.

The tricky hacky way is atomically put them back in place. This means to mv them from a location on the same filesystem into the index from which they came. This will work live, although splunk may take some time to notice them and incorporate them into all the statistics and numbers again.

Personal tools
Hot Wiki Topics


About Splunk >
  • Search and navigate IT data from applications, servers and network devices in real-time.
  • Download Splunk