Community:Modifying indexed data via export and import

From Splunk Wiki

Jump to: navigation, search

Included in splunk are some tools which can take buckets from an index, export them to csv format, and then also take that csv format and construct a new bucket. In total this gives the ability to modify indexed data or prune "deleted" data from the index, but this is still in the realm of a good deal of elbow grease and a fair amount of performance overhead.

Attached is a script that I used to perform this cycle with an intervening filter written in python. View this as a very very rough example of how you could go about it. The steps are basically:

  1. Export the data to csv with exporttool
  2. Futz with the csv
  3. Feed the data into a bucket with importtool
  4. rename the bucket manually
  5. swap the original and new bucket in the live index (tricksy!)

If you undertake this sort of thing, do a proof of concept with nonsensitive data, and possibly additionally test it out by creating a new index with your test buckets first, before going whole hog.

The core ideas are:

  • run exporttool and importtool in a streaming fashion, with some code of your design which modifies the csv as you need (throw away unwanted data, change errant data, etc)
  • remove the original bucket atomically, eg with a rename to a location on the same filesystem
  • add the new bucket atomically, eg with a rename from a location on the same filesystem
  • do not have both buckets (with the same ID) present in the index at the same time

At a future point we will be making all of this much more automated, but currently the performance cost of export and import is too high to make it very easy to fire off. You may find running them both at once soaks up over 3GB of ram.

rodman@joshbook:~> less /Users/jrodman/rewrite_buckets.sh
#!/bin/bash
BUCKET_TMPDIR=/tmp
if [ x${SPLUNK_HOME} = x]; then
    SPLUNK_HOME=/opt/splunk
fi

. $SPLUNK_HOME/setSplunkEnv

if [ x${SPLUNK_BIN} = x]; then
    SPLUNK_BIN=splunk
fi

EXPORT_CMD="$SPLUNK_BIN cmd exporttool"
PYTHON="python"
IMPORT_CMD="$SPLUNK_BIN cmd importtool"

year_end_file=$BUCKET_TMPDIR/year-2009-end
touch -t 200912311100 $year_end_file

for bucket in "$@"; do
    if [ "$bucket" -ot "$year_end_file" ]; then
        echo >&2 "$bucket older than year 2009-2010 rollover, skipping"
        continue
    fi
    bucket_dir=$(dirname $bucket)
    bucket_name=$(basename $bucket)

    NEW_BUCKET=$BUCKET_TMPDIR/new_bucket
    $EXPORT_CMD $bucket /dev/stdout -csv | $PYTHON ~/windows-time-fixer.py | $IMPORT_CMD $NEW_BUCKET /dev/stdin
    bucket_id=$(echo $bucket | sed 's/.*_//')
    (cd $NEW_BUCKET; ls *.tsidx | sed 's/-[0-9]\+\.tsidx$//' |sed 's/-/ /') | {
    global_low=0
    global_high=0
    while read high low; do
        if [ $global_high -eq 0 ] || [ $high -gt $global_high ]; then
            global_high=$high
        fi
        if [ $global_low -eq 0 ] || [ $low -lt $global_low ]; then
            global_low=$low
        fi
    done
    REAL_BUCKET_NAME=db_${global_high}_${global_low}_${bucket_id}
    if [ -d $bucket ]; then
        mv $bucket $BUCKET_TMPDIR
    else
        echo >&2 bucket $bucket vanished while processing.. inserting new one and hoping for the best
    fi
    mv $NEW_BUCKET $bucket_dir/$REAL_BUCKET_NAME
    }
    # request metadata rebuild
    touch $bucket_dir/meta.dirty
    # rm -rf $BUCKET_TMPDIR/$bucket_name # delete old one
done

rm $year_end_file
Personal tools
Hot Wiki Topics


About Splunk >
  • Search and navigate IT data from applications, servers and network devices in real-time.
  • Download Splunk