Community:Modifying indexed data via export and import
From Splunk Wiki
Included in splunk are some tools which can take buckets from an index, export them to csv format, and then also take that csv format and construct a new bucket. In total this gives the ability to modify indexed data or prune "deleted" data from the index, but this is still in the realm of a good deal of elbow grease and a fair amount of performance overhead.
Attached is a script that I used to perform this cycle with an intervening filter written in python. View this as a very very rough example of how you could go about it. The steps are basically:
- Export the data to csv with exporttool
- Futz with the csv
- Feed the data into a bucket with importtool
- rename the bucket manually
- swap the original and new bucket in the live index (tricksy!)
If you undertake this sort of thing, do a proof of concept with nonsensitive data, and possibly additionally test it out by creating a new index with your test buckets first, before going whole hog.
The core ideas are:
- run exporttool and importtool in a streaming fashion, with some code of your design which modifies the csv as you need (throw away unwanted data, change errant data, etc)
- remove the original bucket atomically, eg with a rename to a location on the same filesystem
- add the new bucket atomically, eg with a rename from a location on the same filesystem
- do not have both buckets (with the same ID) present in the index at the same time
At a future point we will be making all of this much more automated, but currently the performance cost of export and import is too high to make it very easy to fire off. You may find running them both at once soaks up over 3GB of ram.
rodman@joshbook:~> less /Users/jrodman/rewrite_buckets.sh
#!/bin/bash
BUCKET_TMPDIR=/tmp
if [ x${SPLUNK_HOME} = x]; then
SPLUNK_HOME=/opt/splunk
fi
. $SPLUNK_HOME/setSplunkEnv
if [ x${SPLUNK_BIN} = x]; then
SPLUNK_BIN=splunk
fi
EXPORT_CMD="$SPLUNK_BIN cmd exporttool"
PYTHON="python"
IMPORT_CMD="$SPLUNK_BIN cmd importtool"
year_end_file=$BUCKET_TMPDIR/year-2009-end
touch -t 200912311100 $year_end_file
for bucket in "$@"; do
if [ "$bucket" -ot "$year_end_file" ]; then
echo >&2 "$bucket older than year 2009-2010 rollover, skipping"
continue
fi
bucket_dir=$(dirname $bucket)
bucket_name=$(basename $bucket)
NEW_BUCKET=$BUCKET_TMPDIR/new_bucket
$EXPORT_CMD $bucket /dev/stdout -csv | $PYTHON ~/windows-time-fixer.py | $IMPORT_CMD $NEW_BUCKET /dev/stdin
bucket_id=$(echo $bucket | sed 's/.*_//')
(cd $NEW_BUCKET; ls *.tsidx | sed 's/-[0-9]\+\.tsidx$//' |sed 's/-/ /') | {
global_low=0
global_high=0
while read high low; do
if [ $global_high -eq 0 ] || [ $high -gt $global_high ]; then
global_high=$high
fi
if [ $global_low -eq 0 ] || [ $low -lt $global_low ]; then
global_low=$low
fi
done
REAL_BUCKET_NAME=db_${global_high}_${global_low}_${bucket_id}
if [ -d $bucket ]; then
mv $bucket $BUCKET_TMPDIR
else
echo >&2 bucket $bucket vanished while processing.. inserting new one and hoping for the best
fi
mv $NEW_BUCKET $bucket_dir/$REAL_BUCKET_NAME
}
# request metadata rebuild
touch $bucket_dir/meta.dirty
# rm -rf $BUCKET_TMPDIR/$bucket_name # delete old one
done
rm $year_end_file