Community:Summary Indexing Back Fill

From Splunk Wiki

Jump to: navigation, search

Backfill a summary index with archived data

This script can be used to backfill a summary index with archived data. This is a simple python script to break up a search into intervals and feed the summary index using the CLI. The script includes the option to utilize the dispatch API for summarizing large amounts of data.

Note: This document refers to 3.x versions of Splunk. For information about managing backfill in Splunk 4.x, see "Manage summary index gaps and overlaps" in the Knowledge Manager manual. For a general overview of 4.x summary index functionality, see "Use summary indexing for increased reporting efficiency" in the Knowledge Manager manual.

Script Usage

To use the script, change the variables in the top section of the script to suit your needs.

Important Notes

  • Make sure you keep the addinfo and collect commands at the end of your search string.
  • Don't include any time modifiers, these will be added automatically.
  • The intervalInMins variable should match your planned collection interval, this will help the data in your summary index to maintain its consistency.
    • If you are summarizing data on a daily basis the value should be intervalInMins=1440, if hourly intervalInMins=60, etc...
  • The intervalInMins variable should account for the max results limit.
    • For example, if you have 100,000 events in 1 hour and maxResults = 10,000, set intervalInMins = 6 (or lower).
  • Enable useDispatch variable if it is not possible to work within the max results limit. The dispatch option gives you a lot more flexibility in setting intervalInMins.
    • maxOut is not the same as maxResults -- maxOut controls the number of output events, not the number of events processed for the summarization (which is limitless when using dispatch).
  • You will need to execute splunk login before running the script to authenticate to Splunk.
  • Ideally you want to source $SPLUNK_HOME/bin/setSplunkEnv to correctly set the Splunk environment.

import os,datetime

# Purpose: Execute summary index searches on archived data via Splunk CLI.  This version
#          provides the option to use the dispatch API instead of the vanilla search command.
#          The dispatch API can be used in cases where it is simply not possible to work
#          within the max results limit.

#---------- change these variables ----------

splunkSearch = "sourcetype=foo | stats count by host | addinfo | collect index=summary"

startDate = "04/13/2008"
startTime = "00:00:00"
endDate = "04/17/2008"
endTime = "00:00:00"

intervalInMins = 10

# default maxresults for CLI searches is 100
maxResults = 50000

# enable dispatch API when maxResults is simply too small
# set maxOut as appropriate, but the default 100 should be ok
useDispatch = True
maxOut = 100

#---------- begin script ----------

# break down the start/end date and time

startDateTokens = startDate.split("/")
startMonth = int(startDateTokens[0])
startDay = int(startDateTokens[1])
startYear = int(startDateTokens[2])

startTimeTokens = startTime.split(':')
startHour = int(startTimeTokens[0])
startMin = int(startTimeTokens[1])
startSec = int(startTimeTokens[2])

endDateTokens = endDate.split("/")
endMonth = int(endDateTokens[0])
endDay = int(endDateTokens[1])
endYear = int(endDateTokens[2])

endTimeTokens = endTime.split(':')
endHour = int(endTimeTokens[0])
endMin = int(endTimeTokens[1])
endSec = int(endTimeTokens[2])

# initialize start and end dates/times

startDate = datetime.datetime(startYear,startMonth,startDay,startHour,startMin,startSec)
endDate = datetime.datetime(startYear,startMonth,startDay,startHour,startMin,startSec)
endDate += datetime.timedelta(minutes=int(intervalInMins))
finishLineDate = datetime.datetime(endYear,endMonth,endDay,endHour,endMin,endSec)

# generate and run splunk search commands via CLI

i = 0

while (startDate < finishLineDate):

  # if near the finish line, set endDate = finishLineDate
  if (endDate >= finishLineDate):
    endDate = datetime.datetime(endYear,endMonth,endDay,endHour,endMin,endSec)

  # convert date/time format to MM/DD/YYYY:HH:mm:ss
  startTime = startDate.strftime("%m/%d/%Y:%H:%M:%S")
  endTime = endDate.strftime("%m/%d/%Y:%H:%M:%S")

  searchCmd = "starttime=\"" + startTime + "\" endtime=\"" + endTime + "\" " + splunkSearch

  # run it!
  if (bool(useDispatch)):
    searchCLI = "splunk dispatch \"" + searchCmd + "\" -maxout " + str(maxOut)
    searchCLI = "splunk search \"" + searchCmd + "\" -maxresults " + str(maxResults)
  print "Executing [" + searchCLI + "]"
  result = str.split(os.popen(searchCLI).read())
  print result

  # increment start and end dates by intervalInMins
  startDate += datetime.timedelta(minutes=int(intervalInMins))
  endDate += datetime.timedelta(minutes=int(intervalInMins))

  # track number of searches run
  i += 1

print "Done running " + str(i) + " searches!"


Looking at the scenario on the summary indexing page, the following settings would be used with this script to back fill the first 14 days of August 2008 for the Do Not Click - Summary Index - Firewall Daily Summary Source IP search.

splunkSearch = "eventtype=firewall | stats count by src_ip | sort count desc | head 200 | addinfo | collect addtime index=summary marker=report=firewall_daily_summary_src_ip"

startDate = "08/01/2008"
startTime = "00:00:00"
endDate = "08/15/2008"
endTime = "00:00:00"

intervalInMins = 1440

Personal tools
Hot Wiki Topics

About Splunk >
  • Search and navigate IT data from applications, servers and network devices in real-time.
  • Download Splunk