From Splunk Wiki
How summary indexing can help you
Summary indexing allows you to deal with large volumes of data in an efficient way by reducing the data volume into smaller subsets, working on those individually and finally collate all of the results to get a final results.
An example of where summary indexing is commonly used is with large volumes of firewall or web access logs data. Imagine that you need to report on monthly firewall traffic (e.g. top sources, top destinations, top services, etc...). Assuming your environment generates around 10 million firewall events a day, the report at the end of the month would have to deal with roughly 280M to 310M events, so running a monthly report could take some time.
How can summary indexing help you?
The simplest way would be to run a daily search to summarize firewall data and store that in a separate index (summary), you could store it on the same index but normally retention periods are different for summary data.
For example, every day at 02:00am, schedule the saved search Do Not Click - Summary Index - Firewall Daily Summary Source IP to run and collect this summary data. example:
starthoursago=26 endhoursago=2 eventtype=firewall | stats count by src_ip | sort count desc | head 200
The actual search executed by the system would look something like (split into multiple lines to ease reading):
search starthoursago=26 endhoursago=2 eventtype=firewall | stats count by src_ip | sort count desc | head 200 | addinfo | collect addtime index="summary"\ marker="info_search_name=\"Do Not Click - Summary Index - Firewall Daily Summary Source IP\",report=\"firewall_daily_summary_src_ip\""
Notice that several additional commands were automatically added by the system and there was also an additional definition on the saved search page. An extra field (report) was added with a value (firewall_daily_summary_src_ip) as well. This field will help differentiate between multiple sets on the summary index for later reporting purposes.
This would reduce the daily amount of data from 10M events to 200 events containing the top 200 src_ip addresses. Now at the end of each month the system will only need to deal with 5600 to 6200 events to calculate the monthly top 20. The search for that would be:
startmonthsago=1 index=summary report=firewall_daily_summary_src_ip | stats sum(count) by src_ip | sort sum(count) desc | head 20
This search would produce the desired results fairly quickly.
You can already see a benefit: not only can we easily and quickly report on last month's data but we can also easily do it on a rolling basis for the last 30 days. It becomes even more clear if you required weekly, quarterly and annual reports, this is actually a very common scenario. You could also always if needed do another summary collection on the summary data itself (e.g. Monthly summaries of Daily data).
Similar searches would be required for top destinations (dst_ip) and top services (dst_port). Ideally you would want to schedule all of these at least an hour apart to reduce the load on the system. The configured searches for this would be:
Every day at 03:00am run the saved search: Do Not Click - Summary Index - Firewall Daily Summary Destination IP - extra fields report=firewall_daily_summary_dst_ip
starthoursago=27 endhoursago=3 eventtype=firewall | stats count by dst_ip | sort count desc | head 200
Every day at 04:00am run the saved search:
Do Not Click - Summary Index - Firewall Daily Summary Destination Port - extra fields report=firewall_daily_summary_dst_port
starthoursago=28 endhoursago=4 eventtype=firewall | stats count by dst_port | sort count desc | head 200
In some cases, it might be possible to combine the three searches into a single search. This would increase performance but that can compromise the accuracy of your final results and it would be highly advisable that you increase the number of results collected. For example:
Every day at 02:00am run the saved search Do Not Click - Summary Index - Firewall Daily Summary - extra fields report=firewall_daily_summary
starthoursago=26 endhoursago=2 eventtype=firewall | stats count by src_ip, dst_ip, dst_port | sort count desc | head 2000
Important: The searches were named Do Not Click - * on purpose and shouldn't be shared or added to any Dashboards; clicking/running these searches from the UI will pollute your summary index, this will produce incorrect results.
The values used above are for illustration purposes, with 10 million events a day you should probably be running the summary collection searches more often so that the number of events that you need to deal with is smaller.
Also to deal with large volumes you should use the CLI
dispatch command instead of the
search command. In this case you will need to schedule the searches through the OS cron facility, in this case you are responsible for the full search command, including the commands addinfo and collect
- Always calculate a larger range for each individual subset than what you expect to have for the final results (e.g. if calculating a daily top 10, calculate hourly top 100)
- When calculating the subsets, shift the time window by at least 5-10 minutes to allow for late delivery of data. (e.g if you run the collecting at 10 minutes past the hour do it for 70 to 10 minutes ago) use
- When calculating the totals remember to use sum() instead of count()
- If your subset searches take too long to run, consider running the collection more often to reduce the amount data and run them through dispatch
Caveats & Issues
- One of the main issues with summary indexing is that system will only have summary data from the point when you start collection (e.g. if on a daily basis the first day's data will be for the day before you start collection) at times it's important to back fill the summary index with older data. A script exists to deal with these particular situations, details can be found at Backfill a summary index with archived data.
- Using distinct counts with summary indexing will most likely produce skewed results. This will happen any time a value crosses the summary window boundary. (e.g the value is seen at 1:59 and 2:00 if the hour is the boundary)