Community:TroubleshootingIndexedDataVolume

From Splunk Wiki

Jump to: navigation, search

Troubleshooting Indexed Data Volume

So data is flowing into Splunk. Where's it coming from? What's chatty today? Who just blew the doors off our indexing volume?

First thing you should check is how many apps are running. Have you installed the Unix app? That can index a lot of data really quickly because it runs lots of scripted inputs. What about other apps, or other inputs? Where did you (or some else?) tell Splunk to get data from? The searches below will help you figure this out.

There are a few tools available to answer these questions.

What is available

License usage report view

Starting in 6.0, Splunk provides a consolidated resource for questions related to your license capacity and indexed volume. LURV gives insight into your daily Splunk indexing volume, as well as any license warnings, and a comprehensive view of the last 30 days of your Splunk usage with multiple reporting options. LURV on your license master is your first stop for licensing questions.

The dashboard is logically divided into two parts: one displays information about today's license usage, and any warning information in the current rolling window; the other shows historic license usage during the past 30 days.

For every panel in LURV, you can click "Open in search" at the bottom left of the panel to interact with the search.

Find LURV in System > Licensing > Usage report. Read more about LURV in the Splunk docs.

If you're still on Splunk 4.3-5.x, much of this instrumentation is available as two views in the Splunk on Splunk App. Find the views in Indexing > Metrics > License usage on SoS version 3.1+.

license_usage.log

This log is available in the Splunk license master instance only. A license master logs indexed events volume every minute by the information the slaves send to the master. A slave maintains a table of how much you've indexed on a slave in chunks of time. Typically that chunk of time is 1 minute, but the chunk may grow if the slave cannot contact the master -- Splunk only resets the chunk when the table is sent to the master. The table is of src,srctype,host tuples…if that table grows to exceed 1000 entries (2000 on 6.0), then Splunk squashes the host/source keys. So, if you have more than 1000 different tuple entries, you find no value for h(ost) and s(ource) fields. Splunk never suppresses st(sourcetype) in the log. See (see squash_threshold in server.conf)

Evolution :

before 4.2

no license_usage.log

in 4.2.*

license_usage.log available, license_audit no more relevant If you're using Splunk 4.2 license master/slaves for license management, you cannot use license_audit.log to calculate Indexed data volume anymore. Instead, you need to use license_usage.log to check total indexed volume. The buit-in "Deployment Monitor" app makes use of license_usage.log files for indexedDataVolume, and metrics.log files for fowarders' connectivity, etc. Splunk 4.2 introduced a new license management feature: the license master and slaves schema. If you simply upgraded from 4.1 to 4.2 and did not configure anything for licensing management, license_audit.log should be logging the same information as before. However, once you enable license slaves, or create your own license pools, at the license master, a license_usage.log starts to log usage information, and the license_audit.log will stop logging license events. At the slaves, there is no license_audit.log because only the master manages total index volumes and violations. Also, keep in mind that license_usage.log will not log indexed volume for non-license related index db, such as _internal or summarydb. You can check metrics.log for these non-license related db for both pre-4.2 and 4.2.

since 4.3.*

In 4.3 the license_usage.log file contains more detail, differentiated by *type*.

  • type=Usage => is the equivalent of 4.2
  • type=RolloverSummary => is the summary for the previous day for each license-slave (replaces the tedious daily sum of all the volumes). It is calculated at midnight and refers to the previous day.
  • type=SlaveWarnSummary => counts the number of violations per slave.

Always specify one type in your searches, otherwise volume calculation ( sum(b) ) will be incorrect (or counted twice per day).

4.3.1

Introduced the setting squash_threshold in server.conf (default value of 1000). Notes from the spec file:

    squash_threshold = 1000
    * Advanced setting.  Periodically the indexer must report to license manager the
    data indexed broken down by source,sourcetype,host.  If the number of distinct
    source,sourcetype/host tuples grows over the squash_threshold, we squash the
    host/source values and only report a breakdown by sourcetype.  This is to
    prevent explosions in memory + license_usage.log lines.  Set this with care or 
    after consulting a Splunk Support engineer, it is an advanced parameter.

6.0

  • The squash_threshold setting in server.conf default moves from 1000 to 2000
  • index was both added to tuple tracked in license_usage.log (in addition to source, source type, and host) and is "guaranteed" in the same way source type is guaranteed. Notes from spec file:
    squash_threshold = <positive integer>
    * Advanced setting.  Periodically the indexer must report to license manager the
    data indexed broken down by source, sourcetype, host, and index.  If the number of distinct
    (source,sourcetype,host,index) tuples grows over the squash_threshold, we squash the
    {host,source} values and only report a breakdown by {sourcetype,index}.  This is to
    prevent explosions in memory + license_usage.log lines.  Set this only
    after consulting a Splunk Support engineer.
    * Default: 2000

license_audit.log

This log is useful only for pre-4.2. Please read the topic above about the log file difference between pre and post 4.2. This log maintains daily license usage and exceeded violation count and log these information right after the midnight.

metrics.log

This log maintain metrics of internal queues, internal processor, etc. It doesn't distinguish what is counted on the license or not. A per_index_thruput in metrics.log collects only the ten busiest samples as a default. So, if you have more than ten indexes, you will need to edit the maxseries attribute in [metrics] stanza in $SPLUNK_HOME/etc/system/local/limits.conf.
However, this can affect your Indexing performance to some extent.
In 4.2, you can also change the interval attribute from the default of 30 sec. So, you should increase the interval when you need to increase the number of maxseries.

[metrics]
maxseries = <integer>
* The number of series to include in the per_x_thruput reports in metrics.log.
* Defaults to 10.

interval = <integer>
* Number of seconds between logging splunkd metrics to metrics.log.
* Minimum of 10.
* Defaults to 30.

REST/Endpoint on the license-master

If you are on the license master, searches using the "|REST" command can return current details on the usage.

Data volume seen by the license code

You may care about this for licensing concerns. Or you may just want to sanity-check what quantity of data the licensing code is seeing. You can review some information in the Manager portion of the Splunk 4.0.x interface, or you can run a search on the _internal index to see a pretty chart, etc.

Splunk 6.0 or more recent

As on 4.3 with a new field index "idx" to get more detail.

  • detail per index :
index=_internal source=*license_usage.log* type=Usage 
| timechart span=1d sum(b) AS volume_b by idx

Splunk 4.3 or more recent

Similar than on 4.2, the value is in bytes. But in addition, the sum of the day was added (faster to calculate). The field added to differentiate them is "type". Remarks :

  1. Also if you prefer detail, you can add details, per pool "pool", on the source "s", host "h", sourcetype "st", indexer "i".
  2. Remember that the fields "s" and "h" can be empty if the licenser was to busy (see squash_threshold in server.conf)
  3. for details, if you have more than 10 items, switch to a stats instead of a timechart and narrow to a timerange.
  4. For the indexer (i) this is the GUID of the server, you can retrieve it by looking in $SPLUNK_HOME/etc/instances.cfg
  • sum per day per pool for the previous days :
host=mylicensemasterhost index=_internal source=*license_usage.log* type=RolloverSummary 
| bucket _time span=1d | stats sum(b) AS volume by _time pool
  • detail per pool:
index=_internal source=*license_usage.log* type=Usage 
| timechart span=1d sum(b) AS volume_b by pool 
  • detail per source type :
index=_internal source=*license_usage.log* type=Usage 
| eval s=if(s=="","unknown",s)
| eval h=if(h=="","unknown",h)
| timechart span=1d sum(b) AS volume_b by st 
  • detail per source :
index=_internal source=*license_usage.log* type=Usage 
| eval s=if(s=="","unknown",s)
| eval h=if(h=="","unknown",h)
| timechart span=1d sum(b) AS volume_b by s 
  • detail per host:
index=_internal source=*license_usage.log* type=Usage 
| eval s=if(s=="","unknown",s)
| eval h=if(h=="","unknown",h)
| timechart span=1d sum(b) AS volume_b by h
  • detail per indexer:
index=_internal source=*license_usage.log* type=Usage 
| timechart span=1d sum(b) AS volume_b by i 

To correlate several fields, use this example :

  • detail per indexer and pool with name of the indexer (has to run on the license-master)
index=_internal source=*license_usage.log* type=Usage 
|  bucket _time span=1d 
|  stats sum(b) AS volume_bytes by _time host pool i 
| eval volume_GB=round(volume_bytes/1024/1024/1024,2) 
| rename i AS indexer_GUID 
| JOIN indexer_GUID [ | REST /services/licenser/slaves | table title label | rename title AS indexer_GUID| rename label AS indexer_name]

current usage REST/endpoint

On the license master, the REST endpoint can provide the day's current usage faster.
  • Today's License Usage per Pool:
| rest /services/licenser/pools 
| rename title AS Pool 
| search [rest splunk_server=local /services/licenser/groups | search is_active=1 | eval stack_id=stack_ids | fields stack_id] 
| eval Used=round(used_bytes/1024/1024/1024, 3) 
| eval Total=round(quota/1024/1024/1024, 3) 
| fields Pool Used Total
  • Today's Percentage of Daily License Quota Used per Pool
| rest /services/licenser/pools 
| rename title AS Pool 
| search [rest splunk_server=local /services/licenser/groups | search is_active=1 | eval stack_id=stack_ids | fields stack_id] 
| eval "% used"=round(used_bytes/quota*100,2) 
| fields Pool "% used"
You may need to also use the
| rest /services/licenser/licenses
to get the real quota (rather than just seeing a value of MAX)

Splunk 4.2

see remarks from 4.3 and after, except that the field type was not yet introduced, so you do not need to specify it.

Splunk pre-4.2

index=_internal todaysBytesIndexed LicenseManager-Audit source=*license_audit.log* 
| eval Daily_Indexing_Volume_in_MBs = todaysBytesIndexed/1024/1024 
| timechart avg(Daily_Indexing_Volume_in_MBs) by host

This is snarfed from: http://www.splunk.com/base/Documentation/latest/Installation/AboutSplunklicenses#View_your_license_and_usage_details

You can also review the license_audit.log file itself in your Splunk installation, if you need history longer than 28 days. If undertaking this, you may find the following unix-platform incantation useful, which creates a more readable variation of the file.

 cat license_audit.log |awk '{ printf("%s\n",substr($0,0,(index($0,"]["))-1)) }' > readable-license-audit.log 


Daily volume by host: This will capture the daily percentage license volume used.

 index=_internal todaysBytesIndexed LicenseManager-Audit source=*license_audit.log* 
| eval Daily_Indexing_Volume_in_MBs = todaysBytesIndexed/1024/1024
| bucket _time span=1d 
| stats avg(Daily_Indexing_Volume_in_MBs) AS UsageMB first(licenseSize) AS LicenseSize by _time host 
| eval UsagePercent=UsageMB/LicenseSize*100 | eval UsagePercent=round(UsagePercent, 2) 
|  table _time host LicenseSize UsageMB UsagePercent 

Quick metrics summary information by host, source, source type, and index

Okay, so there's a problem with the data volume.. it's higher than you expected, or higher than you were planning for. Or you just want to get a better picture of where the data is coming from in a bulk manner. The metrics.log data already has totals for this on a reasonable interval, so we can mine this.

You can use the GUI to have some reports, in the search app > status > index activity > indexing volume. In 6.0 this has been replaced with the more complete License Usage Report View, in settings > licensing.

Or you can run those searches that are more precise and flexible. Splunk Metrics Reports has searches for this purpose in the section 'How much was indexed'. For example:

host=myindexershost index=_internal group="per_host_thruput" 
| eval mb=kb/1024
| timechart span=1d sum(mb) by series
host=myindexershost index=_internal group="per_source_thruput" NOT series="*splunk/var/log*" 
| eval mb=kb/1024
| timechart span=1d sum(mb) by series
host=myindexershost index=_internal group="per_sourcetype_thruput" NOT series="splunk*" 
| eval mb=kb/1024
| timechart span=1d sum(mb) by series

These searches provide a sampling of the top producers by different categories. The default sampling size is 10, so if for example you expect to receive 20 source types, this will not be a complete data picture, but will have the 10 busiest for each sub-minute time window. Thus, this search gives you a quick picture of what's going on generally, but not a to-the-byte accurate value.

To see how much data Splunk has actually written to your various indexes, use this search (some of the index out of volume quotas are excluded):

host=myindexershost index=_internal host =<indexer hostname> group="per_index_thruput" NOT series="_*" NOT series="history" NOT series="summary" 
| eval mb=kb/1024
| timechart span=1d sum(mb) by series

Manually Counting event sizes over a time range

Roughly, you can run a search where you look at all (or some) data over a range of indexed_time values, counting up the size of the actual events. For example, where the endpoints START_TIME and END_TIME are numbers in seconds from the start of unix epoch, the search would be

indexed_time>START_TIME indexed_time<END_TIME 
|eval event_size=len(_raw) 
| stats sum(event_size)

This is a *slow and expensive search*, but when you really need to know, can be valuable. It *must* be run across a time range that can contain all possible events that were indexed at that time -- meaning regardless of timestamp regularity. Typically this means it must be run over all time. The stats computationg as well as initial filters can of course be adjusted to look at the problem more closely.

Set up a scheduled search to alert you if a license violation occurs

First off, learn how to set up a daily scheduled search with an email alert trigger here. You can then use the search string below as the basis for your alert. It will only return results if the quota has incremented and it checks every host separately (handy if you have more than one indexer):

Since Splunk 4.2 using a REST API call

Using an alert calling the rest api to get the current licence usage per day, then comparing to a fixed limit.

example in savedsearches.conf, run every hour. setup on the license_master only. Please update the email and the WHERE condition to your needs.

[license_usage_alert]
action.email = 1
action.email.inline = 1
action.email.reportServerEnabled = 0
action.email.sendresults = 1
action.email.to = mywonderful@opsteam.domain.org
alert.digest_mode = True
alert.suppress = 0
alert.suppress.period = 5s
alert.track = 1
auto_summarize.dispatch.earliest_time = -1d@h
counttype = number of events
cron_schedule = 0 * * * *
dispatch.earliest_time = -1h@h
dispatch.latest_time = @h
displayview = flashtimeline
enableSched = 1
quantity = 0
relation = greater than
request.ui_dispatch_view = flashtimeline
search = | rest /services/licenser/pools | where title="auto_generated_pool_enterprise" | eval used_GB=used_bytes/1024/1024/1024  | table title used_bytes used_GB | WHERE used_GB > 100

Since Splunk 4.2 using splunkd.log or license_usage.log

On the license master using license_usage.log

Simple alert : Schedule this search each day on the License Master, you want an email every day this event is recorded.

 index=_internal source="*splunkd.log" "Indexing quota exceeded"

Detailed alert on the volume used the previous day :

  • precise our license pool name, and your pool size.
  • search over the previous day (earliest=-1d@d latest=@d)
 index=_internal source=*license_usage* pool="$mypoolname$" | eval GB=b/1024/1024/1024 | stats sum(GB) by pool | where 'sum(GB)' > $mypoolsize$

if you want you can also schedule searched running in the middle of the day to send you warnings if the pool usage is already high.

  • precise our license pool name, and your pool size.
  • search over the current day until now (earliest=@d latest=now)
index=_internal source=*license_usage* pool="$mypoolname$" | eval GB=b/1024/1024/1024 | stats sum(GB) by pool | where 'sum(GB)' > $myalertvolume$

Example for an alert that checks the total usage and also checks the volume over 4 hours periods with another condition. (update for your needs)

 index=_internal source=*license_usage.log* type=Usage pool="auto_generated_pool_enterprise" 
 | eval GB=b/1024/1024/1024 
 | bucket _time span=4h
 | stats sum(GB) AS usageGB by _time pool
 | eval licenseGB=250
 | eval alertPercent=25 
 | eval alertDailyPercent=80 
 | streamstats sum(usageGB) AS usageDailyGB 
 | WHERE ( usageDailyGB>((alertDailyPercent*licenseGB)/100 ) ) OR  ( usageGB>((alertPercent*licenseGB)/100 ) )

On Splunk (pre4.2 and after), using the violations counter in license_audit.log

index=_internal source=*license_audit.log LicenseManager-Audit  
| streamstats current=f global=f window=1 first(quotaExceededCount) as next_quotaExceededCount by host 
| eval quotadiff = next_quotaExceededCount - quotaExceededCount 
| search quotadiff>0

example in savedsearches.conf

[new violation alert]
action.email = 1
action.email.sendresults = 1
action.email.to = admin@XXXXXXXXXX.com
counttype = number of events
cron_schedule = 0 1 * * *
dispatch.earliest_time = -24h@h
dispatch.latest_time = now
displayview = flashtimeline
enableSched = 1
quantity = 1
relation = rises by
request.ui_dispatch_view = flashtimeline
search = index=_internal source=*license_audit.log LicenseManager-Audit | streamstats current=f global=f window=1 first(quotaExceededCount) as next_quotaExceededCount by host | eval quotadiff = next_quotaExceededCount - quotaExceededCount | search quotadiff>0
Hot Wiki Topics


About Splunk >
  • Search and navigate IT data from applications, servers and network devices in real-time.
  • Download Splunk