From Splunk Wiki
|The information in this topic has moved to the official Splunk documentation. See About metrics.log in the Splunk Enterprise Troubleshooting Manual.|
Metrics.log has a variety of introspection information for reviewing Splunk's behavior. Let's look at what's available:
Firstly, metrics is a periodic report of the goings on of various Splunk behavior. Essentially it's a report, every 30 seconds or so, of recent goings-on.
Structure of the lines
01-27-2010 15:43:54.913 INFO Metrics - group=pipeline, name=parsing, processor=utf8, cpu_seconds=0.000000, executes=66, cumulative_hits=301958
First, boiler plate: the timestamp, the '
severity', which is
INFO for metrics events, and then the kind of event, "
Next the group, this is the indicator for what kind of metrics data it is. There's a few types in the file. There are a few types already:
Pipeline messages are reports on the splunk pipelines, which are the strung-together pieces of 'machinery' which process and manipulate events flowing into and out of the splunk system. You can see how many times data reached a given machine in the splunk system (executes) , and you can see how much cpu time each machine used (cpu_seconds).
Plotting totals of cpu seconds by processor can show you where the cpu time is going in indexing activity. Looking at numbers for executes can give you an idea of dataflow. For example if you see:
group=pipeline, name=merging, processor=aggregator, ..., executes=998 group=pipeline, name=merging, processor=readerin, ... , executes=698 group=pipeline, name=merging, processor=regexreplacement, ..., executes=698 group=pipeline, name=merging, processor=sendout, ... , executes=698
then it's pretty clear that a large portion of your items aren't making it past the aggregator. This might indicate that many of your events are multiline and are being combined in the aggregator before being passed along.
Queue lines look as follows
... group=queue, name=parsingqueue, max_size=1000, filled_count=0, empty_count=8, current_size=0, largest_size=2, smallest_size=0
Most of these values aren't interesting. current_size, especially considered in aggregate, across events, can be informative of which portions of Splunk indexing are the bottlenecks. If current_size remains near zero, then probably the Splunk (indexing) system is not being taxed in any way. If the queues remain near 1000, then more data is being fed into the system (at the time) than it can processes in total.
Sometimes you will see messages such as
... group=queue, name=parsingqueue, blocked!!=true, max_size=1000, filled_count=0, empty_count=8, current_size=0, largest_size=2, smallest_size=0
This message contains the 'blocked' string, indicating that it was full, and someone tried to add more, and couldn't. A queue will become unblocked as soon as the code pulling items out of it pulls an item. Many blocked queue messages in a sequence indicate that data is not flowing at all for some reason. A few scattered blocked messages indicate that flow control is operating, and is normal for a busy indexer.
If you want to look at the queue data in aggregate, graphing the average of current_size is probably a good starting point.
There are queues in place for data going into the parsing pipeline, for data between parsing and indexing. Each networking output also has its own queue, which can be useful to determine if the data is able to be sent promptly, or alternatively if there's some network or receiving system limitation.
Generally, filled_count and empty_count cnanot be productive used for inferences.
Thruput [sic] messages
Thruput lines (similar to the english word throughput) come in a few flavors.
Firstly, thruput is measured in the indexing pipeline. If your data is not reaching this pipeline for some reason, it will not appear in this data.
First there is a catchall line, that looks like this:
... group=thruput, name=index_thruput, instantaneous_kbps=0.287598, instantaneous_eps=1.000000, average_kbps=0.270838, total_k_processed=74197, load_average=1.345703
This is the best line to look at when working on performance tuning or evaluating indexing load, etc. It is trying to capture the total indexing data load. On my box, there is basically nothing to speak of.
- Note: In thruput lingo, 'kbps' does not mean kilobits per second, it means kilo*bytes* per second. The industry standard term would be to write this something like KBps or kBps or KB/s.
- instantaneous_kbps over time is probably the most useful figure to look at in aggregate, but I've rarely been interested in this line.
Following the catchall are a variety of breakouts of the indexing thruput, lines such as:
... group=per_host_thruput, series="joshbook.splunk.com", kbps=0.261530, eps=1.774194, kb=8.107422, ev=2606, avg_age=420232.710668, max_age=420241 ... group=per_index_thruput, series="_internal", kbps=0.261530, eps=1.774194, kb=8.107422, ev=2606, avg_age=420232.710668, max_age=420241 ... group=per_source_thruput, series="/applications/splunk4/var/log/splunk/metrics.log", kbps=0.261530, eps=1.774194, kb=8.107422, ev=2606, avg_age=420232.710668, max_age=420241 ... group=per_sourcetype_thruput, series="splunkd", kbps=0.261530, eps=1.774194, kb=8.107422, ev=2606, avg_age=420232.710668, max_age=420241
Aside: in 4.1 and earlier, these lines had fewer elements, lacking the ev, avg_age, and max_age attributes.... group=per_host_thruput, series="joshbook.splunk.com", kbps=0.261530, eps=1.774194, kb=8.107422
Here the data load is broken out by host, index, source, and source type. This can be useful for answering two questions:
- which data categories are busy?
- when were my data categories busy?
The series value says which host, or index, etc. The kb value indicates the number of kilobytes processed since the last sample. Graphing kb in aggregate can be informative. The summary indexing status dashboard uses this data, for example.
- ev is a simple total count of events during the sampling period.
- kbps, as before, indicates kilobytes per second as averaged over the sampling period.
avg_age and max_age refer to the time difference between the time that the event was seen by the thruput processor in the indexing queue, and the time when the event occurred, or more properly the time that Splunk decided the event occurred.
- max_age is the largest difference between the current time and the perceived time of the events coming through the thruput processor.
- avg_age is the average difference between the current time and the perceived time of the events coming through the thruput processor.
- Note: the per_x_thruput categories are not complete. By default they show the ten busiest of each type, for each sampling window. If you have 2000 active forwarders, you cannot expect to see the majority of them in this data. The sampling quantity can be adjusted, but it will increase the chattiness of metrics.log, and the resulting indexing load and _internal index size. The value is adjustable in limits.conf, [metrics] maxseries = num
I recommend ignoring the 'eps' value, as it has accuracy issues.
udpin_connections lines are essentially metering on udp input.
group=udpin_connections, 2514, sourcePort=2514, _udp_bps=0.00, _udp_kbps=0.00, _udp_avg_thruput=0.00, _udp_kprocessed=0.00, _udp_eps=0.00
Some should be relatively self explanatory.
- bps: bytes per second
- kbps: kilobytes per specond
- eps: events (packets) per second
- _udp_kprocessed is a running total of kilobytes processed since udp input processor start (typically since splunk start, but if reconfigured may reset).
Don't have info avg_thruput at this time.
Be aware that it's quite achievable to max out the ability of the operating system, let alone splunk, to handle udp packets at high rates. This data may be useful to determine if any data is coming in at all, and at what times it rises. There is no guarantee that all packets sent to this port will be received and thus metered.
mpool lines represent memory used by the splunk indexer code only (not any other pipeline components). This information is probably not useful to anyone other than splunk developers.
group=mpool, max_used_interval=4557, max_used=53878, avg_rsv=180, capacity=268435456, used=0
- max_used_interval represents the number of bytes (in this example a bit over 4KB) used during the reporting interval (since the last output).
- max_used represents the maximum amount of memory, in bytes, in use at any time during the component's lifetime (most likely since splunk start). In this example the high water mark was around 53KB.
- avg_rsv is the average size of a memory allocation across the runtime of the system.
- capacity is the ceiling on memory use for the indexer.
- used is the current use of memory by the indexer.
In this case we can see that some memory is sometimes in use, although at the time of sample, none is in use, and that generally the use is low.
map, name=pipelineinputchannel messages
These messages are primarily debug information over the splunk internal cache of processing state and configuration data for a given datastream (host/source/sourcetype).
group=map, name=pipelineinputchannel, current_size=29, inactive_channels=4, new_channels=0, removed_channels=0, reclaimed_channels=0, timedout_channels=0, abandoned_channels=0
- current_size is the number of total channels loaded in the system at the end of the sampling period.
- inactive_channels is the number of channels which do not have any entries in any pipeline referring to them (typically for recently seen data but not for data which is currently being processed) at the end of the sampling period
- new_channels is the number of channels created during the sampling period, which means that new data streams arrived, or a datastream which was aged out was created again.
- removed_channels is the number of channels which were destructed during the sampling period, which means that enough pressure existed to push these channels out of set (there were too many other new datastreams).
- reclaimed_channels is the number of channels which were repurposed during the sampling period. This will happen for resasons similar to new_channels, based on the size of the utilization, etc.
- timedout_channels is the number of channels which became unused for a long enough time to be considered stale and the information was culled (typically that file/data source hasn't been producing data for some time).
- abandoned_channels is the number of channels which were terminated via splunk forwarding, where the forwarder stopped communicating to an indexer, so the indexer shut them down.