Community:TroubleshootingBlockedQueues
From Splunk Wiki
Troubleshooting blocked queues
Warning: this is a very rough first take, and may have inaccuracies.
This is really step 2, step 1 is identifying that you have blocked queues, possibly from another document, by support, or by manual investigation.
Queue status is reported in metrics.log, which is indexed to index=_internal. A blocked queue status line looks like (4.x):
07-13-2010 11:42:22.534 INFO Metrics - group=queue, name=parsingqueue, blocked=true, max_size=1000, filled_count=0, empty_count=0, current_size=1000, largest_size=0, smallest_size=1000
or, in 3.x:
06-24-2008 09:20:28.278 INFO Metrics - group=queue, name=indexqueue, blocked!!=true, max_size=1000, filled_count=21, empty_count=44987, current_size=1000, largest_size=1000, smallest_size=1
The only real difference being the odd '!!' after 'blocked'. Lines do not say 'blocked=false' when they are not blocked, the string is simply not present.
- Are your blocked queues a problem?
If queues are blocked moderately frequently, say 50% of the time, but have interspersed lines when they are not blocked, then your system is working fine, but is perhaps not getting the work done in real time. You should investigate whether the system is falling behind: search for recent data from various forwarders/sources.
If queues are blocked almost always or always (99%, 100%), then something is wrong, and data is not flowing as you would want.
- Maybe indexing is just much too slow?
If queues are blocked, not always, but nearly always, and data is arriving in the index, but falling further and further behind, then we have an indexing performance problem, rather than a no-indexing problem. See Community:TroubleshootingIndexingPerformance
- Is the disk full?
If Splunk thinks the disk is full, you will get a message in the Splunk UI saying so: "Indexing has paused".
Splunk 4.1+ checks the space available on the filesystem for each index location (warm/cold). The default minFreeSpace (server.conf) value is 2GB.
If space is exhausted, the answer may be to adjust data retention (lower maximum size!) or to allocate more storage.
- Is Splunk trying to forward to a system that is not accepting the data?
Review outputs.conf, and investigate those receiving systems. By default splunk will block when forwarding, if the receiving side is not accepting the data. If you do not care that the receiving side gets a complete record, you can reconfigure the output (e.g., DropEventsOnQueueFull = 30)
- Corollary: are you forwarding to yourself?
This largely happened with 4.0.x (early) deployment servers, who tried to configure deployment clients to forward to the deployment server. It was easy to deploy the forwarding app to the local server as well.. thus forwarding to itself.
This will cause blockage because no data can ever exit the system, thus it will fill.
- Do you have custom coldToFrozen (archival) scripts which are not working?
Failing archival scripts will prevent splunk from removing data from the indexes. Splunk looks for the return code from the script to know whether it succeeded. If they return failure, Splunk will retry the archival at a later time (30 seconds or so). If these consistently fail, splunk cannot remove data from indexes, which will eventually cause the disk to fill.