Community:TroubleshootingBlockedQueues

From Splunk Wiki

Jump to: navigation, search

Troubleshooting blocked queues

Warning: this is a very rough first take, and may have inaccuracies.

This is really step 2, step 1 is identifying that you have blocked queues, possibly from another document, by support, or by manual investigation.

Queue status is reported in metrics.log, which is indexed to index=_internal. A blocked queue status line looks like (4.x):

07-13-2010 11:42:22.534 INFO Metrics - group=queue, name=parsingqueue, blocked=true, max_size=1000, filled_count=0, empty_count=0, current_size=1000, largest_size=0, smallest_size=1000

or, in 3.x:

06-24-2008 09:20:28.278 INFO Metrics - group=queue, name=indexqueue, blocked!!=true, max_size=1000, filled_count=21, empty_count=44987, current_size=1000, largest_size=1000, smallest_size=1

The only real difference being the odd '!!' after 'blocked'. Lines do not say 'blocked=false' when they are not blocked, the string is simply not present.

  • Are your blocked queues a problem?
   If queues are blocked moderately frequently, say 50% of the time, but
   have interspersed lines when they are not blocked, then your
   system is working fine, but is perhaps not getting the work done
   in real time.  You should investigate whether the system is falling
   behind: search for recent data from various forwarders/sources.
   If queues are blocked almost always or always (99%, 100%), then
   something is wrong, and data is not flowing as you would want.
  • Maybe indexing is just much too slow?
   If queues are blocked, not always, but nearly always, and data is
   arriving in the index, but falling further and further behind,
   then we have an indexing performance problem, rather than a
   no-indexing problem. See Community:TroubleshootingIndexingPerformance
  • Is the disk full?
   If Splunk thinks the disk is full, you will get a message in the
   Splunk UI saying so: "Indexing has paused".  
   Splunk 4.1+ checks the space available on the filesystem for each
   index location (warm/cold).  The default minFreeSpace
   (server.conf) value is 2GB.  
   If space is exhausted, the answer may be to adjust data retention
   (lower maximum size!) or to allocate more storage.
  • Is Splunk trying to forward to a system that is not accepting the data?
   Review outputs.conf, and investigate those receiving systems.  By
   default splunk will block when forwarding, if the receiving side
   is not accepting the data.  If you do not care that the receiving
   side gets a complete record, you can reconfigure the output (e.g.,
   DropEventsOnQueueFull = 30) 
  • Corollary: are you forwarding to yourself?
   This largely happened with 4.0.x (early) deployment servers, who
   tried to configure deployment clients to forward to the deployment
   server.  It was easy to deploy the forwarding app to the local
   server as well.. thus forwarding to itself.
   This will cause blockage because no data can ever exit the system,
   thus it will fill.
  • Do you have custom coldToFrozen (archival) scripts which are not working?
   Failing archival scripts will prevent splunk from removing data
   from the indexes.  Splunk looks for the return code from the
   script to know whether it succeeded. If they return failure,
   Splunk will retry the archival at a later time (30 seconds or so).
   If these consistently fail, splunk cannot remove data from
   indexes, which will eventually cause the disk to fill.
Personal tools
Hot Wiki Topics


About Splunk >
  • Search and navigate IT data from applications, servers and network devices in real-time.
  • Download Splunk