Community:SplunkTechNotes

From Splunk Wiki

Jump to: navigation, search

Splunk Tech Notes

SPL-48446

Problem

Filtering events into realtime searches can slow indexing.

Symptoms

  • Indexing may be slower than expected for the hardware, especially the rate of indexing may have *decreased* over the period of Splunk adoption within the organization, while CPU, I/O and memory are not exhausted on the indexers.
  • Investigating splunkd by thread shows one thread pinned at nearly 100% of one cpu core, while most other splunkd threads are relatively idle. Further investigation of this thread (pstack, etc) will show it is spending most of its time in RTFilter
  • A significant number of realtime search processes are running (see below for significant).

Affected systems

  • Platforms/Arch: All
  • Versions: 4.1 - 4.3.1
  • Role: Indexers

Remedies

  • Run fewer realtime searches.
  • Upgrade to Splunk 4.3.2+.
  • When on 4.3.2+ try to ensure your realtime alerts include one or more required terms from this list: host, source, sourcetype, index
  • Use systems with faster CPU cores or more indexers.

Detail

Assuming that all modern CPU cores are *roughly* similar, with Splunk versions 4.3.1 and earlier, it seems:

  • 40 concurrent realtime searches will reduce indexing rates to values in the hundreds of kilobytes per second
  • 20 realtime searches will be limited to single digits of megabytes per second, approximately.


For 4.3.2 we will have some optimizations shipping. 4.3.2 is roughly 30% faster at this one codepath across the board in my testing, even without acceleration.

We also have some basic acceleration starting with 4.3.2, which means that searches which specify one of any or more of the categories:

  • host
  • source
  • sourcetype
  • index

will be accelerated by being able to more or less ignore events other hosts, sources, etc.

To be clear, for 4.3.2 as of now.

Accelerated
host=my_host (192.168.5.0 OR foobar)
Not accelerated
(192.168.5.0 OR host=my_host)
Accelerated
index=security
Not accelerated
index=customer_*
Accelerated
sourcetype=foo source=bar
Not accelerated
sourcetype=foo OR source=bar

Basically, we have to be able to prove, during the optimizer, that a single term is required for all events.

There is significant speedup for each realtime search we can eliminate per-event (90% of the work for that search is removed). There is a very significant speedup when we can eliminate all realtime searches per-event. (Well over 99% of the total work is removed).

SPL-44773

Problem

Splunk file input (Tailing) will abandon reading files for certain classes of file access errors, until restart.

Symptoms

  • Splunk may read successfully from a log file for a time, then cease to do so.
  • A clear error such as no permissions for splunk to read a log file is not resolved by fixing the permissions. Splunk must be restarted
  • Reviewing slpunkd.log will show an error stating that the file in question will be ignored.


Affected systems

  • Platforms/Arch: All
  • Versions: 4.1 - 4.3.2
  • Role: All -- anywhere where log files are being read

Remedies

  • Workarounds are highly dependent upon the specific problem.
  • Upgrade to Splunk 4.3.3+.

Detail

Splunk simply had no retry logic for certain errors, the functionality had not yet been built.

With 4.3.3+, the behavior is now as follows:

Errors result in a retry delay of one-half of one second, doubling with repeated errors up to a bit over half an hour.

The same restated in more detail:

The algorithm specifies the behavior when subsequent attempts to access the file result in some sort of failure. There is no attempt to correlate failure categories. Permission denied, inability to get a checksum from a truncated file, inability to read due to filesystem failure, byte range locks, or other more exotic problems will all be handled the same way by the backoff timer algorithm.

A non-failing access will reset to the default non-error behavior (not detailing this in full here).

A failure will result in a half-second delay (minimum) before we re-attempt to access the file. These values are all in minimums because the tailing system may well have more work than it can accomplish.

Consecutive failures for accessing or handling the same pathname will result in a doubling interval up to 0.5s ^ (2*12), or 1 ^ (2*11) seconds or 2048 seconds, or 34 minutes and 8 seconds. If more errors are encountered, the timer will remain at 34 minutes and 8 seconds.

Personal tools
Hot Wiki Topics


About Splunk >
  • Search and navigate IT data from applications, servers and network devices in real-time.
  • Download Splunk