Community:PerformanceTroubleshooting
From Splunk Wiki
Splunk Performance
~~~~~~~~~~~~~~~~~~
Primary inputs:
* Characterization:
- what actions in splunk are performing poorly, and especially what
in contrast is not performing poorly
* Timing:
- Are the problems constant, or fluctuating, with any identifiable
patterns
* Load:
- Are Disk throughput, memory, or CPU time exhausted?
------
I prefer to start with system performance. Most problems are much
more comprehensible when the system constraints are understood.
This means looking at:
- general load on active splunk box
- changes in load / activity during slow operation (search, ui
action, etc)
Memory use:
Linux: free, ps, vmstat
Solaris: vmstat, ps
Windows: Task manager, perfmon
If the system has more than 80% of the memory in use by programs , or
significant swap usage, then this problem becomes the priority.
example: Linux:
jrodman@support05:~$ free
total used free shared buffers cached
Mem: 2062080 1865568 196512 0 146588 866392
-/+ buffers/cache: 852588 1209492
Swap: 1951888 24 1951864
This system has only 852 megabytes in use (second line) out of 2GB,
which is a healthy ratio. You want the majority of your memory in use
by buffers/cache.
We also see almost no swap is in use. Moderate swap use is not a
problem (perhaps some tens of megabytes). Frequently changing swap
usage, or high usage (more than a few hundred megabytes) may indicate
swapping, which will kill all performance for any purpose.
Identification of swapping:
Unix: vmstat <interval>
Interval is a number in seconds
jrodman@support05:~$ vmstat 5
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 0 24 131608 173708 894048 0 0 1 26 0 7 0 0 99 0
0 0 24 131648 173708 894048 0 0 0 0 90 585 0 0 100 0
0 0 24 131516 173708 894048 0 0 0 3 86 490 1 0 99 0
0 0 24 131532 173708 894056 0 0 0 151 73 476 0 0 99 0
The useful data here is the swap colums, si for swap-in, so for
swap-out, This system is not swapping. A system under significant
memory pressure will often have stairstepping where there is no swap
activity for a time, and then a large chunk of swap activity. A
system with constant swap activity is one that is thrashing, and will
not perform in any reasonable way until this is resolved.
solaris --
There's a nice document about memory troubleshooting on Solaris here:
http://www.princeton.edu/~unix/Solaris/troubleshoot/ram.html
They recommend vmstat or sar -g to review "scan rate" for indication of
memory overload, and vmstat -p to review paging activity.
Disk load:
You can evaluate disk performance on an unused system with Bonnie++,
but for a live system, this isn't useful. Unix, use iostat with an
polling interval in seconds:
jrodman@support05:~$ iostat 5
Linux 2.6.26-2-amd64 (support05) 01/07/2010 _x86_64_ (2 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
0.32 0.05 0.18 0.03 0.00 99.41
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 2.04 2.49 103.80 6303251 262885952
avg-cpu: %user %nice %system %iowait %steal %idle
0.10 0.00 0.20 0.10 0.00 99.59
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 2.00 4.80 33.60 24 168
This system is entirely idle. However, evaluating the blocks read and
written per second can quickly give you a hint if you are anywhere
near capacity. 5000 blocks/s written is nothing. 50,000 blocks/s
could be easily exhausting your bandwidth.
During searches, you should expect tps (transactions per second) to
surge to high numbers. If this number never rises above 300, then
probably your system simply doesn't have enough seek performance.
500 would be mediocre, 800-1000 starts to become quite impressive.
CPU load:
This is the hardest to evalute, since CPU cores are segmented.
Searches can be bounded by a single core performance while other
resources remain idle. The number of CPUs used by indexing will grow
as the data volume grows, from around 2 at lower loads upwards to
around 5-6 at higher volume loads (think 100GB/day).
Linux: For a rough impression, you can use 'top' with multi-cpu view,
(press the 1 key),
Solaris: prstat
------------------------
What aspects of splunk are using what sort of system load?
For a rough quick look, splunkd cpu use is primarily
indexing-related. splunk-optimize processes are also
indexing-related.
The majority of search cpu use occurs in short-lived splunk-search
processes.
User interface related activity occurs in the python process named
mrsparkle, eg:
jrodman 7504 0.2 0.8 126584 32216 ?? S 11:47AM 1:11.12 python -O /Applications/splunk4/lib/python2.6/site-packages/splunk/appserver/mrsparkle/root.py start
If you need to break down system i/o load by process (splunk-search vs
splunkd), iotop may be useful:
http://guichaz.free.fr/iotop/
You can use the built-in splunk metrics.logs to evaluate how much cpu
time splunk believes it is using for various performance purposes.
This is the cpu_seconds value from metrics.log, eg:
Splunk > index=_internal source=*metrics.log group=pipeline | timechart sum(cpu_seconds) by processor
This breaks down the cpu accounting by individual widget in our data
processing. You could also break it down by more rough categories:
Splunk > index=_internal source=*metrics.log group=pipeline | timechart sum(cpu_seconds) by name
In some environments regex may be dominant, in others, parsing, while
in others the indexing pipeline may be the majority of time.
You can get a plot of all the cpu activity with a fairly comparable
values like so:
index="_internal" source="*metrics.log" group=pipeline NOT sendout |timechart span=1m sum(cpu_seconds)
Now we can compare these values to the per-cpu maximum of 60. If the
chart hovers around 240, you have approximately 4 cores in use by
indexing data processing inside splunk. If it's more around 30, you
have only half a core in use by these things.
-----------------------
For indexing problems:
Review KBps values (per_sourcetype_thruput, per_index_thruput) in
metrics log to measure improvements, make sure you're making it
better!
Explicitly declare single-line formats with
SHOULD_LINEMERGE=false
Explicitly declare time format with
TIME_FORMAT, TIME_PREFIX, and MAX_TIMESTAMP_LOOKAHEAD
If the majority of cputime is spent in indexing, not parsing or regex,
it may help to add an indexing thread in indexes.conf
Aside: indexing speed is negatively affected by poorly ordered event
streams. Clean data is faster, and good housekeeping anyway (your
search results will be more meaningful). Keep systems on the same
clock, set up parsing for ill-defined sourcetypes.
----------------
For search performance:
Two avenues:
- compare/contrast speed of various searches
(some searches are slow, some searches are fast)
- look at system behavior when running trouble searches
Common causes of slow searches:
- slow disk
- extremely large OR lists (1 OR 2 OR 3 OR 4 ..... 5000)
these can be caused by tags
- complex subsearches (effectively end up being large OR lists)
- phrase searches, eg Splunk > 1.2.3.4
Search tuning:
- be more specific:
- Add sourcetypes, hosts, source values where possible
- add strings which will only exist in your desired events
- narrow the time range as appropriate
- cut the data at a quantity of events, eg *
Slow: Splunk > *
Fast: Splunk > * | head 1000
- Avoid NOT expressions when possible.
Avoid: (NOT host=d NOT host=e)
Prefer: (host=a OR host=b OR host=c)
- In some cases, when there is no natural way to describe the data
efficiently with the splunk engine, but it can be identified at
input time, parse-time ("indexed") fields may be appropriate
Investigation:
- Try to determine which phase of the search is taking what amounts
of time. If your search involves a number of invocations of
rename, eval, stats, subsearches, etc. Break the search down piece
by piece to figure out which parts are taking what amounts of time.
- Run 'vmstat 5' and 'iostat 5' prior to running your search, let
them run while search is executing, until completion
- Try search from the command line, especially using the time
command, eg:
time splunk search 'my search terms .....'
- In the UI from a slow search you can click on
Actions -> Get link to results...
This will give you the search ID, such as sid=1262914985.83
You can find the directory containing the actual data about the search at:
$SPLUNK_HOME/var/run/splunk/dispatch/<search_id>
eg
/opt/splunk/var/run/splunk/dispatch/1262914985.83
These directories are auto-deleted after ten minutes, to keep one,
create an empty file, 'save' in the directory, or copy it out.
The most interesting thing is usually the search.log. It may contain
errors or timing information in extreme cases. You can also see the
number of 'results' that splunk-search pulled off disk before further
filtration.
Manually running a search:
If you capture the exact commandline used to launch splunk-search
when you run a UI search,
(while true; do ps aux |grep '[s]earch term'; done)
you can run these yourself. This will enable you to run them under
investigation tools such as gdb, strace, valgrind, whatever you
wish. It is convenient to choose the search-id yourself for this
purpose, which will cause the dispatch dir to be created
with the same name.
Anatomy of a search:
From a performance perspective, here are the major pieces of a
search. This can sometimes help to pin down what types of things
may or may not be slow:
1 - Your search is expanded according to its explicitly stated tags
and eventtypes. The full version can be viewed in the dispatch
search.log (above).
Steps 2-4 are quite seek intensive. They are typically bound on disk
head movement, or I/Os per second.
2 - The disk-access code that walks the indexes traverses the
appropriate buckets for the time range.
3 - For each bucket, test are made for whether the desired
sourcetypes, sources, and hosts are present in the bucket at all
4 - assuming the bucket cannot be eliminated on these criteria, the
index files are opened (all of them, in turn, that match the time
range)
4a - the start of the index has a dictionary of indexed terms, a
"lexicon". We identify the terms that are appropriate (such as for
wildcards.
4a - for all identified terms, we identify all the event IDs which
match for each term
4c - set logic is used to determine which event IDs are desired
for the expressed search. eg for 'host=a OR host=b', all events for
either host are desired, while for 'host=a AND host=b', only events
for both are desired.
5 - for all desired events, the appropriate rawdata files are
accesssed. These are typically compressed, so this can be quite CPU
intensive. There is also a moderately significant amount of
sequential reads here.
6 - Various automatic processing steps occur, eg
6a - eventtypes are identified, which involves applying search
expressions against all the resturned invents.
6b - automatic key-value extraction is applied, where sequences like
'a=4' are identified and provided as fields for matching, charting,
etc.
7 - n -- the rest of the search steps are highly dependent upon the
search expression.
Finally: for the UI only, the search text is parsed into 'segments' so
that the UI can visually present clickable items that can be searched,
as a usability aid.
Search performance remedies:
Disk seek limited
- Use 64 bit. This is always a win, in any event
- If your data is sparse (buried in a larger volume of unrelated
data) it may help to store it in its own index
Disk bandwidth exhaustion:
- In this case you will simply have to reduce the load on the I/O
subsystem. Possible remedies:
- Add more index nodes
- Move non-splunk tasks off the system
- Check for extremely large numbers of source, sourcetypes, or
host values (tens of thousands). They can cause unnecessary
I/O.
CPU limited:
- If splunk-search seems CPU bound, you may want to evaluate storing
your rawdata in uncompressed form. This will use a great deal more
disk space, but will not incur the decompression CPU cost at search
time.
- If you have very large events, especially with a great deal of
punctuation. Try disabling search-time segmentation (raw) in the
UI.
- If your events have large numbers of apparent Key-value pairs, like
hostname=foo access=much food=hotdogs
and you do not care about the vast majority of the values, you may
wish to disable automatic key-value extraction with KV_MODE=none in
props.conf
Initial rev, Jrodman 10:40, 12 January 2010 (PST) please do modify etc.