From Splunk Wiki
Hardware tuning factors
Splunk can benefit from certain hardware configurations, maximizing performance for different aspects of the Splunk technology. This topic reviews a variety of factors and offers suggestions on how to size your hardware for Splunk.
High Level Guidelines
Generally speaking, large-scale IT search deployments present unique challenges to modern volume computing hardware available from vendors today. Many of these challenges surround I/O architectures and implementations with both hardware, software, system architecture, and operating system all playing a part in determining a given configuration's suitability for use with Splunk. Your mileage will vary with the guidelines below. Please contact Splunk for more specific recommendations in your environment.
Some high level guidelines (for Splunk version 4.0 and later):
- Up to 25GB/day: 4 CPU cores, 2.5Ghz per core, 4 GB RAM. Additional core per active users above 2.
- Up to 100GB/day: 8 CPU cores, 2.5-3Ghz per core, 8 GB RAM. Additional core per active users above 4. (Note: 64 bit OS and 800 io/sec disks recommended.)
- Up to 300GB/day: 12 CPU cores, 2.5 - 3Ghz per core, 12GB RAM. Distributed among two or more boxes. Additional core per active users above 6. (Note: 64 bit OS and 800 io/sec disks necessary to avoid IO becoming the bottleneck.)
- Up to 1TB/day: 32 CPU cores, 2.5 - 3Ghz per core, 32GB RAM. Distributed among four or more boxes, plus one dedicated search box sized at 8 cores, 4GB RAM. Additional core per active users above 8. Additional search box if over 16 active users, with load balanced requests. (Note: 64 bit OS and 1200 io/sec disks necessary to avoid IO becoming the bottleneck.)
Splunk is naturally demanding of the disk subsystems that it works with. Both index and search operations benefit from a disk subsystem that is designed with an eye to the types of operations that Splunk performs.
- Capacity: Provision up to 50% of raw data size you intend to store. For standard syslog data, this is closer to 35%, and can be tuned down to 12% with lower indexing density. To set up retention policies consult these instructions.
- Architecture: RAID configurations that stripe will yield significantly superior performance to parity based RAID. That is, RAID 0, 10, 01, 0+1 will give the best performance, while RAID 5 will offer the worst performance.
In Splunk, indexed data can be located on different partitions and still be searchable. If you do use seperate partitions, the most common way to arrange Splunk's datastore is to keep the more recent data on the local machine (with disks that read and write fast), and to keep older data on a separate disk array (with slower but more reliable disks for longer term storage). Here is a link to the documentation for more information
- Performance: "High IO/s typically means both faster indexing in general and faster searching of rare, temporally incoherent events (if you’re searching for a rare term, like a name, that occurs once an hour or once a day). On average, we’ve seen indexing speeds increase by about 66% going from an 7200 RPM SATA RAID to a 15K RPM SCSI RAID. We’ve seen comparable performance from SCSI and SAS RAIDs, provided they’re 15K RPM." (from Erik's blog post)
Measuring the number of discrete I/O operations per second is a good benchmark of how well a given disk subsystem could perform with Splunk. Most common 7200 RPM SATA disks represent about 100 IO/s, whereas 15K RPM FC, SAS, and U320 SCSI technologies can yield significantly higher performance levels, near 800 IO/s or more. To perform a benchmark you can use bonnie++, freely available at http://www.coker.com.au/bonnie++/. It needs to be compiled on the target system. Once complied, you can run the following command for each volume you want to benchmark:
$ bonnie++ -d [/your volume] -s [twice your system RAM in MB] -u root:root -qfb
The general pattern of Splunk I/O is a mix of sequential reads and sequential writes for indexing, mixed with some minor activities of 0 byte file creation, and file renames, which should not matter except in very high latency situations. Meanwhile searches have phased operations of high numbers of seeks and small reads, and larger reads mixed with decompression and computation. Overall, the I/O pattern benefits from caching, low latency, and sufficient bandwidth.
Indexing is a disk I/O operation that represents a large number of small, discrete writes, paired with more small reads and writes at index optimization time. As such, large numbers of high performance disk drives in directly attached configurations with high-bandwidth interfaces are preferable when maximum index performance is required.
Each search will run in a separate process, so you will benefit from additional CPUs for each concurrent search.
Searchtime is also dominated by IO/s, especially when infrequently accessed data is in question. When searching for relatively recent data, or even pulling large (~10,000 event) chunks from greater groups of event data, an individual disk is less likely to be a bottleneck as each read call to the disk subsystem will pull larger chunks of data. In this case the storage interface will be much more critical.
However, when searching for rare terms like a name that may occur once an hour or once a day, each read call will tax an individual disk more. In these cases, using higher performance individual disks will pay massive dividends - in some cases 8x performance can be realized by using faster disks
Gigabit networking is recommended for Splunk servers wherever possible. For all media types, ensure that duplex and mode are negotiated properly and use configurations to force duplex and mode if necessary to ensure predictable connectivity to the Splunk deployment.