From Splunk Wiki
Splunk tuning factors
Splunk's core competency is indexing and searching any type of IT data with speed and efficiency. This versatility can present challenges to both new and seasoned users of Splunk when attempting to identify factors that can affect performance. This section reviews a variety of factors and offers suggestions on how to tune Splunk for a given deployment.
Segmentation is how Splunk identifies items to index in your IT data that aren't key/value pairs or fields. These indexed items, or segments along with fields are the building blocks inside IT data that search capabilities are built upon. Tuning segmentation can lead to greater indexing performance by lowering the total processing required to index any line of IT data and increasing the potential for compression effectiveness..
Major and minor segments
Splunk maintains two concepts of segments, called major and minor segments.
- Major segments are words, phrases or terms in your data that are surrounded by breaking characters, such as a blank space.
- Minor segments are breaks within a major segment.
For example, the IP address
192.168.1.254 would be indexed entirely as a major segment and then broken up into the following minor segments:
Segmentation and data sets
Segmentation impacts indexing and data storage performance directly based on the data set in use.
- Highly homogenous data sets that contain mostly major segments and few minor segments index faster and compress better. Examples of these data sets would be access and authentication logs, where a small number of total outcomes occur (permitted, denied) and are delivered in a very similar format with little variance in the information being provided per log-line.
- Data sets with greater levels of entropy represent more major and minor segments, requiring more processing and data storage. Examples of these data sets would be proxy server and transaction logs, where large numbers of users are performing a variety of different actions, each of which may represent very different information per log-line.
You can completely disable segmentation, which allows for maximum indexing performance and storage efficiency. Of course, this comes at the expense of search convenience and search speed. With segmentation disabled, you can perform searches using the
regex search directive (which provides full regular expression search capabilities), search using information indexed in a search fields, or search using a combination of the two.
Note: Searches that involve
regex take longer to execute due to the processing required to find regular expressions in IT data.
Splunk can automatically extract the source hosts from a given piece of IT data, which is useful in situations where data is being aggregated before arriving at Splunk to be indexed.
Splunk can also identify timestamps in any given piece of IT data from a variety of formats, which can not only help in pre-aggregated data cases but also with data sources that embed their timestamps in non-standard formats.
Search convenience and data storage
The combination of indexing options you select ultimately defines how convenient it is to search your IT data. Any combination of the above options is supported and can be implemented on a per source or source type basis. This lets you minimize the index overhead associated with data that is not searched frequently, while making commonly searched data more convenient for users.
A great example of how this can used to optimize a Splunk deployment would be when using Splunk for IT policy compliance. Splunk can be used to search proxy server and transaction logs for user access monitoring and user activity search, while also serving as a central repository for other types of IT data such as system logs that must be retained but may be of less interest to a compliance administrator.
To maintain maximum convenience and allow for saved searches to run quickly and efficiently, the maximum amount of segmentation should be applied to the proxy server and transaction logs which would be configured as discrete sourcetypes. Additional search fields may also be desired to quickly identify certain key/value pairs that may be of interest. System logs, also a discrete sourcetype, could have segmentation disabled given that they are simply being aggregated and stored to adhere to the IT control or mandate.
If you're having problems with odd data being presented from your otherwise "normal" sources, such at incorrect times being reported from a firewall log, ensure the sourcetype is correctly set. Edit %homedir%/splunk/opt/etc/system/local/inputs.conf to view and edit your inputs. View the Wiki page here on input types, there's a ton of them Splunk reconizes by default -- with sample of what your data may look like to help you match up the type.