From Splunk Wiki
Watch this video: Segmentation and Index size. Stephen Sorkin, Manager, Search and Indexing, discusses the general concepts of segmentation in the context of index size.
What effects should I expect to see?
Inner segmentation, which is the most effective policy, will typically yield a 50% reduction in disk space and a 50% boost in search speed for some searches (specifically searches for rare terms).
How do I measure the effects of changing to a particular segmentation policy?
To get the most accurate empirical measurement, use a fresh instance of Splunk and index the same data using one segmentation policy and then the other. You can clear the database after each run with the CLI command './splunk clean eventdata'. Please make sure to use a large data set (at least 20GB) to get an accurate measurement. To compare results check the size of the defaultdb directory (usually $SPLUNK_HOME/var/lib/splunk/defaultdb).
If you've already switched segmentation settings and want to retroactively examine the effects, you can compare the database buckets created before and after the change. You can use the date modified property of the bucket directories, which will reside in defaultdb/db (they are named db_xxxxxxxx_xxxxxxxx_x). Although the size of each bucket is constant (the default for 32 bit systems is 700MB and for 64 bit systems its 10GB), the number of events contained in each will vary. To find out how many events are in a bucket, examine that bucket's Sourcetypes.data file. You must sum up the values in the third column, ignoring the first row. Use the following Splunk search as a utility: | file <full path to Sourcetypes.data file> | rex "::.*?(?<event_count>\d+)" | stats sum(event_count) Assuming on average events are the same length, the ratio of event counts between the two buckets is the improvement.
Another method, perhaps easier to perform, is to examine the composition of your indexes using the idxprobe search command. Searching for "| idxprobe bucket defaultdb" will return detailed information about each bucket in your main datastore. The idea is to compare the size of the raw data with the size of the tsidx (time series index) files. Use following search: | idxprobe bucket defaultdb | eval tsidx_per_raw = tsidx_disk / (raw_disk * 10.0) (Note multiplying by 10 is a rough conversion to uncompressed data volume, reflecting average gzip compression rates). The result is the number of tsidx (time series index) bytes generated per byte of raw data. The smaller the number the better, although there is no "normal" range. On my instance of Splunk -- with syslog data under full segmentation -- I see numbers between 0.2 and 0.4 tsidx bytes per uncompressed raw byte. Compare the results for a few buckets from before and after the segmentation change to find the improvement.
To accurately measure search improvements requires a controlled test with a sufficiently large data set. With a > 20GB data set, index using one segmentation policy and then the other. After each run, we will ask Splunk to search for a few seconds and count how many events are returned. You will see the most dramatic improvement if searching for a rare term, like a specific ip or username, however the data set must be large enough for that term to appear hundreds of thousands of times, otherwise Splunk will have no difficulty returning all results within a couple seconds. Use the dispatch command on the command line: ./splunk dispatch "<unique term> | stats count" -maxtime 3 -auth <username>:<password>