From Splunk Wiki
Understanding how "buckets" work
This topic explains the concept of "buckets" and goes into detail about how they are managed, as well as how they affect your Splunk search performance.
As Splunk evolves...
The issues with "bad buckets" described in this topic are applicable for versions of Splunk up to 4.0; in Splunk 4.0 and later, the bucketing logic is significantly more accommodating of wider date ranges within incoming data. The underlying logic of bucketing and how data moves through Splunk is still valid for all versions of Splunk.
Another thing to note is that starting with Splunk 4.0, you can have multiple hot buckets. Because of this, it is much more resistant to some of the "bucket spread" issues discussed below. However, it is not completely immune to them and it is possible to induce any of the problems discussed below under certain conditions.
How buckets work
As described in this topic about backing up, Splunk places your indexed data in directories, also referred to as "buckets" as it moves through its lifecycle in Splunk. When data is first indexed, it goes in
db-hot, then, according to your data policy definitions, it moves into the warm bucket(s), then cold, and finally "frozen" (which by default means it is deleted).
Each time db-hot is rolled to warm, it creates a new directory, (known as a warm bucket) named to indicate the time range of the events in that bucket, like this:
db_[newest_time]_[oldest_time]_[ID] (you can ignore the ID for the purposes of this discussion). The times are expressed in UTC epoch time (in seconds).
Rolling to warm occurs automatically when the specified bucket size is reached, so the buckets are all typically the same size unless you have rolled manually at some point.
By default, your buckets are located in
$SPLUNK_HOME/var/lib/splunk/defaultdb/db. You should see the hot-db there, and any warm buckets you have.
By default, Splunk sets the bucket size to 10GB for 64bit systems and 750MB on 32bit systems.
Caution: Do not increase the default setting on a 32bit system or you risk crashing Splunk.
Example of a standard bucket setup
So, let's say you have a default 64-bit Splunk instance, and it typically indexes 20GB of data an hour, which tends to result in 10GB of indexed data per hour. (Refer to this topic about estimating your index size for more information.) Because the default bucket size for your system is 10GB, you will roll your db-hot bucket to a warm bucket about once an hour, and start a new db-hot each time.
It's October 10th, 2008, and you're indexing data live as it comes in and your timestamps all reflect this correctly. In this situation, a sample of your warm buckets reflect no overlap in time; when the 10GB data from about 9am-10am is rolled from hot to warm, it makes a bucket called
db_1223658000_1223654401_2835, then the data from 10-11am is rolled at 11 and makes a bucket called
db_1223661600_1223658001_2836, and so on. (Of course, in a real environment, the UTC times wouldn't be so precisely at the top and bottom of the hour, but this is just an example. I've also made up bogus IDs for each bucket, which you can continue to ignore.)
Each bucket contains about an hour's worth of indexed data, and each bucket that is rolled contains only data that is newer than the bucket that was rolled before it. The "bucket spread" of each bucket is just one hour, give or take.
For the purposes of clarity, here's what the span of your buckets look like in a more human-readable format for the moment in time of 11:30am on Oct 10th, 2008:
- hot-db: contains data from right now and going back to about 11am, Oct 10th, 2008
- warm buckets:
- 10.10.2008.11:00:00_10.10.2008.10:00:01_[ID] (10-11am)
- 10.10.2008.10:00:00_10.10.2008.09:00:01_[ID] (9-10am)
and so on.
Searching the data for a few terms yields results quickly, almost always within a couple of seconds.
Example of a problematic bucket setup with "data from the future"
In this situation, you again have a default 64-bit Splunk instance, and it typically indexes 20GB of data an hour, like the standard setup. As before, the hot-db rolls to a new warm bucket every time it gets to be 10GB in size.
It's once again 10/10/08, around 11:30am.
However, in this situation, you have some wacky data source that is giving you bogus timestamps that are months in the future, or maybe even years. For the purposes of this example, let's make it 3 months in the future. This affects the data ranges of your buckets a lot; all your bucket ranges will start at the time the last one was rolled and extend to January 10th of 2009. So, the buckets are now (using our simpler notation for clarity):
- hot-db: contains data from January 10th, 2009 and going back to about 11:00am on 10/10/08.
- warm buckets:
- 10.10.2008.11:00:00_01.10.2009.10:00:01_[ID] (10-11am)
- 10.10.2008.10:00:00_01.10.2009.09:00:01_[ID] (9-10am)
and so on.
This is undesirable for reasons explained later in this topic, but the point to take away from this example is that each bucket contains (bogus) data from 3 months in the future and this is reflected in their naming. The "bucket spread" for these buckets is 3 months.
Searching the data for a simple term can take several minutes, sometimes longer.
How "bucket spread" affects search performance
When you search for something in your indexed data, Splunk gives you the results of your search in reverse chronological order-- we assume you want information about what's happening most recently first, with older results arriving later. Splunk first looks in the hot bucket, then the warm buckets, then cold. The frozen db is never searched.
If you search for "fflanda" in your index, Splunk looks to see if it's in db-hot first. If it is, Splunk then looks at the timestamp of the event that "fflanda" was found in, and the range of time covered by db-hot. Based on that, Splunk decides whether to show you that result right away, or to look in the warm buckets to see if there are any more recent results than that. It will look in every warm bucket (and then in every cold one) that has a range that includes the timestamp of the event in whch "fflanda" was found.
In the case of the first example, the "standard" bucket setup, Splunk will immediately know that there are no results for "fflanda" that are more recent than the one it found in db-hot, and begin giving you your results immediately.
However, in the second example's case, Splunk will look in every warm bucket because it might contain a more recent result than *right now*--the "bucket span" extends to the future, so finding a more recent event than *right now* is possible. And Splunk will therefore wait to display any of your search results to you until it has finished searching every bucket that could yield such a result.
This is how the "spread" in time of the data in your buckets affects search performance. How you 'tune' your buckets can make a big difference to your search experience.
Another example, this time with historical data
Once again, it's October 10th, 2008, around 11:30am. Your 64-bit Splunk instance is merrily chugging away at your 20GB an hour of incoming data, resulting in 10GB an hour of indexed data. However, today, that 20GB an hour also includes the contents of a very large zip archive of historical data, ranging back to 2005. This data gets indexed right along with your live incoming data, and as a result, the bucket spread is all over the place; the zip archive contains hundreds of individual log files, each in turn containing hundreds of thousands of events, all being indexed in no particular order.
This time, your searches for current data (stuff that's happening right around now) don't take long, but getting results for anything from the older data can take a long time. You've probably figured out why--the bucket spread is skewed towards the historical data, but since you're indexing your *right now* data at the same time, many of the buckets span up to 3 years (2005-2008).
If you have to load up a bunch of archive data, Splunk recommends that you create a separate index for it and route it there explicitly. Refer to this topic in the Administration Guide for information on doing this. You can specify a regex to force all data with timestamps older (or newer) than a given time range to be placed in an alternate index.
When to "tune" bucket sizes
In most cases, you don't really need to change the default settings. However, if you have specific business requirements for archiving/aging out data, it may be necessary to change the policy (using the information in this topic in the Administration Guide. For example, if you are indexing a very low volume of data (less than say, 10GB a week), and need to back up or otherwise archive your indexed data more frequently than that, you can set a policy that rolls and/or archives your data based on its age instead of on the amount of data in a given bucket. If you do set your policy based on data age, keep in mind that a given bucket will only be aged out by policy based on the *most recent* data it contains.
For help with custom bucket configuration, contact Splunk Professional Services.