From Splunk Wiki
Splunk Bucket Retention, Timestamps, and You
This document is about:
Splunk Enterprise, Splunk Free, & Splunk Light, and potentially though to a mostly irrelevant degree, Hunk. When I say Splunk here I mean the index management functionality of all of these products.
Splunk data retention has a few layers of configuration: bucket counts, index max size, index max time, total size per index in hotwarm, total size per index in cold, as well as aggregate storage controls at the level of the optional volume feature. All of that is discussed in the Splunk documentation about data size controls.
But, how exactly does all that logic engage?
How Splunk ranks buckets
When a rule says a storage bin is over capacity, or when Splunk wants to check if a chunk of data (a bucket) has expired, what are the decision points we use?
A bucket, for index management purposes is a chunk of data with known maximal endpoints. We have a pile of events in a directory, and we know the directory goes from, for example, the earliest 02:05 last wednesday, until at the latest 05:30 on last thursday, covering more than a day span.
To do some ascii art, we have
02:05 wed|---------------------------|05:30 thu ** * ***** * * ** <-event times
From a perspective of considering and managing the bucket, Splunk doesn't actually know when the events inside it are located, just that they're within the two time endpoints.
Since the rules of bucket management will lead to data being discarded, and it would be silly to throw away data that still meets the time expectations, we basically simplify bucket management to the Latest Time, or lt. The time that the data closest to now landed in the bucket.
To repeat: Splunk manages buckets in accordance with the latest event contained in the bucket, the event closest to the future.
Examples of bucket ranking at work?
- When an index has reached its maximum size, which bucket gets frozen (typically deleted)?
- The bucket with the oldest leading edge. The oldest-newst time.
Example, here are my buckets
|-----------| 1 |----| 2 |----------| 3
Which bucket gets dropped? #2, because we can see that is is the bucket with the provably oldest event set. 1 and 3 have more recent, newer events.
- When an index has a maximum retention time, how is the retention time applied?
- When the bucket passes entirely out of the time limit, it is frozen.
|<- freeze time |----------| 1 |------| 2 |----| 3 |--------| 4 |<- freeze time
Bucket 3 will be frozen next time Splunk checks for freeze work (usually about once every 30 seconds), because it is entirely outside the retention time. Buckets 1, 2, and 4 have some data still inside the retention time. Of those, 4 will be frozen first.
- When the entire volume in which all of my indexes were configured to be stored fills up, which bucket gets moved to cold or frozen out?
- The oldest bucket based on the newest edge, out of all indexes, just like for the single index case.
Managing overall index retention (favoring some indexes over others) can be done by combining settings, enforcing a limit on some indexes within a volume while allowing others to use the remaining space.
Corner cases of bucket retention
What happens when we add data from years ago to splunk?
Data we already had:
|<--now |----| |-----| |----| |----|
3 year old logs that we load into the same index:
|----| |---| |---|
What happens if disk space is near used up or this new data fills the configured space? Splunk immediately begin dropping that old data. That's what the system is configured to do. If you want to keep the old data you need more space or a more permissive maximum age configuration.
Future data? What is that?
Usually future events are bugs or misconfiguration or problems. Maybe the logging application went crazy and started writing out 9999 for the year. Maybe the network time protocol daemon has a bug and accidentally sent the system clock into next months. Maybe your Splunk was configured to read the month as the day and the day as the month incorrectly for the data (silly ambiguous slash formats). Or sometimes it's as simple as your servers aren't well-synchronized in time so one is running ten minutes ahead of the other.
Nearly always, future data is something you don't want, and Splunk has some rules to try to avoid wrong guesses when relying on automatic timestamp handling. However, if the configuration says the timestamp format means the data is coming in 2 days in the future, Splunk tends to guess you'd prefer to have the data availalbe in some form, so often it's indexed at that time.
Now we get future data...
|<--now |-------| .. various normal buckets |-----| |------| |---| and a future bucket
What's going to happen with retention? The future bucket will be kept 3 days longer than any of the other data. This might be undesirable. In some explicit configurations you can ask splunk to permit data even years into the future, which will typically cause it to be retained years longer than your other data.
What to do about future data?
If the amount of data is small, you may want to ignore it. If it's large, and interfering with space allocation, you may want to manually remove it; for example, in the simplest non-clustered cases stop splunk and drop the directories.
Usually you want to stop getting more of it.
Typically this involves fixing the logging application, or fixing the configurations used to handle the data so that it gets the timestamps correct. Most frequently this simply means setting up a TIME_FORMAT & TIME_PREFIX for the sourcetype to explicitly control how the timestamp is determined. Sometimes it means a brute-force of DATETIME_CONFIG = CURRENT to simply disable all timestamp detection, using the time-of-processing of the data as the time.
If you have very unusual goals of wanting to actually store significant future-time data in Splunk, you should probably consider using an indepdendent index to store this data, and ensuring it has maximum size configurations in place to prevent it from dominating your disk space.
What if the default time-retention logic is not exact enough? What about my data compliance requirements?
If you have very very strict requirements about when data must become non-available in your Splunk system, you will have to use some form of workaround.
Enforce Max bucket spans
If you give splunk a maximum time-span (maxHotSpanSecs) that it is permitted to create buckets to handle, then you can set your time to expire data (frozenTimePeriodInSecs) to be the size of that max span early. For example, if you set the maximum bucket timespan to 2 days, and set frozenTimePeriodInSecs to 2 days before the requirement, the 2-day bucket spans will fully expire in time to go away.
- Be aware that the precise value of 86400 (1 day) is dangerous in some versions of splunk. Fully explaining the behavior and goals and why you don't want it isn't worth your time. Use a value one second off if you want to use such a value.
- Tinkering with maxHotSpanSecs is tricky to get right. If you alter the configuration far away from the default, you may end up with a rapid accumulation of buckets. Basically we need enough buckets of enough timespan size to cover the range of incoming data times. If tinkering with this, keep a close eye on bucket size and bucket count accumulation rate after setting the system up. For smaller bucket spans, you may find it necessary to increase the number of concurrent maxHotBuckets to cover the inocoming data.
The |delete search command can be used by users that you set up with the can_delete capability (by default no user has this.) Events that are fed into the delete command are marked unsearchable and will not be returned to future events. By running such a search frequently to erase events past your permitted time window you may be able to meet the compliance needs.
Be aware that |delete will cause splunk search not to return the data (it will be practically unavailable), but the text of the events does still exist in splunk files on the filesystem, so depending upon the letter of your requirements, this may or may not be sufficient.