From Splunk Wiki
This is meant as a basic introduction to Splunk's file data acquisition strategy. Later documents or work may discuss optimizations or workarounds.
This is primarily about the input of uncompressed logfiles, compressed logfiles (.gz, .zip) are handled similarly but not identically.
Splunk's monitor inputs (and sinkhole inputs) try to achieve the following list of goals
- Acquire all data from the specified locations, ie.
- specified files
- all files located in specified directories
- Do not ever index the same data twice, even with file renames, etc.
- When a file is replaced, via renames or delete and recreate, notice this and acquire the new data.
To achieve these goals Splunk assumes normal logging behavior.
- Log files are modified by adding text to them, not rewriting existing text.
- Log files with different contents will have different headers
- Log files will be at least 256 bytes in size
Hashing and recognizing files
Splunk recognizes files not by filename, but by contents. This is because renaming log files is a normal practice. Since it would be impractical to compare all text of all files against each other, Splunk uses the time-honored approach of hashing.
The algorithm, with many simplifications, is:
First: Hash the first 256 bytes of the file.
New hash/ new file:
- If it is a new file which has not been seen before, index the contents of the file.
- Hash the last 256 bytes of the file.
- Store the Hash of the start, the end, the location of the end, close the file.
- Seek to the offset of where we last were for this hash.
- Check the hash at this location, if it does not match, treat as New File.
- The hashes both matched, so this is the same file as before.
- Read any additional new data which was added to the file.
- Store new file offset with new offset hash, and close the file.
You can see this data yourself if inerested in the fishbucket.
The sum of this is that new data will be located, and old data will not be re-indexed.