Community:Troubleshooting Monitor Inputs
From Splunk Wiki
Before you embark on enabling DEBUG mode, run through easier checks first. Look particularly in the Diagnose problems section. Also, read through all the troubleshooting suggestions before trying anything, especially on a production system!
- Identify the file(s) that are having problems and where they live. Investigations should take place on the Splunk instance that is reading the source files.
- Enable some debug settings on the Splunk instance that is reading the problem file(s). These will tell us how often Splunk is actually checking the problem files. You can enable these DEBUG settings via the Splunk UI by going to 'Manager » System settings » System logging'. Search for the processor names in the list below using the text box and change the logging level to DEBUG. These changes will not persist if you restart the Splunk instance. If you want to have the processors in DEBUG on startup, then you have to edit the
- For Splunk 6.3 and later
- maybe ?? category.FileInputTracker=DEBUG
- For Splunk 4.1 and later
- maybe ?? category.FileInputTracker=DEBUG
- For Splunk 3.x, early 4.0 versions and LightWeight Forwarders (no UI) you can do this by editing a file - In $SPLUNK_HOME/etc/log.cfg, add these lines to the first group of 'category.' entries. There will already be an entry for 'FileInputTracker', please comment that out so that there's no conflict between them -
- Restart Splunk. Now every time Splunk checks the inputs file, it will be recorded in $SPLUNK_HOME/var/log/splunk/splunkd.log. Remember to change these settings back when you are finished investigating.
- For Splunk 4.0.5 and later you can enable these settings dynamically from the Manager page. Simply click on 'System Logging' and then find the processors listed above (FileInputTracker etc.) and set the logging level to DEBUG. These settings are not persistent and will revert to default level when Splunk is restarted.
- For Splunk 6.3 and later
- In Splunk 4.1 and later, you can access a debugging endpoint that shows status information about monitored files. https://your-splunk-server:8089/services/admin/inputstatus/TailingProcessor:FileStatus Note that this may change in the future, and is a somewhat rough prototype of the monitor behavior inspection.
Basic configuration checks
- Can the forwarder ping the indexer(s)?
- Is the port you are using for forwarding open? A
** Are you using the same port on the forwarder (outputs.conf) and the indexers (inputs.conf)?
- Do your configuration files have errors? Use splunk btool to check.
- Use splunk list monitor on the forwarder to see which files are being monitored
Trace the behavior for a file you are interested in
Say you know that the file /var/log/foobar should be indexed, and you're not seeing data for it. Scan through your debug data for this file. Use grep, or search index=_internal for the name.
- Does the filename appear at all?
- Message about ignoring non-whitelisted file?
- Fix patterns
- Message about file being binary?
- Check to see that the file is not binary! Splunk indexes text files, not database stores, jpgs, etc.
- Set the CHARSET explicitly if not UTF8, ascii, or similar in props.conf
- If splunkd insists it is binary anyway, and is wrong, set NO_BINARY_CHECK in props.conf
- Some files appear and are working, but not your file
- This may be a scaling issue. See below.
- Permission denied
- Fix permissions
- Modtime has not changed message
- check modtime using operating system tools
- If file is being updated, but modtime is not, you have a windows bug. Use ALWAYS_OPEN_FILE flag in input stanza as workaround.
Similar file messages
- In the Splunk UI, run the following search - index=_internal "FileInputTracker" *<path_to_file>* - You may see events like this come back:
05-01-2009 12:18:11.444 WARN FileInputTracker - Stored seekPtr longer than file: <path_to_file>. 05-01-2009 12:18:11.444 WARN FileInputTracker - Read a similar file to this that was longer. Perhaps you have two files that have the same content.
This means that you may have multiple files that contain the same content, or that Splunk believes contain the same content. Splunk 3.x and 4.x use the first 256 bytes of the file to distinguish, so boiler plate headers (such as log files in XML format) can cause this problem.
To work around, you can add the following setting to the appropriate
monitor stanza in
$SPLUNK_HOME/etc/system/local/inputs.conf, or whever your
input is defined.
[monitor://<input_path>] crcSalt = <SOURCE>
NOTE: This setting is case-sensitive and should be applied exactly as it is typed here. i.e. Do not substitute anything for <SOURCE>
This essentially forces splunk to believe that each differently named file is a different input. This is not appropriate when monitoring files which are renamed (rolled).
05-01-2009 12:18:11.444 WARN FileInputTracker - Hit EOF while computing CRC: 0/256. 05-01-2009 12:18:11.444 WARN FileInputTracker - Couldn't get CRC for file: <path_to_file> 05-01-2009 12:18:11.444 DEBUG FileInputTracker - We've read a newer version (<path_to_file>) of this file (<path_to_another_file>). Skipping.
- This means that Splunk cannot get a correct read on the file to calculate the CRC.
- The common reason is the file is shorter than 256 bytes. Splunk does not support sub-256 byte files well.
- It's possible that some read errors may produce this.
Use your local filesystem/operating system resources to determine which case you are in, or work with support to investigate.
Slow / lagging input
- If you can see a signifcant time-lag between file reads, check to see what Splunk is doing in between, is it reading other files? Are those files static or live?
- If have many live files and you want to continue monitoring them but prioritze a specific file, use the dedicatedFD setting in inputs.conf (not supported on Windows OS)
- If you have many stale/static files and Splunk is just wasting time checking them, you can either move them aside so that Splunk no longer sees them.
Files that cease to be indexed but don't close
If you see many entries of the form:
06-04-2009 09:31:25.003 DEBUG selectProcessor - EOF '<path_to_file>' found. File has been written to in the last '5' seconds. Will keep open.
But none of the form:
06-04-2009 09:31:25.003 INFO FileInputTracker - Computing CRC for seekPtr=55de367d filename=<path_to_file>
Then your file is being kept open, but no new data is arriving. The second type of message shows Splunk is advancing through the file, while the first shows that Splunk believes the file is 'active' via the modification time.
Some problems that can cause this are:
- Time skew between servers, where the files are timestamped in the future
- Bugs where programs truncate their logs (log4j will do this on windows in some cases)
Files are being re-read - duplicate indexing
- Generally speaking, the first step is to carefully verify that the data is not present multiple times in the source data (eg two log files with mostly similar contents). Carefully ruling this scenario out will save a great deal of time, as it is the most common cause of "duplicate" events.
If Splunk is re-indexing a file, you will end up with duplicate data in your index and possibly license violations. How can you recognize this? Easy, the FileInputTracker will tell you.
As it reads a file, Splunk will keep track of it's place using the 'seekPtr'. When it starts reading from the beginning of that file, the seekPtr will be 8000. AFter that, every time the file is read and new data is indexed, the seekPtr should increase. If it resets to 8000 at any point, that means that Splunk has started from the beginning again - perfectly valid in the case of a file being rolled. When that file hasn't been rolled however, it means data is getting re-indexed.
Using the DEBUG settings above, identify the file in question and track the 'seekPtr' value in the FileInputTracker events in splunkd.log. If you think the file is in fact getting re-indexed, something is likely changing it in order to make Splunk think it's a new file
Causes of reindexing:
- File contents (especially the first 256 bytes) are modified in-place. This shouldn't happen for log files (they're supposed to be a record).
- The CHECK_METHOD for the files was set to entire_md5 or modtime. This forces the files to be reindexed.
- Some sourcetypes like 'text_file' intentionally set the CHECK_METHOD because it is desired to index the complete file each time.
- You have multiple input stanzas which are monitoring the same file content, via tricky means -- such as one traversing a symbolic link, and one not, or linux bind mounts, or duplicate network mounts, etc.
- You have incorrectly turned on the 'crcSalt' feature for files that are renamed (such as rolling files). Do not do this.
With very large numbers of files, you might not get the performance you want out of file monitors. Situation occur where monitor pulls in the data correctly, but:
- It might take a while to get started (post-restart) on the files you are interested in
- It may seem to keep some files working, but not get around to other files, especially if you have many active files.
Splunk file discovery is "breadth first". The nature of each input is discovered first. If it is a direct file input, it will be identified essentially right away. After this, the top level contents of each directory monitor is discoved. Then each first-level subdirectory, then second, etc. Thus a deep subdirectory will come last.
Post-startup you can evaluate the progress of file discovery using 'splunk list monitor'. If this list is very slow to build, this is the nature of your problem.
Using the Debug input, the behavior of TailingProcessor is the file discovery process and its output is relevant to this type of issue.
- Monitor a smaller quantity of files
- Remove stale files
- Use more focused subdirectory inputs
- Use Splunk 3.4.12 as a forwarder and adjust tailing_proc_speed
File list traversal
Splunk keeps only so many files open at a time (default, 100). If you have files that are written to more than once every 5 seconds, this table should be expanded.
Identifying this problem involves reviewing the selectProcessor lines from the generated debug, to see when files are closed ("will reopen later"). If the files are consistently closed in a timely manner, this is not the problem. However, if the files do not seem to get moved out of the table, there are different possible remedies.
- In time skew situations, fix the time skew between the file storage location and splunkd location
- Increase the file table size, max_fd in limits.conf [inputproc] to a larger value, eg 256, or larger. May need to adjust ulimit -n
- Decrease time_before_close to cause splunk to give up on files more quickly
Log file behavior
If we don't know what's going on with log files, but there is trouble, we may want to know more about what's happening on the system.
If we want to know, for example, when the files are being created, how long they live, if permissions are changing at odd times, we may just want to know when the third party applications are doing these things.
We may want to use strace or truss to monitor the third party app, if it is a specific program we are interested in. strace can take a -e file,desc option to filter for file and file descriptor events.
If we want to know about all changes to files in a directory, on Linux we can install inotify-tools and run `inotifywait -mr --timefmt '%d/%m/%y %H:%M' --format '%T %w %f %e' /path/to/monitor` to generate a trail of which files are being modified and how. This could give clues about changed permissions, etc.
Solaris, Linux, Windows, and other platforms also have auditing interfaces, that can allow you to find out every single filesystem action, if you have them installed, and configured. This is a nontrivial undertaking, but they are good tools if you have them available.