From Splunk Wiki
1. How Splunk processes data through pipelines and processes
This page is to share information about how data travels through Splunk pipeline/processes to be indexed.
Please keep in mind that if Splunk official document does not explain attribute or feature described here, potentially it is not accurate or not fully supported/QAed feature. Or, simply the diagram is wrong...
When we think about log events life cycle in Splunk, we can think about how to collect data(Input stage), then processes to parse data and ingest them to Splunk Database(Indexing stage), then, how to keep data in database(hot->Warm->Cold->Freezing). In Splunk's doc or presentations, Input and Indexing stages are often explained as a topic of Getting Data In.
Splunk processes data through pipelines. A pipeline is a thread, and each pipeline consists of multiple functions called processors. There is a queue between pipelines. With these pipelines and queues, index time event processing is parallelized.
This flow chart information is helpful to understand which configuration should be done in which process stage(input, parsing, routing/filtering or indexing). Also, for troubleshooting, it is helpful to understand which processors or queues would be influenced when a queue is filling up or when a processor's CPU time is huge. For real troubleshooting, we used to recommend to use Splunk On Splunk(SoS) app. [Download page]. Unfortunately, SoS development was stopped at v6.2. A great news is that a built-in management feature,Monitoring Console, inherited original idea of SoS app and greatly expanded monitoring and management features.
For a high-level overview of the data pipeline (consolidated to highlight the components relevant for planning a Splunk deployment: input, parsing, indexing, and search), see How data moves through Splunk in the Distributed Deployment Manual.
For more about configurations for UF/Heavyweight Forwarders/Indexers/Search heads, please also visit "Where do I configure my Splunk settings?" in this community wiki and Splunk Doc: Configuration and pipeline
2. Brief Diagram - Pipelines and Queues
Data in Splunk moves through the data pipeline in phases. Input data originates from inputs such as files and network feeds. As it moves through the pipeline, processors transform the data into searchable events that encapsulate knowledge.
The following figure shows how input data traverses event-processing pipelines (which are the containers for processors) at index-time. Upstream from each processor is a queue for data to be processed.
The next figure is a different version of how input data traverses pipelines with buckets life-cycle concepts. It shows concept of hot buckets, warm buckets, cold buckets and freezing buckets. How data are stored in buckets and database is another good topic you should learn.
- Pipeline : A thread. Splunk create a thread for each pipeline. Multiple pipelines are running in parallel. - Processor: Processes in pipeline - Queue : Memory space to store data between pipelines
What Pipelines do...
- Input : They input data from source. Source-wide keys, such as source/sourcetypes/hosts, are annotated here. The output of these pipelines are sent to the parsingQueue. - Parsing : Parsing of UTF8 decoding, Line Breaking, and header is done here. This is the first place to split data stream into a single line event. Note that in a UF/LWF, this parsing pipeline does "NOT" do parsing jobs. - Merging : Line Merging for multi-line events and Time Extraction for each event are done here. - Typing : Regex Replacement, Punct. Extractions are done here. - IndexPipe: Tcpout to another Splunk, syslog output, and indexing are done here. In addition, this pipeline is responsible for bytequota, block signing, and indexing metrics such as thruput etc.
Main queues and processors for indexing events
[inputs] -> parsingQueue -> [utf8 processor, line breaker, header parsing] -> aggQueue -> [date parsing and line merging] -> typingQueue -> [regex replacement, punct:: addition] -> indexQueue -> [tcp output, syslog output, http output, block signing, indexing, indexing metrics] -> Disk
*NullQueue could be connected from any queueoutput processor by configuration of outputs.conf