From Splunk Wiki
1. How Splunk processes data through pipelines and processes
This page is to share information about how data travels through Splunk pipeline/processes to be indexed.
Splunk processes data through pipelines. A pipeline is a thread, and each pipeline consists of multiple functions called processors. There is a queue between pipelines. With these pipelines and queues, index time event processing is parallelized.
This flow chart information is helpful to understand which configuration should be done in which process stage(input, parsing, routing/filtering or indexing) . Also, for troubleshooting, it is helpful to understand which processors or queues would be influenced when a queue is filling up or when a processor's CPU time is huge. For real troubleshooting, we recommend to use Splunk On Splunk(SoS) app. [Download page]
For a high-level overview of the data pipeline (consolidated to highlight the components relevant for planning a Splunk deployment: input, parsing, indexing, and search), see How data moves through Splunk in the Distributed Deployment Manual.
For more about configurations for UF/Heavyweight Forwarders/Indexers/Search heads, please also visit "Where do I configure my Splunk settings?" in this community wiki.
2. Brief Diagram - Pipelines and Queues
Data in Splunk moves through the data pipeline in phases. Input data originates from inputs such as files and network feeds. As it moves through the pipeline, processors transform the data into searchable events that encapsulate knowledge.
The following figure shows how input data traverses event-processing pipelines (which are the containers for processors) at index-time. Upstream from each processor is a queue for data to be processed.
- Pipeline : A thread. Splunk create a thread for each pipeline. Multiple pipelines are running in parallel. - Processor: Processes in pipeline - Queue : Memory space to store data between pipelines
What Pipelines do...
- Input : They input data from source. Source-wide keys, such as source/sourcetypes/hosts, are annotated here. The output of these pipelines are sent to the parsingQueue. - Parsing : Parsing of UTF8 decoding, Line Breaking, and header is done here. This is the first place to split data stream into a single line event. Note that in a UF/LWF, this parsing pipeline does "NOT" do parsing jobs. - Merging : Line Merging for multi-line events and Time Extraction for each event are done here. - Typing : Regex Replacement, Punct. Extractions are done here. - IndexPipe: Tcpout to another Splunk, syslog output, and indexing are done here. In addition, this pipeline is responsible for bytequota, block signing, and indexing metrics such as thruput etc.
Main queues and processors for indexing events
[inputs] -> parsingQueue -> [utf8 processor, line breaker, header parsing] -> aggQueue -> [date parsing and line merging] -> typingQueue -> [regex replacement, punct:: addition] -> indexQueue -> [tcp output, syslog output, http output, block signing, indexing, indexing metrics] -> Disk
*NullQueue could be connected from any queueoutput processor by configuration of outputs.conf