Community:Plotting a linear trendline
From Splunk Wiki
Did you know you can do a linear trendline (regression) within Splunk? While Splunk already comes with a built-in search command for doing trendlines based on moving averages (see the "trendline" command), you can also do more complex computations, such as a linear regression using search commands such as 'eventstats' and 'eval'.
Additionally, you can use search macros to mask the complexity of the underlying search.
Step 1: Creating the macro
1. Create a search macro named "lineartrend(2)" via the Splunk Manager user interface. Official documentation on search macros are here:
2. The macro definition should be:
eventstats count as numevents sum($x$) as sumX sum($y$) as sumY sum(eval($x$*$y$)) as sumXY sum(eval($x$*$x$)) as sumX2 sum(eval($y$*$y$)) as sumY2 | eval slope=((numevents*sumXY)-(sumX*sumY))/((numevents*sumX2)-(sumX*sumX)) | eval yintercept= (sumY-(slope*sumX))/numevents | eval newY=(yintercept + (slope*$x$)) | eval R=((numevents*sumXY) - (sumX*sumY))/sqrt(((numevents*sumX2)-(sumX*sumX))* ((numevents*sumY2)-(sumY*sumY))) | eval R2=R*R
Note: When copy/pasting the macro definition, make sure you paste it into Splunk as a single line. You will get errors otherwise.
3. Macro arguments should be:
x, y
Step 2: Plotting the trendline
The macro `lineartrend` gets passed two arguments, x and y values from each event, and creates the following fields for each event:
- slope - the slope of the trendline,
- yintercept - the y-intercept of the trendline
- R - coefficient of correlation
- R2- R-squared, the coefficent of determination
- newY - the trendline values
For interested readers, the calculation was borrowed from this web site: http://easycalculation.com/statistics/learn-regression.php
You can quickly test this on time series data with the Y-value of the chart being simple event counts. Run the following search from the Search App, and then click "Build Report":
<search>sourcetype=my_data | timechart count as yvalue | `lineartrend(_time,yvalue)` | timechart sum(yvalue) sum(newY)</search>
The first timechart generates the sample source data (a plot of event counts over time) which is then piped to the macro 'lineartrend' which generates the new y axis values of the trend. The second timechart stanza plots both the original values and the trendline on a chart, which you can visualize by clicking "Build report"
Step 3: Getting the correlation coefficient (R) or coefficient of determination (R-squared)
The above macro also calculates R and R-square for the above chart automatically. To display R-squared only (for example, to put it below your chart on a dashboard), run the following search:
sourcetype=my_data | timechart count as yvalue | `lineartrend(_time,yvalue)` | stats first(R2)
Sample dashboard
Enjoy!
Gaurav 19:41, 7 December 2009 (PST)