Friday, November 10, 2017

Detecting Data Feed Issues with Splunk - Part I

by Tony Lee

As a Splunk admin, you don’t always control the devices that generate your data. As a result, you may only have control of the data once it reaches Splunk. But what happens when that data stops being sent to Splunk? How long does it take anyone to notice and how much data is lost in the meantime?

We have seen many customers struggle with monitoring and detecting data feed issues so we figured we would share some of the challenges and also a few possible methods for detecting and alerting on data feed issues.

Challenges

Before we discuss the solution, we want to highlight a few challenges to consider when trying to detect data feed issues:
1) This requires searching over a massive amount of data—thus searches in high volume environments may take a while to return.  We have you covered.
2) Complete loss of traffic may not be required—partial loss in traffic may be enough to warrant alerting.  We still have you covered.
3) There may be legitimate reductions in data (weekends) which may produce false alarms—thus the reduction percentage may need to be adjusted.  Yes, we still have you covered.

Constructing a solution

Given these challenges, we wanted to walk you through the solution we developed (Step 4 in the final solution if you want to skip straight to that for the sake of time). This solution can be adapted to monitor indexes, sources, or sourcetypes—depending on what makes the most sense to you. If each of your data sources goes in its own index, then index would make the most sense. If multiple data feeds share indexes, but are referenced by different sources or sourcetypes, then it may make the most sense to monitor by source or sourcetype. In order to change this, just change all instances of “index” (except for the first index=*) to “sourcetype” below.  Our example syntax below show index monitoring, but the screenshots show sourcetype monitoring--this is very flexible.

The first challenge to consider in our searches is the massive amount of data we need to search.  We could use traditional searches such as index=*, but the searches would never finish even in smaller environments.  For this reason we use the tstats command.  In one fairly large environment, it was able to search through 3,663,760,230 events from two days worth of traffic in just 28.526 seconds.

The first solution we arrived at was the following:

Step 1)  View data sources and traffic:

| tstats prestats=t count where earliest=-2d@d latest=-0d@d index=* by index, _time span=1d | timechart useother=false limit=0 span=1d count by index


Figure 1:  Viewing your traffic

Step 2)  Transpose the data:

| tstats prestats=t count where earliest=-2d@d latest=-0d@d index=* by index, _time span=1d | timechart useother=false limit=0 span=1d count by index |  eval _time=strftime(_time,"%Y-%m-%d") | transpose | rename "row 1" AS TwoDaysAgo, "row 2" AS Yesterday


Figure 2:  Transposing the data to get the columns where we need them.

Step 3)  Alert Trigger for dead data source (Yesterday=0):

| tstats prestats=t count where earliest=-2d@d latest=-0d@d index=* by index, _time span=1d | timechart useother=false limit=0 span=1d count by index |  eval _time=strftime(_time,"%Y-%m-%d") | transpose | rename "row 1" AS TwoDaysAgo, "row 2" AS Yesterday | where Yesterday=0


The problem with this solution is that it would not detect partial losses of traffic.  Even if one event was sent, you would not receive an alert.  Thus we changed this to detected a percentage of drop off.

Figure 3:  Detecting a complete loss in traffic.  May not be the best solution.


Final solution:  Alert for percentage of drop off (Example below alerts on reduction of 25% or greater):

| tstats prestats=t count where earliest=-2d@d latest=-0d@d index=* by index, _time span=1d | timechart useother=false limit=0 span=1d count by index |  eval _time=strftime(_time,"%Y-%m-%d") | transpose | rename column AS DataSource, "row 1" AS TwoDaysAgo, "row 2" AS Yesterday | eval PercentageDiff=((Yesterday/TwoDaysAgo)*100) | where PercentageDiff<75

Figure 4:  Final solution to detect a percentage of decline in traffic

Caveats:

The solution above should get most people to where they need to be.  However, depending on your environment, you may need to make some adjustments—such as the percentage of traffic reduction, but that is a simple change of the 75 above.  We have included some additional caveats below that we have encountered:
1) There may be legitimate indexes with low events or possibly naturally occurring 0 events, use “index!=<name>” after the index=* in the |tstats command to ignore these indexes
2) Reminder:  Maybe you send multiple data feeds into a single index, but instead separate it out by sourcetype.  No problem, just change the searches above to use sourcetype instead of index.

Conclusion

The final step is to click the “Save As” button and select “Alert”.  It could be scheduled to run daily with results are greater than 0.  There may be a better way to monitor for data feed loss and we would love to hear it!  There is most likely a way to use _internal logs since Splunk logs information about itself.  😉  If you have that solution, please feel free to share in the comments section.  As you know, with Splunk, there is always more than one way to solve a problem.

No comments:

Post a Comment