A technology that can process billions of logs per hour with natural language processing (NLP) and will automatically find exceedingly rare messages that are operationally relevant and without noise. Sounds too good to be true, right? That is exactly what I thought when our engineering team was describing the new “Zero Day Log” ML capability that they were building.
My concern was not the ability to process vast volumes of logs at scale. Augtera has been doing that for years. Rather, I did not think it was possible to accurately find true “unknown unknowns” that are operationally relevant without producing 10x false positives. Companies have been trying to do this for years. In addition, I failed to appreciate how valuable it is to an organization to find the “unknown unknowns”.
The saying “the proof is in the pudding” is applicable here (sidenote: that saying has nothing to with the creamy dessert, rather it is a reference to a meat dish that apparently was difficult to determine if it has been properly cooked). Fast forward about 9 months. We have our Zero Day Log capability in production in our customers and the results have been very eye opening.
When we first enabled the feature across our customer base I expected to see dozens of Zero Day log anomalies per customer per day, after the typical learning period. Instead it was quiet. Over the course of a week, we would typically get just a couple of anomalies per customer. What I did not expect was that when a device did begin to have issues, we would often get several anomalies in succession and how accurate they would be at both predicting a future failure and finding root-cause. Many of the findings I investigated with our customers often led to a software bug they were unaware they were hitting, or they identified an issue they had been troubleshooting but had not determined root-cause for.
This is a recent example from a customer and highlights the predictive nature of the capability. What is not shown on this screen is that these were the only zero-day log anomalies on this day. As a data point Augtera collects ~ 6M syslog messages per day from this customer. On 12/6 a succession of log anomalies were generated. After 12/6 the same type of logs sporadically continued to be generated (no longer rare). In this case the switch was not broken and continued to function. However, on 12/13 the OS melts down and over the course of an hour various applications begin to fail (AAA, BGP, PIM, etc.) and eventually the switch reboots forcing an unplanned outage. See the picture below for the details.
Customer Learnings:
- Augtera zero-day log anomalies are now notified to a dedicated slack channel for the SRE team to investigate and work on
- New log classifiers are built to detect future messages with similar pattern as the rare log messages (see next parapraph)
- Support case was opened with vendor; OS bug was identified; new OS was identified, and upgrade plan started
Zero Day Log finds the “unknown unknowns”, but what happens the next time that message is seen? It is technically not “rare” anymore. Augtera log classification solves this problem. It is another LogAI feature and it enables workflows for when “known unknowns” are detected. The workflow could be as simple as logging it as an event, sending a notification, or running a set of commands to automatically remediate it. More on this capability will be in a future blog post!
To schedule a 30-minute discussion with an engineer on how Augtera Network AI can help with your network challenges please contact us. Thanks for reading our blog.