What if you could apply AI and machine learning to prevent failures in your network and IT infrastructure operations, eliminate noise, and dramatically reduce the time to root cause and remediate failures? What if I told you that you could do that today, at scale, with software that is deployed in production by very, very large-scale enterprises and providers? That’s what Augtera does.
I had the privilege to share my thoughts and perspectives next to other industry leaders from VMware by Broadcom, IBM, Nokia in the 2024 AI in Networking Report – Pipe Dreams and AI Realities edited by AvidThink: https://nextgeninfra.io/2024-ai-networking/
This is part one of a two-part blog series that transcribes my thoughts expressed in that report.
At Augtera, we transform network operations by applying AI and machine learning to operational data, starting with the network but then moving all the way to servers and infrastructure, both in the data center, in service provider networks, as well as in large-scale LLM infrastructure.
If you look at the last several decades of operation in the network, it’s been riddled with noise. More often than not, network operators learn of issues after things break when application owners complain. There is increasingly more and more data in the cloud, in the data center, in the WAN, and in the 5G RAN. There are more and more layers in this infrastructure, but at the same time, operators are dealing with this massive, massive stack of data, and they’re left with siloed tools to mine that data.
That’s where Augtera saw an opportunity to apply AI and machine learning several years back—to take this data and apply real-time algorithms to find AI insights. This was long before the current LLM movement, which has taken place in the industry. What we do is apply largely unsupervised algorithms to find anomalies, find misbehaviors, predict issues, and then we apply unsupervised algorithms again by learning the topology, discovering the network model, discovering the connectivity in the network, and automatically correlating events and anomalies to eliminate noise and cut down tickets.
Our customers have seen this both in the data center and in the service provider side, in the government side, at very large scales—15,000 switches and routers in one case, in the data center, several thousand service provider edge and P routers in the case of someone like Orange, which has done a public press release with us. We’ve been able to cut down tickets by 70 to 90%, cut down time to detect issues from 47 minutes to 1 minute, and cut down the overall time it takes to remediate and mitigate by up to 60%+.
And all this is done by a platform that is software-only. It can be deployed fully on-premise, it can be air-gapped, or it can be consumed as SaaS, depending on your comfort. It scales horizontally; it’s a microservice architecture that takes data from all parts of your infrastructure—switches, routers, servers, GPUs, firewalls, load balancers—and can also connect to your data lakes, for example, Prometheus and Splunk.
We have very large-scale implementations of telemetry, which has existed in networks for decades: SNMP and syslog, as well as designed for big data and telemetry streaming with OpenConfig and gRPC/gNMI. We are agentless; we rely on network and IT infrastructure and integrate with that. At the same time, we do have an agent that adds additional value on Linux operating systems, for example, Sonic, servers or VMs. We can take in data as JSON logs or metrics, we can take in data through Kafka—so many, many ways to integrate data, including Redfish from the servers.
Now, this then comes into our AI and machine learning engine, which applies many purpose-built algorithms. I should take a pause here. When we started Augtera, we looked at open-source. We took production data, and it took us a while to get that, and we tried applying open-source algorithms, and it was really nothing but noise. As a result of that, we went down the path of building our own algorithms, and that took us a long time. It was a hard road, but it led to results with much higher fidelity and much more actionability.
So, what we have is many algorithms for finding misbehaviors of literally hundreds of metrics in the TCP layer, in the optical layer, in the HTTP layer, and correlating all of that by learning the topology and the model of the network, and then driving actions by either creating tickets or workflows. Some of our customers are beginning to automatically remediate issues back. For example, if there are CRC errors or optical issues or link flaps, you can shut down the interface automatically. In large environments with multipath, that’s better than having traffic transit a port or an interface that is actually seeing issues.
Part 2 coming soon:
What are the top use cases for AIOps in Networking?
Click this link to learn more about Augtera Network’s Network AI.