Orange integrates Augtera Network AI platform to its NOC tools to leverage AI/ML in daily network operations. This will reduce Network Operation Centers alarms by 70% and prevent failures.

What are the Top Use Cases for AIOps in Networking

I had the privilege to share my thoughts and perspectives next to other industry leaders from VMware by Broadcom, IBM, Nokia in the 2024 AI in Networking Report – Pipe Dreams and AI Realities edited by AvidThink: https://nextgeninfra.io/2024-ai-networking/  

This is a continuation of the blog series that transcribes my thoughts expressed in that report. Please read part 1 here.

Our solutions cut across all segments of the infrastructure: hybrid cloud, data center, SD-WAN, service provider backbone, 5G environments, campus, and so on. They range from observability—and I want to draw a distinction between observability, which is really more sophisticated analytics, and visibility and AIOps. So, we have a lot of use cases around prevention. 

For example, one very common and popular use case, difficult to do, is our ability to take packet and flow data, find TCP retransmits in that data, and do application-aware AIOps for the infrastructure, where we can find application misbehavior using TCP retransmits and correlate that with congestion on specific links in the network or drive mean time to innocence. So the network operators can very quickly tell the server and the application people, “The network is not to blame; it’s running clean. The problem is somewhere else.”.  We do that by bringing many things together: TCP data, queue data, link data—all of that.  

Another common use case, which is getting more and more traction now, especially with LLM infrastructure being deployed in the data center, is optical misbehavior. There is increasing use of optics on servers, and as optics misbehave, we are often able to detect these signals days in advance, allowing operators to conduct maintenance before optics break and start impacting applications. 

Another use case is post-change verification, especially subtle changes. You know, you apply a BGP change that starts to draw in extra traffic, results in traffic bursts and congestion in the network, and maybe has a completely unforeseen impact on a firewall that has not been updated. We can correlate all of this automatically—literally hundreds of different events and anomalies—and create a single ticket with the root cause being the config change. 

Another example in the LLM infrastructure is correlating job completion and iteration time  with GPU utilization, server NIC flaps, etc. For example, if a particular job is taking a long time on a GPU, is the GPU underutilized? If yes, is that because of NIC flaps, or is there congestion in the network which is resulting in long tail latency between two specific GPUs? We can bring all of that together with our GenAI Ethernet cluster solution. 

These examples I’m giving are production examples done at scale, baked at scale. We also have synthetics as part of our solution, where we can deploy our agent to generate probes to other agents across servers. We’ve had customers where we are able to do mean-time-to-innocence across OVM infrastructure and switches. For example, in one large environment, we were able to pinpoint that a specific hypervisor upgrade on one vendor is what resulted in the application latency, and we avoided finger-pointing between the switch team and the server team. That was done by using a combination of our synthetic probes along with network telemetry. 

So, I’m here to tell you that network AI ops is real, it’s in production, it’s been done at scale. Customers like Orange have come out publicly talking about the significant ROI they’re seeing. And I’m also here to tell you that LLMs are a very small piece of a very large puzzle. The algorithms we have created are unsupervised on logs. We apply natural language processing to find rare logs. For example, you might have a memory corruption log—one out of 100 million logs. We find that using NLP and unsupervised machine learning, not using LLMs. We do similar things on optical data, TCP retransmit data, where we learn from the pattern of the metrics. LLMs have a role to play; they can be very useful to take in data in the public domain and use that for recommending remediation. They can be useful for helping human language query interfaces, and those are capabilities we’re working on. But just think of it as one piece of a very large puzzle of technologies and capabilities you need, which we have been working on long before people talked about LLMs. 

So, I would leave you with one final thought: AI ops in the network is real. According to Gartner, 10% of enterprises were using it in 2023, and 90% are projected to use it over the next few years. We are seeing that significant ramp in the interest and the very late-stage conversations we are having with a lot of enterprises and providers. So, really, if you’re thinking about whether it is mature—it is mature. And we would love to talk to you about how we can help you transform your operations using network AI ops, and then apply it more broadly to your infrastructure, going all the way to the servers and the application. 

Click this link to learn more about  Augtera Network’s Network AI.