Orange integrates Augtera Network AI platform to its NOC tools to leverage AI/ML in daily network operations. This will reduce Network Operation Centers alarms by 70% and prevent failures.

Autonomous Networks & Network AI

There is likely a long journey ahead to fully autonomous networks, as Augtera Networks CEO Rahul Aggarwal recently commented in a Packet Pushers Tech Bytes podcast.

However, the focus and results of Network AI on automating network operations has implications for autonomous networks.

Autonomous Networks and Performance Expectations

Arguably, no idea is more embedded in the Internet psyche than the end-to-end principle. The idea that characteristics such as reliability and security should be guaranteed by the communicating end nodes. Yet, when the application team comes knocking on the network operations door, they are not there to discuss the end-to-end principle or their own responsibility in guaranteeing great application experiences, they are there to discuss the network’s responsibility. Such is life. 

Performance expectations may well be the issue that ultimately challenges the end-to-end principle. If all anyone cared about was availability, with a few delays here and there being ok, then having many (real) path options is probably good enough. However, in a world where performance does matter, dense topologies, as we see in Data Centers today, create their own challenges. For example, which one of the many paths between A and B is experiencing intermittent optical errors, fabric congestion, flapping or anomalies.

In a world where performance and great customer / application experience matters, as they do in a cloud service and digital transformation world, autonomous networks also become that much harder to achieve. Any IP network with large (real) path optionality is practically autonomous by default, without having to do much at all1. However, throw in a few latency and packet loss constraints, then all of sudden, autonomous networks take on a different flavor.

Automating Network Operations & Autonomous Networks

At Augtera Networks, we are focused first on Automating Network Operations. The chain of events from data ingestion, through anomaly detection, noise elimination, operations policy, mitigation, remediation if possible, and the automated creation of a trouble ticket that often allocates a skilled human resource to determine if a cable, transceiver, switch/router, or other, in the physical hardware world, must be replaced. However, while the outcomes are network operations centric (faster response, less incidents, improved KPIs), how these play into long term network automation should be clear as well.

Above, when I said an IP network is essentially autonomous by default, I was saying in a world where finding an available path in the topology is sufficient, then the control plane is sufficient2. However, if the goal is to find a path in the topology with desired performance, the control plane may not be sufficient. All that other stuff outside the control plane suddenly becomes important. For example, is a control plane going to analyze streaming flow data to determine which applications might be experiencing performance issues as a result of TCP anomalies caused by fabric congestion? If not, then there is a universe of network operations data that is going to play an important role in Autonomous Networks.

High-Fidelity Signals Are Also Important for Autonomous Networks

There are numerous ways in which latency and packet loss could be measured and could be integrated into the control plane. However, the reality today is that networking teams are showing significant interest in agents that do latency and packet loss measurements, reporting to a management/operations system. However, that is just the foundational step.

For example, say an operator deploys agents and collects data. Now what? How is the operator going to determine if the latency is ok? Can’t really use a common latency threshold across all links, like for example might be ok with packet loss. Links and paths vary in distance, hops, and other variables. There needs to be a high-fidelity way of determining that there is, or is not, a latency issue. High-fidelity includes eliminating false positives.

High-fidelity signals are incredibly important to network operations teams. However, they will arguably be even more important to autonomous networks. To operations teams’, high-fidelity means less alert fatigue, more confidence that action is needed, and the ability to focus limited resources on “real” incidents.

However, consider autonomous networks. Imagine a network decision function receiving low-fidelity signals from outside the control plane, and potentially multiple low-fidelity signals from different data sources, analysis systems, and tools? Dante’s Inferno within a Divine Comedy! Perhaps that is why auto-mitigation and auto-remediation are still in their early stages in many networks. Receiving high-fidelity signals is job one for all future network automation.

Prevention of future Incidents

If high-fidelity signals is job one, then job two is preventing future incidents. Tools vendors can help network operations teams respond faster, but over the long run, that may just mean running faster on the same hamster wheel. What network operations teams need is a different hamster wheel, one that reduces the total number of incidents, “today”, and in the future. 

If as an industry we get good at responding very fast, to an ever-increasing number of incidents, then I am not sure that is a good scenario for autonomous networks. My intuition is that network operations, for sure, and probably autonomous networks as well, need a path to preventing future incidents. We tend to believe that control processes have all the CPU and Memory they could ever need. Every now and then they surprise us.

Sustainable Outcomes

Lastly, the conditions for creating good applications experiences needs to be sustainable. A word like sustainable is overloaded with many different inferences, and it is no accident I chose that word. However, in a purely operational sense I refer to the long-term maintenance of good conditions through incident prevention, planning, architecture, and alike.

Conclusion

At Augtera Networks we are achieving high-fidelity signals with classifiers for well-known anomaly signatures, Machine Learning algorithms that reduce false positives, topology-based correlation that is eliminating the noise from related alerts, customer-defined policy that specifies what is operationally relevant to the customer, and more. We are also finding rare log messages that can precede outages, and detecting gray failures before they become significant incidents.

Network AI also moves the ball forward for Autonomous Networks

AI/ML is going to make significant contributions to fidelity, prevention, and sustainability. The first significant beneficiaries may be network operations, as we think of network operations today. The ultimate win though, may be for autonomous networks.

Related Links

Notes: 1 – Notwithstanding appropriate design/architecture constructs to keep the control plane humming, etc.
2 – Assuming their is sufficient underlying physical path diversity etc.

Where is the “Network” in AIOps? 

Where is the “network” in AIOps? No where in most AIOps discussions. Case in point this page summarizing AIOps descriptions. What quickly becomes apparent is the term emerged from a set of problems in compute/applications software architectures: micro services, cloud compute, etc. These shifts “broke” traditional IT operations, requiring a new approach.

The figure below depicts two columns, common AIOps capabilities and areas that differentiate AIOps implementations. A great deal of industry discussion focuses on the first column, the new technology approaches. While there can be great differentiation in the first column, what defines the class a product/service is in, comes equally from the second column: what is being modeled, who is the consumer, what kinds of anomalies are being detected, domain-specific algorithms, specific data types, and more.

The Network in AIOps is defined more by the second column than the first column - the actual pain points being addressed.

Figure 1. Two ways to think about AIOps, the technology and the implementation focus.

If you compare ITOps AIOps solutions, to Augtera’s Network AIOps solutions, through the lens of the first column, it may seem like there are a number of similarities, especially without an in-depth technology discussion. If you compare the average ITOps AIOps solution to Network AIOps through the lens of the second column, significant differences appear immediately.

It may be practical to label product categories by a central technological approach, for example AI/ML, however, operations teams do not buy technology, they buy answers to their pain points. As a result, an Application & Compute AIOps product/service is very different than a Network AIOps product/service.

Up until recently, there has been no “network” in AIOps. In designing, developing, and deploying the first AI purpose built for networking, Augtera Networks has changed that.

Where is the “network” in AIOps? To provide just a few examples, ask the following questions:

  • Which AIOps solutions model all layers of the network from the physical layer to the TCP layer and above? 

  • Which AIOps solutions detect optical signal degradation in time to replace the optics or automatically move traffic somewhere else before a failure occurs? 

  • Which AIOps solutions understand BGP state machines and how that impacts anomaly detection and incident mitigation? 

  • Which AIOps solutions assess the impact of data center fabric congestion on application experience? 

Many similar use-case centric questions could be asked. The point is clear. Networking is different, and it requires a purpose-built approach.

Where AI Shines in Networking

A child takes two examples of a plane to get it. A machine, perhaps 10,000. There have also been public failures in applying AI from a chatbot called Tay to potentially deadly recommendations for treating cancer. What exactly is AI good for? Quite a bit it turns out, and specifically for network operations, where AI is now shining.

AI shines in networking with purpose-built algorithms, models, and constructs

When AI Does Not Shine

Whether the things machines do is artificial “human” intelligence, is a debate on its own. Machines do amazing things humans cannot do. However, it is when they try to do things humans do well, they can sometimes have unpredictable and tragic results – especially when there is significant uncertainty. 

Consider healthcare. What modern medicine knows about the human body is remarkable. At the same time, what is not known remains extensive. There is enormous uncertainty in many aspects of diagnosis and treatment. As a father of children with the statistical probability of a genetic disease that arises from any one of thousands of different gene variations / combinations, I have spent many decades contemplating the non-binary nature of some aspects of medicine. 

When IBM’s Watson, which had been custom-built to win a quiz show, took on cancer treatment recommendations, rather than the narrower credibility demonstrations recommended by medical experts working for IBM, the results were sadly predictable, by humans at least, if not AI. When Microsoft’s Tay tripped over social norms, without knowing that some rules in society are inviolate, and others are relative to who the speaker is* results were disastrous as well. (* Machines do not yet know who they are). 

Other AI fails have been in areas such as facial recognition showing racial bias, investing, AI-driven devices being out of control and a hazard to those around them, and more. 

All the above give reason for those working in AI to stay humble about what AI can and cannot do. Start with narrow cases, and look to where AI does well. Some of this is part of the learning process and some of it is the nature of the problem.

Automating Rules 

The implementation and execution of rules may once have been considered AI, today though, not so much. However, they are extremely important. Rules represent learned knowledge. The manifestation of that knowledge is software code executing what a skilled practitioner has learned through experience. Care must be taken to make rules easy to define and apply in real-time, and there may be other challenges as well. However, rules can provide high-fidelity signals. 

AI Shines in Understanding Network Patterns 

Augtera has developed 9+ purpose-built algorithms, because not all networks are the same, and not all patterns within a single network are the same either. Some of the good results Augtera is getting today, has come from past failures. 

For example, when Augtera first started detecting the change in rate of Syslog messages, it led to very noisy anomaly detection. Change is not necessarily the same as operationally relevant. Systems that simply alarm on something being “abnormal” can generate noise. Following that failure, Augtera developed an algorithm that understands more than a simple variation, it understands message “bursts”. Noise has now been reduced considerably from previous approaches. 

Areas that have worked well include combining specific patterns, an extensive multi-layer, topology-aware model of the network, and the network experience of both Augtera and its customers. Optical degradation preceding future failure, link anomalies, BGP anomalies, traffic anomalies, semantically new syslog messages, and fabric congestion impact on application performance are just some of the areas where Augtera’s Network AI is producing high-fidelity results.  

AI Shines in Auto-Mitigation and Remediation 

The future will no doubt see significant innovation in auto-mitigation and remediation. Today, network operations teams are taking baby steps, developing trust in automation. However, even those baby steps are extremely valuable. There is a set of anomalies which have the same mundane resolution. Experienced network operations people know what the problems look like, and what the time-proven response is. Responses to these anomalies can be automated, relieving operations teams from having to do the same thing, manually. AI, and specifically Network AIOps will shine in this area of networking. This is far from the complex decision making that many might assume is the realm of AI. However, compelling value is realized, and it is a start down the road to the future. 

Conclusion 

The history of AI humbles us all, or at least it should. At Augtera, we are not sending hundreds of millions of dollars to build the ultimate brain that makes all decisions in a network. We are instead taking on narrow problems where machine learning, natural language processing, correlation, and other AI/ML techniques can make a compelling contribution and where AI can shine in networking: identifying the root source of an incident, detecting degrading performance in optics, reducing noise by using algorithmic detection of anomalies rather than thresholds, and where appropriate, integrating customer / industry experience with classifiers and policy. The road ahead is long with many unpredictable and remarkable innovations to come. Even if AI in networking has not yet produced that highly, and today overly ambitious goal of a single brain making all decisions in the network, it is still producing game changing results: cutting incident queues by a factor of 10, increasing the time between incidents by a factor of 4, reducing mean-time-to-mitigation by over 50%, and more. Staying humble and staying narrow is producing compelling results, today. 

Related Links

Structure in Unstructured Logs

There is growing talk about the need for structured logs. Proponents promote benefits such as ease of querying. Today, there are many sources of unstructured data that are a wealth of valuable information to network operations teams. In this blog we discuss some of the ways Augtera’s Network AI finds and uses structure, in unstructured logs. 

Unstructured Logging Challenges 

In the canonical programming meme, the first thing programmers do is output an unstructured message “Hello World”. As their journey continues, they often lean on logging, whether to a console/terminal window or a logging subsystem. Sometimes for debugging, sometimes to record an error, and other times as an audit trail. While there is debate about when and how logging should be used, the reality is, it is used often by programmers across all areas of IT. 

The benefit of the humble “print” command is it creates human readable messages, and this is the ultimate power – humans can read the message and often quickly get a sense of what is going on, or at least where to start exploring what is going on. However, machines, not commonly having the same language skills as humans, find unstructured logs more difficult to parse. Many believe that unstructured logs present so many problems that there needs to be shift to structured logging. The argument is that structured logging makes querying, indexing, and integration into standard visualization tools easier. A quick Google Search on “structured logs” delivers many results. Vendors talking about their structured logging capabilities, Google itself providing JSON structured logging in Cloud Logging, structured logging in Kubernetes, and more. 

Rare log messages and Metric extraction are two examples of finding structure in unstructured logs

While there are many benefits of structured data, unstructured logging is pervasive today. In networking, Syslog is a common example. Equipment software often writes to Syslog. Messages for similar events differ from equipment supplier to equipment supplier, and even programmers within the same equipment supplier, thus complicating Syslog analysis by machines. 

Extracting Metrics from Unstructured Logs 

Augtera Network AI comes with a sophisticated metric analysis and visualization capability. At the heart of our metric analysis is purpose-built networking algorithms that differentiate our anomaly detection from threshold-only systems. We detect complete failures, and gray failures, without the false positives & false negatives of other approaches. As a result, significant value can be realized by customers when metrics embedded in log messages are transformed to metrics and routed to metric processing. 

There are numerous solutions on the market that define rules for transforming log messages to metrics, so this blog will not spend much time discussing rules-based approaches. However, Augtera’s Network AI has this capability. 

Natural Language Processing of Unstructured Log Messages 

On April 27th, Augtera announced Zero Day Anomalies for Syslog. The important elements of that announcement were: 

  • Real-Time natural language processing (NLP) for billions of log messages per day, without queuing or dropping messages 
  • The ability to detect rare / not seen before log messages and flag them as potential anomalies, the first time they are seen 

The ability to detect truly new messages is not based on simple text matching, because the same message can vary in nuanced ways. Detection is based on understanding semantics in a message and knowing whether the semantics are new. This capability was developed because network failures are sometimes preceded by Syslog messages that have never been seen before. In addition, many long-standing anomalies simply go unnoticed. 

This approach is essentially creating, or rather revealing, structure that is typically not understood by machines who just see log messages as a sequence of ASCII codes. Sure, it is not the same kind of structure as a standard with a well-defined and adhered to fields, but it is structure that has already led to one compelling capability, detection of rare/new messages, and will have other applications in the future. It is seeing structure without defining and maintaining rules. 

Conclusion 

There are many approaches to finding structure in unstructured logs. Augtera’s Network AI supports rules-based approaches and NLP-based approaches. The former can have great precision but comes with the overhead of maintenance. The latter requires less maintenance and is yielding new capabilities that rules-based systems do not, for example, the ability to identify rare/never seen before messages. In a world where unstructured logs are still pervasive. Both approaches are being used by Augtera customers.  

Related

Faster, Better, Fewer – Network AI Transformation

Network AI Introduction

Augtera’s Network AI transforms network operations, incident response, and incident management by delivering on three fundamental value propositions: faster resolution of incidents, better insights on incidents, and fewer overall incidents.

SEE More

Whether Network AI is ingesting syslog only, or ingesting all network operations data, it sees anomalies other tools do not.

“50% of anomalies detected by Augtera were not detected by existing tools.”

Arnaud Plouhinec, Head of Automation and Data/IA Program, Orange International

Network AI does not just see anomalies that have already occurred, it sees gray failures such as increasing degradation that will become failures, and also rare/never before see log messages that are often precede a future failure.

ELIMINATE Noise

Alert fatigue is causing missed anomalies. Operationally irrelevant incidents have skilled staff chasing the wrong problems. Noisy signals make meaningful automation impossible.

While many talk of eliminating noise, one good proof point is a customer allowing the automated creation of trouble tickets. Trouble tickets allocate skilled resources. As a result, there is zero tolerance for noise being injected into ticketing systems.

In the case study “Fortune 500 Enterprise Transforms Data Center Network Operations” we wrote of a Fortune 500 Enterprise that had developed such trust in Augtera’s solution, that in now has Network AI auto-create trouble tickets. Some network operations teams want a slightly better mouse trap. Some operations teams want to transform. The latter automate from ingestion to ticket creation.

Network AI eliminates noisy threshold approaches, suppresses maintenance alarms at an interface granularity, and prevents duplicate incident records. Network AI also gives customers the ability to create policy that defines what they consider operationally relevant, as opposed to what a tool vendor does.

Read more on noise elimination.

IDENTIFY Root

Network operations teams can spend considerable time just getting siloed roles, looking at siloed data, to agree on where to start incident triage.

Augtera’s Network AI models the network and correlates data in an industry-leading way. Multi-layer, topology-aware, and all network constructs. When a low-level anomaly is creating higher-layer alarms, Network AI can focus attention on where operationally relevant action is required. Incident records point skilled staff in the right direction.

AUTOMATE response

There will always be incidents that require investigation by skilled resources. However, there are many commonly occurring anomalies for which operations teams routinely, and mundanely, respond to in the same way. These are ripe for automation.

Augtera Networks is actively engaged with customers on incident response: auto-mitigation and where possible, auto-remediation. Use cases auto-mitigated include:

  • Optical anomaly
  • BGP gray failure
  • Border router congestion

PREVENT Failures

When failures occur, they must be dealt with quickly, so customer and application-team experiences are not impacted. Even better though, is if the incidents never occurred.

With increasing data volumes network complexity, resolving incidents quicker is merely running faster on the hamster wheel. What network operations teams ultimately need, is a new hamster wheel.

“We quickly figured out we cannot be in the incident response business, or incident management business. We have to be in the business of incident prevention.”

Pranatap Lahiri, VP of Network and Data Center Engineering, EBAY, AI in Networking, ONUG 2021

Augtera Network AI enables network operations teams to act on gray failures, to see rare syslog messages that precede failures, equipment checks, and more.

Network AI reduces reaction time and reduces incident load, transforming incident response and management.

Prevention is the leading edge of Network AIOps, and essential so network operations teams cannot only respond faster but respond to fewer incidents.

LEARN Collectively With Augtera Network AI

Network AI generates high-fidelity signals in numerous ways including classifiers for known anomaly signatures. Augtera maintains a library of classifiers collected from across the customer base and provided to every Augtera customer.

Network operations teams have limited resources. The rate of anomaly signature learning can be slow, and the implementation of rules to catch anomalies even slower if the operations team must request code changes.

Augtera’s classifier library allows all customers to accelerate the creation of high-fidelity signals by learning from others.

Network AI classifier and incident response feedback loop

As operation teams learn new anomaly signatures from incident investigation, they can quickly create and activate new classifiers in they own implementation, and the same classifier can then be made available to other network operations teams.

After new classifiers are created and distributed to all customers, Network AI sees even more than it did before, resulting in an increasing number of high-fidelity signals. A virtuous circle that benefits all customers.

Network AI Conclusion

Today’s networks have too many tools, too much noise, too few high-fidelity insights, and too many incidents. Whether making incremental change, or pursuing transformation, network operations teams are transforming by responding FASTER, with BETTER insights, on FEWER incidents. Network Operations teams can no longer simply run faster on the same hamster wheel, they need a new one.

Related