Critical to Eliminate Alert Storms
Eliminating Network alert storms is critical because they are disruptive to Network Operations in multiple ways. The most obvious impact of alert storms is they made it difficult to determine exactly what is going on and where the root of the problem is. The more insidious impact comes from alert storms caused by maintenance events. When network operations people start ignoring alert storms caused by maintenance events, they sometimes come to ignore all alert storms pattern matching they are caused by a maintenance event. Lastly, alert storms not handled well by operations systems can lead to skilled resources chasing multiple related aspects of the same incident.
Eliminating Alert Storms Caused by Equipment Failures
The failure of a single piece of equipment can lead to 10, 30, 50, or even 100 alerts. Networks are all about relationships, and the impact of those relationships become clear when something fails. Many alerts for the same incident root.
A device goes away…I am going to get 50 to 100 other tickets related to link downs on every link going in and out of that device and every high-level protocol that is going in and out of that device. So we would have a switch die and we would get 100 tickets.
Rick Casarez, Technical Director, Network Software Engineering & Operations at eBay, ONUG Spring 2022,
“Evolving Network Operations Through the Power of ML“
Understanding which alerts are related to a single incident requires understanding the relationships in a network. Multiple layers and the relationships formed by many different protocols. This is not a trivial problem to solve. It requires a robust network model. Multi-vendor, multi-layer, topology-aware. Optimally the network model is auto-discovered.
With a robust network model as the foundation, AI/ML can then use auto-correlation to group related events, anomalies, and other information under a single high-fidelity operations notification, automated trouble ticket, and/or automation system signal. The same network model can also be used to determine the root of an incident.
Eliminating Alert Storms Caused by Maintenance
Network Operations teams initiate maintenance, so they don’t need a storm of alerts telling them what they started. How to suppress these alerts is not simple within the framework of how network alerting is typically designed.
One approach that works well is to have the Network Operations teams advise tools when a maintenance event is going to be initiated. This can sometimes be done by reusing some aspects of existing event protocols, but the most effective long-term approach is for operations tools to have a meta-information API to which Network Operations teams can send notifications.
Importantly, maintenance notifications should work at a sufficiently low granularity. For example, it should be possible for maintenance alerts to be suppressed on one interface, while not suppressing alerts for other interfaces on the same equipment.
Conclusion
Augtera Networks customers are seeing a transformation through the elimination of alert storms. Trouble tickets are being reduced by 90% as related events and anomalies are grouped under a single incident and maintenance alerts are being suppressed. In addition, Augtera customers are defining policy that expresses the severity of alert types and different parts of the network.
The net result for customers is a transformative reduction in reactive workloads, and the ability to focus on what is operationally relevant.To learn more about how Network AI is transforming Network Operations, read: Network AI Platform.