The Augtera platform takes a holistic approach to policy-based noise elimination throughout the entire pipeline, from ingestion to alerting, ticketing, and automation.
Note: Even after policy-based noise elimination, detailed data is kept in a historical datastore for later triage / historical analysis. The amount of historical data retained is customer-determined.
Selecting the Strongest Signals
False positive and false negative noise reduction
There are some metrics for which a set threshold can work, for example, where a network operations team has a network-wide policy. One such scenario would be a policy that 1% packet loss on any link in the network is an anomaly. The Augtera platform can support thresholds for these scenarios.
However, for most metrics, thresholds are noisy. They either create many false positives, the threshold is so low that too many alarms are generated, or they create many false negatives, the threshold is so high, relevant anomalies are not detected – neither are gray failures.
It is difficult to apply heuristics for many metrics, across all instances of network objects. For example, the optimal latency varies from link to link. The alternative to using heuristics is constant and time-consuming tweaking of object-by-object thresholds, which is not practical at scale, and would result in significant noise.
For these reasons, Augtera uses machine learning models, that learn patterns in real-time, with network-specialized algorithms. These models are not generic open source or academic models. They use production tested, networking algorithms.
The result of Augtera’s 9+ proprietary machine learning algorithms is self-learning / maintenance-free, operationally relevant alerts across hundreds of metrics.
Autocorrelation noise elimination
Augtera’s comprehensive multi-layer topology network model, can dramatically reduce the number of alerts and tickets by correlating all related anomalies and events to a single incident. For individual device failures, some customers are seeing a 100 to 1 reduction in alerts. Across all alerts and tickets, customers are seeing 10 times+ reduction.
For more information on Multilayer topology aware correlation, see the Augtera Platform page.
Classification of known anomaly symptoms
Customers often have their own experience with symptoms of a specific anomaly. In addition, when the Augtera platform is installed, it comes with a library of learnings from all customers. The definition of these anomaly symptoms is referred to by the Augtera platform as “classification”.
Classifications can be added dynamically, without change to the running software code, and can be for any anomaly, whether network-based or not. For example, a classification could be for a hardware device/component of an operating system.
Classification produces strong signals because they are well known anomalies based on customer experiences.
Macro Pattern elimination of lower level noise
Patterns can be recognized across data center PODs, a group of switches, or the entire fabric. Alerts are generated when patterns change in significant and persistent ways. Macro pattern recognition can alleviate noise from lower levels, allowing operators to focus attention on critical issues
Gray failure detection enables proactive action
Not all failures occur suddenly. Some occur after a period of increasing degradation. Visibility to degradation facilitates network operations teams eliminating future incidents before they occur. Trouble tickets can also be created with a code/severity that indicates proactive action required, to not obfuscate urgent issues, though still schedule attention when resources are available.
Finite state machine for control plane changes.
Detects actual BGP and other control plane flaps after eliminating transient changes in the control plane session state. This reduces noise through understanding of complex network behavior.
Policy-based Noise Elimination
Policy-based correlation
While the Augtera platform is production proven at hyperscale, across hundreds of metrics, and billions of data points per day, not all Network Operations teams are able to procure the IT resources required to support that amount of processing. This is despite the Augtera implementation being 5-10 times more efficient than other network operations tools attempting less analysis.
Additionally, network operations teams may choose to test the Augtera platform on a small number of targeted anomalies, or a subset of network objects, before moving to an expanded deployment.
Both goals can be accommodated with Augtera Spaces, that allow customers to limit analysis, and specifically correlation, to what the customer defines as being operationally relevant.
Policy-based noise elimination of notifications
Consoles, collaboration environments, ticketing systems, automation systems and more are notified with high-quality, operationally relevant, strong signals. However, operations teams may still decide to limit notification, or UI/UX displays, to a subset of identified incidents or data.
Customers define their preferences through Augtera Views.
Irrelevant network object noise elimination
Customers can also define what is operationally relevant through meta-data that Augtera can ingest. As an example, an operations team may decide they do not want an alert for failures occurring on Top-Of-Rack (TOR) switch server ports, while still needing to see alerts for TOR uplinks. Many different constructions such as this can be achieved through meta-data.
Policy-based maintenance noise elimination
Large alert storms arise from maintenance events. So much so that network operations teams can over time come to be indifferent to all large storms, whether they are created by maintenance events or not. For multiple reasons, maintenance can create significant noise.
The Augtera platform learns which devices, interfaces, and other network objects are in maintenance, and suppresses alerts for them. On a single device, alerts from one interface can be suppressed, while alerts are still generated for other interfaces
Policy-based duplicate incident noise elimination
The Augtera platform can detect when an incident has already been notified and suppress duplicate notifications. This applies to all notification channels including ticketing.
Policy-based severity suppression
Customers can choose to suppress notification based on severity.
Policy-based definition of Transient Anomalies
The Augtera Platform can be configured with an “Anomaly Duration” period, to ensure that notification / visibility is limited to persistent anomalies.
Conclusion
Machine learning algorithms, correlation, and proactive action are the core of noise elimination. In addition, customers use policy to further define what is operationally relevant to them.