Data Center Solution

Introduction

The foundation of the Data Center Solution is the Network AI platform which ingests a large scope of data types with high-performance, scale, and efficiency prior to automating anomaly detection, noise elimination, incident root identification, trouble ticket creation and notification.

In addition to the many platform advantages, Augtera Networks has spent several years, working closely with Data Center clients to address a range of pain points, constructs, equipment types and more.

The result is the industry’s leading Data Center Network Operations Tool.

Purpose Built for the Data Center

The Data Center Network AIOps solution results from three years of close development with network operations teams overseeing some of the largest data center networks. Based on Augtera’s Network AIOps platform, the solution addresses key Data Center pain points and use cases, in addition to supporting required APIs, ITSM integrations, vendor integrations, equipment types, data types and constructs. For more information on these aspects of the solution, see Augtera Networks Data Center Solution Brief.

Augtera Networks Data Center Network AIOps Solution considerations

Data center network and service architectures have changed dramatically over the last decade resulting in the emergence of new complexities that cannot be managed with old solutions. The Augtera platform transforms manual, reactive, and noisy to automated, proactive, and relevant for the following domains:

Applications
Hybrid Cloud
Underlays and Overlays
Switches & routers
Firewalls and Load Balancers

Augtera’s platform is production proven at scale, with customers realizing the following benefits:

90%+ reduction in Mean Time to Detect (MTTD)
50%+ reduction in Mean Time to Mitigation (MTTM)
40%+ reduction in Mean Time to Repair (MTTR)
4x+ improvement of Mean Time Between Incidents

Anomalies never seen before are now visible, noise is eliminated, mitigation is occurring rapidly, there are less incidents that need to be ticketed, and more time between incidents. Data center operations is being transformed from manual, reactive, and noisy to automated, proactive, and relevant.

Data Center Challenges

Modern data center architectures comprise Layer 3 Clos fabrics, often built using BGP. This provides equal cost multipath (ECMP) between top-of-rack (ToR) switches that can utilize all links for east-west traffic between any two switches. This is highly beneficial for scale, resiliency and optimal bandwidth utilization. At the same time it creates operational challenges that stem from “lots of links”.

The Landscape and challenges described here are not limited to hyper scale data centers. Data centers of all sizes have similar challenges.

Application uptime and performance is becoming even more critical to business.
Increasing failure points across control and data plane
Heterogeneous Environments
Hard to detect failures are common
Legacy tools and processes are manual, reactive, and noisy

Driven by SaaS, Hybrid Cloud, digital transformation, and other trends, application uptime and performance are becoming even more critical to business. Networking teams are accountable to the business. Operation teams need to plan for business and application continuity.

There are multiple drivers for increasing failure points including dense multipath architectures that create operational challenges from “Lots of Links”, cloud Interconnects from the data center to the cloud, and virtual networks in the public cloud.

Heterogeneous Environments
- Multi-vendor implementations
- Increasing use of disaggregation & white labeled hardware
Hard to detect failures are common
- Transient control plane and hardware failures
- Grey failures / brownouts
- Long-lived failures
Legacy tools and processes are manual, reactive, and noisy
- Significant time to detect, mitigate, and repair operational issues.
- Application owners complain of problems before operations teams’ act

Foundation Capabilities

There are several capabilities that are foundational / common to the data center solution, independent of the domain (underlay, overlay, hybrid cloud,…), including:

Noise elimination
Automated notifications and ticketing
Autocorrelation
Proactively detecting gray failures

Augtera Founder and CEO, Rahul Aggarwal, Tech Field Day / Networking Field Day 28, May, 2022

Noise Elimination

The Augtera platform takes a holistic approach to eliminating noise. Noise elimination is not the result of one technology or one approach. It is the result of understanding noise throughout the entire pipeline from ingestion to ticketing.

There are two major categories of noise elimination capabilities:

Selecting the strongest signals from a sea of noise
Policy-driven reduction of analysis and notification

Learn more at Noise Elimination

Automated Notifications and Ticketing

Machine learning anomalies and insights detected by Augtera are automatically notified to Slack, Syslog, Kafka or ticketed to Service Now.

High fidelity proactive alerts & tickets that help transform operations to proactive from reactive
De-duplication aware to suppress duplicates while notifying
Duplicates are added as events against existing ticket state in Service Now
Auto-correlated events are added as events against the same Service Now ticket
Ticket life cycle aware

Data Center Use Cases & Auto-Mitigation / Remediation

Over three years of development, many Data Center Network Operations use cases have been addressed.

Augtera Networks Data Center Network AIOps Solution use cases

A couple of examples include:

Fabric/server latency & packet loss
Fabric congestion impact on application flows

Fabric / Server, Hybrid Cloud Latency & Packet Loss

In the Data Center, the network does not end at ToR (Top of Rack) switches, it extends all the way to servers. Detecting changes in latency and packet loss within the fabric, and all the way to servers is critical today. However, while other solutions stop at measurements, the Augtera Data Center Solution inputs the measurements, into Machine Learning (ML) based anomaly detection to find operationally relevant anomalies in latency and loss. The solution further inputs these anomalies along with other data sources, and performs multi-layer, topology- aware auto-correlation. The Augtera Data Center Solution uses Machine Learning (ML) models instead of static thresholds to reduce anomaly detection noise. Less noise, better insights. Augtera Networks latency and packet loss anomaly detection is also supported for Hybrid / Multi-Cloud.

Fabric Congestion Impact on Application Flows

It is one thing to know or suspect there is fabric congestion, it is another to know if that congestion is impacting applications. The Augtera Data Center Solution can determine if specific applications are experiencing performance degradation. It is critical today that network operations teams be able provide answers to application teams on the impact or innocence of the network with respect to Data Center / Cloud-based applications.

Auto-mitigation / remediation

Augtera Networks has worked with customers on a growing number of auto-mitigation / auto-remediation including:

Optical anomaly
BGP gray failure
Border router congestion

Data Center Autocorrelation

Multi-layer topology aware autocorrelation automatically correlates operationally relevant events and Augtera generated anomalies across the BGP and layer 2 control plane, IP and VXLAN data plane, synthetic probe anomalies, system and environmental degradation anomalies and other grey failures. The output of autocorrelation is “Augtera Incidents”.

This has several benefits:

Reduction of 25-75% in the number of incidents / tickets that NOC needs to manage
High fidelity context for root cause analysis, mitigation, and remediation. For example, Figure X shows an Augtera incident that auto-correlates 8 link and BGP flaps on 4 switches and identifies the mis-behaving switch that they are all connected to, even when the misbehaving switch does not generate any alerts.

Data center solution network topology example.

Proactively Detect Grey Failures

Grey failures or brown outs refer to problems that are brewing and will eventually cause an outage. However, their effect is not immediately obvious. Applications may notice them and complain but they can be difficult to detect, trouble shoot and root cause. For example, intermittent packet drops can be the result of hardware issues, fiber issues, optical problems, or software issues.

Augtera anomaly detection algorithms on hundreds of metrics can detect significant pattern changes and detect control plane, data plane or hardware grey failures hours or days before the current reactive approaches.
Zero-day syslog anomalies can find the very first occurrence of an important syslog that indicates a grey failure. e.g., ASIC parity error.

Data Center Switching Fabric and Servers

Proactive Detection of Latency, Loss and Microbursts

Augtera agent can be deployed on hosts, leaf’s, and spines and enabled for synthetic probes
Probes continually measure loss and latency to each other
Augtera automatically learns normal loss/latency patterns between pair of devices and generates an anomaly when there is a degradation without thresholds or any other configuration
Anomalies can be overlayed on topology or viewed in heatmaps.
Augtera leverages peak buffer depth metrics from telemetry streaming, automatically learns the normal pattern for every queue and generates anomalies that detect microbursts

Data Center solution synthetic agents can detect loss and latency.

Proactively Detect Environmental and Optical Degradation

Early detection of temperature, fan speed, power, voltage and optical degradation via ML Anomaly Detection.

Early detection of router temperature fluctuations.

Traffic and Flow Mapping and Anomalies

Operators can map the path taken by a flow as well as the anomalies along the path in real-time or at a specific time interval in the past to determine the root cause of application issues
Anomaly detection on traffic utilization as an aggregate across a POD, fabric or data center interconnects to proactively detect unexpected major changes in traffic
Anomaly detection on TCP flags to detect SYN floods, unstable TCP sessions, etc.
Ad-hoc analytics to detect ECMP polarization on specific switches

Data center solution for traffic utilization includes support for ECMP

Application Performance and Availability

Enables a customer to proactively determine increase in latency or loss across the data center fabric due to packet drops in the fabric and further determine the application flows responsible for the congestion.

Actionable Insights for Data Center, SE Director, John Heintz, Networking Field Day 28, May 2022

Augtera agents need to be installed on either the fabric leaf switches or servers. The solution encompasses:

Anomaly detection on probe latency and loss between agents
Anomaly detection on packet drops on fabric interfaces
Auto-correlation of probe anomalies and interface anomalies
Auto-notification / ticketing of Augtera incidents that contain both Probe and interface anomalies and detect Fabric Congestion
sFlow enablement on all leaf switches (for a 2 stage Clos) or spine switches with the Augtera stack configured as the sFlow collector

The above results in proactive notifications of increase in latency / loss across the fabric correlated with packet drops in the fabric. sFlow analysis enables an operator to find the flows transiting the interface on which packets are dropped at the time of the drops.

TCP Retransmit Anomalies and Impacted Flows

Augtera collects sFlow telemetry from the fabric. Augtera ML is configured to model the volume of TCP retransmits on a switch or interface and identifies when there is an operationally relevant pattern change. Operator can then use Augtera analytics to identify the impacted flows. Augtera can also automatically identify if there are fabric interfaces that are responsible for the retransmissions or prove that the network is innocent.

Data Center Solution TCP Sack anomalies can find application impacting fabric congestion.

Firewall and Load Balancer Anomalies and Observability

Augtera integrates with a wide range of firewall and load balancer vendors and ingests metric, log and flow data.

Augtera can automatically detect a spike in rejected flows as well as identify the specific flows
Zero Day syslog anomalies can detect the first occurrence of rare syslog messages that identify firewall or load balancer issues
Traffic and flow anomalies described above are applicable to firewall and load balancer infrastructure

Hybrid Cloud

Hybrid Cloud Traffic, Flow, Latency and Loss Anomalies

Several of the solutions described above are applicable to Hybrid Cloud:

Aggregate traffic anomalies on data center to public cloud interconnects
Flow observability, analytics and anomalies based on sFlow and IPFIX streamed from data center border routers or from VPC flow logs in the public cloud
Deployment of Augtera agents on public cloud VMs and data center servers and switches to detect hybrid cloud loss and latency degradation
Integration with server and VM metrics and logs in the data center as well as with public cloud VM metrics and logs

NetOps / DevOps Friendly APIs

There are numerous short and long-term drivers for APIs. In the short-term, an operations team may not be able to provide direct access to a data source for a Network Operations tool. Sometimes the export of data to a file is explored, and maybe the only option. However, this is far from optimal. It is not dynamic for one. In these cases, Augtera provides its own APIs and support for other APIs. In the long-term, the transition to NetOps and DevOps-friendly Network Operations is the main driver for supporting and providing APIs.

Augtera Networks Data Center Network AIOps Solution API examples

Augtera Networks is always exploring with customers what APIs to support and what APIs to provide for the Data Center Solution. Two APIs that have been powerful in Data Center Solutions are:

Meta-Information API
Topology API

There are many powerful uses of Meta-Information, including data / alert enrichment. Of particular concern to Network Operations teams is communicating to Network Operations tools when to suppress alarms. For example, a Network Operations team may know that it is performing maintenance, so it does not need alarms for impacted elements.

Augtera believes strongly that auto-correlation needs to be topology-aware to effectively identify the root of an incident, which is why topology auto-discovery is provided in the base platform. However, there are times that Network Operations teams cannot provide use of the protocols needed to do auto-discovery. For these occasions, using an API is a good alternative.

ITSM Integrations

All Network Operations teams have a “job to be done”, and that job cannot be done unless tools are integrated into / aware-of processes and workflows. Any tool vendor can say they support an API. The “rubber meets the road” on the question of whether a tool is enabling the network operations team to get their job done.

There are numerous ways in which Augtera has worked with Data Center Network Operations teams to support workflows and processes.

ServiceNow tickets are often the trigger for scheduling skilled engineers to act on an incident. As skilled engineers are a scarce resource, it is critical that no noise is injected into ServiceNow. The Augtera Networks Data Center Solution eliminates noise from anomaly detection and ingestion of multiple data sources, suppresses maintenance incidents, de-duplicates, and provides customer policy definition that indicates what is operationally relevant and what is high or low priority.

As Network Operations teams seek to be more collaborative, tools like Slack have been adopted. Augtera Networks provides alerts / visualizations within Slack.

Data Center Relevant Equipment Types and Multi-Vendor

Augtera Networks is inherently differentiated from single-vendor / single-vendor-oriented solutions by broad multi-vendor support. While equipment vendors are often motivated to provide better support for their own equipment, Augtera Networks is motivated to provide robust multi-vendor support.

The Augtera Data Center Solution supports any network object that uses standard interfaces. Some of the vendors / technologies that Augtera has production proven support for data center and hybrid cloud operations include:

Data Center Data Types

Network AIOps begins with ingesting the data needed to develop the insights and actions that transform Network Operations. Any Data Center Solution must support the needed data.

Augtera Networks has found SNMP and Syslog is fundamental to most solutions. In fact, some solutions can be provided with Syslog only. For the Data Center Solution, sFLOW and IPFIX are important for anomaly detection involving flows and application performance degradation. For some SD-WAN offerings, IPFIX is the only flow data supported.

OpenConfig and gRPC/gNMI are emerging approaches. While implementations are still maturing in many cases, support is critical for some Data Center equipment and customer preferences. VPC Flow Log support extends the Data Center Solution to cloud environments.

Synthetic agents for latency and loss are critical because other telemetry data does not provide this information. Kafka is an increasingly important transport for many data types.

The Augtera Data Center Solution also supports the Mirror on Drop (MOD) Broadcom chip capability enabled on Dell Enterprise SONiC, in addition to Dell Enterprise SONiC streaming telemetry.

Data Center Constructs

Every networking environment has potentially different constructs that need to be supported, because of network architecture and design considerations inherent to that environment.

Augtera Networks Data Center Network AIOps Solution constructs

For Data Center environments, there are numerous constructs that need to be supported. Some already mentioned in this paper include physical, optical, Ethernet, underlays, overlays, and SD-WAN. The Data Center has other unique constructs like Top of Rack, spine switches, PODs, L3 Clos Fabrics, and Data Center Interconnects (DCIs).

Why does this matter? What if a network operations team decides they want to understand anomalies at the POD aggregate level? Then the POD construct must be supported by the solution. What if a network operations team decides links from a ToR to a server are not high priority from a notification perspective, but ToR uplinks are? Then the ToR construct must be supported and understood.

If the constructs important to data center environments are not understood and integrated into the solution, then important noise reduction, notification, and analysis policies cannot be realized.

VXLAN and EVPN is common in Data Centers

Deep observability and proactive insights are provided into the EVPN/VXLAN data plane and control plane:

ECMP aware inner packet tracing across the fabric to determine the path taken by application packets encapsulated in VXLAN
Proactive detection of EVPN control plane routing anomalies
Proactive detection of hardware memory exhaustion for EVPN routes

Topology Auto-Discovery and Time Machine Visualization

Data Center layer 2, layer 3, VXLAN and EVPN topology is automatically discovered and visualized across the fabric and servers:

Operator provided metadata including the role of devices (e.g., leaf, spine), data center PODs and fabrics augments the auto-discovered topology
Hierarchical zoom-in and zoom-out visualization enabling operators to visualize thousands of switches and routers
Time machine-based visualization of metrics, events, and anomalies on the topology

Time machine visualization in data center solution.

Conclusion

Data center performance and availability management is one of the most challenging areas of network operations today. Current generation monitoring and observability tools are manual, reactive, and noisy. Augtera Data Center customers are transferring their operations to being automated, proactive, and operationally relevant, dramatically improving their KPIs.

For more information: