Data Center Solution

Introduction

Data center network and service architectures have changed dramatically over the last decade resulting in the emergence of new complexities that cannot be managed with old solutions. The Augtera platform transforms manual, reactive, and noisy to automated, proactive, and relevant for the following domains:

  • Applications
  • Hybrid Cloud
  • Underlays and Overlays
  • Switches & routers
  • Firewalls and Load Balancers

Augtera’s platform is production proven at scale, with customers realizing the following benefits:

  • 90%+ reduction in Mean Time to Detect (MTTD)
  • 50%+ reduction in Mean Time to Mitigation (MTTM)
  • 40%+ reduction in Mean Time to Repair (MTTR)
  • 4x+ improvement of Mean Time Between Incidents

Anomalies never seen before are now visible, noise is eliminated, mitigation is occurring rapidly, there are less incidents that need to be ticketed, and more time between incidents. Data center operations is being transformed from manual, reactive, and noisy to automated, proactive, and relevant.

The Augtera Data Center Solution supports any network object that uses standard interfaces. Some of the vendors / technologies that Augtera has production proven support for data center and hybrid cloud operations include:

Problem

Modern data center architectures comprise Layer 3 Clos fabrics, often built using BGP. This provides equal cost multipath (ECMP) between top-of-rack (TOR) switches that can utilize all links for east-west traffic between any two switches. This is highly beneficial for scale, resiliency and optimal bandwidth utilization. At the same time it creates operational challenges that stem from “lots of links”.

The Landscape and challenges described here are not limited to hyper scale data centers. Data centers of all sizes have similar challenges.

  • Application uptime and performance is becoming even more critical to business. 
  • Increasing failure points across control and data plane
  • Heterogeneous Environments
  • Hard to detect failures are common
  • Legacy tools and processes are manual, reactive, and noisy

Driven by SaaS, Hybrid Cloud, digital transformation, and other trends, application uptime and performance are becoming even more critical to business. Networking teams are accountable to the business. Operation teams need to plan for business and application continuity.

There are multiple drivers for increasing failure points including dense multipath architectures that create operational challenges from “Lots of Links”, cloud Interconnects from the data center to the cloud, and virtual networks in the public cloud.

  • Heterogeneous Environments
    • Multi-vendor implementations
    •  Increasing use of disaggregation & white labeled hardware
  • Hard to detect failures are common
    • Transient control plane and hardware failures
    • Grey failures / brownouts
    • Long-lived failures
  • Legacy tools and processes are manual, reactive, and noisy
    • Significant time to detect, mitigate, and repair operational issues. 
    • Application owners complain of problems before operations teams’ act

Solution

Data Center operators need a different approach, one that:

  • Is specialized to data center networking
  • Understands new infrastructure and service architectures
  • Transforms manual, reactive, and noisy to automated, proactive, and relevant

Augtera Founder and CEO, Rahul Aggarwal, Tech Field Day / Networking Field Day 28, May, 2022

Foundation Capabilities

There are several capabilities that are foundational / common to the data center solution, independent of the domain (underlay, overlay, hybrid cloud,…), including:

  • Noise elimination
  • Automated notifications and ticketing
  • Autocorrelation
  • Proactively detecting gray failures

Noise Elimination

The Augtera platform takes a holistic approach to eliminating noise. Noise elimination is not the result of one technology or one approach. It is the result of understanding noise throughout the entire pipeline from ingestion to ticketing.

There are two major categories of noise elimination capabilities:

  • Selecting the strongest signals from a sea of noise
  • Policy-driven reduction of analysis and notification

Learn more at Noise Elimination

Automated Notifications and Ticketing

Machine learning anomalies and insights detected by Augtera are automatically notified to Slack, Syslog, Kafka or ticketed to Service Now. 

  • High fidelity proactive alerts & tickets that help transform operations to proactive from reactive
  • De-duplication aware to suppress duplicates while notifying
  • Duplicates are added as events against existing ticket state in Service Now
  • Auto-correlated events are added as events against the same Service Now ticket
  • Ticket life cycle aware

Autocorrelation

Multi-layer topology aware autocorrelation automatically correlates operationally relevant events and Augtera generated anomalies across the BGP and layer 2 control plane, IP and VXLAN data plane, synthetic probe anomalies, system and environmental degradation anomalies and other grey failures. The output of autocorrelation is “Augtera Incidents”. 

This has several benefits:

  • Reduction of 25-75% in the number of incidents / tickets that NOC needs to manage
  • High fidelity context for root cause analysis, mitigation, and remediation. For example, Figure X shows an Augtera incident that auto-correlates 8 link and BGP flaps on 4 switches and identifies the mis-behaving switch that they are all connected to, even when the misbehaving switch does not generate any alerts. 

Proactively Detect Grey Failures

Grey failures or brown outs refer to problems that are brewing and will eventually cause an outage. However, their effect is not immediately obvious. Applications may notice them and complain but they can be difficult to detect, trouble shoot and root cause. For example, intermittent packet drops can be the result of hardware issues, fiber issues, optical problems, or software issues. 

  • Augtera anomaly detection algorithms on hundreds of metrics can detect significant pattern changes and detect control plane, data plane or hardware grey failures hours or days before the current reactive approaches.  
  • Zero-day syslog anomalies can find the very first occurrence of an important syslog that indicates a grey failure. e.g., ASIC parity error.

Application Performance and Availability

Enables a customer to proactively determine increase in latency or loss across the data center fabric due to packet drops in the fabric and further determine the application flows responsible for the congestion.

Actionable Insights for Data Center, SE Director, John Heintz, Networking Field Day 28, May 2022

Augtera agents need to be installed on either the fabric leaf switches or servers. The solution encompasses:

  • Anomaly detection on probe latency and loss between agents
  • Anomaly detection on packet drops on fabric interfaces
  • Auto-correlation of probe anomalies and interface anomalies 
  • Auto-notification / ticketing of Augtera incidents that contain both Probe and interface anomalies and detect Fabric Congestion
  • sFlow enablement on all leaf switches (for a 2 stage Clos) or spine switches with the Augtera stack configured as the sFlow collector

The above results in proactive notifications of increase in latency / loss across the fabric correlated with packet drops in the fabric. sFlow analysis enables an operator to find the flows transiting the interface on which packets are dropped at the time of the drops. 

TCP Retransmit Anomalies and Impacted Flows

Augtera collects sFlow telemetry from the fabric.  Augtera ML is configured to model the volume of TCP retransmits on a switch or interface and identifies when there is an operationally relevant pattern change.  Operator can then use Augtera analytics to identify the impacted flows. Augtera can also automatically identify if there are fabric interfaces that are responsible for the retransmissions or prove that the network is innocent.

Hybrid Cloud

Hybrid Cloud Traffic, Flow, Latency and Loss Anomalies

Several of the solutions described above are applicable to Hybrid Cloud:

  • Aggregate traffic anomalies on data center to public cloud interconnects
  • Flow observability, analytics and anomalies based on sFlow and IPFIX streamed from data center border routers or from VPC flow logs in the public cloud
  • Deployment of Augtera agents on public cloud VMs and data center servers and switches to detect hybrid cloud loss and latency degradation 
  • Integration with server and VM metrics and logs in the data center as well as with public cloud VM metrics and logs 

Data Center Switching Fabric and Servers

Proactive Detection of Latency, Loss and Microbursts

  • Augtera agent can be deployed on hosts, leaf’s, and spines and enabled for synthetic probes
  • Probes continually measure loss and latency to each other
  • Augtera automatically learns normal loss/latency patterns between pair of devices and generates an anomaly when there is a degradation without thresholds or any other configuration
  • Anomalies can be overlayed on topology or viewed in heatmaps.
  • Augtera leverages peak buffer depth metrics from telemetry streaming, automatically learns the normal pattern for every queue and generates anomalies that detect microbursts

Proactively Detect Environmental and Optical Degradation

Early detection of temperature, fan speed, power, voltage and optical degradation via ML Anomaly Detection.

Traffic and Flow Mapping and Anomalies

  • Operators can map the path taken by a flow as well as the anomalies along the path in real-time or at a specific time interval in the past to determine the root cause of application issues
  • Anomaly detection on traffic utilization as an aggregate across a POD, fabric or data center interconnects to proactively detect unexpected major changes in traffic
  • Anomaly detection on TCP flags to detect SYN floods, unstable TCP sessions, etc.
  • Ad-hoc analytics to detect ECMP polarization on specific switches

Underlays and Overlays

While the section above on switches and routers deals with underlay support, this section focuses on overlay support

VXLAN and EVPN

Deep observability and proactive insights are provided into the EVPN/VXLAN data plane and control plane:

  • ECMP aware inner packet tracing across the fabric to determine the path taken by application packets encapsulated in VXLAN
  • Proactive detection of EVPN control plane routing anomalies 
  • Proactive detection of hardware memory exhaustion for EVPN routes

Topology Auto-Discovery and Time Machine Visualization

Data Center layer 2, layer 3, VXLAN and EVPN topology is automatically discovered and visualized across the fabric and servers:

  • Operator provided metadata including the role of devices (e.g., leaf, spine), data center PODs and fabrics augments the auto-discovered topology
  • Hierarchical zoom-in and zoom-out visualization enabling operators to visualize thousands of switches and routers
  • Time machine-based visualization of metrics, events, and anomalies on the topology

Firewall and Load Balancer Anomalies and Observability

Augtera integrates with a wide range of firewall and load balancer vendors and ingests metric, log and flow data. 

  • Augtera can automatically detect a spike in rejected flows as well as identify the specific flows
  • Zero Day syslog anomalies can detect the first occurrence of rare syslog messages that identify firewall or load balancer issues
  • Traffic and flow anomalies described above are applicable to firewall and load balancer infrastructure

Conclusion

Data center performance and availability management is one of the most challenging areas of network operations today. Current generation monitoring and observability tools are manual, reactive, and noisy. Augtera Data Center customers are transferring their operations to being automated, proactive, and operationally relevant, dramatically improving their KPIs.

For more information: