How Augtera Detected the AWS Outage 30 minutes Prior to AWS Posting it

Recently two high profile AWS outages have disrupted service to many organizations using their services.  These outages often have direct and in-direct downstream effects to services which may not be hosted directly in AWS but rely on 3rd party services (e.g., Slack) that use AWS.  

A few months back Augtera released a new service , real-time multi-cloud observability that provides actionable insights.  In this blog we will explain how this service  detected and notified operations teams about the recent AWS outage (30 minutes before AWS posted it) and how it provides valuable outage context for organizations that are using multiple regions and public cloud providers. 

First a little about how it works, Augtera is the first ML platform for networks. This includes data center, private cloud, SD-WAN, WAN, hybrid cloud and public cloud.  It automatically learns patterns in network telemetry and will detect when the pattern is abnormal.  The ML can be applied to many different metrics and distributions (e.g., traffic, syslog text, temperature).  In this example Augera is measuring round trip time(RTT) and packet loss data from agents located across the public clouds.  In this specific case, an Augtera agent is resident on virtual machines located in the cloud provider. Each agent generates synthetic probe data to every other agent  that the Augtera service consumes.  

Prior to the outages, Augtera was continually learning the normal pattern of loss and latency between the various cloud providers, regions, and availability zones.  Figure 1 is time series data of loss ratio several hours before and after the degradation in the AWS West US-2 region (at present Augtera agents are not deployed in AWS US-West-1). At approx 3:15 UTC loss had a dramatic increase and at approx 4:00 UTC it mostly recovered.  Notice that not all West US-2 availability zones were impacted equally.  

Augtera ML had already learned the normal loss between these two cloud regions and produced anomalies as soon as the pattern became abnormal as seen in Figure 2.  The pattern was very abnormal for this outage but in many other cases the deviation is not quite as obvious. Being able to detect subtle and operationally relevant changes in loss or latency is often the difference between proactively addressing an application or network issue  or having users inform you of the issue and then reacting.

As mentioned earlier in the blog, Augtera had agents deployed across several public cloud providers and regions.  In this heatmap the vertical column represents the probe data source (sender) and the horizontal row is the probe data receiver (destination).  A darker shade of color means more anomalies were detected in that time period.  A few things become very interesting when looking at the heatmap

  1. Probe data from GCP East was not impacted indicating they likely are peering directly
  2. All Microsoft Azure regions were impacted significantly
  3. AWS to AWS sites were not as impacted.  This might have thrown off some monitoring systems that organizations use.

I hope you enjoyed this blog.  The Augtera platform is available via SaaS or On-premises.  The agent can be deployed in your network (cloud, data center, branch, etc) for full end to end observability and Augtera can also or it consumes telemetry from your existing sources.  To learn more about the Augtera platform  and how machine learning can improve your network operations click the contact us link.