This blog will describe a real example of how an organization uses Augtera machine learning to proactively detect environmental issues before they adversely impact service.
Machine learning and AI are not technologies that typically come to mind when you think of monitoring environmental conditions in a facility, however, they should be and this blog will highlight why. Augtera is reinventing the way organizations operate their networks. Augtera machine learning and AI enable organizations to proactively identify conditions where failure may soon follow. In this blog, we examine a real-world example of how Augtera machine learning prevented a facility outage, and describe the shortcomings with traditional monitoring systems.
The challenge this operator, like many, has is that each facility has different environmental and component characteristics resulting in different metric distributions. This makes identifying problematic conditions difficult as a static threshold setting that works for site A would not necessarily work for site B. Augtera solves this challenge by learning the pattern for each metric (e.g. temperature, fan speed), in each location independently for each component. Figure 1 below is a temperature reading for the same device type (broadband router) but in different facilities. Notice it is not just the top line temperature that is significantly different, but also how fast temperature rises or decreases at different times of the day. These are the unique patterns that Augtera learns.
Augtera Network AI builds a model for each metric on each object using dynamic online learning. Augtera is able to detect when there is an abnormal pattern change (i.e., an anomaly) and generate notifications to the operations teams. In the real example below, Augtera started recording anomaly notifications at 12:00; and this is the first warning that something is not as it should be. Anomaly warnings continued as the facility was getting progressively warmer.
Figure 2 shows temperature data intime blocks of 6 hours. If you only looked at that time-series it may not seem like there is a major pattern deviation or change, but when taking into context the full history of this metric for this component , as Augtera ML does, we can see it was significantly different and that there was also an indication 2 days prior that temperature readings had fluctuated and then corrected.
When the operator received the Augtera notification, the NOC dispatched a technician to the location, as it is rural and unmanned. Upon arrival, the technician confirmed the cooling system had a malfunction and requested an AC technician. Prior to the technician arrival and eventual fix of the cooling system they were able to stabilize the airflow using temporary cooling.
This particular operator has an NMS tool, which also reads temperature data for these devices. The existing tool, however, had a flat alarm threshold of 100F set for all their locations, and this was set to reduce alert noise given the challenges we described earlier. The tool would have eventually sent an alarm but by then the facility would have likely experienced equipment shut down or even damage due to overheating. By contrast, Augtera collects approximately 15K temperature readings per week for this organization, which equates to approximately 2500 predictions (what the expected temperature should be in a given time period). .00004%. of those predictions are considered an anomaly, which for them, is about 3 alerts a week. Proactive notification from Augtera enabled the operator to prevent an outage by getting ahead of the cooling issue before it adversely impacted service.
Quote from the customer: “that’s awesome because there was something wrong and we didn’t alarm yet (with our existing NMS tool)”.