What is an Operationally Relevant Network Anomaly?

An anomaly is defined by Oxford Dictionary as “a thing, situation, etc., that is different from what is normal or expected”. 

When you apply this definition to Networks we need to first determine the networking constructs on which determining what is not normal or expected is of business and operational significance. Then we need to determine the normal behavior of these networking constructs and what is not normal on an ongoing basis.

Networking Constructs – Infrastructure and Flows

Given the multi-layer, multi-domain, multi-cloud nature of modern networks with increasing adoption of software defined networking and unpredictable internet circuits there is a wide range of networking constructs. 

We define a networking construct as a physical or virtual network infrastructure element or a flow that can either be as granular as possible or represent an aggregation that is dynamically defined by an operator. 

Each networking construct has live instances. Consider two simplest networking constructs that an operator is interested in for one device: (1) aggregate traffic on all interfaces on that device (2) traffic on each interface of that device. For case (1), there is one instance and for case (2), there will be as many as the number of interfaces.

Network infrastructure networking constructs include everything from switches, routers, firewalls, load balancers, servers, VMs, containers, hardware components, physical and virtual interfaces, data center interconnects, hybrid cloud interconnects, SD-WAN tunnels, cloud virtual network components and so on. Some of these network constructs may be more important to network operation teams than others. Instances of some of these constructs are readily discoverable from network protocols while others require a fairly flexible operator definition or intent. For example an operator may define a networking construct for the aggregate traffic exiting a data center.

Flow networking constructs include individual flows as well as constructs that represent a flexible operator defined aggregation such as a set of flows that map to an application, all flows on each VM or DMZ, all flows to a specific port, all flows with TCP retransmits, etc. 

The good news is that instances of these networking constructs exhibit a wide range of metrics and events that enable the determination of the normal behavior of these constructs. This includes environmental, data plane, hardware, control plane, packet and system metrics, network events, alarms and logs learned via a very wide range of standards based and vendor proprietary protocols. 

It is important to note that the different instances of a specific type of networking construct e.g., interface or hybrid cloud interconnect, may have different normal behavior. 

Hence a Network Anomaly refers to what is not normal or expected on a specific instance of a networking construct. 

Machine Learning Based Network Anomaly

Determination of the normal behavior of the instances of these networking constructs requires learning from a very wide range of distributions, taking into account time of day, seasonality and looking at multiple metrics and events together. Operators cannot be expected to define what is not normal as it places a significant burden on them and in many cases they may not know what is normal. In addition what is not normal or expected needs to be high fidelity and be actionable and usable by operation teams. 

Threshold based or rule based techniques fall well short of the above goals. Machine learning could be used if the challenge of autonomous training at scale is overcome. In fact several Machine Learning techniques are not a good fit either as they are not autonomous and impose significant operator tax.

However Machine Learning as a discipline is very well suited to be a key building block in addressing the above goals as it enables learning from patterns at scale.

These observations lead us to the following definition of a network anomaly. 

A Network Anomaly refers to an actionable unexpected or abnormal behavior on a specific instance of a networking construct where the normal behavior is autonomously learned using Machine Learning. 

In future posts we will continue elaborating on the concepts introduced by this blog. To learn more about Augtera’s innovation and use cases around AI based detection of operationally relevant Network anomalies please contact us.