In the last few weeks I have been astounded by the number of Augtera enterprise customers who are in the process of rolling out data center clusters for generative AI workloads. The amount of hype around enterprises using Large Language Models (LLMs) to build conversational applications based on their proprietary data on-premise has a kernel of reality.
Some enterprises are building these clusters using Infiniband. Others are using Ethernet. This post will not discuss the pros and cons of these two approaches. Suffice it to say that there are compelling reasons for using Ethernet. However as enterprises use Ethernet they need to do so with their eyes open and understand the pitfalls and operational gaps that they need to address. The cost of unpredictable performance, congestion, reacting to failures, polarization and lack of visibility can be very high in AI workload environments that are built using expensive components.
This is because AI workload applications that employ LLMs have tight coupling between the GPUs. The vast majority of the GPUs being deployed rely on Remote Direct Memory Access (RDMA) which, if Ethernet is being used, in turn uses the RDMA over Converged Ethernet (RoCE) protocol. Latency caused by queueing delays or dropped packets can adversely affect the completion times of LLM jobs.
Below we discuss some of the key operational challenges that operators need to pay attention to while deploying AI workload ethernet clusters.
End-to-End Visibility
End-to-end visibility is always desirable in data centers. Its even more so in the case of LLM workloads where there is significant tight coupling between the performance of the network fabric, GPUs and DPUs that comprise the complete system. Network failures and congestion can increase long tail latency of the GPU to GPU flows; a misbehaving DPU or a software issue on a GPU can have a ripple effect on the entire cluster; misbehaving optics on a NIC can slow down the entire cluster. As an operator you not only need visibility into the metrics, logs, flows and topology from these components but you also need to proactively and quickly identify misbehaviors, understand the correlation between them for troubleshooting, “mean time to innocence” and for validation.
Flow Completion Times
Its critical to measure the completion times of GPU to GPU flows, identify outliers and to do so on an ongoing “always on” basis. These completion times determine the overall performance of the cluster. Flow completion time is the most critical metric needed to measure long tail latency, the impact of tuning any of the components of the cluster and to benchmark the performance of different vendor solutions.
Traffic Polarization
AI workload environments primarily comprise a small number of Elephant flows. Performance of the network fabric depends on fairness of the load balancing of these flows on available links. Polarization of traffic on a few links can result in congestion and significantly reduce the performance of the cluster. Vendors have come up with various solutions for per packet load balancing to reduce polarization of traffic. Ongoing measurement of the utilization of the links and real-time detection of unfair load balancing is imperative to benchmark and measure fairness of load balancing.
Predictable Performance
AI workloads perform best when the cluster health and performance is predictable. This makes it even more important for operators to identify and mitigate data plane, control plane, software or system issues and grey failures as quickly and proactively as possible and leverage automation where they can. This is one area where Network AI is a very good fit to dramatically improve the predictability and operations of the network fabric and AI cluster leveraging AIOps using metric, log, topology and flow data. In addition to this, operators need to think about how to benchmark the end-to-end latency of the cluster on an ongoing basis and whether synthetic probes can help.
Congestion, Long Tail Latency and Utilization
There is an interplay between congestion, long tail latency of flows and utilization of the network fabric. Solutions from various vendors have different ways to address congestion, in particular incast congestion. Explicit Congestion Notification (ECN) supported by RoCE, Priority-based-Flow-Control (PFC) and vendor specific inband-telemetry based approaches are some of the solutions that mitigate congestion. Irrespective of the congestion avoidance mechanisms used operators need visibility to understand queue utilization across the fabric, per hop latency, when congestion mitigation is triggered and how congestion mitigation impacts flow latency and utilization of the fabric. This visibility is also needed to optimize the congestion mitigation configuration across the cluster.
Augtera Network AI platform already addresses a number of the challenges described in this post and we are working on solutions for others. If you would like to discuss these challenges and potential solutions further, we would love to hear from you.