Article

Tackling data center latency for AI: Why intra-DC microseconds matter as much as WAN milliseconds

Nishant Nishant
AI data center
Solving data center latency is crucial for optimizing large-scale AI training and inference.

KEY TAKEAWAYS:
  • WAN latency isn’t the only bottleneck
  • Intra-data-center latency drives AI efficiency
  • Distributed AI amplifies microsecond delays

When people in information technology talk about latency, their minds usually jump straight to the wide area network (WAN), the internet, Multiprotocol Label Switching (MPLS), long haul fiber, and the “ping time” between users and cloud regions.

In many traditional web workloads, it is the obvious bottleneck. In artificial intelligence (AI)-heavy environments that rely on massive parallelism, though, that focus is only half the story. Inside the data center itself, latency in the microsecond to millisecond range can be just as critical, because it directly governs how effectively expensive accelerators and distributed services can work together.

Modern AI systems are deeply distributed. A single inference call or training step is no longer a simple function running on one server; it is a coordinated exchange between graphics processing units (GPUs), central processing units (CPUs), high-performance storage, parameter servers, feature stores and microservices. These elements may sit on different racks and traverse multiple switches and links to communicate. In that world, intra-data-center latency—how long it takes packets to move between components and how predictable that time is—has a direct impact on throughput, job completion time and ultimately cost.

Understanding WAN vs. data center latency in AI workloads

It helps to separate latency into two domains.

The first is the WAN path between the end user and the data center. That latency is shaped by distance, routing and congestion on public networks. You can mitigate it with better peering, regional deployment and edge caching, but there is a hard floor set by physics; you cannot make light travel faster across continents.

The second domain is the latency inside the data center, the sequence of hops between servers, racks and storage systems; the queuing delays at switches and network interface cards (NICs); and the software overhead in the host stack. This latency is usually orders of magnitude smaller, microseconds to low milliseconds, but it sits in the critical inner loop of AI workloads where it plays a disproportionate role.

Latency budgets for real-time and near-real-time AI applications

For real-time and near-real-time AI applications, this dual domain perspective is essential. Suppose you have a 200-millisecond end-to-end response time budget for an interactive large language model, recommendation engine or fraud detection service. WAN and access networks might realistically consume 60 to 120 milliseconds of that budget by the time a request reaches your front-door load-balancer and the response is sent back. What remains must cover everything else: request parsing, microservice calls, feature retrieval, model inference and post processing. In that context, shaving even a few milliseconds from east–west traffic and eliminating worst-case outliers inside the data center can be the difference between consistently hitting service level objectives (SLOs) and having to over-provision hardware to cope with variability.

The effect is even stronger for large-scale training and tightly coupled inference. Distributed training frameworks rely on repeated synchronization steps, such as all-reduce operations across many GPUs, to keep model parameters consistent. The time for each synchronization is gated by the slowest path in the network. If tail latency for a small fraction of packets increases, overall step time increases, and when that cost is multiplied across millions of steps, training jobs can run significantly longer. In practical terms, intra-data-center latency translates directly into GPU utilization and hence into cost, which is why operators increasingly treat internal latency as a first-class design constraint.

Typical latency ranges in AI data centers and WAN networks

To anchor the discussion, it is useful to examine the different latency regimes in an AI system and what each affects.

Table 1: Latency domains in AI systems and their primary impact

Domain Typical latency range
(one way or round trip time)
Main causes Primary impact
User to cloud over WAN 50-150 ms round trip time Geographic distance, public internet routing, congestion End-user responsiveness and perceived "snappiness"
Data center fabric (rack to rack) 5-200 microseconds one way Switch hops, serilization delay, queuing in Clos or fat-tree fabrics GPU and CPU coordination, training step time
Within a rack (server to server) 1-10 microseconds one way Top-of-rack switching, short copper or optical links Microservice calls, parameter server access
GPU-local interconnect Less than 1-3 microseconds one way On-board or in-chassis links, dedicated GPU fabrics Tightly coupled tensor operations and model parallelism
This is why ‘small’ data center latency numbers loom large.

How data center network topology affects AI latency and performance

One of the most powerful levers is the physical network topology. Traditional three-tier architectures, with access, aggregation and core layers, introduce more hops and oversubscription than is ideal for large AI clusters. In response, operators are moving to high-radix, low-diameter topologies such as fat trees and Clos fabrics that minimize the number of switch hops between any two endpoints. The goal is short, predictable paths so both average and tail latency stay low.

At the link level, 100 gigabit per second (100G) links are giving way to 400G and 800G, with work underway on even higher speed Ethernet and purpose-built AI fabrics. Optical links are being pushed deeper into racks to reduce signal integrity issues and retransmissions. These upgrades do more than add bandwidth; they also reduce serialization delay and help queues drain quickly. Within racks, high bandwidth, low latency interconnects such as NVIDIA NVLink shorten GPU-to-GPU paths for tightly coupled operations and offload traffic from the general data center network.

Reducing AI latency with RDMA, RoCE, SmartNICs and DPUs

On top of the physical fabric, transport protocols and network interface card (NIC) design strongly influence latency. Traditional Transmission Control Protocol and Internet Protocol (TCP/IP) stacks involve multiple layers of software processing, interrupt handling, context switches and kernel networking code, each adding microseconds that become noticeable when millions of messages are involved.

To cut this overhead, many AI data centers use remote direct memory access (RDMA)-based approaches such as RDMA over Converged Ethernet (RoCE) and InfiniBand. These allow data to move directly between memory regions on different hosts, bypassing much of the CPU and kernel stack and yielding much lower and more stable latency. Smart network interface cards (SmartNICs) and data processing units (DPUs) push this further by offloading networking, encryption and storage protocols onto dedicated hardware. That frees CPUs from per-packet work and reduces jitter caused by variable host load, which is crucial for controlling tail latency.

Specialized GPU-aware communication libraries also help by minimizing software overhead in collective operations and point-to-point messaging. They are tuned to the underlying fabric and often schedule communication patterns to respect topology, reducing cross rack and cross-tier traffic where possible.

Managing congestion, QoS and traffic engineering for AI fabrics

Even with a good topology and fast links, contention for shared resources can introduce queuing delay. AI workloads tend to mix large “elephant” flows for parameter synchronization with many small “mice” flows, such as control messages and remote procedure calls (RPCs). If they blindly share queues, mice can sit behind elephants and see inflated latency.

Operators combine these tactics with quality-of-service (QoS) policies that reserve capacity for latency-sensitive traffic and with congestion-control algorithms tuned for AI patterns rather than generic enterprise workloads. Traffic engineering, such as spreading hot flows across paths and keeping training jobs within well connected pods, further reduces hotspots and long-tail behavior.

Table 2: Common intra-data-center latency problems and mitigation strategies

Latency issue Typical symptoms Example mitigation
Tail latency from incast and congestion Training runs finish late, GPUs idle waiting for synchronization Congestion-aware routing, explicit congestion notification (ECN), traffic classes, priority queues
Host networking stack overhead on CPUs High p99 latency under load, variable response times Remote direct memory access (RDMA), RoCE, SmartNICs and data processing units (DPUs)
Topology and placement imbalance Some jobs see much higher latency than others, "noisy" racks Topology-aware schedulers, placing tightly coupled GPUs and services in the same pod or rack
Deep microservice call graphs and excessive hops Large gap between median and p99 latency for user requests Flattening call graphs, co-locating services, batching and parallelizing RPCs
Shared fabrics with mixed, non-AI background traffic Latency spikes during backup or batch windows Dedicated AI fabrics or virtual network slices, strict quality of service (QoS) policies
Mitigation for data center latency takes several forms.

AI workload placement, scheduling and microservice design for low latency

Where workloads run and how they interact also shape latency. If GPUs in a tightly coupled job are scattered across distant parts of the fabric, each step pays avoidable delay. If they are collocated in the same rack or pod, and their supporting services are placed nearby, hop counts and contention fall. Modern schedulers and orchestrators therefore factor network topology into placement instead of treating the data center as a flat pool of resources.

At the software level, designers tune batching, concurrency and caching. Large batches can improve throughput but may increase synchronization cost; small batches can reduce queuing delay but underutilize GPUs. The “right” answer depends on the actual latency distribution inside the cluster. Caching frequently accessed features or embeddings close to the compute layer removes cross-cluster round trips and shrinks the critical path. Re-architecting deep microservice call graphs, flattening them, parallelizing independent calls and collocating chatty services can significantly cut 99th-percentile (p99) latency without hardware changes.

Monitoring p99 tail latency and improving GPU utilization in AI clusters

None of these techniques works well without observability. AI-driven data centers invest in fine-grained telemetry that records full latency distributions, not just averages, across the fabric. Per-hop and per-queue metrics, packet traces and real-time alerts for outliers allow operators to find and fix problems before they derail training jobs or user-facing SLOs.

Special attention is paid to tail latency, such as the 95th percentile (p95), 99th percentile (p99) or 99.9th percentile (p999). In distributed AI workloads, these outliers, not the mean, define the step time, because the system must wait for the slowest participant. Engineering teams therefore focus on eliminating sources of variability: overloaded hosts, noisy neighbors, misconfigured queues, and microbursts on the network.

The business implications are clear. Many organizations find that GPU utilization is well below expectations, with a significant portion of the gap traced to network and storage bottlenecks that leave accelerators idle. In effect, the network imposes a “tax” on GPU investments: you pay for peak-spec hardware but cannot keep it busy. By tightening intra-data-center latency and reducing tail behavior, operators report higher effective utilization, lower training costs, and faster time to model.

Turning low-latency AI infrastructure into a competitive advantage

Stepping back, WAN latency and data center latency play different roles in the end-to-end story. WAN latency sets the baseline for how quickly a user can reach your service; you can optimize it only so far before physics dominates. Intra-data-center latency is where you still have significant architectural control and where each saved millisecond is amplified through better GPU utilization, higher throughput and more predictable real-time behavior.

Treating internal latency as a core design parameter, rather than an implementation detail, is how AI-driven organizations turn their infrastructure into a competitive advantage.

Avnet Networking and Communications Solutions

About Author

Nishant Nishant
Avnet Staff

We use Avnet Staff as a collective byline when our team of editors and writers collaborate on the co...

Helpful Links

Marketing Content Spots

Related Articles

Related Articles
Man working in modern data center
How communications networks in high-performance data centers are evolving
By Avnet Staff   -   March 10, 2026
Data center networks evolved rapidly over the past decade, driven by the need to handle growing data volumes efficiently while managing power, space and cost. Learn about the technologies making the advancements possible.
network switch in data center
How to choose NICs for AI data centers: Bandwidth, latency and offload explained
By Avnet Staff   -   March 9, 2026
Selecting network interface cards for AI data centers requires a balance of bandwidth, latency, off-load features and peripheral component interconnect express (PCIe) topology to optimize GPU cluster performance.

Related Events

Related Events
satellite in orbit
Space 2.0: AMD Technologies for New Space
Date: February 23, 2022
Location: Webinar

tackling-data-center-latency-for-ai