wezebo
Back
ArticleMay 18, 2026 · 4 min read

OpenAI’s MRC spec makes AI networking a platform fight

OpenAI and major chip partners released MRC, an Ethernet-based protocol meant to keep giant AI training clusters running when networks fail.

Wezebo
Abstract editorial image of glowing server racks connected by multiple resilient network paths in a dark data center.

OpenAI has published a new AI supercomputer networking specification called Multipath Reliable Connection, or MRC. The pitch is simple: frontier-model training is now so large that network failures are no longer edge cases. They are part of the job.

The spec was developed with AMD, Broadcom, Intel, Microsoft and Nvidia, then released through the Open Compute Project. OpenAI says MRC is already running across its largest Nvidia GB200 supercomputers and has been used in training work for recent ChatGPT and Codex models.

The bottleneck moved below the model

The most interesting part is not that MRC exists. It is that OpenAI is treating the network as a first-class piece of the AI stack.

Modern training jobs spread one model across thousands of GPUs. Those GPUs often move in lockstep, which means one slow transfer can stall the whole run. OpenAI says a training step can involve millions of data transfers, and that routine issues such as link flaps, congestion, switch failures and packet loss become much more damaging at that scale.

MRC tries to reduce that fragility by spreading a single transfer across many network paths, routing around failures quickly and simplifying the control plane. OpenAI says a conventional fabric can take seconds or tens of seconds to stabilize after some failures. With MRC, the company says routing can happen around failures in microseconds.

Why chip vendors are involved

This is not just a software tweak that sits above the data center. MRC is designed for the newest 800Gb/s network interfaces and extends RDMA over Converged Ethernet, a common way to move data directly between machines with low overhead.

AMD’s post frames the problem around predictable training at 10-trillion-parameter scale and says MRC adds packet-spray load balancing, selective retransmission and network-signaled congestion control. In plain English: use more paths, resend less wasted data, and react to congestion before it turns into a cluster-wide slowdown.

That explains the partner list. If OpenAI wants this to become infrastructure instead of a one-off internal system, it needs hardware vendors, cloud operators and network silicon makers to implement it consistently. The OCP release is a signal that OpenAI wants MRC to look more like an ecosystem spec than a private optimization.

The infrastructure race gets less visible

For users, MRC will not show up as a new ChatGPT button. The impact is indirect: fewer wasted GPU cycles, more reliable training runs and possibly lower infrastructure cost per useful model update. Those gains matter because compute is still one of the hardest constraints in AI.

For cloud and chip companies, the stakes are more direct. AI infrastructure competition is moving beyond who can buy the most GPUs. Networking, power, memory, scheduling and fault recovery increasingly decide how much useful work those GPUs can do.

OpenAI also says the MRC topology can help fully connect about 131,000 GPUs with two tiers of switches by splitting high-speed interfaces into multiple smaller links. If that kind of design spreads, buyers will care less about headline accelerator counts and more about whether the whole system can stay productive under failure.

What to watch next

The open question is adoption. A published spec is not the same as broad interoperability, and large training clusters are conservative systems because downtime is expensive. The next test is whether cloud providers and accelerator vendors ship MRC support as part of normal AI infrastructure, not just bespoke OpenAI builds.

If they do, AI scaling gets a little less dependent on brute-force spending and a little more dependent on network engineering. That is less flashy than a new model release, but it may matter just as much for who can train the next one.