Unlocking large scale AI training networks with MRC (Multipath Reliable Connection)
Summary
OpenAI and partners (AMD, Broadcom, Intel, Microsoft, NVIDIA) developed MRC (Multipath Reliable Connection), a new networking protocol that improves data transfer speed and reliability in supercomputer clusters used for AI model training. MRC addresses key challenges in large-scale AI training by reducing network congestion through adaptive packet spraying (distributing data across multiple paths), enabling redundancy to tolerate failures, and using static source routing (predetermined paths that bypass failed connections) to prevent training jobs from crashing when network failures occur.
Solution / Mitigation
MRC has been released through the Open Compute Project (OCP) as an open standard for the industry to use. The specification extends RDMA over Converged Ethernet (RoCE, a hardware-accelerated data transfer standard) and incorporates SRv6-based source routing to support large-scale AI networking fabrics.
Classification
Affected Vendors
Related Issues
Original source: https://openai.com/index/mrc-supercomputer-networking
First tracked: May 6, 2026 at 08:00 AM
Classified by LLM (prompt v3) · confidence: 92%