Supercomputer networking to accelerate large scale AI training
Frontier model training depends on reliable supercomputer networks that can quickly move data between GPUs. To make this faster and more efficient, OpenAI has partnered with AMD, Broadcom, Intel, Microsoft, and NVIDIA to develop MRC (Multipath Reliable Connection): a novel protocol that improves GPU networking performance and resilience in large training clusters. We released MRC today(opens in a new window) through the Open Compute Project (OCP) to enable the broader industry to use it. With m...
Read more at openai.com