The world's largest AI supercomputer built with NVIDIA Ethernet-accelerated xAI
October 28, 2024 — NVIDIA announced that xAI's Colossus supercomputer cluster in Memphis, Tennessee, has reached the massive scale of 100,000 NVIDIA® Hopper GPUs. The cluster is used NVIDIA Spectrum-X™ Ethernet networking platform, an RDMA (Remote) designed for superior performance in multi-tenant, hyperscale AI factories Direct Memory Access) network.
Colossus It is the world's largest AI supercomputer and is currently being used to train xAI's Grok family of large language models, as well as chatbots as one of the features of X Premium users. xAI Willing Colossus has doubled in size to 200,000 NVIDIA Hopper GPUs.
xAI It took NVIDIA only 122 days to build all the supporting facilities and this state-of-the-art supercomputer, from the first rack to the start of the training mission in just 19 days. It often takes months or even years to build a system of this scale.
When training a very large model like Grog, Colossus achieved unprecedented network performance, with no increased application latency or packet loss due to traffic collisions across the system under a three-tier network architecture. With Spectrum-X's advanced congestion control capabilities, system data throughput is maintained at 95%.
This level of performance is simply not possible at scale with traditional Ethernet, which can only deliver 60% of the data throughput when thousands of streams collide.
NVIDIA "AI is becoming critical, with increased demands on performance, security, scalability, and cost-effectiveness," said Gilad Shainer, senior vice president of networking. The NVIDIA Spectrum-X Ethernet networking platform is designed to accelerate the development, deployment, and go-to-market of AI solutions by enabling innovative enterprises like xAI to process, analyze, and execute AI workloads faster. ”
Elon Musk said on X: "Colossus is the most powerful training system in the world. The xAI team, NVIDIA, and many of our partners and suppliers have done a great job. ”
xAI "xAI builds the world's largest and most powerful supercomputer," the spokesperson said. With NVIDIA Hopper GPUs and Spectrum-X, we've been able to push the boundaries of AI model training at scale to create an AI factory based on the Ethernet standard that is hyper-accelerated and optimized. ”
Spectrum-X At the heart of the platform is the Spectrum SN5600 Ethernet Switch, which supports port speeds up to 800Gb/s with the Spectrum-4 switch ASIC. xAI leverages an end-to-end solution with Spectrum-X SN5600 switches and NVIDIA BlueField-3® SuperNIC to deliver unprecedented performance.
Spectrum-X Ethernet networks for AI have advanced features that deliver efficient, scalable bandwidth while delivering low latency and short-tail latency that were previously unique to InfiniBand networks. Spectrum-X's capabilities include dynamic routing, congestion control calculations based on NVIDIA DDP (Direct Data Placement) technology, and enhanced visibility and performance isolation of AI networks, all of which are key requirements for multi-tenant generative AI clouds and large enterprise application environments.