On May 28, at the COMPUTEX conference in Taipei, NVIDIA announced a slew of new hardware and networking tools, many of which focus on enabling artificial intelligence. The new line includes the 1 exaflop supercomputer, the DGX GH200 class; 100+ system configuration options designed to help enterprises accommodate AI and high-performance computing needs; a modular reference architecture for accelerated servers; and a cloud networking platform built around Ethernet-based AI clouds.
The announcements — and the first public talk co-founder and CEO Jensen Huang has given since the start of the COVID-19 pandemic — helped propel NVIDIA into view of the coveted $1 trillion market capitalization. This would make it the first chipmaker to step into the realm of tech giants like Microsoft and Apple.
What makes the DGX GH200 for AI supercomputers different?
NVIDIA’s new class of AI supercomputers leverages GH200 Grace Hopper superchips and the NVIDIA NVLink switch system interconnect to run generative AI language applications, recommender systems (machine learning engines for predict what a user might rate a product or piece of content), and data analytics workloads (Figure A). It is the first product to use both the high performance chips and the new interconnect.
NVIDIA will initially offer the DGX GH200 to Google Cloud, Meta and Microsoft. Next, it plans to offer the DGX GH200 design as a model for cloud service providers and other hyperscalers. It should be available at the end of 2023.
The DGX GH200 is intended to enable organizations to run AI from their own data centers. 256 GH200 superchips in each unit provide 1 exaflop of performance and 144 terabytes of shared memory.
NVIDIA explained in the announcement that the NVLink switching system allows the GH200 chips to bypass a conventional CPU-to-GPU PCIe connection, increasing bandwidth while reducing power consumption.
Mark Lohmeyer, vice president of compute at Google Cloud, pointed out in a NVIDIA press release that the new Hopper chips and NVLink switching system can “address key AI bottlenecks at scale”.
“Training large AI models has traditionally been a resource- and time-intensive task,” said Girish Bablani, vice president of Azure infrastructure at Microsoft, in NVIDIA’s press release. “The ability for DGX GH200 to work with terabyte-sized datasets would allow developers to conduct advanced research on a larger scale and at accelerated speeds.”
NVIDIA will also keep supercomputing capability for itself; the company plans to work on its own supercomputer called Helios, powered by four DGX GH200 systems.
Alternatives to NVIDIA supercomputing chips
Not many companies or customers aim for the AI and supercomputing speeds that NVIDIA’s Grace Hopper chips enable. NVIDIA’s main rival is AMD, which produces the Instinct MI300. This chip includes both CPU and GPU cores, and is expected to run the El Capitan supercomputer at 2 exaflops.
Intel has offered the Falcon Shores chip, but it recently announced that it won’t come with both a CPU and a GPU. Instead, he has changed the roadmap For focus on AI and high-power computing, but does not include processor cores.
Enterprise library supports AI deployments
Another new service, the NVIDIA AI Enterprise Library, is designed to help organizations access the software layer of new AI offerings. It includes more than 100 frameworks, pre-trained models and development tools. These frameworks are appropriate for the development and deployment of production AI, including generative AI, computer vision, voice AI, and others.
On-demand support from NVIDIA AI experts will be available to help you deploy and scale AI projects. It can help deploy AI on data center platforms from VMware and Red Hat or on NVIDIA-certified systems.
SEE: Are ChatGPT or Google Bard good for your business?
Faster Networking for AI in the Cloud
NVIDIA wants to help accelerate Ethernet-based AI clouds with the Spectrum-X accelerated networking platform (Figure B).
“NVIDIA Spectrum-X is a new class of Ethernet networking that removes barriers for next-generation AI workloads that have the potential to transform entire industries,” said Gilad Shainer, senior vice president of implementation. networked at NVIDIA, in A press release.
Spectrum-X can support AI clouds with 256 200 Gbps ports connected by a single switch or 16,000 ports in a two-tier spine-leaf topology.
To do this, Spectrum-X uses Spectrum-4, a 51 Tbps Ethernet switch specifically designed for AI networks. Advanced RoCE extensions bringing together Spectrum-4 switches, BlueField-3 DPUs, and NVIDIA LinkX optics create an end-to-end 400GbE network optimized for AI clouds, NVIDIA said.
Spectrum-X and its associated products (Spectrum-4 switches, BlueField-3 DPUs and LinkX 400G optics) are available now, including ecosystem integration with Dell Technologies, Lenovo and Supermicro.
MGX server specification coming soon
In other news about accelerated performance in data centers, NVIDIA has released the MGX server specification. It is a modular reference architecture for system manufacturers working on AI and high performance computing.
“We created MGX to help organizations get started with enterprise AI,” said Kaustubh Sanghani, vice president of GPU products at NVIDIA, in a Press release.
Manufacturers will be able to specify their GPU, DPU, and CPU preferences in the initial base system architecture. MGX is compatible with current and future NVIDIA server form factors, including 1U, 2U, and 4U (air-cooled or liquid-cooled).
SoftBank is currently working on building a data center network in Japan that will use GH200 Superchips and MGX systems for 5G services and generative AI applications.
QCT and Supermicro have adopted MGX and will release it in August.
What will change in data center management?
For businesses, adding high-performance computing or AI to data centers will require changes to physical infrastructure designs and systems. Whether and how much to do so depends on the individual situation. Joe Reele, vice president, solution architects at Schneider Electric, said many large organizations are already preparing their data centers for AI and machine learning.
“Power density and heat dissipation are driving this transition,” Reele said in an email to TechRepublic. “Additionally, how the computing kit is designed for AI/ML in white space is also a determining factor when considering the need for things like shorter cables and clustering.”
Enterprise-owned data center operators should decide, based on their business priorities, whether replacing servers and upgrading IT equipment to support generative AI workloads makes sense for them. them, Reele said.
“Yes, the new servers will be more efficient and more capable in terms of computing power, but operators need to consider things like compute usage, carbon emissions, and of course space, l “power and cooling. While some operators may need to adjust their server infrastructure strategies, many won’t need to make these massive updates in the near term,” he said.
Other NVIDIA news at COMPUTEX
NVIDIA announced a variety of other new AI-based products and services:
- WPP and NVIDIA Omniverse have teamed up to announce a new marketing engine. The content engine will be able to generate videos and images for advertising.
- An intelligent manufacturing platform, Metropolis of Factoriescan create and manage custom quality control systems.
- THE Avatar Cloud Engine (ACE) for games is a foundry service for video game developers. It allows animated characters to use AI for speech generation and animation.