GPU market leader Nvidia holds several GPU Technology Conferences (GTC) annually around the globe. It seems every show has some sort of major announcement where the company is pushing the limits of GPU computing and creating more options for customers. For example, at GTC San Jose, the company announced its NVSwitch architecture, which connects up to 16 GPUs over a single fabric, creating one massive, virtual GPU. This week at GTC Taiwan, it announced its HGX-2 server platform, which is a reference architecture enabling other server manufacturers to build their own systems. The DGX-2 server announced at GTC San Jose is built on the HGX-2 architecture.
Network World’s Marc Ferranti did a great job of covering the specifics of the announcement in this post, including the server partners that will build their own products using the reference architecture. I wanted to drill down a little deeper on the importance of the HGX-2 and the benefits it brings.
HGX-2 gets its horsepower from NVSwitch
In his post, Ferranti mentioned that the HGX-2 leverages the NVSwitch interconnect fabric. NVSwitch is a significant leap forward for GPU computing, and without it, the speeds the Nvidia is getting could not be achieved. As fast as PCI bus speeds have gotten, they are far too slow to feed multiple GPUs. By creating a single, virtual GPU, HGX-2 delivers 2 petaflops in a single server.
Server partners have flexibility in platform design using the HGX-2 base
Also, with AI and HPC, architectures will vary from data center to data center. HGX-2 is a base building block that enables the server ecosystem partners to build a full server platform that can meet the unique requirements of their customers. As an example, some hyper-scale customers prefer to have PCIe and networking cables in the back of the server, while some prefer them in the front. How the servers are powered can be done via a power bus bar for the rack or using an individual power supply in each server. The approach Nvidia is taking lets it do what it does best, and that’s deliver market-leader performance from GPU subsystems while allowing the server manufacturers to focus on system-level design, power, cooling and mechanicals. This should lead to faster innovation and new systems being developed to meet the constantly changing needs of the A.I. and machine-learning industries.
The below image shows Nvidia’s server architecture for high-performance AI and HPC workloads.
With this design, the CPU host node and GPU server platforms connect using PCIe cables. This lets the GPU and CPU operate at different speeds and refresh at their own pace. The disaggregated architecture allows for the CPUs and GPUs to be upgraded independently. Another benefit worth noting is that the four PCIe x16 connections provide plenty of bandwidth to continually feed the GPUs. I’ve talked to many data scientists, who have told me one of the biggest issues with machine learning and A.I. is not being able to feed the GPUs fast enough to keep them working.
HGX-2 also useful for HPC workloads for ultimate flexibility
Another interesting element of HGX-2 is that it can be used for HPC workloads as well as A.I. The platform comes with FP64 and FP32 (measures of calculation accuracy) for scientific computing, modeling and simulations, while also supporting FP16 and INT8 used for A.I. training and inferencing. Typically, this would require investments in multiple platforms, driving costs through the roof. The ability to do both on a single platform means greater flexibility and a lower cost to get started with A.I. initiatives.
Nvidia currently has a big head start on the industry
At the end of his post, Ferranti made a comment that Nvidia’s lead in the market is destined to face increasing completion and mentioned Intel and Xylinx as possible competitors. Logically, it makes sense that Nvidia would see more competition and that may happen, but it’s unlikely to be from any of its existing competitors. What makes Nvidia unique today isn’t its GPUs; they’re obviously very good, but it’s the entire stack, from the silicon to software to hardware platforms and developer ecosystem. None of the other GPU manufacturers has an ecosystem and stack that’s even close to Nvidia’s. People thought the same thing about Intel when the PC industry was booming and it took decades before another vendor challenged it. I believe Nvidia will have a similar decade-long run where it is as important to A.I. computing as Intel was to PC computing.