by Zeus Kerravala

Nvidia’s HGX-2 brings flexibility to GPU computing

News Analysis

May 31, 20184 mins

Nvidia takes the covers off its HGX-2 server platform that unifies HPC and A.I. computing.

Credit: Getty Images

GPU market leader Nvidia holds several GPU Technology Conferences (GTC) annually around the globe. It seems every show has some sort of major announcement where the company is pushing the limits of GPU computing and creating more options for customers. For example, at GTC San Jose, the company announced its NVSwitch architecture, which connects up to 16 GPUs over a single fabric, creating one massive, virtual GPU. This week at GTC Taiwan, it announced its HGX-2 server platform, which is a reference architecture enabling other server manufacturers to build their own systems. The DGX-2 server announced at GTC San Jose is built on the HGX-2 architecture.

Network World’s Marc Ferranti did a great job of covering the specifics of the announcement in this post, including the server partners that will build their own products using the reference architecture. I wanted to drill down a little deeper on the importance of the HGX-2 and the benefits it brings.

HGX-2 gets its horsepower from NVSwitch

In his post, Ferranti mentioned that the HGX-2 leverages the NVSwitch interconnect fabric. NVSwitch is a significant leap forward for GPU computing, and without it, the speeds the Nvidia is getting could not be achieved. As fast as PCI bus speeds have gotten, they are far too slow to feed multiple GPUs. By creating a single, virtual GPU, HGX-2 delivers 2 petaflops in a single server.

Server partners have flexibility in platform design using the HGX-2 base

Also, with AI and HPC, architectures will vary from data center to data center. HGX-2 is a base building block that enables the server ecosystem partners to build a full server platform that can meet the unique requirements of their customers. As an example, some hyper-scale customers prefer to have PCIe and networking cables in the back of the server, while some prefer them in the front. How the servers are powered can be done via a power bus bar for the rack or using an individual power supply in each server. The approach Nvidia is taking lets it do what it does best, and that’s deliver market-leader performance from GPU subsystems while allowing the server manufacturers to focus on system-level design, power, cooling and mechanicals. This should lead to faster innovation and new systems being developed to meet the constantly changing needs of the A.I. and machine-learning industries.

The below image shows Nvidia’s server architecture for high-performance AI and HPC workloads.

With this design, the CPU host node and GPU server platforms connect using PCIe cables. This lets the GPU and CPU operate at different speeds and refresh at their own pace. The disaggregated architecture allows for the CPUs and GPUs to be upgraded independently. Another benefit worth noting is that the four PCIe x16 connections provide plenty of bandwidth to continually feed the GPUs. I’ve talked to many data scientists, who have told me one of the biggest issues with machine learning and A.I. is not being able to feed the GPUs fast enough to keep them working.

HGX-2 also useful for HPC workloads for ultimate flexibility

Another interesting element of HGX-2 is that it can be used for HPC workloads as well as A.I. The platform comes with FP64 and FP32 (measures of calculation accuracy) for scientific computing, modeling and simulations, while also supporting FP16 and INT8 used for A.I. training and inferencing. Typically, this would require investments in multiple platforms, driving costs through the roof. The ability to do both on a single platform means greater flexibility and a lower cost to get started with A.I. initiatives.

Nvidia currently has a big head start on the industry

At the end of his post, Ferranti made a comment that Nvidia’s lead in the market is destined to face increasing completion and mentioned Intel and Xylinx as possible competitors. Logically, it makes sense that Nvidia would see more competition and that may happen, but it’s unlikely to be from any of its existing competitors. What makes Nvidia unique today isn’t its GPUs; they’re obviously very good, but it’s the entire stack, from the silicon to software to hardware platforms and developer ecosystem. None of the other GPU manufacturers has an ecosystem and stack that’s even close to Nvidia’s. People thought the same thing about Intel when the PC industry was booming and it took decades before another vendor challenged it. I believe Nvidia will have a similar decade-long run where it is as important to A.I. computing as Intel was to PC computing.

Data CenterServers

by Zeus Kerravala

Zeus Kerravala is the founder and principal analyst with ZK Research, and provides a mix of tactical advice to help his clients in the current business climate and long-term strategic advice. Kerravala provides research and advice to end-user IT and network managers, vendors of IT hardware, software and services and the financial community looking to invest in the companies that he covers.

Prior to ZK Research, Kerravala spent 10 years as an analyst at Yankee Group. Earlier in his career, he held a number of technical roles, including as VP of IT and Deputy CIO.

Kerravala holds a Bachelor of Science in Physics and Mathematics from the University of Victoria in British Columbia, Canada.

He currently resides in Acton, Massachusetts.

Show me more

Nvidia’s HGX-2 brings flexibility to GPU computing

Nvidia takes the covers off its HGX-2 server platform that unifies HPC and A.I. computing.

HGX-2 gets its horsepower from NVSwitch

Server partners have flexibility in platform design using the HGX-2 base

HGX-2 also useful for HPC workloads for ultimate flexibility

Nvidia currently has a big head start on the industry

More from this author

U.S. Open powers up AI-ready network in challenging environment

Rami Rahim’s message for network pros: Legacy networks can’t withstand rigors of AI

What Bundesliga’s Captain tells us about AI-powered CX

How Jeetu Patel made Cisco unrecognizable

What is Cisco Cloud Control and why should customers care?

Cisco’s new certs are a wake-up call for AI-era network engineers

Five takeaways from Cisco’s blowout quarter and what it means to customers

Wi-Fi 8 is closer than you think. Here’s what you need to know

Show me more

AI workloads shake up observability market

Google Cloud configuration update disrupts VMware Engine stretched clusters

Fortinet adds AI protections to endpoint security platform

Has the hype around ‘Internet of Things’ paid off? | Ep. 145

Episode 1: Understanding Cisco’s Converged SDN Transport

Episode 2: Pluggable Optics and the Internet for the Future

Master Linux Math with the bc Command | Easy CLI Calculations Explained!

Master Linux Math in Seconds: How to Use the expr Command Like a Pro

How to Do Math in the Command Line Using Double Parentheses

Nvidia’s HGX-2 brings flexibility to GPU computing

HGX-2 gets its horsepower from NVSwitch

Server partners have flexibility in platform design using the HGX-2 base

HGX-2 also useful for HPC workloads for ultimate flexibility

Nvidia currently has a big head start on the industry

From our editors straight to your inbox

More from this author

U.S. Open powers up AI-ready network in challenging environment

Rami Rahim’s message for network pros: Legacy networks can’t withstand rigors of AI

What Bundesliga’s Captain tells us about AI-powered CX

How Jeetu Patel made Cisco unrecognizable

What is Cisco Cloud Control and why should customers care?

Cisco’s new certs are a wake-up call for AI-era network engineers

Five takeaways from Cisco’s blowout quarter and what it means to customers

Wi-Fi 8 is closer than you think. Here’s what you need to know

Show me more

AI workloads shake up observability market

Google Cloud configuration update disrupts VMware Engine stretched clusters

Fortinet adds AI protections to endpoint security platform

Has the hype around ‘Internet of Things’ paid off? | Ep. 145

Episode 1: Understanding Cisco’s Converged SDN Transport

Episode 2: Pluggable Optics and the Internet for the Future

Master Linux Math with the bc Command | Easy CLI Calculations Explained!

Master Linux Math in Seconds: How to Use the expr Command Like a Pro

How to Do Math in the Command Line Using Double Parentheses