Nvidia’s Hopper, its latest generation of GPU design, showed up in the MLPerf benchmark tests of neural nets hardware. The chip showed admirable scores, with a single-chip system besting some systems that used multiple chips of the older variety, the A100. 


Increasingly, the trend in machine learning forms of artificial intelligence is toward larger and larger neural networks. The biggest neural nets, such as such as Google’s Pathways Language Model, as measured by their parameters, or “weights,” are clocking in at over half a trillion weights, where every additional weight increases the computing power used.

How is that increasing size to be dealt with? With more powerful chips, on the one hand, but also by putting some of the software on a diet. 

On Thursday, the latest benchmark test of how fast a neural network can be run to make predictions was presented by MLCommons, the consortium that runs the MLPerf tests. The reported results featured some important milestones, including the first-ever benchmark results for Nvidia’s “Hopper” GPU, unveiled in March.

At the same time, Chinese cloud giant Alibaba submitted the first-ever reported results for an entire cluster of computers acting as a single machine, blowing away other submissions in terms of the total throughput that could be achieved. 

And a startup, Neural Magic, showed how it was able to use “pruning,” a means of cutting away parts of a neural network, to achieve a slimmer piece of software that can perform just about as good as a normal program would but with less computing power needed.

“We’re all training these embarrassingly brute-force, dense models,” said Michael Goin, product engineering lead for Neural Magic, in an interview with ZDNet, referring to giant neural nets such as Pathways. “We all know there has to be a better way.”

The benchmark tests, called Inference 2.1, represent one half of the machine learning approach to AI, when a trained neural network is fed new data and has to produce conclusions as its output. The benchmark measure how fast a computer can produce an answer for a number of tasks, including ImageNet, where the challenge is for the neural network to apply one of several labels to a photo describing the object in the photo such as a cat or dog. 

Chip and system makers compete to see how well they can do on measures such as the number of photos processed in a single second, or how low they can get latency, the total round-trip time for a request to be sent to the computer and a prediction to be returned. 

In addition, some vendors submit test results showing how much energy their machines consume, an increasingly important element as datacenters become larger and large, consuming vast amounts of power.  

The other half of the problem, training a neural network, is covered in another suite of benchmark results that MLCommons reports separately, with the latest round being in June

The Inference 2.1 report follows a previous round of inference benchmarks in April. This time around, the reported results pertained only to computer systems operating in datacenters and the “edge,” a term that has come to encompass a variety of computer systems other than traditional data center machines. One spreadsheet is posted for the datacenter results, another for the edge.

The latest report did not include results for the ultra-low-power devices known as TinyML and for mobile computers, which had been lumped in with data center in the April report. 

In all, the benchmarks received 5,300 submissions by the chip makers and partners, and startups such as Neural Magic. That was almost forty percent more than in the last round, reported in April. 

As in past, Nvidia took top marks for speeding up inference in numerous tasks. Nvidia’s A100 GPU dominated the number of submission, as is often the case, being integrated with processors from Intel and Advanced Micro Devices in systems built by a gaggle of partners, including Alibaba, ASUSTeK, Microsoft Azure, Biren, Dell, Fujitsu, GIGABYTE, H3C, Hewlett Packard Enterprise, Inspur, Intel, Krai, Lenovo, OctoML, SAPEON, and Supermicro. 

Two entries were submitted by Nvidia itself with the Hopper GPU, designated “H100,” in the datacenter segment of the results. One system was accompanied by an AMD EPYC CPU as the host processor, and another was accompanied by an Intel Xeon CPU. 

In both cases, it’s noteworthy that the Hopper GPU, despite being a single chip, scored very high marks, in many cases outperforming systems with two, four or eight A100 chips.

The Hopper GPU is expected to be commercially available later this year. Nvidia said it expects in early 2023 to make available its forthcoming “Grace” CPU chip, which will compete with Intel and AMD CPUs, and that part will be a companion chip to Hopper in systems. 

Alongside Nvidia, mobile chip giant Qualcomm showed off new results for its Cloud AI 100 chip, a novel accelerator built for machine learning tasks. Qualcomm added new system partners this round, including Dell and Hewlett Packard Enterprise and Lenovo, and the number of total submissions using its chip.

While the bake-off between chip makers and system makers tends to dominate headlines, an increasing number of clever researchers show up in MLPerf with novel approaches that can get more performance out of the same hardware. 

Past examples have included OctOML, the startup that is trying to bring the rigor of DevOps to running machine learning

This time around, an interesting approach was offered by four-year-old, venture-backed startup Neural Magic. The company’s technology comes in part from research by founder Nir Shavit, a scholar at MIT. 

The work points to a possible breakthrough in slimming down the computing needed by a neural network. 

Neural Magic’s technology trains a neural network and finds which weights can be left unused. It then sets those weights to a zero value, so they are not processed by the computer chip. 

That approach, called pruning, is aking to removing the unwanted branches of a tree. It is also, however, part of a broader trend in deep learning going back decades known as “sparsity.” In sparse approaches to machine learning, some data and some parts of programs can be deemed as unnecessary information for practical purposes. 


Neural magic figures out ways to drop neural network weights, the tensor structures that take up much of the memory and bandwidth needs of a neural net. The original network of many-to-many connected layers are pruned till only some connections remained, while the others are zeroed out. The pruning approach is part of a larger principle in machine learning known as sparsity. 

Neural Magic

Another technique, called quantization, converts some numbers to simpler representations. For example, a 32-bit floating point number can be compressed into an 8-bit scalar value, which is easier to compute. 

The Neural Magic technology acts as a kind of conversion tool that a data scientist can use to automatically find the parts of their neural network that can be safely discarded without sacrificing accuracy. 

The benefit, according to Neural Magic’s project lead, is not only to reduce how many calculations a processor has to crunch, it is also to reduce how much a CPU has to go outside the chip to external memory, such as DRAM, which slows down everything. 

“You remove 90% of the parameters and you remove 90% of the FLOPs you need,” said Goin of Neural Magic, referring to “floating-point operations per second,” a standard measure of how fast a processor runs calculations. 

In addition, “It’s very easy for CPUs to get memory bandwidth-limited,” Goin said. “Moving large tensors requires a lot of memory bandwidth, which CPUs are bad at,” noted Goin. Tensors are the structures that organize values of neural network weights and that have to be retained in memory.