NVIDIA Ampere A100 has 54 billion transistors, world’s largest 7nm chip

Not long ago, Intel's Raja Koduri claimed that the Xe HP "Ponte Vecchio" silicon was the "big daddy" of Xe GPUs, and the "largest chip co-developed in India," larger than the 35 billion-transistor Xilinix VU19P FPGA co-developed in the country. It turns out that NVIDIA is in the mood for setting records. The "Ampere" A100 silicon has 54 billion transistors crammed into a single 7 nm die (not counting transistor counts of the HBM2E memory stacks).

NVIDIA claims a 20 times boost in both AI inference and single-precision (FP32) performance over its "Volta" based predecessor, the Tesla V100. The chip also offers a 2.5X gain in FP64 performance over "Volta." NVIDIA has also invented a new number format for AI compute, called TF32 (tensor float 32). TF32 uses 10-bit mantissa of FP16, and the 8-bit exponent of FP32, resulting in a new, efficient format. NVIDIA attributes its 20x performance gains over "Volta" to this. The 3rd generation tensor core introduced with Ampere supports FP64 natively. Another key design focus for NVIDIA is to leverage the "sparsity" phenomenon in neural nets, to reduce their size, and improve performance.

A new HPC-relevant feature being introduced with A100 is multi-instance GPU, which allows multiple complex applications to run on the same GPU without sharing resources such as memory bandwidth. The user can now partition a physical A100 into up to 7 virtual GPUs of varying specs, and ensure that an application running on one of the vGPUs doesn't eat into the resources of the other. As for real-world performance, NVIDIA claims that the A100 beat the V100 by a factor of 7 at BERT.

The DGX-A100 system crams 5 petaflops of compute peformance onto a single "graphics card" (a single node), and starts at $199,000 a piece.