With the rise of large language models and the desire to run them more cheaply and efficiently, the concept of quantization has gained a lot of traction. By representing the weights of the LLM with data types that use fewer bits, you reduce the necessary GPU memory to load the model and the memory bandwidth of operations like matrix multiplication.
As a quick refresher, floating-point numbers consist of a single sign bit, a number of exponent bits (E), and a number of mantissa bits (M). If $e$ is the value of the exponent bits (potentially biased), and $m$ is the value of the mantissa bits, then the floating point number represented is
$f = sign \cdot 2^e \cdot \frac{m}{2^{M-1}}$
Intuitively, the exponent determines the rough scale of the number, and the mantissa determines the precise value within that scale.
| float32 representation. Source: Wikipedia. |
The high-precision format for LLMs is float32, which is your standard float data type in C, and has E = 8, M = 23 for 32 bits total. For quantizing down to 16 bits, it's become popular to use bfloat16, which is a newer format developed for machine learning specifically. The bfloat16 format uses E = 8, M = 7 versus traditional float16 that uses E = 5, M = 10. This is because capturing a wider range of scale (e.g. for large gradients or activations) is more important than high precision in ML.
| bfloat16 representation. Source: Wikipedia. |
Quantization can go further: 8-bit and even 4-bit formats are common now. By the time you get down to 4 bits, it's no longer obvious that anything will work -- after all, 4 bits can only represent 16 values! To understand how this could work, let's look in particular at NVFP4, which is NVIDIA's own data type that is targeted specifically at maintaining model accuracy.
It's a bit of a misclassification to consider NVFP4 a data type at all, as it's not a standalone representation like traditional floating-point numbers. It consists of a standard 4-bit floating-point value (E = 2, M = 1) used in conjunction with per-block scaling factors and a per-tensor scaling factor. You can think of it like a generalization of the floating-point concept across multiple values instead of within a single one.
| NVFP4 representation. Source: NVIDIA. |
There is a single 32-bit tensor scaling factor that determines the "global scale" of the tensor, and then an 8-bit scaling factor for every 16 values in the tensor. These scaling factors mitigate the 4-bit format's limited range, and they've chosen E and M values for the scaling factors to most accurately reconstruct true values in practice (as measured by mean squared error). NVIDIA has implemented hardware support for NVFP4 into their Blackwell architecture, so their GPUs natively understand how to manipulate tensors in the NVFP4 format.
One thing I've glossed over is how the quantization actually happens, which isn't trivial. Another reason that bfloat16 is preferred over float16 is the fact that quantizing from float32 to bfloat16 is easy as they have the same range (number of exponent bits). But if you quantize to 8 or 4 bits, that's never going to be true, so you have to apply scaling as described in this blog post. The post covers techniques for post-training quantization in order to choose quantized weights that maintain model accuracy, but I won't be covering them here. Instead, this post is a setup for my next post on quantization-aware distillation as an alternative approach, which is a paper that NVIDIA recently published -- stay tuned!