Inside LLM.int8()

13 Oct, 2025

Short summary of the paper

Problem identified:

LLMs use a significant amount of GPU power to do inference (make predictions on the data) and thus their use case becomes very limited. The standard quantization techniques that convert one data type into the another reduce the performance of the models significantly especially of those that have more than 6.7B parameters.

Core Finding and Solution:

The root cause of this problem are the presence of 'outliers'. These are large magnitude, systematic features that appear in transformer models as they scale. These are the main reason that disrupt the quantization methods which generally use a single scaling factor for all the values present in a tensor. The solution proposed in the paper uses vector wise quantization for the majority of the model's weight and identifies the outlier feature dimensions, isolates them and computes their matrix multiplication in higher 16-bit precision, while the remaining values are processed in 8-bit.