Researchers Unveil 'Linearity Theorem' for LLM Quantization; New HIGGS Method Outperforms NF4, Improves Accuracy-Compression Trade-offs

Pushing the Limits of Large Language Model Quantization via the Linearity Theorem

View PDF HTML (experimental) Abstract:Quantizing large language models has become a standard way to reduce their memory and computational costs. Typically, existing methods focus on breaking down the problem into individual layer-wise sub-problems, and minimizing per-layer error, measured via various metrics. Yet, this approach currently lacks theoretical justification and the metrics employed may be sub-optimal. In this paper, we present a "linearity theorem" establishing a direct relationship ...