"New LLMs Quantization Algorithm 'EfficientQAT' Outperforms Predecessor, Achieving Higher Accuracy with Less Memory in 41 Hours"

New LLMs Quantization Algorithm EfficientQAT, which makes 2-bit INT llama-2-70B outperforms FP llama-2-13B with less memory.

Recently, LLMs focus on vector quantization, such as AQLM and QUIP# for the precise quantization in 2-bits. However, vector quantization introduce more challenge for deployment. In EfficentQAT, we focus on push the limitation of uniform(INT) quantization, successfully make INT quantization achieve comparable performance with vector quantiza. Specially, EfficientQAT obtains a 2-bit Llama-2-70B model on a single A100-80GB GPU in 41 hours, with less than 3% accuracy degradation compared to the full...