News Score: Score the News, Sort the News, Rewrite the Headlines

New LLMs Quantization Algorithm EfficientQAT, which makes 2-bit INT llama-2-70B outperforms FP llama-2-13B with less memory.

Recently, LLMs focus on vector quantization, such as AQLM and QUIP# for the precise quantization in 2-bits. However, vector quantization introduce more challenge for deployment. In EfficentQAT, we focus on push the limitation of uniform(INT) quantization, successfully make INT quantization achieve comparable performance with vector quantiza. Specially, EfficientQAT obtains a 2-bit Llama-2-70B model on a single A100-80GB GPU in 41 hours, with less than 3% accuracy degradation compared to the full...

Read more at reddit.com

© News Score  score the news, sort the news, rewrite the headlines