SplitQuantV2 Algorithm Enhances LLM Quantization Without GPUs, Matches FP32 Performance in 2 Minutes on CPU

SplitQuantV2: Enhancing Low-Bit Quantization of LLMs Without GPUs

View PDF HTML (experimental) Abstract:The quantization of large language models (LLMs) is crucial for deploying them on devices with limited computational resources. While advanced quantization algorithms offer improved performance compared to the basic linear quantization, they typically require high-end graphics processing units (GPUs), are often restricted to specific deep neural network (DNN) frameworks, and require calibration datasets. This limitation poses challenges for using such algori...