GitHub - dipampaul17/KVSplit: Run larger LLMs with longer contexts on Apple Silicon by using differentiated precision for KV cache quantization. KVSplit enables 8-bit keys & 4-bit values, reducing memory by 59% with <1% quality loss. Includes benchmarking, visualization, and one-command setup. Optimized for M1/M2/M3 Macs with Metal support.
🚀 KVSplit
Differentiated KV Cache Quantization for Apple Silicon
📌 Overview
Run larger context windows and heavier LLMs on your Mac by applying different quantization precision to keys vs values in the attention mechanism's KV cache. KVSplit enables you to:
Reduce memory usage by up to 72% with minimal quality loss
Run 2-3x longer contexts in the same memory budget
Maintain or improve inference speed compared to FP16
Optimize for Apple Silicon with full Metal support
Key Findings
Configuration...
Read more at github.com