New KVSplit Tool Slashes LLM Memory Use by 59% on Apple Silicon Macs; Enables Longer Contexts with Minimal Quality Loss

GitHub - dipampaul17/KVSplit: Run larger LLMs with longer contexts on Apple Silicon by using differentiated precision for KV cache quantization. KVSplit enables 8-bit keys & 4-bit values, reducing memory by 59% with <1% quality loss. Includes benchmarking, visualization, and one-command setup. Optimized for M1/M2/M3 Macs with Metal support.

🚀 KVSplit Differentiated KV Cache Quantization for Apple Silicon 📌 Overview Run larger context windows and heavier LLMs on your Mac by applying different quantization precision to keys vs values in the attention mechanism's KV cache. KVSplit enables you to: Reduce memory usage by up to 72% with minimal quality loss Run 2-3x longer contexts in the same memory budget Maintain or improve inference speed compared to FP16 Optimize for Apple Silicon with full Metal support Key Findings Configuration...