Speculative KV coding: losslessly compressing KV cache by up to ~4× using a predictor model
The size of LLM context grows by the day. KV caching is what makes running
those long contexts affordable: it trades compute for memory so the model
doesn’t re-prefill work it has already done. But as agentic workflows push
contexts ever longer, storing and moving the cache starts to dominate
everything. To get to the next order of magnitude of LLM capability, we need
it to be smaller.
You can make it smaller lossily. TurboQuant is a
(somewhat controversialPerformance exploration, accusations of...
Read more at fergusfinn.com