Speculative KV coding compresses AI model memory cache up to 4x losslessly using cheaper predictor model; entropy coder reconstructs cache exactly, avoiding quality loss from lossy methods

Speculative KV coding: losslessly compressing KV cache by up to ~4× using a predictor model

The size of LLM context grows by the day. KV caching is what makes running those long contexts affordable: it trades compute for memory so the model doesn’t re-prefill work it has already done. But as agentic workflows push contexts ever longer, storing and moving the cache starts to dominate everything. To get to the next order of magnitude of LLM capability, we need it to be smaller. You can make it smaller lossily. TurboQuant is a (somewhat controversialPerformance exploration, accusations of...