News Score: Score the News, Sort the News, Rewrite the Headlines

Speculative KV coding: losslessly compressing KV cache by up to ~4× using a predictor model

The size of LLM context grows by the day. KV caching is what makes running those long contexts affordable: it trades compute for memory so the model doesn’t re-prefill work it has already done. But as agentic workflows push contexts ever longer, storing and moving the cache starts to dominate everything. To get to the next order of magnitude of LLM capability, we need it to be smaller. You can make it smaller lossily. TurboQuant is a (somewhat controversialPerformance exploration, accusations of...

Read more at fergusfinn.com

© News Score  score the news, sort the news, rewrite the headlines