Google's TurboQuant AI-compression algorithm can reduce LLM memory usage by 6x
Even if you don’t know much about the inner workings of generative AI models, you probably know they need a lot of memory. Hence, it is currently almost impossible to buy a measly stick of RAM without getting fleeced. Google Research recently revealed TurboQuant, a compression algorithm that reduces the memory footprint of large language models (LLMs) while also boosting speed and maintaining accuracy.
TurboQuant is aimed at reducing the size of the key-value cache, which Google likens to a “dig...
Read more at arstechnica.com