Meta Releases Quantized Llama 3.2 Models: 2-4x Faster, 56% Smaller, Optimized for Mobile Devices

Introducing quantized Llama models with increased speed and a reduced memory footprint

At Connect 2024 last month, we open sourced Llama 3.2 1B and 3B—our smallest models yet—to address the demand for on-device and edge deployments. Since their release, we’ve seen not just how the community has adopted our lightweight models, but also how grassroots developers are quantizing them to save capacity and memory footprint, often at a tradeoff to performance and accuracy.As we’ve shared before, we want to make it easier for more developers to build with Llama, without needing significan...