Neural Magic Unveils Sparse Llama 3.1: 50% Smaller, GPU-Optimized LLM Matches Dense Model Performance, Boosts Inference Speed up to 4.9x

2:4 Sparse Llama: Smaller Models for Efficient GPU Inference

Nov 25, 2024 Author(s) Alexandre Marques Manager of Machine Learning Research Mark Kurtz CTO, Neural Magic Dan Alistarh Principal Research Scientist Shubhra Pandit Senior Machine Learning Researcher A Sparse Summary Sparse Foundation Model: The first sparse, highly accurate foundation model built on top of Meta’s Llama 3.1 8B with 98% recovery on Open LLM Leaderboard v1 and full recovery across fine-tuning tasks, including math, coding, and chat. Hardware-Accelerated Sparsity: Features a 2:4 spa...