Cerebras Inference Accelerates AI Model Performance: Llama3.1-70B Now 3x Faster, Reaching 2,100 Tokens/Second

Cerebras Inference now 3x faster: Llama3.1-70B breaks 2,100 tokens/s - Cerebras

Today we’re announcing the biggest update to Cerebras Inference since launch. Cerebras Inference now runs Llama 3.1-70B at an astounding 2,100 tokens per second – a 3x performance boost over the prior release. For context, this performance is: 16x faster than the fastest GPU solution 8x faster than GPUs running Llama3.1-3B, a model 23x smaller Equivalent to a new GPU generation’s performance upgrade (H100/A100) in a single software release Fast inference is the key to unlocking the next generati...