Cerebras Inference now 3x faster: Llama3.1-70B breaks 2,100 tokens/s - Cerebras
Today we’re announcing the biggest update to Cerebras Inference since launch. Cerebras Inference now runs Llama 3.1-70B at an astounding 2,100 tokens per second – a 3x performance boost over the prior release. For context, this performance is:
16x faster than the fastest GPU solution
8x faster than GPUs running Llama3.1-3B, a model 23x smaller
Equivalent to a new GPU generation’s performance upgrade (H100/A100) in a single software release
Fast inference is the key to unlocking the next generati...
Read more at cerebras.ai