LLM inference speed of light
15 Mar 2024
In the process of working on calm, a minimal from-scratch fast CUDA implementation of transformer-based language model inference, a critical consideration was establishing the speed of light for the inference process, and measuring the progress relative to that speed of light. In this post we’ll cover this theoretical limit and its implications.
If you’re interested in more derivation and some graphs, this notebook does the same modeling in Python.
Inference mechanics
When a language...
Read more at zeux.io