"Exploring the Speed Limit of Language Model Inference in CUDA-based Implementation, Calm"

LLM inference speed of light

15 Mar 2024 In the process of working on calm, a minimal from-scratch fast CUDA implementation of transformer-based language model inference, a critical consideration was establishing the speed of light for the inference process, and measuring the progress relative to that speed of light. In this post we’ll cover this theoretical limit and its implications. If you’re interested in more derivation and some graphs, this notebook does the same modeling in Python. Inference mechanics When a language...