Fast LLM Inference From Scratch
Contents
Pushing single-GPU inference throughput to the edge without libraries
Source code for this article on GitHub.
This post is about building an LLM inference engine using C++ and CUDA from scratch without libraries.
Why? In doing so, we can learn about the full stack of LLM inference - which is becoming increasingly importantEspecially as inference compute becomes a new axis with which AI models scale,
and models are increasingly deployed locally to devices on the edge. - from CUDA kernels...
Read more at andrewkchan.dev