"GitHub Developer Simplifies LLM Training with Raw C/CUDA, Rivaling PyTorch Efficiency; Targets Faster Direct CUDA Implementation and Modern Architectures Update Next"

GitHub - karpathy/llm.c: LLM training in simple, raw C/CUDA

llm.c LLM training in simple, pure C/CUDA. There is no need for 245MB of PyTorch or 107MB of cPython. For example, training GPT-2 (CPU, fp32) is ~1,000 lines of clean code in a single file. It compiles and runs instantly, and exactly matches the PyTorch reference implementation. I chose GPT-2 as the first working example because it is the grand-daddy of LLMs, the first time the modern stack was put together. Currently, I am working on: direct CUDA implementation, which will be significantly fast...