"AI Researcher Andrej Karpathy's llm.c Project Achieves Multi-GPU Training in bfloat16, Running 7% Faster than PyTorch Nightly and 46% Faster than PyTorch Stable Release 2.3.0"

Andrej Karpathy on X: "Day 24 of llm.c: we now do multi-GPU training, in bfloat16, with flash attention, directly in ~3000 lines of C/CUDA, and it is FAST! 🚀 We're running ~7% faster than PyTorch nightly, with no asterisks, i.e. this baseline includes all modern & standard bells-and-whistles: mixed… https://t.co/wnZ3woPvKZ" / X

PostConversationDay 24 of llm.c: we now do multi-GPU training, in bfloat16, with flash attention, directly in ~3000 lines of C/CUDA, and it is FAST! We're running ~7% faster than PyTorch nightly, with no asterisks, i.e. this baseline includes all modern & standard bells-and-whistles: mixed precision training, torch compile and flash attention, and manually padding vocab. (Previous comparisons included asterisks like *only inference, or *only fp32 etc.) Compared to the current PyTorch stable rele...