"Andrej Karpathy Optimizes llm.c to Match PyTorch's Speed, Fixes cuBLAS Bug, and Brings Optimized Softmax Kernel; Targets Faster LLM Training with ~2,000 Lines of C Code"

Andrej Karpathy on X: "Highly amusing update, ~18 hours later: llm.c is now down to 26.2ms/iteration, exactly matching PyTorch (tf32 forward pass). We discovered a bug where we incorrectly called cuBLAS in fp32 mathmode 🤦‍♂️. And ademeure contributed a more optimized softmax kernel for very long rows… https://t.co/Orz0KUznrP" / X

PostConversationHighly amusing update, ~18 hours later: llm.c is now down to 26.2ms/iteration, exactly matching PyTorch (tf32 forward pass). We discovered a bug where we incorrectly called cuBLAS in fp32 mathmode . And ademeure contributed a more optimized softmax kernel for very long rows (50,257 elements per row, in the last logits layer). But the fun doesn’t stop because we still have a lot of tricks up the sleeve. Our attention kernel is naive attention, not flash attention, and materializes...