"Reproducing GPT-2 with 8XH100 Node for $672 in 24 Hours: Uncovering Future Potential and Glitches of Language Model llm.c"

Let's reproduce GPT-2 (1.6B): one 8XH100 node, 24 hours, $672, in llm.c · karpathy/llm.c · Discussion #677

In this post we are reproducing GPT-2 in llm.c. This is "the GPT-2", the full, 1558M parameter version that was introduced in OpenAI's blog post Better Language Models and their Implications in February 14, 2019. llm.c does so directly in C/CUDA (total of ~5,000 lines of code), without the typical training stack that would involve the Python interpreter and a significantly more complex deep learning library like PyTorch/JAX, huggingface/transformers, or etc. In 2019, training GPT-2 was an involv...