"Exploring the Intricate Data Pipelines Used for Training Large Language Models like LLaMA Using Common Crawl"

Large language model data pipelines and Common Crawl (WARC/WAT/WET)

Erik Desmazieres’s “La Bibliothèque de Babel”. 1997. We have been training language models (LMs) for years, but finding valuable resources about the data pipelines commonly used to build the datasets for training these models is paradoxically challenging. It may be because we often take it for granted that these datasets exist (or at least existed? As replicating them is becoming increasingly difficult). However, one must consider the numerous decisions involved in creating such pipelines, as it...