GitHub - mlfoundations/MINT-1T: MINT-1T: A one trillion token multimodal interleaved dataset.
MINT-1T:Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens
Paper | Dataset | Blog Post
🍃 MINT-1T is an open-source Multimodal INTerleaved dataset with one trillion text tokens and 3.4 billion images, a ~10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers.
We release all subsets of MINT-1T, including:
🌐 HTML Data
📚 PDF Data
We provide shards of MINT-1T PDFs for each CommonCraw...
Read more at github.com