News Score: Score the News, Sort the News, Rewrite the Headlines

GitHub - mlfoundations/MINT-1T: MINT-1T: A one trillion token multimodal interleaved dataset.

MINT-1T:Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens Paper | Dataset | Blog Post 🍃 MINT-1T is an open-source Multimodal INTerleaved dataset with one trillion text tokens and 3.4 billion images, a ~10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers. We release all subsets of MINT-1T, including: 🌐 HTML Data 📚 PDF Data We provide shards of MINT-1T PDFs for each CommonCraw...

Read more at github.com

© News Score  score the news, sort the news, rewrite the headlines