PleIAs/YouTube-Commons · Datasets at Hugging Face
📺 YouTube-Commons 📺
YouTube-Commons is a collection of audio transcripts of 2,063,066 videos shared on YouTube under a CC-By license.
Content
The collection comprises 15,112,121 original and automatically translated transcripts from 2,063,066 videos (411,432 individual channels).
In total, this represents nearly 30 billion words (29,721,837,256).
All the videos where shared on YouTube with a CC-BY license: the dataset provide all the necessary provenance information including the title, link, ...
Read more at huggingface.co