Harvard Library Releases 242B-Token Dataset of Public Domain Books for AI Training, Enhancing LLM Development

Institutional Books 1.0: A 242B token dataset from Harvard Library's collections, refined for accuracy and usability

Authors:Matteo Cargnelutti, Catherine Brobston, John Hess, Jack Cushman, Kristi Mukk, Aristana Scourtas, Kyle Courtney, Greg Leppert, Amanda Watson, Martha Whitehead, Jonathan Zittrain View PDF Abstract:Large language models (LLMs) use data to learn about the world in order to produce meaningful correlations and predictions. As such, the nature, scale, quality, and diversity of the datasets used to train these models, or to support their work at inference time, have a direct impact on their qual...