"Open-Sourced AI Tool, LLaMA-3, Successfully Recaptions 1.3 Billion Web Images, Enhancing Training of Advanced Vision-Language Models"

What If We Recaption Billions of Web Images with LLaMA-3?

Authors:Xianhang Li, Haoqin Tu, Mude Hui, Zeyu Wang, Bingchen Zhao, Junfei Xiao, Sucheng Ren, Jieru Mei, Qing Liu, Huangjie Zheng, Yuyin Zhou, Cihang Xie View PDF HTML (experimental) Abstract:Web-crawled image-text pairs are inherently noisy. Prior studies demonstrate that semantically aligning and enriching textual descriptions of these pairs can significantly enhance model training across various vision-language tasks, particularly text-to-image generation. However, large-scale investigations ...