"Revolutionizing Language Models: Record-Breaking Trillion-Token Scale N-Gram Model Developed, Improving Text Analysis and Performance of Neural LLMs"

Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens

View PDF HTML (experimental) Abstract:Are n-gram language models still relevant in this era of neural large language models (LLMs)? Our answer is yes, and we showcase their values in both text analysis and improving neural LLMs. This was done by modernizing n-gram LMs in two aspects. First, we train them at the same data scale as neural LLMs -- 5 trillion tokens. This is the largest n-gram LM ever built. Second, existing n-gram LMs use small n which hinders their performance; we instead allow n ...