Researchers Decode LLM Behavior: N-gram Rules Match 79% of Predictions on TinyStories, 68% on Wikipedia

Understanding Transformers via N-gram Statistics

View PDF HTML (experimental) Abstract:Transformer based large-language models (LLMs) display extreme proficiency with language yet a precise understanding of how they work remains elusive. One way of demystifying transformer predictions would be to describe how they depend on their context in terms of simple template functions. This paper takes a first step in this direction by considering families of functions (i.e. rules) formed out of simple N-gram based statistics of the training data. By st...