"Revolutionizing Natural Language Processing: Studying How Self-Attention Mechanism Predicts Next Tokens in Transformer-Based Models, Reveals arXiv Study"

Mechanics of Next Token Prediction with Self-Attention

View PDF HTML (experimental) Abstract:Transformer-based language models are trained on large datasets to predict the next token given an input sequence. Despite this simple training objective, they have led to revolutionary advances in natural language processing. Underlying this success is the self-attention mechanism. In this work, we ask: What does a single self-attention layer learn from next-token prediction? We show that training self-attention with gradient descent learns an automaton whi...