News Score: Score the News, Sort the News, Rewrite the Headlines

Mechanics of Next Token Prediction with Self-Attention

View PDF HTML (experimental) Abstract:Transformer-based language models are trained on large datasets to predict the next token given an input sequence. Despite this simple training objective, they have led to revolutionary advances in natural language processing. Underlying this success is the self-attention mechanism. In this work, we ask: What does a single self-attention layer learn from next-token prediction? We show that training self-attention with gradient descent learns an automaton whi...

Read more at arxiv.org

© News Score  score the news, sort the news, rewrite the headlines