News Score: Score the News, Sort the News, Rewrite the Headlines

Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models

View PDF Abstract:The disconnect between tokenizer creation and model training in language models has been known to allow for certain inputs, such as the infamous SolidGoldMagikarp token, to induce unwanted behaviour. Although such `glitch tokens' that are present in the tokenizer vocabulary, but are nearly or fully absent in training, have been observed across a variety of different models, a consistent way of identifying them has been missing. We present a comprehensive analysis of Large Langu...

Read more at arxiv.org

© News Score  score the news, sort the news, rewrite the headlines