News Score: Score the News, Sort the News, Rewrite the Headlines

Integer tokenization is insane

After spending a lot of time with language models, I have come to the conclusion that tokenization in general is insane and it is a miracle that language models learn anything at all. To drill down into one specific example of silliness which has been bothering me recently, let’s look at how the GPT2 tokenizer (which is also used for GPT3 as far as I know) tokenizes integers. The tokenization of integers is the most basic element of learning and representing mathematical facts and ultimately all...

Read more at beren.io

© News Score  score the news, sort the news, rewrite the headlines