News Score: Score the News, Sort the News, Rewrite the Headlines

GPT-4o’s Chinese token-training data is polluted by spam and porn websites

Soon after OpenAI released GPT-4o on Monday, May 13, some Chinese speakers started to notice that something seemed off about this newest version of the chatbot: the tokens it uses to parse text were full of spam and porn phrases. On May 14, Tianle Cai, a PhD student at Princeton University studying inference efficiency in large language models like those that power such chatbots, accessed GPT-4o’s public token library and pulled a list of the 100 longest Chinese tokens the model uses to parse a...

Read more at technologyreview.com

© News Score  score the news, sort the news, rewrite the headlines