Full Unicode Search at 50× ICU Speed with AVX‑512
This article is about the ugliest, but potentially most useful piece of open-source software I’ve written this year.
It’s messy, because UTF-8 is messy.
The world’s most widely used text encoding standard was introduced in 1989.
It now covers more than 1 million characters across the majority of used writing systems, so it’s not exactly trivial to work with.The example above contains multiple confusable characters: German Eszett variants 'ß'U+00DF0x C3 9F
and 'ẞ'U+1E9E0x E1 BA 9E
, the Kelvin si...
Read more at ashvardanian.com