Classic ML to cope with Dumb LLM Judges
In previous posts I use a local LLM to choose which two products were more relevant for a search query (see this github repo). Using human labels in an open e-commerce search dataset as a baseline (WANDS from Wayfair), I measure the LLM’s preference for a product, seeing if it matches human raters. If I can do this, then I can use my laptop as the search relevance judge. This can then guide search quality tuning and iterations, without an expensive OpenAI bill.
My goal, not so much to replace ot...
Read more at softwaredoug.com