LLM Judges Are Unreliable — The Collective Intelligence Project
How Positional Preferences, Order Effects, and Prompt Sensitivity Undermine Reliability in AI JudgmentsBy James Padolsey
Beyond their everyday chat capabilities, Large Language Models are increasingly being used to make decisions in sensitive domains like hiring, health, law, and civic engagement. The exact mechanics of how we use these models in such scenarios is vital. There are many ways to have LLMs make decisions, including A/B decision-making, ranking, classification, "panels" of judges, e...
Read more at cip.org