LLMs Show Unreliable Judgments: Biases in AI Decision-Making Across Sensitive Domains Revealed by CIP Research

LLM Judges Are Unreliable — The Collective Intelligence Project

How Positional Preferences, Order Effects, and Prompt Sensitivity Undermine Reliability in AI JudgmentsBy James Padolsey Beyond their everyday chat capabilities, Large Language Models are increasingly being used to make decisions in sensitive domains like hiring, health, law, and civic engagement. The exact mechanics of how we use these models in such scenarios is vital. There are many ways to have LLMs make decisions, including A/B decision-making, ranking, classification, "panels" of judges, e...