Researchers Develop Arbitrus: AI Classifier Outperforms LLMs in Legal Consistency Test, Challenges Future of Judicial Decision-Making

LLMs are Bad Judges. So use Our Classifier Instead.

68 Pages Posted: 8 Jul 2025 Last revised: 8 Jul 2025 Date Written: June 30, 2025 Abstract Large Language Models suffer from prompt variance— meaning they’ll give you totally different legal answers depending on how you phrase your question. Jonathan Choi demonstrated this recently when he asked ChatGPT five legal questions, each rephrased 2,000 times, and watched as the bot spat out different answers every time. When you tell somebody that AI is going to replace the judge, the lawyer, and the le...