News Score: Score the News, Sort the News, Rewrite the Headlines

Task-Specific LLM Evals that Do & Don't Work

If you’ve ran off-the-shelf evals for your tasks, you may have found that most don’t work. They barely correlate with application-specific performance and aren’t discriminative enough to use in production. As a result, we could spend weeks and still not have evals that reliably measure how we’re doing on our tasks. To save us some time, I’m sharing some evals I’ve found useful. The goal is to spend less time figuring out evals so we can spend more time shipping to users. We’ll focus on simple, c...

Read more at eugeneyan.com

© News Score  score the news, sort the news, rewrite the headlines