Task-Specific LLM Evals that Do & Don't Work
If you’ve ran off-the-shelf evals for your tasks, you may have found that most don’t work. They barely correlate with application-specific performance and aren’t discriminative enough to use in production. As a result, we could spend weeks and still not have evals that reliably measure how we’re doing on our tasks.
To save us some time, I’m sharing some evals I’ve found useful. The goal is to spend less time figuring out evals so we can spend more time shipping to users. We’ll focus on simple, c...
Read more at eugeneyan.com