Evaluating LLMs for my personal use case
It’s great that AI can win maths Olympiads, but that’s not what I’m doing. I mostly ask basic Rust, Python, Linux and life questions. So I did my own evaluation.
I gathered 130 real prompts from my bash history (I use command line tool llm).
I had Qwen3 235B Thinking and Gemini 2.5 Pro group them into categories. They both chose very similar ones, broadly (with examples):
Programming - “Write a bash script to ..”
Sysadmin - “With curl how do I ..”
Technical explanations - “Explain underlay netw...
Read more at darkcoding.net