GitHub - Danau5tin/terminal-bench-rl: GRPO training code which scales to 32xH100s for long horizon terminal/coding tasks. Base agent is now the top Qwen3 agent on Stanford's TerminalBench leaderboard.
🤓 Terminal-Bench-RL: Training Long-Horizon Terminal Agents with Reinforcement Learning
TL;DR:
I successfully built stable RL training infrastructure that scales to 32x H100 GPUs across 4 bare metal nodes for training long-horizon terminal-based coding agents.
In doing so, I developed Terminal-Agent-Qwen3-32b to become the highest scoring Qwen3 agent on terminal-bench. WITHOUT training! (currently under submission):
Unfortunately I am too GPU poor to train a SOTA coding agent 😅 (estimated £30k-...
Read more at github.com