Developer Scales RL Training to 32 H100 GPUs, Creates Top-Performing Qwen3 Agent for Stanford's TerminalBench

GitHub - Danau5tin/terminal-bench-rl: GRPO training code which scales to 32xH100s for long horizon terminal/coding tasks. Base agent is now the top Qwen3 agent on Stanford's TerminalBench leaderboard.

🤓 Terminal-Bench-RL: Training Long-Horizon Terminal Agents with Reinforcement Learning TL;DR: I successfully built stable RL training infrastructure that scales to 32x H100 GPUs across 4 bare metal nodes for training long-horizon terminal-based coding agents. In doing so, I developed Terminal-Agent-Qwen3-32b to become the highest scoring Qwen3 agent on terminal-bench. WITHOUT training! (currently under submission): Unfortunately I am too GPU poor to train a SOTA coding agent 😅 (estimated £30k-...