Researchers Unveil Vending-Bench: New AI Test Reveals LLMs Struggle with Long-Term Business Management

Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents

View PDF HTML (experimental) Abstract:While Large Language Models (LLMs) can exhibit impressive proficiency in isolated, short-term tasks, they often fail to maintain coherent performance over longer time horizons. In this paper, we present Vending-Bench, a simulated environment designed to specifically test an LLM-based agent's ability to manage a straightforward, long-running business scenario: operating a vending machine. Agents must balance inventories, place orders, set prices, and handle d...