News Score: Score the News, Sort the News, Rewrite the Headlines

Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents

View PDF HTML (experimental) Abstract:While Large Language Models (LLMs) can exhibit impressive proficiency in isolated, short-term tasks, they often fail to maintain coherent performance over longer time horizons. In this paper, we present Vending-Bench, a simulated environment designed to specifically test an LLM-based agent's ability to manage a straightforward, long-running business scenario: operating a vending machine. Agents must balance inventories, place orders, set prices, and handle d...

Read more at arxiv.org

© News Score  score the news, sort the news, rewrite the headlines