Researchers Uncover Critical Flaws in SWE-Bench Coding Dataset: 63.75% of Successful Patches Deemed Problematic, LLM Performance Drops Significantly

SWE-Bench+: Enhanced Coding Benchmark for LLMs

View PDF HTML (experimental) Abstract:Large Language Models (LLMs) in Software Engineering (SE) can offer assistance for coding. To facilitate a rigorous evaluation of LLMs in practical coding contexts, Carlos et al. introduced the SWE-bench dataset, which comprises 2,294 real-world GitHub issues and their corresponding pull requests, collected from 12 widely used Python repositories. Several impressive LLM-based toolkits recently are developed and evaluated on this dataset. However, a systemati...