LLaVA-o1: New Vision-Language Model Outperforms Larger AI in Reasoning Tasks with Multistage Approach

LLaVA-o1: Let Vision Language Models Reason Step-by-Step

View PDF HTML (experimental) Abstract:Large language models have demonstrated substantial advancements in reasoning capabilities, particularly through inference-time scaling, as illustrated by models such as OpenAI's o1. However, current Vision-Language Models (VLMs) often struggle to perform systematic and structured reasoning, especially when handling complex visual question-answering tasks. In this work, we introduce LLaVA-o1, a novel VLM designed to conduct autonomous multistage reasoning. U...