Apple Unveils FastVLM: Efficient Vision-Language Model Outperforms Rivals with 85x Faster Processing, Smaller Encoder for High-Res Images

GitHub - apple/ml-fastvlm: This repository contains the official implementation of "FastVLM: Efficient Vision Encoding for Vision Language Models" - CVPR 2025

FastVLM: Efficient Vision Encoding for Vision Language Models This is the official repository of FastVLM: Efficient Vision Encoding for Vision Language Models. (CVPR 2025) Highlights We introduce FastViTHD, a novel hybrid vision encoder designed to output fewer tokens and significantly reduce encoding time for high-resolution images. Our smallest variant outperforms LLaVA-OneVision-0.5B with 85x faster Time-to-First-Token (TTFT) and 3.4x smaller vision encoder. Our larger variants using Qwen2-7B...