OmniVision-968M: World's Smallest Vision Language Model Cuts Tokens 9x, Enhances Accuracy for Edge Devices

OmniVision-968M: World's Smallest Vision Language Model

Your browser does not support the video tag.Omnivision is a compact, sub-billion (968M) multimodal model for processing both visual and text inputs, optimized for edge devices. Improved on LLaVA's architecture, it features:9x Tokens Reduction: Reduces image tokens from 729 to 81, cutting latency and computational cost.Enhanced Accuracy: Reduces hallucinations using DPO training from trustworthy data.Demo(OmniVision generated description for an image with multiple object)(OmniVision generated des...