"Chameleon: A Game-Changer in Technology, Outshines Major Models in Text, Image Tasks; Marks Significant Progress in Unified Full Multimodal Document Modeling"

Chameleon: Mixed-Modal Early-Fusion Foundation Models

View PDF HTML (experimental) Abstract:We present Chameleon, a family of early-fusion token-based mixed-modal models capable of understanding and generating images and text in any arbitrary sequence. We outline a stable training approach from inception, an alignment recipe, and an architectural parameterization tailored for the early-fusion, token-based, mixed-modal setting. The models are evaluated on a comprehensive range of tasks, including visual question answering, image captioning, text gen...