Extract-0: 7B-parameter AI outperforms GPT-4 in document extraction, using novel training methods and 280K synthetic examples

Extract-0: A Specialized Language Model for Document Information Extraction

View PDF HTML (experimental) Abstract:This paper presents Extract-0, a 7-billion parameter language model specifically optimized for document information extraction that achieves performance exceeding models with parameter counts several orders of magnitude larger. Through a novel combination of synthetic data generation, supervised fine-tuning with Low-Rank Adaptation (LoRA), and reinforcement learning via Group Relative Policy Optimization (GRPO), Extract-0 achieves a mean reward of 0.573 on a...