TokenFormer: New AI Architecture Scales Efficiently, Reduces Training Costs for Large Language Models

TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters

View PDF HTML (experimental) Abstract:Transformers have become the predominant architecture in foundation models due to their excellent performance across various domains. However, the substantial cost of scaling these models remains a significant concern. This problem arises primarily from their dependence on a fixed number of parameters within linear projections. When architectural modifications (e.g., channel dimensions) are introduced, the entire model typically requires retraining from scra...