Researchers run 35B-parameter AI model on 8GB laptop GPU at 21 tokens/sec using Rotary GPU technique; aims to make large language models accessible under hardware constraints

Rotary GPU: Exploring Local Execution Paths for Large Mixture-of-Experts Models Under Limited GPU Memory

View PDF HTML (experimental) Abstract:Large language models have achieved remarkable capabilities through scaling, and this paper does not challenge that. It instead investigates a different question: once large models already exist, can they become more accessible to environments with substantially smaller hardware resources? The motivation came from deployment concerns rather than architecture research. Many organizations operate under hardware, budget, security, or closed-network constraints ...