"PowerInfer-2 Revolutionizes Smartphone Capabilities: High-Speed Inference of Large Language Models with Up to 29.2x Speed Increase, Lower Memory Use"

PowerInfer-2: Fast Large Language Model Inference on a Smartphone

View PDF HTML (experimental) Abstract:This paper introduces PowerInfer-2, a framework designed for high-speed inference of Large Language Models (LLMs) on smartphones, particularly effective for models whose sizes exceed the device's memory capacity. The key insight of PowerInfer-2 is to utilize the heterogeneous computation, memory, and I/O resources in smartphones by decomposing traditional matrix computations into fine-grained neuron cluster computations. Specifically, PowerInfer-2 features a...