"Adaptive N-gram Parallel Decoding: A Breakthrough Technology Accelerating Large Language Models Up to 3.67x Faster Without Extra Resource Consumption"

Lossless Acceleration of Large Language Model via Adaptive N-gram Parallel Decoding

View PDF HTML (experimental) Abstract:While Large Language Models (LLMs) have shown remarkable abilities, they are hindered by significant resource consumption and considerable latency due to autoregressive processing. In this study, we introduce Adaptive N-gram Parallel Decoding (ANPD), an innovative and lossless approach that accelerates inference by allowing the simultaneous generation of multiple tokens. ANPD incorporates a two-stage approach: it begins with a rapid drafting phase that emplo...