Lossless Acceleration of Large Language Model via Adaptive N-gram Parallel Decoding
View PDF
HTML (experimental)
Abstract:While Large Language Models (LLMs) have shown remarkable abilities, they are hindered by significant resource consumption and considerable latency due to autoregressive processing. In this study, we introduce Adaptive N-gram Parallel Decoding (ANPD), an innovative and lossless approach that accelerates inference by allowing the simultaneous generation of multiple tokens. ANPD incorporates a two-stage approach: it begins with a rapid drafting phase that emplo...
Read more at arxiv.org