SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator
Abstract
Large Language Models (LLMs) have exhibited exceptional performance across
a spectrum of natural language processing tasks. However, their substantial
sizes pose considerable challenges, particularly in terms of computational
demands and inference speed, due to its quadratic complexity. In this work,
we have identified a noteworthy pattern: certain meaningless special tokens
(i.e., separators) contribute massively to attention scores compared to other
semantically meaningful tokens. Thi...
Read more at sepllm.github.io