Researchers propose diffusion language models for text embeddings, outperforming LLMs by 20% on long-document retrieval

Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective

View PDF HTML (experimental) Abstract:Large language model (LLM)-based embedding models, benefiting from large scale pre-training and post-training, have begun to surpass BERT and T5-based models on general-purpose text embedding tasks such as document retrieval. However, a fundamental limitation of LLM embeddings lies in the unidirectional attention used during autoregressive pre-training, which misaligns with the bidirectional nature of text embedding tasks. To this end, We propose adopting di...