Google researchers have introduced a method for scaling Transformer-based large language models (LLMs) to handle infinitely long inputs with bounded memory and computation.
The paper titled Leave No Context Behind devises the approach, known as Infini-attention which incorporates compressive memory into the vanilla attention mechanism and combines masked local attention and long-term linear attention mechanisms in a single Transformer block.
This modification to the Transformer attention layer supports continual pre-training and fine-tuning, facilitating the natural extension of existing LLMs to process infinitely long contexts.
Infini-attention reuses key, value, and query states from standard attention computations for long-term memory consolidation and retrieval. Instead of discarding old key-value (KV) states, the approach stores them in compressive memory and retrieves values using attention query states for processing subsequent sequences. The final contextual output is computed by combining long-term memory-retrieved values with local attention contexts.
Experimental results demonstrate that this approach surpasses baseline models on long-context language modelling benchmarks, achieving a …