Google Demonstrates Method to Scale Language Model to Infinitely Long Inputs [Video]

Google researchers have introduced a method for scaling Transformer-based large language models (LLMs) to handle infinitely long inputs with bounded memory and computation.

The paper titled Leave No Context Behind devises the approach, known as Infini-attention which incorporates compressive memory into the vanilla attention mechanism and combines masked local attention and long-term linear attention mechanisms in a single Transformer block.

This modification to the Transformer attention layer supports continual pre-training and fine-tuning, facilitating the natural extension of existing LLMs to process infinitely long contexts.

Infini-attention reuses key, value, and query states from standard attention computations for long-term memory consolidation and retrieval. Instead of discarding old key-value (KV) states, the approach stores them in compressive memory and retrieves values using attention query states for processing subsequent sequences. The final contextual output is computed by combining long-term memory-retrieved values with local attention contexts.

Experimental results demonstrate that this approach surpasses baseline models on long-context language modelling benchmarks, achieving a …