IndexCache, a new sparse attention optimizer, delivers 1.82x faster inference on long-context AI models

Feed de Notícias 28/03/2026 às 14:50

Processing 200,000 tokens through a large language model is expensive and slow: the longer the context, the faster the costs spiral. Researchers at Tsinghua University and Z.ai have built a technique called IndexCache that cuts up to 75% of the redundant computation in sparse attention models, delivering up to 1.82x faster time-to-first-token and 1…