APE: Faster and Longer Context-augmented Generation via Adpative Parallel Encoding

Introduction

Context-augmented generation (CAG) techniques, including RAG and ICL, require the efficient combination of multiple contexts to generate responses to user queries. Directly inputting these contexts sequentially introduce considerable computational cost by re-encoding the combined selection of contexts for every request. To address this, we explore the promising potential of parallel encoding to pre-compute and cache the KV states of each context independently. This approach enables direct loading of cached states during inference, while also accommodating more contexts through position reuse across contexts. However, due to misalignments in attention distribution, directly applying parallel encoding results in a significant performance drop. To enable effective and efficient CAG, we propose Adaptive Parallel Encoding (APE), which brings shared prefix, attention temperature, and scaling factor to align the distribution of parallel encoding with sequential encoding. Results on RAG and ICL tasks demonstrate that APE can preserve 98% and 93% sequential encoding performance using the same inputs, while outperforming parallel encoding by 3.6% and 7.9%, respectively. It also scales to many-shot CAG settings, effectively encoding hundreds of contexts in parallel. Efficiency evaluation shows that APE achieves an end-to-end 4.5× speedup by reducing 28× prefilling time for a 128K-length context. For the detailed techniques of APE, please refer to Part 2.

Context-Augmented Generation and APE

CAG leverages additional contexts to improve the quality of LLM responses to queries. Sequential encoding prefills selected context chunks as a long sequence during inference, suffering from high latency due to on-the-fly re-encoding and low accuracy due to chunk size and count limits. Parallel encoding offers an alternative method to pre-cache more and longer contexts and share same positions across them but leads to worse performance. We propose Adaptive Parallel Encoding to re-align the attention weight distribution of parallel encoding with sequential encoding via three steps: shared prefix, scaling factor, and adaptive temperature. APE can accommodate more and longer contexts without re-encoding and fine-tuning, leading to fast and accurate CAG systems in practice.

How APE Accelerates Inference?

APE reduces the prefilling time to nearly zero.

APE increases the cache hit rate among requests.

How APE Maintains Performance?

APE maintains 98% accuracy on ChatRAG-Bench.

APE improves RAG performance on LongBench with more contexts.

APE keeps 93% accuracy on ICL tasks requiring advanced abilities.

APE handles hundreds of contexts in parallel without degradation.

Conclusion and Future Work

This work explores the potential of parallel encoding in CAG scenarios, which can pre-cache KV states for fast inference and re-use positions for long context but leads to worse performance. To address this issue, we propose APE, a training-free method to enable accurate, fast, and long CAG systems. APE achieves this by aligning the attention weight distribution of parallel encoding with sequential encoding via three steps: shared prefix, adaptive temperature, and scaling factor. APE improves both accuracy and efficiency in retrieval-augmented generation (RAG) and in-context learning (ICL) tasks while successfully scaling to process hundreds of chunks in parallel. Future research directions include automating hyperparameter selection for diverse inputs, developing APE-cache serving systems, and extending APE to multimodal scenarios.

BibTeX

@inproceedings{yang2025ape,
  title={Faster and Longer Context-Augmented Generation via Adpative Parallel Encoding},
  author={Yang, Xinyu and Chen, Tianqi and Chen, Beidi},
  booktitle={The Thirteenth International Conference on Learning Representations (ICLR)},
  year={2025}
}

APE: Faster and Longer Context-Augmented Generation via Adaptive Parallel Encoding