Recompute only the tokens that actually carry information to the answer — selected by query-conditioned attention under inference-consistent RoPE.
Block-diagonal. Each chunk attends only within itself.
Patch the gap. ~15% of tokens recomputed under the full mask.
The reference. Every token attends to all earlier tokens.
Neither captures whether a token is structurally positioned to influence downstream decoding under the global causal mask.
Each cached chunk was encoded locally. Where we re-place it on the absolute index decides whether positions collide — or match what the model saw at prefill.
| RoPE config | 2WikiMQA | MuSiQue | HotpotQA | NarrativeQA | |
|---|---|---|---|---|---|
| HL–HP | both at head | 0.4455 | 0.2871 | 0.5529 | 0.1481 |
| TL–TP | both at tail | 0.4458 | 0.2970 | 0.5693 | 0.1923 |
| HL–TP | split range | 0.4722 | 0.3072 | 0.5651 | 0.2106 |
| GLOBAL | inference-consistent | 0.5019 | 0.3386 | 0.5954 | 0.2288 |
Qwen · passage-split · F1, higher is better. GLOBAL wins on every benchmark.
| Method | Fixed chunk · 2048 | Passage split | ||||||
|---|---|---|---|---|---|---|---|---|
| 2Wiki | MuSiQue | Hotpot | NarrQA | 2Wiki | MuSiQue | Hotpot | NarrQA | |
| Qwen3-14B | ||||||||
| Baseline | 0.5161 | 0.3718 | 0.5922 | 0.1654 | 0.5161 | 0.3718 | 0.5922 | 0.1654 |
| No Recompute | 0.3948 | 0.1342 | 0.4633 | 0.1137 | 0.1162 | 0.1012 | 0.3059 | 0.2078 |
| Ours | 0.5089 | 0.3384 | 0.5967 | 0.2110 | 0.5019 | 0.3386 | 0.5954 | 0.2288 |
| Ours + Reorder | 0.4773 | 0.2872 | 0.5053 | 0.2251 | 0.5058 | 0.3285 | 0.5972 | 0.2310 |
| CacheBlend | 0.4417 | 0.2611 | 0.5352 | 0.2170 | 0.4330 | 0.2765 | 0.5738 | 0.2197 |
| EPIC | 0.4321 | 0.2368 | 0.5284 | 0.1999 | 0.3697 | 0.2480 | 0.5443 | 0.2291 |
| Llama-3.1-8B | ||||||||
| Baseline | 0.4588 | 0.3285 | 0.5410 | 0.1862 | 0.4588 | 0.3285 | 0.5410 | 0.1862 |
| No Recompute | 0.3969 | 0.2523 | 0.4671 | 0.2639 | 0.3066 | 0.2462 | 0.4253 | 0.2810 |
| Ours | 0.4635 | 0.3104 | 0.5150 | 0.2891 | 0.4208 | 0.2996 | 0.5123 | 0.3141 |
| Ours + Reorder | 0.4455 | 0.3044 | 0.5053 | 0.2957 | 0.4417 | 0.2793 | 0.5202 | 0.3243 |
| CacheBlend | 0.4131 | 0.2823 | 0.4720 | 0.2685 | 0.3976 | 0.2708 | 0.4872 | 0.3095 |
| EPIC | 0.4087 | 0.2638 | 0.4755 | 0.2701 | 0.3885 | 0.2852 | 0.4898 | 0.3102 |
| GLM-4-9B | ||||||||
| Baseline | 0.5253 | 0.3946 | 0.6003 | 0.3264 | 0.5253 | 0.3946 | 0.6003 | 0.3264 |
| No Recompute | 0.4370 | 0.2833 | 0.5024 | 0.2758 | 0.3523 | 0.2474 | 0.4388 | 0.3181 |
| Ours | 0.5064 | 0.3688 | 0.5739 | 0.3239 | 0.4890 | 0.3758 | 0.5614 | 0.3100 |
| Ours + Reorder | 0.5176 | 0.3786 | 0.5820 | 0.3140 | 0.4666 | 0.3635 | 0.5567 | 0.3180 |
| CacheBlend | 0.4226 | 0.2624 | 0.5177 | 0.2970 | 0.3757 | 0.3188 | 0.5164 | 0.3179 |
| EPIC | 0.4401 | 0.2902 | 0.5362 | 0.2962 | 0.4521 | 0.3091 | 0.5481 | 0.3142 |
Ours is best or second-best on every benchmark · largest gains on multi-hop reasoning. Baseline & No Recompute are references, not competitors.
| Method | RealWorldQA | ChartQA | OCRBench | HRBench4K | InfoVQA |
|---|---|---|---|---|---|
| k = 0 · no chunk-wise recomputation | |||||
| Baseline | 0.7059 | 83.08 | 878 | 0.7488 | 83.07 |
| k = 2 · two visual chunks | |||||
| No Recompute | 0.6745 | 71.32 | 839 | 0.7275 | 71.64 |
| Ours | 0.6810 | 73.48 | 842 | 0.7263 | 73.07 |
| CacheBlend | 0.6758 | 71.72 | 845 | 0.7238 | 72.00 |
| EPIC | 0.6745 | 70.92 | 836 | 0.7250 | 71.51 |
| k = 4 · four visual chunks | |||||
| No Recompute | 0.6588 | 62.00 | 781 | 0.6888 | 57.88 |
| Ours | 0.6667 | 65.68 | 802 | 0.6913 | 62.23 |
| CacheBlend | 0.6562 | 62.68 | 786 | 0.6863 | 58.56 |
| EPIC | 0.6549 | 62.48 | 785 | 0.6725 | 57.82 |
Higher is better · OCRBench scored /1000, ChartQA & InfoVQA in %, others 0–1. Gains widen with k — biggest wins on ChartQA & InfoVQA, which bind dispersed visual elements to text.
Assume no cache available, context computed on the fly. On 4× H100 sequence-parallel prefill we recompute only the information-critical tokens.
Communicate only the selected subset, not the whole KV cache.
@article{teng2026infoflowkv,
title = {InfoFlow KV: Information-Flow-Aware KV Recomputation for Long Context},
author = {Teng, Xin and Zhang, Canyu and Zheng, Shaoyi and
Zhuo, Danyang and Zhou, Tianyi and Wang, Shengjie},
journal = {arXiv preprint arXiv:2603.05353},
year = {2026}
}