ICML 2026 · arXiv:2603.05353

InfoFlowKV

Recompute only the tokens that actually carry information to the answer — selected by query-conditioned attention under inference-consistent RoPE.

Faster prefill @ 32K
vs. Ring Attention
Zero training
Plug-and-play
1New York University 2Duke University 3MBZUAI * equal contribution
scroll
01 — The setting

Cache each document once — but combining caches breaks

1Positions collide — every chunk restarts at 0
2Cross-document attention never computed

02 — Existing approaches

Recompute a small subset to patch back cross-doc attention

A · Cached

Block-diagonal. Each chunk attends only within itself.

B · Recompute

Patch the gap. ~15% of tokens recomputed under the full mask.

C · Full causal

The reference. Every token attends to all earlier tokens.

EPIC

attention sinks
Always recomputes attention-sink tokens
Fixed set — blind to the query.

CacheBlend

shallow deviation
Picks tokens whose shallow-layer KV drifts most
Early-layer drift ≠ true influence on the answer.
The gap

Neither captures whether a token is structurally positioned to influence downstream decoding under the global causal mask.


03 — InfoFlow KV

Recompute only the tokens that carry information to the answer

Pipeline▶ Click through the four stages — or watch it play once.

04 — InfoFlow KV · Reorder

For independent evidence, reorder before recomputing

Reorder▶ Click through the four stages — or watch it play once.

05 — Why geometry matters

How we place RoPE positions changes which tokens get selected

Each cached chunk was encoded locally. Where we re-place it on the absolute index decides whether positions collide — or match what the model saw at prefill.

HL–HP
head local · head prompt
high-freqprompt near
TL–TP
tail local · tail prompt
low-freqprompt near
HL–TP
head local · tail prompt
split freqprompt far
GLOBAL
true absolute indices · adopted
contiguous indexinference-aligned
RoPE config2WikiMQAMuSiQueHotpotQANarrativeQA
HL–HPboth at head0.44550.28710.55290.1481
TL–TPboth at tail0.44580.29700.56930.1923
HL–TPsplit range0.47220.30720.56510.2106
GLOBALinference-consistent0.50190.33860.59540.2288

Qwen · passage-split · F1, higher is better. GLOBAL wins on every benchmark.


06 — See it select

Which tokens get recomputed?

Selection strategy
Prompt › Who directed the film adapted from the novel that won the 1985 Booker Prize?
chunk 1 chunk 2 chunk 3 chunk 4 recomputed

07 — Results

Consistent gains at a 15% budget

LongBench F1 · 15% budget — bold cyan = best, shaded = second (among recompute methods)
MethodFixed chunk · 2048Passage split
2WikiMuSiQueHotpotNarrQA2WikiMuSiQueHotpotNarrQA
Qwen3-14B
Baseline0.51610.37180.59220.16540.51610.37180.59220.1654
No Recompute0.39480.13420.46330.11370.11620.10120.30590.2078
Ours0.50890.33840.59670.21100.50190.33860.59540.2288
Ours + Reorder0.47730.28720.50530.22510.50580.32850.59720.2310
CacheBlend0.44170.26110.53520.21700.43300.27650.57380.2197
EPIC0.43210.23680.52840.19990.36970.24800.54430.2291
Llama-3.1-8B
Baseline0.45880.32850.54100.18620.45880.32850.54100.1862
No Recompute0.39690.25230.46710.26390.30660.24620.42530.2810
Ours0.46350.31040.51500.28910.42080.29960.51230.3141
Ours + Reorder0.44550.30440.50530.29570.44170.27930.52020.3243
CacheBlend0.41310.28230.47200.26850.39760.27080.48720.3095
EPIC0.40870.26380.47550.27010.38850.28520.48980.3102
GLM-4-9B
Baseline0.52530.39460.60030.32640.52530.39460.60030.3264
No Recompute0.43700.28330.50240.27580.35230.24740.43880.3181
Ours0.50640.36880.57390.32390.48900.37580.56140.3100
Ours + Reorder0.51760.37860.58200.31400.46660.36350.55670.3180
CacheBlend0.42260.26240.51770.29700.37570.31880.51640.3179
EPIC0.44010.29020.53620.29620.45210.30910.54810.3142

Ours is best or second-best on every benchmark · largest gains on multi-hop reasoning. Baseline & No Recompute are references, not competitors.

Qwen3-VL-8B · five multimodal QA benchmarks · larger k = more visual chunks, stronger mismatch
MethodRealWorldQAChartQAOCRBenchHRBench4KInfoVQA
k = 0 · no chunk-wise recomputation
Baseline0.705983.088780.748883.07
k = 2 · two visual chunks
No Recompute0.674571.328390.727571.64
Ours0.681073.488420.726373.07
CacheBlend0.675871.728450.723872.00
EPIC0.674570.928360.725071.51
k = 4 · four visual chunks
No Recompute0.658862.007810.688857.88
Ours0.666765.688020.691362.23
CacheBlend0.656262.687860.686358.56
EPIC0.654962.487850.672557.82

Higher is better · OCRBench scored /1000, ChartQA & InfoVQA in %, others 0–1. Gains widen with k — biggest wins on ChartQA & InfoVQA, which bind dispersed visual elements to text.


08 — Efficiency

Faster than ring attention — and more accurate

Assume no cache available, context computed on the fly. On 4× H100 sequence-parallel prefill we recompute only the information-critical tokens.

8,192 tokens
2.44×
vs. single-GPU prefill
Single-GPU
567
Ring Attn
248
Ours
232
16,384 tokens
3.01×
vs. single-GPU prefill
Single-GPU
1286
Ring Attn
708
Ours
428
32,768 tokens
3.49×
vs. single-GPU prefill
Single-GPU
3190
Ring Attn
2350
Ours
914
@ 32K vs RING
2.57×

Communicate only the selected subset, not the whole KV cache.

F1 vs RING ATTENTION · same SP setting
2WikiMQA
48.9 → 51.5  +2.55
HotpotQA
56.5 → 57.8  +1.30
MuSiQue
32.1 → 33.5  +1.44

09 — Contributions

Select better, recompute less

01

InfoFlow — a simple attention-norm signal

02

A reorder strategy from RoPE-geometry analysis

03

Validated on LLM & VLM — best or second on every benchmark

04

An on-the-fly efficiency win — up to 3.49× at 32K


10 — Cite

BibTeX

@article{teng2026infoflowkv,
  title   = {InfoFlow KV: Information-Flow-Aware KV Recomputation for Long Context},
  author  = {Teng, Xin and Zhang, Canyu and Zheng, Shaoyi and
             Zhuo, Danyang and Zhou, Tianyi and Wang, Shengjie},
  journal = {arXiv preprint arXiv:2603.05353},
  year    = {2026}
}