InfoFlow KV — Information-Flow-Aware KV Recomputation for Long Context

01 — The setting

Cache each document once — but combining caches breaks

1Positions collide — every chunk restarts at 0

2Cross-document attention never computed

02 — Existing approaches

Recompute a small subset to patch back cross-doc attention

A · Cached

Block-diagonal. Each chunk attends only within itself.

B · Recompute

Patch the gap. ~15% of tokens recomputed under the full mask.

C · Full causal

The reference. Every token attends to all earlier tokens.

EPIC

attention sinks

Always recomputes attention-sink tokens

Fixed set — blind to the query.

CacheBlend

shallow deviation

Picks tokens whose shallow-layer KV drifts most

Early-layer drift ≠ true influence on the answer.

The gap

Neither captures whether a token is structurally positioned to influence downstream decoding under the global causal mask.

03 — InfoFlow KV

Recompute only the tokens that carry information to the answer

Pipeline▶ Click through the four stages — or watch it play once.

04 — InfoFlow KV · Reorder

For independent evidence, reorder before recomputing

Reorder▶ Click through the four stages — or watch it play once.

05 — Why geometry matters

How we place RoPE positions changes which tokens get selected

Each cached chunk was encoded locally. Where we re-place it on the absolute index decides whether positions collide — or match what the model saw at prefill.

HL–HP

head local · head prompt

high-freqprompt near

TL–TP

tail local · tail prompt

low-freqprompt near

HL–TP

head local · tail prompt

split freqprompt far

GLOBAL

true absolute indices · adopted

contiguous indexinference-aligned

RoPE config		2WikiMQA	MuSiQue	HotpotQA	NarrativeQA
HL–HP	both at head	0.4455	0.2871	0.5529	0.1481
TL–TP	both at tail	0.4458	0.2970	0.5693	0.1923
HL–TP	split range	0.4722	0.3072	0.5651	0.2106
GLOBAL	inference-consistent	0.5019	0.3386	0.5954	0.2288

Qwen · passage-split · F1, higher is better. GLOBAL wins on every benchmark.

06 — See it select

Which tokens get recomputed?

Selection strategy

Prompt › Who directed the film adapted from the novel that won the 1985 Booker Prize?

chunk 1 chunk 2 chunk 3 chunk 4 recomputed

07 — Results

Consistent gains at a 15% budget

LongBench F1 · 15% budget — bold cyan = best, shaded = second (among recompute methods)

Method	Fixed chunk · 2048				Passage split
Method	2Wiki	MuSiQue	Hotpot	NarrQA	2Wiki	MuSiQue	Hotpot	NarrQA
Qwen3-14B
Baseline	0.5161	0.3718	0.5922	0.1654	0.5161	0.3718	0.5922	0.1654
No Recompute	0.3948	0.1342	0.4633	0.1137	0.1162	0.1012	0.3059	0.2078
Ours	0.5089	0.3384	0.5967	0.2110	0.5019	0.3386	0.5954	0.2288
Ours + Reorder	0.4773	0.2872	0.5053	0.2251	0.5058	0.3285	0.5972	0.2310
CacheBlend	0.4417	0.2611	0.5352	0.2170	0.4330	0.2765	0.5738	0.2197
EPIC	0.4321	0.2368	0.5284	0.1999	0.3697	0.2480	0.5443	0.2291
Llama-3.1-8B
Baseline	0.4588	0.3285	0.5410	0.1862	0.4588	0.3285	0.5410	0.1862
No Recompute	0.3969	0.2523	0.4671	0.2639	0.3066	0.2462	0.4253	0.2810
Ours	0.4635	0.3104	0.5150	0.2891	0.4208	0.2996	0.5123	0.3141
Ours + Reorder	0.4455	0.3044	0.5053	0.2957	0.4417	0.2793	0.5202	0.3243
CacheBlend	0.4131	0.2823	0.4720	0.2685	0.3976	0.2708	0.4872	0.3095
EPIC	0.4087	0.2638	0.4755	0.2701	0.3885	0.2852	0.4898	0.3102
GLM-4-9B
Baseline	0.5253	0.3946	0.6003	0.3264	0.5253	0.3946	0.6003	0.3264
No Recompute	0.4370	0.2833	0.5024	0.2758	0.3523	0.2474	0.4388	0.3181
Ours	0.5064	0.3688	0.5739	0.3239	0.4890	0.3758	0.5614	0.3100
Ours + Reorder	0.5176	0.3786	0.5820	0.3140	0.4666	0.3635	0.5567	0.3180
CacheBlend	0.4226	0.2624	0.5177	0.2970	0.3757	0.3188	0.5164	0.3179
EPIC	0.4401	0.2902	0.5362	0.2962	0.4521	0.3091	0.5481	0.3142

Ours is best or second-best on every benchmark · largest gains on multi-hop reasoning. Baseline & No Recompute are references, not competitors.

Qwen3-VL-8B · five multimodal QA benchmarks · larger k = more visual chunks, stronger mismatch

Method	RealWorldQA	ChartQA	OCRBench	HRBench4K	InfoVQA
k = 0 · no chunk-wise recomputation
Baseline	0.7059	83.08	878	0.7488	83.07
k = 2 · two visual chunks
No Recompute	0.6745	71.32	839	0.7275	71.64
Ours	0.6810	73.48	842	0.7263	73.07
CacheBlend	0.6758	71.72	845	0.7238	72.00
EPIC	0.6745	70.92	836	0.7250	71.51
k = 4 · four visual chunks
No Recompute	0.6588	62.00	781	0.6888	57.88
Ours	0.6667	65.68	802	0.6913	62.23
CacheBlend	0.6562	62.68	786	0.6863	58.56
EPIC	0.6549	62.48	785	0.6725	57.82

Higher is better · OCRBench scored /1000, ChartQA & InfoVQA in %, others 0–1. Gains widen with k — biggest wins on ChartQA & InfoVQA, which bind dispersed visual elements to text.

08 — Efficiency

Faster than ring attention — and more accurate

Assume no cache available, context computed on the fly. On 4× H100 sequence-parallel prefill we recompute only the information-critical tokens.

8,192 tokens

2.44×

vs. single-GPU prefill

Single-GPU

567

Ring Attn

248

Ours

232

16,384 tokens

3.01×

vs. single-GPU prefill

Single-GPU

1286

Ring Attn

708

Ours

428

32,768 tokens

3.49×

vs. single-GPU prefill

Single-GPU

3190

Ring Attn

2350

Ours

914

@ 32K vs RING

2.57×

Communicate only the selected subset, not the whole KV cache.

F1 vs RING ATTENTION · same SP setting

2WikiMQA

48.9 → 51.5 +2.55

HotpotQA

56.5 → 57.8 +1.30

MuSiQue

32.1 → 33.5 +1.44

09 — Contributions

Select better, recompute less

01

InfoFlow — a simple attention-norm signal

02

A reorder strategy from RoPE-geometry analysis

03

Validated on LLM & VLM — best or second on every benchmark

04

An on-the-fly efficiency win — up to 3.49× at 32K

10 — Cite

BibTeX

@article{teng2026infoflowkv,
  title   = {InfoFlow KV: Information-Flow-Aware KV Recomputation for Long Context},
  author  = {Teng, Xin and Zhang, Canyu and Zheng, Shaoyi and
             Zhuo, Danyang and Zhou, Tianyi and Wang, Shengjie},
  journal = {arXiv preprint arXiv:2603.05353},
  year    = {2026}
}