DualPath lifts throughput as RDMA eases KV-cache I/O

WikiBit 2026-02-27 16:01

DualPath inference: dual load paths relieve KV-cache I/O bottlenecksA new paper introduces DualPath inference, describing a system that nearly doubles an

A new paper introduces DualPath inference, describing a system that nearly doubles an agents throughput by tackling the KV cache bottleneck in multi‑round agentic LLMs.

According to DeepSeeks arXiv paper (https://arxiv.org/abs/2602.21548？utm_source=openai), DualPath adds a second load path: storage loads into the decode engine, which then uses RDMA (Remote Direct Memory Access) to transfer KV data to the prefill engine. The report indicates this rebalances bandwidth and relieves the KV-cache I/O bottleneck, delivering up to ~1.87× throughput in offline tests and ~1.96× on average in online service, without breaching latency SLOs. Peak gains assume very high cache reuse, with KV-cache hit rates around or above 95%.

Why it matters: higher throughput without breaking latency SLOs

For online inference, throughput gains are only meaningful if Time to First Token and token-to-token latency remain stable. The evaluations emphasize preserving SLOs while increasing aggregate tokens served.

“Computation is cheap; data movement is expensive,” said Jeff Dean.

Agentic, multi‑turn workloads that repeatedly draw from past context benefit most, because DualPath reduces stalls when fetching KV cache from external storage. This shifts the limiting factor away from storage I/O toward better‑balanced compute and network use.

In production, the headline result is higher tokens‑per‑second per cluster without measurable TTFT regression in the reported tests. That combination supports steadier user experience while raising capacity.

Organizations should still validate under their own mixes, as realized gains depend on cache reuse patterns, sequence lengths, and interconnect quality.

Deployment checklist, hardware needs, and when DualPath helps lessRDMA-capable interconnect, robust storage bandwidth, and ≥95% KV-cache hit rates

A practical rollout expects an RDMA‑capable interconnect, solid storage throughput to feed caches, and very high reuse so KV‑cache hit rates approach the ~95% mark cited in evaluations. Decode‑engine NIC capacity should be provisioned to absorb the added transfer path.

Lower cache reuse or weaker networking may reduce realized gains

Workloads with sparse history reuse, fragmented sessions, or weaker networking will see smaller uplift. Absent robust RDMA, added transfers can shift bottlenecks rather than remove them.

At the time of this writing, NVIDIA (NVDA) traded near 186.18 in overnight action after a 5.49% decline to 184.89 at the close, based on data from Nasdaq.

FAQ about DualPath inferenceHow much throughput improvement does DualPath deliver in online vs offline inference workloads？

Reported gains reached about 1.87× in offline tests and roughly 1.96× on average in online service, while adhering to stated service-level objectives in the papers evaluations.

Does DualPath affect Time to First Token (TTFT) and token-to-token latency under real production load？

The reported evaluations indicate TTFT and token-to-token latency remained stable under load, with throughput increasing via DualPaths second transfer path and RDMA-assisted balancing of bandwidth.

Disclaimer：

The views in this article only represent the author's personal views, and do not constitute investment advice on this platform. This platform does not guarantee the accuracy, completeness and timeliness of the information in the article, and will not be liable for any loss caused by the use of or reliance on the information in the article.

Related exchange

TIME

Rating

Chrono.tech | 5-10 years

HIGH

Rating

Highstreet | 5-10 years