Prefill-Decode Disaggregation

Feb 2026

Trial

Why?

Large LLM-style workloads separate naturally into a batched, high-throughput "prefill" phase (context encoding) and a low-latency "decode" phase (autoregressive token generation).
Disaggregation lets us specialize hardware: dense GPU farms for prefill throughput, and low-latency nodes (small GPUs, CPUs or inference accelerators) for decode, improving utilization and cost-efficiency.
Network and protocol choices (RDMA/UltraEthernet, efficient RPC, context-caching strategies) are critical to keep end-to-end latency predictable.

Prototype disaggregated serving architectures: pooled prefill services that export encoded contexts to dedicated decode servers via low-latency protocols.
Standardize protocols, context serialization, request coalescing, and backpressure to reduce tail latency and jitter.
Benchmark E2E latency, throughput, and cost across placement strategies; integrate disaggregation patterns into GPU orchestration and autoscaling policies.
Track operational concerns: context cache invalidation, auth/tenancy, and cost allocation across prefill/decode pools.