Trial
Why?
- Large LLM-style workloads separate naturally into a batched, high-throughput "prefill" phase (context encoding) and a low-latency "decode" phase (autoregressive token generation).
- Disaggregation lets us specialize hardware: dense GPU farms for prefill throughput, and low-latency nodes (small GPUs, CPUs or inference accelerators) for decode, improving utilization and cost-efficiency.
- Network and protocol choices (RDMA/UltraEthernet, efficient RPC, context-caching strategies) are critical to keep end-to-end latency predictable.
What?
- Prototype disaggregated serving architectures: pooled prefill services that export encoded contexts to dedicated decode servers via low-latency protocols.
- Standardize protocols, context serialization, request coalescing, and backpressure to reduce tail latency and jitter.
- Benchmark E2E latency, throughput, and cost across placement strategies; integrate disaggregation patterns into GPU orchestration and autoscaling policies.
- Track operational concerns: context cache invalidation, auth/tenancy, and cost allocation across prefill/decode pools.