Midokura Technology RadarMidokura Technology Radar
Trial

Why?

  • Large LLM-style workloads separate naturally into a batched, high-throughput "prefill" phase (context encoding) and a low-latency "decode" phase (autoregressive token generation).
  • Disaggregation lets us specialize hardware: dense GPU farms for prefill throughput, and low-latency nodes (small GPUs, CPUs or inference accelerators) for decode, improving utilization and cost-efficiency.
  • Network and protocol choices (RDMA/UltraEthernet, efficient RPC, context-caching strategies) are critical to keep end-to-end latency predictable.

What?

  • Prototype disaggregated serving architectures: pooled prefill services that export encoded contexts to dedicated decode servers via low-latency protocols.
  • Standardize protocols, context serialization, request coalescing, and backpressure to reduce tail latency and jitter.
  • Benchmark E2E latency, throughput, and cost across placement strategies; integrate disaggregation patterns into GPU orchestration and autoscaling policies.
  • Track operational concerns: context cache invalidation, auth/tenancy, and cost allocation across prefill/decode pools.