Midokura Technology RadarMidokura Technology Radar

RDMA & High-performance Networking

networkingrdmainfinibandteam:mido/infra
Trial

Why?

  • Large-scale model training and parallel simulations require low-latency, high-bandwidth interconnects between GPUs.
  • RDMA/InfiniBand and technologies like RoCEv2 reduce communication overhead and improve scaling efficiency.
  • Emerging post-RoCEv2 protocols and fabrics (e.g., UltraEthernet, RoCE extensions, and proprietary RDMA-like stacks) are gaining traction for improved determinism, telemetry, and Ethernet-native deployment models.

What?

  • Standardize on supported network fabrics for distributed training clusters.
  • Track and evaluate post-RoCEv2 protocols and fabrics (e.g., UltraEthernet): benchmark performance, interoperability, and vendor ecosystem maturity.
  • Validate RDMA capabilities, NUMA/topology effects, and software stack readiness.
  • Explore DPU/SmartNIC offloads for network & security functions in AI clusters.