Midokura Technology RadarMidokura Technology Radar

Distributed Training (Data & Model Parallelism)

trainingdistributeddeepspeedmegatronteam:mido/infra
Trial

Why?

  • Model sizes are growing rapidly; single-GPU training is no longer sufficient for many state-of-the-art models.
  • Efficient scaling (data & model parallelism) is required to reduce training time and cost.

What?

  • Invest in distributed-training frameworks (PyTorch DDP, DeepSpeed, Megatron-LM, Fully Sharded Data Parallel).
  • Build reference pipelines for sharding, checkpointing, and failure recovery at scale.
  • Measure end-to-end tradeoffs (throughput, convergence, cost) and standardize best practices.