Midokura Technology Radar

GPU Observability & Profiling

observability profiling gpu team:mido/infra

Feb 2026

Trial

Why?

Troubleshooting performance, utilization, and correctness of GPU workloads requires specialized telemetry and profiling tools.
Better observability unlocks higher utilization and faster issue resolution for training & inference.

What?

Adopt tooling for GPU telemetry and profiling (NVIDIA Nsight, DCGM, PyTorch/TensorBoard profilers, Prometheus exporters).
Establish logging and tracing practices that correlate GPU metrics with jobs, datasets, and deployments.
Build runbooks for common performance issues and capacity bottlenecks.