DisTrO от Nous Research совершает революцию в обучении нейросетей, снижая требования к скорости связи между GPU в десятки тысяч раз. Эта технология позволяет тренировать масштабные ИИ-модели на обычном оборудовании, избавляя от необходимости использовать дорогостоящие серверные решения.
Training large scale neural networks typically involves sharing gradients between all accelerators, which necessitates specialized, high-speed interconnects. To address this, we introduce DisTrO, a family of architecture-agnostic and network-agnostic distributed optimizers that reduces the inter-GPU communication requirements by four to five orders of magnitude without relying on amortized analysis, enabling low-latency training of large neural networks on slow internet bandwidths with heterogeneous networking hardware. In this preliminary report we are excited to show the first and earliest empirical proof that DisTrO-AdamW matches standard AdamW+All-Reduce in convergence rate while massively reducing the required bandwidth during pre-training of a 1.2B LLM. When using Distributed Data Parallelism, DisTrO may enable future large scale foundation model training to bypass the need for high-speed interconnects entirely.