The Rise of Decentralized Training

Special thanks to Harry Grieve & Ben Fielding for the helpful feedback and review

Over the last year, we’ve witnessed an explosion of new startups—and more established players—entering the scene of distributed and decentralized training. This space is now growing very rapidly. This essay aims at breaking down the concepts behind decentralized training, and exploring how this category, and the startups within it, might evolve.

Distributed → Decentralized Training

First things first: this is a very difficult problem space. Driving real innovation across this sector is a multi-layered challenge. This is also why the biggest and brightest investors in the space completely discounted it until recently. There's a lot of research showing the promise and viability of distributed training. However, it's clear that bringing this research to production is not a trivial task. Moreover, distributed training is only the first step.

The true leap forward occurs when we achieve full decentralization of the various layers that underpin distributed training at scale. Distributed training enabled by trustless systems, rather than relying on a single central coordinator that could fail, limit access, or go rogue. Over time, decentralized training should provide an easier pathway to permissionless access (both at the supply and demand side), higher degrees of consistency and fault tolerance, and shared ownership over the very primitives driving this innovation space. I, along with some builders in the category, see a world where decentralized training and fine-tuning networks become baked into the machine learning stack, similar to TCP/IP in the early internet—an open protocol to power machine intelligence.

So, where are we on this journey? To put it clearly, we're early. Some of the seminal research in distributed training originated from a Google beanbag; with Arthur Douillard and DeepMind making some of the earliest arguments for distributed training (DiLoCo, DiPaCo). More recently, startup teams like Gensyn, Nous Research, and Prime Intellect have been leading the charge with additional research and proving of the distributed training thesis. Still, much of that research (and its implementations) is basic and needs a lot of refinement. I expect this to change dramatically in the next 12 months - and the early signs are promising.

Decentralized Training in Action

Prime Intellect recently launched the training run of a 10-billion-parameter model, leveraging their own framework for distributed, data-parallel training. This marks one of the first times an open protocol iterates on DiLoCo's core ideas in production. This achievement continues to prove that this form of distributed training does work in real-time – outside of academia, and outside of big tech too.

In August, Nous Research published their preliminary report on DisTrO, their own take on distributed data-parallel training. Once again, this is an early indication of where the trend is headed: not only is distributed training possible, but it's being optimized by a few credible and capable teams.

note that DisTrO also uses a form of data parallelism in their implementation, most likely inspired by DiLoCo - similar to Prime Intellect.

Understanding Parallelism Techniques

Parallelism techniques are foundational to distributed training. These methods are crucial for scaling machine learning models across multiple devices and nodes, making training faster, more efficient, and dramatically reducing the communication requirements seen as the status quo in traditional training.

Data Parallelism

Data parallelism is the most straightforward and widely used parallelism technique. It is researched and proven extensively in the context of federated learning. Federated learning trains separate models across a number of distributed devices, sharding the dataset across devices and aggregating the results. In FL, data parallelism allows heterogenous and distributed sets of devices to train without sharing the data itself between the different peers within the training network.

In data parallelism, the entire model is replicated across multiple devices - and each device processes a different subset of the data. After processing, the gradients from each device are aggregated and used to update the model parameters synchronously or asynchronously.