1/2 We extended our previous work in the data-parallel regime (where every node has full copy of the model) to the Model-Parallel regime. This is the first work, aside from the original swarm paper, dealing with the scenario where the model itself is sharded over devices.
2K