Introducing CheckFree A fault tolerant method for decentralised training, with no checkpoints or redundant compute. Up to 1.6x faster than existing methods, with no convergence loss. We’re open sourcing it today.
Fault tolerance is critical in decentralised training, as nodes are unreliable and prone to failure. Recent works have proposed various recovery methods, though they still require redundant computation or checkpointing, adding time and compute.
How it works CheckFree instead recovers the failed stage with the average weights of its neighbouring stages. This provides an efficient way to approximate the lost weights, with minimal effect on convergence. Blog:
This unlocks: – Up to 1.6x faster training time than conventional checkpointing – Up to 1.2x faster than using redundant compute – No additional memory or compute required
We’re open sourcing it today, as a key building block for decentralised training. Blog: Paper: Code:
We’re excited to open source it today, as a key building block for decentralised training. Blog: Paper: Code:
29.83K