Trending topics
#
Bonk Eco continues to show strength amid $USELESS rally
#
Pump.fun to raise $1B token sale, traders speculating on airdrop
#
Boop.Fun leading the way with a new launchpad on Solana.
Introducing CheckFree
A fault tolerant method for decentralised training, with no checkpoints or redundant compute.
Up to 1.6x faster than existing methods, with no convergence loss.
We’re open sourcing it today.

Fault tolerance is critical in decentralised training, as nodes are unreliable and prone to failure.
Recent works have proposed various recovery methods, though they still require redundant computation or checkpointing, adding time and compute.
How it works
CheckFree instead recovers the failed stage with the average weights of its neighbouring stages.
This provides an efficient way to approximate the lost weights, with minimal effect on convergence.
Blog:
This unlocks:
– Up to 1.6x faster training time than conventional checkpointing
– Up to 1.2x faster than using redundant compute
– No additional memory or compute required
We’re open sourcing it today, as a key building block for decentralised training.
Blog:
Paper:
Code:
We’re excited to open source it today, as a key building block for decentralised training.
Blog:
Paper:
Code:
29.83K
Top
Ranking
Favorites