nvidia-resiliency-ext
Python
★ 299
updated 5d ago
NVIDIA Resiliency Extension is a python package for framework developers and users to implement fault-tolerant features. It improves the effective training time by minimizing the downtime due to failures and interruptions.
No plain-English explanation yet — one is being written right now. Check back in a minute.