mzl
|
8d53ac0cd3
Add MPICH Multinode Runner (#2839)
|
1 年之前 |
Ma, Guokai
|
98cc35b6a8
Abstract accelerator (step 3) (#2677)
|
1 年之前 |
Jeff Rasley
|
a091bc223c
[launcher] fail gracefully if hostname -i doesn't work as expected (#2631)
|
1 年之前 |
mzl
|
11f5daba5e
add enable_each_rank_log to deepspeed/launcher/runner.py (#2571)
|
1 年之前 |
Jeff Rasley
|
8c56c25d84
[launcher] parse hostfile via regex and added error checks (#2626)
|
1 年之前 |
savitamittal1
|
ffb6d98762
Added MLFLOW environment variables for logging metrics within trainig… (#2477)
|
1 年之前 |
Cheng Li
|
8da0238b7a
rollback ds config changes (#2395)
|
2 年之前 |
Dashiell Stander
|
3db0b5e2de
Add SLURM Multinode Runner (#2404)
|
2 年之前 |
Arpan Jain
|
1ed5aa96a8
Elastic Training support in DeepSpeed (#2153) (#2156)
|
2 年之前 |
trajep
|
e669aaf55b
Trajepl/nebula ckpt engine (#2085)
|
2 年之前 |
liamcli
|
380d32f980
[launcher] add option to bypass ssh check (#1957)
|
2 年之前 |
Jeff Rasley
|
a773996d97
[launcher] validate passwordless-ssh works when using hostfile launching (#1832)
|
2 年之前 |
liamcli
|
dac9056e13
Improve how runner parses env var file (#1747)
|
2 年之前 |
Jeff Rasley
|
9351266f78
Multi-node save pid support + allow sparse-attn extra (#1728)
|
2 年之前 |
Jeff Rasley
|
171316fc83
launcher save pid + require manual triton install for sparse-attn (#1727)
|
2 年之前 |
Jeff Rasley
|
2d51f6171b
preserve cuda visible devices order (#1712)
|
2 年之前 |
liamcli
|
fead387f78
support module and no python args for launcher (#1690)
|
2 年之前 |
Stas Bekman
|
e3c2d7b16f
[launcher/runner] respect CUDA_VISIBLE_DEVICES for a single node (#960)
|
2 年之前 |
Cheng Li
|
9caa74e577
Autotuning (#1554)
|
2 年之前 |
Chunyang Wen
|
df5b0884c7
Unify use f str (#1511)
|
3 年之前 |
Alex Hedges
|
be789b1665
Fix many typos (#1423)
|
3 年之前 |
Jeff Rasley
|
9e0dab402d
add option to force multi-node launcher mode (#977)
|
3 年之前 |
Jeff Rasley
|
72a30c1eab
revert zero-inf change to launcher
|
3 年之前 |
Jeff Rasley
|
0d4a54a04d
ZeRO-Infinity (#976)
|
3 年之前 |
Takuya Makino
|
e6999ebd16
Delete check of pdsh (#941)
|
3 年之前 |
Takuya Makino
|
ce14cf1af6
Add space in help string (#926)
|
3 年之前 |
Stas Bekman
|
24335d49ce
[runner/launch] propagate the error (#854)
|
3 年之前 |
Jeff Rasley
|
2e6692c8ad
Fix regression in runner (#843)
|
3 年之前 |
Samyam Rajbhandari
|
599258f979
ZeRO 3 Offload (#834)
|
3 年之前 |
Jeff Rasley
|
6217a6c243
skip empty lines in hostfile (#669)
|
3 年之前 |