Guo Yejun
|
3432c740e9
deepspeed/launcher/launch.py: add option '--enable_each_rank_log logdir' (#2409)
|
2 年之前 |
Arpan Jain
|
1ed5aa96a8
Elastic Training support in DeepSpeed (#2153) (#2156)
|
2 年之前 |
Jerry Mannil
|
66d29b0a6c
Graceful exit on failures for multi-node runs (#2008)
|
2 年之前 |
trajep
|
e669aaf55b
Trajepl/nebula ckpt engine (#2085)
|
2 年之前 |
Ammar Ahmad Awan
|
36ad3119d5
DeepSpeed comm backend v1 (#1985)
|
2 年之前 |
Jeff Rasley
|
9351266f78
Multi-node save pid support + allow sparse-attn extra (#1728)
|
2 年之前 |
liamcli
|
fead387f78
support module and no python args for launcher (#1690)
|
2 年之前 |
Mikhail Druzhinin
|
499800caa8
Fix return code on error (#1540)
|
2 年之前 |
Chunyang Wen
|
cf1f16016f
Use fstr in launcher (#1521)
|
3 年之前 |
Alex Hedges
|
be789b1665
Fix many typos (#1423)
|
3 年之前 |
Stas Bekman
|
4f1d827c52
[launcher] look ma, no more zombies (#714)
|
3 年之前 |
Jeff Rasley
|
7435b2f10a
Ability to initialize distributed backend outside deepspeed runtime (#608)
|
3 年之前 |
Jeff Rasley
|
c5a449f9a3
Update launcher to set local rank environ variable (#597)
|
3 年之前 |
Ammar Ahmad Awan
|
01726ce2b8
Add 1-bit Adam support to DeepSpeed (#380)
|
4 年之前 |
Jeff Rasley
|
e5bbc2e559
Sparse attn + ops/runtime refactor + v0.3.0 (#343)
|
4 年之前 |