shiyuan680
|
3f875d9519
add device config env for the accelerator (#5396)
|
6 months ago |
ByronHsu
|
9500ab7d47
[minor] Improve logging for multiprocesses (#5004)
|
9 months ago |
Byungsoo Oh
|
b7f463ddeb
Fix local rank mismatch for heterogeneous nodes (#3409)
|
1 year ago |
Ma, Guokai
|
1f72082fc0
[CPU] Support Intel CPU inference (#3041)
|
1 year ago |
digger-yu
|
254663a28c
fix spelling error with deepspeed/runtime/ (#3509)
|
1 year ago |
Ma, Guokai
|
0b5252bbd3
[CPU support] Optionally bind each rank to different cores on host (#2881)
|
1 year ago |
Michael Wyatt
|
b361c72761
Update DeepSpeed copyright license to Apache 2.0 (#3111)
|
1 year ago |
Jeff Rasley
|
91d63e0228
update formatter version and style settings (#3098)
|
1 year ago |
Guo Yejun
|
3432c740e9
deepspeed/launcher/launch.py: add option '--enable_each_rank_log logdir' (#2409)
|
2 years ago |
Arpan Jain
|
1ed5aa96a8
Elastic Training support in DeepSpeed (#2153) (#2156)
|
2 years ago |
Jerry Mannil
|
66d29b0a6c
Graceful exit on failures for multi-node runs (#2008)
|
2 years ago |
trajep
|
e669aaf55b
Trajepl/nebula ckpt engine (#2085)
|
2 years ago |
Ammar Ahmad Awan
|
36ad3119d5
DeepSpeed comm backend v1 (#1985)
|
2 years ago |
Jeff Rasley
|
9351266f78
Multi-node save pid support + allow sparse-attn extra (#1728)
|
2 years ago |
liamcli
|
fead387f78
support module and no python args for launcher (#1690)
|
2 years ago |
Mikhail Druzhinin
|
499800caa8
Fix return code on error (#1540)
|
2 years ago |
Chunyang Wen
|
cf1f16016f
Use fstr in launcher (#1521)
|
3 years ago |
Alex Hedges
|
be789b1665
Fix many typos (#1423)
|
3 years ago |
Stas Bekman
|
4f1d827c52
[launcher] look ma, no more zombies (#714)
|
3 years ago |
Jeff Rasley
|
7435b2f10a
Ability to initialize distributed backend outside deepspeed runtime (#608)
|
3 years ago |
Jeff Rasley
|
c5a449f9a3
Update launcher to set local rank environ variable (#597)
|
3 years ago |
Ammar Ahmad Awan
|
01726ce2b8
Add 1-bit Adam support to DeepSpeed (#380)
|
4 years ago |
Jeff Rasley
|
e5bbc2e559
Sparse attn + ops/runtime refactor + v0.3.0 (#343)
|
4 years ago |