Ma, Guokai
|
1bc3b78423
[CPU] Use allreduce_low_latency for AutoTP and implement low latency allreduce for CPU backend (single node) (#3919)
|
1 年之前 |
Michael Wyatt
|
a1effc9170
different port ranges for xdist workers (#3975)
|
1 年之前 |
Michael Wyatt
|
aef6c65ce3
Reduce Unit Test Times (Part 3) (#3850)
|
1 年之前 |
Ramya Ramineni
|
d24629f4fd
[ROCm] Enable TestCUDABackward::test_backward unit tests (#3849)
|
1 年之前 |
Ma, Guokai
|
1f72082fc0
[CPU] Support Intel CPU inference (#3041)
|
1 年之前 |
Michael Wyatt
|
b361c72761
Update DeepSpeed copyright license to Apache 2.0 (#3111)
|
1 年之前 |
Jeff Rasley
|
91d63e0228
update formatter version and style settings (#3098)
|
1 年之前 |
Ma, Guokai
|
090d49e79f
pre-commit check for torch.cuda in code (#2981)
|
1 年之前 |
Ma, Guokai
|
0acf7e9c48
[RFC] add device abstraction to allow other device than CUDA be used (#2221)
|
1 年之前 |
Jeff Rasley
|
da84e60d98
add missing license info to top of all source code (#2889)
|
1 年之前 |
Heyang Qin
|
7e77cf710a
Check device count before running dist tests (#2799)
|
1 年之前 |
Michael Wyatt
|
349f845b83
Handle hanged tests in CI (#2808)
|
1 年之前 |
Olatunji Ruwase
|
3f210c9715
CUDA optional deepspeed ops (#2507)
|
1 年之前 |
Michael Wyatt
|
ff42743865
Refactor remaining distributed tests (#2216)
|
2 年之前 |
Jeff Rasley
|
7d8ad45d6a
Fix regression w. dist_init_required (#2225)
|
2 年之前 |
Michael Wyatt
|
ac9951985f
Refactor Distributed Tests (#2180)
|
2 年之前 |
Michael Wyatt
|
1a71e77dc2
Fix for distributed tests on pytorch>=1.12 (#2141)
|
2 年之前 |
Alex Hedges
|
316c4a43e0
Add flake8 to pre-commit checks (#2051)
|
2 年之前 |
Ammar Ahmad Awan
|
36ad3119d5
DeepSpeed comm backend v1 (#1985)
|
2 年之前 |
Jeff Rasley
|
c3c8d5dd93
AMD support (#1430)
|
2 年之前 |
Jeff Rasley
|
7f58853c2e
[testing] 3x faster unit tests (#1636)
|
2 年之前 |
Jeff Rasley
|
0af15b985d
[unit tests] allow unique port for tests
|
2 年之前 |
Jeff Rasley
|
0457bb1cb6
Add assert to ensure we don't skip unsupported grad dtypes (#1418)
|
3 年之前 |
Jeff Rasley
|
6996bb0159
Sparse attn triton v1.0 support + torch1.8 test runner (#1374)
|
3 年之前 |
Jeff Rasley
|
2e2dd861f3
Dist testing backend fixes, etc. (#708)
|
3 年之前 |
Jeff Rasley
|
7435b2f10a
Ability to initialize distributed backend outside deepspeed runtime (#608)
|
3 年之前 |
Jeff Rasley
|
08c96a1bc6
ZeRO-1 tune max-elems + bug fix (#532)
|
3 年之前 |
Shaden Smith
|
65c2f974d8
Pipeline parallel training engine. (#392)
|
4 年之前 |
Jeff Rasley
|
f2ac7eafd5
ZeRO-2 (#217)
|
4 年之前 |
Shaden Smith
|
438aa01773
Enables NCCL backend in @distributed_test (#13)
|
4 年之前 |