Commit History

Author SHA1 Message Date
  Ma, Guokai 1bc3b78423 [CPU] Use allreduce_low_latency for AutoTP and implement low latency allreduce for CPU backend (single node) (#3919) 1 year ago
  Michael Wyatt a1effc9170 different port ranges for xdist workers (#3975) 1 year ago
  Michael Wyatt aef6c65ce3 Reduce Unit Test Times (Part 3) (#3850) 1 year ago
  Ramya Ramineni d24629f4fd [ROCm] Enable TestCUDABackward::test_backward unit tests (#3849) 1 year ago
  Ma, Guokai 1f72082fc0 [CPU] Support Intel CPU inference (#3041) 1 year ago
  Michael Wyatt b361c72761 Update DeepSpeed copyright license to Apache 2.0 (#3111) 1 year ago
  Jeff Rasley 91d63e0228 update formatter version and style settings (#3098) 1 year ago
  Ma, Guokai 090d49e79f pre-commit check for torch.cuda in code (#2981) 1 year ago
  Ma, Guokai 0acf7e9c48 [RFC] add device abstraction to allow other device than CUDA be used (#2221) 1 year ago
  Jeff Rasley da84e60d98 add missing license info to top of all source code (#2889) 1 year ago
  Heyang Qin 7e77cf710a Check device count before running dist tests (#2799) 1 year ago
  Michael Wyatt 349f845b83 Handle hanged tests in CI (#2808) 1 year ago
  Olatunji Ruwase 3f210c9715 CUDA optional deepspeed ops (#2507) 1 year ago
  Michael Wyatt ff42743865 Refactor remaining distributed tests (#2216) 2 years ago
  Jeff Rasley 7d8ad45d6a Fix regression w. dist_init_required (#2225) 2 years ago
  Michael Wyatt ac9951985f Refactor Distributed Tests (#2180) 2 years ago
  Michael Wyatt 1a71e77dc2 Fix for distributed tests on pytorch>=1.12 (#2141) 2 years ago
  Alex Hedges 316c4a43e0 Add flake8 to pre-commit checks (#2051) 2 years ago
  Ammar Ahmad Awan 36ad3119d5 DeepSpeed comm backend v1 (#1985) 2 years ago
  Jeff Rasley c3c8d5dd93 AMD support (#1430) 2 years ago
  Jeff Rasley 7f58853c2e [testing] 3x faster unit tests (#1636) 2 years ago
  Jeff Rasley 0af15b985d [unit tests] allow unique port for tests 2 years ago
  Jeff Rasley 0457bb1cb6 Add assert to ensure we don't skip unsupported grad dtypes (#1418) 3 years ago
  Jeff Rasley 6996bb0159 Sparse attn triton v1.0 support + torch1.8 test runner (#1374) 3 years ago
  Jeff Rasley 2e2dd861f3 Dist testing backend fixes, etc. (#708) 3 years ago
  Jeff Rasley 7435b2f10a Ability to initialize distributed backend outside deepspeed runtime (#608) 3 years ago
  Jeff Rasley 08c96a1bc6 ZeRO-1 tune max-elems + bug fix (#532) 3 years ago
  Shaden Smith 65c2f974d8 Pipeline parallel training engine. (#392) 4 years ago
  Jeff Rasley f2ac7eafd5 ZeRO-2 (#217) 4 years ago
  Shaden Smith 438aa01773 Enables NCCL backend in @distributed_test (#13) 4 years ago