Commit History

Author SHA1 Message Date
  Ma, Guokai c08e69f212 Make op builder detection adapt to accelerator change (#5206) 7 months ago
  Zhihao Lin 01af3e1ddf Enhance the robustness of `module_state_dict` (#4587) 11 months ago
  Alexander Jipa 2ded2ff0be checking process_group before merging bucket ranges (#3521) (#3577) 1 year ago
  Olatunji Ruwase dd8df20fe0 zero3 checkpoint frozen params (#3205) 1 year ago
  Michael Wyatt b361c72761 Update DeepSpeed copyright license to Apache 2.0 (#3111) 1 year ago
  Jeff Rasley 91d63e0228 update formatter version and style settings (#3098) 1 year ago
  Ma, Guokai 0acf7e9c48 [RFC] add device abstraction to allow other device than CUDA be used (#2221) 1 year ago
  Jeff Rasley da84e60d98 add missing license info to top of all source code (#2889) 1 year ago
  Alexander Jipa cfead55132 fixes #2389 (#2411) 2 years ago
  Michael Wyatt ff42743865 Refactor remaining distributed tests (#2216) 2 years ago
  Ammar Ahmad Awan 36ad3119d5 DeepSpeed comm backend v1 (#1985) 2 years ago
  Ammar Ahmad Awan c0af6d90f7 Refactor MoE and Groups API to simplify model creation and mangement (#1798) 2 years ago
  Jeff Rasley 3293cf72a0 [ZeRO] Default disable elastic ckpt in stage 1+2 and reduce CPU memory overhead during ckpt load (#1525) 2 years ago
  Jeff Rasley 2332cb31a7 Enables ZeRO-3 inference (#1514) 2 years ago
  Jeff Rasley 6996bb0159 Sparse attn triton v1.0 support + torch1.8 test runner (#1374) 3 years ago
  Hari Prasad c0b27fb019 Added drop_last to DeepSpeedDataLoader (#1321) 3 years ago
  Ammar Ahmad Awan f28432441b DeepSpeed MoE (#1310) 3 years ago
  Conglong Li b2b34ae342 Curriculum learning (#1307) 3 years ago
  Jeff Rasley adc21a4dfd ZeRO-1 empty grads fix + tests (#1273) 3 years ago
  hamlet d0b61f1810 Add find_unused_parameters option to DeepSpeedEngine (#945) 3 years ago
  Jeff Rasley 2e2dd861f3 Dist testing backend fixes, etc. (#708) 3 years ago
  Olatunji Ruwase e6ac731136 Support initialization with dict configuration (#632) 3 years ago
  Olatunji Ruwase 6021b70288 Support non-tensor state in checkpoint (#548) 3 years ago
  Olatunji Ruwase 0178e6cc22 Fix unbalanced gradients bug in ZeRO-2 gradient accumulation (#545) 3 years ago
  Olatunji Ruwase be1147c08a PLD release (#513) 4 years ago
  Shaden Smith 65c2f974d8 Pipeline parallel training engine. (#392) 4 years ago
  Jeff Rasley 376818ef9d Empty grad fix (#291) 4 years ago
  Olatunji Ruwase 607814feb9 Fix bug in fp32 optimizer state loading (#289) 4 years ago
  Calogero Zarbo 43f27332c2 Add "zero_allow_untested_optimizer" option in conf file (#173) 4 years ago
  Jeff Rasley 001abe2362 Refactor simple model test, fix pythonpath issue (#96) 4 years ago