Commit History

Author SHA1 Message Date
  Jeff Rasley a59455fa73 fix for DS_ENV issue (#4992) 9 months ago
  minchao 6d7b44a838 [NPU] load EXPORT_ENV based on different accelerators to support multi-node training on other devices (#4830) 10 months ago
  Alienfeel 65b7727758 Fix 4649 (#4650) 10 months ago
  Michael Wyatt d37fc25d56 Refactor launcher user arg parsing (#4824) 10 months ago
  Omar Elayan 00e7dc5e51 Fix for when prompt contains an odd num of apostrophes (#4660) 11 months ago
  Yudi Zhang e75c285ad7 fix user args parsing of string with spaces on runner (#4265) 1 year ago
  Hiromasa 8145b5e41f added port argument for ssh (#4117) 1 year ago
  Logan Adams 1a29573946 Handling for SIGTERM as well (#4160) 1 year ago
  Michael Wyatt 0cc2d6ff25 Fix user arg parsing in single node deployment (#4007) 1 year ago
  Logan Adams 6580a2db17 Allow user to select name of .deepspeed_env (#4006) 1 year ago
  digger yu ce535945e6 fix: change ==NONE to is (#3923) 1 year ago
  Abhilash Majumder 26b3e73298 single node pdsh sigkill (#3730) 1 year ago
  Logan Adams d8aaa58122 Fix incorrectly formatted f string (#3698) 1 year ago
  Jeff Rasley 49a73549b9 AISC launcher fixes (#3637) 1 year ago
  Ma, Guokai 1f72082fc0 [CPU] Support Intel CPU inference (#3041) 1 year ago
  Michael Wyatt 2f8d384e8b print default values (#3347) 1 year ago
  Ma, Guokai 0b5252bbd3 [CPU support] Optionally bind each rank to different cores on host (#2881) 1 year ago
  Michael Wyatt b361c72761 Update DeepSpeed copyright license to Apache 2.0 (#3111) 1 year ago
  Jeff Rasley 91d63e0228 update formatter version and style settings (#3098) 1 year ago
  mzl 8d53ac0cd3 Add MPICH Multinode Runner (#2839) 1 year ago
  Ma, Guokai 98cc35b6a8 Abstract accelerator (step 3) (#2677) 1 year ago
  Jeff Rasley a091bc223c [launcher] fail gracefully if hostname -i doesn't work as expected (#2631) 1 year ago
  mzl 11f5daba5e add enable_each_rank_log to deepspeed/launcher/runner.py (#2571) 1 year ago
  Jeff Rasley 8c56c25d84 [launcher] parse hostfile via regex and added error checks (#2626) 1 year ago
  savitamittal1 ffb6d98762 Added MLFLOW environment variables for logging metrics within trainig… (#2477) 1 year ago
  Cheng Li 8da0238b7a rollback ds config changes (#2395) 2 years ago
  Dashiell Stander 3db0b5e2de Add SLURM Multinode Runner (#2404) 2 years ago
  Arpan Jain 1ed5aa96a8 Elastic Training support in DeepSpeed (#2153) (#2156) 2 years ago
  trajep e669aaf55b Trajepl/nebula ckpt engine (#2085) 2 years ago
  liamcli 380d32f980 [launcher] add option to bypass ssh check (#1957) 2 years ago