提交历史

作者 SHA1 备注 提交日期
  mzl 8d53ac0cd3 Add MPICH Multinode Runner (#2839) 1 年之前
  Ma, Guokai 98cc35b6a8 Abstract accelerator (step 3) (#2677) 1 年之前
  Jeff Rasley a091bc223c [launcher] fail gracefully if hostname -i doesn't work as expected (#2631) 1 年之前
  mzl 11f5daba5e add enable_each_rank_log to deepspeed/launcher/runner.py (#2571) 1 年之前
  Jeff Rasley 8c56c25d84 [launcher] parse hostfile via regex and added error checks (#2626) 1 年之前
  savitamittal1 ffb6d98762 Added MLFLOW environment variables for logging metrics within trainig… (#2477) 1 年之前
  Cheng Li 8da0238b7a rollback ds config changes (#2395) 2 年之前
  Dashiell Stander 3db0b5e2de Add SLURM Multinode Runner (#2404) 2 年之前
  Arpan Jain 1ed5aa96a8 Elastic Training support in DeepSpeed (#2153) (#2156) 2 年之前
  trajep e669aaf55b Trajepl/nebula ckpt engine (#2085) 2 年之前
  liamcli 380d32f980 [launcher] add option to bypass ssh check (#1957) 2 年之前
  Jeff Rasley a773996d97 [launcher] validate passwordless-ssh works when using hostfile launching (#1832) 2 年之前
  liamcli dac9056e13 Improve how runner parses env var file (#1747) 2 年之前
  Jeff Rasley 9351266f78 Multi-node save pid support + allow sparse-attn extra (#1728) 2 年之前
  Jeff Rasley 171316fc83 launcher save pid + require manual triton install for sparse-attn (#1727) 2 年之前
  Jeff Rasley 2d51f6171b preserve cuda visible devices order (#1712) 2 年之前
  liamcli fead387f78 support module and no python args for launcher (#1690) 2 年之前
  Stas Bekman e3c2d7b16f [launcher/runner] respect CUDA_VISIBLE_DEVICES for a single node (#960) 2 年之前
  Cheng Li 9caa74e577 Autotuning (#1554) 2 年之前
  Chunyang Wen df5b0884c7 Unify use f str (#1511) 3 年之前
  Alex Hedges be789b1665 Fix many typos (#1423) 3 年之前
  Jeff Rasley 9e0dab402d add option to force multi-node launcher mode (#977) 3 年之前
  Jeff Rasley 72a30c1eab revert zero-inf change to launcher 3 年之前
  Jeff Rasley 0d4a54a04d ZeRO-Infinity (#976) 3 年之前
  Takuya Makino e6999ebd16 Delete check of pdsh (#941) 3 年之前
  Takuya Makino ce14cf1af6 Add space in help string (#926) 3 年之前
  Stas Bekman 24335d49ce [runner/launch] propagate the error (#854) 3 年之前
  Jeff Rasley 2e6692c8ad Fix regression in runner (#843) 3 年之前
  Samyam Rajbhandari 599258f979 ZeRO 3 Offload (#834) 3 年之前
  Jeff Rasley 6217a6c243 skip empty lines in hostfile (#669) 3 年之前