Zhen Zhang 8a63754bce save_non_zero_checkpoint on first partition group (#3787) 1 year ago
..
autotuning b361c72761 Update DeepSpeed copyright license to Apache 2.0 (#3111) 1 year ago
checkpoint 8a63754bce save_non_zero_checkpoint on first partition group (#3787) 1 year ago
comm 1bc3b78423 [CPU] Use allreduce_low_latency for AutoTP and implement low latency allreduce for CPU backend (single node) (#3919) 1 year ago
compression 6b2365e4fa Re-enable elastic training for torch 2+ (#4010) 1 year ago
elasticity 7290aace9b [CPU] Skip CPU support unimplemented error (#3633) 1 year ago
hybrid_engine 7290aace9b [CPU] Skip CPU support unimplemented error (#3633) 1 year ago
inference 76953a37b7 fix opt-350m shard loading issue in AutoTP (#3600) 1 year ago
launcher 1f72082fc0 [CPU] Support Intel CPU inference (#3041) 1 year ago
model_parallelism 6b2365e4fa Re-enable elastic training for torch 2+ (#4010) 1 year ago
moe 6b2365e4fa Re-enable elastic training for torch 2+ (#4010) 1 year ago
monitor b361c72761 Update DeepSpeed copyright license to Apache 2.0 (#3111) 1 year ago
ops 7290aace9b [CPU] Skip CPU support unimplemented error (#3633) 1 year ago
pipe 7ddc3b01dd Fix pipeline module evaluation when contiguous activation checkpointing is enabled (#3005) 1 year ago
profiling 6b2365e4fa Re-enable elastic training for torch 2+ (#4010) 1 year ago
runtime 7f90ef4bdd Multiple zero stage 3 related fixes (#3886) 1 year ago
utils b361c72761 Update DeepSpeed copyright license to Apache 2.0 (#3111) 1 year ago
__init__.py b361c72761 Update DeepSpeed copyright license to Apache 2.0 (#3111) 1 year ago
alexnet_model.py aef6c65ce3 Reduce Unit Test Times (Part 3) (#3850) 1 year ago
common.py 1bc3b78423 [CPU] Use allreduce_low_latency for AutoTP and implement low latency allreduce for CPU backend (single node) (#3919) 1 year ago
ds_batch_config.json ff42743865 Refactor remaining distributed tests (#2216) 2 years ago
gpt2-merges.txt ff42743865 Refactor remaining distributed tests (#2216) 2 years ago
gpt2-vocab.json ff42743865 Refactor remaining distributed tests (#2216) 2 years ago
megatron_model.py 4b35833379 Revert "Update megatron GPT2Model" 1 year ago
modeling.py b361c72761 Update DeepSpeed copyright license to Apache 2.0 (#3111) 1 year ago
modelingpreln.py b361c72761 Update DeepSpeed copyright license to Apache 2.0 (#3111) 1 year ago
multi_output_model.py b361c72761 Update DeepSpeed copyright license to Apache 2.0 (#3111) 1 year ago
simple_model.py 2ded2ff0be checking process_group before merging bucket ranges (#3521) (#3577) 1 year ago
util.py 7b850d3d04 Re-enable skipped unit tests (#3939) 1 year ago