Tunji Ruwase
|
3d5ea1430b
Respect memory pinning config
|
1 年之前 |
Joe Mayer
|
8a8683d343
Fix Issue 4083 (#4084)
|
1 年之前 |
leiwen83
|
1e0c39c6bf
enable pipeline checkpoint loading mode (#3629)
|
1 年之前 |
Zhen Zhang
|
8a63754bce
save_non_zero_checkpoint on first partition group (#3787)
|
1 年之前 |
Olatunji Ruwase
|
7f90ef4bdd
Multiple zero stage 3 related fixes (#3886)
|
1 年之前 |
Alexander Jipa
|
b354c28b76
polishing timers and log_dist (#3996)
|
1 年之前 |
Joe Mayer
|
eeab613ab8
Monitored Loss Calculations (#4030)
|
1 年之前 |
Olatunji Ruwase
|
0a0819b785
Option to exclude frozen weights for checkpoint save (#3953)
|
1 年之前 |
Joe Mayer
|
8afcda2ac9
ZeRO Gradient Accumulation Dtype. (#2847)
|
1 年之前 |
Adrian Wälchli
|
fb9aebbf25
Fix checkpoint conversion when model layers share weights (#3825)
|
1 年之前 |
digger yu
|
ce535945e6
fix: change ==NONE to is (#3923)
|
1 年之前 |
Xingjian Shi
|
d81dfdabcc
Fix LoRA Fuse/Unfuse in Hybrid Engine (#3563)
|
1 年之前 |
kisseternity
|
1b888399dc
Add an api in deepspeed engine for adjusting micro batch size during training (#3773)
|
1 年之前 |
Joe Mayer
|
5eb2598623
Requires grad checking. (#3789)
|
1 年之前 |
Cheng Li
|
c80855b543
Bug Fixes for autotuner and flops profiler (#1880)
|
1 年之前 |
Heyang Qin
|
d18aa2c79c
ZeRO++ (#3784)
|
1 年之前 |
Jeff Rasley
|
80ccaf9c7a
revert PR #3611 (#3786)
|
1 年之前 |
mzl
|
5a5340d03b
remove UtilsBuilder load, use torch (un)flatten ops (#3728)
|
1 年之前 |
Zhen Zhang
|
c88af21432
[MiCS] [Fix] saving and loading model checkpoint logic for MiCS sharding (#3440)
|
1 年之前 |
Guo Yejun
|
460bec4679
flops_profiler: add option recompute_fwd_factor for the case of activation recompute (#3362)
|
1 年之前 |
digger yu
|
cd4e473ee6
fix typo with deepspeed/ (#3547)
|
1 年之前 |
Olatunji Ruwase
|
d39c311fc6
DS init should not broadcast or move zero.Init models (#3611)
|
1 年之前 |
Joe Mayer
|
4d269c6e4d
Changing monitor loss to aggregate loss over gradient accumulation steps (#3428)
|
1 年之前 |
digger-yu
|
254663a28c
fix spelling error with deepspeed/runtime/ (#3509)
|
1 年之前 |
Tian, Feng
|
6938c449de
Add snip_momentum structured pruning which can support higher sparse ratio with minor accuracy loss (#3300)
|
1 年之前 |
Joe Mayer
|
d3550dc88a
Adagrad support in ZeRO (#3401)
|
1 年之前 |
Stas Bekman
|
77ebf760f3
[zero_to_fp32] fix shared param recovery (#3407)
|
1 年之前 |
Zhen Zhang
|
2e99f6edf6
[DRAFT] Tentative implementation of MiCS (#2964)
|
1 年之前 |
Alexander Jipa
|
d56268f375
fixing default communication_data_type for bfloat16_enabled and docs (#3370)
|
1 年之前 |
Michael Wyatt
|
ad168a6954
Fix for dist not being initialized when constructing main config (#3324)
|
1 年之前 |