harygo2
|
0fc19b6a32
Fix crash when creating Torch tensor on NPU with device=get_accelerator().current_device() (#5464)
|
5 月之前 |
inkcherry
|
0896503e2f
Fix a convergence issues in TP topology caused by incorrect grad_norm. (#5411)
|
6 月之前 |
Logan Adams
|
6dcced1d5c
Cleanup required_torch_version code and references. (#5370)
|
6 月之前 |
Masahiro Tanaka
|
c56a4b9e0d
Improve universal checkpoint (#5289)
|
6 月之前 |
inkcherry
|
e5dd5501c1
support bf16_optimizer moe expert parallel training and moe EP grad_scale/grad_norm fix (#5259)
|
6 月之前 |
Moshe Island
|
8ad187d84f
Universal ckp fixes (#4588)
|
11 月之前 |
Jackmin801
|
2f73b834b5
change default set_to_none in zero_grad methods (#4438)
|
1 年之前 |
Alexander Jipa
|
b354c28b76
polishing timers and log_dist (#3996)
|
1 年之前 |
Logan Adams
|
6b2365e4fa
Re-enable elastic training for torch 2+ (#4010)
|
1 年之前 |
Michael Wyatt
|
b361c72761
Update DeepSpeed copyright license to Apache 2.0 (#3111)
|
1 年之前 |
Jeff Rasley
|
91d63e0228
update formatter version and style settings (#3098)
|
1 年之前 |
Ma, Guokai
|
98cc35b6a8
Abstract accelerator (step 3) (#2677)
|
1 年之前 |
loadams
|
34a11688c4
Change zero_grad() argument to match pytorch (#2741)
|
1 年之前 |
JackieWu
|
323c266cfe
[Bug Fixed] use torch.cuda.is_available() (#2661)
|
1 年之前 |
Alex Hedges
|
316c4a43e0
Add flake8 to pre-commit checks (#2051)
|
2 年之前 |
Karim Foda
|
735406e536
fix import errors (#2026)
|
2 年之前 |
Ammar Ahmad Awan
|
36ad3119d5
DeepSpeed comm backend v1 (#1985)
|
2 年之前 |
Jeff Rasley
|
50893458d6
Fairseq support (#1915)
|
2 年之前 |
Olatunji Ruwase
|
56c5223868
bf16+pipeline parallelism (#1801)
|
2 年之前 |
Ammar Ahmad Awan
|
c0af6d90f7
Refactor MoE and Groups API to simplify model creation and mangement (#1798)
|
2 年之前 |
Olatunji Ruwase
|
135a625619
Move param_shapes to model files (#1732)
|
2 年之前 |
Jeff Rasley
|
e46d808a1b
MoE inference + PR-MoE model support (#1705)
|
2 年之前 |
Jeff Rasley
|
3293cf72a0
[ZeRO] Default disable elastic ckpt in stage 1+2 and reduce CPU memory overhead during ckpt load (#1525)
|
2 年之前 |
Jeff Rasley
|
e2fdd254ed
Big science related changes (#1407)
|
3 年之前 |
Ammar Ahmad Awan
|
f28432441b
DeepSpeed MoE (#1310)
|
3 年之前 |
Reza Yazdani
|
ed3de0c21b
Quantization + inference release (#1091)
|
3 年之前 |
Conglong Li
|
67a48aaa89
1-bit LAMB optimizer (#970)
|
3 年之前 |
Stas Bekman
|
29853c3eed
less scary overflow notice (#833)
|
3 年之前 |
Shaden Smith
|
f5cce75e70
Overflow fix (#416)
|
4 年之前 |
Shaden Smith
|
65c2f974d8
Pipeline parallel training engine. (#392)
|
4 年之前 |