Mayank Mishra
|
f1d2a15b50
better eval sampler (#2907)
|
1 年之前 |
Olatunji Ruwase
|
81b4d5db06
Make z3 respect comm dtype (#2807)
|
1 年之前 |
Conglong Li
|
7c99def0f0
Data efficiency library update (#2866)
|
1 年之前 |
Stas Bekman
|
d323abd80f
remove outdated comment (#2786)
|
1 年之前 |
Logan Adams
|
86477538a6
Fix hardcoded instances to fp16 in optimizer creation log messages to the correct dtype. (#2743)
|
1 年之前 |
Bing Xie
|
8d3b42c230
Bing/formatting correction (#2764)
|
1 年之前 |
Jeff Rasley
|
a60e31a7f2
[zero] remove misleading dtype log (#2732)
|
1 年之前 |
Dashiell Stander
|
d4bfae415d
Fix autotuning so that it records Floating Point Operations per second, not microsecond (#2711)
|
1 年之前 |
Ma, Guokai
|
98cc35b6a8
Abstract accelerator (step 3) (#2677)
|
1 年之前 |
Joe Mayer
|
4be8df721a
fixing optimizer sanity check (#2742)
|
1 年之前 |
Ammar Ahmad Awan
|
867da307d0
Inference Refactor (replace_with_policy, model_implementations) (#2554)
|
1 年之前 |
Joe Mayer
|
8d87c89e42
BF16 optimizer for BF16+ZeRO Stage 1 (#2706)
|
1 年之前 |
Jeff Rasley
|
e4ba722297
non-MoE stage 1 requires CG disabled (#2703)
|
1 年之前 |
Alexander Jipa
|
0f0e38c520
fixes #2498 (#2603)
|
1 年之前 |
Conglong Li
|
ef869377e9
DeepSpeed Data Efficiency Library (#2585)
|
1 年之前 |
Ma, Guokai
|
06938835eb
Support fp32 gradaccum for bf16 model (#2566)
|
1 年之前 |
Cheng Li
|
abe4fc6b55
encoded ds config into command line argument when launching child processes in autotuning (#2524)
|
1 年之前 |
ShijieZZZZ
|
340fc0cf19
Report progress at gradient accumulation boundary (#2553)
|
1 年之前 |
Joe Mayer
|
21c2802964
Adding Gradient Accumulation Data Type Config (#2512)
|
1 年之前 |
Olatunji Ruwase
|
ee39187d8f
Make bf16_optimizer work for non pipeline (#2470)
|
1 年之前 |
Joe Mayer
|
7d113633e4
Fix Bug #2319 (#2438)
|
2 年之前 |
Adam Moody
|
b8fb9c3f1a
parallelize writing of layer checkpoint files across data parallel instances (#1419)
|
2 年之前 |
Olatunji Ruwase
|
799120e7e4
Universal checkpoint for zero stage 1 (#2284)
|
2 年之前 |
Joe Mayer
|
906b4a025f
Fixing bug 2361 (#2410)
|
2 年之前 |
Alexander Jipa
|
cfead55132
fixes #2389 (#2411)
|
2 年之前 |
Matt Smith
|
b609a29412
fix an exception when recursively casting dicts to fp16 (#2370)
|
2 年之前 |
叶志晟
|
80f94c10c5
fix #2240: wrong time unit in flops_profiler (#2241)
|
2 年之前 |
Olatunji Ruwase
|
cb5e05fe55
Correctly detect CPU optimizer usage (#2257)
|
2 年之前 |
Siddharth Singh
|
b288cf1b9b
Enable contiguous gradients with Z1+MoE (#2250)
|
2 年之前 |
Olatunji Ruwase
|
217338beb6
Refactor dist tests: Checkpointing (#2202)
|
2 年之前 |