2020-05-19-bert-record.md 1.5 KB


layout: single title: "The Fastest and Most Efficient BERT Training through Optimized Transformer Kernels" excerpt: "" categories: news new_post: true

date: 2020-05-19 00:00:00

We introduce new technology to accelerate single GPU performance via kernel optimizations. These optimizations not only create a strong foundation for scaling out large models, but also improve the single GPU performance of highly tuned and moderately sized models like BERT by more than 30%, reaching a staggering performance of 66 teraflops per V100 GPU, which is 52% of the hardware peak. Using optimized transformer kernels as the building block, DeepSpeed achieves the fastest BERT training record: 44 minutes on 1,024 NVIDIA V100 GPUs, compared with the best published result of 67 minutes on the same number and generation of GPUs.