Heyang Qin 63e1776993 DeepSpeed Ulysses Chinese blog translation (#4210) 1 年之前
..
chinese 63e1776993 DeepSpeed Ulysses Chinese blog translation (#4210) 1 年之前
japanese 3808273c2a Add Japanese blog of DS-Ulysses (#4209) 1 年之前
media 4e5d39fe13 Deepspeed-Ulysses blog (#4201) 1 年之前
README.md c274e51212 Update README.md (#4211) 1 年之前

README.md

# DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

Introduction

Training large models with long sequences is becoming very important across the board from generative AI to models for scientific discovery. On generative AI side, conversational AI, long document summarization and video generation require reasoning over long contexts in spatial and temporal domains. For example, multimodal foundation models such as ones that process speech, images and waveforms concurrently require long context reasoning over high dimensional inputs with extremely large sequences. Similarly, chapter and book level summarization (estimated at tens and hundreds of thousands of words) are of great importance in conversational AI and abstractive summarization tasks.

Long sequence length is equally critical for AI for science opening doors for better understanding of structure biology, health care, climate and weather forecasting and large molecular simulation. For instance, by adapting large language models with gene sequences, we can create language models that can learn the evolutionary patterns of genomes using simple alphabets and extremely long sequences (the human genome has 6.4 billion letters). In health care, diagnostic predictive model conditioned on entire patient care record requires context of extremely long sequence.

Despite the emerging importance of long sequence length for both generative AI and AI for science, existing large model training systems and the underlying parallelism technologies (data, tensor, pipeline, sequence parallelism) are limited in their ability to support the efficient long sequence training. Two challenges with existing parallelism approach come to the fore. First, existing parallelism approach such as data, tensor and pipeline parallelism cannot address the scaling along sequence dimension. Second, existing sequence parallelism approaches are not effective because of memory-communication inefficiencies. Furthermore, existing approaches have limited usability requiring intrusive and error prone code refactoring.

In this release, we are proud to introduce DeepSpeed-Ulysses (or Ulysses, a very long novel), a simple, portable, and effective methodology for enabling highly efficient and scalable LLM training with extremely long sequence lengths.

DeepSpeed-Ulysses partitions individual samples along the sequence dimension among participating GPU. Then right before the attention computation, it employs all-to-all communication collective on the partitioned queries, keys and values such that each GPU receives the full sequence but only for a non-overlapping subset of the attention heads. This allows the participating GPUs to compute attention for different attention heads in parallel. Finally, DeepSpeed-Ulysses employs another all-to-all to gather the results along the attention heads while re-partitioning along the sequence dimension.

The key properties of DeepSpeed-Ulysses and its implementation released with this blog are as follows:

  • 4x larger sequence lengths than existing systems, while enabling training with sequences with over a million tokens.

  • Communication reduction of over 10x compared to existing systems, resulting in throughput improvements of up to 2.5x, and sustained throughput of over 175 TFlops/GPU (over 54% of hardware peak).

  • Fully general and implementation agnostic attention: DeepSpeed sequence parallelism supports dense as well as sparse attention, and it works with efficient attention implementations such as FlashAttention v2.

  • Support for massive model training: DeepSpeed sequence parallelism works together with ZeRO-3 to not only support large sequence lengths but also massive model sizes.

  • Easy-to-use and portable, requiring minimal code changes to the existing training frameworks.

In subsequent sections, we provide detailed discussion of DeepSpeed-Ulysses core design, communication complexity analysis, experimental evaluation and comparison with existing work and highlight of usability and guide on usage.

Core Design of DeepSpeed-Ulysses

A diagram of a computer Description automatically generated *Figure 1: DeepSpeed sequence parallelism (DeepSpeed-Ulysses) design*

Figure 1 shows the core design of DeepSpeed-Ulysses. As with the known transformer architecture, the design consists of input sequences N partitioned across P available devices. Each local N/P partition is projected into queries (Q), keys (K) and values (V) embeddings. Next, (QKV) embeddings are gathered into global QKV through highly optimized all-to-all collectives between participating compute devices. Sequel to all-to-all collective is the attention computation per head in the form:

$$Output\ context = Softmax\ (\frac{QK^{T}}{\sqrt{d}})V$$

After the attention computation, another all-to-all collective transforms output context tensor of attention computation to sequence (N/P) parallel for subsequent operators (MLP MatMul, layer norm etc) in the remaining modules of transformer layer block.

Significant Communication Volume Reduction

What distinguishes DeepSpeed-Ulysses from the other existing long-sequence approaches is our much smaller aggregate communication volume and overall better scalability with increasing degree of sequence parallelism compared to existing solutions, as demonstrated by the communication volume analysis below:

On modern clusters with intra-node NVSwitch interconnect and inter-node fat tree IB topology, the communication volume transmitted per link for an all-to-all for aggregate message of size M over P GPUs is M/P. For a transformer model with hidden size h, sequence length of N, and parallelism degree of P, DeepSpeed sequence parallelism performs all-to-all for the QKV projections with an aggregate message size of 3Nh before the attention computation, and another all-to-all for output context projection with a size Nh for each transformer layer. Therefore, DeepSpeed sequence parallelism incurs an aggregate communication volume per link of 4Nh/P (or with the complexity of O(N/P). Note that this communication volume is constant when both N and P are increased proportionally.

In contrast, the existing approaches like Megatron-LM incur communication volume that increases linearly with N regardless of P, resulting in the communication complexity of O(N). For instance, Megatron-LM performs two all-gather with the message volume of Nh and two reduce-scatter with the volume of Nh for each transformer layer. However, the cost of each all-gather and reduce-scatter of size M remains M when P >> 1, instead of M/P. Therefore, Megatron-LM sequence parallelism incurs a communication volume per link of 4Nh which is P times larger than that for DeepSpeed sequence parallelism. This allows DeepSpeed sequence parallelism to enable training with extremely long sequences while achieving significantly higher training efficiency compared to the existing approaches. Our evaluation results match this analysis.

Additional Highlights of DeepSpeed-Ulysses

An Attention Agnostic Solution

DeepSpeed implementation of distributed attention module is general enough to support any attention: e.g., self-attention, cross-attention, causal attention in both their dense and sparse counterparts, and their various optimized kernels that support long-sequence at local attention level such as different versions of FlashAttention.

The generality property of DeepSpeed-Ulysses stems from the modular nature of its core design: an attention-centric sequence parallelism design. Prior to attention computation is sequence parallelism of N/P partition, attention computation is head parallelism with full attention per head but just with fewer heads, thus attention computation can be replaced with any type of attention mechanisms, e.g., dense attention and various forms of sparse attention.

Training Bigger Models with Longer Sequences through ZeRO-3 Integration

While DeepSpeed sequence parallelism reduces the activation memory when training with longer sequences, it does not impact the memory consumed by the model states. Therefore, to support large sequence length training with large language model, DeepSpeed sequence parallelism is integrated with ZeRO-3.

ZeRO Redundancy Optimizer Stage 3 (ZeRO-3) is a memory optimization technique for training large models. Unlike the classic data parallel training of neural networks where model states are replicated across data parallel ranks, ZeRO-3 optimizes memory usage by partitioning model states across data parallel ranks. However, with sequence parallelism, training data can be considered in both batch (sample) and sequence dimensions and the associated parallel groups combined to form a larger group for ZeRO parallelism.

Therefore, we extend ZeRO-3 partitioning to combination of data parallel and sequence parallel ranks. In other words, in DeepSpeed sequence parallelism, ZeRO partitions model states across both sequence and data parallel group and collects per rank partitions (allgather) when they are needed. Similarly, gradients are reduced across both data and sequence parallel ranks for parameter update. ZeRO allows for huge memory savings in both sequence and data dimensions and enables scaling not just to large sequence lengths but also to large models.

Evaluation

We evaluate DeepSpeed-Ulysses on GPT, a foundation model for many NLP tasks on up to 64 A100 GPUs with 40GB memory. Our evaluations are four-fold: i) sequence length scalability, ii) throughput for dense attention and comparison with existing system, and iii) throughput with sparse attention and comparison with existing system, iv) convergence study of DeepSpeed sequence parallelism. We discuss and present evaluations from each of these categories next.

Sequence Length Scalability

The first set of experiments is strong scaling of sequence length up to 1 million tokens on 1.2 billion parameter GPT model. Results of this evaluation are shown in Figures 2. DeepSpeed sequence parallelism allows increasing sequence length linearly with the number of GPUs and maintains similar computation throughput across different sequence length at appropriate GPU count.

*Figure 2: DeepSpeed sequence parallelism strong scalability evaluation at different sequence length and GPU count.*

Dense Attention Evaluation

Next, we evaluate DeepSpeed sequence parallelism on 30 billion parameter dense attention model and benchmark against Megatron sequence parallelism on 64 A100 GPUs. The results of these evaluations are shown in Figures 3.

We compare DeepSpeed sequence parallelism with Megatron-LM for a 30B model running various sequence lengths. For our evaluation we chose the sequence parallelism degree and global batch size that produced the best performance (measured as throughput or TFLOPs) for both DeepSpeed sequence parallelism and Megatron-LM, this we call optimal (batch size-sequence length) configurations. For DeepSpeed sequence parallelism, we always use a ZeRO parallelism degree of 64.

Figure 3 shows that DeepSpeed sequence parallelism consistently outperforms Megatron-LM for the sequence length that can be run with both. In addition, DeepSpeed sequence parallelism can run longer sequence than Megatron-LM. DeepSpeed sequence parallelism performance advantages are two folds: (1) DeepSpeed sequence parallelism in combination with ZeRO-3 fits more sample than Megatron-LM because of memory optimization leading to higher throughput (2) DeepSpeed sequence parallelism benefits from efficient all-to-all communication relative to all-gather communication as applied in Megatron-LM sequence parallelism.

*Figure 3: Evaluation of DeepSpeed and Megatron LM sequence parallelism on 30B parameter model with dense attention.*

Sparse Attention Evaluation

Similarly, we evaluate DeepSpeed sequence parallelism on 30 billion parameter sparse attention model and benchmark against Megatron sequence parallelism. Results of our evaluation are shown in Figure 4. We observe similar trends with sparse attention as dense attention experiments. We observe more than 2X throughput performance of DeepSpeed sequence parallelism compared to Megatron-LM. For memory saving, DeepSpeed sequence parallelism leveraging ZeRO-3 scales to 4X longer sequence lengths than Megatron-LM.

DeepSpeed sequence parallelism outperforms Megatron-LM for sequence length that can be run with both. In fact, the current DeepSpeed throughput is bottlenecked by the local sparse attention implementation, and as a result DeepSpeed throughput decreases as the sequence length increases. We expect this gap in performance between DeepSpeed and Megatron to increase further for larger sequence lengths as we improve the performance of the local sparse attention implementation in future.

*Figure 4: Evaluation of DeepSpeed and Megatron LM sequence parallelism on 30B parameter model with block sparse attention.*

Convergence Study

Lastly, Figure 5 shows convergence of a 1.3 billion GPT model at 32K sequence length on 8 A100 GPUs with sequence parallelism degree set at 4 for both DeepSpeed and Megatron-LM sequence parallelism. For DeepSpeed sequence parallelism, we evaluate convergence with different ZeRO stages. DeepSpeed sequence parallelism is a purely system optimization technique that enables training of long sequence Transformer model, thus there is no (negative) impact on quality of trained models, this assertion is validated through experiments and is shown in Figure 5.

*Figure 5: Convergence evaluation of DeepSpeed sequence parallelism with different ZeRO memory optimization stages.*

DeepSpeed-Ulysses Software Accessibility

DeepSpeed-Ulysses can be easily integrated into your code with just a few lines of simple code changes. Here is an example of how to enable it:

from deepspeed.sequence.layer import DistributedAttention

# Replace the original self-attention (attn) with DeepSpeed-Ulysses’s self-attention

dist_attn = DistributedAttention(attn, get_sequence_parallel_group())

Compared to other libraries that support sequence parallelism, such as Megatron-LM, DeepSpeed-Ulysses does not require model refactoring. DeepSpeed-Ulysses has been fully integrated and tested with the Megatron-DeepSpeed code repository. This means that if you are already using this repository for training large language models, you can seamlessly benefit from DeepSpeed-Ulysses to train models with massive sequence length.

Release: Try DeepSpeed-Ulysses Today

We are excited to release DeepSpeed-Ulysses, accessible through DeepSpeed GitHub. Detailed tutorial on usage is available on DeepSpeed tutorial page.

We welcome contributions and collaboration as we together push forward on what is possible when long context window is no longer a limitation. DeepSpeed-Ulysses is part of the bigger DeepSpeed ecosystem of large-scale AI training and inference. For more details on all DeepSpeed technologies and innovations, please visit our website and follow us on X, formerly Twitter, (English, Japanese) and Chinese Zhihu.

We are open to collaborations with universities, research labs, and companies. For such requests (and other requests unsuitable for GitHub), please directly email to deepspeed-info@microsoft.com. If you like our work, please "Star" our repo.