Ammar Ahmad Awan 844eb68570 Update README.md (#4244) | 1 year ago | |
---|---|---|
.. | ||
README.md | 1 year ago |
DeepSpeed-Chat is a general system framework for RLHF training that enables easy, fast, affordable, and scalable training of ChatGPT-style models that we publicly released on GitHub. The detailed performance and capabilities of DeepSpeed-Chat have been published in our blog post and arXiv paper.
We are happy to share that today we are improving DeepSpeed-Chat along three areas: i) system support for Llama/Llama-2 family of models, ii) system features for improved efficiency and accessibility, and iii) stability and software enhancements.
We introduce system support for training Llama and Llama-2 models in DeepSpeed-Chat enabling and leveraging various optimizations and features including the Hybrid Engine, ZeRO family of optimizations, Low-Rank Adaptation (LoRA) support, as well as full integration into the three-stage DeepSpeed-Chat RLHF pipeline. By leveraging the Hybrid-Engine, we speed up the experience generation phase for Llama-2-7B and Llama-2-13B models by up to 7.1X.
New System Features for Improved Efficiency and Accessibility
Mixed Precision ZeRO++ (MixZ++). It is an extended set of optimization strategies built upon ZeRO++ tailored to reduce memory usage and improve training/inference efficiency for RLHF training with LoRA. MixZ++ partitions model parameters across GPUs to reduce footprint and gathers them with quantized communication only when needed similar to its ZeRO and ZeRO++ siblings. Our evaluation indicates MixZ++ increases the training throughput by up to 3.3x for the Llama-2-70B model running on 128 V100 GPUs.
ZeRO-Offload. It is an optimization that offloads optimizer memory and computation from the GPU to the host CPU, enabling larger models to be trained with fewer GPU resources. After training stability fixes and testing, we have enabled this feature across all three stages of the DeepSpeed-Chat RLHF training pipeline. ZeRO-Offload reduces the minimum number of GPUs required to train large models by up to 16x.
Stability and Software Enhancements
DeepSpeed-Chat contains a rich set of features for training across many different platforms and scenarios. Composing these features in a systematic way and ensuring both system stability and decent training convergence is critical for the usability of the framework. Thus, in addition to new features in DeepSpeed-Chat, many system stability and training convergence issues have been fixed both in DeepSpeed-Chat (client code) and DeepSpeed (runtime). These improvements have been thoroughly tested using the OPT model family for end-to-end training. Furthermore, end-to-end testing, characterization scripts, and several instrumentation features like TensorBoard support are now also available. To try out these latest features and software improvements, please use DeepSpeed release v0.10.2 and the latest DeepSpeed-Chat in DeepSpeedExamples.
Finally, to ensure the long-term health of the DeepSpeed-Chat training framework, PyTests were added for testing Step 3 of the RLHF training pipeline and are run on a nightly basis through a newly developed GitHub Actions workflow.
We now dive into the details of our new features, training stability, and software improvements.
The DeepSpeed-Chat training framework now provides system support for the Llama and Llama-2 models across all three stages of training. To support this, we encountered a spectrum of issues, spanning from minor runtime errors to intricate performance-related challenges. In particular, the Llama model architecture which deviates from the standard Transformers block, was incompatible with DeepSpeed's inference kernels and the DeepSpeed container policy used by the Hybrid Engine. Addressing these hurdles necessitated extensive modifications across our DeepSpeed-Chat pipeline and the DeepSpeed runtime including code to support the ZeRO family of optimizations and their interaction with optimized inference kernels in the Hybrid Engine. We have resolved these challenges to ensure that DeepSpeed-Chat can support Llama and Llama-2 and provide our users with the best possible experience. The details can be seen from several PRs that have been merged in our codebases.
The following key optimizations in DeepSpeed are now fully integrated for Llama and Llama-2 models:
Users looking to try the new Llama and Llama-2 model support can get started by using the newly added Llama scripts. | Step Number | Scripts | | --- | --- | | 1 | Llama-2 Step 1 Scripts | | 2 | Llama-2 Step 2 Scripts | | 3 | Llama-2 Step 3 Scripts |
Note: While all the system aspects of Llama and Llama-2 support have been extensively tested, there are no guarantees about training convergence and may require hyper-parameter tuning to achieve convergence.
We highlight the performance benefits of the Hybrid Engine for Llama-2 models on NVIDIA A100 and V100 GPUs in this section. Improved performance for larger models like Llama-2-70B and reduced resource requirements via ZeRO-Offload are discussed in the next section.
Using A100 GPUs, we achieve 7.1x faster generation for Llama-2-7B and 5.4x faster generation for Llama-2-13B with DeepSpeed-Chat Hybrid Engine compared to DeepSpeed-Chat without Hybrid Engine (baseline) as shown in Figure 1.
Using V100 GPUs, we achieve 4x faster generation for Llama-2-7B and 2.1x faster generation for Llama-2-13B with DeepSpeed-Chat Hybrid Engine compared to DeepSpeed-Chat without Hybrid Engine (baseline) as shown in Figure 2.
We now dive into the details of two new features we are introducing today: 1) Mixed Precision ZeRO++ (MixZ++) and 2) ZeRO-Offload. Both these features offer unique benefits for DeepSpeed-Chat users. MixZ++ provides up to 3.3x better throughput for LoRA-enabled training and ZeRO-Offload reduces the minimum number of GPUs required to train by up to 16x.
Mixed Precision ZeRO++ (MixZ++) is an extended set of optimization strategies built upon ZeRO and ZeRO++ tailored to reduce memory usage and improve training/inference efficiency for RLHF training with LoRA.
Similar to ZeRO, MixZ++ partitions model parameters across GPUs to reduce footprint and gathers them only when needed. In addition, similar to ZeRO++, MixZ++ allows for hierarchical partitioning and quantized communication. The hierarchical partitioning allows all the parameters to be stored within a node when possible so that the communication happens within a node, where communication bandwidth is significantly higher than communicating across nodes. The communication overhead is further reduced by quantizing the weights before gathering them.
Finally, unlike ZeRO++ where parameters are always stored in fp16/bf16, and quantized/dequantized before and after communication, MixZ++ can persistently store the frozen weights in Low-Rank Adaptation (LoRA) in lower-precision, significantly reducing the communication overhead, eliminating quantization overhead, and supporting larger batch sizes that enable better efficiency.
A comprehensive exploration of technical details can be accessed through our ZeRO++ blog, MixZ++ tutorial, and paper.
State-of-the-art approaches like QLoRA focus on combining multiple techniques like quantization of LoRA weights, relying on new datatypes such as NF4, and memory-management/offload techniques like paged optimizers to enable finetuning of large models on a single GPU. MixZ++ is our approach to enable large model training powered by quantization but is designed to scale to a large number of GPUs with simplicity and compatibility with existing technologies like ZeRO-Offload and DeepSpeed Hybrid Engine.
MixZ++ has the following highlights:
To assess the effectiveness of MixZ++ for LoRA-enabled training, we carried out a series of RLHF training experiments (Step 3) using the Llama-2-70B model. These experiments were conducted on hardware configurations featuring 64 and 128 V100 GPUs. A visual representation of the experiment results is shown in the following figure:
Specifically, our results showcase a 2x increase in training throughput when utilizing 64 GPUs with MixZ++, compared to the ZeRO-3 baseline. Furthermore, when scaling up to 128 GPUs, the speedup effect becomes even more pronounced, with a substantial 3.3x improvement in training throughput. These outcomes underscore the potential of MixZ++ as a powerful tool for improving training efficiency in large-scale GPU settings.
To try this feature, please refer to MixZ++ tutorial.
ZeRO-Offload powers unprecedented model sizes by leveraging the full memory capacity of a system, concurrently exploiting all heterogeneous memory. Modern GPU clusters have 2-3x more CPU memory than GPU memory. ZeRO-Offload capitalizes on this disparity and offloads optimizer memory and computation from the GPU to the host CPU, enabling larger models to be trained with fewer GPU resources without being bottlenecked by the CPU's lower bandwidth. ZeRO-Offload allows training of large models on up to 16x fewer GPUs as we can see in Figure 4.
ZeRO-Offload was disabled with the initial release of DeepSpeed-Chat due to training instability that was observed when it was used with Hybrid Engine and LoRA. After improvements to Hybrid Engine and LoRA as well as extensive testing of all feature configurations for ZeRO Stage2 and ZeRO Stage 3, this feature can now be enabled across all three steps of the DeepSpeed-Chat training framework. Please note that configuring ZeRO-Offload with ZeRO Stage 2 and Hybrid Engine with LoRA disabled is currently unsupported due to observed training instability.
f036f00c
](https://github.com/microsoft/DeepSpeed/tree/f036f00c37
), DSE commit: [81a8521f
](https://github.com/microsoft/DeepSpeedExamples/tree/81a8521f05
).*
A wide range of issues have been addressed in the DeepSpeed runtime and the DeepSpeed-Chat pipeline. These fixes enable advanced features such as Hybrid Engine, LoRA, and ZeRO-Offload to run across all training steps of the DeepSpeed-Chat pipeline and improve training stability and convergence.
f036f00c
](https://github.com/microsoft/DeepSpeed/tree/f036f00c37
), DSE commit: [81a8521f
](https://github.com/microsoft/DeepSpeedExamples/tree/81a8521f05
).*
Figure 6 above shows the training convergence across all supported DeepSpeed-Chat configurations. This data was collected using 16 V100 NVIDIA GPUs, the AdamG012/chat-opt-1.3b-sft-deepspeed OPT model as the actor, the AdamG012/chat-opt-350m-reward-deepspeed OPT model as the critic, and the following DeepSpeed and DeepSpeedExamples repository commits: DS commit: f036f00c, DSE commit: 81a8521f.
We now dive into the details of all the fixes across different areas.
In this section we discuss the functionality and training stability fixes in the DeepSpeed-Chat pipeline.
Training Stability:
To improve the training stability in Step 3, several different areas of training were tuned and changed. To start, the Kullback-Liebler (KL) divergence used in the Proximal Policy Optimization (PPO) trainer was slightly tuned to reduce divergence between the new and reference policies and improve the reward score. Next, the sequence generation function in the PPO trainer (_generate_sequence()
) removed the specification of a min_length
in the Actor model's generate()
call, which means generated sequences won't be artificially enlarged, allowing for the possibility of sequence generation to collapse i.e. when training convergence is extremely poor. A minor off-by-one error was also fixed in the PPO trainer's reward computation function (compute_rewards()
). Finally, the PPO trainer's RLHF training function was updated to zero out the reward and value after the end of a conversation to prevent incorrect advantages
and returns
.
PR #633 - DS Chat Step 3 - Add separate Lora Adam optimizer group
The LoRA feature is supported across all three training steps of the DeepSpeed-Chat framework. Prior to this stability effort, there was no distinction between the overall learning rate and the LoRA learning rate i.e. the LoRA learning rate was set to whatever the overall learning rate was. This led to instability in training convergence and can be seen in Figure 7 below showing the reward score across training steps for various Step 3 configurations:
<img src="../assets/images/sweep_before_lora_fix.png" width="650">
*Figure 7: Before the fix, the sweep across all ZeRO-2 cases without a separate LoRA learning rate shows training instability when LoRA is used.*
To address this training convergence issue, when creating the optimizer grouped parameters, the LoRA lora_right_weight
and lora_left_weight
parameters were explicitly separated out and given their own LoRA-specific learning rate. After this change, a dramatic improvement in stability was observed, as shown in the figure below:
<img src="../assets/images/sweep_after_lora_fix.png" width="650">
*Figure 8: After creating a separate LoRA learning rate, the sweep across all ZeRO-2 cases shows proper convergence.*
The next fix details the addition of separate LoRA learning rate arguments.
A separate LoRA learning rate argument can now be provided in each of the three training steps, with Step 3 having individual LoRA learning rates for the Actor and Critic models.
Bug Fixes:
PR #636 - DS Chat Step 3 - Fix Zero Stage 3
world_size > 1
. When observing the state of each rank, one rank would still be in the sequence generation phase self._generate_sequence()
, while the other rank had already progressed to the self.actor_model()
call. This ZeRO Stage 3 desynchronization, due to misaligned token generation between the GPUs, can normally be automatically detected and accounted for in the HuggingFace Transformers library via synced_gpus
. However, due to the nature of the DeepSpeed-Chat pipeline and the lifetime of the corresponding model configuration objects, this automatic detection code was not triggered. To resolve this, when invoking the generate()
function, the synced_gpus
argument is explicitly passed and set to True
when ZeRO Stage 3 is being used.This fix allows Step 3 training to run with the combination of gradient checkpointing and LoRA-only parameter optimization, a previously unsupported training case. With the addition of the enable_input_require_grads model utility function in the HuggingFace Transformers library, which enables the gradients for the input embeddings, gradient checkpointing and optimization of only the LoRA parameters is made possible.
An external contributor helped in resolving an argument parsing issue.
An external contributor fixed the passing of an uninitialized parameter that was hardcoded earlier.
In this section we discuss several fixes in the Hybrid Engine.
PR #3563 - Fix LoRA Fuse/Unfuse in Hybrid Engine
PR #3883 - Extend HE-Lora test with Z3 support + Fix/add guard in HE for Z3
In this section we discuss several fixes in support of the ZeRO Stage 3 feature.
PR #3819 - Fix racing condition in GatheredParameters
GatheredParameters
context, which resulted in various 'status': 'INFLIGHT'
issues, was fixed by removing duplicate input parameters that were being passed from the Hybrid Engine.PR #3884 - Separate ZeRO3 InflightParamRegistry for train and eval
InflightParamRegistry
was updated to use a separate InflightParamRegistry
for training and evaluation, fixing an issue where leftover parameters in flight were causing inflight parameter errors. These fixes, along with related fixes in the Hybrid Engine, enabled the use of the ZeRO-Offload feature in the DeepSpeed-Chat training pipeline.PR #3928 - Remove the param.ds_tensor from print
To improve the characterization, ease of debug, and maintainability of the DeepSpeed-Chat framework, several areas of software improvements have been completed. Characterization scripts were added to enable systematic composition of features, instrumentation was added to improve insight into the behavior of training, and a testing CI workflow was added to improve the maintainability of the DeepSpeed-Chat training framework.
The DeepSpeed-Chat training framework provides a rich set of features (Hybrid Engine, ZeRO, LoRA, etc.) that can be composed in many different combinations, depending on the scenario. The interactions between the features are often complex and composing them in a systematic way for characterization is useful for understanding their behavior. To support such use cases, characterization scripts have been added to run sweeps of Steps 1, 2, and 3 training for various combinations of features. The scripts default to OPT but can be modified to run with Llama. Please see the READMEs in the following folders for more details:
For example, the Step 3 characterization script sweeps across various training features: | Feature | Values | | --- | --- | | ZeRO Stage | 2, 3 | | Hybrid Engine | True, False | | ZeRO-Offload | True, False | | LoRA | True, False |
And can be ran as follows:
DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning$ bash training_scripts/opt/single_node/sweep/run_step3_sweep.sh
The training log for each combination of features will be stored in a folder with the name formatted as z${z}_he_${he}_offload_${offload}_lora_${lora}
Related PRs:
To gain better insight into DeepSpeed-Chat training, new instrumentation features were added across all three steps of DeepSpeed-Chat and can be enabled via arguments to each step's main.py
.
Argument | Description | Step(s) |
---|---|---|
--print_loss | Print loss during each step | 1 |
--enable_tensorboard | Enable TensorBoard logging at the model Runtime Engine level | 1,2,3 |
Enable TensorBoard logging at the Training Pipeline level | 3 | |
--tensorboard_path | Path to write TensorBoard log | 1,2,3 |
--print_answers | Print actor model prompt and answers during training across all ranks | 3 |
TensorBoard logging can be enabled in each of the three training steps, with some slight nuances in Step 3. To start, for each training step, the enable_tensorboard
argument can be used to enable a TensorBoard monitor at the Runtime Engine level (see documentation) and is reflected in the corresponding model training configuration:
"tensorboard": {
"enabled": enable_tensorboard,
"output_path": f"{tb_path}/ds_tensorboard_logs/",
"job_name": f"{tb_name}_tensorboard"
}
reward
, actor_loss
, actor_loss_sum
, critic_loss
, and critic_loss_sum
.As part of the DeepSpeed team's commitment to maintaining the DeepSpeed-Chat training framework, continuous integration PyTest testing has been added for Step 3 RLHF training in a new GitHub Actions workflow.
Description | Status |
---|---|
Integrations |
The workflow is run on a nightly basis across a 16-case test matrix (see table below), and uses the facebook/opt-125m model for both the actor and critic.
Parameter | Values |
---|---|
ZeRO Stage | 2, 3 |
Hybrid Engine | True, False |
ZeRO-Offload | True, False |
LoRA | True, False |
Each configuration (16 total) runs through a limited number of Step 3 non-overflow training steps (i.e. steps where neither actor nor critic overflow) and saves the actor/critic models. Assertions are used to check if the training pipeline executed correctly and if the actor and critic models were saved properly.
We are very excited to share this DeepSpeed-Chat feature and stability release.
To get started, please visit our GitHub page for DeepSpeed-Chat: GitHub Landing Page
We will continue to improve DeepSpeed-Chat with your feedback and support. Our roadmap shows currently supported features as well as ones that are planned for future.
DeepSpeed-Chat is part of the bigger DeepSpeed ecosystem comprising of a multitude of Deep Learning systems and modeling technologies. To learn more,
DeepSpeed welcomes your contributions! We encourage you to report issues, contribute PRs, and join discussions on the DeepSpeed GitHub page. Please see our contributing guide for more details. We are open to collaborations with universities, research labs, companies, such as those working together on deep learning research, applying DeepSpeed to empower real-world AI models and applications, and so on. For such requests (and other requests unsuitable for GitHub), please directly email to deepspeed-info@microsoft.com.