--- title: "Mixture of Experts" --- DeepSpeed v0.5 introduces new support for training Mixture of Experts (MoE) models. MoE models are an emerging class of sparsely activated models that have sublinear compute costs with respect to their parameters. For example, the [Switch Transformer](https://arxiv.org/abs/2101.03961) consists of over 1.6 trillion parameters, while the compute required to train it is approximately equal to that of a 10 billion-parameter dense model. This increase in model size offers tremendous accuracy gains for a constant compute budget. For more details on results and further discussion, please see our press release: [DeepSpeed powers 8x larger MoE model training with high performance]({{ site.press_release_v5 }}). ## Getting started with a simple MoE example **Note:** DeepSpeed MoE requires Pytorch 1.8 or above. {: .notice--info} As a simple starting point we will show how to apply DeepSpeed MoE to a cifar10 example. Please refer to our [cifar10 example](https://github.com/microsoft/DeepSpeedExamples/tree/master/cifar) going forward. If you are adding MoE to an existing model you can use the snippet below to help guide you: ### Expert groups initialization DeepSpeed MoE supports five different forms of parallelism, and it exploits both GPU and CPU memory. Its flexible design enables users to mix different types of prevalent parallelism techniques, as shown in the table below. | Short Name | Flexible Parallelism Configurations | Benefit | | ---------------- | ------------------------------------| --------------------------------------------------------------------------- | | E | Expert | Scales the model size by increasing the number of experts | | E + D | Expert + Data | Accelerates training throughput by scaling to multiple data parallel groups | | E + Z | Expert + ZeRO-powered data | Partitions the nonexpert parameters to support larger base models | | E + D + M | Expert + Data + Model | Supports massive hidden sizes and even larger base models than E+Z | | E + D + Z | Expert + Data + ZeRO-powered data | Supports massive hidden sizes and even larger base models than E+Z | | E + Z-Off + M | Expert + ZeRO-Offload + Model | Leverages both GPU and CPU memory for large MoE models on limited # of GPUs | To support different forms of parallelism, we create a notion of DeepSpeed process groups that resides in ```deepspeed.utils.groups.py``` For most cases, the model training code needs to initialize these groups by calling ```python deepspeed.utils.groups.initialize(ep_size="desired expert-parallel world size") ``` The GPUs (or ranks) participating in an expert-parallel group will distribute the total number of experts specified by the model training code argument num_experts. ### MoE layer API The hidden_size is the input dimension of a particular layer and the output dimension is the same as that. This could lead to some changes to your model definition, especially for vision/convolutional models because the input/output dimensions don't match in certain cases. E.g. in the CIFAR-10 example, we modify the third fully connected layer to add the MoE layer. To cater for this, we need to add an additional fully-connected layer, whose input dimension is equal to the output dimension of the MoE layer. Original model config ```python self.fc3 = nn.Linear(84, 10) ``` Updated with MoE Layers ```python self.fc3 = nn.Linear(84, 84) self.fc3 = deepspeed.moe.layer.MoE(hidden_size=84, expert=self.fc3, num_experts=args.num_experts, ...) self.fc4 = nn.Linear(84, 10) ``` ### An Example Scenario Given a total number of GPUs in our world size and a subset of GPUs in our expert-parallel world as follows. ```python WORLD_SIZE = 4 EP_WORLD_SIZE = 2 EXPERTS = 8 ``` The user code needs to initialize the groups as follows. ```python groups.initialize (ep_size=EP_WORLD_SIZE) ``` After that, the model code needs to use the deepspeed.moe.layer.MoE API as follows. ```python self.experts = deepspeed.moe.layer.MoE(hidden_size=input_dim, expert=ExpertModule(), num_experts=EXPERTS) ``` With the above two commands, the DeepSpeed runtime will be set to train an MoE model with a total of 8 experts on 4 GPUs in 4 experts/GPU mode. We call this the E + D mode as described earlier in the table. For more advanced use case of the groups API including the inter-operability with Megatron style mpu object, watch this space! ```python import torch import deepspeed import deepspeed.utils.groups as groups from deepspeed.moe.layer import MoE WORLD_SIZE = 4 EP_WORLD_SIZE = 2 EXPERTS = 8 groups.initialize(ep_size=EP_WORLD_SIZE) fc3 = torch.nn.Linear(84, 84) fc3 = MoE(hidden_size=84, expert=self.fc3, num_experts=EXPERTS, k=1) fc4 = torch.nn.Linear(84, 10) ``` For a runnable end-to-end example, please look at [cifar10 example](https://github.com/microsoft/DeepSpeedExamples/tree/master/cifar) ### Combining ZeRO-Offload and DeepSpeed MoE for very large models To use MoE Layers in DeepSpeed, we rely on two parameter groups that are passed to an optimizer. A concrete example to create such groups is available from the [cifar10 example](https://github.com/microsoft/DeepSpeedExamples/tree/master/cifar). The relevant function that creates these param groups is as follows. ```python def create_moe_param_groups(model): from deepspeed.moe.utils import is_moe_param params_with_weight_decay = {'params': [], 'name': 'weight_decay_params'} moe_params_with_weight_decay = { 'params': [], 'moe': True, 'name': 'weight_decay_moe_params' } for module_ in model.modules(): moe_params_with_weight_decay['params'].extend([ p for n, p in list(module_._parameters.items()) if p is not None and is_moe_param(p) ]) params_with_weight_decay['params'].extend([ p for n, p in list(module_._parameters.items()) if p is not None and not is_moe_param(p) ]) return params_with_weight_decay, moe_params_with_weight_decay ``` The above param groups can then be fed to the ZeRO stage-2 optimizer as follows. ```python net = Net() parameters = create_moe_param_groups(net) model_engine, optimizer, trainloader, __ = deepspeed.initialize( args=args, model=net, model_parameters=parameters, training_data=trainset) ``` We are working on automating this functionality in the DeepSpeed ZeRO optimizer so the model training code can be simplified further. To run the [cifar10 example](https://github.com/microsoft/DeepSpeedExamples/tree/master/cifar) with ZeRO-Offload (stage 2) and MoE, please set the ds_config flags ```json "zero_optimization": { "stage": 2, "allgather_partitions": true, "reduce_scatter": true, "allgather_bucket_size": 50000000, "reduce_bucket_size": 50000000, "overlap_comm": true, "contiguous_gradients": true, "cpu_offload": true } ``` An additional optimization to save memory for extremely large model training on limited number of GPUs has also been introduced. Please enable that using the following config flag to the fp16 optimizer in ds_config. ```json "fp16": { "enabled": true, "fp16_master_weights_and_grads": true, } ``` ## Random Token Selection We have devised a new technique called “Random Token Selection” that greatly improves convergence. Random token selection addresses the limitation of biased selection problem in MoE model training. Our upcoming paper describes this technique and its results in detail. This feature is already part of the DeepSpeed runtime and is enabled by default so users can take advantage without any config flags or command-line arguments. ## Advanced MoE usage Watch this space! We plan to add more interesting and detailed examples of using DeepSpeed MoE in the coming weeks.