Authors: Devang Patel (@devang_patel), Wei Zuo (@weizuo), Yu Shi (@yu3shi2), Kenichi Kumatani (@kekumata), Mengchen Liu (@MengchenLiu), Robert Gmyr (@rogmyr) and Kshama Pawar (@kshama-msft)
Mixture of Experts technique, frequently referred to as MoE, is gaining traction in the NLP community. It is primarily used to scale Transformer models without incurring high computational resource costs. In this post, we discuss how ORT MoE, an MoE implementation from the ONNX Runtime team, is used to scale networks and improve the quality in Speech and Vision models in addition to NLP models. Automatic Speech Recognition model is able to increase quality by 16.3% and language models are able to improve sample efficiency by 10x using this technique. We will briefly describe MoE architecture first. ORT MoE features and implementation details are discussed in the second half of this post.
The Mixture of Expert techniques typically use subcomponents such as Experts and, Gating Functions in a Transformer block as shown in Figure 1.
Figure 1: MoE components
Experts are at the heart of the Mixture of Experts technique. Usually, a standard feedforward neural network sublayer is used as an expert but that is not necessary. Fundamentally, the Mixture of Experts technique deploys many experts to increase model parameter count to achieve better model quality. However, at runtime only a small subset of these experts is used to process the given input tokens. This allows data scientists to keep FLOP budget under control while increasing the model size. A gating function is used to select a small subset of experts at runtime. Top1 or Top2 algorithms are popular choices as gating functions. MixtureOfExperts sublayer consists of the gating function, experts and the needed communication collectives to synchronize experts across multiple shards. This sublayer is the core of an MoE implementation. Finally, a Transformer layer is constructed using MixtureOfExperts sub layer and other components such as multi-headed attention layer.
We discuss scenarios where MoE improves scalability and model quality for large scale Transformer based models within Microsoft below.
We are able to increase the quality of the Automatic Speech Recognition model by 16.3% by applying the MoE technique. We have used ORT MoE implementation to introduce experts layer in transformer networks for multi-lingual speech recognition tasks. To train the multi-lingual network, we used approximately 75,000 hours of data from 10 languages. The baseline network for the model consists of a Transformer encoder with 18 blocks and a decoder with 6 blocks, where each block contains 8 heads of the attention network followed by the feed-forwarding network (FFN) with pre-normalization. MoE is applied to every 2 FFN layers; see here for details.
The targeted scenario was to provide a teacher model to improve the accuracy of student models through semi-supervised learning such as knowledge distillation.
The table below shows the word error rates (WERs) obtained with the baseline and MoE networks with 24, 72 and 120 experts. The WER is shown for each language as well as the overall language, the WER averaged weighted based on the number of words. The MoE network improves recognition accuracy, and the results indicate that the MoE network is a promising approach as the teacher model. The MoE architecture reduces the computational complexity during training and inference.
Figure 2a: Word Error Rates for the baseline and MoE networks against #experts
We have also investigated if the MoE teacher model can be compressed without loss of recognition accuracy. For this study, we have developed Teacher-Student (TS) training with Knowledge Distillation (KD). To make the student model closer to the teacher model, we initialize the student model with teacher’s weights. For initialization of student’s FFN, we select one of the teacher’s experts based on the frequency used in inference on a validation dataset. The student model is updated with KD loss and then fine-tuned with supervised learning. The details are described here. The table shows the WERs obtained with two teacher models with 24 and 72 experts as well as the student model distilled from knowledge of each teacher model. As a baseline, the table also shows the WER of the student model trained from scratch without the teacher model. It is clear from the table that TS training can provide better accuracy than the model trained from scratch whether a teacher model is small or large. It is also clear that when the teacher model is large, the student model will require more capacity to provide accuracy comparable to the teacher model.
Figure 2b: Word Error Rates for the MoE teacher and student dense networks
We evaluated MoE technique using ORT MoE implementation to research variants of language models for machine translation. Two sets of multilingual data are used which consists of 20 and 50 languages respectively. Both models have the Transformer encoder-decoder architecture. We use a base-sized model and multi-task training for 20 languages and a large-sized model and single-task (machine translation) training for 50 languages. In MoE models, MoE layers, each containing one top-1 gate and multiple feedforward experts, are injected in every other block in the Transformer stack. All non-MoE hyperparameters in the MoE models are kept the same with the corresponding dense model.
The baseline dense model for 20 languages has 243M parameters. Expert scaling is investigated by adding 32, 64, 128, and 256 experts separately. The obtained model sizes are 1.6B, 2.9B, 5.6B, and 11.1B parameters. Experimental results show that the more the experts the higher the BLEU score hence the better the translation quality.
Figure 3: BLEU scores with expert scaling for 20 languages
In the experiment of 50 languages, the baseline dense model has 761M parameters. A 64-expert MoE model with 10.3B parameters is trained and achieves similar BLEU score at 1/10 of training steps, indicating about 10x sample efficiency compared to the non-MoE baseline.
Figure 4: Sample efficiency of 64 experts for 50 languages
A foundation model in Vision is trained from broad data at scale and is capable of being adapted (e.g. fine-tuned) to a wide range of downstream tasks. Foundation models become promising due to their impressive performance and generalization capabilities. They are quickly integrated and deployed into real-world AI systems, such as image retrieval, image classification, object detection, visual question answering and action recognition. We find that ORT MoE provides an efficient approach to scale up a vision foundation model for larger capacity to effectively digest broader data.
We evaluate the model performance via a set of zero-shot classification tasks. The model is a CLIP Vision model ([2103.00020] Learning Transferable Visual Models From Natural Language Supervision (arxiv.org) ) that learns a matching between image-text-pairs. As shown in the table, with more expert, the model has a larger capacity and achieve better performance on various tasks.
Figure 5: Evaluation results between base model and MoE models on vision tasks.
We further demonstrate the sample efficiency of MoE. We compare the sample efficiency between base model, MoE model with 8 and 32 experts. We can achieve around ~2X sample efficiency using ORT MoE implementation.
Figure 6: Data efficiency comparison of MoE on vision task
ORT MoE implementation provides a wide variety of features to support real world workloads discussed above.
#Create standard data parallel grid where each rank holds entire copy of the model.
dgrid = DistributionGrid(data_parallel_group_size = <no_of_ranks_available>)
#Create expert parallel grid where experts are evenly distributed among available ranks.
dgrid = DistributionGrid(expert_parallel_group_size = <no_of_ranks_available>)
We have employed various techniques to improve computational efficiency and reduce memory consumption of MoE models. These techniques increase training throughput and batch size. We introduce three major techniques here, Sparse to dense mask in gating function, Fast Gating Indexing algorithm, and Merged Experts.
Figure 7: ORT MoE performance improvement on the CLIP Vision model (NDV4 cluster)
Unlike the traditional transformer structure (or neural network), the routing logic of MoE breaks the locality of input tokens: instead of being only consumed by a single or a small set of GPUs (e.g., model parallelism), the tokens on each GPU are routed globally to experts on other GPUs. This gives flexibility to how we distribute the experts and provides opportunities to explore distribution strategies with different communication patterns and the associated trade-offs. As different workloads may require different distributions for best performance, it is important to support a rich set of distribution strategies. In this section we discuss some of the distribution strategies we used with ORT MoE.
Expert Parallelism (EP):
In this scheme, experts are evenly divided across ranks. Alltoall collectives are inserted before and after experts to route the input and output tokens to the correct position, according to the gating function. The non-expert parameters are replicated across ranks and synchronized using AllReduction collective. This is the distribution strategy that was introduced in the GShard paper.
This distribution enables a simple and good load-balanced distribution of MoE and has been widely used in different models. In this distribution, the performance of Alltoall is one critical factor of the throughput.
Figure 8: Expert Parallelism as described in Gshard paper
Data scientists have deployed multiple replicas of Expert Parallel distribution, known as Expert Parallel Replica, to increase training throughput when larger number of GPUs are available. Under this strategy, like traditional data parallel training, experts from each EP replica are synchronized by additional AllReduce collective.
Expert Slicing:
A few drawbacks of EP are: (1) to maintain the load-balance, the number of experts must be larger and dividable by the number of GPUs. (2) Padding of the tokens may needed, when the number of tokens routed to an expert is fewer than the expert capacity, which incurs waste of memory and computation. (3) Each expert has a capacity of number of tokens taking in, which results in tokens exceeding the capacity are dropped, and affects the learning efficiency.
To avoid these drawbacks, we provide another distribution technique, called expert slicing. Here each expert is evenly sharded across the number of GPUs. An all-gather is used to collect inputs from all ranks to each rank, and an allreduce is used after the experts to compute the output correctly. Then each rank uses a narrow-like operation to select the outputs belonging to itself. Note, since each rank has all inputs before the gating, a global loss computation and routing decision is made, which increases the learning efficiency. Since each rank computes partial results of all experts, there is no token dropping, nor padding is needed. However, there is a memory tradeoff: each rank is taking inputs from all ranks, which increases the size of activation.
One essential promise of MoE is that the model size scales with the number of GPUs used, while keeping the parameter size, computation, and memory consumption per GPU constant. Ideally, we would like the model to scale as a linear function of the number of GPUs. However, due to additional communication cost (which also increases with the number of GPUs), only sublinear scaling may be feasible today.
To evaluate the scalability of our ORT MoE implementation, we deployed the CLIP Vision MoE model discussed above using CC2M dataset on 1 to 64 GPUs, and 128 GPUs, on Azure NDv4 cluster ( https://docs.microsoft.com/en-us/azure/virtual-machines/nda100-v4-series ), using expert parallelism. For each setting, we keep the model size per GPU constant, and scale up the number of experts along with the number of GPUs.
Figure 9: Scaling Result
Figure 10: Model size for each data point
Figure 9 records the scaling result. The x-axis is the number of nodes used; y-axis is the throughput. The orange line represents “perfect scaling”, which is computed as the product of the throughput of model on a single node and the number of nodes used. It is linear scaling. The blue line is the actual throughput. Figure 10 records the model size for each data point, where we can see we are scaling the parameter size almost linearly.
From the result we can see that the MoE model maintains good scalability: on 128 GPUs, it preserves 65% of the theoretical value. One reason is the Azure NDv4 cluster provides high-bandwidth cross node network (InfiniBand: Supported, GPUDirect RDMA, 8 x 200 Gigabit HDR). We do observe that an increase in the number of nodes increases the gap between the actual and theoretical throughput. This is expected, since the communication cost is bigger.
ZeRO technique enables data scientists to scale PyTorch models efficiently by providing memory savings. DeepSpeed ZeRO and FairScale FSDP are some of the common independent implementations of the ZeRO technique. ZeRO techniques can be applied on an MoE model that is using ORT MoE. ZeRO technique reduces memory consumption which could enable data scientists to increase batch size. We were able to increase batch size of a CLIP model by applying ZeRO using Fairscale FSDP implementation. Figure 11 shows the result. We applied FSDP on 2 replicas of CLIP Vision model running on 8 GPUs. In this configuration expert parameters are sharded in two partitions and non-expert parameters are sharded in eight partitions. When at batch 36, with FSDP, it uses 36% less peak memory. Without FSDP, the max batch size per gpu can reach 36, while with FSDP, it increases to 48.
Figure 11 : ZeRO helps improve batch size for the CLIP Model
We are able to train same CLIP Vision MoE model configuration on NVIDIA A100 GPUs as well as AMD MI100 GPUs. We are able to improve model accuracy and efficiency using ORT MoE on both platforms.
We have implemented MixtureOfExperts as a standard torch.nn.module. The implementation allows seamless integration with PyTorch packages. ORTModule provides performance boost to standard PyTorch modules. ORT MoE could be simply wrapped inside ORTModule to gain the performance boost while constructing the model. A model could be entirely wrapped inside ORTModule also. See ORTModule documentation for more information.
dgrid = DistributionGrid(expert_parallel_group_size = dist.get_world_size())
encoder = TransformerMoEEncoderLayer(256, 4, nexperts=8, distribution_grid = dgrid)
encoder = ORTModule(encoder)
We are able to scale various language, speech and vision models using the Mixture Of Experts technique by incorporating ORT MoE. We will continue to optimize the ORT MoE implementation to improve training throughput and explore new distribution strategies. This will enable data scientists scale models to further improve the quality of the services built upon these models. We invite data scientists to explore the MoE technique using the flexible ORT MoE implementation here.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.