Scaling Speech, Language and Vision Models with Mixture of Experts Technique
Published Apr 29 2022 03:38 PM 5,288 Views

Authors: Devang Patel (@devang_patel), Wei Zuo (@weizuo), Yu Shi (@yu3shi2), Kenichi Kumatani (@kekumata), Mengchen Liu (@MengchenLiu), Robert Gmyr (@rogmyr) and Kshama Pawar (@kshama-msft)



Mixture of Experts technique, frequently referred to as MoE, is gaining traction in the NLP community. It is primarily used to scale Transformer models without incurring high computational resource costs. In this post, we discuss how ORT MoE, an MoE implementation from the ONNX Runtime team, is used to scale networks and improve the quality in Speech and Vision models in addition to NLP models. Automatic Speech Recognition model is able to increase quality by 16.3% and language models are able to improve sample efficiency by 10x using this technique. We will briefly describe MoE architecture first. ORT MoE features and implementation details are discussed in the second half of this post.


The Mixture of Expert techniques typically use subcomponents such as Experts and, Gating Functions in a Transformer block as shown in Figure 1.




Figure 1: MoE components


Experts are at the heart of the Mixture of Experts technique. Usually, a standard feedforward neural network sublayer is used as an expert but that is not necessary. Fundamentally, the Mixture of Experts technique deploys many experts to increase model parameter count to achieve better model quality. However, at runtime only a small subset of these experts is used to process the given input tokens. This allows data scientists to keep FLOP budget under control while increasing the model size. A gating function is used to select a small subset of experts at runtime. Top1 or Top2 algorithms are popular choices as gating functions. MixtureOfExperts sublayer consists of the gating function, experts and the needed communication collectives to synchronize experts across multiple shards. This sublayer is the core of an MoE implementation. Finally, a Transformer layer is constructed using MixtureOfExperts sub layer and other components such as multi-headed attention layer.


MoE Acceleration Scenarios

We discuss scenarios where MoE improves scalability and model quality for large scale Transformer based models within Microsoft below.

Improving quality of Automatic Speech Recognition model using MoE

We are able to increase the quality of the Automatic Speech Recognition model by 16.3% by applying the MoE technique. We have used ORT MoE implementation to introduce experts layer in transformer networks for multi-lingual speech recognition tasks. To train the multi-lingual network, we used approximately 75,000 hours of data from 10 languages. The baseline network for the model consists of a Transformer encoder with 18 blocks and a decoder with 6 blocks, where each block contains 8 heads of the attention network followed by the feed-forwarding network (FFN) with pre-normalization. MoE is applied to every 2 FFN layers; see here for details.


The targeted scenario was to provide a teacher model to improve the accuracy of student models through semi-supervised learning such as knowledge distillation.


The table below shows the word error rates (WERs) obtained with the baseline and MoE networks with 24, 72 and 120 experts. The WER is shown for each language as well as the overall language, the WER averaged weighted based on the number of words. The MoE network improves recognition accuracy, and the results indicate that the MoE network is a promising approach as the teacher model. The MoE architecture reduces the computational complexity during training and inference.




Figure 2a: Word Error Rates for the baseline and MoE networks against #experts

We have also investigated if the MoE teacher model can be compressed without loss of recognition accuracy. For this study, we have developed Teacher-Student (TS) training with Knowledge Distillation (KD). To make the student model closer to the teacher model, we initialize the student model with teacher’s weights. For initialization of student’s FFN, we select one of the teacher’s experts based on the frequency used in inference on a validation dataset. The student model is updated with KD loss and then fine-tuned with supervised learning. The details are described here. The table shows the WERs obtained with two teacher models with 24 and 72 experts as well as the student model distilled from knowledge of each teacher model. As a baseline, the table also shows the WER of the student model trained from scratch without the teacher model. It is clear from the table that TS training can provide better accuracy than the model trained from scratch whether a teacher model is small or large. It is also clear that when the teacher model is large, the student model will require more capacity to provide accuracy comparable to the teacher model.




Figure 2b: Word Error Rates for the MoE teacher and student dense networks


Improving sample efficiency by 10x for a 10B parameter Language model using MoE

We evaluated MoE technique using ORT MoE implementation to research variants of language models for machine translation. Two sets of multilingual data are used which consists of 20 and 50 languages respectively. Both models have the Transformer encoder-decoder architecture. We use a base-sized model and multi-task training for 20 languages and a large-sized model and single-task (machine translation) training for 50 languages. In MoE models, MoE layers, each containing one top-1 gate and multiple feedforward experts, are injected in every other block in the Transformer stack. All non-MoE hyperparameters in the MoE models are kept the same with the corresponding dense model.


The baseline dense model for 20 languages has 243M parameters. Expert scaling is investigated by adding 32, 64, 128, and 256 experts separately. The obtained model sizes are 1.6B, 2.9B, 5.6B, and 11.1B parameters. Experimental results show that the more the experts the higher the BLEU score hence the better the translation quality.




Figure 3: BLEU scores with expert scaling for 20 languages


In the experiment of 50 languages, the baseline dense model has 761M parameters. A 64-expert MoE model with 10.3B parameters is trained and achieves similar BLEU score at 1/10 of training steps, indicating about 10x sample efficiency compared to the non-MoE baseline.




Figure 4: Sample efficiency of 64 experts for 50 languages


Improving accuracy and efficiency of a Vision Foundation Model using MoE

A foundation model in Vision is trained from broad data at scale and is capable of being adapted (e.g. fine-tuned) to a wide range of downstream tasks. Foundation models become promising due to their impressive performance and generalization capabilities. They are quickly integrated and deployed into real-world AI systems, such as image retrieval, image classification, object detection, visual question answering and action recognition. We find that ORT MoE provides an efficient approach to scale up a vision foundation model for larger capacity to effectively digest broader data.



We evaluate the model performance via a set of zero-shot classification tasks. The model is a CLIP Vision model ([2103.00020] Learning Transferable Visual Models From Natural Language Supervision ( ) that learns a matching between image-text-pairs. As shown in the table, with more expert, the model has a larger capacity and achieve better performance on various tasks.




Figure 5: Evaluation results between base model and MoE models on vision tasks.


Sample efficiency

We further demonstrate the sample efficiency of MoE. We compare the sample efficiency between base model, MoE model with 8 and 32 experts. We can achieve around ~2X sample efficiency using ORT MoE implementation.





Figure 6: Data efficiency comparison of MoE on vision task


ORT MoE Deep Dive

ORT MoE implementation provides a wide variety of features to support real world workloads discussed above.

  • Precision: ORT MoE’s core component MixtureOfExperts supports FP32 as well as FP16 precision. This flexible implementation enables use of mixed precision model training through external packages such as NVIDIA Apex and PyTorch Automatic Mixed Precision (AMP) support.
  • Variable Length Inputs: The MixtureOfExperts uses pair of AllToAll in each layer. Typically, standard AllToAll implementations require input tensor size to match on all ranks. ORT MoE is used in multi-lingual models where input data size on each rank could vary for each batch and for the given batch all ranks may not have same input size. To accommodate this, ORT MoE provides modified AllToAll implementation that supports variable length input tensors.
  • DistributionGrid: ORT MoE provides DistributionGrid to create and manage process groups for various distributed training configurations considering underlying cluster topology. The DistributionGrid takes the number of ranks available for the given distribution strategy as an input and handles the complexity of creating and placing torch.distributed process groups.



#Create standard data parallel grid where each rank holds entire copy of the model.
dgrid = DistributionGrid(data_parallel_group_size = <no_of_ranks_available>)

#Create expert parallel grid where experts are evenly distributed among available ranks.
dgrid = DistributionGrid(expert_parallel_group_size = <no_of_ranks_available>)



  • Checkpointing: The DistributionGrid also provides an interface to map experts for checkpoint saving and loading. Based on this interface, utility functions are provided to query and translate model state dictionary. The utilities are robust to translate parameter mapping from one model distribution strategy to other while hiding complexity of expert parameter placement from data scientists.
  • Loss Functions: ORT MoE provides multiple loss functions for flexible MoE experimentation. The `load_balancing_loss` is the auxiliary loss defined in the Switch Transformers paper to seek uniform routing of the batch of samples across the experts. The `sparsity_l1_loss` and `mean_importance_loss` are introduced in the SpeechMoE paper. The former encourages the sparsity of router activation while the latter leads to a balanced routing strategy. The `z_loss` belongs to the spherical loss family in multi-class classification and is proposed in this paper. In ORT MoE, it is a variant of the original z-loss following Google’s TensorFlow MoE. The `ideal_load_balancing_loss` is a straight through estimator of the ideal load balance which is not differentiable.
  • Gating Utilities: ORT MoE also provides gating utilities for balancing exploration. For top-1 gating, `switch_jitter` will add multiplicative jitter noise on the incoming representations. `switch_dropout` will add input dropout on the incoming representations. For both top-1 and top-2 gating, `logits_gumbel` uses Gumbel-max trick to sample from the softmax distribution. `token_drop_type` includes 3 methods for sample dropping when the expert capacity is not enough: `cut` means cut the sequence tail, `random` means random sampling, and `routing_weight` means dropping samples according to the routing weights. For top-2 gating, a 2nd-place loss can be added by specifying the ratio using ` second_place_loss_ratio`.
  • Gate Metrics: To help interpret and visualize the gate and expert behavior, ORT MOE provides some useful gate metrics for logging. `gate_entropy` computes the average entropy of the router probability distribution. `gate_probability` computes the average probability of the selected expert over all samples. This is a good metric to monitor expert specialty. `gate_routed` computes the fraction of samples being routed. This usually indicates how difficult the expert loads can be balanced over the given samples for a proper capacity factor or whether the capacity is not big enough. Per-layer values of these metrics can be obtained by using utility function get_moe_loss() and setting layer_level=True. There are also two metrics for expert usage. `expert_fraction` indicates the fraction of samples dispatched to each expert before applying capacity constraint. `expert_routed_fraction` indicates the fraction of routed samples to each expert for computation.
  • Base Layer Routing: A new balanced routing of experts to improve quality of large MoE models as described in BASE Layers paper.


ORT MoE Performance Improvements

We have employed various techniques to improve computational efficiency and reduce memory consumption of MoE models. These techniques increase training throughput and batch size. We introduce three major techniques here, Sparse to dense mask in gating function, Fast Gating Indexing algorithm, and Merged Experts.

  • In MoE implementations the matrix used to store routing information has the size of O(S^2) where S is the token length. ORT MoE implements a dense matrix of size O(S) to record routing efficiency which allowed 29.49% throughput increase in Vision models.
  • ORT MoE uses a sorting-based algorithm, instead a typical cumulative sum operator, to implement fast indexing in the gating function. The new algorithm increases throughput by 42.17%.
  • In a typical expert module there are two matrix-multiplication (MM) kernels separated by an activation function. In scenarios, where multiple experts are placed on a single GPU, each expert module is instantiated sequentially in a loop. This increases kernel launch overhead and GPU utilization remains low for smaller expert sizes. Since all expert computations are independent of each other they could be executed in parallel. ORT MoE implements MergedFFNExpert which uses batched GEMM, instead of MM, to execute multiple expert computations in a single kernel launch. Merged experts improves throughput by 53.87% for Vision models.



Figure 7: ORT MoE performance improvement on the CLIP Vision model (NDV4 cluster)


  • ORT MoE integrates with Tutel computation optimization features and provides user an option to take advantage of Tutel.


Distribution Strategies supported by ORT MoE

Unlike the traditional transformer structure (or neural network), the routing logic of MoE breaks the locality of input tokens: instead of being only consumed by a single or a small set of GPUs (e.g., model parallelism), the tokens on each GPU are routed globally to experts on other GPUs. This gives flexibility to how we distribute the experts and provides opportunities to explore distribution strategies with different communication patterns and the associated trade-offs. As different workloads may require different distributions for best performance, it is important to support a rich set of distribution strategies. In this section we discuss some of the distribution strategies we used with ORT MoE.


Expert Parallelism (EP):

In this scheme, experts are evenly divided across ranks. Alltoall collectives are inserted before and after experts to route the input and output tokens to the correct position, according to the gating function. The non-expert parameters are replicated across ranks and synchronized using AllReduction collective. This is the distribution strategy that was introduced in the GShard paper.


This distribution enables a simple and good load-balanced distribution of MoE and has been widely used in different models. In this distribution, the performance of Alltoall is one critical factor of the throughput.




Figure 8: Expert Parallelism as described in Gshard paper


Data scientists have deployed multiple replicas of Expert Parallel distribution, known as Expert Parallel Replica, to increase training throughput when larger number of GPUs are available. Under this strategy, like traditional data parallel training, experts from each EP replica are synchronized by additional AllReduce collective.


Expert Slicing:

A few drawbacks of EP are: (1) to maintain the load-balance, the number of experts must be larger and dividable by the number of GPUs. (2) Padding of the tokens may needed, when the number of tokens routed to an expert is fewer than the expert capacity, which incurs waste of memory and computation. (3) Each expert has a capacity of number of tokens taking in, which results in tokens exceeding the capacity are dropped, and affects the learning efficiency.


To avoid these drawbacks, we provide another distribution technique, called expert slicing. Here each expert is evenly sharded across the number of GPUs. An all-gather is used to collect inputs from all ranks to each rank, and an allreduce is used after the experts to compute the output correctly. Then each rank uses a narrow-like operation to select the outputs belonging to itself. Note, since each rank has all inputs before the gating, a global loss computation and routing decision is made, which increases the learning efficiency. Since each rank computes partial results of all experts, there is no token dropping, nor padding is needed. However, there is a memory tradeoff: each rank is taking inputs from all ranks, which increases the size of activation.


Model scaling with ORT MoE

One essential promise of MoE is that the model size scales with the number of GPUs used, while keeping the parameter size, computation, and memory consumption per GPU constant. Ideally, we would like the model to scale as a linear function of the number of GPUs. However, due to additional communication cost (which also increases with the number of GPUs), only sublinear scaling may be feasible today.


To evaluate the scalability of our ORT MoE implementation, we deployed the CLIP Vision MoE model discussed above using CC2M dataset on 1 to 64 GPUs, and 128 GPUs, on Azure NDv4 cluster ( ), using expert parallelism. For each setting, we keep the model size per GPU constant, and scale up the number of experts along with the number of GPUs.




Figure 9: Scaling Result




Figure 10: Model size for each data point


Figure 9 records the scaling result. The x-axis is the number of nodes used; y-axis is the throughput. The orange line represents “perfect scaling”, which is computed as the product of the throughput of model on a single node and the number of nodes used. It is linear scaling. The blue line is the actual throughput. Figure 10 records the model size for each data point, where we can see we are scaling the parameter size almost linearly.


From the result we can see that the MoE model maintains good scalability: on 128 GPUs, it preserves 65% of the theoretical value. One reason is the Azure NDv4 cluster provides high-bandwidth cross node network (InfiniBand: Supported, GPUDirect RDMA, 8 x 200 Gigabit HDR). We do observe that an increase in the number of nodes increases the gap between the actual and theoretical throughput. This is expected, since the communication cost is bigger.


ZeRO integration

ZeRO technique enables data scientists to scale PyTorch models efficiently by providing memory savings. DeepSpeed ZeRO and FairScale FSDP are some of the common independent implementations of the ZeRO technique. ZeRO techniques can be applied on an MoE model that is using ORT MoE. ZeRO technique reduces memory consumption which could enable data scientists to increase batch size. We were able to increase batch size of a CLIP model by applying ZeRO using Fairscale FSDP implementation. Figure 11 shows the result. We applied FSDP on 2 replicas of CLIP Vision model running on 8 GPUs. In this configuration expert parameters are sharded in two partitions and non-expert parameters are sharded in eight partitions. When at batch 36, with FSDP, it uses 36% less peak memory. Without FSDP, the max batch size per gpu can reach 36, while with FSDP, it increases to 48.




Figure 11 : ZeRO helps improve batch size for the CLIP Model


Ongoing Effort


Training on AMD GPUs

We are able to train same CLIP Vision MoE model configuration on NVIDIA A100 GPUs as well as AMD MI100 GPUs. We are able to improve model accuracy and efficiency using ORT MoE on both platforms.


ORTModule integration

We have implemented MixtureOfExperts as a standard torch.nn.module. The implementation allows seamless integration with PyTorch packages. ORTModule provides performance boost to standard PyTorch modules. ORT MoE could be simply wrapped inside ORTModule to gain the performance boost while constructing the model. A model could be entirely wrapped inside ORTModule also. See ORTModule documentation for more information.



dgrid = DistributionGrid(expert_parallel_group_size = dist.get_world_size())
encoder = TransformerMoEEncoderLayer(256, 4, nexperts=8, distribution_grid = dgrid)
encoder = ORTModule(encoder)



Looking Forward

We are able to scale various language, speech and vision models using the Mixture Of Experts technique by incorporating ORT MoE. We will continue to optimize the ORT MoE implementation to improve training throughput and explore new distribution strategies. This will enable data scientists scale models to further improve the quality of the services built upon these models. We invite data scientists to explore the MoE technique using the flexible ORT MoE implementation here.

Version history
Last update:
‎May 02 2022 01:21 PM
Updated by: