Blog Post

AI - Azure AI services Blog
7 MIN READ

Discover the New Multi-Lingual, High-Quality Phi-3.5 SLMs

AdinaTru's avatar
AdinaTru
Icon for Microsoft rankMicrosoft
Aug 22, 2024

The Phi-3 model collection is the latest in Microsoft's family of Small Language Models (SLMs). They are designed to be highly capable and cost-effective, outperforming models of similar and larger sizes across various benchmarks in language, reasoning, coding, and math. The availability of Phi-3 models expands the selection of high-quality models for Azure customers, offering more practical choices as they compose and build generative AI applications. Since the launch in April 2024, we have received lots of valuable feedback from our customers and community members on areas for improvement in the Phi-3 models. Today, we are proud to announce Phi-3.5-mini, Phi-3.5-vision, and a new member to the Phi family, Phi-3.5-MoE, a Mixture-of-Experts (MoE) model. Phi-3.5-mini enhances multi-lingual support with a 128K context length. Phi-3.5-vision improves multi-frame image understanding and reasoning, boosting performance on single-image benchmarks. Phi-3.5-MoE, featuring 16 experts and 6.6B active parameters, provides high performance, reduced latency, multi-lingual support, and robust safety measures, excelling over larger models while upholding Phi models efficacy.

 

Phi- 3.5 Quality vs. Size graph in SLM

Phi-3.5-MoE: Mixture-of-Experts

Phi-3.5-MoE is the latest addition to the Phi model family. It comprises 16 experts, each containing 3.8B parameters. With a total model size of 42B parameters, it activates 6.6B parameters when using two experts. This MoE model outperforms a similarly sized dense model in terms of quality and performance. It supports over 20 languages. Like its Phi-3 counterparts, the MoE model employs a robust safety post-training strategy, using a mix of open-source and proprietary synthetic instruction and preference datasets. This post-training process combines Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO), utilizing both human-labeled and synthetic datasets. These include datasets focused on helpfulness and harmlessness, as well as multiple safety categories. Phi-3.5-MoE also supports a context length of up to 128K, enabling it to handle numerous long-context tasks.

 

To understand the model quality, we compare Phi-3.5-MoE with a set of models over a variety of benchmarks as shown in Table 1:

Table 1: Phi-3.5-MoE Model Quality

 

We take a closer look at different categories public benchmarks datasets at the table below:

Table 2: Phi-3.5-MoE Model Quality on various capabilities

 

Phi-3.5-MoE with only 6.6B active parameters achieves a similar level of language understanding and math as much larger models. Moreover, the model outperforms bigger models in reasoning capability. The model provides good capacity for finetuning for various tasks.  Table 3 highlights multi-lingual capability of the Phi-3.5-MoE on multi-lingual MMLU, MEGA, and multi-lingual MMLU-pro datasets. Overall, we observed that even with just 6.6B active parameters, the model is very competitive on multi-lingual tasks in comparison to other models with much bigger active parameters.

 

Multi-lingual Capability

Table 3: Phi-3.5-MoE Multi-lingual Benchmark

 

The table below shows multi-lingual MMLU scores in some of the supported languages.

Table 4: Phi-3.5-MoE Multi-lingual MMLU Benchmark

 

Phi-3.5-mini

Phi-3.5-mini model has undergone further pre-training using multi-lingual synthetic and high-quality filtered data. This was followed by a series of post-training steps which included Supervised Fine-Tuning (SFT), Proximal Policy Optimization (PPO), and Direct Preference Optimization (DPO). These processes utilized a combination of human-labeled, synthetic, and translated datasets.

 

Model Quality

When diving into the capabilities of language models, it's crucial to understand how they stack up against one another. That's why we put Phi-3.5-mini to the test alongside a selection of recent top performing larger models, utilizing our internal benchmark platform. In a high-level overview, Table 1 provides a snapshot of the model quality on key benchmarks. Despite its compact size of just 3.8B parameters, this efficient model not only matches but often surpasses the performance of other models of larger sizes.

 

Table 5: Phi-3.5-mini Model Quality

 

Multi-lingual Capability

Phi-3.5-mini is our latest 3.8B model update. The model used additional continual pre-training and post-training data leading to substantial gains on multi-lingual, multi-turn conversation quality, and reasoning capability. The model has been trained on selective set of languages listed here: Arabic, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, Turkish and Ukrainian.

 

Table 6 below highlights multi-lingual capability of Phi-3.5-mini on the average language specific scores of multi-lingual MMLU, MGSM, MEGA, and multi-lingual MMLU-pro datasets.

Table 6: Phi-3.5-mini Multi-Lingual Quality

 

Table 7 below shows Multi-lingual MMLU scores for some of the supported languages.

Table 7: Phi-3.5-mini Multi-Lingual MMLU Quality on selected set of languages

 

Phi-3.5-mini shows significant improvement over Phi-3-mini on multi-lingual support. Arabic, Dutch, Finnish, Polish, Thai and Ukrainian received the most boost from the new Phi version with 25-50% improvement in performance. Putting these into wider perspective, Phi-3.5-mini shows top performance for any sub-8B model, in English as well as several languages.  It is worth noting that the model uses 32K vocabulary and optimized for higher resource languages above, it is not recommended to be used for low resource languages without further fine-tuning.

 

Long Context

Phi-3.5-mini, with a 128K context length support, excels in tasks like summarizing long documents or meeting transcripts, long document-based QA, and information retrieval. Phi-3.5 performs better than the Gemma-2 family, which supports only an 8K context length. Additionally, Phi-3.5-mini is highly competitive with much larger open-weight models such as Llama-3.1-8B-instruct, Mistral-7B-instruct-v0.3, and Mistral-Nemo-12B-instruct-2407. Table 8 lists various long-context benchmarks.

 

Ruler: a retrieval-based benchmark for long context understanding

RepoQA: a benchmark for long context code understanding

Table 8: Phi-3.5-mini Long Context Benchmark

 

With only 3.8B parameters, 128K context length, and multi-lingual support, Phi-3.5-mini-instruct is the only model in this category. It is worth noting that we opted to support more languages while maintaining English performance on various tasks. Due to limited model capacity, this may lead to English knowledge on the model being better than other languages. For multi-lingual knowledge intense tasks, we recommend using the model in RAG setup.

 

 

Phi-3.5-vision with Multi-frame Input

Phi-3.5-vision introduces cutting-edge capabilities for multi-frame image understanding and reasoning, developed based on invaluable customer feedback. This innovation empowers detailed image comparison, multi-image summarization/storytelling, and video summarization, offering a wide array of applications across various scenarios.

 

For example, see the model output for summarization of multiple slides:

 

Phi-3.5-vison model output for slide summarization

 

Remarkably, Phi-3.5-vision has demonstrated significant performance improvements in numerous single-image benchmarks. For example, it boosted the MMMU performance from 40.4 to 43.0 and improved the MMBench performance from 80.5 to 81.9. Additionally, the document understanding benchmark TextVQA saw an increase from 70.9 to 72.0.

 

The following tables illustrate the detailed comparison results on two renowned multi-image/video benchmarks, showcasing the enhanced performance metrics. It is worth noting that Phi-3.5-Vision is not optimized for multi-lingual use cases. It is advised not to use it for multi-lingual scenarios without further fine-tuning.

Table 9: Phi-3.5-vision Tasks Benchmark

 

Table 10: Phi-3.5-vision VideoMME Benchmark

 

 

Safety

The Phi-3 family of models were developed in accordance with the Microsoft Responsible AI Standard, which is a company-wide set of requirements based on the following six principles: accountability, transparency, fairness, reliability and safety, privacy and security, and inclusiveness. Like the previous Phi-3 models, a multi-faceted safety evaluation and safety post-training approach was adopted, with additional measures taken to account for multi-lingual capabilities of this release. Our approach to safety training and evaluations including testing across multiple languages and risk categories is outlined in the Phi-3 Safety Post-Training Paper. While the Phi-3 models benefit from this approach, developers should apply responsible AI best practices, including mapping, measuring, and mitigating risks associated with their specific use case and cultural and linguistic context.

 

Optimized Variants

ONNX Runtime provides optimized inference for the Phi family of models. You can optimize Phi-3.5-mini on various hardware targets today using this example. Stay tuned for updated ONNX variants of latest Phi-3.5 models in the coming weeks.

 

More Predictable Outputs

We are bringing Guidance to the Phi-3.5-mini serverless endpoint offering in Azure AI Studio to make outputs more predictable through defining the structure tailored to an application. With Guidance, you can eliminate expensive retries, and can, for example, constrain the model to select from pre-defined lists (e.g., medical codes), restrict outputs to direct quotes from provided context, or follow in any regex. Guidance steers the model token by token in the inference stack, reducing cost and latency by 30-50%, which makes it a unique and valuable add-on to the Phi-3-mini serverless endpoint.

 

 

Closing Remarks

Phi-3.5-mini has emerged as a unique offering in the LLMs landscape, boasting only 3.8B parameters, a substantial 128K context length, and multi-lingual support. Phi-3.5-mini represents a milestone in creating efficient, multi-lingual models, striking a delicate balance between broad language support and focused performance in English.  Given the small model capacity, users may observe that the density of English knowledge in the model surpasses that of other languages.  When approaching multi-lingual, knowledge-intensive tasks, it's advisable to utilize Phi-3.5-mini within a Retrieval-Augmented Generation (RAG) setup. This configuration can significantly enhance the model's performance across different languages by leveraging external data sources, thereby mitigating the language-specific limitations imposed by its compact architecture.

 

Phi-3.5-MoE, featuring 16 small experts, delivers high-quality performance and reduced latency, supports 128k context length, and multiple languages with strong safety measures. It surpasses larger models and can be customized for various applications through fine-tuning, all while maintaining efficiency with 6.6B active parameters.

 

Phi-3.5-vision introduces advancements in multi-frame image understanding and reasoning, enhancing single-image benchmark performance.

 

The Phi-3.5 model family provides cost-effective, high-capability options for the open-source community and Azure customers, pushing the boundaries of small language models and generative AI.

Updated Aug 23, 2024
Version 3.0