The NLP Recipes Team
Natural Language Processing (NLP) systems are used to ease the interactions between computers and humans using natural language. It is used in a variety of scenarios and industries from personal assistants like Cortana, to language translation applications, to call centers responding to specific users’ requests. In recent years, NLP has seen significant growth both in terms of quality and usability. Through new deep learning methods and state-of-the-art (SOTA) Deep Neural Network (DNN) algorithms, businesses are able to adopt Artificial Intelligence solutions to meet their customer’s needs. Unfortunately, finding the correct algorithm to use in different scenarios and languages remains a challenge. To help researchers and data scientists find the best fit for the problem at hand, Microsoft is open-sourcing the Microsoft NLP Recipes repository containing best practices in building and evaluating NLP systems across multiple tasks and languages.
Specifically, our goals are to provide information for anyone who wants to:
- Learn about the newest algorithms and topics in NLP
- Develop and deploy NLP systems efficiently and with faster development times
- Bring SOTA algorithms to production with Azure Machine Learning
Easing the Process for Data Scientists
Several models have emerged over the years within the NLP community pushing towards neural network architectures for language modeling over more traditional approaches such as conditional random fields (CRFs) and Hidden Markov Models (HMMs). Since 2017, “Transformer” based neural network architectures, such as BERT, GPT-2, ELMo, XLNet, and RoBERTa, have developed as a dominant choice within the NLP community. These architectures dominate multi-task benchmarks such as GLUE as well as single task benchmarks (e.g. text classification and named entity recognition) as they allow leveraging pre-trained language models and adapting them to different downstream tasks. In addition, these pre-trained models are available with support for 100+ languages out of the box. The following table includes the current implementations of models within the repository, across different tasks and languages.
Category | Applications | Methods | Languages |
---|---|---|---|
Text Classification | Topic Classification | BERT, XLNet, RoBERTa, DistilBERT | en, hi, ar |
Named Entity Recognition | Wikipedia NER | BERT | en |
Entailment | MultiNLI Natural Language Inference | BERT, XLNet | en |
Question Answering | SQuAD | BiDAF, BERT, XLNet, DistilBERT | en |
Sentence Similarity | STS Benchmark | BERT, GenSen | en |
Embeddings | Custom Embeddings Training | Word2Vec, fastText, GloVe | en |
Annotation | Text Annotation | Doccano | en |
Model Explainability | DNN Layer Explanation | DUUDNM (Guan et al.) | en |
The examples and utilities of the Microsoft NLP Recipes repository are focused with the following goals in mind to address these issues:
- Walk through NLP scenarios: These are common scenarios that are popular within the research community. The repository provides walkthrough examples within each scenario that show how once can get started with custom modeling and sample datasets.
- Ease use of SOTA algorithms: The utility functions includes easy-to-use wrappers of SOTA algorithms that dominate popular benchmarks like GLUE and SQuAD and provide an easy way to switch between them. With contributions from the community, we expect the latest algorithms to be included as they make their way up in the leaderboard. This gives the users easy access to the latest algorithms and reduces the friction when a new model is added.
- Global Language Support: Open source technologies like BERT, XLNet, and other transformer-based models, support 100+ languages and allow implementation of all the NLP scenarios across these languages. The datasets are hard to find, though, so we provide example notebooks and sample datasets on non-English languages such as Hindi, Arabic, and Chinese showing the implementation of NLP scenarios on sample datasets.
- Ease the use of common datasets: The repository has documentation and provides utility functions to use common academic datasets such as the Microsoft Research Paraphrase Corpus and the The Multi-Genre NLI (MultiNLI) Corpus A list of the datasets can be found here.
- Azure Machine Learning service support: The GitHub repository provides best practices for how to train, test, optimize, and deploy models on Azure using the Azure Machine Learning (Azure ML) service. Azure ML can be used intensively across various notebooks for tasks relating to AI model development, such as:
- Accessing datastores
- Scaling up and out on Azure Machine Learning Compute
- Automated Machine Learningand Hyperparameter tuning for model selection and hyperparameter tuning
- Tracking experiments and monitoring metrics to enhance the model creation process
- Distributed Training of models on clusters of many nodes and GPUs
- Deploying trained models as a web services to Azure Container Instance and Azure Kubernetes Service
The NLP repository is meant to be accessible to anyone interested in building NLP solutions easily across a wide range of tasks and languages. Contributions from the community are always welcome to keep up to date with the latest state-of-the-art methods.
Learn More
Utilize the GitHub repository for your own NLP systems
Try out an example of Text Classification using Transformer models
Try out an example of Question answering using BiDAF and Azure Machine Learning
Learn more about the Azure Machine Learning service
Get started with a free trial of Azure Machine Learning service