Azure AI Foundry Blog

8 MIN READ

Build custom NLP solutions with AzureML AutoML NLP

Former Employee

Oct 12, 2022

Introduction:

Since the publication of the BERT paper [1], Transformer architecture [2] based pretrained deep neural networks have become the state of the art for Natural Language Processing (NLP) tasks. These models have helped Machine Learning professionals in research, academia, and industry alike. Many of the biggest technology companies have devoted enormous resources towards further improving these models, in terms of performance and scale, while many others have leveraged these to cater to their use cases.

AzureML (Azure Machine Learning) AutoML (Automated Machine Learning) was one of the earliest adopters of Foundational Deep Neural Network Models for NLP tasks like classification since the beginning of 2020 [3]. We have been building on it ever since.

We’re now taking a step further and are excited to announce the General Availability of AutoML NLP, an end-to-end deep learning solution for text data within AzureML.

AutoML NLP solves NLP problems like text classification and named entity recognition (NER) and provides the following capabilities:

Large pool of Pretrained Text Deep Neural Network (DNN) models (currently in preview)
Ability to tune hyperparameters (currently in preview) on these models to help achieve high scores
Data awareness that taps into input dataset characteristics and subtleties
Native support for 104 languages
Optimizations for near-linear scale on very large data sizes and clusters
Seamless ML Ops and production deployment on AzureML endpoints

Scenarios Supported

AutoML NLP currently offers three scenarios: Multiclass Classification, Multilabel Classification, and NER.

Multiclass Classification (including Binary Classification)

This task helps classify each datapoint/sample into exactly one class from a total of two or more classes.

Multilabel Classification

This task helps classify each datapoint/sample into any number of classes, including all classes or no classes, from a total of two or more classes.

Named Entity Recognition (NER)

This task helps classify each entity into exactly one entity class, such that multiple entities corresponding to the same chunk are classified into the same base entity class, leveraging special formatting techniques, discussed next.

We expect the NER input data to be based on the CoNLL format, such that the dataset would be provided as text files. Within these files every input text example would be split into multiple lines, where each line would contain one word followed by the label/category for that word, and every input example would be followed by an empty new line.

The labels in the NER data should adhere to the IOB2 (Inside-Outside-Beginning) tagging format [4]. According to this format, tokens which are part of a chunk of multiple tokens, such as first and last name of an individual, are prefixed with “B-” and “I-” tags respectively. This helps determine the position in the chunk. Every chunk begins with the “B-” prefix, and all entities that are part of a chunk following the beginning entity, are prefixed with “I-”. Entities that do not belong to any entity class are classified as “O”.

For more details and examples, see Set up AutoML for NLP - Azure Machine Learning | Microsoft Docs

Leveraging Pretrained Language Models

Large language models, such as BERT, RoBERTa, XLNet, Turing-NLG, GPT-3, are pretrained using large training corpora. They leverage the enormous knowledge gained during pretraining when used for task-specific finetuning, thus requiring only a small amount of labeled data and a few epochs to achieve good results. Collecting and/or labeling data is often incredibly challenging and expensive; hence employing such models can be an ideal option for users who have limited training data.

The larger the model, the higher the number of trainable parameters, and greater the ability to store more knowledge from the pretraining corpus. However, this also increases the GPU memory requirements, training time, and inference latency.

The diagram below captures the finetuning results for several pretrained models comparing accuracy and normalized training times with respect to the size of the training dataset. We leverage the popular multiclass dataset, AG News [5]. The training time is normalized by dividing the training time for bert-base-cased model.

Sweeping over many Models and Hyperparameter combinations

We empower our customers to select from a wide array of powerful pretrained text DNN models for finetuning. We currently support 15 pretrained models including:

autoencoding models (like BERT) and autoregressive models (like XLNet)
multilingual models like XLM-RoBERTa and BERT-multilingual
large models like RoBERTa-large and BERT-large for achieving higher scores
base models (like BERT-base and RoBERTa-base) and distilled models (like distil-BERT and distil-RoBERTa) for faster training and lower GPU memory consumption

For all models, AutoML uses intelligent default hyperparameters which would produce good results for almost all use cases. For users who want more control, we provide the ability to override these default hyperparameters and their corresponding ranges, empowering users to leverage their domain knowledge for better fine-tuning results. Hyperparameters such as batch-size, gradient-accumulation-steps and epochs are commonly used and can impact training time and GPU memory usage, in addition to overall model performance. Other hyperparameters, such as learning-rate, weight-decay, warmup-ratio and lr-scheduler-type are also available for tuning, but the training results are quite sensitive to these. We have found AutoML defaults to work best for most scenarios, hence it is recommended to use those and only customize if they are not producing the best outcome.

The model sweeping feature also offers the early termination functionality which automatically ends the finetuning runs for poorly performing models. Several policies for early termination are supported along with customizable evaluation intervals [6]. The overall goal is to improve computation efficiency: achieve the best results using compute resources judiciously.

Model Sweeping and Hyperparameter tuning capabilities are released for Public Preview at this time, and they will be made Generally Available soon.

In the next few subsections, we’ll describe several features that we’ve enabled in AutoML NLP for improving performance and efficiency.

Custom Features

One size does not fit all, even though the pretrained text DNN models are largely capable of solving a wide variety of tasks for many kinds of datasets. The dataset’s characteristics provide important signals that can help improve finetuning results. For example, adjusting the model’s sequence length can offer significant boost in the scores for longer range text data, while also reducing the memory and time requirements for shorter range text data. Similarly, when datasets have more than one text column, AutoML smartly utilizes text data from all columns.

Multilingual Support

AutoML NLP natively supports 104 languages. Customers are required to provide the dataset language parameter when they submit an experiment, and the model best suited to that language would be leveraged. Additionally, with our model sweeping functionality, users can leverage powerful multilingual capable models like bert-base-multilingual, xlm-roberta-base and xlm-roberta-large to achieve near state of the art (SOTA) performance on datasets in a variety of languages.

Distributed Training

AutoML is well-tuned to work best for a combination of GPU SKUs with high efficiency InfiniBand interconnections, latest libraries for data parallelism, and innovation from Microsoft Research to achieve robust training on multi-GPU or multi-node AzureML compute clusters providing near-linear scaling. This functionality is available to all NLP tasks that we support.

Here is an example of the speedups achieved through distributed training with NC24rs_v3 virtual machines, each of which comprises of 4 V100 GPUs. We measure scaling in terms of strong scaling [7] defined in high-performance computing as the speedup in training time obtained for the same problem size by varying the number of processors.

Now that we’ve introduced our features and capabilities, we’ll describe the evaluation and deployment phases before sharing answers to our anticipated frequently asked questions.

Evaluation and Metrics

It’s important to evaluate the performance of the fine-tuned model on unseen data. As part of the fine-tuning/training run, we ask users to provide the hold-out validation dataset, which is used to evaluate/score the trained model. A wide variety of metrics are provided for each of the three scenarios, with some metrics like accuracy, precision, recall and F-1 that are common to all scenarios.

Specifically, for our multilabel text classification scenario we also offer the thresholding feature. We provide a metrics.csv file as part of the finetuning run, to help users understand the impact of varying the threshold (used for predicted probabilities) on metrics like precision and recall. A smaller threshold value would allow more labels per sample on average and hence increase chances for false positives (useful when high recall is desirable). A larger threshold value would allow fewer labels and hence increase chances for false negatives (useful when high precision is desirable). The user can leverage this capability when inferencing the finetuned model on the test dataset or when testing the deployed model.

Deployment

AutoML NLP is natively integrated within AzureML, enabling users to use all AzureML workflows with AutoML NLP. Once you train a model you can register and deploy it to the REST endpoints like any other AzureML model. You could use both UI and SDK to deploy this model.

Architecture

The following diagram explains the high-level architecture of AutoML NLP.

FAQ

I am a Machine Learning (ML) professional, but do not have time/expertise to conduct research in NLP. How do I even know which pretrained model to use?

AutoML NLP finds the model that works best for your task and training data. However, if you would prefer to specify a list of models from what we currently support, you can leverage the model sweeping feature of AutoML NLP.

Data scientists spend a lot of time cleaning and processing data. Do I need to perform any preprocessing on my text data?

AutoML NLP expects the data to comply with the format specified for a particular task. We explain the format, with examples, in our documentation. In addition to data validation checks, AutoML NLP also checks for data pitfalls, and either warns the user or fails the run. Sometimes the data may comply with the pre-specified format, but still lead to hidden problems which do not surface even at runtime, but usually culminate with misleading scores. We have checks in place to avoid many such issues.

What if my data is in a non-English language, or if it uses multiple languages?

AutoML NLP natively supports a variety of languages. The user only needs to provide the three-letter ISO code corresponding to the dataset’s language, and we will do the rest. Additionally, users can specify multilingual capable models and leverage model sweeping.

How do I evaluate the finetuned model? How do I know which metrics to use for evaluation?

AutoML NLP is integrated with AzureML’s rich set of metrics available to the user via the UI. The ultimate choice of metric rests with the user and their business use case, although we can share some general guidelines. While accuracy is a well understood metric for classification tasks, it is of little value for NER. In many NER datasets, most of the tokens do not correspond to any entity class but while computing accuracy even such tokens get counted, making accuracy an unreliable metric for NER. We recommend using F1 score, precision and recall for NER, because in our implementation they compute scores by taking into account entity level granularity.

For classification tasks, when datasets are imbalanced, metrics like AUC (Area under the curve) are more informative since accuracy is sensitive to imbalance.