Fine-tuning a transformer model for question natural language inference (QNLI) with Azure SQL

Published Jul 26 2022 05:03 PM 1,059 Views


Figure 1: An example of QNLI. The task of the model is to determine whether the sentence contains the information required to answer the question.



Question natural language inference (QNLI) can be described as determining whether a paragraph of text contains the necessary information for answering a question. There are many real-world applications for this task. For example, imagine a potential future customer who wants to find product reviews that address a particular concern they want to rule out before purchasing the product.

We find that many organizations store these kind of data in a SQL database. When training a text classifier to perform QNLI, they export the data into a format suitable for training a deep learning model. This approach has at least two downsides.

  • The overhead of point-in-time exports makes frequent retraining complicated.
  • The size of the dataset can easily exhaust the memory on the compute host used for model training.

We therefore created a training script and PyTorch dataset class definition that allows users to fine-tune a model, while accessing data directly in a SQL database. We intentionally aligned our work with the Hugging Face fine-tuning tutorial, to enable developers and data scientists to adapt the code to fit their personal needs (we welcome contributions to the open-source repo).


Text Classification

Question natural language inference (QNLI) is a variant of text classification. We want to provide a brief overview of other text classification tasks and their applications. While our scripts and dataset definitions are geared towards QNLI, they can easily be modified to meet the requirements of these other text classification tasks.

Text classification is the task of assigning a label to a sentence, paragraph, or pairs of those. The most common use-case is probably classifying a paragraph. For example, one may want to perform sentiment analysis of a product review: How satisfied is a customer with the product?

Other use cases that involve classifying a single sentence or paragraph include spam filtering, NSFW (not suitable for work) classification, topic labeling (e.g., sports vs. political news), detecting irony or sarcasm, language detection, and grammatical correctness. One may say that whichever categories a stakeholder is interested in, it is possible to train a language model to solve the classification task.


Natural Language Inference (NLI)

There are variants of text classification that may not come to mind immediately. One common variant is called Natural Language Inference (NLI). Here, the task for the model is to classify the relationship between pairs of text. For example, one sentence may provide context (aka. premise) and the second sentence may provide a question (aka. hypothesis). The task of the model is to classify whether the hypothesis is true (entailment), false (contradiction), or neutral (there is no logical relationship between the sentences offered as premise and hypothesis), given the premise.

There are three popular variants of NLI tasks: Multi-Genre NLI, Question NLI, and Winograd NLI. The General Language Understanding Evaluation (GLUE; benchmark dataset is helpful for understanding these task variants. As we go through the three variants of NLI, we invite you to explore this dataset to get a better understanding of the task variants.


Multi-Genre NLI (MNLI)

This subset of the GLUE dataset contains data for training a model for general (multi-genre) purpose inference. That is, does the premise entail the hypothesis?

Two examples:

premise (string)

hypothesis (string)

label (class label)

Conceptually cream skimming has two basic dimensions - product and geography.

Product and geography are what make cream skimming work.

1 (neural)

How do you know? All this is their information again.

This information belongs to them.

0 (entailment)


You can explore the MNLI subset of the dataset here:


Question NLI

The task of QNLI is to determine whether one group of sentences contains the information required to answer the question posed in the other group of sentences. This is the task we will be fine-tuning our model on.

premise (string)

hypothesis (string)

label (class label)

What two things does Popper argue Tarski's theory involves in an evaluation of truth?

He bases this interpretation on the fact that examples such as the one described above refer to two things: assertions and the facts to which they refer.

0 (entailment)

What famous palace is located in London?

London contains four World Heritage Sites: the Tower of London; Kew Gardens; the site comprising the Palace of Westminster, Westminster Abbey, and St Margaret's Church; and the historic settlement of Greenwich (in which the Royal Observatory, Greenwich marks the Prime Meridian, 0° longitude, and GMT).

1 (not entailment)


You can explore the QNLI subset here:


Winograd NLI

This task aims to be a test of machine intelligence. It uses the Winograd Schema Challenge (WSC) format proposed by Hector Levesque. The task is to resolve an ambiguous pronoun in a sentence. Levesque argued that this requires the use of knowledge or common-sense reasoning (see Wikipedia, for a more detailed coverage of this fascinating topic).

Sentence1 (string)

Senetence2 (string)

label (class label)

John couldn't see the stage with Billy in front of him because he is so short.

John is so short.

0 (entailment)

The police arrested all of the gang members. They were trying to stop the drug trade in the neighborhood.

The police were trying to stop the drug trade in the neighborhood.

1 (not entailment)


You can explore the WNLI subset here:


Model training with SQL

If you look at the code repository, you will find that the training script ( may look just as you would expect. The main difference can be found in the dataset definition. This was a deliberate design choice we made because it allows you to quickly adapt the training script to your needs. Put another way, when training your model, you don’t have to keep in mind that the data are in fact loaded from a SQL database.

For the implementation of the dataset class, we chose to subclass IterableDataset (i.e.; Quoting the PyTorch documentation: “An iterable-style dataset is an instance of a subclass of IterableDataset that implements the __iter__() protocol, and represents an iterable over data samples. This type of datasets is particularly suitable for cases where random reads are expensive or even improbable, and where the batch size depends on the fetched data.” This contrasts with the standard Dataset class, which implements a method “__getitem__(index)” that for selecting a sample from the dataset by its index.

Perhaps a downside of this approach is that our class cannot easily be combined with PyTorch samplers. We are again quoting the PyTorch documentation: “For iterable-style datasets, data loading order is entirely controlled by the user-defined iterable. This allows easier implementations of chunk-reading and dynamic batch size (e.g., by yielding a batched sample at each time).”

However, as the documentation states, “data loading order is entirely controlled by the user-defined iterable”. We will demonstrate in the future how to exert this control.



We hope that you will find this blog post and the code repository useful. We strove to enable you to fine-tune a text classification model by following a few simple steps. We welcome your feedback on what adaptations to the dataset definition you would like to see covered in the future. We also welcome contribution to the open-source repository.



This blog post is heavily influenced by Hugging Face introduction to Text Classification:

Version history
Last update:
‎Jul 26 2022 04:58 PM
Updated by: