Extremely Fast Training of Extremely Small Text Classification Models with Azure SQL

Former Employee

Oct 05, 2022

In a previous blog post, we described how to fine-tune a pretrained Hugging Face transformer model for text classification at scale. We used a PyTorch dataset class for pulling data directly from a SQL database. The advantages of this approach were obvious, in comparison to loading the data into memory of the host machine at the beginning of training. First, SQL databases are almost limitless in size. That is, we don’t have to worry about the size of the dataset and whether it will fit into the memory of the compute host used and optimized for ML model training. Second, because rows of data are pulled only if they are needed, model training and evaluation can commence almost instantly. In fact, we find that we can sometimes successfully train a model using a small fraction of large datasets, further highlighting the inefficiency of copying all data to the training node at the beginning of training.

In this blog post, we go one step further: Instead of fine-tuning the classification layer of a very large language model, we make a large upfront investment: We store the activation of the last hidden layer (aka. sentence embedding) in a SQL database. For each use-case or ML experiment, instead of using sentences as input, we then use these stored activations as input to rapidly train a tiny classification model.

Consider the following use-cases (list not exhaustive):

The most common situation where recycling stored embeddings is very practical is when a model needs to be retrained, either simply because one wants to try different hyperparameters (e.g., learning rate or weight decay) or because a decision was made to add additional categories to the target variable.
Imagine you are part of an organization that has massive amounts of free text data (e.g., product reviews, support tickets, or customer complaints, doctor’s notes). You and your colleagues would like to include it in different analytic workflows. For example, you may be interested in focusing on product reviews that focus on quality aspects of products, so you want to train a binary classifier that identifies those rows in the data which pertain to product quality. A colleague of yours may be interested in something completely different, perhaps categorizing product reviews based on product category (e.g., clothing, sporting goods, electronics).

In this blog post, we will describe how you can achieve your goals in a tiny fraction of the time and cost you would incur, if you followed standard operating procedures for fine-tuning a very large language model.

Data Ingestion

In this step, we process all sentences and store the activation in the last hidden layer of our very large language model in a SQL table. This activation is typically referred to a sentence embedding (even though it is not limited to sentences but can also be applied to paragraphs of text). Quoting Wikipedia: “Sentence embedding is the collective name for a set of techniques in natural language processing (NLP) where sentences are mapped to vectors of real numbers”. The elements in these vectors describe the location of a sentence in high-dimensional semantic space. The size of embedding vectors varies largely between different models. For the popular BERT base model, the vector has 786 elements. For GPT-3, the embedding size is much larger: 12,288.

To retrieve these embeddings from our model, without having to leave the comfort of the Hugging Face library, we apply a simple trick: We remove the classification layer from the BERT model and replace it with the Identity operator. We can then store the model output as JSON string in a SQL table.

Model Training

During training, we then simply load these stored embeddings, and feed them to a tiny classification model. This classification model is absolutely tiny, with an input layer, one hidden layer, and the output layer. The size of the input layer corresponds to the size of the sentence embeddings. The size of the output layer corresponds to the number of categories in our dataset. The size of the hidden layer is a free parameter. We recommend using either the same size as the output layer, or half that size. The model (aka. multi-layer perceptron) can be written up in 16 lines of code (or less).

Compare the size of this model (5495 parameters) to the number of parameters you have in language models. The relatively small BERT base model has 110M parameters. The following figure (borrowed from the blog post “Large Language Models: A New Moore's Law?”) shows the size of other language models.

That is, by re-using embeddings during training, instead of processing each sentence with the entire model, we are saving millions or billions of float operations.

Results

How much money do you save during model training? There is the overwhelming advantage when recycling sentence embeddings that model training can be done without a GPU. To keep it simple, let’s say model training takes an hour (which is not entirely unreasonable). The below table shows how much model training would cost in Azure. The first row shows the cost of training our tiny model on a VM without a GPU. In comparison to fine-tuning a model with the CPU of a D2s v3, using an A100 GPU on an NC24 would cost approximately 36 times as much!

VM	GPU	Cost per hour (pay as you go)
D2s v3	N/A	$0.10
NC6 v1	K80	$0.90
NC6 v2	P100	$2.07
NC6 v3	V100	$3.06
NC24	A100	$3.67

The below figure shows the test accuracy of our tiny model on the dbpedia_14 dataset. We didn’t do much hyper parameter tuning, and used a learning rate of 1e-3, weight decay of 1e-5, Adam optimizer with amsgrad enabled, linear learning rate schedule with 1000 steps. The size of the hidden layer was .5 the number of categories.