In a previous blog post, we described how to fine-tune a pretrained Hugging Face transformer model for text classification at scale. We used a PyTorch dataset class for pulling data directly from a SQL database. The advantages of this approach were obvious, in comparison to loading the data into memory of the host machine at the beginning of training. First, SQL databases are almost limitless in size. That is, we don’t have to worry about the size of the dataset and whether it will fit into the memory of the compute host used and optimized for ML model training. Second, because rows of data are pulled only if they are needed, model training and evaluation can commence almost instantly. In fact, we find that we can sometimes successfully train a model using a small fraction of large datasets, further highlighting the inefficiency of copying all data to the training node at the beginning of training.
In this blog post, we go one step further: Instead of fine-tuning the classification layer of a very large language model, we make a large upfront investment: We store the activation of the last hidden layer (aka. sentence embedding) in a SQL database. For each use-case or ML experiment, instead of using sentences as input, we then use these stored activations as input to rapidly train a tiny classification model.
Consider the following use-cases (list not exhaustive):
In this blog post, we will describe how you can achieve your goals in a tiny fraction of the time and cost you would incur, if you followed standard operating procedures for fine-tuning a very large language model.
In this step, we process all sentences and store the activation in the last hidden layer of our very large language model in a SQL table. This activation is typically referred to a sentence embedding (even though it is not limited to sentences but can also be applied to paragraphs of text). Quoting Wikipedia: “Sentence embedding is the collective name for a set of techniques in natural language processing (NLP) where sentences are mapped to vectors of real numbers”. The elements in these vectors describe the location of a sentence in high-dimensional semantic space. The size of embedding vectors varies largely between different models. For the popular BERT base model, the vector has 786 elements. For GPT-3, the embedding size is much larger: 12,288.
To retrieve these embeddings from our model, without having to leave the comfort of the Hugging Face library, we apply a simple trick: We remove the classification layer from the BERT model and replace it with the Identity operator. We can then store the model output as JSON string in a SQL table.
During training, we then simply load these stored embeddings, and feed them to a tiny classification model. This classification model is absolutely tiny, with an input layer, one hidden layer, and the output layer. The size of the input layer corresponds to the size of the sentence embeddings. The size of the output layer corresponds to the number of categories in our dataset. The size of the hidden layer is a free parameter. We recommend using either the same size as the output layer, or half that size. The model (aka. multi-layer perceptron) can be written up in 16 lines of code (or less).
Compare the size of this model (5495 parameters) to the number of parameters you have in language models. The relatively small BERT base model has 110M parameters. The following figure (borrowed from the blog post “Large Language Models: A New Moore's Law?”) shows the size of other language models.
That is, by re-using embeddings during training, instead of processing each sentence with the entire model, we are saving millions or billions of float operations.
How much money do you save during model training? There is the overwhelming advantage when recycling sentence embeddings that model training can be done without a GPU. To keep it simple, let’s say model training takes an hour (which is not entirely unreasonable). The below table shows how much model training would cost in Azure. The first row shows the cost of training our tiny model on a VM without a GPU. In comparison to fine-tuning a model with the CPU of a D2s v3, using an A100 GPU on an NC24 would cost approximately 36 times as much!
VM |
GPU |
Cost per hour (pay as you go) |
D2s v3 |
N/A |
$0.10 |
NC6 v1 |
K80 |
$0.90 |
NC6 v2 |
P100 |
$2.07 |
NC6 v3 |
V100 |
$3.06 |
NC24 |
A100 |
$3.67 |
The below figure shows the test accuracy of our tiny model on the dbpedia_14 dataset. We didn’t do much hyper parameter tuning, and used a learning rate of 1e-3, weight decay of 1e-5, Adam optimizer with amsgrad enabled, linear learning rate schedule with 1000 steps. The size of the hidden layer was .5 the number of categories.
As always, we invite you to try this for yourself, using our open-source GitHub repository: https://github.com/Azure/elenchus
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.