Customize AOAI Embeddings with contrastive learning

sudarsan

Microsoft

Feb 05, 2025

Introduction

Embeddings are used to generate a representation of unstructured data in a dense vector space. An embedding is a vector of floating-point numbers, such that the distance between two embeddings in the vector space is correlates to semantic similarity between two inputs in the original format (eg., text / image). When text is embedded, the meaning of each word is encoded so that words closer together in the vector space are expected to have similar meanings. A large number of embedding models that support such text representations are available and benchmarks like MTEB help understand their performance.

One of the pitfalls / risks in embedding models is that sometimes models may not be able to adequately represent the underlying data. This could be due to the following scenarios:

Out-of-Domain Text: If the text is highly technical or niche, and the model hasn’t been trained on similar data, the resulting embeddings might not accurately capture the specialized context or jargon. For eg, this is likely to happen when new terminologies are coined in science / technology.
Ambiguity: Text with ambiguous meanings can lead to embeddings that don’t clearly represent any of the possible interpretations. (eg. “getting a match for him was too difficult” could mean sports or marriage depending on context. “getting a match for him to play was too difficult” and “getting a match for him was to marry too difficult” itself gives a cosine similarity of 0.77 with ada embeddings).
Sarcasm or Irony: Detecting sarcasm or irony requires a deep understanding of context and tone, which can be challenging for AI models, leading to embeddings that take the text at face value. (“Oh, I just love getting stuck in traffic” and “traffic is really bad” may show a lower similarity score)
Cultural Nuances: Subtle cultural references or idioms might not be well-represented in embeddings if the model lacks sufficient exposure to the culture in question.
Short Texts / Concepts: Very short texts, like one-word inputs, might not provide enough information for the model to generate meaningful embeddings. In some cases, short texts are used to explain concepts which may not be picked up by the models (eg “"time value of money" and “money value of time” generates embeddings with cosine similarity of 0.73, whereas their equivalent details “increase in value of money over time due to interest earned” and “charging money for time spent on a task” have cosine similarity of 0.37)
Non-Standard Language: Text containing a lot of slang, misspellings, or grammatical errors might result in less accurate embeddings.
Rare Words: If the text contains rare words or neologisms, the embeddings might not capture their meaning accurately if those words were not present in the training data.
Context with RAG: If the text contains organization or entity specific content, embeddings may not be truly representative since these will not be part of the training data for the model. These could be abbreviations, organization specific definition of generic terms, etc.

Custom Embeddings

Retrieval Augmented Generation (RAG) is an architecture that augments the capabilities of a Large Language Model (LLM) like ChatGPT by adding an information retrieval system that provides grounding data. Embeddings being a key part of RAG for retrieval of relevant content, it is important to look at the pitfalls as discussed above. One of the ways to improve these representations is to modify the embedding vectors values to overcome the challenges discussed. Below steps explain an implementation of contrastive learning for finetuning the embeddings.

Step 1: We take sample training documents and generate chunks from the documents. We generate positive and negative sentences in reference to the chunks. These samples can be generated for a given corpus by sampling chunks and using an LLM model to create positive and negative examples.

Step 2: We generate the embedding representations of the positive and negative examples as a next step with LLM models that support embedding generation such as “text-embedding-ada-002” or “text-embedding-3-small” models.

As an alternate to generating the examples, we can consider labelled data corpus like MS MARCO.

Step 3: We train a shallow neural network, with the embedding to sentence pairs as input, model the loss function based on the difference between the known similarity from the labelled data set (positive / negative sentences) and the predicted similarity from the model. Since the data is labelled, the similarity for positive samples should be higher and between positive and negative labels should be lower.

Step 4: Since the sentence pairs with positive and negative examples are used to train the model, the trainable weights are nudged to reduce the loss function. This way, we will be able to generate more context aware embeddings for a custom corpus.

Step 5: For generating embeddings for an unseen chunk, we generate the embedding with the embedding model and get fine-tuned embeddings from the trained model, which gives a more context aware representation.

Generating training data:

For generation of labelled data, we can use labelled corpora available – eg. MS-MARCO passage ranking dataset or Stanford Natural Language Inference (snli) dataset. An alternate approach is to generate similar and dissimilar labelled data using GPT 4. This is done by iteratively giving a user query (along with a chunk of document as reference if needed) and prompting the model to generate a positive and hard negative document in relation to the user query. There are prompt templates available that can be used to generate these.

Example sentences with positive labels:

sentence1	sentence2
A person on a horse jumps over a broken down airplane.	A person is outdoors, on a horse.
Children smiling and waving at camera	There are children present
A boy is jumping on skateboard in the middle of a red bridge.	The boy does a skateboarding trick.
Two blond women are hugging one another.	There are women showing affection.
A few people in a restaurant setting, one of them is drinking orange juice.	The diners are at a restaurant.

Example sentences with hard negative labels:

sentence1	sentence2
A person on a horse jumps over a broken down airplane.	A person is at a diner, ordering an omelette.
Children smiling and waving at camera	The kids are frowning
A boy is jumping on skateboard in the middle of a red bridge.	The boy skates down the sidewalk.
An older man sits with his orange juice at a small table in a coffee shop while employees in bright colored shirts smile in the background.	A boy flips a burger.
Two blond women are hugging one another.	The women are sleeping.

Implementation

Generating embedding representations for labels:

def get_embedding(text: str, model="text-embedding-3-small", **kwargs) -> List[float]:
    # replace newlines, which can negatively affect performance.
    text = text.replace("\n", " ")
    client = AzureOpenAI(api_key = "*****",
                         api_version = "2023-05-15",
                         azure_endpoint ="https://***.openai.azure.com/")
    response = client.embeddings.create(input=[text], model=model, **kwargs)

    return response.data[0].embedding

# this function will get embeddings from the cache and save them there afterward
def get_embedding_with_cache(
    text: str,
    engine: str = default_embedding_engine,
    embedding_cache: dict = embedding_cache,
    embedding_cache_path: str = embedding_cache_path,
) -> list:
    if (text, engine) not in embedding_cache.keys():
        # if not in cache, call API to get embedding
        embedding_cache[(text, engine)] = get_embedding(text, engine)
        # save embeddings cache to disk after each update
        with open(embedding_cache_path, "wb") as embedding_cache_file:
            pickle.dump(embedding_cache, embedding_cache_file)
    return embedding_cache[(text, engine)]

# create column of embeddings
for column in ["text_1", "text_2"]:
    df[f"{column}_embedding"] = df[column].apply(get_embedding_with_cache)

We generate a trainable matrix that can be used to customize the embeddings.

def embedding_multiplied_by_matrix(
    embedding: List[float], matrix: torch.tensor
) -> np.array:
    embedding_tensor = torch.tensor(embedding).float()
    modified_embedding = embedding_tensor @ matrix
    modified_embedding = modified_embedding.detach().numpy()
    return modified_embedding

# compute custom embeddings and new cosine similarities
def apply_matrix_to_embeddings_dataframe(matrix: torch.tensor, df: pd.DataFrame):
    for column in ["text_1_embedding", "text_2_embedding"]:
        df[f"{column}_custom"] = df[column].apply(
            lambda x: embedding_multiplied_by_matrix(x, matrix)
        )
    df["cosine_similarity_custom"] = df.apply(
        lambda row: cosine_similarity(
            row["text_1_embedding_custom"], row["text_2_embedding_custom"]
        ),
        axis=1,
    )

def optimize_matrix(
    modified_embedding_length: int = 2048,  # in my brief experimentation, bigger was better (2048 is length of babbage encoding)
    batch_size: int = 100,
    max_epochs: int = 100,
    learning_rate: float = 100.0,  # seemed to work best when similar to batch size - feel free to try a range of values
    dropout_fraction: float = 0.0,  # in my testing, dropout helped by a couple percentage points (definitely not necessary)
    df: pd.DataFrame = df,
    print_progress: bool = True,
    save_results: bool = True,
) -> torch.tensor:
    """Return matrix optimized to minimize loss on training data."""
    run_id = random.randint(0, 2 ** 31 - 1)  # (range is arbitrary)
    # convert from dataframe to torch tensors
    # e is for embedding, s for similarity label
    def tensors_from_dataframe(
        df: pd.DataFrame,
        embedding_column_1: str,
        embedding_column_2: str,
        similarity_label_column: str,
    ) -> Tuple[torch.tensor]:
        e1 = np.stack(np.array(df[embedding_column_1].values))
        e2 = np.stack(np.array(df[embedding_column_2].values))
        s = np.stack(np.array(df[similarity_label_column].astype("float").values))

        e1 = torch.from_numpy(e1).float()
        e2 = torch.from_numpy(e2).float()
        s = torch.from_numpy(s).float()

        return e1, e2, s

    e1_train, e2_train, s_train = tensors_from_dataframe(
        df[df["dataset"] == "train"], "text_1_embedding", "text_2_embedding", "label"
    )
    e1_test, e2_test, s_test = tensors_from_dataframe(
        df[df["dataset"] == "test"], "text_1_embedding", "text_2_embedding", "label"
    )

    # create dataset and loader
    dataset = torch.utils.data.TensorDataset(e1_train, e2_train, s_train)
    train_loader = torch.utils.data.DataLoader(
        dataset, batch_size=batch_size, shuffle=True
    )

    # define model (similarity of projected embeddings)
    def model(embedding_1, embedding_2, matrix, dropout_fraction=dropout_fraction):
        e1 = torch.nn.functional.dropout(embedding_1, p=dropout_fraction)
        e2 = torch.nn.functional.dropout(embedding_2, p=dropout_fraction)
        modified_embedding_1 = e1 @ matrix  # @ is matrix multiplication
        modified_embedding_2 = e2 @ matrix
        similarity = torch.nn.functional.cosine_similarity(
            modified_embedding_1, modified_embedding_2
        )
        return similarity

    # define loss function to minimize
    def mse_loss(predictions, targets):
        difference = predictions - targets
        return torch.sum(difference * difference) / difference.numel()

    # initialize projection matrix
    embedding_length = len(df["text_1_embedding"].values[0])
    matrix = torch.randn(
        embedding_length, modified_embedding_length, requires_grad=True
    )

    epochs, types, losses, accuracies, matrices = [], [], [], [], []
    for epoch in range(1, 1 + max_epochs):
        # iterate through training dataloader
        for a, b, actual_similarity in train_loader:
            # generate prediction
            predicted_similarity = model(a, b, matrix)
            # get loss and perform backpropagation
            loss = mse_loss(predicted_similarity, actual_similarity)
            loss.backward()
            # update the weights
            with torch.no_grad():
                matrix -= matrix.grad * learning_rate
                # set gradients to zero
                matrix.grad.zero_()
        # calculate test loss
        test_predictions = model(e1_test, e2_test, matrix)
        test_loss = mse_loss(test_predictions, s_test)

        # compute custom embeddings and new cosine similarities
        apply_matrix_to_embeddings_dataframe(matrix, df)

        # calculate test accuracy
        for dataset in ["train", "test"]:
            data = df[df["dataset"] == dataset]
            a, se = accuracy_and_se(data["cosine_similarity_custom"], data["label"])

            # record results of each epoch
            epochs.append(epoch)
            types.append(dataset)
            losses.append(loss.item() if dataset == "train" else test_loss.item())
            accuracies.append(a)
            matrices.append(matrix.detach().numpy())

            # optionally print accuracies
            if print_progress is True:
                print(
                    f"Epoch {epoch}/{max_epochs}: {dataset} accuracy: {a:0.1%} ± {1.96 * se:0.1%}"
                )

    data = pd.DataFrame(
        {"epoch": epochs, "type": types, "loss": losses, "accuracy": accuracies}
    )
    data["run_id"] = run_id
    data["modified_embedding_length"] = modified_embedding_length
    data["batch_size"] = batch_size
    data["max_epochs"] = max_epochs
    data["learning_rate"] = learning_rate
    data["dropout_fraction"] = dropout_fraction
    data[
        "matrix"
    ] = matrices  # saving every single matrix can get big; feel free to delete/change
    if save_results is True:
        data.to_csv(f"{run_id}_optimization_results.csv", index=False)

    return data

Comparison between the distribution of cosine similarity based on embeddings generated from text and customized embeddings on a sample dataset show that custom embeddings perform better. Sample text below shows a reduced similarity compared to default embeddings (data for label -1 indicate dissimilarity), indicating that the custom embeddings are able to differentiate better.

text_1	text_2	label	cosine_similarity	cosine_similarity_custom
The man plays guitar	someone is playing an instrument	-1	0.58	0.52
Lady wearing a yellow top is sitting on a chair	a woman on a yellow shirt is on the floor.	-1	0.52	0.486
Children playing a game.	The guys are playing a game.	-1	0.52	0.45

Llamaindex Implementation:

Llamaindex has a simplified implementation, which involves corpus generation, generating synthetic queries, running embedding fine tuning and evaluating results as steps.

# Generate synthetic queries

from llama_index.finetuning import generate_qa_embedding_pairs
from llama_index.core.evaluation import EmbeddingQAFinetuneDataset
import os

OPENAI_API_TOKEN = "sk-"
os.environ["OPENAI_API_KEY"] = OPENAI_API_TOKEN
from llama_index.llms.openai import OpenAI

train_dataset = generate_qa_embedding_pairs(
    llm=OpenAI(model="gpt-3.5-turbo"), nodes=train_nodes
)
val_dataset = generate_qa_embedding_pairs(
    llm=OpenAI(model="gpt-3.5-turbo"), nodes=val_nodes
)

train_dataset.save_json("train_dataset.json")
val_dataset.save_json("val_dataset.json")

# Run Embedding Finetuning
from llama_index.finetuning import SentenceTransformersFinetuneEngine
finetune_engine = SentenceTransformersFinetuneEngine(
    train_dataset,
    model_id="BAAI/bge-small-en",
    model_output_path="test_model",
    val_dataset=val_dataset,
)
finetune_engine.finetune()
embed_model = finetune_engine.get_finetuned_model()

Sentence Transformers (SBERT):

Augmented SBERT provides the below approach to extend the sentence transformer models to custom datasets with no annotations of positive / negative pairs.

Train from scratch a cross-encoder (BERT) over a source dataset, we can consider a labelled dataset like STS benchmark dataset.
Use this cross-encoder (BERT) to label your target dataset i.e. unlabeled sentence pairs
Finally, train a bi-encoder (SBERT) on the labeled target dataset