Introduction
Journey 3 in our 5-part RAG Time developer series covers how to optimize your vector index for large-scale AI applications. I’m Mike, a program manager on the Azure AI Search team. Read the second post of this series and access all videos and resources in our Github repo.
Journey 3 covers various methods to optimize your vector index for large-scale RAG including:
- Vector storage and storage optimization
- Vector Compression (Scalar/Binary Quantization)
- Vector Truncation (MRL)
- Quality improvements for optimized vectors
Vector storage and storage optimization
A vector is an array of numbers generated by an embedding model where the number of items corresponds to the number of dimensions. Azure AI Search supports a range of data types from a 32-bit single precision floating point array down to packed binary which holds a single bit per dimension.
Description |
Size |
Data type |
Single precision floating point |
32 bits |
Edm.Single |
Half precision floating point |
16 bits |
Edm.Half |
Short int |
16 bits |
Edm.Int16 |
Signed byte |
8 bits |
Edm.Sbyte |
Packed binary |
1 bit packed in bytes |
Edm.Byte, packedBinary=true |
The raw size of a single vector is the number of dimensions times the size of the data type.
For example, if we have a vector with 3072 dimensions using a single precision floating point it would be around 12MB.
By default, Azure AI Search stores three copies of vector data uploaded to the service:
- Vector Index: Loaded into memory for fast ANN search. This copy is used to generate the initial set of candidate results.
- Full-precision vectors: Used for rescoring the candidate result set, which significantly improves the final quality of results for a highly optimized vector index.
- Source vectors: Used for retrieval and to support merge updates for other fields in the document.
You can choose to not store full-precision and/or source vectors to reduce the amount of total storage consumed.
Let’s say you have 1 million vectors with no optimizations applied, if each vector is 12MB, the vector index size will be around 11.4GB with a total storage consumed on disk of roughly 34GB. That’s the starting point for your optimizations. The vector index size is a finite resource and you’ll want to make the most of it. This size is limited by what the service can fit into memory, which is much lower than the total storage available. The more you optimize, the more vectors you can fit into memory for faster vector retrieval. This is why optimization is so important!
Next we’ll explore how you can optimize your vector index in AzureAI Search using vector compression and truncation.
Vector Compression (Scalar/Binary Quantization)
Scalar and binary quantization compresses the values output by your embedding model into a narrower data type, reducing a 32-bit or 16-bit value to an 8-bit, which reduces the vector index size by up to 75%.
It works by identifying a range of numbers (typically the observed minimum and maximum) and dividing them into 256 bins, giving each bin a number. The value of each vector dimension is then rounded to the number of the nearest bin. Binary quantization is similar but reduces the vector to a single bit per dimension packed into a byte, which results in up to a 97% smaller vector index.
Vector Truncation (MRL)
If the model used to generate vector embeddings has been trained to support Matryoshka Representation Learning (MRL), you can also truncate dimensions to reduce the size of the vectors. This technique produces embeddings where lower dimensions contain more semantic information such that higher dimensions can be truncated without a significant loss in quality.
To achieve the maximum vector index size reduction, we recommend using binary quantization together with MRL and set the stored property to false. You would keep a vector copy for rescoring, but not for retrieval. Referring to the previous example, with all the optimization techniques we’ve discussed, 1 million vectors from the OpenAI's text-embedding-3-large model can be reduced from ~11 GB down to 122 MB, a 96x decrease in the size of the vector index loaded into memory.
Quality improvements for optimized vectors
This all sounds great for cost efficiency and improving performance, but reducing the precision and truncating the dimensions can significantly impact the quality. Can we also achieve high quality results with all this optimization? That's where we bring in oversampling and rescoring. Oversampling pulls a much larger set of candidate results than requested from the vector index, which is then rescored using the full-precision vectors stored on disk.
Combining binary quantization with MRL and 2x oversampling with rescoring gives you the most cost-effective vector scale with very low loss of result quality.
As Matt mentioned in Journey 2, you can also improve your results using hybrid search to retrieve results based on both vector and keyword search matches. The semantic ranker re-ranks results based on our highly optimized semantic models. Finally, query rewriting makes each query more effective by adding additional relevant search terms for more refined results.
Try these optimizations for yourself
Now it's time to see for yourself these optimizations in code and their effects in a real index. You can find the notebook Pamela discussed in the session here: Find it in our GitHub repo.Following along, you will create multiple vector indexes in your Azure AI Search service for a large vector data set from HuggingFace configured with various optimization strategies and see firsthand the effect of different optimization strategies on the storage and index size and resulting quality. You'll see that the retrieval quality remains high, especially when oversampling and rescoring is enabled.
Next Steps
Ready to explore further? Check out these resources, which can all be found in our centralized GitHub repo.
- Watch Journey 3
- RAG Time GitHub Repo (Hands-on notebooks, documentation, and detailed guides to kick-start your RAG journey)
- Azure AI Search vector optimization documentation
- Azure AI Foundry