Blog Post

AI - Azure AI services Blog
11 MIN READ

Azure AI Search: Cut Vector Costs Up To 92.5% with New Compression Techniques

fsunavala-msft's avatar
Apr 17, 2025

TLDR: Key learnings from our compression technique evaluation

  • Cost savings: Up to 92.5% reduction in monthly costs
  • Storage efficiency: Vector index size reduced by up to 99%
  • Speed improvement: Query response times up to 33% faster with compressed vectors
  • Quality maintained: Many compression configurations maintain 99-100% of baseline relevance quality

At scale, the cost of storing and querying large, high-dimensional vector indexes can balloon. The common trade-off? Either pay a premium to maintain top-tier search quality or sacrifice user experience to limit expenses.

With Azure AI Search, you no longer have to choose. Through testing, we have identified ways to reduce system costs without compromising retrieval performance quality.  

Our experiments show:

  • 92.5% reduction in cost when using our most aggressive compression configurations

  • Query speed improves up to 33% with compressed index sizes

  • In many scenarios, quality remains at or near the baseline if you preserve originals and leverage rescoring.

This post will walk through experiments that implemented compression techniques and measured impact on storage footprint, cost, query speed and relevance quality.

We will share experimental data, a decision framework, and practical recommendations to implement scalable knowledge retrieval applications:

  1. Why Compression Matters
  2. Experiment Setup
  3. Data Overview: Cost, Speed, Quality
  4. Understanding the technology
  5. Decision framework
  6. Implementation example in Azure AI Search

Why Compression Matters

The Business Value

  • Reduce operating costs: reduce hefty storage costs from high-dimensional vector embeddings as you scale your solutions.
  • Maintain quality: Users still expect search results to be as relevant and accurate as before.
  • Improve speed: In many cases, compressed vectors can yield faster query responses due to smaller index sizes and fewer compute overheads.

The Experiment Setup

The Mental Model: Cost, Speed, Quality

When evaluating compression approaches, three dimensions are measured:

  • Cost: How much you spend each month for index storage and the necessary compute resources (partitions, SKUs).
  • Speed: How quickly your queries return results—often measured by a latency percentile distributions.
  • Quality: How relevant your search results are, measured by NDCG@k (Normalized Discounted Cumulative Gain at rank k).

Note: They are many ways to evaluate retrieval systems. For simplicity, we decided to focus on an industry standard, NDCG.

Our experiments consisted of various configurations of the following compression techniques:

  • Scalar Quantization (SQ): Maps each floating-point value to a lower-precision scale (int8).
  • Binary Quantization (BQ): Floats mapped to binary values for maximum compression.
  • Matryoshka Representation Learning (MRL): For certain tests, we leveraged embeddings trained with Matryoshka Representation Learning, which allowed us to truncate vectors from 3072 dimensions to 768 dimensions without significant loss of semantic information. See our MRL support for quantization blog for more details
  • Oversampling (defaultOversampling): Retrieves extra candidates using compressed vectors, then rescore them with original vectors to boost recall.
  • Preserve vs. Discard Originals
  • PreserveOriginals allows second-pass rescoring with full-precision vectors.
  • DiscardOriginals maximizes storage savings by removing the original uncompressed vectors entirely.

These compression configurations were tested on an open source dataset:

  • Dataset: mteb/msmarco with 8.8M vectors
  • Embeddings: Generated using OpenAI text-embedding-3-large with 3072 dimensions
  • Indexing Algorithm: HNSW-based vector indexing (the default in Azure AI Search).
  • MRL: For certain tests, we used Matryoshka Representation Learning to reduce dimensions from 3072 to 768.
  • Cost data shown below reflects the minimum SKU and partitions required under current (post-Nov 2024) Azure pricing

Note: We have conducted similar experiments on other MTEB retrieval datasets and observed consistent results across them. This blog focuses on results from the largest dataset that best simulates real-world production scenarios. Ultimately, your exact results may vary depending on your own data and use cases. 

Data Overview: Cost, Speed, Quality

Below is a condensed summary of our experiments. Notice how compression can deliver massive storage reductions with minimal or no quality loss. (Full details, including all intermediate configurations, can be found in extended tables in the Appendix).

Cost & Storage Comparision

These tests measured cost against size across various compression methods.

Key Insights:

  • Moving from No Compression (109 GB) to BQ can reduce vector data size by 96% (~4 GB), though the actual index compression ratio may vary depending on supporting data structures, the M parameter in HNSW, and other index configuration settings.
  • Combining MRL + BQ can cut index size by 99% (109 GB → ~1 GB).
  • Cost can drop from $1,000/month on S1 to $75/month on Basic—a 92.5% reduction—when you discard original vectors.

Method

Vector Index (GB)

Disk Storage (GB)

Approx Cost per month

% Savings

SKU

Min Partitions required

No Compression

109.13

112.29

$1,000

-

S1

4

SQ (wo/rescoring)

27.65

139.45

$250

75%

S1

1

SQ (w/ rescoring)

27.65

139.45

$250

75%

S1

1

BQ (w/ rescoring)

3.88

115.68

$250

75%

S1

1

BQ + discardOriginals

3.88

7.04

$75

92.5%

Basic

1

MRL + BQ (w/rescoring)

1.33

113.13

$250

75%

S1

1

MRL + BQ + discardOriginals

1.33

4.49

$75

92.5%

Basic

1

Table 1

Interpreting Table 1:

Speed (Latency)

Most compression methods improve or match latency relative to the No Compression baseline at the p50, p90, and p99 percentiles.

We measured query speed in milliseconds (ms) and then compared each configuration’s relative latency percentile to the No Compression baseline. 

Key Insight:

  • Compressed vectors typically yield faster searches due to more efficient processing.
  • You may only see performance benefits on indexes that have >10K vectors.

Method

p50

p90

p99

Relative to baseline

No Compression Baseline

1.00

1.00

1.00

-

SQ (wo/ rescoring) + discardOriginals

0.74

0.71

0.69

~30% faster

BQ (wo/ rescoring) + discardOriginals

0.72

0.69

0.67

~33% faster

Interpreting Table 2:

  • Values are normalized to the No Compression baseline (where 1.00 = baseline latency).
  • Lower numbers mean faster performance (e.g., 0.70 means 30% faster than baseline).
  • This is server-side latency, excluding network latency.
  • Most compression configurations show improved performance, with BQ+discardOriginals providing the best speed improvement.

Quality

We measured NDCG@10 and then compared each configuration’s score to the uncompressed baseline (0.40219).

Key Insight:

  • With rescoring (i.e., preserveOriginals), most configurations maintain full baseline quality (NDCG@10 ≈ 1.00).
  • SQ configurations maintain excellent quality (99-100% of baseline) even when discarding originals
  • BQ with discardOriginals shows a small quality drop (96% of baseline), while MRL+BQ with discardOriginals shows a more noticeable quality drop (92% of baseline)

 

Method

NDCG@10 score

Relative to baseline

No Compression: baseline

0.40219

1.00

SQ (w/ rescoring) + preserveOriginals

0.40249

1.00

SQ (wo/rescoring) + discardOriginals

0.39999   

0.99

BQ (w/rescoring) + preserveOriginals  

0.40259

1.00

BQ (wo/rescoring) + preserveOriginals

0.39287

0.98

BQ (w/rescoring) + discardOriginals

0.39181   

0.97

BQ (wo/rescoring) + discardOriginals

0.38733

0.96

Table 3

Interpreting Table 3:

  • NDCG@10: This metric measures how well your search system ranks relevant results within the top 10 positions, with higher scores indicating better performance.
  • The "relative" column shows how each compression method performs compared to uncompressed vectors, with 1.00 being identical quality and 0.96 meaning a slight degradation.

Note: This selection highlights quality impact across key configurations. With rescoring and preserveOriginals, quality remains at baseline levels across all compression methods. BQ with rescoring and discardOriginals—our newest feature—maintains 97% of baseline quality while providing significant storage savings. If you're using Semantic Ranking (SR), quality drops become even less noticeable, making aggressive compression options like MRL+BQ+discardOriginals viable even for quality-sensitive applications. Read more in our experiments using compression with MRL with SR here in our latest Applied AI Blog announcing Semantic Ranker updates.

Understanding the Technology

Scalar Quantization (SQ)

  • How It Works: Converts floating-point components to lower-precision numbers (e.g., int8)
  • Pros: Typically yields ~75%+ index reduction with almost no quality loss (when rescoring), though results may vary depending on the specific embedding model used.
  • Cons: Doesn't compress as aggressively as BQ.
  • Best For: Balanced cost savings with minimal overhead. E.g., "SQ + preserveOriginals + rescoring" offers a 75% reduction in vector storage with no quality loss, though exact results depend on your embedding model.

Binary Quantization (BQ)

  • How It Works: Floats become binary representations (1s and 0s), achieving the largest compression ratio, often >90%. Binary quantization performs better with higher-dimensional vectors, as information loss during quantization becomes proportionally less significant.
  • Pros: Drastically reduces partition/SKU requirements, delivering maximum storage savings
  • Cons: Slightly higher risk of quality drop if you discard originals, though it's often modest (92–96% of baseline).
  • Best For: Maximum cost savings. "BQ + discardOriginals" yields the best combination of lower partition count (hence lower cost) and faster search.

Matryoshka Representation Learning (MRL)

  • How It Works: Dimension reduction technique by truncation (e.g., from 3072 → 768).
  • Pros: Combining MRL with BQ/SQ delivers the highest compression—sometimes <2% of the original size.
  • Cons: Requires using embeddings specifically designed for dimension truncation.
  • Best For: Use when you have or can adopt MRL-ready vector embedding models and want to minimize storage.

Tip: Our data indicates it's better to apply SQ or BQ first, and then use MRL if further compression is needed, rather than using MRL alone as a starting point without quantization.

Rescoring: Preserve vs. Discard Originals

  • PreserveOriginals: Keeps full-precision vectors for optional oversampling and second-pass reranking, preserving nearly baseline quality.
  • *DiscardOriginals: Provides maximum storage gains but prevents full-precision rescoring. Expect a small drop in retrieval accuracy (NDCG@10 ~0.92–0.96).

*As of 2025-03-01-Preview API version, binary quantization supports discardOriginals with rescoring. In this scenario, rescoring is calculated by the dot product of the full precision query and binary quantized data in the index. 

Implementation Example in Azure AI Search

When creating or updating your Search Index, you can specify compression in your VectorSearchProfile. For instance:

{
  "name": "myVectorSearchProfile",
  "algorithm": "myHnswConfig", // your HNSW settings
  "compression": "myCompressionConfig",
  "vectorizer": "myVectorizer"
}

Then define a compression kind with rescoring options. For instance:

{
  "compressions": [
    {
      "name": "myCompressionConfig",
      "kind": "binaryQuantization",
      "rescoringOptions": {
        "defaultOversampling": 2.0, // Start with 2x and increase as needed based on your quality requirements
        "enableRescoring": true,
        "rescoreStorageMethod": "discardOriginals"
      },
      "truncationDimension": null  // or 768 if using MRL-compatible embeddings
    }
  ]
}

Choosing the Right Compression Strategy

When selecting your compression approach, consider these key factors:

  1. Budget constraints: If minimizing cost is your primary concern, BQ+discardOriginals offers the most dramatic savings (up to 92.5%) compared to No Compression 
  2. Quality sensitivity: If maintaining maximum quality while still opting for compression is critical:
    • Use preserveOriginals with rescoring for virtually no quality loss (any compression method)
    • If storage costs are still a concern, SQ (wo/rescoring) + discardOriginals offers an excellent compromise (99% quality retention)
  3. Speed requirements: All compression methods improve speed, but BQ+discardOriginals offers the best performance (up to 33% faster). Remember that if you opt for rescoring, higher oversampling factors will also increase query latency. 
  4. Embedding models: If you're using or can adopt MRL-compatible embeddings such as OpenAI's text-embedding-3-large or Cohere's Embed 4, combining MRL with BQ offers the most extreme compression while maintaining acceptable quality.
  5. Semantic Ranking impact: If you're using Semantic Ranking (SR) in your search pipeline:
    • You can employ more aggressive compression configurations with minimal quality impact
    • SR effectively compensates for any minor relevance losses from compression, allowing you to maximize cost savings without compromising user experience
    • This is especially valuable for large-scale applications where storage costs would otherwise be prohibitive

With Azure AI Search compression, you can drastically reduce your storage footprint—and therefore cost—while maintaining fast queries and high (or near-baseline) relevance.

For more details, check out our latest documentation on compression.

Ready to take action?

  • Benchmark your current vector index size, cost, and search quality.
  • Choose the compression approach that fits your Cost/Speed/Quality priorities.
  • Test in a sandbox environment, measure the impact, and verify relevance vs. baseline.
  • Roll out to production once you confirm the desired outcomes.

Check out this Python notebook on how you can compare storage sizes across different vector compression configurations.

Don't be afraid to scale your GenAI solutions! Azure AI Search has you covered! 

We welcome your feedback and questions. Drop a comment below or visit https://feedback.azure.com to share your ideas!

Appendix: Extended Configuration Tables

Table A1: Complete Index Configuration and Cost Comparison

Compression Method

Vector Index Size (GB)

Disk Storage (GB)

Cost per Month (Services Created after Nov 18, 2024)

Best SKU

Min Partitions

No Compression

109.13

112.29

$1,000

S1

4

SQ (w/rescoring)

27.65

139.45

$250

S1

1

SQ (wo/rescoring)

27.65

139.45

$250

S1

1

SQ + discardOriginals

27.65

30.80

$250

S1

1

BQ (w/rescoring)

3.88

115.68

$250

S1

1

BQ (wo/rescoring)

3.88

115.68

$250

S1

1

BQ + discardOriginals (w/rescoring)

3.88

7.04

$75

Basic

1

BQ + discardOriginals (wo/rescoring)

3.88

7.04

$75

Basic

1

MRL + SQ (w/rescoring)

7.27

119.07

$250

S1

1

MRL + SQ (wo/rescoring)

7.27

119.07

$250

S1

1

MRL + SQ + discardOriginals

7.27

10.43

$150

Basic

2

MRL + BQ (w/rescoring)

1.33

113.13

$250

S1

1

MRL + BQ (wo/rescoring)

1.33

113.13

$250

S1

1

MRL + BQ + discardOriginals (w/rescoring)

1.33

4.49

$75

Basic

1

MRL + BQ + discardOriginals (wo/rescoring)

1.33

4.49

$75

Basic

1

Summary of Table A1: This comprehensive table shows all tested configurations and their impact on storage and cost. The most aggressive compression methods (MRL + BQ + discardOriginals) reduce the vector index size from 109GB to just 1.33GB—a 99% reduction. This allows the index to fit on a Basic SKU with a single partition, reducing monthly costs by 92.5% compared to the uncompressed baseline.

Table A2: Complete Performance Comparison (Relative Latency)

Compression Method

P50

P90

P99

No Compression

1.00

1.00

1.00

SQ (w/rescoring) + preserveOriginals

0.87

0.85

0.84

SQ (wo/rescoring) + preserveOriginals

0.84

0.80

0.79

SQ (wo/rescoring) + discardOriginals

0.74

0.71

0.69

BQ (w/rescoring) + preserveOriginals

1.22

2.23

2.59

BQ (wo/rescoring) + preserveOriginals

0.82

0.79

0.80

BQ (w/rescoring) + discardOriginals

0.76

0.73

0.71

BQ (wo/rescoring) + discardOriginals

0.72

0.69

0.67

MRL + SQ (w/rescoring) + preserveOriginals

0.96

1.10

1.20

MRL + SQ (wo/rescoring) + preserveOriginals

0.82

0.80

0.82

MRL + SQ (wo/rescoring) + discardOriginals

0.73

0.70

0.68

MRL + BQ (w/rescoring) + preserveOriginals

0.80

0.77

0.76

MRL + BQ (wo/rescoring) + preserveOriginals

0.76

0.73

0.71

MRL + BQ (w/rescoring) + discardOriginals

0.75

0.72

0.70

MRL + BQ (wo/rescoring) + discardOriginals

0.72

0.69

0.67

Summary of Table A2: Most compression methods (except BQ with rescoring and preserveOriginals) improve query latency across all percentiles. The most consistent performance improvements come from configurations that discard originals, which show around 30% faster response times. This is likely due to the reduced memory footprint and simplified query processing without needing to access the original vectors.

 

Table A3: Complete Relevance Quality Comparison (NDCG@10)

Compression Method

NDCG@10

Relative NDCG@10

No Compression

0.40219

1.00

SQ (w/rescoring) + preserveOriginals

0.40249

1.00

SQ (wo/rescoring) + preserveOriginals

0.40188

1.00

SQ (wo/rescoring) + discardOriginals

0.39999

0.99

BQ (w/rescoring) + preserveOriginals

0.40259

1.00

BQ (wo/rescoring) + preserveOriginals

0.39287

0.98

BQ (w/rescoring) + discardOriginals

0.39181

0.97

BQ (wo/rescoring) + discardOriginals

0.38733

0.96

MRL + SQ (w/rescoring) + preserveOriginals

0.40224

1.00

MRL + SQ (wo/rescoring) + preserveOriginals

0.39793

0.99

MRL + SQ (wo/rescoring) + discardOriginals

0.39375

0.98

MRL + BQ (w/rescoring) + preserveOriginals

0.40024

1.00

MRL + BQ (wo/rescoring) + preserveOriginals

0.35704

0.89

MRL + BQ (w/rescoring) + discardOriginals

0.37192

0.92

MRL + BQ (wo/rescoring) + discardOriginals

0.35314

0.88

Summary of Table A3: This table provides a complete quality comparison across all tested configurations. The data shows that preserving originals with rescoring maintains full baseline quality (NDCG@10 ≈ 1.00) regardless of compression method. SQ configurations show minimal quality impact even when discarding originals (99% of baseline), while BQ configurations show a modest drop (96-97%). The most aggressive compression (MRL+BQ+discardOriginals) still maintains 92% of baseline quality with rescoring enabled, which may be acceptable for many use cases, especially when combined with Semantic Ranking.

Note: All configurations use HNSW defaults. MRL tests use dimension=768. Configurations with rescoring use oversampling=10. The dataset used is mteb/msmarco with 8.8M vectors.

 

Updated Apr 17, 2025
Version 2.0
No CommentsBe the first to comment