Blog Post

AI - Azure AI services Blog
15 MIN READ

Raising the bar for RAG excellence: query rewriting and new semantic ranker

alinabeck's avatar
alinabeck
Icon for Microsoft rankMicrosoft
Nov 19, 2024

Introducing an improved generative query engine with query rewriting and next-gen reranking model

By Alec Berntson, Alina Stoica Beck, Amaia Salvador Aguilera, Farzad Sunavala, Thibault Gisselbrecht and Xianshun Chen

 

We are announcing 2 major updates that raise the bar again for RAG retrieval: a new capability, query rewriting, and a new model for semantic ranker.  These enhancements set new performance standards for relevance and latency across many benchmarks.

  • Generative Query Rewriting (QR)
    • Powered by a fine-tuned Small Language Model (SLM) optimized for low latency.
    • Creates up to 10 query transformations to improve recall.
    • Improves search relevance, especially for term-based indexes and when recall is low (+4 points NDCG@3).
    • Included at no additional cost with semantic ranker queries.
  • New semantic ranker (SR)
    • Cross-encoder model which reranks the top 50 results.
    • Delivers significant improvements in relevance and performance (tested on over 90 datasets across 19 languages; up to +22 points NDCG@3 improvement when combined with QR).
    • Up to 2.3 times lower latency than the previous model.
    • Includes refreshed models for answers, captions, and highlights.
    • Requires zero configuration changes – automatic upgrade effective November 19, 2024.

Last year, we showed that vector search alone is not sufficient for adequate relevance; combining hybrid search with reranking proved to be the most successful retrieval stack for delivering high quality results. Since then, using a reranker and hybrid search in your RAG app has become table stakes.

This year, we have expanded and improved our query pipeline performance, with model training and optimizations based on insights from in-production RAG applications, and  billions of queries a day. With the addition of query rewriting and a new model for semantic ranker, our retrieval stack has surpassed our previous leading results from a year ago.

Performance Benchmarks

Performance Summary

Background

The query stack in Azure AI Search follows a pattern that's often used in sophisticated search systems, where there are two main layers of execution: recall (L1) and ranking (L2). A more detailed explanation can be found in section 1 of our previous post,  but in short, the L1 recalls documents from customer indexes using either text, vector or hybrid (a combination of the two) representations. Ranking, using Semantic Ranker, reorders the top 50 documents from the L1 to put the best content first (e.g. for input to an LLM for RAG). Query rewriting is a new step that runs before the L1. It produces up to 10  different rewrites which are sent together to the L1 to improve its recall.

Summary

We tested Azure AI Search with the new QR + SR models on a multitude of datasets that span many different use cases. The following insights compare hybrid, QR and new SR to hybrid L1 alone. While the two models can be used with text-only or vector-only L1s, hybrid L1 is a strong baseline that usually offers the best L1 recall.

For customers with typical content:

  • Enabling QR + SR on an existing index yields an impressive +22 point NDCG@3 gain over a hybrid search (text-embedding-3-large 3072-dim) alone. This is almost 2 times the improvement as compared to the previous production ranker (which achieved +13 points over hybrid L1).
  • QR + SR allows you to achieve results comparable to the text-embedding-003-large model at 3072 dimensions when using an older embedding model like text-embedding-ada-002 or a BQ (binary quantization) compressed embedding model such as text-embedding-003-large at 256 dimensions.
  • QR + SR relevance improvements are consistent across content segments and languages, as shown by numerous tests on representative content. One evaluation dataset alone has 70 representative indexes spanning typical customer content types and languages.
  • QR and SR are trained to support over 50 languages and benchmarked here against 19.
  • QR and SR are both heavily optimized for latency; SR can rerank a batch of fifty 2048 token documents in an average of 158ms, while QR can generate ten rewrites for a 32 token query in 147ms.
  • QR is extremely useful across the board for text-based L1s (up to +4 points) and for hybrid L1s where retrieval quality tends to be poor or where queries are misspelled or contain text and numeric combinations (up to +6 points).

Retrieval Performance Across Different Search Configurations and Datasets

Table 1 compares the performance of different search configurations, including the new semantic ranker and query rewriting, on many different datasets.

Summary Metrics (NDCG)

 Search configuration

Open Source Internal

BEIR 

Miracl 

Customer 

MIML 

Support 

Average 

Delta*

Text only

41.6 

49.5 

43.5 

46.1 

38.4 

42.6 

 

text-embedding-ada-002

Vector

47.3 

57.3 

38.0 

51.9 

45.5 

45.1 

 

Hybrid

50.2 

59.1 

45.3 

53.1 

46.9 

48.4 

 

Hybrid + SR (legacy)

51.3 

72.0 

62.1 

67.7 

59.7 

63.1 

+14.7 

Hybrid + SR (new)

55.0 

78.0 

68.2 

75.0 

69.9 

71.1 

+22.7 

Hybrid+ QR + SR (new)

54.9 

78.1 

68.1 

75.5 

71.5 

71.7 

+23.3 

text-embedding-3-large 256-dim BQ

Vector

47.5 

73.4 

37.1 

53.4 

31.0 

40.5 

 

Hybrid

49.6 

69.5 

45.7 

55.7 

43.5 

48.3 

 

Hybrid + SR (legacy)

51.5 

71.8 

62.4 

68.5 

57.9 

62.9 

+14.6 

Hybrid + SR (new)

55.3 

79.0 

68.5 

75.9 

68.1 

70.8 

+22.5 

Hybrid+ QR + SR (new)

55.2 

78.9 

68.4 

76.2 

70.6 

71.7 

+23.4 

text-embedding-3-large

Vector

54.6 

78.1 

42.6 

60.3 

35.0 

46.0 

 

Hybrid

53.0 

70.5 

48.5 

58.2 

46.9 

51.2 

 

Hybrid + SR (legacy)

52.1 

71.8 

63.3 

69.5 

59.3 

64.0 

+12.8 

Hybrid + SR (new)

56.3 

79.3 

69.4 

77.2 

70.3 

72.3 

+21.1 

Hybrid+ QR + SR (new)

55.9 

79.4 

69.3 

77.2 

72.0 

72.8 

+21.6 

Table 1: Summary Metrics across benchmarks. Beir and Miracl use NDCG@10, Internal use NDCG@3. * Delta is against hybrid L1 as baseline. Customer, MIML, Support datasets are described below. Largest values in bold.

 

Benchmark selection and design

QR+SR delivers substantial improvements over its predecessor. Academic benchmarks are shown for transparency, but it is important to look at the queries and documents included in them to ensure they align to production scenarios. We have sought to improve our measurement in several dimensions:

  • Retrieval benchmarks typically use very small documents (e.g. 99.8% of BEIR documents have at most 1024 tokens). We use documents from real RAG scenarios that range from 10s to 100s of pages, requiring chunking.
  • Retrieval benchmarks are typically sparsely labeled (e.g. only one or two labeled documents per query) at low granularity (e.g. binary “good/bad” labels). This does not provide high measurement resolution and does not allow for alternative documents that may have the same information to be retrieved. We labeled an average of 160 document chunk pairs per query with 5 levels of granularity.
  • Most retrieval benchmarks are based on Wikipedia documents or highly processed text segments from other sources. We use real documents that require ingestion and processing of complex formatting and may have OCR and parsing artifacts.

We used 3 sources of documents for our internal benchmarks:

  • One ("Customer") with several document sets provided with permission from Azure customers. Queries are both from real users and synthetically created. These document sets range from several hundred documents to several million.
  • One ("Support") is sourced from hundreds of thousands of public support/knowledge base articles where the same documents were present in many different languages. We tested against 9 of these languages. All queries are from real users.
  • One (Multi-industry, Multi-language – "MIML") is a collection of 70 indexes that we created from publicly available documents. These were filtered and clustered to represent typical documents from 10 customer segments and 7 languages. Each segment language pair has approximately 1000 documents and 1500 queries (real user and synthetically generated). There are about 100k total queries and 70k total documents.

The synthetic queries were generated by prompting several different large language models in more than ten different ways. Numerous quality control steps were followed to filter malformed or non-representative queries. We further reviewed these queries with real customers. The resulting queries are of many types, from keyword and web search-like, to complex questions.

We use benchmark-provided labels to produce metrics for public datasets (BEIR and Miracl). For our internal benchmarks, we calibrated a large language model (GPT-4o) prompt against thousands of high-quality human labels. This prompt took in a query-document pair and generated a relevance score on a 5-point scale. For every internal benchmark experiment, we scored the top 50 documents to ensure there were no measurement holes. Public datasets have many holes and we left them as-is to enable external comparisons.

For each query, we used all the data that was labeled for the different methods to compute the ideal ranking of documents for that query. We compared the ranking provided by each method to this ideal ranking to compute NDCG@3. This makes the different methods and search configurations comparable to each other.

Datasets that closely align to real customer challenges allow us to produce models that are robust when they encounter real world complexity. The improvements we see across all these datasets demonstrate this.

Search configurations

Azure AI Search enables several different L1 configurations: text-based, vector-based and hybrid which combines results from both using RRF. QR + SR can be composed with any of these options. In this post we focused on hybrid search because it is a high baseline; we observed even larger relevance improvements when using vector-only or text-only L1s.

We broke up documents into 1024 token chunks (no overlap was used in this analysis) because we observe customers commonly using this size. 1024 token chunks are on the larger size but still preserve the option to combine multiple chunks for LLM consumption and keep multi-turn context windows manageable.

Result Insights

Insight: QR+SR expands embedding model flexibility and improves quality

Embedding models represent an index maintenance challenge – when a new model is released, all documents need to be re-embedded to get the gains. It’s also likely that you’ll want to use Binary Quantization (BQ) to reduce the storage cost of vectors. Older embedding models (text-embedding-ada-002) and embedding models using BQ (text-embedding-003-large 256 dimensions) do not achieve the same performance (-2.8 and -2.9 points respectively) in a hybrid search setup as a modern, full dimension embedding model (text-embedding-003-large at 3072 dimensions).

 

Embedding Model

Dimensions 

Vector-only 

Hybrid 

Hybrid + QR + SR 

text-embedding-ada-002

1536 

45.1 

48.4 

71.7 

text-embedding-3-large

(BQ)  256 

40.5 

48.3 

71.7 

3072 

46.0 

51.2 

72.8 

Table 2: NDCG@3 comparison of different embedding models shows that Hybrid+QR+SR offers robust relevance. Largest values in bold.

 

Our data in Table 2 shows that QR + SR can significantly improve relevance over L1. By promoting the best document chunks to the top of the recall set, it provides substantial relevance gains over a best in class embedding model and can bring the older or heavily compressed embedding models to within a point of the best search configuration. This allows you to save on re-embedding costs or use storage optimized techniques with minimal relevance penalty.

 

Insight: QR+SR gives large relevance gains across Language/Content Segments

Table 3 shows the NDCG@3 gain (absolute increase) of enabling QR+SR over traditional hybrid search in the MIML dataset. Text-embedding-003-large at 256 dimensions with BQ is the embedding model used for this example. Each segment-language pair represents about 1500 queries searching against 1000 documents split into 1024 token chunks. The relevance improvement is consistently substantial across all content segments and languages (average +21 with a range from +12 points to +28 points). This gives confidence that SR+QR should be enabled by default on RAG applications to deliver the best retrieved content.

 

Content Segment

German 

English 

Spanish 

French 

Japanese 

Chinese (simplified)

Chinese (traditional)

Average

Accounting & Tax Services

+22.0 

+25.2 

+24.4 

+22.6 

+21.5 

+20.8 

+17.3 

+22.0 

Banking

+21.8 

+24.5 

+23.6 

+22.5 

+22.9 

+19.6 

+19.1 

+22.0 

Government Administration

+23.1 

+27.5 

+27.8 

+23.9 

+22.3 

+19.1 

+20.4 

+23.5 

Healthcare Administration

+20.1 

+24.7 

+21.7 

+19.5 

+20.3 

+15.5 

+16.8 

+19.8 

Healthcare Research

+19.3 

+21.1 

+21.6 

+17.6 

+17.0 

+14.5 

+12.2 

+17.6 

Human Resources

+17.0 

+25.1 

+20.6 

+17.5 

+21.6 

+12.0 

+12.9 

+18.1 

Industrial and Manufacturing

+21.4 

+20.4 

+20.6 

+18.4 

+19.5 

+16.1 

+20.1 

+19.5 

Insurance

+19.9 

+25.0 

+22.7 

+23.0 

+17.2 

+15.8 

+18.3 

+20.3 

Legal

+22.8 

+26.8 

+24.9 

+21.0 

+17.9 

+18.9 

+16.8 

+21.3 

Product Manuals

+20.9 

+23.5 

+21.9 

+17.8 

+18.3 

+27.4 

+18.2 

+21.2 

Average

+20.8 

+23.8 

+23.0 

+20.4 

+19.9 

+18.0 

+17.2 

+20.7 

Table 3: NDCG@3 gain (absolute increase) of enabling QR+SR over hybrid search baseline. Row and column averages in bold.

 

Insight: QR+SR gives large relevance gains across Query Types

We classified the MIML and Support queries into categories to give insight as to how the search stack would perform in different scenarios. Tables 4 and 5 aggregate the queries of each type across languages and content segments and show a consistent story: enabling SR + QR will provide a large benefit in all scenarios.

A few observations:

  • In workloads where query types are diverse, all L1s - text-based, vector-based and hybrid L1s - each have strengths and weaknesses. Text-based search is still the best for queries looking for exact strings (e.g. product ids); it far outperforms vector retrieval for codewords. If you try to balance the performance by combining vector retrieval with text-based (hybrid search), you might still observe a loss in relevance when compared to text-only search, as shown by the Support dataset. However, adding QR+SR recovers from this deficit and significantly outperforms any L1-only configuration.
  • 'Hard' queries – those from across all categories where Hybrid search (text-embedding-003-large at 3072 dimensions) is not able to find at least 5 reasonably relevant candidates in the top 50 - greatly benefit from QR. The Support dataset [all real queries] sees a 4 point gain from QR in this category.
  • QR is also most beneficial for keyword-like queries (keyword, codeword) and misspellings. These are generally the hardest (lowest L1 relevance) classes of queries.

 

Query Type

Text 

Vector 

Hybrid 

Hybrid + SR 

Hybrid+ QR + SR 

Delta vs Hybrid 

acronym

40.4 

31.4 

45.8 

68.5 

69.9 

+24.1 

codeword

36.2 

17.4 

32.3 

58.6 

64.4 

+32.1 

concept

28.3 

41.3 

44.8 

64.3 

65.1 

+20.3 

fact

28.6 

42.0 

45.4 

64.8 

65.0 

+19.6 

keyword

40.9 

33.9 

47.5 

71.4 

73.3 

+25.8 

misspellings

23.5 

31.9 

38.4 

60.8 

63.2 

+24.8 

question

28.6 

42.6 

47.0 

67.1 

68.4 

+21.4 

web-like

37.0 

39.1 

49.3 

71.5 

72.9 

+23.6 

easy

43.3 

38.1 

52.2 

75.6 

76.6 

+24.4 

hard

23.5 

25.6 

30.8 

54.3 

58.2 

+27.4 

Table 4:  Support Benchmark by category (NDCG@3). Text-embedding-3-large 3072-dimensions used for all Vector and Hybrid L1s. Largest values in bold.

 

Query Type

Text  Vector  Hybrid  Hybrid + SR  Hybrid+ QR + SR  Delta vs Hybrid 

acronym

55.9 

61.8 

63.3 

79.6 

79.8 

+16.5 

codeword

58.6 

53.4 

60.0 

80.5 

80.8 

+20.8 

concept

44.3 

63.8 

59.3 

76.6 

76.8 

+17.5 

fact

48.9 

58.0 

58.5 

80.3 

80.4 

+21.9 

keyword

47.6 

61.3 

59.5 

76.9 

76.9 

+17.4 

misspellings

38.9 

57.5 

54.6 

71.9 

72.1 

+17.5 

question

45.4 

60.3 

57.8 

78.0 

78.1 

+20.3 

web-like

47.6 

64.9 

60.4 

78.3 

78.4 

+18.0 

easy

44.7 

63.8 

60.4 

77.5 

77.0 

+16.6 

hard

46.5 

56.2 

55.3 

76.2 

76.8 

+21.5 

Table 5: MIML Benchmark by category (NDCG@3). Text-embedding-3-large 3072-dimensions used for all Vector and Hybrid L1s. Largest values in bold.

 

Insight: Query rewriting greatly improves text-based relevance

We can see that QR is particularly helpful when L1 is text-based only. QR improves relevance by +2 to +4 more points when using text-based L1 than hybrid L1 on MIML and Support datasets. A large number of customers have text-only indexes, and this is a great way to strongly improve relevance if adding a vector representation (to get a hybrid L1) is not feasible.

This QR implementation creates new words and phrases to represent the original query; these new terms expand the match footprint, which increases recall. This is somewhat less effective for embedding-based representations because semantic matching is already part of vector retrieval.

Dataset

L1 Type

L1 

L1 + SR 

L1 + QR + SR 

QR Gain over SR 

Support

Text

38.4 

61.7 

65.2 

+3.5 

Hybrid

46.9 

70.3 

72.0 

+1.7 

MIML

Text

46.1 

66.3 

70.3 

+4.0 

Hybrid

58.2 

77.2 

77.2 

0.0 

Table 6: Impact of QR on Text vs Hybrid Search (NDCG@3).

Model Creation and Optimization

Model training

We listened to customer feedback and observed industry trends over the past year since our last model update. Several strategies were used to train QR and a new SR:

  • Simplify the internal stack: We simplified internal components so that we could co-train the various capabilities of SR: ranking, captions, highlights and QnA. This resulted in better performance as a single model.
  • Increase the SR input context: We observed the use of larger queries and larger chunk sizes, so we trained SR that could natively support up to 2048 tokens. We determined this size offers a balance of relevance, runtime latency and cost. Longer chunks use an extractive summarization algorithm so they fit.
  • Utilize the latest training techniques: We used flash attention to enable larger batch sizes and faster training cycles for SR. Because QR is an SLM, we trained the model on completion tokens (Supervised Fine-tuning Trainer).
  • High quality training data: We constructed training data to emulate typical workloads using several sources of data across customer industries and following common index ingestion strategies. A wide range of languages (>50) and content formats was included. For SR we varied the size of chunks used; similarly, for QR we used many different sized queries.

 

Latency optimization

Latency is important for production environments, especially for RAG applications where retrieval and LLMs must work together for each response. We used an internal package that bundled CUDA level post-training optimizations in our runtime stack to enable much faster inference for the same payload/model size. This allows our new model to run up to 2.3 times faster than the previous one, as shown by Table 7.

Typical incremental latency for SR (milliseconds)

Document length (tokens)

Previous SR Model

New SR Model

 128 

44 

50 

 256 

116 

59 

 512 

182 

78 

 1024 

210 

113 

 2048 

243 

158  

Table 7: Average model execution latency in test environment to rank 50 documents (all with the same specified token length). All SR features enabled (captions, highlights, answers).

 

This is also true for QR (Table 8) where we fine-tuned and optimized an SLM to run within a tight latency budget despite having over 1 billion parameters.

Typical incremental latency for QR (milliseconds)

Query length (tokens) 

QR Model 

95 

116 

16 

144 

32 

147 

48 

167 

Table 8: Average model execution latency in test environment to generate 10 rewrites for queries with the specified token length.

Getting started with SR + QR in Azure AI Search

The system we evaluated in this work is available in production for developers to use in their own applications, as part of the Azure AI Search service. You can learn more about Azure AI Search in general here, or jump directly to the documentation for Semantic Ranker or Query Rewriting.

Appendix

External Dataset description

Query types description (one query can have multiple types)

Query type

Description

Example 

Web-like queries

Shortened queries similar to those commonly entered into Google or Bing

“Best retrieval concept queries”

Keyword

Short queries that consist of only important words

“semantic ranker”

Misspelling

Queries with misspellings, typos or transpositions of letters

“Ho w mny documents are samantically r4nked”

Acronym

Queries that contain an acronym

“when were ATMs invented”

Codeword

Queries that contain an identifier

“time of arrival for flight AF852 “

Question

Queries that ask a question

“who wrote Brothers Karamazov”

Fact seeking queries

Questions asking for simple facts with some given conditions, such as stock prices on a certain date and a director’s recent movies in a certain genre

“what were Microsoft earnings in 2020”

Concept seeking queries

Abstract questions that require multiple sentences to answer

“Why should I use semantic search to rank results?”

 

Updated Nov 21, 2024
Version 26.0
No CommentsBe the first to comment