azure ai services
478 TopicsArizona Department of Transportation Innovates with Azure AI Vision
The Arizona Department of Transportation (ADOT) is committed to providing safe and efficient transportation services to the residents of Arizona. With a focus on innovation and customer service, ADOT’s Motor Vehicle Division (MVD) continually seeks new ways to enhance its services and improve the overall experience for its residents. The challenge ADOT MVD had a tough challenge to ensure the security and authenticity of transactions, especially those involving sensitive information. Every day, the department needs to verify thousands of customers seeking to use its online services to perform activities like updating customer information including addresses, renewing vehicle registrations, ordering replacement driver licenses, and ordering driver and vehicle records. Traditional methods of identity verification, such as manual checks and physical presence, were not only time-consuming and error-prone, but didn’t provide any confidence that the department was dealing with the right customer in remote interactions, such as online using its web portal. With high daily demand and stringent security requirements, the department recognized the need to enhance its digital presence and improve customer engagement. Facial verification technology has been a longstanding method for verifying a user's identity on-device and online account login for its convenience and efficiency. However, challenges are increasing as malicious actors persist in their attempts to manipulate and deceive the system through various spoofing techniques. The solution To address these challenges, the ADOT turned to Azure AI Vision Face API (also known as Azure Face Service), with Liveness Detection. This technology leverages advanced machine learning algorithms to verify the identity of individuals in real time. The Liveness Detection feature aims to verify that the system engages with a physically present, living individual during the verification process. This is achieved by differentiating between a real (live) and fake (spoof) representation which may include photographs, videos, masks, or other means to mimic a real person. By using facial verification and liveness detection, the system can determine whether the person in front of the camera is a live human being and not a photograph or a video. This cutting-edge technology has transformed the way the department operates to make it more efficient, secure, and reliable. Implementation and collaboration The department worked closely with Microsoft's team to ensure a seamless integration of the technology. "We were extremely excited to partner with Microsoft to use their passive liveness verification and facial verification all in one step," said Grant Hawkes, a contracted partner with the department’s Motor Vehicle Modernization (MvM) Project and its Lead Foundation Architect. "The Microsoft engineers were super receptive and super helpful. They would actually tweak the software a little bit for our use case, making our lives much easier. We have this wonderful working relationship with Microsoft, and they were extremely open with us, extremely receptive to ideas and whatever else it took. And we've only seen the ease of use get better and better and better.” Key benefits ADOT MVD has realized numerous benefits from the adoption of Azure AI Vision face liveness and verification functionality: Enhanced security—The technology has significantly reduced the risk of identity theft and fraud. By verifying the identity of individuals in real time, the department can ensure that only authorized individuals can access sensitive information and complete transactions. Improved efficiency—The system has streamlined the verification process, reducing the time required for identity checks. This allows the department to offer some services online that were previously only able to be done in office, such as driver license renewals and title transfers. Accessibility—The technology has made the process more inclusive and user-friendly, so it’s easier for individuals with disabilities or the elderly to complete transactions. Cost-effective—The Azure AI Vision face technology works seamlessly across different devices, including laptops and smartphones, without requiring expensive hardware, and fits into ADOT’s existing budget. Verifying mobile driver's licenses (mDLs) is one of the most significant applications of this technology. Arizona was one of the first states to offer ISO 18013-5 compliant mDLs, allowing residents to store their driver's licenses on their mobile devices, making it more convenient and secure. Another notable application is electronic transfer of vehicle titles. Residents can now transfer vehicle titles electronically, eliminating the need for physical presence and paperwork. This will make the process much easier for citizens, while also making it more efficient and secure, reducing the risk of fraud. On-demand authentication ADOT MVD has also developed an innovative solution called on-demand authentication (ODA). This allows residents to verify their identity remotely using their mobile devices. When a resident calls ADOT MVD’s call center, they receive a text message with a link to verify their identity. The system uses Azure AI Vision to perform facial verification and liveness detection, ensuring that the person on the other end of the call is who they claim to be. "This technology has been key in mitigating fraud by increasing our confidence that we're working with the right person," said Grant Hawkes. "The whole process takes maybe a few seconds and is user-friendly for both the call center representative and the customer." Future plans The success of Azure AI Vision has prompted ADOT to explore further applications, and other state agencies are now looking at adopting the technology as well. "We see this growing and growing," said Grant Hawkes. "We're working to roll this technology out to more and more departments within the state as part of a unified identity solution. We see the value in this technology and what can be done with it." The ADOT’s adoption of Azure AI Vision Face liveness and verification functionality has transformed the way the department operates. By enhancing security, improving efficiency, and making services more accessible, the technology has brought significant benefits to both the department and the residents of Arizona. As the department continues to innovate and expand the use of this technology, it sets a benchmark for other states and organizations to follow. Our commitment to Trustworthy AI Organizations across industries are leveraging Azure AI and Copilot capabilities to drive growth, increase productivity, and create value-added experiences. We’re committed to helping organizations use and build AI that is trustworthy, meaning it is secure, private, and safe. We bring best practices and learnings from decades of researching and building AI products at scale to provide industry-leading commitments and capabilities that span our three pillars of security, privacy, and safety. Trustworthy AI is only possible when you combine our commitments, such as our Secure Future Initiative and our Responsible AI principles, with our product capabilities to unlock AI transformation with confidence. Get started: Learn more about Azure AI Vision. Learn more about Face Liveness Detection, a milestone in identity verification. See how face detection works. Try it now. Read about Enhancing Azure AI Vision Face API with Liveness Detection. Learn how Microsoft empowers responsible AI practices.143Views5likes0CommentsWhy Azure AI Is Retail’s Secret Sauce
Executive Summary Leading RCG enterprises are standardizing on Azure AI—specifically Azure OpenAI Service, Azure Machine Learning, Azure AI Search, and Azure AI Vision—to increase digital‑channel conversion, sharpen demand forecasts, automate store execution, and accelerate product innovation. Documented results include up to 30 percent uplift in search conversion, 10 percent reduction in stock‑outs, and multimillion‑dollar productivity gains. This roadmap consolidates field data from CarMax, Kroger, Coca‑Cola, Estée Lauder, PepsiCo and Microsoft reference architectures to guide board‑level investment and technology planning. 1 Strategic Value of Azure AI Azure AI delivers state‑of‑the‑art language (GPT‑4o, GPT-4.1), reasoning (o1, o3, o4-mini) and multimodal (Phi‑3 Vision) models through Azure OpenAI Service while unifying machine‑learning, search, and vision APIs under one security, compliance, and Responsible AI framework. Coca‑Cola validated Azure’s enterprise scale with a $1.1 billion, five‑year agreement covering generative AI across marketing, product R&D and customer service (Microsoft press release; Reuters). 2 Customer‑Experience Transformation 2.1 AI‑Enhanced Search & Recommendations Microsoft’s Two‑Stage AI‑Enhanced Search pattern—vector search in Azure AI Search followed by GPT reranking—has lifted search conversion by up to 30 percent in production pilots (Tech Community blog). CarMax uses Azure OpenAI generates concise summaries for millions of vehicle reviews, improving SEO performance and reducing editorial cycles from weeks to hours (Microsoft customer story). 2.2 Conversational Commerce The GPT‑4o real‑time speech endpoint supports multilingual voice interaction with end‑to‑end latencies below 300 ms—ideal for kiosks, drive‑thrus, and voice‑enabled customer support (Azure AI dev blog). 3 Supply‑Chain & Merchandising Excellence Azure Machine Learning AutoML for Time‑Series automates feature engineering, hyper‑parameter tuning, and back‑testing for SKU‑level forecasts (AutoML tutorial; methodology guide). PepsiCo reported lower inventory buffers and improved promotional accuracy during its U.S. pilot and is scaling globally (PepsiCo case study). In February 2025 Microsoft published an agentic systems blueprint that layers GPT agents on top of forecast outputs to generate replenishment quantities and route optimizations, compressing decision cycles in complex supply chains (Microsoft industry blog). 4 Smart‑Store Operations Kroger’s EDGE shelves integrate Azure AI Vision and IoT to display dynamic pricing, trigger low‑stock alerts, and cut curb‑side‑picking times (Microsoft Transform article). At Microsoft Build 2024, Azure AI Vision introduced multimodal models that detect out‑of‑stocks, planogram gaps, and label errors via a single API call (Build 2024 Vision blog). NOTE: organizations using the preview Shelf Product Recognition API should migrate to Custom Vision ahead of the 31 March 2025 retirement date (retirement notice). The new Azure AI Vision v3.2 release adds higher‑accuracy tagging and multilingual embeddings for 102 languages (v3.2 release notes). 5 Marketing & Product Innovation Estée Lauder and Microsoft established an AI Innovation Lab that uses Azure OpenAI to accelerate concept development and campaign localization across 20 prestige brands (Estée Lauder press release). Coca‑Cola applies the same foundation models to generate ad copy, packaging text, and flavor concepts, maximizing reuse of trained embeddings across departments. Azure AI Studio provides prompt versioning, automated evaluation, and CI/CD pipelines for generative‑AI applications, reducing time‑to‑production for retail creative teams (Azure AI Studio blog). 6 Governance & Architecture The open‑source Responsible AI Toolbox bundles dashboards for fairness, interpretability, counterfactual analysis, and error inspection, enabling documented risk mitigation for language, vision, and tabular models (Responsible AI overview). Microsoft’s Retail Data Solutions Reference Architecture describes how to land POS, loyalty, and supply‑chain data into Microsoft Fabric or Synapse Lakehouses and expose it to Azure AI services through governed semantic models (architecture guide). 7 Implementation Roadmap Phase Key Activities Azure AI Services & Assets 0 – Foundation (Weeks 0‑2) Align business goals, assess data, deploy landing zone Azure Landing Zone; Retail Data Architecture 1 – Pilot (Weeks 3‑6) Build one measurable use case (e.g., AI Search or AutoML forecasting) in Azure AI Studio Azure AI Search; Azure OpenAI; Azure ML AutoML 2 – Industrialize (Months 2‑6) Integrate with commerce/ERP; add Responsible AI monitoring; CI/CD automation Responsible AI Toolbox 3 – Scale Portfolio (Months 6‑12) Extend to smart‑store vision, generative marketing, and agentic supply chain Azure AI Vision; agentic systems pattern Pilots typically achieve < 6‑week time‑to‑value and 3–7 percentage‑point operating‑margin improvement when search conversion gains, inventory precision, and store‑associate efficiency are combined (see CarMax, PepsiCo, and Kroger sources above). 8 Key Takeaways for Executives Unified Platform: Generative, predictive, and vision workloads run under one governance model and SLA. Proven Financial Impact: Field results confirm double‑digit revenue uplift and meaningful OPEX savings. Future‑Proof Investments: Continuous model refresh (GPT‑4.1, o3, o4-mini) and clear migration guidance protect ROI. Built‑in Governance: Responsible AI tooling accelerates compliance and audit readiness. Structured Scale Path: A phased roadmap de‑risks experimentation and enables enterprise deployment within 12 months. Bottom line: Azure AI provides the technical depth, operational maturity, and economic model required to deploy AI at scale across RCG value chains—delivering quantifiable growth and efficiency without introducing multi‑vendor complexity.117Views0likes0CommentsBuilding an Interactive Feedback Review Agent with Azure AI Search and Haystack
By Khye Wei (Azure AI Search) & Amna Mubashar (Haystack) We’re excited to announce the integration of Haystack with Azure AI Search! To demonstrate its capabilities, we’ll walk you through building an interactive review agent to efficiently retrieve and analyze customer reviews. By combining Azure AI Search’s hybrid retrieval with Haystack’s flexible pipeline architecture, this agent provides deeper insights through sentiment analysis and intelligent summarization tools. Why Use Azure AI Search with Haystack? Azure AI Search offers an enterprise-grade retrieval system with battle-tested AI search technology, built for high performance GenAI applications at any scale: Hybrid Search: Combining keyword-based BM25 and vector-based searches with reciprocal rank fusion (RRF). Semantic Ranking: Enhancing retrieval results using deep learning models. Scalability: Supporting high-performance GenAI applications. Secure, Enterprise-ready Foundation: Powering interactive experiences at scale on a trusted foundation. Haystack complements Azure AI Search by providing an end-to-end framework that enables: Modular Architecture: Easily swap or configure components like document retrieval, language models, and pipelines to build customized AI applications. Flexible Pipeline Design: Adapt pipelines to various data flows and use cases. Scalable and Reproducible: Ensure consistent performance across deployments with reliable and scalable pipelines. Tools and Agentic Pipelines: Build sophisticated pipelines that let AI models interact with external functions and structured tools. Indexing and Retrieval with Azure AI Search and Haystack In this blog, we’ll demonstrate how to create an end-to-end pipeline that combines Haystack with Azure AI Search to process customer reviews. By enabling semantic search and leveraging Haystack Tools for interactive sentiment analysis and summarization, you can quickly uncover deeper insights on your data. You can find the full working example and code in the linked recipe from our cookbook. We’ll use an open-source customer reviews dataset from Kaggle (link to the dataset). The process includes: Converting the dataset to Haystack Documents and preparing it using Haystack preprocessors. Indexing the documents using AzureAISearchDocumentStore, with semantic search enabled. Building a query pipeline that leverages Azure AI Search’s hybrid retrieval. Creating an interactive review assistant that uses a custom sentiment analysis tool to provide insights. Data Preparation First, we read the data and convert it to JSON for efficient indexing: import pandas as pd from json import loads path = "<path to dataset file>" df = pd.read_csv(path, encoding='latin1',nrows=200) df.rename(columns={'review-label': 'rating'}, inplace=True) df['year'] = pd.to_datetime(df['year'], format='%Y %H:%M:%S').dt.year # Convert DataFrame to JSON json_data = {"reviews": loads(df.to_json(orient="records"))} Next, we use Haystack’s JSONConverter to extract reviews as Haystack Documents, choosing which columns will be stored as metadata: from haystack.components.converters import JSONConverter from haystack.dataclasses import ByteStream from json import dumps converter = JSONConverter( jq_schema=".reviews[]", content_key="review", extra_meta_fields={"store_location", "date", "month", "year", "rating"} ) source = ByteStream.from_string(dumps(json_data)) documents = converter.run(sources=[source])['documents'] We apply Haystack’s DocumentCleaner to ensure the text is ASCII-only and remove any unwanted characters: from haystack.components.preprocessors import DocumentCleaner cleaner = DocumentCleaner(ascii_only=True, remove_regex="i12i12i12") cleaned_documents = cleaner.run(documents=documents) With the data loaded and cleaned, we’re ready to move on to indexing these documents in Azure AI Search. Indexing Documents with Azure AI Search Initialize the AzureAISearchDocumentStore with the desired metadata fields and semantic configuration enabled. from azure.search.documents.indexes.models import ( SemanticConfiguration, SemanticField, SemanticPrioritizedFields, SemanticSearch ) from haystack_integrations.document_stores.azure_ai_search import AzureAISearchDocumentStore semantic_config = SemanticConfiguration( name="my-semantic-config", prioritized_fields=SemanticPrioritizedFields( content_fields=[SemanticField(field_name="content")] ) ) semantic_search = SemanticSearch(configurations=[semantic_config]) document_store = AzureAISearchDocumentStore( index_name="customer-reviews-analysis", azure_endpoint="https://your-search-service.search.windows.net", api_key="YOUR_AZURE_API_KEY", embedding_dimension=1536, metadata_fields={"month": int, "year": int, "rating": int, "store_location": str}, semantic_search=semantic_search ) Now, we build an indexing pipeline that uses AzureAIDocumentEmbeddder to generate document embeddings and store them in the index. These embeddings are essential for hybrid retrieval, which combines vector retrieval with semantic search. from haystack import Pipeline from haystack.components.embedders import AzureOpenAIDocumentEmbedder from haystack.components.writers import DocumentWriter indexing_pipeline = Pipeline() indexing_pipeline.add_component("document_embedder", AzureOpenAIDocumentEmbedder()) indexing_pipeline.add_component(instance=DocumentWriter(document_store=document_store), name="doc_writer") indexing_pipeline.connect("document_embedder", "doc_writer") indexing_pipeline.run({"document_embedder": {"documents": cleaned_documents["documents"]}}) Querying with Hybrid Retrieval After indexing, we can query our data with a hybrid retrieval pipeline that uses the AzureAISearchHybridRetriever. The query is embedded with the AzureOpenAITextEmbedder and the embeddings are passed to the retriever along with the semantic configuration. from haystack_integrations.components.retrievers.azure_ai_search import AzureAISearchHybridRetriever from haystack.components.embedders import AzureOpenAITextEmbedder query_pipeline = Pipeline() query_pipeline.add_component("text_embedder", AzureOpenAITextEmbedder()) query_pipeline.add_component("retriever", AzureAISearchHybridRetriever( document_store=document_store, query_type="semantic", semantic_configuration_name="my-semantic-config" )) query_pipeline.connect("text_embedder.embedding", "retriever.query_embedding") query = "Which reviews are positive?" result = query_pipeline.run({"text_embedder": {"text": query}, "retriever": {"query": query}}) print(result["retriever"]["documents"]) Building an Interactive Feedback Review Agent with Tools After retrieving the relevant documents, we can create an interactive agent-based workflow using Haystack Tools. Our feedback review agent includes two specialized tools: An Aspect-Based Sentiment Analysis (ABSA) Tool (review_analysis): This tool examines specific aspects of reviews—such as product quality, shipping, customer service, and pricing—using the VADER sentiment analyzer. It computes sentiment scores, normalizes them to a 1–5 scale, and compares them against the original user-provided rating. Finally, it generates a visualization that highlights these comparisons. A Summarization Tool (review_summarization): This tool leverages Latent Semantic Analysis (LSA) to extract key sentences from each review, enabling quick scanning of main ideas and recurring themes to generate summaries. The agent decides which tool to invoke based on user requests, making the review analysis process more flexible, automated, and intuitive. In the code sample below, we demonstrate how to configure these tools, integrate them into a Haystack pipeline, and create the above agentic review assistant. Configuring Sentiment Analysis Tool from haystack.tools import Tool from haystack.components.tools import ToolInvoker from typing import Dict, List from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer def analyze_sentiment(reviews: List[Dict]) -> Dict: """ Function that performs aspect-based sentiment analysis. For each review that mentions keywords related to a specific topic, the function computes sentiment scores using VADER and categorizes the sentiment as 'positive', 'negative', or 'neutral'. It also normalizes the compound score to a 1-5 scale and then displays a bar chart comparing the normalized analyzer rating to the original review rating. """ topics = { "product_quality": [], "shipping": [], "customer_service": [], "pricing": [] } # Define keywords for each topic keywords = { "product_quality": ["quality", "material", "design", "fit", "size", "color", "style"], "shipping": ["shipping", "delivery", "arrived"], "customer_service": ["service", "support", "help"], "pricing": ["price", "cost", "expensive", "cheap"] } # Store the sentiment distribution based on ratings sentiments = {"positive": 0, "negative": 0, "neutral": 0} for review in reviews: rating = review.get("rating", 3) if rating >= 4: sentiments["positive"] += 1 elif rating <= 2: sentiments["negative"] += 1 else: sentiments["neutral"] += 1 # Initialize the VADER sentiment analyzer analyzer = SentimentIntensityAnalyzer() for review in reviews: text = review.get("review", "").lower() for topic, words in keywords.items(): if any(word in text for word in words): # Compute sentiment scores using VADER sentiment_scores = analyzer.polarity_scores(text) compound = sentiment_scores['compound'] # Normalize compound score from [-1, 1] to [1, 5] normalized_score = ((compound + 1) / 2) * 4 + 1 if compound >= 0.05: sentiment_label = 'positive' elif compound <= -0.05: sentiment_label = 'negative' else: sentiment_label = 'neutral' # Append the review along with its sentiment analysis result topics[topic].append({ "review": review, "sentiment": { "analyzer_rating": normalized_score, "label": sentiment_label } }) # Create the aspect-based sentiment analysis tool sentiment_tool = Tool( name="review_analysis", description="Aspect based sentiment analysis tool that compares the sentiment of reviews by analyzer and rating", function=analyze_sentiment, parameters={ "type": "object", "properties": { "reviews": { "type": "array", "items": { "type": "object", "properties": { "review": {"type": "string"}, "rating": {"type": "integer"}, "date": {"type": "string"} } } }, }, "required": ["reviews"] } ) Configuring Summarization Tool from sumy.parsers.plaintext import PlaintextParser from sumy.nlp.tokenizers import Tokenizer from sumy.summarizers.lsa import LsaSummarizer from typing import Dict, List def summarize_reviews(reviews: List[Dict]) -> Dict: """ Summarize the reviews by extracting key sentences. """ summaries = [] summarizer = LsaSummarizer() for review in reviews: text = review.get("review", "") parser = PlaintextParser.from_string(text, Tokenizer("english")) summary = summarizer(parser.document, 2) # Extract 2 sentences, adjust as needed summary_text = " ".join(str(sentence) for sentence in summary) summaries.append({"review": text, "summary": summary_text}) return {"summaries": summaries} # Create the text summarization tool summarization_tool = Tool( name="review_summarization", description="Tool to summarize customer reviews by extracting key sentences.", function=summarize_reviews, parameters={ "type": "object", "properties": { "reviews": { "type": "array", "items": { "type": "object", "properties": { "review": {"type": "string"}, "rating": {"type": "integer"}, "date": {"type": "string"} } } }, }, "required": ["reviews"] } ) Creating Interactive Review Agent Using Haystack’s chat architecture, we register both tools—review_analysis and review_summarization—within AzureOpenAIChatGenerator. The agent then autonomously decides which tool to invoke based on the user’s query. If someone asks for a summary, it calls the review_summarization tool; if they want deeper, aspect-based sentiment insights, it calls the review_analysis tool. from haystack.dataclasses import ChatMessage from haystack.components.generators.chat import AzureOpenAIChatGenerator def create_review_agent(): """Creates an interactive review analysis assistant""" chat_generator = AzureOpenAIChatGenerator( tools=[sentiment_tool, summarization_tool] ) system_message = ChatMessage.from_system( """ You are a customer review analysis expert. Your task is to perform aspect based sentiment analysis on customer reviews. You can use two tools to get insights: - review_analysis: to get the sentiment of reviews by analyzer and rating - review_summarization: to get the summary of reviews. Depending on the user's question, use the appropriate tool to get insights and explain them in a helpful way. """ ) return chat_generator, system_message tool_invoker = ToolInvoker(tools=[sentiment_tool, summarization_tool]) # Example of how you might set up an interactive loop chat_generator, system_message = create_review_agent() messages = [system_message] # Pseudocode for an interactive session while True: user_input = input("\n\nwaiting for input (type 'exit' or 'quit' to stop)\n: ") if user_input.lower() in ["exit", "quit"]: break messages.append(ChatMessage.from_user(user_input)) print("🧑: ", user_input) # This example references some 'retrieved_reviews' in the prompt user_prompt = ChatMessage.from_user(f""" {user_input} Here are the reviews: {{retrieved_reviews}} analysis_type: "topics" """) messages.append(user_prompt) while True: print("⌛ iterating...") replies = chat_generator.run(messages=messages)["replies"] messages.extend(replies) # Check for any tool calls if not replies[0].tool_calls: break tool_calls = replies[0].tool_calls # Print tool calls for tc in tool_calls: print("\n TOOL CALL:") print(f"\t{tc.tool_name}") # Execute the tool calls tool_messages = tool_invoker.run(messages=replies)["tool_messages"] messages.extend(tool_messages) # Print the final AI response print(f"🤖: {messages[-1].text}") Get started today Interested in building your own agentic RAG applications? Follow these steps: Explore Azure AI Search: Learn more about all the latest features. Try Azure AI Search for free - Azure AI Search | Microsoft Learn Dive into Haystack Documentation: Familiarize yourself with the framework’s capabilities and learn how to develop custom AI pipelines with agentic behavior. Try Haystack - Azure AI Search Integration: Combine the strengths of both platforms to create innovative, AI-driven search applications. Check out the integration docs. Engage with the Community: Join Haystack Discord channel and Azure developer community to share insights, ask questions, and collaborate on projects.167Views1like0CommentsAzure AI Search: Cut Vector Costs Up To 92.5% with New Compression Techniques
TLDR: Key learnings from our compression technique evaluation Cost savings: Up to 92.5% reduction in monthly costs Storage efficiency: Vector index size reduced by up to 99% Speed improvement: Query response times up to 33% faster with compressed vectors Quality maintained: Many compression configurations maintain 99-100% of baseline relevance quality At scale, the cost of storing and querying large, high-dimensional vector indexes can balloon. The common trade-off? Either pay a premium to maintain top-tier search quality or sacrifice user experience to limit expenses. With Azure AI Search, you no longer have to choose. Through testing, we have identified ways to reduce system costs without compromising retrieval performance quality. Our experiments show: 92.5% reduction in cost when using our most aggressive compression configurations Query speed improves up to 33% with compressed index sizes In many scenarios, quality remains at or near the baseline if you preserve originals and leverage rescoring. This post will walk through experiments that implemented compression techniques and measured impact on storage footprint, cost, query speed and relevance quality. We will share experimental data, a decision framework, and practical recommendations to implement scalable knowledge retrieval applications: Why Compression Matters Experiment Setup Data Overview: Cost, Speed, Quality Understanding the technology Decision framework Implementation example in Azure AI Search Why Compression Matters The Business Value Reduce operating costs: reduce hefty storage costs from high-dimensional vector embeddings as you scale your solutions. Maintain quality: Users still expect search results to be as relevant and accurate as before. Improve speed: In many cases, compressed vectors can yield faster query responses due to smaller index sizes and fewer compute overheads. The Experiment Setup The Mental Model: Cost, Speed, Quality When evaluating compression approaches, three dimensions are measured: Cost: How much you spend each month for index storage and the necessary compute resources (partitions, SKUs). Speed: How quickly your queries return results—often measured by a latency percentile distributions. Quality: How relevant your search results are, measured by NDCG@k (Normalized Discounted Cumulative Gain at rank k). Note: They are many ways to evaluate retrieval systems. For simplicity, we decided to focus on an industry standard, NDCG. Our experiments consisted of various configurations of the following compression techniques: Scalar Quantization (SQ): Maps each floating-point value to a lower-precision scale (int8). Binary Quantization (BQ): Floats mapped to binary values for maximum compression. Matryoshka Representation Learning (MRL): For certain tests, we leveraged embeddings trained with Matryoshka Representation Learning, which allowed us to truncate vectors from 3072 dimensions to 768 dimensions without significant loss of semantic information. See our MRL support for quantization blog for more details Oversampling (defaultOversampling): Retrieves extra candidates using compressed vectors, then rescore them with original vectors to boost recall. Preserve vs. Discard Originals PreserveOriginals allows second-pass rescoring with full-precision vectors. DiscardOriginals maximizes storage savings by removing the original uncompressed vectors entirely. These compression configurations were tested on an open source dataset: Dataset: mteb/msmarco with 8.8M vectors Embeddings: Generated using OpenAI text-embedding-3-large with 3072 dimensions Indexing Algorithm: HNSW-based vector indexing (the default in Azure AI Search). MRL: For certain tests, we used Matryoshka Representation Learning to reduce dimensions from 3072 to 768. Cost data shown below reflects the minimum SKU and partitions required under current (post-Nov 2024) Azure pricing Note: We have conducted similar experiments on other MTEB retrieval datasets and observed consistent results across them. This blog focuses on results from the largest dataset that best simulates real-world production scenarios. Ultimately, your exact results may vary depending on your own data and use cases. Data Overview: Cost, Speed, Quality Below is a condensed summary of our experiments. Notice how compression can deliver massive storage reductions with minimal or no quality loss. (Full details, including all intermediate configurations, can be found in extended tables in the Appendix). Cost & Storage Comparision These tests measured cost against size across various compression methods. Key Insights: Moving from No Compression (109 GB) to BQ can reduce vector data size by 96% (~4 GB), though the actual index compression ratio may vary depending on supporting data structures, the M parameter in HNSW, and other index configuration settings. Combining MRL + BQ can cut index size by 99% (109 GB → ~1 GB). Cost can drop from $1,000/month on S1 to $75/month on Basic—a 92.5% reduction—when you discard original vectors. Method Vector Index (GB) Disk Storage (GB) Approx Cost per month % Savings SKU Min Partitions required No Compression 109.13 112.29 $1,000 - S1 4 SQ (wo/rescoring) 27.65 139.45 $250 75% S1 1 SQ (w/ rescoring) 27.65 139.45 $250 75% S1 1 BQ (w/ rescoring) 3.88 115.68 $250 75% S1 1 BQ + discardOriginals 3.88 7.04 $75 92.5% Basic 1 MRL + BQ (w/rescoring) 1.33 113.13 $250 75% S1 1 MRL + BQ + discardOriginals 1.33 4.49 $75 92.5% Basic 1 Table 1 Interpreting Table 1: SKU indicates which pricing tier you need as a function of Search Units (Partitions x Replicas). For more details, see Choose a service tier for Azure AI Search. Min Partitions is how many partition units you must allocate to store the index at that SKU. Fewer partitions generally lower cost. Costs are approximate figures for a single replica in East US region, post-Nov 2024 pricing updates. Actual charges may vary. For more details see Azure AI Search Pricing. Speed (Latency) Most compression methods improve or match latency relative to the No Compression baseline at the p50, p90, and p99 percentiles. We measured query speed in milliseconds (ms) and then compared each configuration’s relative latency percentile to the No Compression baseline. Key Insight: Compressed vectors typically yield faster searches due to more efficient processing. You may only see performance benefits on indexes that have >10K vectors. Method p50 p90 p99 Relative to baseline No Compression Baseline 1.00 1.00 1.00 - SQ (wo/ rescoring) + discardOriginals 0.74 0.71 0.69 ~30% faster BQ (wo/ rescoring) + discardOriginals 0.72 0.69 0.67 ~33% faster Interpreting Table 2: Values are normalized to the No Compression baseline (where 1.00 = baseline latency). Lower numbers mean faster performance (e.g., 0.70 means 30% faster than baseline). This is server-side latency, excluding network latency. Most compression configurations show improved performance, with BQ+discardOriginals providing the best speed improvement. Quality We measured NDCG@10 and then compared each configuration’s score to the uncompressed baseline (0.40219). Key Insight: With rescoring (i.e., preserveOriginals), most configurations maintain full baseline quality (NDCG@10 ≈ 1.00). SQ configurations maintain excellent quality (99-100% of baseline) even when discarding originals BQ with discardOriginals shows a small quality drop (96% of baseline), while MRL+BQ with discardOriginals shows a more noticeable quality drop (92% of baseline) Method NDCG@10 score Relative to baseline No Compression: baseline 0.40219 1.00 SQ (w/ rescoring) + preserveOriginals 0.40249 1.00 SQ (wo/rescoring) + discardOriginals 0.39999 0.99 BQ (w/rescoring) + preserveOriginals 0.40259 1.00 BQ (wo/rescoring) + preserveOriginals 0.39287 0.98 BQ (w/rescoring) + discardOriginals 0.39181 0.97 BQ (wo/rescoring) + discardOriginals 0.38733 0.96 Table 3 Interpreting Table 3: NDCG@10: This metric measures how well your search system ranks relevant results within the top 10 positions, with higher scores indicating better performance. The "relative" column shows how each compression method performs compared to uncompressed vectors, with 1.00 being identical quality and 0.96 meaning a slight degradation. Note: This selection highlights quality impact across key configurations. With rescoring and preserveOriginals, quality remains at baseline levels across all compression methods. BQ with rescoring and discardOriginals—our newest feature—maintains 97% of baseline quality while providing significant storage savings. If you're using Semantic Ranking (SR), quality drops become even less noticeable, making aggressive compression options like MRL+BQ+discardOriginals viable even for quality-sensitive applications. Read more in our experiments using compression with MRL with SR here in our latest Applied AI Blog announcing Semantic Ranker updates. Understanding the Technology Scalar Quantization (SQ) How It Works: Converts floating-point components to lower-precision numbers (e.g., int8) Pros: Typically yields ~75%+ index reduction with almost no quality loss (when rescoring), though results may vary depending on the specific embedding model used. Cons: Doesn't compress as aggressively as BQ. Best For: Balanced cost savings with minimal overhead. E.g., "SQ + preserveOriginals + rescoring" offers a 75% reduction in vector storage with no quality loss, though exact results depend on your embedding model. Binary Quantization (BQ) How It Works: Floats become binary representations (1s and 0s), achieving the largest compression ratio, often >90%. Binary quantization performs better with higher-dimensional vectors, as information loss during quantization becomes proportionally less significant. Pros: Drastically reduces partition/SKU requirements, delivering maximum storage savings Cons: Slightly higher risk of quality drop if you discard originals, though it's often modest (92–96% of baseline). Best For: Maximum cost savings. "BQ + discardOriginals" yields the best combination of lower partition count (hence lower cost) and faster search. Matryoshka Representation Learning (MRL) How It Works: Dimension reduction technique by truncation (e.g., from 3072 → 768). Pros: Combining MRL with BQ/SQ delivers the highest compression—sometimes <2% of the original size. Cons: Requires using embeddings specifically designed for dimension truncation. Best For: Use when you have or can adopt MRL-ready vector embedding models and want to minimize storage. Tip: Our data indicates it's better to apply SQ or BQ first, and then use MRL if further compression is needed, rather than using MRL alone as a starting point without quantization. Rescoring: Preserve vs. Discard Originals PreserveOriginals: Keeps full-precision vectors for optional oversampling and second-pass reranking, preserving nearly baseline quality. *DiscardOriginals: Provides maximum storage gains but prevents full-precision rescoring. Expect a small drop in retrieval accuracy (NDCG@10 ~0.92–0.96). *As of 2025-03-01-Preview API version, binary quantization supports discardOriginals with rescoring. In this scenario, rescoring is calculated by the dot product of the full precision query and binary quantized data in the index. Implementation Example in Azure AI Search When creating or updating your Search Index, you can specify compression in your VectorSearchProfile. For instance: { "name": "myVectorSearchProfile", "algorithm": "myHnswConfig", // your HNSW settings "compression": "myCompressionConfig", "vectorizer": "myVectorizer" } Then define a compression kind with rescoring options. For instance: { "compressions": [ { "name": "myCompressionConfig", "kind": "binaryQuantization", "rescoringOptions": { "defaultOversampling": 2.0, // Start with 2x and increase as needed based on your quality requirements "enableRescoring": true, "rescoreStorageMethod": "discardOriginals" }, "truncationDimension": null // or 768 if using MRL-compatible embeddings } ] } Choosing the Right Compression Strategy When selecting your compression approach, consider these key factors: Budget constraints: If minimizing cost is your primary concern, BQ+discardOriginals offers the most dramatic savings (up to 92.5%) compared to No Compression Quality sensitivity: If maintaining maximum quality while still opting for compression is critical: Use preserveOriginals with rescoring for virtually no quality loss (any compression method) If storage costs are still a concern, SQ (wo/rescoring) + discardOriginals offers an excellent compromise (99% quality retention) Speed requirements: All compression methods improve speed, but BQ+discardOriginals offers the best performance (up to 33% faster). Remember that if you opt for rescoring, higher oversampling factors will also increase query latency. Embedding models: If you're using or can adopt MRL-compatible embeddings such as OpenAI's text-embedding-3-large or Cohere's Embed 4, combining MRL with BQ offers the most extreme compression while maintaining acceptable quality. Semantic Ranking impact: If you're using Semantic Ranking (SR) in your search pipeline: You can employ more aggressive compression configurations with minimal quality impact SR effectively compensates for any minor relevance losses from compression, allowing you to maximize cost savings without compromising user experience This is especially valuable for large-scale applications where storage costs would otherwise be prohibitive With Azure AI Search compression, you can drastically reduce your storage footprint—and therefore cost—while maintaining fast queries and high (or near-baseline) relevance. For more details, check out our latest documentation on compression. Ready to take action? Benchmark your current vector index size, cost, and search quality. Choose the compression approach that fits your Cost/Speed/Quality priorities. Test in a sandbox environment, measure the impact, and verify relevance vs. baseline. Roll out to production once you confirm the desired outcomes. Check out this Python notebook on how you can compare storage sizes across different vector compression configurations. Don't be afraid to scale your GenAI solutions! Azure AI Search has you covered! We welcome your feedback and questions. Drop a comment below or visit https://feedback.azure.com to share your ideas! Appendix: Extended Configuration Tables Table A1: Complete Index Configuration and Cost Comparison Compression Method Vector Index Size (GB) Disk Storage (GB) Cost per Month (Services Created after Nov 18, 2024) Best SKU Min Partitions No Compression 109.13 112.29 $1,000 S1 4 SQ (w/rescoring) 27.65 139.45 $250 S1 1 SQ (wo/rescoring) 27.65 139.45 $250 S1 1 SQ + discardOriginals 27.65 30.80 $250 S1 1 BQ (w/rescoring) 3.88 115.68 $250 S1 1 BQ (wo/rescoring) 3.88 115.68 $250 S1 1 BQ + discardOriginals (w/rescoring) 3.88 7.04 $75 Basic 1 BQ + discardOriginals (wo/rescoring) 3.88 7.04 $75 Basic 1 MRL + SQ (w/rescoring) 7.27 119.07 $250 S1 1 MRL + SQ (wo/rescoring) 7.27 119.07 $250 S1 1 MRL + SQ + discardOriginals 7.27 10.43 $150 Basic 2 MRL + BQ (w/rescoring) 1.33 113.13 $250 S1 1 MRL + BQ (wo/rescoring) 1.33 113.13 $250 S1 1 MRL + BQ + discardOriginals (w/rescoring) 1.33 4.49 $75 Basic 1 MRL + BQ + discardOriginals (wo/rescoring) 1.33 4.49 $75 Basic 1 Summary of Table A1: This comprehensive table shows all tested configurations and their impact on storage and cost. The most aggressive compression methods (MRL + BQ + discardOriginals) reduce the vector index size from 109GB to just 1.33GB—a 99% reduction. This allows the index to fit on a Basic SKU with a single partition, reducing monthly costs by 92.5% compared to the uncompressed baseline. Table A2: Complete Performance Comparison (Relative Latency) Compression Method P50 P90 P99 No Compression 1.00 1.00 1.00 SQ (w/rescoring) + preserveOriginals 0.87 0.85 0.84 SQ (wo/rescoring) + preserveOriginals 0.84 0.80 0.79 SQ (wo/rescoring) + discardOriginals 0.74 0.71 0.69 BQ (w/rescoring) + preserveOriginals 1.22 2.23 2.59 BQ (wo/rescoring) + preserveOriginals 0.82 0.79 0.80 BQ (w/rescoring) + discardOriginals 0.76 0.73 0.71 BQ (wo/rescoring) + discardOriginals 0.72 0.69 0.67 MRL + SQ (w/rescoring) + preserveOriginals 0.96 1.10 1.20 MRL + SQ (wo/rescoring) + preserveOriginals 0.82 0.80 0.82 MRL + SQ (wo/rescoring) + discardOriginals 0.73 0.70 0.68 MRL + BQ (w/rescoring) + preserveOriginals 0.80 0.77 0.76 MRL + BQ (wo/rescoring) + preserveOriginals 0.76 0.73 0.71 MRL + BQ (w/rescoring) + discardOriginals 0.75 0.72 0.70 MRL + BQ (wo/rescoring) + discardOriginals 0.72 0.69 0.67 Summary of Table A2: Most compression methods (except BQ with rescoring and preserveOriginals) improve query latency across all percentiles. The most consistent performance improvements come from configurations that discard originals, which show around 30% faster response times. This is likely due to the reduced memory footprint and simplified query processing without needing to access the original vectors. Table A3: Complete Relevance Quality Comparison (NDCG@10) Compression Method NDCG@10 Relative NDCG@10 No Compression 0.40219 1.00 SQ (w/rescoring) + preserveOriginals 0.40249 1.00 SQ (wo/rescoring) + preserveOriginals 0.40188 1.00 SQ (wo/rescoring) + discardOriginals 0.39999 0.99 BQ (w/rescoring) + preserveOriginals 0.40259 1.00 BQ (wo/rescoring) + preserveOriginals 0.39287 0.98 BQ (w/rescoring) + discardOriginals 0.39181 0.97 BQ (wo/rescoring) + discardOriginals 0.38733 0.96 MRL + SQ (w/rescoring) + preserveOriginals 0.40224 1.00 MRL + SQ (wo/rescoring) + preserveOriginals 0.39793 0.99 MRL + SQ (wo/rescoring) + discardOriginals 0.39375 0.98 MRL + BQ (w/rescoring) + preserveOriginals 0.40024 1.00 MRL + BQ (wo/rescoring) + preserveOriginals 0.35704 0.89 MRL + BQ (w/rescoring) + discardOriginals 0.37192 0.92 MRL + BQ (wo/rescoring) + discardOriginals 0.35314 0.88 Summary of Table A3: This table provides a complete quality comparison across all tested configurations. The data shows that preserving originals with rescoring maintains full baseline quality (NDCG@10 ≈ 1.00) regardless of compression method. SQ configurations show minimal quality impact even when discarding originals (99% of baseline), while BQ configurations show a modest drop (96-97%). The most aggressive compression (MRL+BQ+discardOriginals) still maintains 92% of baseline quality with rescoring enabled, which may be acceptable for many use cases, especially when combined with Semantic Ranking. Note: All configurations use HNSW defaults. MRL tests use dimension=768. Configurations with rescoring use oversampling=10. The dataset used is mteb/msmarco with 8.8M vectors.317Views0likes0CommentsBuilding AI-Powered Clinical Knowledge Stores with Azure AI Search
👀 Missed Session 01? Don’t worry—you can still catch up. But first, here’s what AI HLS Ignited is all about: What is AI HLS Ignited? AI HLS Ignited is a Microsoft-led technical series for healthcare innovators, solution architects, and AI engineers. Each session brings to life real-world AI solutions that are reshaping the Healthcare and Life Sciences (HLS) industry. Through live demos, architectural deep dives, and GitHub-hosted code, we equip you with the tools and knowledge to build with confidence. Session 01 Recap: In our first session, we introduced the accelerator MedIndexer - which is an indexing framework designed for the automated creation of structured knowledge bases from unstructured clinical sources. Whether you're dealing with X-rays, clinical notes, or scanned documents, MedIndexer converts these inputs into a schema-driven format optimized for Azure AI Search. This will allow your applications to leverage state-of-the-art retrieval methodologies, including vector search and re-ranking. Moreover, by applying a well-defined schema and vectorizing the data into high-dimensional representations, MedIndexer empowers AI applications to retrieve more precise and context-aware information... The result? AI systems that surface more relevant, accurate, and context-aware insights—faster. 🔍 Turning Your Unstructured Data into Value "About 80% of medical data remains unstructured and untapped after it is created (e.g., text, image, signal, etc.)" — Healthcare Informatics Research, Chungnam National University In the era of AI, the rise of AI copilots and assistants has led to a shift in how we access knowledge. But retrieving clinical data that lives in disparate formats is no trivial task. Building retrieval systems takes effort—and how you structure your knowledge store matters. It’s a cyclic, iterative, and constantly evolving process. That’s why we believe in leveraging enterprise-ready retrieval platforms like Azure AI Search—designed to power intelligent search experiences across structured and unstructured data. It serves as the foundation for building advanced retrieval systems in healthcare. However, implementing Azure AI Search alone is not enough. Mastering its capabilities and applying well-defined patterns can significantly enhance your ability to address repetitive tasks and complex retrieval scenarios. This project aims to accelerate your ability to transform raw clinical data into high-fidelity, high-value knowledge structures that can power your next-generation AI healthcare applications. 🚀 How to Get Started with MedIndexer New to Azure AI Search? Begin with our guided labs to build a strong foundation and get hands-on with the core capabilities. Already familiar with the tech? Jump ahead to the real-world use cases—learn how to build Coded Policy Knowledge Stores and X-ray Knowledge Stores. 🧪 Labs 🧪 Building Your Azure AI Search Index: 🧾 Notebook - Building your first Index Learn how to create and configure an Azure AI Search index to enable intelligent search capabilities for your applications. 🧪 Indexing Data into Azure AI Search: 🧾 Notebook - Ingest and Index Clinical Data Understand how to ingest, preprocess, and index clinical data into Azure AI Search using schema-first principles. 🧪 Retrieval Methods for Azure AI Search: 🧾 Notebook - Exploring Vector Search and Hybrid Retrieval Dive into retrieval techniques such as vector search, hybrid retrieval, and reranking to enhance the accuracy and relevance of search results. 🧪 Evaluation Methods for Azure AI Search: 🧾 Notebook - Evaluating Search Quality and Relevance Learn how to evaluate the performance of your search index using relevance metrics and ground truth datasets to ensure high-quality search results. 🏥 Use Cases 📝 Creating Coded Policy Knowledge Stores In many healthcare systems, policy documents such as pre-authorization guidelines are still trapped in static, scanned PDFs. These documents are critical—they contain ICD codes, drug name coverage, and payer-specific logic—but are rarely structured or accessible in real-time. To solve this, we built a pipeline that transforms these documents into intelligent, searchable knowledge stores. This diagram shows how pre-auth policy PDFs are ingested via blob storage, passed through an OCR and embedding skillset, and then indexed into Azure AI Search. The result: fast access to coded policy data for AI apps. 🧾 Notebook - Creating Coded Policies Knowledge Stores Transform payer policies into machine-readable formats. This use case includes: Preprocessing and cleaning PDF documents Building custom OCR skills Leveraging out-of-the-box Indexer capabilities and embedding skills Enabling real-time AI-assisted querying for ICDs, payer names, drug names, and policy logic Why it matters: This streamlines prior authorization and coding workflows for providers and payors, reducing manual effort and increasing transparency. 🩻 Creating X-ray Knowledge Stores In radiology workflows, X-ray reports and image metadata contain valuable clinical insights—but these are often underutilized. Traditionally, they’re stored as static entries in PACS systems or loosely connected databases. The goal of this use case is to turn those X-ray reports into a searchable, intelligent asset that clinicians can explore and interact with in meaningful ways. This diagram illustrates a full retrieval pipeline where radiology reports are uploaded, enriched through foundational models, embedded, and indexed. The output powers an AI-driven web app for similarity search and decision support. 🧾 Notebook - Creating X-rays Knowledge Stores Turn imaging reports and metadata into a searchable knowledge base. This includes: Leveraging push APIs with custom event-driven indexing pipeline triggered on new X-ray uploads Generating embeddings using Microsoft Healthcare foundation models Providing an AI-powered front-end for X-ray similarity search Why it matters: Supports clinical decision-making by retrieving similar past cases, aiding diagnosis and treatment planning with contextual relevance. 📣 Join Us for the Next Session Help shape the future of healthcare by sharing AI HLS Ignited with your network—and don’t miss what’s coming next! 📅 Register for the upcoming session → AI HLS Ignited Event Page 💻 Explore the code, demos, and architecture → AI HLS Ignited GitHub RepositoryEvaluating Agentic AI Systems: A Deep Dive into Agentic Metrics
In this post, we explore the latest Agentic metrics introduced in the Azure AI Evaluation library, a Python library designed to assess generative AI systems with both traditional NLP metrics (like BLEU and ROUGE) and AI-assisted evaluators (such as relevance, coherence, and safety). With the rise of agentic systems, the library now includes purpose-built evaluators for complex agent workflows. We’ll focus on three key metrics: Task Adherence, Tool Call Accuracy, and Intent Resolution—each capturing a critical dimension of an agent’s performance. To help illustrate these evaluation strategies, you can find AgenticEvals, a simple public repo that showcases these metrics in action using Semantic Kernel for the agentic/orchestration layer and Azure AI Evaluation library for the evaluation. Why We Need New Metrics for Generative & Agentic AI Generative AI systems don’t fit neatly into the evaluation mold of traditional machine learning. In classical ML tasks such as classification or regression, we rely on objective metrics – accuracy, precision, recall, F1-score, etc. – which compare predictions to a single ground truth. Generative AI, by contrast, produces open-ended outputs (free-form text, code, images) where there may be many acceptable answers and quality is subjective. Moreover, agentic systems add yet another layer of complexity. They don’t just generate output – they reason over tasks, break them into subtasks, invoke tools, make decisions, and adapt. We need to evaluate not only the final answer but the entire process. This includes how well the agent understands the user’s goal, whether it chooses the right tools, and if it follows the intended path to completion. Azure AI Evaluation: Agentic Metrics The Azure AI Evaluation library now supports metrics tailored for evaluating agentic behaviors. Let’s explore the three key metrics that help developers and researchers build more reliable AI agents. Task Adherence – Is the Agent Answering the Right Question? Task Adherence evaluates how well the agent’s final response satisfies the original user request. It goes beyond surface-level correctness and looks at whether the response is relevant, complete, and aligned with the user’s expectations. It uses an LLM-based evaluator to score adherence based on natural language prompt. This ensures that open-ended answers are assessed with contextual understanding. For example, if a user asks for a list of budget hotels in London, a response that lists luxury resorts would score poorly even if they’re technically located in London. Tool Call Accuracy – Is the Agent Using Tools Correctly? This metric focuses on the agent's procedural accuracy when invoking tools. It examines whether the right tool was selected for each step and whether inputs to the tool were appropriate and correctly formatted. The evaluator considers the context of the task and determines if the agent’s tool interactions were logically consistent and goal aligned. This helps identify subtle flaws such as correct answers that were reached via poor tool usage or unnecessary API calls. Intent Resolution – Did the Agent Understand the User’s Goal? Intent Resolution assesses whether the agent’s initial actions reflect a correct understanding of the user’s underlying need. It evaluates the alignment between the user’s input and the agent’s plan or early decisions. A high score means the agent correctly inferred what the user meant and structured its response accordingly. This metric is essential for diagnosing failure cases where the output seems fluent but addresses the wrong objective – a common issue in multi-step or ambiguous tasks. Running a Full Evaluation with Azure AI Evaluation To evaluate an agent’s performance at scale, the Azure AI Evaluation library supports batch evaluation via the evaluate() method. This function allows you to run multiple evaluators like Task Adherence, Tool Call Accuracy, and Intent Resolution on a structured dataset. You can specify a JSONL dataset representing interaction logs, and the library will apply each evaluator to the corresponding fields. The results can be exported directly to an Azure AI Foundry Project Evaluation workspace, enabling seamless tracking, visualization, and comparison of evaluation runs. Conclusion and Next Steps As AI agents become more autonomous and embedded in real-world workflows, robust evaluation is key to ensuring they’re acting responsibly and effectively. The new Agentic metrics in Azure AI Evaluation give developers the tools to systematically assess agent behavior—not just outputs. Task Adherence, Tool Call Accuracy, and Intent Resolution offer a multi-dimensional view of performance that aligns with how modern agents operate. To try these metrics in action, check out AgenticEvals on GitHub—a simple repo demonstrating how to evaluate agent traces using these metrics. Call to Action: Start evaluating your own agents using the Azure AI Evaluation library. The metrics are easy to integrate and can surface meaningful insights into your agent’s behavior. With the right evaluation tools, we can build more transparent, effective, and trustworthy AI systems.873Views0likes0CommentsGeneral-Purpose vs Reasoning Models in Azure OpenAI
Explore the key differences between general-purpose and reasoning large language models (LLMs) using real-world examples from Azure OpenAI. Learn how to compare models like GPT-4o, o1, and their mini variants based on capabilities, latency, accuracy, and cost.685Views0likes0CommentsBest Practices for Mitigating Hallucinations in Large Language Models (LLMs)
Real-world AI Solutions: Lessons from the Field Overview This document provides practical guidance for minimizing hallucinations—instances where models produce inaccurate or fabricated content—when building applications with Azure AI services. It targets developers, architects, and MLOps teams working with LLMs in enterprise settings. Key Outcomes ✅ Reduce hallucinations through retrieval-augmented strategies and prompt engineering ✅ Improve model output reliability, grounding, and explainability ✅ Enable robust enterprise deployment through layered safety, monitoring, and security Understanding Hallucinations Hallucinations come in different forms. Here are some realistic examples for each category to help clarify them: Type Description Example Factual Outputs are incorrect or made up "Albert Einstein won the Nobel Prize in Physics in 1950." (It was 1921) Temporal Stale or outdated knowledge shown as current "The latest iPhone model is the iPhone 12." (When iPhone 15 is current) Contextual Adds concepts that weren’t mentioned or implied Summarizing a doc and adding "AI is dangerous" when the doc never said it Linguistic Grammatically correct but incoherent sentences "The quantum sandwich negates bicycle logic through elegant syntax." Extrinsic Unsupported by source documents Citing nonexistent facts in a RAG-backed chatbot Intrinsic Contradictory or self-conflicting answers Saying both "Azure OpenAI supports fine-tuning" and "Azure OpenAI does not." Mitigation Strategies 1- Retrieval-Augmented Generation (RAG) Grounding model outputs with enterprise knowledge sources like PDFs, SharePoint docs, or images. Key Practices: Data Preparation and Organization Clean and curate your data. Organize data into topics to improve search accuracy and prevent noise. Regularly audit and update grounding data to avoid outdated or biased content. Search and Retrieval Techniques Explore different methods (keyword, vector, hybrid, semantic search) to find the best fit for your use case. Use metadata filtering (e.g., tagging by recency or source reliability) to prioritize high-quality information. Apply data chunking to improve retrieval efficiency and clarity. Query Engineering and Post-Processing Use prompt engineering to specify which data source or section to pull from. Apply query transformation methods (e.g., sub-queries) for complex queries. Employ reranking methods to boost output quality. 2- Prompt Engineering High-quality prompts guide LLMs to produce factual and relevant responses. Use the ICE method: Instructions: Start with direct, specific asks. Constraints: Add boundaries like "only from retrieved docs". Escalation: Include fallback behaviors (e.g., “Say ‘I don’t know’ if unsure”). Example Prompt Improvement: ❌: Summarize this document. ✅: Using only the retrieved documentation, summarize this paper in 3–5 bullet points. If any information is missing, reply with 'Insufficient data.' Prompt Patterns That Work: Clarity and Specificity Write clear, unambiguous instructions to minimize misinterpretation. Use detailed prompts, e.g., "Provide only factual, verified information. If unsure, respond with 'I don't know.'" Structure Break down complex tasks into smaller logical subtasks for accuracy. Example: Research Paper Analysis ❌ Bad Prompt (Too Broad, Prone to Hallucination): "Summarize this research paper and explain its implications." ✅ Better Prompt (Broken into Subtasks): Extract Core Information: "Summarize the key findings of the research paper in 3-5 bullet points." Assess Reliability: "Identify the sources of data used and assess their credibility." Determine Implications: "Based on the findings, explain potential real-world applications." Limit Speculation: "If any conclusions are uncertain, indicate that explicitly rather than making assumptions." Repetition Repeating key instructions in a prompt can help reduce hallucinations. The way you structure the repetition matters. Here are some best practices: Beginning (Highly Recommended) The start of the prompt has the most impact on how the LLM interprets the task. Place essential guidelines here, such as: "Provide only factual, verified information." End (For Final Confirmation or Safety Checks) Use the end to reinforce key rules. Instead of repeating the initial instruction verbatim, word it differently to reinforce it, and keep it concise. For example: "If unsure, clearly state 'I don't know.'" Temperature Control Adjust temperature settings (0.1–0.4) for deterministic, focused responses. Chain-of-Thought Incorporate "Chain-of-Thought" instructions to encourage logical, stepwise responses. For example, to solve a math problem: "Solve this problem step-by-step. First, break it into smaller parts. Explain each step before moving to the next." Tip: Use Azure AI Prompt Flow’s playground to test prompt variations with parameter sweeps. 3- System-Level Defenses Mitigation isn't just prompt-side—it requires end-to-end design. Key Recommendations: Content Filtering: Use Azure AI Content Safety to detect sexual, hate, violence, or self-harm content. Metaprompts: Define system boundaries ("You can only answer from documents retrieved"). RBAC & Networking: Use Azure Private Link, VNETs, and Microsoft Entra ID for secure access. 4- Evaluation & Feedback Loops Continuously evaluate outputs using both automated and human-in-the-loop feedback. Real-World Setup: Labeling Teams: Review hallucination-prone cases with Human in Loop integrations. Automated Test Generation Use LLMs to generate diverse test cases covering multiple inputs and difficulty levels. Simulate real-world queries to evaluate model accuracy. Evaluations Using Multiple LLMs Cross-evaluate outputs from multiple LLMs. Use ranking and comparison to refine model performance. Be cautious—automated evaluations may miss subtle errors requiring human oversight. Tip: Common Evaluation Metrics Metric What It Measures How to Use It Relevance Score How closely the model's response aligns with the user query and intent (0–1 scale). Use automated LLM-based grading or semantic similarity to flag off-topic or loosely related answers. Groundedness Score Whether the output is supported by retrieved documents or source context. Use manual review or Azure AI Evaluation tools (like RAG evaluation) to identify unsupported claims. User Trust Score Real-time feedback from users, typically collected via thumbs up/down or star ratings. Track trends to identify low-confidence flows and prioritize them for prompt tuning or data curation. Tip: Use evaluation scores in combination. For example, high relevance but low groundedness often signals hallucination risks—especially in chat apps with fallback answers. Tip: Flag any outputs where "source_confidence" < threshold and route them to a human review queue. Tip: Include “accuracy audits” as part of your CI/CD pipeline using Prompt Flow or other evaluations tools to test components. Summary & Deployment Checklist Task Tools/Methods Curate and chunk enterprise data Azure AI Search, data chunkers Use clear, scoped, role-based prompts Prompt engineering, prompt templates Ground all outputs using RAG Azure AI Search + Azure OpenAI Automate evaluation flows Prompt Flow + custom evaluators Add safety filters and monitoring Azure Content Safety, Monitor, Insights Secure deployments with RBAC/VNET Azure Key Vault, Entra ID, Private Link Additional AI Best Practices blog posts: Best Practices for Requesting Quota Increase for Azure OpenAI Models Best Practices for Leveraging Azure OpenAI in Constrained Optimization Scenarios Best Practices for Structured Extraction from Documents Using Azure OpenAI Best Practices for Using Generative AI in Automated Response Generation for Complex Decision Making Best Practices for Leveraging Azure OpenAI in Code Conversion Scenarios Kickstarting AI Agent Development with Synthetic Data: A GenAI Approach on Azure | Microsoft Community Hub625Views0likes0CommentsBest Practices for Requesting Quota Increase for Azure OpenAI Models
Introduction This document outlines a set of best practices to guide users in submitting quota increase requests for Azure OpenAI models. Following these recommendations will help streamline the process, ensure proper documentation, and improve the likelihood of a successful request. Understand the Quota and Limitations Before submitting a quota increase request, make sure you have a clear understanding of: The current quota and limits for your Azure OpenAI instance. Your specific use case requirements, including estimated daily/weekly/monthly usage. The rate limits for API calls and how they affect your solution's performance. Use the Azure portal or CLI to monitor your current usage and identify patterns that justify the need for a quota increase. Provide a Clear and Detailed Justification When requesting a quota increase, include a well-documented justification. This should include: Use Case Description Provide an overview of how you are using Azure OpenAI services, including details about the application or platform it supports. User Impact Explain how the quota increase will benefit end-users or improve the solution's performance. Provide details of Azure Consumption impact, if possible. Current Limits vs. Required Limits Clearly state your current quota and the requested increase. For example: Current limit: 10,000 tokens per minute Requested limit: 50,000 tokens per minute Supporting Data Include quantitative data such as: Historical usage trends Growth projections (e.g., user base, API calls, or token consumption) Expected peak usage scenarios (e.g., seasonal demand or events) Follow Prompt Engineering Best Practices As a first step, ensure you are optimizing your usage of Azure OpenAI models by adhering to prompt engineering best practices: Plain Text Over Complex Formats – Use simple, clear prompts to minimize errors and improve model efficiency. JSON Format (if applicable) – If structured data is required, experiment with lightweight and efficient JSON schemas. Clear Instructions – Provide concise and unambiguous instructions in your prompts to reduce unnecessary token consumption. Iterative Refinement – Continuously refine prompts to strike a balance between response quality and token usage. Optimize API Usage Ensure that your solution is designed to optimize API usage: Batch Requests/Processes – Combine multiple smaller requests into a single larger request where possible, or use the batch endpoint option in Azure OpenAI. Rate-Limit Handling – Implement robust retry mechanisms and backoff strategies to handle rate-limited responses gracefully. Caching Responses – Cache frequently requested results to minimize redundant API calls. Include Architectural and Operational Details If your request involves high usage or complex architecture, provide additional details: System Architecture Overview – Describe the architecture of your solution, including how Azure OpenAI services integrate with other platforms (e.g., AWS, on-premises systems, etc.). Rate Limiting Strategy – Highlight any strategies in place to manage rate limits, such as prioritizing critical requests or implementing queuing mechanisms. Monitoring and Alerts – Share your approach to monitoring API usage and setting up alerts for potential issues. Submit with Supporting Documentation Prepare and attach any supporting materials that can strengthen your case: Usage Reports – Include graphs, charts, or tables showing current consumption trends. Excel Calculations – Provide detailed calculations related to rate limit requirements, if applicable. Screenshots or Diagrams – Share visuals that demonstrate your use case or architectural setup. Concise and Structured Format – Use bullet points or numbered lists for clarity. Alignment with Business Goals – Connect the request to business outcomes, such as improved customer experience or scalability for growth. Free of Ambiguity – Avoid vague language; provide specific details wherever possible. Additional AI Best Practices blog posts: Best Practices for Leveraging Azure OpenAI in Code Conversion Scenarios Best Practices for Leveraging Azure OpenAI in Constrained Optimization Scenarios Best Practices for Structured Extraction from Documents Using Azure OpenAI801Views1like3CommentsRevolutionizing Retail: Meet the Two-Stage AI-Enhanced Search
This article was written by the AI GBB Team: Samer El Housseini, Setu Chokshi, Aastha Madaan, and Ali Soliman. If you've ever struggled with a retailer's website search—typing in something simple like "snow boots" and getting random results, e.g. garden hoses—you're not alone. Traditional search engines often miss the mark because they're stuck in an outdated world of keyword matching. Modern shoppers want more. They want searches that understand context, intent, and personal preferences. Enter the game-changer: Two-Stage AI-Enhanced Search, powered by Azure AI Search and Azure OpenAI services. What's the Big Idea? Several retailers and e-commerce giants in the UK and Australia are already looking to transform customer experience using AI-enabled cutting-edge solutions. Customers often wish to search for products that they may want to give as a gift, something nice to wear for an occasion, something that is of daily use or solves a problem. This mandates a search system that can understand customer’s intention and provide them with relevant results without the customers having to spend several hours browsing through 1000s. In addition, a lot of retailers, fashion and e-commerce giants want to enhance the search experience for their customers through a hyper-personalized search experience based on their purchasing behavior, preferences and personal style. For example, if a customer types in a search phrase - “find a gift for my sister who loves hiking under 100$”, the search should return hiking gear, accessories based on the customers budget, brand preference and season. For the search systems to return top results for users search phrase need improve the relevancy of the search results which is complex task. For this we need to discover all possible product searches customer wants to perform and map them to product categories available in the product catalogue and recommend the most relevant products. Our solution builds on these two stages discovery, query expansion and recommendation to understand customers’ search context and enhance the search relevancy by using advanced reasoning models. For example, if a customer types in “It snowed today", “the system will intelligently expand search terms such as “winter gear in neutral shades”, or “hand warmer for cold weather", then searches the product categories such as, jackets, thermal leggings to recommend. How Does the Two-Stage AI Work? Stage 1: Discovery & Query Expansion The first step tackles the vague or lifestyle queries users often input: Contextual Query Expansion: When a customer says, "It snowed today," the AI doesn't merely match the keyword "snow." It understands potential purchase intent, offering winter apparel or practical cold-weather gear. The queries take into account the purchaser’s buying behavior inferred from their customer profile. For example, if the user shows a high-purchasing power through their purchasing history, then the system will show them premium and luxury items. Automatic Filtering & Categorization: The solution identifies product categories like "Cold Weather Coats" or "Automotive" and applies relevant filters such as price, brand, or past purchasing patterns. This ensures comprehensive coverage of products that match the user's real intent, transforming general queries into highly precise recommendations. Stage 2: Intelligent Personalization & Recommendation Once Stage 1 generates an initial list of potential products that comprehensively addresses the user’s query across product categories, Stage 2 refines it: Personalized Ranking: Leveraging user profiles, purchase histories, and brand affinities, the AI ranks and re-ranks products to match personal preferences. Contextual Storytelling: The system doesn't stop at listing items. It provides a compelling story or justification—like highlighting how a coat pairs perfectly with previously purchased boots or why a certain scarf is ideal for snowy conditions. Cross-selling & Upselling: By thoughtfully combining related products, the AI encourages users to add complementary items to their carts, boosting basket size and completion rates. Why the Old Ways Aren't Enough Anymore Traditional methods have significant drawbacks: Pure keyword matching leads to irrelevant results. The lack of personalization produces results that are generic, missing individual customer needs. The Two-Stage AI approach demolishes these barriers by offering a dynamic, contextually aware, and highly personalized search experience. Inside the Two-Stage System Deep Dive into Phase 1 Phase 1 (Discovery) uses hybrid semantic search and structured filters to generate broad yet targeted product sets. Expands vague queries into precise, contextually relevant search terms and categories. Uses search engine filters to dynamically manage product categories selection, ensuring maximum relevance. Deep Dive into Phase 2 Phase 2 (Recommendation) applies advanced personalization and re-ranking algorithms, crafting tailored recommendations. Refines the discovery set using detailed customer profiles. Reorders products and creates engaging narratives explaining product suitability. Real-World Business Benefits & Impact Retailers can expect significant business advantages: Higher Conversion Rates: Personalized results boost conversions by 30–50%. Increased Average Order Value: Intelligent product combinations naturally encourage larger purchases. Reduced Search Abandonment: Accurate context interpretation means customers find what they want faster, reducing frustration and bounce rates. Enhanced Customer Loyalty: Personalized shopping experiences foster repeat visits and brand affinity. Competitive Edge: Advanced AI capabilities clearly set businesses apart in the fiercely competitive retail landscape. Enhanced Fashion Relevancy: The retailers can provide hyper personalized recommendations to their customers Easy Integration and ROI Measurement Thanks to Azure AI Search and Azure OpenAI, implementation is straightforward. The solution easily integrates with existing e-commerce platforms, and comprehensive analytics make measuring KPIs (like conversion rates, average order values, and abandonment rates) simple. Continuous optimization is built right into the model, ensuring ongoing improvements. Check out the technical details and get started on GitHub: retail-search-with-ai GitHub Repository The Smart Shopping Assistant transforms online shopping by tailoring product recommendations to your unique preferences and shopping style. The platform adapts its search results and product rankings based on your selected shopping persona, ensuring you discover products that truly align with the priorities. To do that we created four distinct shopping personas that represent different consumer priorities: Luxury Diva: Prioritizes premium brands and high-quality products. Smart Saver: Focuses on value and finding the best deals. Tech Maven: Favors innovation and the latest technologies. Eco Warrior: Emphasizes sustainable and environmentally friendly options. This approach eliminates hours of product comparison by instantly identifying items that match the user specific preferences. The transparent reasoning ensures you understand exactly why certain products are recommended, giving you confidence in your purchasing decisions while maintaining complete control over your shopping. This can augment the experience to not only augment existing reasoning but also transform the website based on their previous shopping or purchasing preferences. Standard Product Search The platform provides a conventional search experience where you can enter keywords (such as "headphones") to find relevant products. In standard mode, results are displayed based on traditional ranking factors without personalization. AI-Powered Personalization When you enable AI Reasoning mode via the toggle switch, the system activates its advanced recommendation engine. This feature: Dynamically reranks products based on your selected persona's preferences’ Displays match percentage scores on each product card, indicating compatibility with the user profile. Shows ranking changes through visual indicators, allowing you to see how products move up or down in relevance. Transparent Recommendation Logic Unlike typical "black box" recommendation systems, we wanted to show how the recommendations was made and we thought about transparency into why certain products are recommended: Product cards can be flipped to reveal detailed reasoning behind each recommendation. The system displays feature-by-feature analysis of how each product attribute was evaluated. Quality, brand recognition, price sensitivity, and other factors are scored based on your persona's preference weights. Evaluations The below results benchmark the AI models against pure hybrid search (keyword + semantic) which we call the “baseline”. The methodology we used is to provide a set of 60 queries to each model and then benchmark the performance versus the baseline. All models have performed significantly better versus pure hybrid search. Interestingly, the reasoning models have produced performance results in the same range as one-shot models like GPT-4o and GPT-45. A Note on System Latency This solution is not considered “real-time” by today’s e-commerce search standards. The 2-stage search solution will take anywhere between 15 seconds up to 70 seconds depending on the LLM model used. This means that this should be marketed to end users as a separate “intelligent tool” that will take more time but will eventually produce much more targeted results. The UI should indicate and prepare the end users for this, including managing the expectations that the “wait is very much worth it”. Roadmap The roadmap for this solution will include the following features – which we will be experimenting with: Building and loading a search index with generic products for demonstrations Enhancement to the 2-stage solution process by adding a third stage that will further reason over the entire product search result set, going deeper into product features and past customer search and purchasing history. The aim here for the third stage is to maximize relevancy of proposed products to the customer preferences and expectations Introduce a feedback loop from the end of the second stage (or third stage if implemented) that feeds back into the input of the first stage. The objective for closing this loop is to refine the generated search terms and product categories filters, therefore leading to more targeted product results that trickle down the solution 2-stage (or 3-stage) pipeline. Wrapping It Up The future of retail search is intelligent, personalized, and context-aware. With Two-Stage AI-Enhanced Search, businesses can significantly improve customer satisfaction, boost sales, and build lasting brand loyalty. Ready to move beyond outdated search methods and embrace AI-driven retail innovation? Explore our GitHub Repo, watch the demo, and transform your customer journeys from uncertainty to satisfaction! Want to Learn More? Implementation Details: GitHub Repo Contact: Reach out to our Strategic X-Pod—we're excited to help you elevate your retail game!386Views1like0Comments