azure openai service
197 TopicsIntroducing Azure AI Agent Service
Introducing Azure AI Agent Service at Microsoft Ignite 2024 Discover how Azure AI Agent Service is revolutionizing the development and deployment of AI agents. This service empowers developers to build, deploy, and scale high-quality AI agents tailored to business needs within hours. With features like rapid development, extensive data connections, flexible model selection, and enterprise-grade security, Azure AI Agent Service sets a new standard in AI automation55KViews9likes7CommentsGenAI Search for Retail
Integrating generative AI into e-commerce search systems can significantly enhance user experience by refining query understanding and delivering more relevant results. A practical implementation involves deploying a query expansion mechanism that utilizes AI to interpret and broaden user search inputs. Implementation Overview The GenAISearchQueryExpander repository provides a .NET 8.0 application designed for this purpose. It leverages Azure Functions and Azure OpenAI to expand search queries, thereby improving the retrieval of pertinent products. Key Features Azure Functions Integration: Utilizes serverless computing to handle search query processing efficiently. AI-Powered Query Expansion: Employs Azure OpenAI to generate expanded versions of user queries, capturing a broader range of relevant search terms. HTTP Trigger Support: Allows the function to be invoked via HTTP requests, facilitating seamless integration with existing e-commerce platforms. Setup and Deployment Prerequisites: .NET 8.0 SDK Azure Account Azure Functions Core Tools (for local development) Example Usage Once deployed, the function can be invoked via an HTTP POST request with a JSON payload containing the user's search query. The function processes this input, utilizes Azure OpenAI to generate expanded search terms, and returns the enhanced query for use in the e-commerce platform's search index. You simply need to call the REST API and supply the search query from the user and the model name you want to use. The API will then supply the query and system prompt to the LLM and expand the query into more keywords that better meet the user’s intent. Use Case Examples Fashion Retail: A customer searches for "red dress," but the AI expands this to include terms like "maroon gown," "crimson evening wear," and "burgundy cocktail dress." Electronics Store: A search for "gaming laptop" expands to "high-performance laptop," "RTX 4060 laptop," and "16GB RAM laptop." Home Improvement: A query for "LED light bulbs" expands to include "energy-efficient bulbs," "smart LED bulbs," and "dimmable LED lights." Grocery Delivery: A search for "organic apples" is expanded to include "fresh Fuji apples," "organic Granny Smith apples," and "pesticide-free apples." This can also enable brand new experiences for users. Instead of searching for keywords, they can type in the problem they are trying to solve. For example: “I am painting my kid’s bedroom” could return paints, brushes, rollers, handles, tape, drop clothes. “What do I need for a football party?” could return all kinds of snacks, decorations, electronics, or clothing. By understanding the intent of what the user is looking for, this solution can provide more relevant results and suggest products the user hadn’t even thought about yet. By implementing this AI-driven query expansion, e-commerce platforms can significantly improve search accuracy, leading to enhanced user satisfaction and potentially increased sales.225Views1like0CommentsPrompt Engineering for OpenAI’s O1 and O3-mini Reasoning Models
This section explores how O1 and O3-mini differ from GPT-4o in input handling, reasoning capabilities, and response behavior, and outlines prompt engineering best practices to maximize their performance. Finally, we apply these best practices to a legal case analysis scenario. Differences Between O1/O3-mini and GPT-4o Input Structure and Context Handling Built-in Reasoning vs. Prompted Reasoning: O1-series models have built-in chain-of-thought reasoning, meaning they internally reason through steps without needing explicit coaxing from the prompt. In contrast, GPT-4o often benefits from external instructions like “Let’s think step by step” to solve complex problems, since it doesn’t automatically engage in multi-step reasoning to the same extent. With O1/O3, you can present the problem directly; the model will analyze it deeply on its own. Need for External Information: GPT-4o has a broad knowledge base and access to tools (e.g. browsing, plugins, vision) in certain deployments, which helps it handle a wide range of topics. By comparison, the O1 models have a narrower knowledge base outside their training focus. For example, O1-preview excelled at reasoning tasks but couldn’t answer questions about itself due to limited knowledge context. This means when using O1/O3-mini, important background information or context should be included in the prompt if the task is outside common knowledge – do not assume the model knows niche facts. GPT-4o might already know a legal precedent or obscure detail, whereas O1 might require you to provide that text or data. Context Length: The reasoning models come with very large context windows. O1 supports up to 128k tokens of input, and O3-mini accepts up to 200k tokens (with up to 100k tokens output), exceeding GPT-4o’s context length. This allows you to feed extensive case files or datasets directly into O1/O3. For prompt engineering, structure large inputs clearly (use sections, bullet points, or headings) so the model can navigate the information. Both GPT-4o and O1 can handle long prompts, but O1/O3’s higher capacity means you can include more detailed context in one go, which is useful in complex analyses. Reasoning Capabilities and Logical Deduction Depth of Reasoning: O1 and O3-mini are optimized for methodical, multi-step reasoning. They literally “think longer” before answering, which yields more accurate solutions on complex tasks. For instance, O1-preview solved 83% of problems on a challenging math exam (AIME), compared to GPT-4o’s 13% – a testament to its superior logical deduction in specialized domains. These models internally perform chain-of-thought and even self-check their work. GPT-4o is also strong but tends to produce answers more directly; without explicit prompting, it might not analyze as exhaustively, leading to errors in very complex cases that O1 could catch. Handling of Complex vs. Simple Tasks: Because O1-series models default to heavy reasoning, they truly shine on complex problems that have many reasoning steps (e.g. multi-faceted analyses, long proofs). In fact, on tasks requiring five or more reasoning steps, a reasoning model like O1-mini or O3 outperforms GPT-4 by a significant margin (16%+ higher accuracy). However, this also means that for very simple queries, O1 may “overthink.” Research found that on straightforward tasks (fewer than 3 reasoning steps), O1’s extra analytical process can become a disadvantage – it underperformed GPT-4 in a significant portion of such cases due to excessive reasoning. GPT-4o might answer a simple question more directly and swiftly, whereas O1 might generate unnecessary analysis. The key difference is O1 is calibrated for complexity, so it may be less efficient for trivial Q&A. Logical Deduction Style: When it comes to puzzles, deductive reasoning, or step-by-step problems, GPT-4o usually requires prompt engineering to go stepwise (otherwise it might jump to an answer). O1/O3 handle logical deduction differently: they simulate an internal dialogue or scratchpad. For the user, this means O1’s final answers tend to be well-justified and less prone to logical gaps. It will have effectively done a “chain-of-thought” internally to double-check consistency. From a prompt perspective, you generally don’t need to tell O1 to explain or check its logic – it does so automatically before presenting the answer. With GPT-4o, you might include instructions like “first list the assumptions, then conclude” to ensure rigorous logic; with O1, such instructions are often redundant or even counterproductive. Response Characteristics and Output Optimization Detail and Verbosity: Because of their intensive reasoning, O1 and O3-mini often produce detailed, structured answers for complex queries. For example, O1 might break down a math solution into multiple steps or provide a rationale for each part of a strategy plan. GPT-4o, on the other hand, may give a more concise answer by default or a high-level summary, unless prompted to elaborate. In terms of prompt engineering, this means O1’s responses might be longer or more technical. You have more control over this verbosity through instructions. If you want O1 to be concise, you must explicitly tell it (just as you would GPT-4) – otherwise, it might err on the side of thoroughness. Conversely, if you want a step-by-step explanation in the output, GPT-4o might need to be told to include one, whereas O1 will happily provide one if asked (and has likely done the reasoning internally regardless). Accuracy and Self-Checking: The reasoning models exhibit a form of self-fact-checking. OpenAI notes that O1 is better at catching its mistakes during the response generation, leading to improved factual accuracy in complex responses. GPT-4o is generally accurate, but it can occasionally be confidently wrong or hallucinate facts if not guided. O1’s architecture reduces this risk by verifying details as it “thinks.” In practice, users have observed that O1 produces fewer incorrect or nonsensical answers on tricky problems, whereas GPT-4o might require prompt techniques (like asking it to critique or verify its answer) to reach the same level of confidence. This means you can often trust O1/O3 to get complex questions right with a straightforward prompt, whereas with GPT-4 you might add instructions like “check your answer for consistency with the facts above.” Still, neither model is infallible, so critical factual outputs should always be reviewed. Speed and Cost: A notable difference is that O1 models are slower and more expensive in exchange for their deeper reasoning. O1 Pro even includes a progress bar for long queries. GPT-4o tends to respond faster for typical queries. O3-mini was introduced to offer a faster, cost-efficient reasoning model – it’s much cheaper per token than O1 or GPT-4o and has lower latency. However, O3-mini is a smaller model, so while it’s strong in STEM reasoning, it might not match full O1 or GPT-4 in general knowledge or extremely complex reasoning. When prompt engineering for optimal response performance, you need to balance depth vs. speed: O1 might take longer to answer thoroughly. If latency is a concern and the task isn’t maximal complexity, O3-mini (or even GPT-4o) could be a better choice. OpenAI’s guidance is that GPT-4o “is still the best option for most prompts,” using O1 primarily for truly hard problems in domains like strategy, math, and coding. In short, use the right tool for the job – and if you use O1, anticipate longer responses and plan for its slower output (possibly by informing the user or adjusting system timeouts). Prompt Engineering Techniques to Maximize Performance Leveraging O1 and O3-mini effectively requires a slightly different prompting approach than GPT-4o. Below are key prompt engineering techniques and best practices to get the best results from these reasoning models: Keep Prompts Clear and Minimal Be concise and direct with your ask. Because O1 and O3 perform intensive internal reasoning, they respond best to focused questions or instructions without extraneous text. OpenAI and recent research suggest avoiding overly complex or leading prompts for these models. In practice, this means you should state the problem or task plainly and provide only necessary details. There is no need to add “fluff” or multiple rephrasing of the query. For example, instead of writing: “In this challenging puzzle, I’d like you to carefully reason through each step to reach the correct solution. Let’s break it down step by step...”, simply ask: “Solve the following puzzle [include puzzle details]. Explain your reasoning.” The model will naturally do the step-by-step thinking internally and give an explanation. Excess instructions can actually overcomplicate things – one study found that adding too much prompt context or too many examples worsened O1’s performance, essentially overwhelming its reasoning process. Tip: For complex tasks, start with a zero-shot prompt (just the task description) and only add more instruction if you find the output isn’t meeting your needs. Often, minimal prompts yield the best results with these reasoning models. Avoid Unnecessary Few-Shot Examples Traditional prompt engineering for GPT-3/4 often uses few-shot examples or demonstrations to guide the model. With O1/O3, however, less is more. The O1 series was explicitly trained to not require example-laden prompts. In fact, using multiple examples can hurt performance. Research on O1-preview and O1-mini showed that few-shot prompting consistently degraded their performance – even carefully chosen examples made them do worse than a simple prompt in many cases. The internal reasoning seems to get distracted or constrained by the examples. OpenAI’s own guidance aligns with this: they recommend limiting additional context or examples for reasoning models to avoid confusing their internal logic. Best practice: use zero-shot or at most one example if absolutely needed. If you include an example, make it highly relevant and simple. For instance, in a legal analysis prompt, you generally would not prepend a full example case analysis; instead, just ask directly about the new case. The only time you might use a demonstration is if the task format is very specific and the model isn’t following instructions – then show one brief example of the desired format. Otherwise, trust the model to figure it out from a direct query. Leverage System/Developer Instructions for Role and Format Setting a clear instructional context can help steer the model’s responses. With the API (or within a conversation’s system message), define the model’s role or style succinctly. For example, a system message might say: “You are an expert scientific researcher who explains solutions step-by-step”. O1 and O3-mini respond well to such role instructions and will incorporate them in their reasoning. However, remember that they already excel at understanding complex tasks, so your instructions should focus on what kind of output you want, not how to think. Good uses of system/developer instructions include: Defining the task scope or persona: e.g. “Act as a legal analyst” or “Solve the problem as a math teacher explaining to a student.” This can influence tone and the level of detail. Specifying the output format: If you need the answer in a structured form (bullet points, a table, JSON, etc.), explicitly say so. O1 and especially O3-mini support structured output modes and will adhere to format requests. For instance: “Provide your findings as a list of key bullet points.” Given their logical nature, they tend to follow format instructions accurately, which helps maintain consistency in responses Setting boundaries: If you want to control verbosity or focus, you can include something like “Provide a brief conclusion after the detailed analysis” or “Only use the information given without outside assumptions.” The reasoning models will respect these boundaries, and it can prevent them from going on tangents or hallucinating facts. This is important since O1 might otherwise produce a very exhaustive analysis – which is often great, but not if you explicitly need just a summary. Ensure any guidance around tone, role, format is included each time. Control Verbosity and Depth Through Instructions While O1 and O3-mini will naturally engage in deep reasoning, you have control over how much of that reasoning is reflected in the output. If you want a detailed explanation, prompt for it (e.g. “Show your step-by-step reasoning in the answer”). They won’t need the nudge to do the reasoning, but they do need to be told if you want to see it. Conversely, if you find the model’s answers too verbose or technical for your purposes, instruct it to be more concise or to focus only on certain aspects. For example: “In 2-3 paragraphs, summarize the analysis with only the most critical points.” The models are generally obedient to such instructions about length or focus. Keep in mind that O1’s default behavior is to be thorough – it’s optimized for correctness over brevity – so it may err on the side of giving more details. A direct request for brevity will override this tendency in most cases. For O3-mini, OpenAI provides an additional tool to manage depth: the “reasoning effort” parameter (low, medium, high). This setting lets the model know how hard to “think.” In prompt terms, if using the API or a system that exposes this feature, you can dial it up for very complex tasks (ensuring maximum reasoning, at the cost of longer answers and latency) or dial it down for simpler tasks (faster, more streamlined answers). This is essentially another way to control verbosity and thoroughness. If you don’t have direct access to that parameter, you can mimic a low effort mode by explicitly saying “Give a quick answer without deep analysis” for cases where speed matters more than perfect accuracy. Conversely, to mimic high effort, you might say “Take all necessary steps to arrive at a correct answer, even if the explanation is long.” These cues align with how the model’s internal setting would operate. Ensure Accuracy in Complex Tasks To get the most accurate responses on difficult problems, take advantage of the reasoning model’s strengths in your prompt. Since O1 can self-check and even catch contradictions, you can ask it to utilize that: e.g. “Analyze all the facts and double-check your conclusion for consistency.” Often it will do so unprompted, but reinforcing that instruction can signal the model to be extra careful. Interestingly, because O1 already self-fact-checks, you rarely need to prompt it with something like “verify each step” (that’s more helpful for GPT-4o). Instead, focus on providing complete and unambiguous information. If the question or task has potential ambiguities, clarify them in the prompt or instruct the model to list any assumptions. This prevents the model from guessing wrongly. Handling sources and data: If your task involves analyzing given data (like summarizing a document or computing an answer from provided numbers), make sure that data is clearly presented. O1/O3 will diligently use it. You can even break data into bullet points or a table for clarity. If the model must not hallucinate (say, in a legal context it shouldn’t make up laws), explicitly state “base your answer only on the information provided and common knowledge; do not fabricate any details.” The reasoning models are generally good at sticking to known facts, and such an instruction further reduces the chance of hallucinationIterate and verify: If the task is critical (for example, complex legal reasoning or a high-stakes engineering calculation), a prompt engineering technique is to ensemble the model’s responses. This isn’t a single prompt, but a strategy: you could run the query multiple times (or ask the model to consider alternative solutions) and then compare answers. O1’s stochastic nature means it might explore different reasoning paths each time. By comparing outputs or asking the model to “reflect if there are alternative interpretations” in a follow-up prompt, you can increase confidence in the result. While GPT-4o also benefits from this approach, it’s especially useful for O1 when absolute accuracy is paramount – essentially leveraging the model’s own depth by cross-verifying. Finally, remember that model selection is part of prompt engineering: If a question doesn’t actually require O1-level reasoning, using GPT-4o might be more efficient and just as accurate. OpenAI recommends saving O1 for the hard cases and using GPT-4o for the rest. So a meta-tip: assess task complexity first. If it’s simple, either prompt O1 very straightforwardly to avoid overthinking, or switch to GPT-4o. If it’s complex, lean into O1’s abilities with the techniques above. How O1/O3 Handle Logical Deduction vs. GPT-4o The way these reasoning models approach logical problems differs fundamentally from GPT-4o, and your prompt strategy should adapt accordingly: Internal Chain-of-Thought: O1 and O3-mini effectively perform an internal dialogue or step-by-step solution as they deduce answers. GPT-4o, unless explicitly guided, might not rigorously go through each step. For example, in a logic puzzle or a math word problem, GPT-4o might give a quick answer that sounds plausible but skips some reasoning, increasing the risk of error. O1 will automatically break the problem down, consider various angles, and only then give an answer, which is why it achieved dramatically higher scores on logic-heavy evaluations. Prompting difference: Do not prompt O1 to “show the reasoning” unless you actually want to see it. With GPT-4o, you’d use a CoT prompt (“First, think about... then ...”) to improve deduction, but with O1 this is built-in and telling it to do so externally can be redundant or even confusing. Instead, just ensure the problem is clearly stated and let O1 deductively reason it out. Handling Ambiguities: In logical deduction tasks, if there’s missing info or ambiguity, GPT-4o might make an assumption on the fly. O1 is more likely to flag the ambiguity or consider multiple possibilities because of its reflective approach. To leverage this, your prompt to O1 can directly ask: “If there are any uncertainties, state your assumptions before solving.” GPT-4 might need that nudge more. O1 might do it naturally or at least is less prone to assuming facts not given. So in comparing the two, O1’s deduction is cautious and thorough, whereas GPT-4o’s is swift and broad. Tailor your prompt accordingly – with GPT-4o, guide it to be careful; with O1, you mainly need to supply the information and let it do its thing. Step-by-Step Outputs: Sometimes you actually want the logical steps in the output (for teaching or transparency). With GPT-4o, you must explicitly request this (“please show your work”). O1 might include a structured rationale by default if the question is complex enough, but often it will present a well-reasoned answer without explicitly enumerating every step unless asked. If you want O1 to output the chain of logic, simply instruct it to — it will have no trouble doing so. In fact, O1-mini was noted to be capable of providing stepwise breakdowns (e.g., in coding problems) when prompted. Meanwhile, if you don’t want a long logical exposition from O1 (maybe you just want the final answer), you should say “Give the final answer directly” to skip the verbose explanation. Logical Rigor vs. Creativity: One more difference: GPT-4 (and 4o) has a streak of creativity and generative strength. Sometimes in logic problems, this can lead it to “imagine” scenarios or analogies, which isn’t always desired. O1 is more rigor-focused and will stick to logical analysis. If your prompt involves a scenario requiring both deduction and a bit of creativity (say, solving a mystery by piecing clues and adding a narrative), GPT-4 might handle the narrative better, while O1 will strictly focus on deduction. In prompt engineering, you might combine their strengths: use O1 to get the logical solution, then use GPT-4 to polish the presentation. If sticking to O1/O3 only, be aware that you might need to explicitly ask it for creative flourishes or more imaginative responses – they will prioritize logic and correctness by design. Key adjustment: In summary, to leverage O1/O3’s logical strengths, give them the toughest reasoning tasks as a single well-defined prompt. Let them internally grind through the logic (they’re built for it) without micromanaging their thought process. For GPT-4o, continue using classic prompt engineering (decompose the problem, ask for step-by-step reasoning, etc.) to coax out the same level of deduction. And always match the prompt style to the model – what confuses GPT-4o might be just right for O1, and vice versa, due to their different reasoning approaches. Crafting Effective Prompts: Best Practices Summary To consolidate the above into actionable guidelines, here’s a checklist of best practices when prompting O1 or O3-mini: Use Clear, Specific Instructions: Clearly state what you want the model to do or answer. Avoid irrelevant details. For complex questions, a straightforward ask often suffices (no need for elaborate role-play or multi-question prompts). Provide Necessary Context, Omit the Rest: Include any domain information the model will need (facts of a case, data for a math problem, etc.), since the model might not have up-to-date or niche knowledge. But don’t overload the prompt with unrelated text or too many examples – extra fluff can dilute the model’s focus Minimal or No Few-Shot Examples: By default, start with zero-shot prompts. If the model misinterprets the task or format, you can add one simple example as guidance, but never add long chains of examples for O1/O3. They don’t need it, and it can even degrade performance. Set the Role or Tone if Needed: Use a system message or a brief prefix to put the model in the right mindset (e.g. “You are a senior law clerk analyzing a case.”). This helps especially with tone (formal vs. casual) and ensures domain-appropriate language. Specify Output Format: If you expect the answer in a particular structure (list, outline, JSON, etc.), tell the model explicitly. The reasoning models will follow format instructions reliably. For instance: “Give your answer as an ordered list of steps.” Control Length and Detail via Instructions: If you want a brief answer, say so (“answer in one paragraph” or “just give a yes/no with one sentence explanation”). If you want an in-depth analysis, encourage it (“provide a detailed explanation”). Don’t assume the model knows your desired level of detail by default – instruct it. Leverage O3-mini’s Reasoning Effort Setting: When using O3-mini via API, choose the appropriate reasoning_effort (low/medium/high) for the task . High gives more thorough answers (good for complex legal reasoning or tough math), low gives faster, shorter answers (good for quick checks or simpler queries). This is a unique way to tune the prompt behavior for O3-mini. Avoid Redundant “Think Step-by-Step” Prompts: Do not add phrases like “let’s think this through” or chain-of-thought directives for O1/O3; the model already does this internally. Save those tokens and only use such prompts on GPT-4o, where they have impact. An exception might be if you explicitly want the model to output each step for transparency – then you can ask for that in the output, but you still don’t need to tell it to actually perform reasoning. Test and Iterate: Because these models can be sensitive to phrasing, if you don’t get a good answer, try rephrasing the question or tightening the instructions. You might find that a slight change (e.g. asking a direct question vs. an open-ended prompt) yields a significantly better response. Fortunately, O1/O3’s need for iteration is less than older models (they usually get complex tasks right in one go), but prompt tweaking can still help optimize clarity or format. Validate Important Outputs: For critical use-cases, don’t rely on a single prompt-answer cycle. Use follow-up prompts to ask the model to verify or justify its answer (“Are you confident in that conclusion? Explain why.”), or run the prompt again to see if you get consistent results. Consistency and well-justified answers indicate the model’s reasoning is solid. By following these techniques, you can harness O1 and O3-mini’s full capabilities and get highly optimized responses that play to their strengths. Applying Best Practices to a Legal Case Analysis Finally, let’s consider how these prompt engineering guidelines translate to a legal case analysis scenario (as mentioned earlier). Legal analysis is a perfect example of a complex reasoning task where O1 can be very effective, provided we craft the prompt well: Structure the Input: Start by clearly outlining the key facts of the case and the legal questions to be answered. For example, list the background facts as bullet points or a brief paragraph, then explicitly ask the legal question: “Given the above facts, determine whether Party A is liable for breach of contract under U.S. law.” Structuring the prompt this way makes it easier for the model to parse the scenario. It also ensures no crucial detail is buried or overlooked. Provide Relevant Context or Law: If specific statutes, case precedents, or definitions are relevant, include them (or summaries of them) in the prompt. O1 doesn’t have browsing and might not recall a niche law from memory, so if your analysis hinges on, say, the text of a particular law, give it to the model. For instance: “According to [Statute X excerpt], [provide text]… Apply this statute to the case.” This way, the model has the necessary tools to reason accurately. Set the Role in the System Message: A system instruction like “You are a legal analyst who explains the application of law to facts in a clear, step-by-step manner.” will cue the model to produce a formal, reasoned analysis. While O1 will already attempt careful reasoning, this instruction aligns its tone and structure with what we expect in legal discourse (e.g. citing facts, applying law, drawing conclusions). No Need for Multiple Examples: Don’t supply a full example case analysis as a prompt (which you might consider doing with GPT-4o). O1 doesn’t need an example to follow – it can perform the analysis from scratch.. You might, however, briefly mention the desired format: “Provide your answer in an IRAC format (Issue, Rule, Analysis, Conclusion).” This format instruction gives a template without having to show a lengthy sample, and O1 will organize the output accordingly. Control Verbosity as Needed: If you want a thorough analysis of the case, let O1 output its comprehensive reasoning. The result may be several paragraphs covering each issue in depth. If you find the output too verbose or if you specifically need a succinct brief (for example, a quick advisory opinion), instruct the model: “Keep the analysis to a few key paragraphs focusing on the core issue.” This ensures you get just the main points. On the other hand, if the initial answer seems too brief or superficial, you can prompt again: “Explain in more detail, especially how you applied the law to the facts.” O1 will gladly elaborate because it has already done the heavy reasoning internally. Accuracy and Logical Consistency: Legal analysis demands accuracy in applying rules to facts. With O1, you can trust it to logically work through the problem, but it’s wise to double-check any legal citations or specific claims it makes (since its training data might not have every detail). You can even add a prompt at the end like, “Double-check that all facts have been addressed and that the conclusion follows the law.” Given O1’s self-checking tendency, it may itself point out if something doesn’t add up or if additional assumptions were needed. This is a useful safety net in a domain where subtle distinctions matter. Use Follow-Up Queries: In a legal scenario, it’s common to have follow-up questions. For instance, if O1 gives an analysis, you might ask, “What if the contract had a different clause about termination? How would that change the analysis?” O1 can handle these iterative questions well, carrying over its reasoning. Just remember that, if the project you ar working on, the interface doesn’t have long-term memory beyond the current conversation context (and no browsing), each follow-up should either rely on the context provided or include any new information needed. Keep the conversation focused on the case facts at hand to prevent confusion. By applying these best practices, your prompts will guide O1 or O3-mini to deliver high-quality legal analysis. In summary, clearly present the case, specify the task, and let the reasoning model do the heavy lifting. The result should be a well-reasoned, step-by-step legal discussion that leverages O1’s logical prowess, all optimized through effective prompt construction. Using OpenAI’s reasoning models in this way allows you to tap into their strength in complex problem-solving while maintaining control over the style and clarity of the output. As OpenAI’s own documentation notes, the O1 series excels at deep reasoning tasks in domains like research and strategy– legal analysis similarly benefits from this capability. By understanding the differences from GPT-4o and adjusting your prompt approach accordingly, you can maximize the performance of O1 and O3-mini and obtain accurate, well-structured answers even for the most challenging reasoning tasks.10KViews4likes3CommentsUse generative AI to extract structured data out of emails
One thing we regularly hear from clients is that they receive information that are key to their business such as order requests via email in an unstructured format and sometimes there are structured information within the body of those emails in a variety of table formats. In today’s fast-paced digital world, businesses need a way to automatically extract, structure, and integrate this information into their existing applications. Whether it’s leveraging AI-powered document processing, natural language processing (NLP), or intelligent automation, the right approach can transform email-based orders into structured, actionable data. In this blog, we’ll explore one such scenario where AI can be leveraged to extract information in tabular format that has been provided within an email. The emails contextually belong to a specific domain, but the tables are not with consistent headers or shapes. Sometimes in the body of one email there could be multiple tables. The problem Statement Extract tabular information with varying table formats from emails The typical approach to this problem involves rule-based processing, where individual tables are extracted and merged based on predefined logic. However, given the variety of email formats from hundreds or even thousands of different senders, maintaining such rule-based logic becomes increasingly complex and difficult to manage. A more optimal solution is leveraging the cognitive capabilities of generative AI, which can dynamically adapt to different table structures, column names, and formatting variations—eliminating the need for constant rule updates while improving accuracy and scalability. To create this sample code, I used below email with test data, with two tables with inconsistent column names. It is going to provide some upcoming trainings information. Please note the difference between the column headers: Hi there, Regarding the upcoming trainings, this is the list: Event Date Description of Event Length Grade 2025-01-21 Digital environments 20 hours 5 2025-03-01 AI for Industry A 10 hours 3 and some further events in below list Date Subject Duration Grade 2025-01-21 Digital environments 2 2 days 1 2025-03-01 AI for Industry B 2 weeks 4 These sessions are designed to be interactive and informative, so your timely participation is crucial. Please make sure to log in or arrive on time to avoid missing key insights. If you have any questions or need assistance, feel free to reach out. Looking forward to seeing you there! Thanks, Azadeh These are the two tables within the email, and we need to extract one consistent table format with all the rows from these two tables. Table 1 Event Date Description of Event Length Grade 2025-01-21 Digital environments 20 hours 5 2025-03-01 AI for Industry A 10 hours 3 Table 2 Date Subject Duration Grade 2025-01-21 Digital environments 2 2 days 1 2025-03-01 AI for Industry B 2 weeks 4 To extract the tabular data into one single table in json format, I am using python with below libraries installed in my environment: pandas beautifulsoup4 openai lxml The Code I use azure OpenAI service with a gpt 4o deployment. Below code is just one way of solving this type of problem and can be customized or improved to fit to other similar problems. I have provided some guidelines about merging the tables and column names similarity in the user prompt. This sample code is using an email message that is saved in 'eml' format in a local path, but the email library has other capabilities to help you connect to a mailbox and get the emails. import email import pandas as pd from bs4 import BeautifulSoup import os from openai import AzureOpenAI endpoint = os.getenv("ENDPOINT_URL", "https://....myendpointurl....openai.azure.com/") deployment = os.getenv("DEPLOYMENT_NAME", "gpt-4o") subscription_key = os.getenv("AZURE_OPENAI_API_KEY", "myapikey) # Initialize Azure OpenAI Service client with key-based authentication client = AzureOpenAI( azure_endpoint=endpoint, api_key=subscription_key, api_version="2024-05-01-preview", ) # Process email content with GPT-4 def extract_information(email_body, client): soup = BeautifulSoup(email_body, "html.parser") body = soup.get_text() print(body) #Prepare the chat prompt chat_prompt = [ { "role": "system", "content": [ { "type": "text", "text": "You are an AI assistant that is expert in extracting structured data from emails." } ] }, { "role": "user", "content": [ { "type": "text", "text": f"Extract the required information from the following email and format it as JSON and consolidate the tables using the common column names. For example the columns length and duration are the same and the columns Event and Subject are the same:\n\n{body}" } ] } ] messages = chat_prompt # Generate the completion completion = client.chat.completions.create( model=deployment, messages=messages, max_tokens=800, temperature=0.1, top_p=0.95, frequency_penalty=0, presence_penalty=0, stop=None, stream=False ) return completion.choices[0].message.content email_file_name = r'...path to your file....\Test Email with Tables.eml' with open(email_file_name, "r") as f: msg = email.message_from_file(f) email_body = "" for part in msg.walk(): if part.get_content_type() == "text/plain": email_body = part.get_payload(decode=True).decode() elif part.get_content_type() == "text/html": email_body = part.get_payload(decode=True).decode() extracted_info = extract_information(email_body, client) print(extracted_info) The output is: ``` [ { "Event": "Digital environments", "Date": "2025-01-21", "Length": "20 hours", "Grade": 5 }, { "Event": "AI for Industry A", "Date": "2025-03-01", "Length": "10 hours", "Grade": 3 }, { "Event": "Digital environments 2", "Date": "2025-01-21", "Length": "2 days", "Grade": 1 }, { "Event": "AI for Industry B", "Date": "2025-03-01", "Length": "2 weeks", "Grade": 4 } ] ``` Key points in the code: Read an email and extract the body Use a gen AI model with the right instructions prompt to complete the task Gen AI will follow the instructions and create a combined consistent table Get the output in the right format, e.g. 'json' I hope you find this blog post helpful, and you can apply it to your use case/domain. Or you can simply get the idea of how to use generative AI to solve a problem, instead of building layers of custom logic.543Views6likes1CommentBuilding an OpenAI powered Recommendation Engine
Introduction Recommendation engines play a vital role in enhancing user experiences by providing personalized suggestions and have been proved as an effective strategy in turning engagement into valuable business. The technical objective of a Recommendation Engine is to filter and present the most relevant items from a vast datasets considering business constrains. This process includes steps like data collection, preprocessing, model training, and deployment. Advanced techniques such as embeddings and cosine similarity are used to determine most relevant results for recommendations This blog explores the design and implementation of a recommendation engine. It addresses the challenges faced by traditional systems and how modern approaches can overcome them, aiming to build a robust, scalable recommendation engine suitable for various domains. Background / Problem Scenario Traditional recommendation systems often fall short due to their reliance on basic filtering techniques and limited understanding of user behaviour, resulting in poor recommendations and user dissatisfaction. The main issue is that traditional recommendation engines struggle to analyse large datasets and understand the relationships between items, leading to a mismatch between user preferences and recommendations. Additionally, the need for real-time, personalized suggestions adds complexity. To address this, we need a recommendation engine that leverages advanced AI techniques like embeddings and cosine similarity to accurately filter relevant results. This engine should be scalable, capable of handling vast amounts of data, and able to provide quick, relevant recommendations. We have implemented a similar solution on our Microsoft Career Site which has been scaled to provide job recommendations to internal users in over 100 countries across the globe. We have noticed a significant increase in conversion rates of 1.6 times in job applications through recommendations vs job search. This solution is not just limited to a career site but can be adopted for a variety of recommendation scenarios such as e-commerce, social media, e-learning platforms, media streaming platforms, travel and hospitality, healthcare, retail and much more. Key Features Semantic Understanding: By using embeddings, the engine captures the semantic meaning of items, leading to more relevant recommendations. Agility and Customizability: Customization options are available by modifying the weight. Scalability: Azure AI Search provides scalable storage and efficient retrieval of embeddings, making the system suitable for large datasets. Real-time Recommendations: The use of cosine similarity allows for quick computation of similarity scores, enabling real-time recommendations. Flexibility: The system can be adapted to various domains, such as e-commerce, content streaming, and social media, by training domain-specific embedding models. Working Principle Raw Data Conversion: The recommendation engine converts raw data into Named Entity Recognition (NER) using OpenAI. NER is nothing but a Json in the pre-defined schema. Vector Embeddings: The NER is then converted into vector embeddings using OpenAI. Vector Database: This is used to store embeddings and querying it efficiently. We preferred to use Azure AI Search as our Vector Database. User Interaction: When a user interacts with the system, their preferences are also converted into embeddings. Cosine Similarity: In technical words, cosine similarity measures the angle between the user's embedding and item embeddings. In simple terms, it's a technique used to generate a score that indicates how closely an item matches the given sample. Recommendation: This process identifies the most similar items, ensuring recommendations are based on the semantic similarity of items, rather than just surface-level features. This process also implements additional filters on the result which user has shared via feedback loop. Data Flow NER Generation: With existing structured or unstructured data, NER (Named Entity Recognition) is generated using OpenAI with prompt engineering approach. Embeddings Generation: NER is further processed with OpenAI to generate embeddings. Azure AI Search: Generated Embeddings are further stored in Azure AI Search. Recommendation Generation: Using vector queries and cosine similarity calculations, a set of matching results is generated. Further on that, additional filter is done based on the user feedback collected via feedback loop, which is then served as recommendations. Feedback Loop: To enhance the recommendation results based on user feedback. The feedback collected here is used further to refine the final calculated results. Azure Premium Storage: For caching results to improve performance. When considering caching solutions for our recommendation engine, several factors come into play: Redis Cache Limitations: Redis can struggle with larger response sizes, around 1.5 MB. Cost Efficiency: Blob-based caching is often more cost-effective compared to Redis. Document DB Constraints: The maximum response size is usually capped at few MB, which may not be scalable for larger result datasets. Also, scaling up document database can be costly. Response Time Goals: Our aim is to significantly reduce response times without incurring high costs for ultra-fast API responses. Performance Metrics: For 25 job recommendations in our pre-production environment, the response time was around 600 ms, which meets our SLA. Considerations for Engineering Standards Security Disable Secrets / Local Auth: Disable local authentication and secrets/connection strings for all Azure Services to enhance security and prevent unauthorized access. Use Managed Identity wherever applicable and possible. Firewall: Consider limiting the IP range accessibility of the databases to reduce the risk of unauthorized access. You can also prefer to use Virtual Network to restrict access. Rate Limiting: Implement rate limiting to prevent throttling and ensure fair usage of OpenAI resources. Encryption: Ensure all data at rest and in transit is encrypted to protect sensitive information. Identity and Access Management (IAM): Implement strict IAM policies to control who can access what resources. Security Audits: Regularly conduct security audits to identify and mitigate vulnerabilities. Incident Response Plan: Develop and maintain an incident response plan to quickly address security breaches. Quality Comprehensive Testing for Quality NER: Set up an extensive testing environment to guarantee high-quality Named Entity Recognition (NER) outputs. With high quality NER, the overall quality and reliability of the entire system will significantly improve. In our scenario, we have developed an automated tool to feed bulk dataset to generate and test the quality of NER. Manual quality testing is also required up to some extent to ensure result is not capturing any bias based on language, colour, ethnicity etc. Unit Testing: Make use of Unit Testing framework to ensure consistent and thorough testing of all code changes. Build Verification Testing (BVT): Perform automated BVT to ensure that the build is stable and meets the basic requirements before proceeding to more rigorous testing. Performance Result Caching: Implement caching mechanisms to store frequently accessed data and improve response times. Multi-Region Load Balancing: Distribute traffic across multiple regions to enhance performance and ensure high availability. Load Testing: Conduct load testing to evaluate system performance under high traffic conditions and identify potential bottlenecks. We considered JMeter for load testing in our scenario. Database Optimization: Optimize database queries and indexing to improve performance. Also ensure it is appropriately scaled to cater the required load. Content Delivery Network (CDN): Use CDNs to reduce latency and improve load times for users globally. Scalability Testing: Test the system’s ability to scale up or down based on demand. Resource / SKEU Allocation: Efficiently allocate resources to ensure optimal performance under varying loads. Prompt Engineering in OpenAI OpenAI Model Selection: Extensive rounds of testing may be required to identify the optimal model for your use case. New, higher-performing models are emerging almost every quarter. Ensure a thorough validation is done before you plan to switch to a new model. Context Awareness: Ensure your prompts consider user preferences, history, and current context for personalized recommendations if applicable. In our case, context use case was not there. Clarity and Brevity: Keep prompts clear and concise to avoid user confusion and encourage quick responses. Dynamic Adjustments: Adapt your prompts based on user feedback and changing preferences to keep recommendations relevant. Avoid Bias: Enrich your prompts to avoid any kind of bias in results. Feedback Loops: Implement prompts that actively seek user feedback to continually refine and improve the recommendation system. Deployment & Release Feature Flighting: Gradually roll out new features to a subset of users to test and gather feedback before full deployment. Blue-Green Deployment: Use blue-green deployment strategies to minimize downtime and reduce the risk during updates. CICD Pipelines: Implement Continuous Integration and Continuous Deployment pipelines to automate testing and deployment processes, ensuring faster and more reliable releases. Rollback Strategies: Develop rollback strategies to quickly revert to a previous version in case of issues during deployment. Infra as Code: Recommended to use Bicep or other approaches for infrastructure setup. Challenges Anticipated AI Hallucinations: Ensuring the prevention of AI-generated hallucinations. It can be fixed with appropriate prompts and rigorous testing with malicious prompts. Quality Assurance: Maintaining rigorous quality testing protocols. NER Extraction Accuracy: Enhancing the precision of Named Entity Recognition (NER) by enhancing prompts. Data Privacy and Compliance: Upholding data privacy standards and conducting thorough reviews. Conclusion: Why Should You Consider This Approach? Easy Integration with Azure AI Search: One of the biggest advantages of using Azure AI Search is how easy it is to integrate. You don't need to spend a lot of time setting up complex infrastructure. Instead, you can focus on fine-tuning your recommendation algorithms. Azure AI Search comes with built-in support for vector search, making it simpler to implement advanced recommendation systems. Scalability: Azure AI Search is designed to handle large datasets efficiently. This means your recommendation engine can grow alongside your user base without losing performance. The platform can manage high query volumes and large-scale data indexing, ensuring your system stays responsive and reliable as it scales. Vector-Based Search Benefits: Traditional filtering techniques often fall short in capturing the true meaning behind user preferences. Vector-based search, on the other hand, understands the semantic relationships between items, leading to more accurate and relevant recommendations. This results in a better user experience, as the suggestions are more aligned with what users are actually looking for. Cost Efficiency: Choosing the right caching strategies, like Azure Premium Storage blob-based caching over Redis, can help you save costs while maintaining performance. This is especially important for large-scale deployments where budget management is crucial. Blob storage is a cost-effective solution for storing large amounts of data. Real-World Impact: Implementing a recommendation engine like this can have a significant impact on user engagement and business outcomes. For instance, personalized job recommendations on the Microsoft Global Career Site have led to improved candidate engagement and conversion rates increased by 1.6 times. Delivering relevant content quickly enhances user experience and drives important business metrics like retention and conversion. References Introduction to Vector Embeddings OpenAI cookbook on Vector databases Introducing text and code embeddings Contributors: Ashish Mathur, Jayesh Kudukken Thazhath, Ashudeep Reshi, Bipul Raman, Swadhin Nayak, Sivakamy Lakshminarayanan, Prachi Nautiyal, Priyanka Kumari, Abhishek Mishra, Satya Vamsi Gadikoyila1KViews6likes0CommentsHarnessing the Power of Azure AI Foundry with AI agents, Azure AI and OpenAI: SmartWeather AI Agent
Introduction: AI is transforming how we interact with data, and one great example of this is the SmartWeather AI Agent. This AI-powered weather reporting system integrates Azure AI and Azure OpenAI's GPT-4O-mini to provide real-time weather data, sentiment analysis, and health alerts based on weather conditions. It combines weather data from OpenWeatherMap with Azure AI’s natural language processing and sentiment analysis to create a seamless, personalized weather experience. Explore the code and get started with this innovative AI solution on GitHub! What is SmartWeather AI Agent? SmartWeather AI Agent is an AI solution that integrates multiple agents for fetching real-time weather data, analyzing sentiment, and generating health and safety alerts. It uses Azure AI and OpenAI models to provide weather insights and actionable alerts, all delivered through an interactive web interface and email notifications. Key Technologies Behind the Project: Azure AI & Azure OpenAI: At the core of the project, Azure AI powers sentiment analysis, while OpenAI’s GPT-4O-MINI handles understanding weather descriptions and generating responses. Azure AI Foundry: The Azure AI Foundry helps integrate multiple AI agents efficiently, allowing for easy deployment and management of models for different tasks like weather fetching and health advisories. OpenWeatherMap API: Used to pull live weather data for accurate and real-time weather information. How SmartWeather AI Agent Works: Weather Fetching Agent: Retrieves live weather data from OpenWeatherMap. Sentiment Analysis Agent: Analyzes the mood of weather conditions using Azure AI. Health & Safety Alerts: Generates alerts based on weather conditions to ensure user safety. Forecast Agent: Provides a 5-day forecast to help users plan ahead. Benefits of AI Agents in Weather Forecasting: Engaging User Experience: Sentiment analysis and personalized insights enhance the app’s interactivity. Actionable Insights: AI-driven alerts for health and safety keep users informed. Real-Time Data: Accurate, up-to-date weather information powered by Azure AI and OpenAI models. Why Use Azure AI and OpenAI? Azure AI: Offers powerful tools for data processing and analysis, enabling seamless AI integrations. Azure OpenAI: GPT-4o-mini enhances natural language understanding and generation for a more interactive experience. Azure AI Foundry: Simplifies the process of deploying and managing multiple AI agents in one solution. Conclusion: The SmartWeather AI Agent is a great example of how combining Azure AI, OpenAI, and AI Foundry can create intelligent solutions that provide deeper insights and improve user experience. By integrating real-time weather data with AI-driven sentiment analysis and health alerts, the project demonstrates the potential of AI agents in improving everyday life. Explore the code and get started with this innovative AI solution on GitHub!596Views0likes1CommentIntroducing the GPT-4o-Mini Audio Models: Adding More Choice to Audio-Enhanced AI Interaction
We are thrilled to announce the release of the new GPT-4o-Mini-Realtime-Preview and GPT-4o-Mini-Audio-Preview models, both now available in preview. These new models introduce advanced audio capabilities at just 25% of the cost of GPT-4o audio models. Adding on to the existing GPT-4o audio models, this expansion enhances the potential for AI applications in text and voice-based interactions. Starting today, developers can unlock immersive, voice-driven experiences by harnessing the advanced capabilities of all Azure OpenAI Service advanced audio models, now in public preview. Key Benefits Advanced Audio Capabilities: Enjoy high-quality audio interactions at a fraction of the cost of GPT-4o audio models. Seamless Compatibility: Our new models are compatible with existing Realtime API and Chat Completion API, ensuring smooth integration and consistent functionality across model families. Innovative Interactions: Experience natural and intuitive interactions with our voice-based capabilities, making your interactions more engaging and effective. Detailed Features GPT-4o-Mini-Realtime-Preview: Real-Time Voice Interaction: Enable real-time, natural voice-based interactions for a more engaging user experience. When to Use: Ideal for applications requiring immediate, real-time responses, such as customer service chatbots and virtual assistants. GPT-4o-Mini-Audio Preview: Advanced Audio Capabilities: Provides high-quality audio interactions at a reduced cost. When to Use: Perfect for applications requiring asynchronous audio capabilities, such as recording sentiment analysis and text-to-audio content creation. Real-World Applications The potential of our new products spans across various industries, transforming how businesses operate and how users interact with technology: Customer Service: Voice-based chatbots and virtual assistants can now handle customer inquiries more naturally and efficiently, reducing wait times and improving overall satisfaction. Content Creation: Media producers can revolutionize their workflows by leveraging speech generation for use in video games, podcasts, and film studios. Real-Time Translation: Industries such as healthcare and legal services can benefit from real-time audio translation, breaking down language barriers and fostering better communication in critical contexts. Ready to get started? Learn more about GPT-4o-Audio Preview Model: Introducing the GPT-4o-Audio-Preview: A New Era of Audio-Enhanced AI Interaction | Microsoft Community Hub Learn more about Azure OpenAI Service Try it out with Azure AI Foundry2.3KViews0likes0CommentsFrom Foundry to Fine-Tuning: Topics you Need to Know in Azure AI Services
With so many new features from Azure and newer ways of development, especially in generative AI, you must be wondering what all the different things you need to know are and where to start in Azure AI. Whether you're a developer or IT professional, this guide will help you understand the key features, use cases, and documentation links for each service. Let's explore how Azure AI can transform your projects and drive innovation in your organization. Stay tuned for more details! Term Description Use Case Azure Resource Azure AI Foundry A comprehensive platform for building, deploying, and managing AI-driven applications. Customizing, hosting, running, and managing AI applications. Azure AI Foundry AI Agent Within Azure AI Foundry, an AI Agent acts as a "smart" microservice that can be used to answer questions (RAG), perform actions, or completely automate workflows. can be used in a variety of applications to automate tasks, improve efficiency, and enhance user experiences. Link AutoGen An open-source framework designed for building and managing AI agents, supporting workflows with multiple agents. Developing complex AI applications with multiple agents. Autogen Multi-Agent AI Systems where multiple AI agents collaborate to solve complex tasks. Managing energy in smart grids, coordinating drones. Link Model as a Platform A business model leveraging digital infrastructure to facilitate interactions between user groups. Social media channels, online marketplaces, crowdsourcing websites. Link Azure OpenAI Service Provides access to OpenAI’s powerful language models integrated into the Azure platform. Text generation, summarization, translation, conversational AI. Azure OpenAI Service Azure AI Services A suite of APIs and services designed to add AI capabilities like image analysis, speech-to-text, and language understanding to applications. Image analysis, speech-to-text, language understanding. Link Azure Machine Learning (Azure ML) A cloud-based service for building, training, and deploying machine learning models. Creating models to predict sales, detect fraud. Azure Machine Learning Azure AI Search An AI-powered search service that enhances information to facilitate exploration. Enterprise search, e-commerce search, knowledge mining. Azure AI Search Azure Bot Service A platform for developing intelligent, enterprise-grade bots. Creating chatbots for customer service, virtual assistants. Azure Bot Service Deep Learning A subset of ML using neural networks with many layers to analyze complex data. Image and speech recognition, natural language processing. Link Multimodal AI AI that integrates and processes multiple types of data, such as text and images(including input & output). Describing images, answering questions about pictures. Azure OpenAI Service, Azure AI Services Unimodal AI AI that processes a single type of data, such as text or images (including input & output). Writing text, recognizing objects in photos. Azure OpenAI Service, Azure AI Services Fine-Tuning Models Adapting pre-trained models to specific tasks or datasets for improved performance. Customizing models for specific industries like healthcare. Azure Foundry Model Catalog A repository of pre-trained models available for use in AI projects. Discovering, evaluating, fine-tuning, and deploying models. Model Catalog Capacity & Quotas Limits and quotas for using Azure AI services, ensuring optimal resource allocation. Managing resource usage and scaling AI applications. Link Tokens Units of text processed by language models, affecting cost and performance. Managing and optimizing text processing tasks. Link TPM (Tokens per Minute) A measure of the rate at which tokens are processed, impacting throughput and performance. Allocating and managing processing capacity for AI models. Link PTU(provisioned throughput) provisioned throughput capability allows you to specify the amount of throughput you require in a deployment. Ensuring predictable performance for AI applications. Link746Views1like0CommentsCustomize AOAI Embeddings with contrastive learning
Introduction Embeddings are used to generate a representation of unstructured data in a dense vector space. An embedding is a vector of floating-point numbers, such that the distance between two embeddings in the vector space is correlates to semantic similarity between two inputs in the original format (eg., text / image). When text is embedded, the meaning of each word is encoded so that words closer together in the vector space are expected to have similar meanings. A large number of embedding models that support such text representations are available and benchmarks like MTEB help understand their performance. One of the pitfalls / risks in embedding models is that sometimes models may not be able to adequately represent the underlying data. This could be due to the following scenarios: Out-of-Domain Text: If the text is highly technical or niche, and the model hasn’t been trained on similar data, the resulting embeddings might not accurately capture the specialized context or jargon. For eg, this is likely to happen when new terminologies are coined in science / technology. Ambiguity: Text with ambiguous meanings can lead to embeddings that don’t clearly represent any of the possible interpretations. (eg. “getting a match for him was too difficult” could mean sports or marriage depending on context. “getting a match for him to play was too difficult” and “getting a match for him was to marry too difficult” itself gives a cosine similarity of 0.77 with ada embeddings). Sarcasm or Irony: Detecting sarcasm or irony requires a deep understanding of context and tone, which can be challenging for AI models, leading to embeddings that take the text at face value. (“Oh, I just love getting stuck in traffic” and “traffic is really bad” may show a lower similarity score) Cultural Nuances: Subtle cultural references or idioms might not be well-represented in embeddings if the model lacks sufficient exposure to the culture in question. Short Texts / Concepts: Very short texts, like one-word inputs, might not provide enough information for the model to generate meaningful embeddings. In some cases, short texts are used to explain concepts which may not be picked up by the models (eg “"time value of money" and “money value of time” generates embeddings with cosine similarity of 0.73, whereas their equivalent details “increase in value of money over time due to interest earned” and “charging money for time spent on a task” have cosine similarity of 0.37) Non-Standard Language: Text containing a lot of slang, misspellings, or grammatical errors might result in less accurate embeddings. Rare Words: If the text contains rare words or neologisms, the embeddings might not capture their meaning accurately if those words were not present in the training data. Context with RAG: If the text contains organization or entity specific content, embeddings may not be truly representative since these will not be part of the training data for the model. These could be abbreviations, organization specific definition of generic terms, etc. Custom Embeddings Retrieval Augmented Generation (RAG) is an architecture that augments the capabilities of a Large Language Model (LLM) like ChatGPT by adding an information retrieval system that provides grounding data. Embeddings being a key part of RAG for retrieval of relevant content, it is important to look at the pitfalls as discussed above. One of the ways to improve these representations is to modify the embedding vectors values to overcome the challenges discussed. Below steps explain an implementation of contrastive learning for finetuning the embeddings. Step 1: We take sample training documents and generate chunks from the documents. We generate positive and negative sentences in reference to the chunks. These samples can be generated for a given corpus by sampling chunks and using an LLM model to create positive and negative examples. Step 2: We generate the embedding representations of the positive and negative examples as a next step with LLM models that support embedding generation such as “text-embedding-ada-002” or “text-embedding-3-small” models. As an alternate to generating the examples, we can consider labelled data corpus like MS MARCO. Step 3: We train a shallow neural network, with the embedding to sentence pairs as input, model the loss function based on the difference between the known similarity from the labelled data set (positive / negative sentences) and the predicted similarity from the model. Since the data is labelled, the similarity for positive samples should be higher and between positive and negative labels should be lower. Step 4: Since the sentence pairs with positive and negative examples are used to train the model, the trainable weights are nudged to reduce the loss function. This way, we will be able to generate more context aware embeddings for a custom corpus. Step 5: For generating embeddings for an unseen chunk, we generate the embedding with the embedding model and get fine-tuned embeddings from the trained model, which gives a more context aware representation. Generating training data: For generation of labelled data, we can use labelled corpora available – eg. MS-MARCO passage ranking dataset or Stanford Natural Language Inference (snli) dataset. An alternate approach is to generate similar and dissimilar labelled data using GPT 4. This is done by iteratively giving a user query (along with a chunk of document as reference if needed) and prompting the model to generate a positive and hard negative document in relation to the user query. There are prompt templates available that can be used to generate these. Example sentences with positive labels: sentence1 sentence2 A person on a horse jumps over a broken down airplane. A person is outdoors, on a horse. Children smiling and waving at camera There are children present A boy is jumping on skateboard in the middle of a red bridge. The boy does a skateboarding trick. Two blond women are hugging one another. There are women showing affection. A few people in a restaurant setting, one of them is drinking orange juice. The diners are at a restaurant. Example sentences with hard negative labels: sentence1 sentence2 A person on a horse jumps over a broken down airplane. A person is at a diner, ordering an omelette. Children smiling and waving at camera The kids are frowning A boy is jumping on skateboard in the middle of a red bridge. The boy skates down the sidewalk. An older man sits with his orange juice at a small table in a coffee shop while employees in bright colored shirts smile in the background. A boy flips a burger. Two blond women are hugging one another. The women are sleeping. Implementation Generating embedding representations for labels: def get_embedding(text: str, model="text-embedding-3-small", **kwargs) -> List[float]: # replace newlines, which can negatively affect performance. text = text.replace("\n", " ") client = AzureOpenAI(api_key = "*****", api_version = "2023-05-15", azure_endpoint ="https://***.openai.azure.com/") response = client.embeddings.create(input=[text], model=model, **kwargs) return response.data[0].embedding # this function will get embeddings from the cache and save them there afterward def get_embedding_with_cache( text: str, engine: str = default_embedding_engine, embedding_cache: dict = embedding_cache, embedding_cache_path: str = embedding_cache_path, ) -> list: if (text, engine) not in embedding_cache.keys(): # if not in cache, call API to get embedding embedding_cache[(text, engine)] = get_embedding(text, engine) # save embeddings cache to disk after each update with open(embedding_cache_path, "wb") as embedding_cache_file: pickle.dump(embedding_cache, embedding_cache_file) return embedding_cache[(text, engine)] # create column of embeddings for column in ["text_1", "text_2"]: df[f"{column}_embedding"] = df[column].apply(get_embedding_with_cache) We generate a trainable matrix that can be used to customize the embeddings. def embedding_multiplied_by_matrix( embedding: List[float], matrix: torch.tensor ) -> np.array: embedding_tensor = torch.tensor(embedding).float() modified_embedding = embedding_tensor @ matrix modified_embedding = modified_embedding.detach().numpy() return modified_embedding # compute custom embeddings and new cosine similarities def apply_matrix_to_embeddings_dataframe(matrix: torch.tensor, df: pd.DataFrame): for column in ["text_1_embedding", "text_2_embedding"]: df[f"{column}_custom"] = df[column].apply( lambda x: embedding_multiplied_by_matrix(x, matrix) ) df["cosine_similarity_custom"] = df.apply( lambda row: cosine_similarity( row["text_1_embedding_custom"], row["text_2_embedding_custom"] ), axis=1, ) def optimize_matrix( modified_embedding_length: int = 2048, # in my brief experimentation, bigger was better (2048 is length of babbage encoding) batch_size: int = 100, max_epochs: int = 100, learning_rate: float = 100.0, # seemed to work best when similar to batch size - feel free to try a range of values dropout_fraction: float = 0.0, # in my testing, dropout helped by a couple percentage points (definitely not necessary) df: pd.DataFrame = df, print_progress: bool = True, save_results: bool = True, ) -> torch.tensor: """Return matrix optimized to minimize loss on training data.""" run_id = random.randint(0, 2 ** 31 - 1) # (range is arbitrary) # convert from dataframe to torch tensors # e is for embedding, s for similarity label def tensors_from_dataframe( df: pd.DataFrame, embedding_column_1: str, embedding_column_2: str, similarity_label_column: str, ) -> Tuple[torch.tensor]: e1 = np.stack(np.array(df[embedding_column_1].values)) e2 = np.stack(np.array(df[embedding_column_2].values)) s = np.stack(np.array(df[similarity_label_column].astype("float").values)) e1 = torch.from_numpy(e1).float() e2 = torch.from_numpy(e2).float() s = torch.from_numpy(s).float() return e1, e2, s e1_train, e2_train, s_train = tensors_from_dataframe( df[df["dataset"] == "train"], "text_1_embedding", "text_2_embedding", "label" ) e1_test, e2_test, s_test = tensors_from_dataframe( df[df["dataset"] == "test"], "text_1_embedding", "text_2_embedding", "label" ) # create dataset and loader dataset = torch.utils.data.TensorDataset(e1_train, e2_train, s_train) train_loader = torch.utils.data.DataLoader( dataset, batch_size=batch_size, shuffle=True ) # define model (similarity of projected embeddings) def model(embedding_1, embedding_2, matrix, dropout_fraction=dropout_fraction): e1 = torch.nn.functional.dropout(embedding_1, p=dropout_fraction) e2 = torch.nn.functional.dropout(embedding_2, p=dropout_fraction) modified_embedding_1 = e1 @ matrix # @ is matrix multiplication modified_embedding_2 = e2 @ matrix similarity = torch.nn.functional.cosine_similarity( modified_embedding_1, modified_embedding_2 ) return similarity # define loss function to minimize def mse_loss(predictions, targets): difference = predictions - targets return torch.sum(difference * difference) / difference.numel() # initialize projection matrix embedding_length = len(df["text_1_embedding"].values[0]) matrix = torch.randn( embedding_length, modified_embedding_length, requires_grad=True ) epochs, types, losses, accuracies, matrices = [], [], [], [], [] for epoch in range(1, 1 + max_epochs): # iterate through training dataloader for a, b, actual_similarity in train_loader: # generate prediction predicted_similarity = model(a, b, matrix) # get loss and perform backpropagation loss = mse_loss(predicted_similarity, actual_similarity) loss.backward() # update the weights with torch.no_grad(): matrix -= matrix.grad * learning_rate # set gradients to zero matrix.grad.zero_() # calculate test loss test_predictions = model(e1_test, e2_test, matrix) test_loss = mse_loss(test_predictions, s_test) # compute custom embeddings and new cosine similarities apply_matrix_to_embeddings_dataframe(matrix, df) # calculate test accuracy for dataset in ["train", "test"]: data = df[df["dataset"] == dataset] a, se = accuracy_and_se(data["cosine_similarity_custom"], data["label"]) # record results of each epoch epochs.append(epoch) types.append(dataset) losses.append(loss.item() if dataset == "train" else test_loss.item()) accuracies.append(a) matrices.append(matrix.detach().numpy()) # optionally print accuracies if print_progress is True: print( f"Epoch {epoch}/{max_epochs}: {dataset} accuracy: {a:0.1%} ± {1.96 * se:0.1%}" ) data = pd.DataFrame( {"epoch": epochs, "type": types, "loss": losses, "accuracy": accuracies} ) data["run_id"] = run_id data["modified_embedding_length"] = modified_embedding_length data["batch_size"] = batch_size data["max_epochs"] = max_epochs data["learning_rate"] = learning_rate data["dropout_fraction"] = dropout_fraction data[ "matrix" ] = matrices # saving every single matrix can get big; feel free to delete/change if save_results is True: data.to_csv(f"{run_id}_optimization_results.csv", index=False) return data Comparison between the distribution of cosine similarity based on embeddings generated from text and customized embeddings on a sample dataset show that custom embeddings perform better. Sample text below shows a reduced similarity compared to default embeddings (data for label -1 indicate dissimilarity), indicating that the custom embeddings are able to differentiate better. text_1 text_2 label cosine_similarity cosine_similarity_custom The man plays guitar someone is playing an instrument -1 0.58 0.52 Lady wearing a yellow top is sitting on a chair a woman on a yellow shirt is on the floor. -1 0.52 0.486 Children playing a game. The guys are playing a game. -1 0.52 0.45 Llamaindex Implementation: Llamaindex has a simplified implementation, which involves corpus generation, generating synthetic queries, running embedding fine tuning and evaluating results as steps. # Generate synthetic queries from llama_index.finetuning import generate_qa_embedding_pairs from llama_index.core.evaluation import EmbeddingQAFinetuneDataset import os OPENAI_API_TOKEN = "sk-" os.environ["OPENAI_API_KEY"] = OPENAI_API_TOKEN from llama_index.llms.openai import OpenAI train_dataset = generate_qa_embedding_pairs( llm=OpenAI(model="gpt-3.5-turbo"), nodes=train_nodes ) val_dataset = generate_qa_embedding_pairs( llm=OpenAI(model="gpt-3.5-turbo"), nodes=val_nodes ) train_dataset.save_json("train_dataset.json") val_dataset.save_json("val_dataset.json") # Run Embedding Finetuning from llama_index.finetuning import SentenceTransformersFinetuneEngine finetune_engine = SentenceTransformersFinetuneEngine( train_dataset, model_id="BAAI/bge-small-en", model_output_path="test_model", val_dataset=val_dataset, ) finetune_engine.finetune() embed_model = finetune_engine.get_finetuned_model() Sentence Transformers (SBERT): Augmented SBERT provides the below approach to extend the sentence transformer models to custom datasets with no annotations of positive / negative pairs. Train from scratch a cross-encoder (BERT) over a source dataset, we can consider a labelled dataset like STS benchmark dataset. Use this cross-encoder (BERT) to label your target dataset i.e. unlabeled sentence pairs Finally, train a bi-encoder (SBERT) on the labeled target dataset References: openai-cookbook/examples/fine-tuned_qa/ft_retrieval_augmented_generation_qdrant.ipynb at main · openai/openai-cookbook (github.com) openai-cookbook/examples/Customizing_embeddings.ipynb at main · openai/openai-cookbook (github.com) Custom Embeddings - LlamaIndex 🦙 0.9.24 run-llama/finetune-embedding: Fine-Tuning Embedding for RAG with Synthetic Data (github.com) Fine-Tuning Embeddings for RAG with Synthetic Data | by Jerry Liu | LlamaIndex Blog Augmented SBERT — Sentence Transformers documentation FlagEmbedding/examples/finetune at master · FlagOpen/FlagEmbedding (github.com)241Views0likes0Comments