azure ai vision

39 Topics

Arizona Department of Transportation Innovates with Azure AI Vision
The Arizona Department of Transportation (ADOT) is committed to providing safe and efficient transportation services to the residents of Arizona. With a focus on innovation and customer service, ADOT’s Motor Vehicle Division (MVD) continually seeks new ways to enhance its services and improve the overall experience for its residents. The challenge ADOT MVD had a tough challenge to ensure the security and authenticity of transactions, especially those involving sensitive information. Every day, the department needs to verify thousands of customers seeking to use its online services to perform activities like updating customer information including addresses, renewing vehicle registrations, ordering replacement driver licenses, and ordering driver and vehicle records. Traditional methods of identity verification, such as manual checks and physical presence, were not only time-consuming and error-prone, but didn’t provide any confidence that the department was dealing with the right customer in remote interactions, such as online using its web portal. With high daily demand and stringent security requirements, the department recognized the need to enhance its digital presence and improve customer engagement. Facial verification technology has been a longstanding method for verifying a user's identity on-device and online account login for its convenience and efficiency. However, challenges are increasing as malicious actors persist in their attempts to manipulate and deceive the system through various spoofing techniques. The solution To address these challenges, the ADOT turned to Azure AI Vision Face API (also known as Azure Face Service), with Liveness Detection. This technology leverages advanced machine learning algorithms to verify the identity of individuals in real time. The Liveness Detection feature aims to verify that the system engages with a physically present, living individual during the verification process. This is achieved by differentiating between a real (live) and fake (spoof) representation which may include photographs, videos, masks, or other means to mimic a real person. By using facial verification and liveness detection, the system can determine whether the person in front of the camera is a live human being and not a photograph or a video. This cutting-edge technology has transformed the way the department operates to make it more efficient, secure, and reliable. Implementation and collaboration The department worked closely with Microsoft's team to ensure a seamless integration of the technology. "We were extremely excited to partner with Microsoft to use their passive liveness verification and facial verification all in one step," said Grant Hawkes, a contracted partner with the department’s Motor Vehicle Modernization (MvM) Project and its Lead Foundation Architect. "The Microsoft engineers were super receptive and super helpful. They would actually tweak the software a little bit for our use case, making our lives much easier. We have this wonderful working relationship with Microsoft, and they were extremely open with us, extremely receptive to ideas and whatever else it took. And we've only seen the ease of use get better and better and better.” Key benefits ADOT MVD has realized numerous benefits from the adoption of Azure AI Vision face liveness and verification functionality: Enhanced security—The technology has helped to reduce the risk of identity theft and fraud by enabling the verification of identities in real time, so the department can ensure that only authorized individuals can access sensitive information and complete transactions. Improved efficiency—By streamlining the verification process, the time required for identity checks has been reduced. In addition, the department is now able to offer some services online that were previously only able to be done in office, such as driver license renewals and title transfers. Accessibility—The technology has made the process easier for individuals with disabilities and the elderly to complete transactions, as they no longer have to make their way to an office for certain services. In this way, it's more inclusive and user-friendly. Cost-effective—The Azure AI Vision face technology works seamlessly across different devices, including laptops and smartphones, without requiring expensive hardware, and fits into ADOT’s existing budget. Verifying mobile driver's licenses (mDLs) is one of the most significant applications of this technology. Arizona was one of the first states to offer ISO 18013-5 compliant mDLs, allowing residents to store their driver's licenses on their mobile devices, making it more convenient and secure. Another notable application is electronic transfer of vehicle titles. Residents can now transfer vehicle titles electronically, eliminating the need for physical presence and paperwork. This will make the process much easier for citizens, while also making it more efficient and secure, reducing the risk of fraud. On-demand authentication ADOT MVD has also developed an innovative solution called on-demand authentication (ODA). This allows residents to verify their identity remotely using their mobile devices. When a resident calls ADOT MVD’s call center, they receive a text message with a link to verify their identity. The system uses Azure AI Vision to perform facial verification and liveness detection, ensuring that the person on the other end of the call is who they claim to be. "This technology has been key in mitigating fraud by increasing our confidence that we're working with the right person," said Grant Hawkes. "The whole process takes maybe a few seconds and is user-friendly for both the call center representative and the customer." Future plans The success of Azure AI Vision has prompted ADOT to explore further applications, and other state agencies are now looking at adopting the technology as well. "We see this growing and growing," said Grant Hawkes. "We're working to roll this technology out to more and more departments within the state as part of a unified identity solution. We see the value in this technology and what can be done with it." The ADOT’s adoption of Azure AI Vision Face liveness and verification functionality has transformed the way the department operates. By enhancing security, improving efficiency, and making services more accessible, the technology has brought significant benefits to both the department and the residents of Arizona. As the department continues to innovate and expand the use of this technology, it sets a benchmark for other states and organizations to follow. Our commitment to Trustworthy AI Organizations across industries are leveraging Azure AI and Copilot capabilities to drive growth, increase productivity, and create value-added experiences. We’re committed to helping organizations use and build AI that is trustworthy, meaning it is secure, private, and safe. We bring best practices and learnings from decades of researching and building AI products at scale to provide industry-leading commitments and capabilities that span our three pillars of security, privacy, and safety. Trustworthy AI is only possible when you combine our commitments, such as our Secure Future Initiative and our Responsible AI principles, with our product capabilities to unlock AI transformation with confidence. Get started: Learn more about Azure AI Vision. Learn more about Face Liveness Detection, a milestone in identity verification. See how face detection works. Try it now. Read about Enhancing Azure AI Vision Face API with Liveness Detection. Learn how Microsoft empowers responsible AI practices.
ronakchokshi
Apr 18, 2025 Place Microsoft Foundry Blog
587Views
6likes
1Comment
Real Time, Real You: Announcing General Availability of Face Liveness Detection
A Milestone in Identity Verification We are excited to announce the general availability of our face liveness detection features, a key milestone in making identity verification both seamless and secure. As deepfake technology and sophisticated spoofing attacks continue to evolve, organizations need solutions that can verify the authenticity of an individual in real time. During the preview, we listened to customer feedback, expanded capabilities, and made significant improvements to ensure that liveness detection works across three platforms and for common use cases. What’s New Since the Preview? During the preview, we introduced several features that laid the foundation for secure and seamless identity verification, including active challenge in JavaScript library. Building on that foundation, there are improvements across the board. Here’s what’s new: Feature Parity Across Platforms: Liveness detection’s active challenge is now available on both Android and iOS platforms, achieving full feature parity across all supported devices. This allows a consistent and seamless experience for both developers and end users on all three supported platforms. Easy integration: The liveness detection client SDK now requires only a single function call to start the entire flow, making it easier for developers to integrate. The SDK also includes an integrated UI flow to simplify implementation, allowing a seamless developer experience across platforms. Runtime environment safety: The liveness detection client SDK integrated safety check for untrustworthy runtime environment on both iOS and Android devices. Accuracy and Usability Improvements: We’ve delivered numerous bug fixes and enhancements to improve detection accuracy and user experience across all supported platforms. Our solution is now faster, more intuitive, and more resilient against even the most advanced spoofing techniques. These advancements help that businesses integrate liveness detection with confidence, providing both security and convenience. Security in Focus: Microsoft’s Commitment to Innovation As identity verification threats continue to evolve, general availability is the start of the journey. Microsoft is dedicated to advancing our face liveness detection technology to address evolving security challenges: Continuous Support and Innovation: Our team is actively monitoring emerging spoofing techniques. With ongoing updates and enhancements, we ensure that our liveness detection solution adapts to new challenges. Learn more about liveness detection updates. Security and Privacy by Design: Microsoft’s principles of security and privacy are built into every step. We provide robust support to assist customers in integrating and maintaining these solutions effectively. We process the data securely, respecting user privacy and complying with global regulations. By collaborating closely with our customers, we ensure that together, we build solutions that are not only innovative but also secure. Learn more about shared responsibility in liveness solutions We provide reliable, long-term solutions to help organizations stay ahead of threats. Get Start Today We’re excited for customers to experience the benefits of real-time liveness detection. Whether you’re safeguarding financial transactions, streamlining digital onboarding, or enabling secure logins, our solution can strengthen your security. Explore: Learn more about integrating liveness detection into your applications by this tutorial. Try it Out: Liveness detection is available to experience in Vision Studio Build with Confidence: Empower your organization with secure, real-time identity verification. Try our sample code to see how easy it is to get started: Azure-Samples/azure-ai-vision-sdk A Step Toward a Safer Future With a focus on real-time, reliable identity verification, we’re making identity verification smarter, faster, and safer. As we continue to improve and evolve this solution, our goal remains the same: to protect identities, build trust, and verify that the person behind the screen is really you. Start building with liveness detection today and join us on this journey toward a more secure digital world.
Jinyu_Li_2005
Jan 30, 2025 Place Microsoft Foundry Blog
1KViews
6likes
0Comments
Explore Azure AI Services: Curated list of prebuilt models and demos
Unlock the potential of AI with Azure's comprehensive suite of prebuilt models and demos. Whether you're looking to enhance speech recognition, analyze text, or process images and documents, Azure AI services offer ready-to-use solutions that make implementation effortless. Explore the diverse range of use cases and discover how these powerful tools can seamlessly integrate into your projects. Dive into the full catalogue of demos and start building smarter, AI-driven applications today.
RasikaSavant
Sep 03, 2024 Place Microsoft Foundry Blog
10KViews
5likes
1Comment
Phi-3 Vision – Catalyzing Multimodal Innovation
Microsoft's Phi-3 Vision is a new AI model that combines text and image data to deliver smart and efficient solutions. With just 4.2 billion parameters, it offers high performance and can run on devices with limited computing power. From describing images to analyzing documents, Phi-3 Vision is designed to make advanced AI accessible and practical for everyday use. Explore how this model is set to change the way we interact with AI, offering powerful capabilities in a small and efficient package.
shagrawal
Jun 18, 2024 Place Microsoft Foundry Blog
34KViews
5likes
2Comments
Announcing the General Availability of GPT-4 Turbo with Vision on Azure OpenAI Service
We are excited to announce the general availability (GA) of GPT-4 Turbo with Vision on the Azure OpenAI Service. The GA model, gpt-4-turbo-2024-04-09, is a multimodal model capable of processing both text and image inputs to generate text outputs. This model replaces the following preview models: gpt-4-1106-preview gpt-4-0125-preview gpt-4-vision-preview Our customers and partners have been utilizing GPT-4 Turbo with Vision to create new processes, enhance efficiencies, and innovate within their businesses. Applications range from retailers improving the online shopping experience, to media and entertainment companies enriching digital asset management, and various organizations deriving insights from charts and diagrams. We will be showcasing detailed case studies from these applications at the upcoming Build conference. Existing Azure OpenAI Service customers can now deploy gpt-4-turbo-2024-04-09 in Sweden Central and East US 2. For more information, please visit our model availability page. Guide to Deploying GPT-4 Turbo with Vision GA To deploy this GA model from the Studio UI, select "GPT-4" and then choose the "turbo-2024-04-09" version from the dropdown menu. The default quota for the gpt-4-turbo-2024-04-09 model will be the same as current quota for GPT-4-Turbo. See the regional quota limits. Upgrade Path from Preview to GA Models We are targeting the upgrade of deployments that utilize any of the three preview models (gpt-4-1106-preview, gpt-4-0125-preview, and gpt-4-vision-preview) and are configured for auto-update on the Azure OpenAI Service. These deployments will be upgraded to gpt-4-turbo-2024-04-09 starting on June 10th or later. We will notify all customers with these preview deployments at least two weeks before the start of the upgrades. We will publish an upgrade schedule detailing the order of regions and model versions that we will follow during the upgrades in our public documentation. Upcoming Features for image (vision) inputs: JSON Mode and Function Calling JSON mode and function calling for inference requests involving image (vision) inputs will be available in GA in the coming weeks. Please note that text-based inputs will continue to support both JSON mode and function calling. Changes to GPT-4 Vision Enhancements Enhancements such as Optical Character Recognition (OCR), object grounding, video prompts, and "Azure OpenAI Service on your data with images", that were integrated with the gpt-4-vision-preview model will not be available with the GA model. We are dedicated to enhancing our products to provide value to our customers, and are actively exploring how to best integrate these features into future offerings. To Get Started, Explore the Following Resources Learn more about What's new in Azure OpenAI Service? Learn more about GPT-4 Turbo with Vision on Azure OpenAI Service Azure Open AI Quickstart for GPT-4 Turbo with Vision Azure Open AI How-To Guide: How to use the GPT-4 Turbo with Vision model on Azure OpenAI Service GPT-4 Turbo with Vision pricing explained in detail: Text and Image tokens Apply now for access to Azure OpenAI Service  If you are a current Azure OpenAI customer and would like to add additional use cases, fill out the Azure OpenAI Additional Use Case form. Responsible AI: Transparency Note for Azure OpenAI Service
Fisayo_Feyisetan
May 02, 2024 Place Microsoft Foundry Blog
48KViews
5likes
2Comments
Intelligent Load Balancing with APIM for OpenAI: Weight-Based Routing
Weightage: There is no direct feature capablities in APIM for weightage based routing.I have tried achieve same results using custom logic with APIM policies Selection Process: Backend logic used in this policy is based on weighted selection method to choose an endpoint route for retry.endpoint with higher weights are more likely to be chosen, but each endpoints route has at least some chance of being selected. This is because the selection is based on a random number that is compared against cumulative weights, which means the selection process inherently favors routes with higher weights due to the way cumulative weights are calculated and utilized
OsamaSheikh
Apr 16, 2024 Place Microsoft Foundry Blog
13KViews
5likes
0Comments
Image Analysis OCR for Data and Content Compliance
In today’s digital world, organizations have the challenge and responsibility of ensuring a safe and secure online environment for their users, employees, and partners. The increasing volume of images and videos shared on various social and communication channels necessitates robust data and content safety and compliance measures. Optical Character Recognition (OCR) technology extracts text from images and scanned documents to make it machine-readable. This allows computers to read images’ textual content and determine the location of the text within the image. Content Compliance with Azure Image Analysis OCR With the growing multimodal capabilities of large language models (LLMs), the extraction of textual insights from images is becoming increasingly essential. Organizations must ensure that the extracted image content is both harmless and compliant. Text extraction from images with OCR facilitates identification of images containing harmful content such as profanity and hate speech. OCR-extracted text is passed to content moderation systems to classify and filter images with harmful text. The content moderation processes can be custom pipelines or leverage text and multimodal content moderation APIs offered by Azure AI Content Safety. Moderation strategies and policies should be tailored to align with the organization’s unique goals and user needs. Some organizations use OCR to moderate content on images before the images are uploaded to LLM APIs such as GPT-4V turbo with vision. The text extracted from images is processed in-house moderation systems to get ratings on different safety categories. This prevents inappropriate and malicious text input from reaching the Large Language Model. This optimizes LLM API spending while also protecting LLMs from potentially malicious user activity. Data Loss Prevention and Compliance with Image Analysis OCR OCR can also be used to help identify and protect sensitive information in images. Private data such as health records, financial information, and Personally Identifiable Information (PII) like names, social security numbers, and addresses embedded in images can be detected with the help of OCR. After OCR has extracted text from an image, the extracted text is passed to a sensitive data detector such as Azure AI PII Detection. The detector identifies and categorizes any sensitive information present in the text. This enables redaction or masking to prevent unauthorized access or sharing. Here's a code sample on how this can be accomplished with Azure AI OCR and PII Detection. Utilizing OCR for the detection of sensitive data in images, which might otherwise have gone unnoticed, ensures adherence to privacy laws and industry standards for handling private information. This helps organizations build trust with users and customers while mitigating the risk of compliance violations. Example Customers Microsoft Purview Communication Compliance uses Azure AI Image Analysis OCR to extract text from images shared in Teams chat or Exchange online emails before running the text through a compliance pipeline. This process helps surface inappropriate content and sensitive information to compliance administrators for further action. Previously risky content embedded in images was not able to be detected and was a blind spot for compliance administrators. Microsoft News also leverages Azure AI Image Analysis OCR to ensure that images embedded in news articles do not contain any inappropriate content. Get Started with Azure AI Services Vision OCR Moderate content and protect sensitive text information embedded in images using Azure AI OCR. Get started by following this code sample or our QuickStart guide. You can also use Vision Studio for a no-code try-out experience. Use Azure Document Intelligence if you are interested in OCR for documents.
dybe
Mar 21, 2024 Place Microsoft Foundry Blog
3.8KViews
5likes
0Comments
AI Automation in Azure Foundry through turnkey MCP Integration and Computer Use Agent Models
The Fashion Trends Discovery Scenario In this walkthrough, we'll explore a sample application that demonstrates the power of combining Computer Use (CUA) models with Playwright browser automation to autonomously compile trend information from the internet, while leveraging MCP integration to intelligently catalog and store insights in Azure Blob Storage. The User Experience A fashion analyst simply provides a query like "latest trends in sustainable fashion" to our command-line interface. What happens next showcases the power of agentic AI—the system requires no further human intervention to: Autonomous Web Navigation: The agent launches Pinterest, intelligently locates search interfaces, and performs targeted queries Intelligent Content Discovery: Systematically identifies and interacts with trend images, navigating to detailed pages Advanced Content Analysis: Applies computer vision to analyze fashion elements, colors, patterns, and design trends Intelligent Compilation: Consolidates findings into comprehensive, professionally formatted markdown reports Contextual Storage: Recognizes the value of preserving insights and autonomously offers cloud storage options Technical capabilities leveraged Behind this seamless experience lies a coordination of AI models: Pinterest Navigation: The CUA model visually understands Pinterest's interface layout, identifying search boxes and navigation elements with pixel-perfect precision Search Results Processing: Rather than relying on traditional DOM parsing, our agent uses visual understanding to identify trend images and calculate precise interaction coordinates Content Analysis: Each discovered trend undergoes detailed analysis using GPT-4o's advanced vision capabilities, extracting insights about fashion elements, seasonal trends, and style patterns Autonomous Decision Making: The agent contextually understands when information should be preserved and automatically engages with cloud storage systems Technology Stack Overview At the heart of this solution lies an orchestration of several AI technologies, each serving a specific purpose in creating a truly autonomous agent. The architecture used ``` ┌─────────────────────────────────────────────────────────────────┐ │ Azure AI Foundry │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ Responses API │ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────┐ │ │ │ │ │ CUA Model │ │ GPT-4o │ │ Built-in MCP │ │ │ │ │ │ (Interface) │ │ (Content) │ │ Client │ │ │ │ │ └─────────────┘ └─────────────┘ └─────────────────┘ │ │ │ └─────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────┐ │ Function Calling Layer │ │ (Workflow Orchestration) │ └─────────────────────────────────────────┘ │ ▼ ┌─────────────────┐ ┌──────────────────┐ │ Playwright │◄──────────────► │ Trends Compiler │ │ Automation │ │ Engine │ └─────────────────┘ └──────────────────┘ │ ▼ ┌─────────────────────┐ │ Azure Blob │ │ Storage (MCP) │ └─────────────────────┘ ``` Azure OpenAI Responses API At the core of the agentic architecture in this solution, the Responses API provides intelligent decision-making capabilities that determine when to invoke Computer Use models for web crawling versus when to engage MCP servers for data persistence. This API serves as the brain of our agent, contextually understanding user intent and autonomously choosing the appropriate tools to fulfill complex multi-step workflows. Computer Use (CUA) Model Our specialized CUA model excels at visual understanding of web interfaces, providing precise coordinate mapping for browser interactions, layout analysis, and navigation planning. Unlike general-purpose language models, the CUA model is specifically trained to understand web page structures, identify interactive elements, and provide actionable coordinates for automated browser control. Playwright Browser Automation Acting as the hands of our agent, Playwright executes the precise actions determined by the CUA model. This robust automation framework translates AI insights into real-world browser interactions, handling everything from clicking and typing to screenshot capture and page navigation with pixel-perfect accuracy. GPT-4o Vision Model for Content Analysis While the CUA model handles interface understanding, GPT-4o provides domain-specific content reasoning. This powerful vision model analyzes fashion trends, extracts meaningful insights from images, and provides rich semantic understanding of visual content—capabilities that complement rather than overlap with the CUA model's interface-focused expertise. Model Context Protocol (MCP) Integration The application showcases the power of agentic AI through its autonomous decision-making around data persistence. The agent intelligently recognizes when compiled information needs to be stored and automatically engages with Azure Blob Storage through MCP integration, without requiring explicit user instruction for each storage operation. Unlike traditional function calling patterns where custom applications must relay MCP calls through client libraries, the Responses API includes a built-in MCP client that directly communicates with MCP servers. This eliminates the need for complex relay logic, making MCP integration as simple as defining tool configurations. Function Calling Orchestration Function calling orchestrates the complex workflow between CUA model insights and Playwright actions. Each step is verified and validated before proceeding, ensuring robust autonomous operation without human intervention throughout the entire trend discovery and analysis process. Let me walk you through the code used in the Application. Agentic Decision Making in Action Let's examine how our application demonstrates true agentic behavior through the main orchestrator in `app.py`: async def main() -> str: """Main entry point demonstrating agentic decision making.""" conversation_history = [] generated_reports = [] while True: user_query = input("Enter your query for fashion trends:-> ") # Add user input to conversation context new_user_message = { "role": "user", "content": [{"type": "input_text", "text": user_query}], } conversation_history.append(new_user_message) # The agent analyzes context and decides on appropriate actions response = ai_client.create_app_response( instructions=instructions, conversation_history=conversation_history, mcp_server_url=config.mcp_server_url, available_functions=available_functions, ) # Process autonomous function calls and MCP tool invocations for output in response.output: if output.type == "function_call": # Agent decides to compile trends function_to_call = available_functions[output.name] function_args = json.loads(output.arguments) function_response = await function_to_call(**function_args) elif output.type == "mcp_tool_call": # Agent decides to use MCP tools for storage print(f"MCP tool call: {output.name}") # MCP calls handled automatically by Responses API Key Agentic Behaviors Demonstrated: Contextual Analysis: The agent examines conversation history to understand whether the user wants trend compilation or storage operations Autonomous Tool Selection: Based on context, the agent chooses between function calls (for trend compilation) and MCP tools (for storage) State Management: The agent maintains conversation context across multiple interactions, enabling sophisticated multi-turn workflows Function Calling Orchestration: Autonomous Web Intelligence The `TrendsCompiler` class in `compiler.py` demonstrates sophisticated autonomous workflow orchestration: class TrendsCompiler: """Autonomous trends compilation with multi-step verification.""" async def compile_trends(self, user_query: str) -> str: """Main orchestration loop with autonomous step progression.""" async with LocalPlaywrightComputer() as computer: state = {"trends_compiled": False} step = 0 while not state["trends_compiled"]: try: if step == 0: # Step 1: Autonomous Pinterest navigation await self._launch_pinterest(computer) step += 1 elif step == 1: # Step 2: CUA-driven search and coordinate extraction coordinates = await self._search_and_get_coordinates( computer, user_query ) if coordinates: step += 1 elif step == 2: # Step 3: Autonomous content analysis and compilation await self._process_image_results( computer, coordinates, user_query ) markdown_report = await self._generate_markdown_report( user_query ) state["trends_compiled"] = True except Exception as e: print(f"Autonomous error handling in step {step}: {e}") state["trends_compiled"] = True return markdown_report Autonomous Operation Highlights: Self-Verifying Steps: Each step validates completion before advancing Error Recovery: Autonomous error handling without human intervention State-Driven Progression: The agent maintains its own execution state No User Prompts: Complete automation from query to final report Pinterest's Unique Challenge: Visual Coordinate Intelligence One of the most impressive demonstrations of CUA model capabilities lies in solving Pinterest's hidden URL challenge: async def _detect_search_results(self, computer) -> List[Tuple[int, int, int, int]]: """Use CUA model to extract image coordinates from search results.""" # Take screenshot for CUA analysis screenshot_bytes = await computer.screenshot() screenshot_b64 = base64.b64encode(screenshot_bytes).decode() # CUA model analyzes visual layout and identifies image boundaries prompt = """ Analyze this Pinterest search results page and identify all trend/fashion images displayed. For each image, provide the exact bounding box coordinates in the format: <click>x1,y1,x2,y2</click> Focus on the main content images, not navigation or advertisement elements. """ response = await self.ai_client.create_cua_response( prompt=prompt, screenshot_b64=screenshot_b64 ) # Extract coordinates using specialized parser coordinates = self.coordinate_parser.extract_coordinates(response.content) print(f"CUA model identified {len(coordinates)} image regions") return coordinates The Coordinate Calculation: def calculate_centers(self, coordinates: List[Tuple[int, int, int, int]]) -> List[Tuple[int, int]]: """Calculate center coordinates for precise clicking.""" centers = [] for x1, y1, x2, y2 in coordinates: center_x = (x1 + x2) // 2 center_y = (y1 + y2) // 2 centers.append((center_x, center_y)) return centers key take aways with this approach: No DOM Dependency: Pinterest's hover-based URL revelation becomes irrelevant Visual Understanding: The CUA model sees what humans see—image boundaries Pixel-Perfect Targeting: Calculated center coordinates ensure reliable clicking Robust Navigation: Works regardless of Pinterest's frontend implementation changes Model Specialization: The Right AI for the Right Job Our solution demonstrates sophisticated AI model specialization: async def _analyze_trend_page(self, computer, user_query: str) -> Dict[str, Any]: """Use GPT-4o for domain-specific content analysis.""" # Capture the detailed trend page screenshot_bytes = await computer.screenshot() screenshot_b64 = base64.b64encode(screenshot_bytes).decode() # GPT-4o analyzes fashion content semantically analysis_prompt = f""" Analyze this fashion trend page for the query: "{user_query}" Provide detailed analysis of: 1. Fashion elements and style characteristics 2. Color palettes and patterns 3. Seasonal relevance and trend timing 4. Target demographics and style categories 5. Design inspiration and cultural influences Format as structured markdown with clear sections. """ # Note: Using GPT-4o instead of CUA model for content reasoning response = await self.ai_client.create_vision_response( model=self.config.vision_model_name, # GPT-4o prompt=analysis_prompt, screenshot_b64=screenshot_b64 ) return { "analysis": response.content, "timestamp": datetime.now().isoformat(), "query_context": user_query } Model Selection Rationale: CUA Model: Perfect for understanding "Where to click" and "How to navigate" GPT-4o: Excels at "What does this mean" and "How is this relevant" Specialized Strengths: Each model operates in its domain of expertise Complementary Intelligence: Combined capabilities exceed individual model limitations Compilation and Consolidation async def _generate_markdown_report(self, user_query: str) -> str: """Consolidate all analyses into comprehensive markdown report.""" if not self.image_analyses: return "No trend data collected for analysis." # Intelligent report structuring report_sections = [ f"# Fashion Trends Analysis: {user_query}", f"*Generated on {datetime.now().strftime('%B %d, %Y')}*", "", "## Executive Summary", await self._generate_executive_summary(), "", "## Detailed Trend Analysis" ] # Process each analyzed trend with intelligent categorization for idx, analysis in enumerate(self.image_analyses, 1): trend_section = [ f"### Trend Item {idx}", analysis.get('analysis', 'No analysis available'), f"*Analysis timestamp: {analysis.get('timestamp', 'Unknown')}*", "" ] report_sections.extend(trend_section) # Add intelligent trend synthesis report_sections.extend([ "## Trend Synthesis and Insights", await self._generate_trend_synthesis(), "", "## Recommendations", await self._generate_recommendations() ]) return "\n".join(report_sections) Intelligent Compilation Features: Automatic Structuring: Creates professional report formats automatically Content Synthesis: Combines individual analyses into coherent insights Temporal Context: Maintains timestamp and query context Executive Summaries: Generates high-level insights from detailed data Autonomous Storage Intelligence Note that there is no MCP Client code that needs to be implemented here. The integration is completely turnkey, through configuration alone. # In app_client.py - MCP tool configuration def create_app_tools(self, mcp_server_url: str, available_functions: Dict[str, Any]) -> List[Dict[str, Any]]: """Configure tools with automatic MCP integration.""" tools = [ { "type": "mcp", "server_label": "azure-storage-mcp-server", "server_url": mcp_server_url, "require_approval": "never", # Autonomous operation "allowed_tools": ["create_container", "list_containers", "upload_blob"], } ] return tools # Agent instructions demonstrate contextual intelligence instructions = f""" Step1: Compile trends based on user query using computer use agent. Step2: Prompt user to store trends report in Azure Blob Storage. Use MCP Server tools to perform this action autonomously. IMPORTANT: Maintain context of previously generated reports. If user asks to store a report, use the report generated in this session. """ Turnkey MCP Integration: Direct API Calls: MCP tools called directly by Responses API No Relay Logic: No custom MCP client implementation required Autonomous Tool Selection: Agent chooses appropriate MCP tools based on context Contextual Storage: Agent understands what to store and when Demo and Code reference Here is the GitHub Repo of the Application described in this post. See a demo of this application in action: Conclusion: Entering the Age of Practical Agentic AI The Fashion Trends Compiler Agent represents Agentic AI applications that work autonomously in real-world scenarios. By combining Azure AI Foundry's turnkey MCP integration with specialized AI models and robust automation frameworks, we've created an agent that doesn't just follow instructions but intelligently navigates complex multi-step workflows with minimal human oversight. Ready to build your own agentic AI solutions? Start exploring Azure AI Foundry's MCP integration and Computer Use capabilities to create the next generation of intelligent automation.
srikantan
Jun 16, 2025 Place Microsoft Foundry Blog
1.8KViews
3likes
0Comments
From Extraction to Insight: Evolving Azure AI Content Understanding with Reasoning and Enrichment
First introduced in public preview last year, Azure AI Content Understanding enables you to convert unstructured content—documents, audio, video, text, and images—into structured data. The service is designed to support consistent, high-quality output, directed improvements, built-in enrichment, and robust pre-processing to accelerate workflows and reduce cost. A New Chapter in Content Understanding Since our launch we’ve seen customers pushing the boundaries to go beyond simple data extraction with agentic solutions fully automating decisions. This requires more than just extracting fields. For example, a healthcare insurance provider decision to pay a claim requires cross-checking against insurance policies, applicable contracts, patient’s medical history and prescription datapoints. To do this a system needs the ability to interpret information in context, perform more complex enrichments and analysis across various data sources. Beyond field extraction, this requires a custom designed workflow leveraging reasoning. In response to this demand, Content Understanding now introduces Pro mode which enables enhanced reasoning, validation, and information aggregation capabilities. These updates allow the service to aggregate and compare results across sources, enrich extracted data with context, and deliver decisions as output. While Standard mode continues to offer reliable and scalable field extraction, Pro mode extends the service to support more complex content interpretation scenarios—enabling workflows that reflect the way people naturally reason over data. With this update, Content Understanding now solves a much larger component of your data processing workflows, offering new ways to automate, streamline, and enhance decision-making based on unstructured information. Key Benefits of Pro Mode Packed with cutting-edge reasoning capabilities, Pro mode revolutionizes document analysis. Multi-Content Input Process and aggregate information across multiple content files in a single request. Pro mode can build a unified schema from distributed data sources, enabling richer insight across documents. Multi-Step Reasoning Go beyond basic extraction with a process that supports reasoning, linking, validation, and enrichment. Knowledge Base Integration Seamlessly integrate with organizational knowledge bases and domain-specific datasets to enhance field inference. This ensures outputs can reason over the task of generating the output using the context of your business. When to Use Pro Mode Pro mode, currently limited to documents, is designed for scenarios where content understanding needs to go beyond surface-level extraction—ideal for use cases that traditionally require postprocessing, human review and decision-making based on multiple data points and contextual references. Pro mode enables intelligent processing that not only extracts data, but also validates, links, and enriches it. This is especially impactful when extracted information must be cross-referenced with external datasets or internal knowledge sources to ensure accuracy, consistency, and contextual depth. Examples include: Invoice processing that reconciles against purchase orders and contract terms Healthcare claims validation using patient records and prescription history Legal document review where clauses reference related agreements or precedents Manufacturing spec checks against internal design standards and safety guidelines By automating much of the reasoning, you can focus on higher value tasks! Pro mode helps reduce manual effort, minimize errors, and accelerate time to insight—unlocking new potential for downstream applications, including those that emulate higher-order decision-making. Simplified Pricing Model Introducing a simplified pricing structure that significantly reduces costs across all content modalities compared to previous versions, making enterprise-scale deployment more affordable and predictable. Expanded Feature Coverage We are also extending capabilities across various content types: Structured Document Outputs: Improved handling of tables spanning multiple pages, recognition of selection marks, and support for additional file types like .docx, .xlsx, .pptx, .msg, .eml, .rtf, .html, .md, and .xml. Classifier API: Automatically categorize/split and route documents to appropriate processing pipelines. Video Analysis: Extract data across an entire video or break a video into chapters automatically. Enrich metadata with face identification and descriptions that include facial images. Face API Preview: Detect, recognize, and enroll faces, enabling richer user-aware applications. Check out the details about each of these capabilities here - What's New for Content Understanding. Let's hear it from our customers Customers all over the globe are using Content Understanding for its powerful one-stop solution capabilities by leveraging advance modes of reasoning, grounding and confidence scores across diverse content types. ASC: AI-based analytics in ASC’s Recording Insights platform allows customers to move to a 100% compliance review coverage of conversations across multiple channels. ASC’s integration of Content Understanding replaces a previously complex setup—where multiple separate AI services had to be manually connected—with a single multimodal solution that delivers transcription, summarization, sentiment analysis, and data extraction in one streamlined interface. This shift not only simplifies implementation and accelerates time-to-value but also received positive customer feedback for its powerful features and the quick, hands-on support from Microsoft product teams. “With the integration of Content Understanding into the ASC Recording Insights platform, ASC was able to reduce R&D effort by 30% and achieve 5 times faster results than before. This helps ASC drive customer satisfaction and stay ahead of competition.” —Tobias Fengler, Chief Engineering Officer, ASC. To learn more about ASCs integration check out From Complexity to Simplicity: The ASC and Azure AI Partnership.” Ramp: Ramp, the all-in-one financial operations platform, is exploring how Azure AI Content Understanding can help transform receipts, bills, and multi-line invoices into structured data automatically. Ramp is leveraging the pre-built invoice template and experimenting with custom extraction capabilities across various document types. These experiments are helping Ramp evaluate how to further reduce manual entry and enhance the real-time logic that powers approvals, policy checks, and reconciliation. “Content Understanding gives us a single API to parse every receipt and statement we see—then lets our own AI reason over that data in real time. It's an efficient path from image to fully reconciled expense.” — Rahul S, Head of AI, Ramp MediaKind: MK.IO’s cloud-native video platform, available on Azure Marketplace—now integrates Azure AI Content Understanding to make it easy for developers to personalize streaming experiences. With just a few lines of code, you can turn full game footage into real-time, fan-specific highlight reels using AI-driven metadata like player actions, commentary, and key moments. “Azure AI Content Understanding gives us a new level of control and flexibility—letting us generate insights instantly, personalize streams automatically, and unlock new ways to engage and monetize. It’s video, reimagined.” —Erik Ramberg, VP, MediaKind Catch the full story from MediaKind in our breakout session at Build 2025 on May 18: My Game, My Way, where we walk you through the creation of personalized highlight reels in real-time. You’ll never look at your TV in the same way again. Getting Started For more details about the latest from Content Understanding check out Reasoning on multimodal content for efficient agentic AI app building Wednesday, May 21 at 2 PM PST Build your own Content Understanding solution in the Azure AI Foundry. Pro mode will be available in the Foundry starting June 1 st 2025 Refer to our documentation and sample code on Content Understanding Explore the video series on getting started with Content Understanding
Aditi_M
May 20, 2025 Place Microsoft Foundry Blog
2KViews
2likes
0Comments
Agentic P2P Automation: Harnessing the Power of OpenAI's Responses API
The Procure-to-Pay (P2P) process is traditionally error-prone and labor-intensive, requiring someone to manually open each purchase invoice, look up contract details in a separate system, and painstakingly compare the two to identify anomalies—a task prone to oversight and inconsistency. About the sample Application The 'Agentic' characteristics demonstrated here using the Responses API are: The client application makes a single call to the Responses API that internally handles all the actions autonomously, processes the information and returns the response. In other words, the client application does not have to perform those actions itself. These actions that the Responses API uses, are Hosted tools like (file search, vision-based reasoning). Function calling is used to invoke custom action not available in the Hosted tools (i.e. calling Azure Logic App in this case). The Responses API delegates control to the client application that executes the identified Function, hands over the response to the Responses API to complete the rest of the steps in the business process Handling of state across all the tool calls and orchestrating them in the right sequence are all handled by the Responses API. It autonomously takes the output from each Tool call and uses it to prepare the request for the next one. There is no Workflow logic implemented in the code to perform these steps. It is all done through natural language instructions passed when calling the Responses API, and through the Tool actions. The P2P Anomaly Detection system follows this workflow: Processes purchase invoice images using computer vision capabilities of gpt-4o Extracts critical information like Contract ID, Supplier ID, and line items from it Retrieves corresponding contract details from an external system via Azure Logic App, through Function Calling capabilities in Responses API Performs a vector Search for the business rules in the OpenAI vector store, for detection of anomalies in Procure to Pay processes Applies the Business rules on the Invoice details and validates them against the details in the Contract data, using gpt-4o for reasoning Generates a detailed report of violations and anomalies using gpt-4o Code Walkthrough 1. Tools The Agent (i.e. the application) uses the configuration for File search, and for the Function Call to invoke the Azure Logic App. # These are the tools that will be used by the Responses API. tools_list = [ { "type": "file_search", "vector_store_ids": [config.vector_store_id], "max_num_results": 20, }, { "type": "function", "name": "retrieve_contract", "description": "fetch contract details for the given contract_id and supplier_id", "parameters": { "type": "object", "properties": { "contract_id": { "type": "string", "description": "The contract id registered for the Supplier in the System", }, "supplier_id": { "type": "string", "description": "The Supplier ID registered in the System", }, }, "required": ["contract_id", "supplier_id"], }, }, ] 2. Instructions to the Agent Unlike Chat Completions End points that use System Prompts, the Responses API uses Instructions. This contains the prompt that describes how the Agent should go about implementing the use case in its entirety. instructions=""" This is a Procure to Pay process. You will be provided with the Purchase Invoice image as input. Note that Step 3 can be performed only after Step 1 and Step 2 are completed. Step 1: As a first step, you will extract the Contract ID and Supplier ID from the Invoice and also all the line items from the Invoice in the form of a table. Step 2: You will then use the function tool to call the Logic app with the Contract ID and Supplier ID to get the contract details. Step 3: You will then use the file search tool to retrieve the business rules applicable to detection of anomalies in the Procure to Pay process. Step 4: Then, apply the retrieved business rules to match the invoice line items with the contract details fetched from the system, and detect anomalies if any. Provide the list of anomalies detected in the Invoice, and the business rules that were violated. """ 3. User input to Responses API Load the Invoice image as an encoded base64 string, and add that to user input payload. For simplicity the user input is passed as 'user_prompt' as a string literal in the code, just for demonstration purposes. user_prompt = """ here are the Purchase Invoice image(s) as input. Detect anomalies in the procure to pay process and give me a detailed report """ # read the Purchase Invoice image(s) to be sent as input to the model image_paths = ["data_files/Invoice-002.png"] def encode_image_to_base64(image_path): with open(image_path, "rb") as image_file: return base64.b64encode(image_file.read()).decode("utf-8") # Encode images base64_images = [encode_image_to_base64(image_path) for image_path in image_paths] input_messages = [ { "role": "user", "content": [ {"type": "input_text", "text": user_prompt}, *[ { "type": "input_image", "image_url": f"data:image/jpeg;base64,{base64_image}", "detail": "high", } for base64_image in base64_images ], ], } ] 4. Invoking the Responses API The single call below performs all the different steps required to complete the anomaly detection end to end. Note that all the actions like Image based reasoning over the Invoice, vector search to retrieve the Business rules, reasoning over every tool call output and preparing the input for the next tool call, all happens directly within the API, in the cloud. # The following code is to call the Responses API with the input messages and tools response = client.responses.create( model=config.model, instructions=instructions, input=input_messages, tools=tools_list, tool_choice="auto", parallel_tool_calls=False, ) tool_call = response.output[0] There is only one step, related to Function call, that needs to run the custom function locally in the Application. The Responses API response indicates that a Function Call invocation has to happen before it can complete the process. It provides the Function name and the arguments required to make that call. We then make that function call, locally in the application, to Azure Logic Apps. We get the response back from the Function call, and that that to the payload of input message to the Responses API. It then completes the rest of the steps in the workflow. # We know this needs a function call, that needs to be executed from here in the application code. # Lets get hold of the function name and arguments from the Responses API response. function_response = None function_to_call = None function_name = None # When a function call is entailed, Responses API gives us control so that we can make the call from our application. # Note that this is because function call is to run our own custom code, it is not a hosted tool that Responses API can directly access and run. if response.output[0].type == "function_call": function_name = response.output[0].name function_to_call = available_functions[function_name] function_args = json.loads(response.output[0].arguments) # Lets call the Logic app with the function arguments to get the contract details. function_response = function_to_call(**function_args) # append the response message to the input messages, and proceed with the next call to the Responses API. input_messages.append(tool_call) # append model's function call message input_messages.append({ # append result message "type": "function_call_output", "call_id": tool_call.call_id, "output": str(function_response) }) # This is the final call to the Responses API with the input messages and tools response_2 = client.responses.create( model=config.model, instructions=instructions, input=input_messages, tools=tools_list, ) print(response_2.output_text) 5. Function Call Here is the code snippet that invokes the Azure Logic App and returns the relevant contract details from the Azure SQL Database. if response.output[0].type == "function_call": function_name = response.output[0].name function_to_call = available_functions[function_name] function_args = json.loads(response.output[0].arguments) # Lets call the Logic app with the function arguments to get the contract details. function_response = function_to_call(**function_args) # append the response message to the input messages, and proceed with the next call to the Responses API. input_messages.append(tool_call) # append model's function call message input_messages.append({ # append result message "type": "function_call_output", "call_id": tool_call.call_id, "output": str(function_response) }) # This is the final call to the Responses API with the input messages and tools response_2 = client.responses.create( model=config.model, instructions=instructions, input=input_messages, tools=tools_list, ) print(response_2.output_text) Code Run outcome Here is the output from the run of the Responses API call ## ✅ Contract Line Items (Raw JSON) ```json [ { "ContractID": "CON000002", "LineID": "LINE000003", "SupplierID": "SUP0008", "ContractDate": "2022-10-19T00:00:00", "ExpirationDate": "2023-01-07T00:00:00", "TotalAmount": 66543.390625, "Currency": "USD", "Status": "Expired", "ItemID": "ITEM0040", "Quantity": 78, "UnitPrice": 136.75, "TotalPrice": 10666.5, "DeliveryDate": "2023-01-01T00:00:00", "ItemDescription": "Description for ITEM0040" }, { "ContractID": "CON000002", "LineID": "LINE000004", "SupplierID": "SUP0008", "ContractDate": "2022-10-19T00:00:00", "ExpirationDate": "2023-01-07T00:00:00", "TotalAmount": 66543.390625, "Currency": "USD", "Status": "Expired", "ItemID": "ITEM0082", "Quantity": 57, "UnitPrice": 479.8699951171875, "TotalPrice": 27352.58984375, "DeliveryDate": "2022-11-26T00:00:00", "ItemDescription": "Description for ITEM0082" }, { "ContractID": "CON000002", "LineID": "LINE000005", "SupplierID": "SUP0008", "ContractDate": "2022-10-19T00:00:00", "ExpirationDate": "2023-01-07T00:00:00", "TotalAmount": 66543.390625, "Currency": "USD", "Status": "Expired", "ItemID": "ITEM0011", "Quantity": 21, "UnitPrice": 398.0899963378906, "TotalPrice": 8359.8896484375, "DeliveryDate": "2022-11-29T00:00:00", "ItemDescription": "Description for ITEM0011" }, { "ContractID": "CON000002", "LineID": "LINE000006", "SupplierID": "SUP0008", "ContractDate": "2022-10-19T00:00:00", "ExpirationDate": "2023-01-07T00:00:00", "TotalAmount": 66543.390625, "Currency": "USD", "Status": "Expired", "ItemID": "ITEM0031", "Quantity": 47, "UnitPrice": 429.0299987792969, "TotalPrice": 20164.41015625, "DeliveryDate": "2022-12-09T00:00:00", "ItemDescription": "Description for ITEM0031" } ] ## 🧾 Extracted Details from Invoice - **Contract ID:** CON000002 - **Supplier ID:** SUP0008 - **Total Invoice Value:** $113,130.16 USD - **Invoice Date:** 2023-06-15 --- ### 📦 Invoice Line Items | Item ID | Quantity | Unit Price | Total Price | Description | |-----------|----------|------------|-------------|------------------------------| | ITEM0040 | 116 | $136.75 | $15,863.00 | Description for ITEM0040 | | ITEM0082 | 116 | $554.62 | $64,335.92 | Description for ITEM0082 | | ITEM0011 | 36 | $398.09 | $14,331.24 | Description for ITEM0011 | | ITEM0031 | 36 | $475.00 | $17,100.00 | Description for ITEM0031 | | ITEM9999 | 10 | $150.00 | $1,500.00 | Extra item not in contract | --- ## 📄 Contract Details Retrieved ### ITEM0040 - Quantity: 78 - Unit Price: $136.75 - Total Price: $10,666.50 ### ITEM0082 - Quantity: 57 - Unit Price: $479.87 - Total Price: $27,352.59 ### ITEM0011 - Quantity: 21 - Unit Price: $398.09 - Total Price: $8,359.89 ### ITEM0031 - Quantity: 47 - Unit Price: $429.03 - Total Price: $20,164.41 - **Contract Expiration:** 2023-01-07 (Status: Expired) --- ## ❗ Anomalies Detected ### 🔴 Contract Expiry - Invoice dated **2023-06-15** refers to an **expired contract** (expired on **2023-01-07**). ### 🔴 Quantity Exceeds Contract - **ITEM0040:** 116 > 78 - **ITEM0082:** 116 > 57 - **ITEM0011:** 36 > 21 - **ITEM0031:** 36 ≤ 47 (✅ within limit) ### 🔴 Price Discrepancy - **ITEM0082:** Invoiced @ $554.62 vs Contract @ $479.87 - **ITEM0031:** Invoiced @ $475.00 vs Contract @ $429.03 ### 🔴 Extra Item - **ITEM9999** not found in contract records. --- ## 🧩 Conclusion Multiple business rule violations were found: - ❌ Contract expired - ❌ Quantity overrun - ❌ Price discrepancies - ❌ Unauthorized items > **Recommended:** Detailed investigation and corrective action. References: The source code of the application used in this sample - here Read about the Responses API here Read about the availability of this API on Azure here View a video of the demonstration of this sample application below.
srikantan
Mar 24, 2025 Place Microsoft Foundry Blog
811Views
2likes
0Comments