Azure AI Vision at Microsoft Build 2024: Multimodal AI for Everyone

Microsoft

May 21, 2024

As Microsoft Build 2024 kicks off, we are excited to share with you some of the groundbreaking innovations enabling powerful Vision use cases on Azure AI. We have been pushing the boundaries of multimodal AI, combining natural language processing and computer vision to create powerful and intuitive solutions for a wide range of scenarios. In this blog post, we will introduce you to three of our latest multimodal models: GPT-4 Turbo with Vision, Phi-3-vision model, and the recently released model, GPT-4o.

As we announced earlier this month, GPT-4 Turbo is now generally available through the Azure OpenAI Service. GPT-4 Turbo is a multimodal model capable of processing both text and image inputs to generate text outputs. In the months since Microsoft Ignite, where we announced the public preview of GPT-4 Turbo with Vision, several of our customers have already unlocked the incredible potential of computer vision and generative AI. Use cases range from creating new processes, enhancing efficiencies, and innovating within their businesses. We have seen applications across insurance improving process efficiency, healthcare for increasing customer safety, and various organizations deriving insights from charts and diagrams.

The GPT-4 Vision model in Microsoft Azure integrates seamlessly with Palantir’s Artificial Intelligence Platform (AIP). This has unlocked significant capabilities across numerous operational AIP workflows, ranging from due diligence processing, to insurance claim management, to improving manufacturing operations with engineering schematics. With Palantir and Microsoft technology, both back-office and operational workflows are enhanced by adaptable, controllable visual intelligence.

Anirvan Mukherjee, Head of AI/ML Solutions, Palantir Technologies, Inc.

We're excited about the advancements in image analysis offered by Azure AI Vision, encompassing image classification, captions, and descriptions. Our exploration of this technology's potential is broad and diverse, especially in leveraging images captured by robots within elder’s homes. This allows us to analyze images for text-based descriptions and potential safety issues, making these insights available to loved ones and caregivers. With this service we plan to also develop a "What is this?" capability that would allow image classification & description to be done by robots initiated by the elder person for learning & cognitive enrichment.

Chris Heidbrink, Senior Vice President, Innovation and AI, NTT Research

We are excited to see the addition of the vision capability to Azure OpenAI GPT service. TCS Industry Copilots are now enriched with Azure OpenAI GPT-4V to extract crucial insights from images and drawings via the vision enhancement feature, resulting in more efficient analysis of images.

Girish Phadke, Technology Head, Microsoft Cloud Platforms, Tata Consultancy Services, AI Cloud

We are also introducing Phi-3-vision, the first multimodal model in the Phi-3 family of open models, bringing together text and images. Phi-3-vision can be used to reason over real-world images and extract and reason over text from images. It has also been optimized for chart and diagram understanding and can be used to generate insights and answer questions. Phi-3-vision builds on the language capabilities of Phi-3-mini, continuing to pack strong language and image reasoning quality in a small size. You can read more about Phi-3-vision and the Phi-3 family of open models here.

Lastly, we’re thrilled to introduce the launch of GPT-4o, OpenAI’s new flagship model on Azure AI. This groundbreaking multimodal model integrates text, vision, and in the future, audio capabilities, setting a new standard for generative and conversational AI experiences. GPT-4o is available now in the Azure OpenAI Service API and Azure AI Studio with support for text and images.

We believe that multimodal AI will enable new and exciting use cases and experiences for everyone. We are committed to making multimodal AI accessible, affordable, and scalable. We are also committed to making multimodal AI responsible, by continuing to develop our models with the principles of responsible AI.

In addition to the multimodal models we're highlighting today, we also want to share the preview of Facial Liveness for browsers in the Azure AI Vision Face API. Liveness detection is a key aspect of multifactor authentication (MFA), preventing bad actors from using a person's image in spoofing attacks. Learn more about Liveness detection and the Face API here.

You can also check out our sessions Practical applications of multimodal vision AI models, Azure AI Vision use cases for image and video with multimodality, and our customer demo session from Palantir Technologies, who will show their multimodal use cases with GPT and Azure AI Vision. Thank you for your interest and support, and we look forward to hearing from you and seeing what you can create with multimodal models.

Updated May 22, 2024

Version 2.0

Microsoft

Joined May 17, 2024

View Profile

Azure AI Foundry Blog

Follow this blog board to get notified when there's new activity