pamela fox
4 TopicsAdd speech input & output to your app with the free browser APIs
One of the amazing benefits of modern machine learning is that computers can reliably turn text into speech, or transcribe speech into text, across multiple languages and accents. We can then use those capabilities to make our web apps more accessible for anyone who has a situational, temporary, or chronic issue that makes typing difficult. That describes so many people - for example, a parent holding a squirmy toddler in their hands, an athlete with a broken arm, or an individual with Parkinson's disease. There are two approaches we can use to add speech capabilities to our apps: Use the built-in browser APIs: the SpeechRecognition API and SpeechSynthesis API. Use a cloud-based service, like the Azure Speech API. Which one to use? The great thing about the browser APIs is that they're free and available in most modern browsers and operating systems. The drawback of the APIs is that they're often not as powerful and flexible as cloud-based services, and the speech output often sounds more robotic. There are also a few niche browser/OS combos where the built-in APIs don't work. That's why we decided to add both options to our most popular RAG chat solution, to give developers the option to decide for themselves. However, in this post, I'm going to show you how to add speech capabilities using the free built-in browser APIs, since free APIs are often easier to get started with and it's important to do what we can to improve the accessibility of our apps. The GIF below shows the end result, a chat app with both speech input and output buttons: All of the code described in this post is part of openai-chat-vision-quickstart, so you can grab the full code yourself after seeing how it works. Speech input with SpeechRecognition API To make it easier to add a speech input button to any app, I'm wrapping the functionality inside a custom HTML element, SpeechInputButton . First I construct the speech input button element with an instance of the SpeechRecognition API, making sure to use the browser's preferred language if any are set: class SpeechInputButton extends HTMLElement { constructor() { super(); this.isRecording = false; const SpeechRecognition = window.SpeechRecognition || window.webkitSpeechRecognition; if (!SpeechRecognition) { this.dispatchEvent( new CustomEvent("speecherror", { detail: { error: "SpeechRecognition not supported" }, }) ); return; } this.speechRecognition = new SpeechRecognition(); this.speechRecognition.lang = navigator.language || navigator.userLanguage; this.speechRecognition.interimResults = false; this.speechRecognition.continuous = true; this.speechRecognition.maxAlternatives = 1; } Then I define the connectedCallback() method that will be called whenever this custom element has been added to the DOM. When that happens, I define the inner HTML to render a button and attach event listeners for both mouse and keyboard events. Since we want this to be fully accessible, keyboard support is important. connectedCallback() { this.innerHTML = ` <button class="btn btn-outline-secondary" type="button" title="Start recording (Shift + Space)"> <i class="bi bi-mic"></i> </button>`; this.recordButton = this.querySelector('button'); this.recordButton.addEventListener('click', () => this.toggleRecording()); document.addEventListener('keydown', this.handleKeydown.bind(this)); } handleKeydown(event) { if (event.key === 'Escape') { this.abortRecording(); } else if (event.key === ' ' && event.shiftKey) { // Shift + Space event.preventDefault(); this.toggleRecording(); } } toggleRecording() { if (this.isRecording) { this.stopRecording(); } else { this.startRecording(); } } The majority of the code is in the startRecording function. It sets up a listener for the "result" event from the SpeechRecognition instance, which contains the transcribed text. It also sets up a listener for the "end" event, which is triggered either automatically after a few seconds of silence (in some browsers) or when the user ends the recording by clicking the button. Finally, it sets up a listener for any "error" events. Once all listeners are ready, it calls start() on the SpeechRecognition instance and styles the button to be in an active state. startRecording() { if (this.speechRecognition == null) { this.dispatchEvent( new CustomEvent("speech-input-error", { detail: { error: "SpeechRecognition not supported" }, }) ); } this.speechRecognition.onresult = (event) => { let input = ""; for (const result of event.results) { input += result[0].transcript; } this.dispatchEvent( new CustomEvent("speech-input-result", { detail: { transcript: input }, }) ); }; this.speechRecognition.onend = () => { this.isRecording = false; this.renderButtonOff(); this.dispatchEvent(new Event("speech-input-end")); }; this.speechRecognition.onerror = (event) => { if (this.speechRecognition) { this.speechRecognition.stop(); if (event.error == "no-speech") { this.dispatchEvent( new CustomEvent("speech-input-error", { detail: {error: "No speech was detected. Please check your system audio settings and try again."}, })); } else if (event.error == "language-not-supported") { this.dispatchEvent( new CustomEvent("speech-input-error", { detail: {error: "The selected language is not supported. Please try a different language.", }})); } else if (event.error != "aborted") { this.dispatchEvent( new CustomEvent("speech-input-error", { detail: {error: "An error occurred while recording. Please try again: " + event.error}, })); } } }; this.speechRecognition.start(); this.isRecording = true; this.renderButtonOn(); } If the user stops the recording using the keyboard shortcut or button click, we call stop() on the SpeechRecognition instance. At that point, anything the user had said will be transcribed and become available via the "result" event. stopRecording() { if (this.speechRecognition) { this.speechRecognition.stop(); } } Alternatively, if the user presses the Escape keyboard shortcut, we instead call abort() on the SpeechRecognition instance, which stops the recording and does not send any previously untranscribed speech over. abortRecording() { if (this.speechRecognition) { this.speechRecognition.abort(); } } Once the custom HTML element is fully defined, we register it with the desired tag name, speech-input-button : customElements.define("speech-input-button", SpeechInputButton); To use the custom speech-input-button element in a chat application, we add it to the HTML for the chat form: <speech-input-button></speech-input-button> <input id="message" name="message" type="text" rows="1"></input> Then we attach an event listener for the custom events dispatched by the element, and we update the input text field with the transcribed text: const speechInputButton = document.querySelector("speech-input-button"); speechInputButton.addEventListener("speech-input-result", (event) => { messageInput.value += " " + event.detail.transcript.trim(); messageInput.focus(); }); You can see the full custom HTML element code in speech-input.js and the usage in index.html. There's also a fun pulsing animation for the button's active state in styles.css. Speech output with SpeechSynthesis API Once again, to make it easier to add a speech output button to any app, I'm wrapping the functionality inside a custom HTML element, SpeechOutputButton . When defining the custom element, we specify an observed attribute named "text", to store whatever text should be turned into speech when the button is clicked. class SpeechOutputButton extends HTMLElement { static observedAttributes = ["text"]; In the constructor, we check to make sure the SpeechSynthesis API is supported, and remember the browser's preferred language for later use. constructor() { super(); this.isPlaying = false; const SpeechSynthesis = window.speechSynthesis || window.webkitSpeechSynthesis; if (!SpeechSynthesis) { this.dispatchEvent( new CustomEvent("speech-output-error", { detail: { error: "SpeechSynthesis not supported" } })); return; } this.synth = SpeechSynthesis; this.lngCode = navigator.language || navigator.userLanguage; } When the custom element is added to the DOM, I define the inner HTML to render a button and attach mouse and keyboard event listeners: connectedCallback() { this.innerHTML = ` <button class="btn btn-outline-secondary" type="button"> <i class="bi bi-volume-up"></i> </button>`; this.speechButton = this.querySelector("button"); this.speechButton.addEventListener("click", () => this.toggleSpeechOutput() ); document.addEventListener('keydown', this.handleKeydown.bind(this)); } The majority of the code is in the toggleSpeechOutput function. If the speech is not yet playing, it creates a new SpeechSynthesisUtterance instance, passes it the "text" attribute, and sets the language and audio properties. It attempts to use a voice that's optimal for the desired language, but falls back to "en-US" if none is found. It attaches event listeners for the start and end events, which will change the button's style to look either active or unactive. Finally, it tells the SpeechSynthesis API to speak the utterance. toggleSpeechOutput() { if (!this.isConnected) { return; } const text = this.getAttribute("text"); if (this.synth != null) { if (this.isPlaying || text === "") { this.stopSpeech(); return; } // Create a new utterance and play it. const utterance = new SpeechSynthesisUtterance(text); utterance.lang = this.lngCode; utterance.volume = 1; utterance.rate = 1; utterance.pitch = 1; let voice = this.synth .getVoices() .filter((voice) => voice.lang === this.lngCode)[0]; if (!voice) { voice = this.synth .getVoices() .filter((voice) => voice.lang === "en-US")[0]; } utterance.voice = voice; if (!utterance) { return; } utterance.onstart = () => { this.isPlaying = true; this.renderButtonOn(); }; utterance.onend = () => { this.isPlaying = false; this.renderButtonOff(); }; this.synth.speak(utterance); } } When the user no longer wants to hear the speech output, indicated either via another press of the button or by pressing the Escape key, we call cancel() from the SpeechSynthesis API. stopSpeech() { if (this.synth) { this.synth.cancel(); this.isPlaying = false; this.renderButtonOff(); } } Once the custom HTML element is fully defined, we register it with the desired tag name, speech-output-button : customElements.define("speech-output-button", SpeechOutputButton); To use this custom speech-output-button element in a chat application, we construct it dynamically each time that we've received a full response from an LLM, and call setAttribute to pass in the text to be spoken: const speechOutput = document.createElement("speech-output-button"); speechOutput.setAttribute("text", answer); messageDiv.appendChild(speechOutput); You can see the full custom HTML element code in speech-output.js and the usage in index.html. This button also uses the same pulsing animation for the active state, defined in styles.css. Acknowledgments I want to give a huge shout-out to John Aziz for his amazing work adding speech input and output to the azure-search-openai-demo, as that was the basis for the code I shared in this blog post.Entity extraction with Azure OpenAI Structured Outputs
📺 Tune into our live stream on this topic on December 3rd! Have you ever wanted to extract some details from a large block of text, like to figure out the topics of a blog post or the location of a news article? In the past, I've had to use specialized models and domain-specific packages for entity extraction. But now, we can do entity extraction with large language models and get equally impressive results. 🎉 When we use the OpenAI gpt-4o model along with the structured outputs mode, we can define a schema for the details we'd like to extract and get a response that conforms to that schema. Here's the most basic example from the Azure OpenAI tutorial about structured outputs: class CalendarEvent(BaseModel): name: str date: str participants: list[str] completion = client.beta.chat.completions.parse( model="MODEL_DEPLOYMENT_NAME", messages=[ {"role": "system", "content": "Extract the event information."}, {"role": "user", "content": "Alice and Bob are going to a science fair on Friday."}, ], response_format=CalendarEvent, ) output = completion.choices[0].message.parsed The code first defines the CalendarEvent class, an instance of a Pydantic model. Then it sends a request to the GPT model specifying a response_format of CalendarEvent . The parsed output will be a dictionary containing a name , date , and participants . We can even go a step farther and turn the parsed output into a CalendarEvent instance, using the Pydantic model_validate method: event = CalendarEvent.model_validate(event) With this structured outputs capability, it's easier than ever to use GPT models for "entity extraction" tasks: give it some data, tell it what sorts of entities to extract from that data, and constrain it as needed. Extracting from GitHub READMEs Let's see an example of a way that I actually used structured outputs, to help me summarize the submissions that we got to a recent hackathon. I can feed the README of a repository to the GPT model and ask for it to extract key details like project title and technologies used. First I define the Pydantic models: class Language(str, Enum): JAVASCRIPT = "JavaScript" PYTHON = "Python" DOTNET = ".NET" class Framework(str, Enum): LANGCHAIN = "Langchain" SEMANTICKERNEL = "Semantic Kernel" LLAMAINDEX = "Llamaindex" AUTOGEN = "Autogen" SPRINGBOOT = "Spring Boot" PROMPTY = "Prompty" class RepoOverview(BaseModel): name: str summary: str = Field(..., description="A 1-2 sentence description of the project") languages: list[Language] frameworks: list[Framework] In the code above, I asked for a list of a Python enum, which will constrain the model to return only options matching that list. I could have also asked for a list[str] to give it more flexibility, but I wanted to constrain it in this case. I also annoted the description using the Pydantic Field class so that I could specify the length of the description. Without that annotation, the descriptions are often much longer. We can use that description whenever we want to give additional guidance to the model about a field. Next, I fetch the GitHub readme, storing it as a string: url = "https://api.github.com/repos/shank250/CareerCanvas-msft-raghack/contents/README.md" response = requests.get(url) readme_content = base64.b64decode(response.json()["content"]).decode("utf-8") Finally, I send off the request and convert the result into a RepoOverview instance: completion = client.beta.chat.completions.parse( model=os.getenv("AZURE_OPENAI_GPT_DEPLOYMENT"), messages=[ { "role": "system", "content": "Extract info from the GitHub issue markdown about this hack submission.", }, {"role": "user", "content": readme_content}, ], response_format=RepoOverview, ) output = completion.choices[0].message.parsed repo_overview = RepoOverview.model_validate(output) You can see the full code in extract_github_repo.py That gives back an object like this one: RepoOverview( name='Job Finder Chatbot with RAG', description='This project is a chatbot application aimed at helping users find job opportunities and get relevant answers to questions about job roles, leveraging Retrieval-Augmented Generation (RAG) for personalized recommendations and answers.', languages=[<Language.JAVASCRIPT: 'JavaScript'>], azure_services=[<AzureService.AISEARCH: 'AI Search'>, <AzureService.POSTGRESQL: 'PostgreSQL'>], frameworks=[<Framework.SPRINGBOOT: 'Spring Boot'>] ) Extracting from PDFs I talk to many customers that want to extract details from PDF, like locations and dates, often to store as metadata in their RAG search index. The first step is to extract the PDF as text, and we have a few options: a hosted service like Azure Document Intelligence, or a local Python package like pymupdf. For this example, I'm using the latter, as I wanted to try out their specialized pymupdf4llm package that converts the PDF to LLM-friendly markdown. First I load in a PDF of an order receipt and convert it to markdown: md_text = pymupdf4llm.to_markdown("example_receipt.pdf") Then I define the Pydantic models for a receipt: class Item(BaseModel): product: str price: float quantity: int class Receipt(BaseModel): total: float shipping: float payment_method: str items: list[Item] order_number: int In this example, I'm using a nested Pydantic model Item for each item in the receipt, so that I can get detailed information about each item. And then, as before, I send the text off to the GPT model and convert the response back to a Receipt instance: completion = client.beta.chat.completions.parse( model=os.getenv("AZURE_OPENAI_GPT_DEPLOYMENT"), messages=[ {"role": "system", "content": "Extract the information from the blog post"}, {"role": "user", "content": md_text}, ], response_format=Receipt, ) output = completion.choices[0].message.parsed receipt = Receipt.model_validate(output) You can see the full code in extract_pdf_receipt.py Extracting from images Since the gpt-4o model is also a multimodal model, it can accept both images and text. That means that we can send it an image and ask it for a structured output that extracts details from that image. Pretty darn cool! First I load in a local image as a base-64 encoded data URI: def open_image_as_base64(filename): with open(filename, "rb") as image_file: image_data = image_file.read() image_base64 = base64.b64encode(image_data).decode("utf-8") return f"data:image/png;base64,{image_base64}" image_url = open_image_as_base64("example_graph_treecover.png") For this example, my image is a graph, so I'm going to have it extract details about the graph. Here are the Pydantic models: class Graph(BaseModel): title: str description: str = Field(..., description="1 sentence description of the graph") x_axis: str y_axis: str legend: list[str] Then I send off the base-64 image URI to the GPT model, inside a "image_url" type message, and convert the response back to a Graph object: completion = client.beta.chat.completions.parse( model=os.getenv("AZURE_OPENAI_GPT_DEPLOYMENT"), messages=[ {"role": "system", "content": "Extract the information from the graph"}, { "role": "user", "content": [ {"image_url": {"url": image_url}, "type": "image_url"}, ], }, ], response_format=Graph, ) output = completion.choices[0].message.parsed graph = Graph.model_validate(output) More examples You can use this same general approach for entity extraction across many file types, as long as they can be represented in either a text or image form. See more examples in my azure-openai-entity-extraction repository. As always, remember that large language models are probabilistic next-word-predictors that won't always get things right, so definitely evaluate the accuracy of the outputs before you use this approach for a business-critical task.RAG Deep Dive: 10-part live stream series
Our most popular RAG solution for Azure has now been deployed thousands of times by developers using it across myriad domains, like meeting transcripts, research papers, HR documents, and industry manuals. Based on feedback from the community (and often, thanks to pull requests from the community!), we've added the most hotly requested features: support for multiple document types, chat history with Cosmos DB, user account and login, data access control, multimodal media ingestion, private deployment, and more. This open-source RAG solution is powerful, but it can be intimidating to dive into the code yourself, especially now that it has so many optional features. That's why we're putting on a 10-part live series in January/February, diving deep into the solution and showing you all the ways you can use it. Register for the whole series on Reactor or scroll down to learn about each session and register for individual sessions. We look forward to seeing you in the live chat and hearing how you're using the RAG solution for your own domain. See you in the streams! 👋🏻 The RAG solution for Azure 13 January, 2025 | 11:30 PM UTC | 3:30 PM PT Register for the stream on Reactor Join us for the kick-off session, where we'll do a live demo of the RAG solution and explain how it all works. We'll step through the RAG flow from Azure AI Search to Azure OpenAI, deploy the app to Azure, and discuss the Azure architecture. Customizing our RAG solution 15 January, 2025 | 11:30 PM UTC | 3:30 PM PT Register for the stream on Reactor In our second session, we'll show you how to customize the RAG solution for your own domain - adding your own data, modifying the prompts, and personalizing the UI. Plus, we'll give you tips for local development for faster feature iteration. Optimal retrieval with Azure AI Search 20 January, 2025 | 11:30 PM UTC | 3:30 PM PT Register for the stream on Reactor Our RAG solution uses Azure AI Search to find matching documents, using state-of-the-art retrieval mechanisms. We'll dive into the mechanics of vector embeddings, hybrid search with RRF, and semantic ranking. We'll also discuss the data ingestion process, highlighting the differences between manual ingestion and integrated vectorization Multimedia data ingestion 22 January, 2025 | 11:30 PM UTC | 3:30 PM PT Register for the stream on Reactor Do your documents contain images or charts? Our RAG solution has two different approaches to handling multimedia documents, and we'll dive into both approaches in this session. The first approach is purely during ingestion time, where it replaces media in the documents with LLM-generated descriptions. The second approach stores images of the media alongside vector embeddings of the images, and sends both text and images to a multimodal LLM for question answering. Learn about both approaches in this session so that you can decide what to use for your app. User login and data access control 27 January, 2025 | 11:30 PM UTC | 3:30 PM PT Register for the stream on Reactor In our RAG flow, the app first searches a knowledge base for relevant matches to a user's query, then sends the results to the LLM along with the original question. What if you have documents that should only be accessed by a subset of your users, like a group or a single user? Then you need data access controls to ensure that document visibility is respected during the RAG flow. In this session, we'll show an approach using Azure AI Search with data access controls to only search the documents that can be seen by the logged in user. We'll also demonstrate a feature for user-uploaded documents that uses data access controls along with Azure Data Lake Storage Gen2. Storing chat history 29 January, 2025 | 11:30 PM UTC | 3:30 PM PT Register for the stream on Reactor Learn how we store chat history using either IndexedDB for client-side storage or Azure Cosmos DB for persistent storage. We'll discuss the API architecture and data schema choices, doing both a live demo of the app and a walkthrough of the code. Adding speech input and output 3 February, 2025 | 11:30 PM UTC | 3:30 PM PT Register for the stream on Reactor Our RAG solution includes optional features for speech input and output, powered either by the free browser SDKs or by the powerful Azure Speech API. We also offer a tight integration with the VoiceRAG solution, for those of you who want a real-time voice interface. Learn about all the ways you can add speech to your RAG chat in this session! Private deployment 5 February, 2025 | 11:30 PM UTC | 3:30 PM PT Register for the stream on Reactor To ensure that the RAG app can only be accessed within your enterprise network, you can deploy it to an Azure virtual network with private endpoints for each Azure service used. In this session, we'll show how to deploy the app to a virtual network that includes AI Search, OpenAI, Document Intelligence, and Blob storage. Then we'll log in to the virtual network using Azure Bastion with a virtual machine to demonstrate that we can access the RAG app from inside the network, and only inside the network. Evaluating RAG answer quality 10 February, 2025 | 11:30 PM UTC | 3:30 PM PT Register for the stream on Reactor How can you be sure that the RAG chat app answers are accurate, clear, and well formatted? Evaluation! In this session, we'll show you how to generate synthetic data and run bulk evaluations on your RAG app, using the azure-ai-evaluation SDK. Learn about GPT metrics like groundedness and fluency, and custom metrics like citation matching. Plus, discover how you can run evaluations on CI/CD, to easily verify that new changes don't introduce quality regressions. Monitoring and tracing LLM calls 12 February, 2025 | 11:30 PM UTC | 3:30 PM PT Register for the stream on Reactor When your RAG app is in production, observability is crucial. You need to know about performance issues, runtime errors, and LLM-specific issues like Content Safety filter violations. In this session, learn how to use Azure Monitor along with OpenTelemetry SDKs to monitor the RAG application.1.1KViews2likes0Comments