Knowledge Mining is a technique to extract insights from structured and unstructured data. In this context, Azure Search is the standard Microsoft Knowledge Mining service, that uses AI to create metadata about images, relational databases, and textual data, providing a web-like search experience. Audio is a data type that matters for companies in all industries, containing customer and business information. In this article you will learn how to combine these two separated worlds into one single search solution.
We can turn Azure Search into Azure Cognitive Search, by adding an enrichment pipeline that creates meta data on images and text files. It pushes the created metadata into a Search Index, that contains pointers to the original data on its physical location on Azure. The supported data types are Microsoft Office documents, PDFs, json, xml, and image formats like png, jpg, and more. Those images may have a scene that will be tagged and described, or text that will be extracted. The text in the images may also be handwritten, like a form or meeting notes, common situations in every business.
But audio data isn’t supported, and a company may want to search also for the content from podcasts, meetings recordings, call center phone calls, and so on. A possible solution would be a custom skill. After the regular Cognitive Search pipeline for your standard supported documents, a second skillset is necessary. It would only extract physical metadata from the file and the custom skill would do the job. But today, August 2019, this approach has limitations on processing time, files sizes, and more. It will be addressed in a future blog post.
Another possible solution for this challenge has two steps, pre-processing and a link-through, that will allow users not only to search from company’s recordings but also to have the “click-to-open” experience, like Microsoft’s JFK or Azure Search Accelerator demos. Let’s see the details below. The diagram below shows you how it will work.
Figure 1: The traditional Cognitive Search diagram, with the addition, in grey, of the required audio files pre-processing step
In this first step, we need to extract textual data from the audio files before the Azure Cognitive Search enrichment pipeline. The process will feed the pipeline with json data containing the transcription of the audio files, a content that must be merged with the other files content, which happens just before the pipeline AI analysis like entity or key phrases extractions.
This idea is also seen within the speech support in a Bot, with the help of the Microsoft Speech SDK. It will convert the audio to text, and then you can integrate your bot with other Azure Cognitive Services, like LUIS for language understanding, Text Analytics for information extraction, or Q&A Maker for knowledge base; all cognitive services are REST APIs that expect a textual input.
To implement this pre-processing, we will use the Microsoft Speech to Text API, a cognitive service that will do the transformation we need to mine knowledge from business audio. This API offers a range of capabilities you can embed into your apps to support various transcription scenarios, including conversation transcription, speech transcription, and custom speech transcription. The audio files location is informed as a parameter of the API call, and the service will access the file.
The code below allows you to submit your audio files to the API and get back the transcriptions in json files which has the name of the original recording file. This is key to allow the application to offer the “click-to-open” experience, what you will see in the next step of the solution for this challenge. To access to this code and sample files, click here for a GitHub lab that has all details necessary to accomplish this task.
Figure 2: The script code to call the REST API for your audio files and get the transcription back.
For enterprise scenarios, it is necessary to use Batch Transcription, ideal for call centers, which typically accumulate thousands of hours of audio. It makes it easy to transcribe large volumes of audio recordings. By using the dedicated REST API, you can point to audio files with a shared access signature (SAS) URI and asynchronously receive transcriptions. It requires the S0 (Standard) tier of the Speech Service and will allow you to process file types like wav and mp3, without data volume limitations.
The Cognitive Services team did a great blog post on how to use the Text Analytics API on call center recordings. They used a similar architecture, with focus on Power BI dashboards for sentiment analysis and key phrases, instead of Cognitive Search. An interesting approach is to leverage both designs to create the best solution for your needs.
Now you have the transcriptions, if you upload the json data to the same location of the rest of your dataset, it will be submitted for all AI processes you defined for your dataset. And the created metadata will be available for search in an inverted index, the theory behind Azure Search incredible performance. But the enrichment pipeline will index the json file, not the original audio. That means, in a click-to-open solution, the user will read the json file text instead of listening to the original recording. Here are the required actions to fix it:
Figure 3: The body definition of your document update API call.
This blog post shows you how pre-processing is powerful for knowledge mining scenarios. The same process done here for audio files can be used, along with the Video Indexer API, for video files. You will be extending the impact and the use of the knowledge mining solution for new data types that the original product can’t yet handle. The same data enrichment is done for audios and videos, allowing a complete search experience for users. For industry specific vocabularies, you may want to use custom language models, making your transcription even more effective. Stay tuned, this blog channel has a long list of AI solutions to be published.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.