Azure Cognitive Services: Speech API's [Azure AI Applied Services : Part 3]

Kruti_Mehta

Microsoft

Jun 21, 2022

Overview

This is a follow-up blog to

Which AI am I ? [Azure AI Applied Services : Part 1]
Azure Cognitive Services: Vision API's [Azure AI Applied Services : Part 2]

In this blog we discuss in detail the applications for Speech API services with the help of flow charts and graphs to help you understand its application. It will help if the intent is clear what is it that you wish to achieve through analyzing images or text.

Azure Cognitive Services provides with Speech API's and Language API's which often overlap with the functionalities they cater.

Speech API's - Assist in spoken language transformations

Language API's - Understand conversations and unstructured text

The key differentiation factor amongst the choice you make between the 2 is the use case intent. If you only wish to transform the format in either real-time or in batches its recommended to go with Speech services approach. If you wish to dig deeper insights in terms detailed analysis of either spoken or written languages (Transform+Analyze+Filter) its recommended to go with Language services approach.

Speech API’s

Speech API's should be leverage if you wish to do basic transformation between text-to-speech, speech-to-text and speech-to-speech (transformation) with the basic functionality of language,intent,,key-word, speaker, recognitions. Language support varies by Speech service functionality. The following tables summarize language support for speech-to-text, text-to-speech, speech translation, and speaker recognition service offerings

A Speech resource - choose this resource type if you only plan to use the Speech service, or if you want to manage access and billing for the resource separately from other services.
A Cognitive Services resource - choose this resource type if you plan to use the Speech service in combination with other cognitive services, and you want to manage access and billing for these services together.

The Speech service includes the following application programming interfaces (APIs):

Speech-To-Text - used to transcribe speech from an audio source to text format.
Text-To-Speech - used to generate spoken audio from a text source.
Speech-To-Speech (Speech Translation) - used to translate speech in one language to text or speech in another.

1) Speech-To-Text API's

Speech-to-text from the Speech service, also known as speech recognition, enables real-time and batch transcription of audio streams into text. Speech-to-text, also known as Speech Recognition, enables real-time or offline transcription of audio streams into text.

The base model may not be sufficient if the audio contains ambient noise or includes a lot of industry and domain-specific jargon. In these cases, you can create and train custom speech models with acoustic, language, and pronunciation data.

The recognized words are typically converted to text, which you can use for various purposes, such as.

Providing closed captions for recorded or live videos
Creating a transcript of a phone call or meeting
Automated note dictation
Determining intended user input for further processing

2) Text-To-Speech API's

Text-to-speech from the Speech service enables your applications, tools, or devices to convert text into human-like synthesized speech. The text-to-speech capability is also known as Speech Synthesis. Use humanlike prebuilt neural voices out of the box, or create a custom neural voice that's unique to your product or brand. Its powered by deep neural networks. You can use the Speech Synthesis Markup Language (SSML) to fine-tune the pitch, pronunciation, speaking rate, volume, and more.

You can use the output of speech synthesis for many purposes, including:

Generating spoken responses to user input.
Creating voice menus for telephone systems.
Reading email or text messages aloud in hands-free scenarios.
Broadcasting announcements in public locations, such as railway stations or airports.

3) Speech-To-Speech API's

Recognized speech can be translated and then synthesized in a different language (speech-to-speech).The benefits and capabilities of the speech translation service, which enables real-time, multi-language speech-to-speech and speech-to-text translation of audio streams. Interim transcription and translation results are returned as speech is detected, and the final results can be converted into synthesized speech. Speech Translation is to specify target translation languages. At least one is required, but multiples are supported