This week we’re releasing raw media audio access as a public preview feature in the Calling JavaScript SDK for web browsers (1.6.1-beta). In this blog post we’ll discuss potential uses of this feature and set you up with an illuminating sample that translates audio in real-time. Azure Communication Services real-time voice and video calling can be used a wide range of scenarios:
In all of these scenarios, your app may benefit from a wide ecosystem of cloud services that analyze audio streams in real-time and provide useful data. Two Microsoft examples are Nuance and Azure Cognitive Services. Nuance can monitor virtual medical appointments and help clinicians with post-visit notes. And Azure Cognitive Services can provide real-time transcription and translation capabilities.
When you use Azure Communication Services, you enjoy a secure auto-scaling communication platform with complete control over your data. Raw media access is a practical realization of this promise, allowing your Web clients to pull audio from calls, and inject audio into a call, all in real-time, with a socket-like interface. Raw media access is just one of several ways to connect to the Azure Communication Services calling data plane with client SDKs. You can also use service SDKs, get phone numbers from Azure, or bring your own carrier, connect Power Virtual Agent, and connect Teams.
You can learn more about raw media access by checking out concept and quick start documentation. But the rest of this post will discuss a sample we built using Azure Cognitive Services.
Imagine an end-user in a call who only understands French, but everyone else on the call is speaking English. We can do two things to help this French user communicate with the group:
This flow is diagramed below.
To get started you’re going to want to get basic Web Calling up and running. We’ve built this Cognitive Services sample as a branch of the web calling tutorial sample hosted on GitHub.
git clone https://github.com/Azure-Samples/communication-services-web-calling-tutorial.git
Navigate to the project folder then install and start the application on local machine.
npm install && npm run start
Once your tutorial sample application is up and running, stop it and check out to the speech translation branch
git checkout azure_communication_services_cognitive_services_speech_to_speech_translation
Install dependencies and run this branch with translation:
npm install && npm run start
Your app will need an Azure Cognitive Services Speech resource for translation. Follow these directions to create Azure speech resource and find key and location. Then add your Azure Speech key and region to the `config.json` file:
{
"connectionString": "REPLACE_WITH_CONNECTION_STRING",
"cognitiveServiceSpeechKey": "REPLACE_WITH_COGNITIVE_SPEECH_KEY",
"cognitiveServicesRegion": "REPLACE_WITH_COGNITIVE_SPEECH_REGION"
}
Azure Cognitive Services supports 40+ languages for speech to speech translation. We’re going to pick French and English with the code below. This is all in MakeCall.js.
// To start a translation, we need to set up the speech config
let speechConfigOutput = SpeechTranslationConfig.fromSubscription( window.TutorialSpeechConfig.key, window.TutorialSpeechConfig.region);
//Set up the spoken language, target language, synthesis language, voice name and format in speechConfigOutput.
// Languages information can be found in https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/language-support
speechConfigOutput.speechRecognitionLanguage = this.state.spokenLanguage;
speechConfigOutput.addTargetLanguage(configObj.lang);
speechConfigOutput.speechSynthesisLanguage = configObj.lang;
speechConfigOutput.voiceName = configObj.voiceName;
speechConfigOutput.speechSynthesisVoiceName = configObj.voiceName;
speechConfigOutput.speechSynthesisOutputFormat = SpeechSynthesisOutputFormat.Ogg48Khz16BitMonoOpus;
There are two key real-time operations in the sample. In this first flow we’re going to take incoming audio from ACS Call using `remoteAudioStreams` available and connects it to an TranslationRecognizer. In this flow we take the incoming audio from the call and translate it to the selected language.
/************************************************/
/* Speaker Translation */
/************************************************/
//Get media stream from speaker
const rawAudioSenderStream = this.call.remoteAudioStreams[0];
const audioConfig = AudioConfig.fromStreamInput(rawAudioSenderStream);
const audioCtx = new AudioContext();
this.recognizerForInput = new TranslationRecognizer(translationConfig, audioConfig);
//The recognizerForInput object exposes a Synthesizing event.
//The event fires several times and provides a mechanism to retrieve the synthesized audio from the translation recognition result.
this.recognizerForInput.synthesizing = (s, e) => {
if (e.result.audio && audioCtx) {
var source = audioCtx.createBufferSource();
audioCtx.decodeAudioData(e.result.audio, (newBuffer) => {
source.buffer = newBuffer;
source.connect(audioCtx.destination);
source.start(0);
});
}
};
this.call.tsCall.muteSpeaker();
// Start the continuous recognition/translation operation.
this.recognizerForInput.startContinuousRecognitionAsync();
This next flow takes the local user’s microphone input, translates that audio using another TranslationRecognizer, and that to the ACS calling using `localAudioStream`
/************************************************/
/* Microphone Translation */
/************************************************/
//Get media stream track from microphone
let audioConfigInput = AudioConfig.fromDefaultMicrophoneInput();
//Get recognizerForInput from stream track. Also get config and context from raw
this.recognizerForInput = new TranslationRecognizer(speechConfigOutput, audioConfigInput);
let soundContext = new AudioContext();
const destinationStream = soundContext.createMediaStreamDestination();
const localAudioStream = new LocalAudioStream(destinationStream);
this.call.startAudio(localAudioStream);
//The recognizerForInput object exposes a Synthesizing event.
//The event fires several times and provides a mechanism to retrieve the synthesized audio from the translation recognition result.
this.recognizerForInput.synthesizing = (s, e) => {
var audioSize = e.result.audio === undefined ? 0 : e.result.audio.byteLength;
if (e.result.audio && soundContext) {
var source = soundContext.createBufferSource();
soundContext.decodeAudioData(e.result.audio, function (newBuffer) {
source.buffer = newBuffer;
source.connect(destinationStream);
source.start(0);
});
}
};
// Start the continuous recognition/translation operation.
this.recognizerForInput.startContinuousRecognitionAsync();
}
We hope this new feature will enable productive extensions and experimentation in your voice calling applications. And yes – we are working on enabling this same raw media access capability for the Calling iOS and Android SDKs. Follow this blog and hit up our GitHub page to get the latest updates from the Azure Communication Services team!
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.