Satin: Microsoft’s latest AI-powered audio codec for real-time communications
Published Feb 17 2021 08:00 AM 76.4K Views
Microsoft

Jigar Dani, Principal PM Manager, Microsoft
Sriram Srinivasan, Principal Software Engineering Manager, Microsoft

 

Over a decade ago, Skype invented the Silk audio codec to transmit speech over the internet and it catalyzed the voice over internet protocol (VoIP) industry. The primary codec used in VoIP then was G.722 that required 64 kbps to transmit wide band (16 kHz) speech, Silk on the other hand offered wideband quality starting at just 14 kbps. Additionally, Silk was an adaptive variable bitrate codec that seamlessly switched from delivering narrow band (8 kHz) speech at ultra-low bandwidth of 6 kbps to offer a near transparent quality of speech at higher bit rates. This was critical for dial-up and limited broadband internet available at that time and served us well as the default codec for Skype and Microsoft Teams. Silk is also the basis of voice mode in the Opus codec, one of the default WebRTC codecs.


As we enter a new decade, users can now choose from several high-end connectivity alternatives such as high-speed broadband, optical fiber, and 5G. Yet, large segments of Microsoft’s user base are still limited to low cable internet speeds or slower 3G and 4G cellular networks. They often experience situations with over 50% packet loss and sporadic loss of coverage when moving between cell towers, commuting, or switching between network types. Network availability can even be unpredictable in their homes where many share bandwidth with others who are working and learning remotely. After all these years, it turns out that utilization of available bitrate is every bit as important today as it was in the dial-up world. Any bitrate savings can be used to provide additional resiliency and improve experiences on other workloads like modern video or content sharing.


Our challenge is to deliver a virtual voice experience that’s as good as talking in person even over ultra-low bandwidth and in highly constrained network conditions. To truly serve our customers, we know they need to be able to communicate and collaborate on the go, on all device types, over any network, in every environment.


That’s why we’re excited to share the details of our new AI-powered audio codec named Satin. Satin can deliver super wide band speech starting at a bitrate of 6 kbps, and full-band stereo music starting at a bitrate of 17 kbps, with progressively higher quality at higher bitrates. Satin has been designed to provide great audio quality even under high packet loss. In addition, its great quality at low bitrates allows us to use more of the available bandwidth for providing better resiliency to packet loss. We have recently improved our redundancy algorithms to provide better protection under burst loss. Here is the net effect of our improved resiliency algorithms and new Satin codec (please use your favorite headset to hear the two audio files).

 

Silk at 6 kbps, burst packet loss (additional 6 kbps for redundancy):

Satin at 6 kbps, burst packet loss, improved redundancy algorithm (additional 6 kbps for redundancy):

 

Our team built this new codec by combining decades of algorithmic experience and advanced machine learning techniques. Let’s take a deeper dive into how Satin works.


What’s narrowband, wideband, and super wideband voice?
Our ear can generally perceive sounds that range in frequency from 20 Hz to 20 kHz. When dealing with discrete time signals, we need to sample the audio waveform at a minimum of twice the highest frequency we wish to reproduce. This is generally why CD-quality music is sampled at 44.1 kHz (44100 samples per second) or 48 kHz. Early telephony systems used a sampling rate of 8 kHz and could reproduce frequencies up to 4 kHz (in practice up to 3.4 kHz), which was considered sufficient at the time for speech communication. While a lower sampling rate implies fewer bits per second to transmit over the wire, it resulted in the all too familiar tinny voice quality over the phone as the higher vocal frequencies present in natural speech could not be reproduced. VoIP solutions, which were no longer limited by the narrowband telephony infrastructure, introduced us to the magic of wideband speech (reproduce up to 8 kHz, sampled at 16 kHz) and users were immediately able to appreciate the crisper, more natural and intelligible sound.


Silk took this a step further with the introduction of super wideband voice, capturing frequencies up to 12 kHz, sampled at 24 kHz (energy drops off rapidly at frequencies above 12 kHz for human voice). As mentioned earlier, higher sampling rates imply a higher bitrate. Satin re-defines super wideband to cover frequencies up to 16 kHz (sampled at 32 kHz) for greater clarity and sibilance, and its efficient compression enables super wideband voice at 6 kbps.

Frequency components of the sound /t/ in the word “suit.” There is a significant amount of energy well beyond the narrowband cutoff of 4 kHz and even the wideband cutoff of 8 kHz. Preserving energy in the higher spectral components results in more natural sounding speech.Frequency components of the sound /t/ in the word “suit.” There is a significant amount of energy well beyond the narrowband cutoff of 4 kHz and even the wideband cutoff of 8 kHz. Preserving energy in the higher spectral components results in more natural sounding speech.

 

Listen to these two samples below on your headphones. The Satin super wideband speech sample sounds a lot more natural and intelligible, much like what you hear when you are talking to someone in person.

 

Silk narrowband at 6 kbps:

Satin super wideband at 6 kbps:

 

How do you achieve super wideband at 6 kbps?
To achieve super wideband quality at 6 kbps, Satin uses a deep understanding of speech production, modelling and psychoacoustics to extract and encode a sparse representation of the signal. To further reduce the required bitrate, Satin only encodes and transmits certain parameters in the lower frequency bands. At the decoder, Satin uses deep neural networks to estimate the high band parameters from the received low band parameters, and a minimal amount of side information sent over the wire.


While this approach solved the primary challenge of reproducing super wideband voice at ultra-low bitrates, it introduced a new challenge of computational complexity. The analysis of the input speech signal to extract a low dimensional representation is computationally intensive. Real-time inference on deep neural networks adds even more complexity. To solve this, the team then focused on both algorithmic optimizations as well as techniques like loop vectorization beyond what the compiler could achieve. This achieved nearly 40% reduction in computational complexity and allowed us to run on all our users’ devices.

Satin Quality.png

 

As with all new features, we A/B tested Satin before widely rolling it out—both to ensure there were no regressions, as well as to quantify the positive impact for our users. The A/B tests showed a statistically significant increase in call duration for Satin compared to Silk at these low bitrates. Offline, crowdsourced subjective tests to evaluate codec quality at 6 kbps showed the mean opinion score (MOS) rating of Satin to be 1.7 MOS higher than Silk.


How resilient is Satin to packet loss?
The majority of calls are on Wi-Fi and mobile networks, where packet loss is common and can adversely affect call quality. Satin is uniquely positioned to compensate for this. Unlike most other voice codecs, Satin encodes each packet independently, so the effect of losing one packet does not affect the quality of subsequent packets. The codec is also designed to facilitate high quality packet loss concealment in an internal parametric domain. These features help Satin seamlessly handle random losses where one or two packets are lost at a time.


Another type of packet loss, which is even more detrimental to perceived quality, is when several packets are lost in a burst. Here, Satin’s ability to deliver great audio at a low rate of 6 kbps provides the flexibility to use some of the available bitrate to add redundancy and forward error correction to quickly recover from these situations. Satin does this without compromising overall audio quality.

 

Satin is already being used for all Teams and Skype two-party calls and will roll out for Teams meetings soon. It currently operates in wideband voice mode within a bitrate range of 6 – 36 kbps and will be extended to support full-band stereo music at a maximum sampling rate of 48 kHz in the near future. We are very excited for you to try this new codec and let us know what you think.

 

Subscribe to Teams Engineering Blog RSS feed to stay in touch with the latest innovations from Teams.


Want to work on the team that builds bleeding edge AI technology: AI Jobs in M365 Intelligent Conversations and Communications Cloud Team

15 Comments
Iron Contributor

Thank you for sharing.

Deleted
Not applicable

This article makes me feel 'old' because back in the late 70's I worked on the "Early telephony systems" referenced above ;)  Kudos to the person who proposed Satin as the codec name!

 
Copper Contributor

Is there also a statement with regard to latency? We very often experience that people start talking simulaneously. Will latency go significantly down?

Copper Contributor

@jigardani san
Does Satin support for PSTN Call including Direct routing and Calling plan?

Microsoft

@Oliver74 Thank you for your feedback. The codec change itself does not improve latency. Have you observed a specific pattern when people are talking over each other? Specific ISP's, devices, constrained network conditions? 

@Akihiro_Kawamura Satin is not currently supported for PSTN calls. 

Iron Contributor

Will Teams Direct Routing SBC vendors be expected to support this codec too... ?

Copper Contributor

Is the codec available for Teams on Linux? 

 

How do we know when the codec is used? 

Copper Contributor

Will Azure Communication Services get the Satin codec as well?

Copper Contributor

I hope Skype implements this for PSTN calls as well. I have paid subscriptions, and I can call people fine over all other services, but when I call phones via Skype, people always complain that they cannot hear me. I don't get why any other service - PSTN or not - is fine, but my favorite communication tools, for which I pay a hefty fee, have the worst PSTN calling quality!

Microsoft

@billybraga yes it will also work on Linux. Use of Satin codec is transparent to the user. 

Brass Contributor

My shoppe is cutting over to Teams only in advance of retirement of Skype for Business Online. Will Satin be enabled for Teams Meetings before July when the plug is pulled on SFBO?

Copper Contributor

well-done very clear and concise writeup and ive been looking forward to satin codec for sometime now

Copper Contributor

Good report with some detail. But what "AI" (maybe some cloud smell) sense really presented here? 14kbps (10 year ago) is really low code rate. Now it cut into half. But as we all known the data points if get get lost, will have different consequence during transmission, over-air wireless in particular case, the time jitter and hopping between nodes cause audio quality of the track a lot. That is not intended effects. So wideband in freq domain vs time jit in domain for conversation type of channel has two aspects requirements.  Jit and network blocking really matters. If for movie transmission, more room is on video side. Not sure how many power consumption for codec operation side, for mobile device, too complicate computing means less time to talk due battery.

Brass Contributor

thanks for sharing, what's the max tolerance of the package loss for Satin?

Copper Contributor

It is 11/2021, and some of our Teams Meetings are still using the SILKWide and not SATIN Audio Codec here.
It appears to be dependent on who started the recurring meeting (their bandwidth, computing capacity, .. ?)

Version history
Last update:
‎Aug 24 2022 05:02 PM
Updated by: