A tenet of our ongoing efforts to improve the audio and video experiences in Microsoft Teams is situational optimization – understanding specific use cases and environments and enabling Teams to perform at its peak in those scenarios. One such scenario is to transmit live or pre-recorded music content during a Teams meeting or call. High-fidelity music mode and automatic music detection are new Teams features that optimize for music, to deliver clear sound at frequencies that extend beyond the normal range for speech.
Communication apps are frequently designed for meetings or one-on-one conversations in which most of the audio signals are speech. Transmitting high-quality speech at the lowest possible bitrate typically requires the use of high-efficiency speech codecs. While these codecs are suitable for their primary purpose, they can significantly limit the fidelity of non-speech signals. High-fidelity music mode in Teams offers superior sound clarity for a wide range of audio content including music, medical signals, and speech.
Superior speech quality in Teams
Traditional PSTN (Public Switched Telephone Network) landlines transmit speech in the frequency range from 300Hz to 3.4kHz. The low-end nature of this range poses challenges for hearing differences in letters such as “S” and “F”. However, speech codecs used in today’s telecommunication applications are typically designed for wideband, covering a frequency range of 60Hz to 8kHz, significantly improving the intelligibility of speech compared to traditional phone calls over PSTN.
To enable speech signals with a bandwidth of 8kHz, the raw signal must be sampled at 16kHz at 16bits, which requires 256kbps to transmit. A highly-efficient speech codec can transmit speech at 16kbs or less. Recent efficiency improvements to the Teams audio codec make it possible to deliver quality sound even as low as 6kbps with minimal audible distortion.
Take audio beyond speech quality with High-fidelity music mode
High-efficiency codecs depend on speech model parameters that can characterize the vocal tract and pitch of the speaker. This does not work well for non-speech signals such as music. As users increasingly share an expanded variety of audio signals including music lessons, songs through other applications, or medical signals during a virtual appointment with a physician, it is increasingly important to provide high-fidelity options to transmit audio signals other than speech.
High-fidelity music mode addresses the need to share these types of content in Teams by transmitting audio signals with a 32kHz sampling rate (16kHz bandwidth) at 128kbps, preserving fidelity while reducing the bitrate by 4x compared to lossless encoding. The optimized experience in Teams applies to signals captured by microphones as well as audio played while sharing an application or desktop. The result is significantly improved audio quality of music and other non-speech signals in Teams calls and meetings.
The following examples contrast music transmitted using the speech codec versus the High-fidelity music mode.
New automatic music detection prompts Teams users to enable High-fidelity music mode
Machine-learning-based noise suppression has now been enabled by default for most Teams customers. This noise suppression considers any non-speech signal picked up by the microphone as noise which should be suppressed. To avoid unintentionally suppressing music, Teams features new automatic music detection which notifies users whenever music is recognized (see below.) This gives users the choice to enable High-fidelity music mode when music is a desired signal, such as a guitar lesson, or continue suppressing unwanted music, such as ambient sound in a coffee shop.
Detecting music with accuracy involved training a deep neural network with more than 1,000,000 audio clips which contain speech and music. We then evaluated this model with an independent test of 1,000 additional audio clips crowd-sourced from a wide range of contributors. This approach ensured a variety of recording conditions such as different microphones and room acoustics. For music lesson simulation, we asked contributors to play different instruments such as piano, guitar, violin, trumpet, and play different background music from a wide variety of music genres including rock, pop, country, R&B, jazz, classical and others.
Since we didn’t want the user notification to appear when no music is present, we had a very strict requirement of 0.1% false positives (i.e., speech or noise is classified as music) and even so, we were still able to detect more than 81% of all music clips in our test set, significantly outperforming all published research in this field. Another important requirement was for this machine learning model to run in the Teams client across devices, to preserve a great user experience for all users. More details on our approach can be found in this research paper. Automatic music detection is expected to be generally available in the coming months.
Each day, millions of users across the globe choose Teams to communicate across work, school, and home, with innovative features that enable new customer experiences. Automatic music detection and High-fidelity music mode are examples of how Teams uses machine learning and AI to optimize user experiences in real-time, delivering improved audio and video quality without taxing your organization’s network.
Stay tuned to this blog to learn about new Teams features designed to improve the quality of your calls and meetings.