The genesis of our AI language learning product is quite a story — It all started during a discussion about the lack of customizable learning tools for reading practice. My friend Fan (Benjamin) Wang, who is an Online Course Designer & Technical Writer at Shanbay China, then shared his ideas of how to create one with Azure API, and we thought it would be worthwhile to turn the prototype into a product. So I formed the team with Yanxin (Jeff) Luo(Embedded Software Engineer), Yankun (Alex) Meng(Duke CS & ECE), and Vivian Yang(Duke Fuqua Quantitative Management) in Duke Generative AI Hackathon to make it happen. Our innovative approach and dedication culminated in a significant achievement: “TalkwithMe” was honored as the winner of the Beginner Track at the Duke Generative AI Hackathon 2023, a testament to our team's hard work and the potential of our project.
Our product “TalkwithMe” is an innovative browser extension that revolutionizes language learning by enabling users to practice pronunciation with their favorite scripts, providing instant feedback. This AI-driven tool, developed to enrich the language-learning landscape, leverages Microsoft Azure’s advanced speech synthesis models. Our journey in creating “TalkwithMe” involved selecting the most effective AI services and integrating Text-to-Speech (TTS) and Automatic Pronunciation Assessment (APA) functionalities. The challenge was to ensure a seamless user experience, which necessitated a user-friendly interface and smooth backend-frontend integration. Our commitment to solving these challenges has been pivotal in bringing this unique language-learning solution to life.
The general pipeline for our Minimum-Viable-Product (MVP) is divided into two independent processes: Text-to-Speech (TTS) and Automatic Pronunciation Assessment (APA). Both are done with the help of Microsoft Azure.
1.Text-to-Speech (TTS)
The initial step in the technical implementation of text-to-speech involves obtaining user input text. This input is provided via a text box or as txt file. The user's input serves as the content that needs to be converted into synthetic speech.
Once the user's input is acquired, we convert the input into an audio blob. This step was done with the help of the Microsoft Speech Synthesizer class (Updated in September 2023). This API takes the input text and returns an audio blob, which is essentially a binary audio data representation.
Within the Microsoft Speech Synthesizer class, we used specific configurations to fine-tune the text-to-speech process. These configurations are essential for customizing the synthesized speech to meet the user's requirements. The language setting is crucial and depends on the speaker, and in this case, it is set to "en-US," ensuring that the generated speech is in US English. Furthermore, the choice of voice is significant. The "Microsoft Server Speech Text to Speech Voice (en-US, JennyNeural)" is specified, which is what we believe to be the most natural and realistic voice provided.
After the text-to-speech conversion is successfully executed, the resulting audio blob is returned to the user interface by creating an audio bar element in JavaScript. The audio blob contains the synthesized speech, and users are provided with the capability to play and stop it at their discretion.
2. Automatic Pronunciation Assessment (APA)
This pipeline begins with user input in the form of audio, which is captured and processed using the Media Capture and Streams API in JavaScript. The main objective here is to assess the user's pronunciation based on the reference text they provide.
To initiate the pronunciation assessment pipeline, we create a WAV blob from the user's input audio. This allows us to efficiently work with the audio data. The audio data and reference text are then extracted from the user's request.
Once we have the necessary data, we proceed to assess the pronunciation. The audio data is read in manageable chunks, here we chose 1024 bytes which is the typical size. These audio data chunks are then sent to the Azure Cognitive Speech API, configured with Pronunciation Assessment settings. The key configurations include:
The result of the assessment is provided in the form of JSON data, which includes both Word Level and Phoneme Level evaluation. The scoring process involves comparing the spoken phonemes from the user's audio input with the expected phonemes from the reference text and computing a confidence score on how well it matches.
The four most significant features are extracted from the JSON results and stored persistently in a CSV format. These features are presented to the user, allowing them to track their pronunciation progress effectively. The four primary features include:
To see these features in action and understand how they enhance the language learning experience, you can view our product walkthrough video
One of the initial hurdles was conducting a comprehensive literature review and industry research to identify the best available tools that could be utilized to deliver fast and accurate results. We looked into new papers in NLP from renowned universities about TTS and well-developed AI services from OpenAI, AWS, and Google, and we eventually decided to use Microsoft Azure for our task due to the naturalness of its speech synthesis models and easy-to-configure parameters.
A significant milestone we overcame was figuring out how to seamlessly combine the Text-to-Speech (TTS) and Automatic Pronunciation Assessment (APA) pipelines to create a user experience that was both smooth and comfortable. This involved careful configuration and integration of various components to ensure that users could easily transition from generating synthetic speech to assessing their pronunciation with minimal friction.
Other challenges we encountered were programming and implementation tasks, such as how to persist user data into CSV format effectively, how to send requests, and combine JavaScript (Frontend) and Python (Backend) without bugs. This involved carefully coordinating the two components to ensure a synchronized and efficient operation. In terms of user interface (UI) design, we needed to create an interface that was not only intuitive and easy to use but also visually appealing. We opted for a Duke-themed color scheme that struck a balance between aesthetics and user-friendliness. These design choices enhanced the overall user experience and usability of our system.
In our technical achievements for the demo, we successfully integrated various APIs and SDKs provided by Microsoft Azure and deployed them using Flask/Python3. This integration allowed us to create a seamless user experience, combining Text-to-Speech (TTS) and Automatic Pronunciation Assessment (APA) functionalities.
We successfully fine-tuned the parameters and configurations of our models to tailor them to our specific requirements. For the TTS component, we paid careful attention to language settings, ensuring that the generated speech matched the user's preferences. In the Pronunciation Assessment Model, we made key configurations such as granularity and miscue to improve the accuracy of assessments. Additionally, we developed a visually appealing and fully interactive user interface using vanilla JavaScript, HTML, and CSS. This interface facilitated a user-friendly experience, allowing users to interact easily with the system.
We provided a practical and intuitive working MVP designed for people who wish to practice their speaking skills extensively. Furthermore, our efforts were recognized by our peers at a hackathon, where we received excellent feedback from peers. This positive reception is a testament to the potential of our system and its effectiveness in addressing the needs of speakers looking to enhance their pronunciation and speaking abilities.
Referencing and analyzing the graph of “Size of Global E-learning market from 2019 to 2026”, “Revenues generated by Duolingo from 2019 to 2022”, and “Language learning apps awareness and usage in the U.S. from 2019 to 2022”, we interpreted:
Our journey with “TalkwithMe” was an enlightening experience. We learned the importance of comprehensive market research, the need for a pedagogically sound approach, and the intricacies of combining different AI technologies. These insights have enriched our understanding of creating impactful educational tools.
Our future product development strategy encompasses the following key objectives, designed to enhance user experience and engagement:
If you're as passionate about language mastery as we are, discover more about AI and language learning with Microsoft Learn Modules/Docs that inspired us. We'd love to dive deeper into a conversation about our product. Join us in shaping the future of language education!
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.