Introduction
As part of our 3rd-year group project in the EE Department of Imperial College London and under the amazing guidance of Lee_Stott and nitya , our team devised an end-to-end framework for developing and deploying GenAI retail copilots using state-of-the-art models from the Hugging Face AI platform and integrating them with a back-end architecture which leverages Microsoft Azure infrastructure. A key feature of this project is that Contoso Chat can be adapted easily to any retail setting simply by switching out the datasets and connecting large language models best suited for your application.
To aid in optimal model selection for future developers, a detailed evaluation of the performance of several popular LLMs in the inference/chat completion task is presented at the end of this document. In addition, our team has also incorporated new features to enhance accessibility of the copilot’s User Interface through audio input/output.
About Us
This blog gives a brief overview of our project. To see more information and our code, visit the following:
Github Page | https://github.com/Microsoft-Contoso-Group-Project |
Website | https://microsoft-contoso-group-project.github.io/website/ |
We are a team of 6 undergraduates in the Electronic and Information Engineering programme at Imperial College London.
Member (from left) | Role | |
Sebastian Tan | linkedin.com/in/sebastian-tan-b485a5223 | Front End (Text to Speech) |
Jim Zhu | linkedin.com/in/yonghui-jim-zhu-b687b5208 | Evaluation, Back-End HuggingFace Integration |
Pinqian Jin | linkedin.com/in/pinqian-jin-7090b4237 | Front-end (Speech-to-text) |
Yiru Chen | linkedin.com/in/yiru-chen-85b750227 | Evaluation |
Alex Saul | linkedin.com/in/alex-saul | Back-End HuggingFace Integration |
Zachary Elliot | linkedin.com/in/zacharygelliott | Back-End HuggingFace Integration |
Project Overview and Goals
The objectives of the project are detailed below:
Objective |
Details |
I. Integrate Hugging Face into Contoso Chat app |
· Create a framework for developers to seamlessly use open-source models from Hugging Face with the Contoso Chat architecture, which currently relies on costly OpenAI models on Azure AI. · Ensure developers can easily substitute the Embedding Model, Chat Completion Model, and Evaluation Model with Hugging Face alternatives. |
II. Improve User Interface Experience |
· Introduce a microphone feature to the chat interface, allowing users to vocalize their prompts to the chatbot. · Implement a voice response feature which will audibly relay the chat responses to the user, complementing the text output. |
III. Evaluation |
· Conduct thorough testing to confirm the application framework operates as intended. · Establish a comprehensive, automated evaluation framework to allow developers using this product to assess the performance of substituted models for their specific task. · Assess the performance of various free Hugging Face models to provide model recommendations and guidance for developers. |
The overall software architecture uses the Retrieval Augmented Generation (RAG Model).
- User Input: User enters prompt in either text or audio format.
- Speech-to-Text Model: For audio inputs, a Whisper model from Hugging Face transcribes the spoken words into text.
- Embedding Model: Transforms text prompts into vector representations for further processing.
- AI Search: Utilises the vectors generated by the embedding model to search the database for semantically similar document chunks.
- Cosmos DB: Stores catalogues, product information, and customer data, grounding the generated responses.
- Chat Completion Model: Generates final responses and product recommendations based on the refined prompt from the AI Search & Cosmos DB stage.
- Text-to-Speech Model: Converts text responses into audio using the Eleven Labs model.
- Response to User: Delivers the response to the initial prompt in both text and audio format
Technical Details
The following code references files and code provided in Version 1 of the Contoso Chat repo, which utilises Azure Promptflow (rather than Prompty assets in Version 2). The specific branch that was used in this project development can be found here.
Hugging Face Integration
To access open source LLM models and Embeddings on Hugging Face, a personal access token must be obtained from Hugging Face , then developers can navigate to the model selection page, which provides a list of available models that they can choose from. The API base is the inference API where Hugging face is hosting the model, and this be obtained from the individual model card. Hugging Face Chat, contains the latest models that allow developers to try before adding them to the prompt flow or application.
The prompt flow uses the Serverless Connection for Large language models (LLM) and embeddings, therefore non-OpenAI models can be added in a similar way by passing in the access tokens to the model’s endpoint and the address of the serverless API. The request and response to the model is initiated and parsed automatically. Although these methods work for most of the models such as Llama and Phi series models on Hugging Face, which use the same communication protocol as GPT3.5 and GPT4, models such as “Mistral” that use a different communication protocol will cause a template error, and the framework is under development to adapt to more models.
Adding Connections
Similar to the LLM Model Integration, new endpoint must be added to the create-connection.ipynb notebook to allow for connections to be established and for access to the models of choice. As seen in the example code block below, the script is easily adaptable and in the case below creates connections for two of the bge embedding models, bge-small and bge-large.
from promptflow._sdk.entities._connection import ServerlessConnection
HF_KEY = "XXXXXXXX"
HF_endpoints = {
"bge-small":"https://api-inference.huggingface.co/models/BAAI/bge-small-en","bge-large"
:"https://api-inference.huggingface.co/models/BAAI/bge-large-en-v1.5"}#{name:api_base}
for name, end_point in HF_endpoints.items():
connection =ServerlessConnection(name=name,api_key=HF_KEY,api_base=end_point)
print(f"Creating connection {connection.name}...")
result = pf.connections.create_or_update(connection)
print(result)
Promptflow Adaptation
To adapt pre-existing prompt flow to integrate different embedding models from Hugging Face, the initial question embedding node must be replaced by a new node that utilises a python script custom_embedding.py. An example new node is defined below where the python script is in the source path and the connection is specified based on the connection established to Hugging Face prior.
- name: custom_question_embedding
type: python
source:
type: code
path: custom_embedding.py
inputs:
connection: bge-large
question: ${inputs.question}
deployment_name: bge-large-en-v1.5
Having defined the new node in the promptflow yaml file, the custom_embedding.py file receives the string question input and both the connection and deployment name to query the Hugging Face API and receives the response from the embedding model when the question string is given as input.
import requests
from typing import List
from promptflow.core import tool
from promptflow.connections import ServerlessConnection
@tool
def custom_question_embedding(question: str, connection: ServerlessConnection, deployment_name: str) -> List[float]:
API_URL = connection.configs["api_base"]
headers = {"Authorization": f"Bearer {connection.secrets['api_key']}"}
def query(payload):
response = requests.post(API_URL, headers=headers, json=payload)
response.raise_for_status()
return response.json()
output = query({"inputs": question})
return output
Evaluation
Source: here
Traditional LLM evaluation methods usually involve manually writing summaries for texts and comparing AI-generated results to those written by humans. However, expert human-made ground truths for summarization are hard to obtain and compare automatically to generated summaries. In this project, the evaluation workflow implements a reference-free automatic abstractive summarization evaluation. Each model is evaluated against results produced by a high-performance base model (GPT-4). Four dimensions are measured for results evaluation: fluency, coherence, consistency, and relevance.
Metrics |
Description |
Coherence |
· The collective quality of all sentences in the summary. |
Consistency |
· Factual alignment between the summary and source document. |
Fluency |
· The quality of individual sentences of the summary. |
Relevance |
· The selection of the most important content from the source document. |
Automatic Evaluation
Since there are over 700,000 free open-source models on Hugging Face, the aim of auto evaluation is to automate the process of selecting the most suitable model for the application. The auto evaluation script loops through specified models, model parameters i.e, top p (temperature), which determines the randomness of LLM’s response, embeddings used for vector similarity search in the framework, as well as different prompt flow templates.
In the end it will store all the runs and logs in a summary table and calculate a weighted sum of the 4 different evaluation metrics being used. For the contoso-chat, it was decided that groundedness and relevance are more important than other metrics, so more weight was placed on these two areas when calculating the weighted sum.
The top-k models and parameters are subsequently returned, within dynamic HTML pages that show the results of the runs, allowing developers to interact with and analyse results.
Users can interact with the HTML page with the results and compare them across different models.
Front End
One of the goals of our project is to introduce two features to the chat interface:
1. Audio input of prompt. This involves allowing the user to record the prompt they wish to ask in the form of speech, and then automatically sending the audio prompt for processing.
2. Audio output of response. Once the response is returned through the API, it should be read out as audio to the user.
These features were introduced to improve accessibility and ease-of-use of the chat interface. The overall workflow for the new front-end design encompassing the audio input and output features are shown in the flow chart above.
Speech-to-Text (Audio Input)
To implement this, two main steps are necessary. First, an audio input function needs to be added to the user interface to collect audio input from users. The second step is to employ an automatic speech recognition model to transcribe the audio input into text, as back-end models only accept text inputs.
Hosting Platform |
Model Name |
Pros |
Cons |
Hugging Face |
SALMONN |
· Multilingual-support |
· No free access, high cost |
Hugging Face |
Wav2Vec2-large-xlsr-53 |
· High accuracy |
· Complicated to implement, need fine-tuning · No free access, high cost |
Hugging Face |
Whisper large v3 |
· Free access on Hugging face · Multilingual support · High accuracy |
· High latency of serverless API on Hugging Face |
There are numerous automatic speech recognition models available, and each of them has their own strengths and weaknesses. Some of the widely used models are listed in the table above. Since the primary goal of our project is to reduce development cost and improve the feasibility of the front-end for a broader user base, the models with low cost and multi-lingual support are preferred. Therefore, Whisper large v3 model developed by OpenAI was chosen, due to its high accuracy, multilingual support and free availability on Hugging face.
React provides a component for audio recording, it was created and implemented using the code below. In this step, audio inputs are collected and stored as audio blobs and sent to a function called sendAudio() for further processing.
useEffect(() => {
if (recording && !mediaRecorderRef.current){
//initialize media recorder
navigator.mediaDevices.getUserMedia({audio: true})
.then(stream => {
const mediaRecorder = new MediaRecorder(stream);
let audioChunks: Blob[] = [];
//when audio available
mediaRecorder.ondataavailable = event => {
if (event.data.size >0){
audioChunks.push(event.data);
}
};
//when recording starts
mediaRecorder.onstart = () => {
audioChunks = [];
};
//when recording stops
mediaRecorder.onstop = () => {
const audioBlob = new Blob(audioChunks, {type: audioChunks[0].type});
//send to backend for processing
sendAudio(audioBlob);
Then the sendAudio function will send the audio input as a waveform audio file to server through an HTTP request.
After the server received the audio file, it will send it to the whisper model deployed on Hugging Face through another HTTP request. The model is accessed using inference API on Hugging Face. After transcription is finished, the model will return a text json file, which will be sent back to the client side.
export async function POST(req: NextRequest, res: NextApiResponse) {
if (!process.env.HUGGING_FACE_API_KEY) {
res.status(500).json({ error: 'Hugging Face API key not set' });
return;
}
const formData = await req.formData();
const file = formData.get("file") as Blob;
const buffer = Buffer.from(await file.arrayBuffer());
const response = await fetch(HUGGING_FACE_API_URL, {
method: "POST",
headers: {
'Authorization': "Bearer " + HF_api_key,
//'Content-Type': "application/octet-stream", //to send a binary file
},
body: buffer,
duplex: 'half',
} as ExtendedRequestInit);
const responseData = await response.json();
return Response.json(responseData);
}
Text-to-Speech (Audio Output)
ElevenLabs’ Turbo V2 model was a perfect fit due to its low latency, generating audio files more quickly and providing a more seamless user experience. Moreover, ElevenLabs allows users to convert up to 10,000 words of text into speech before subscribing to a user plan where one can use the API under a paid plan. The pay-as-you-go structure of ElevenLabs is much better suited for developers in a sandbox scenario who wish to plug-and-play with different models. Although a drawback of using the ElevenLabs model is the inability to fine tune the model, since text-to-speech is a relatively static task without much variation over industries/applications, it was decided that the trade-off was acceptable.
Hosting Platform |
Model Name |
Pros |
Cons |
Hugging Face |
metavoice-1B-v0.1 |
· Emotional speech rhythm and tone in English. No hallucinations. · Well-documented finetuning steps. |
· Serverless API unavailable, hence this model needs to be deployed on Azure– increasing costs. |
Hugging Face |
WhisperSpeech |
· Available in pytorch library – easy to deploy. · Able to fine-tune model. |
· Serverless API unavailable, hence this model needs to be deployed on Azure– increasing costs. |
ElevenLabs
(Selected) |
Turbo v2 |
· Optimised to have low latency processing, measured to be 400ms. · Free to use (up to 10,000 tokens) · Already hosted on eleven-labs server, simply need to access via API. |
· Unable to fine-tune the model. |
The chat interface uses an API to connect to the ElevenLabs platform. A HTTP Post request is sent to ElevenLabs from the chat interface, and the generated audio is returned in the form of a BLOB, which will be parsed into a HTML audio object.
By calling the getElevenLabsResponse function during each time when a new response arrives from the back-end, the text response is sent to ElevenLabs via the defined API route and an audio BLOB is returned from ElevenLabs, which is parsed into an audio file Javascript object and played for the user.
const getElevenLabsResponse = async (text: string) => {
const response = await fetch("/api/chat/text-to-speech", {
method: "POST",
headers: {
"Content-Type": "application/json"
},
body: JSON.stringify({
message: text,
voice: "Rachel"
})
});
return data;
};
const audioRef = useRef<HTMLAudioElement>(null);
//call getElevenLabsResponse to read out chat response
getElevenLabsResponse(responseTurn.message).then((botVoiceResponse) =>{
const reader = new FileReader();
reader.readAsDataURL(botVoiceResponse);
reader.onload = () => {
if (audioRef.current) {
// Pass the file to the <audio> element's src attribute.
audioRef.current.src=reader.result as string;
// Immediately play the audio file.
audioRef.current.play();
}
};
Results and Outcomes
In this project, three text-generation models and two embedding models are evaluated, and the best one is picked for the Contoso Chat application. Prior to Hugging Face integration, GPT-3.5 and text-embedding-ada-002 are used for text-generation and embedding respectively.
Meta_llama3_instruct_70B and Phi_3_mini_4k_instruct from Hugging Face are evaluated as text-generation models. They are tested with different top_p values and embedding models to find their best performance. Best four combinations for each model are shown in the spider diagrams below: (p0.9 means top_p = 0.9; ada002 means text-embedding-ada-002; bge-large-en-v1.5 means 'bge-large-en-v1.5', 1024).
It’s clear that GPT-3.5 outperforms Phi3 and Meta Llama3 in all the evaluation metrics. Meta Llama3 is stronger at groundedness, while Phi3 performs better at fluency and coherence. This indicates that Meta Llama3 produces texts with better writing quality, whereas Phi3’s texts are more aligned with the original document. Developers can choose different Hugging Face models based on requirements of the specific projects during actual development and testing for better results.
The Evaluation Model May be Biased!
Meta Llama3 70b and Phi3 4k were tested as evaluation base models instead of GPT-4 to further investigate the results difference between GPT-3.5 and other models. The results showed a clear bias: all the Meta Llama models get higher marks when evaluated against Meta Llama3 70b, while all the Phi models get better scores when Phi3 4k is used as the base model. As a result, it can be conjectured that GPT-3.5’s high score could be partially because of GPT-4 is used as evaluation model. In the future, the evaluation process needs to be revised to minimize this bias.
Acknowledgement & Conclusion
For the 6 of us, this project has been a totally fun and an truly illuminating learning experience. Our heartfelt thanks goes out to Lee_Stott and nitya for their superb mentoring and generosity in their time. Our weekly standups could not have been more enlightening and fun with the immense knowledge and experience they were so kind to share with us throughout the entire course of this 6-week project. Our deepest gratitude also goes out to our internal supervisor from Imperial - Sonali Parbhoo for her guidance and support during the project.
We hope that developers will be able to use some of the techniques we have presented in this project to drive down their development costs and gain access to the huge variety of models available on Hugging Face. In addition, we hope our auto evaluation feature will aid developers in making sense of different models efficiently and the front-end audio input/output tools will be useful in improving the user experience of the chat interface.
Call to Action
Thank you for reading! Feel free to check out our GitHub page for more information.
We encourage everyone to explore the huge variety of Microsoft Samples on their page to learn more about the vast infrastructure and services provided by Microsoft Azure.