Is Data Science Obsolete with Model as a Service (MaaS) Endpoints?

Christopher_Tearpak

Microsoft

Dec 16, 2024

In this blog post, guest blogger Martin Bald, Sr. Manager DevRel and Community at Microsoft Partner Wallaroo.AI, addresses the important practices and considerations performed by Data Scientists and Data Engineers for monitoring LLMaaS for mitigating hallucinations and bias to produce accurate and reliable output for consumers.

Introduction

The emergence of GenAI and services associated with it such as ChatGPT, Gemini, etc, is creating a scenario where enterprises are feeling pressure to quickly implement GenAI/LLM solutions to make sure they are not left behind competitively in the race towards broad enterprise GenAI adoption. This urgency has trickled down to the technology teams in these enterprises with pressure to rapidly create and implement GenAI/LLM-enabled products and solutions.

One GenAI/LLM low barrier to entry solution for enterprise technology teams is Managed Inference Endpoint or MaaS/LLMaaS (Model as a Service or LLM as a Service). MaaS/LLMaaS are cloud-hosted services designed to simplify deploying, and scaling, LLMs for inference. The appeal for enterprises seeking to initiate production LLMs is that LLMaaS (LLM as a Service) provides a production ready infrastructure that takes care of all the deployment and scaling complexities for production LLMs.

While LLMaaS is a hands-off solution, effective and reliable production LLMs are highly dependent on the accuracy of the generated output for its consumers. All models decay over time and require model governance to ensure they are performing optimally for the use case they are being used for.

In the case of LLMs, model decay or drift can have negative effects through hallucinations and bias leading to lack of trust, integrity, satisfaction, compliance, and potentially legal fallout. Proactively monitoring models for hallucinations and bias is one of the challenges that prevents Enterprises from launching LLMs effectively and reliably in production.

Enterprises that have offloaded LLM deployment and inference to LLMaaS can still retain control to mitigate hallucinations and bias through model monitoring and validation techniques such as Retrieval-Augmented Generation (RAG), LLM guardrails, and “LLM as a judge” to monitor and control the output of LLM applications.

While those solutions may be effective, can they be implemented efficiently alongside LLMaaS endpoints? Does the complexity of implementing these solutions end up alienating Data scientists and/or ML Engineers whose skills are needed to monitor and manage LLMs?

Best Practices for Managed Inference Endpoints Performance

There are a couple of methods that Data Scientists can do to mitigate this hallucinations and bias challenge.

Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG) is one method that helps LLMs to produce more accurate and relevant outputs, effectively overcoming some of the limitations inherent in their training data. RAG not only enhances the reliability of the generated content but also ensures that the information is up-to-date, which is crucial for enhancing user trust and delivering accurate responses while adapting to constantly changing information.

RAG is also a good low cost alternative to fine tuning the model. Fine tuning models is expensive because of intensive resource consumption and it also produces diminishing returns for accuracy when compared to RAG.

RAG works by improving the accuracy and reliability of LLMs by allowing the model to reference an authoritative knowledge base outside of its training data sources before generating a response.

RAG LLM creates an up to date authoritative source for the model and can quickly incorporate the latest data and provide accurate up-to-date responses for end users.

The RAG LLM process takes the following steps:

Input text first passes through the feature extractor model that outputs the embedding. This is a list of floats that the RAG LLM uses to query the database for its context.
Both the embedding and the origin input is passed to the RAG LLM.
The RAG LLM queries the vector indexed database for the context from which to build its response. As we have discussed above this context prevents hallucinations by providing guidelines that the RAG LLM uses to construct its response.
Once finished, the response is submitted as the generated text back to the application.

An example of RAG in action

In the following example, inference requests are submitted either as pandas DataFrames or Apache Arrow tables. The following example shows submitting a pandas DataFrame with the query to suggest an action movie. The response is returned as a pandas DataFrame, and we extract the generated text from there.

data = pd.DataFrame({"text": ["Suggest me an action movie, including its name"]})
result = pipeline.infer(data, timeout=10000)
result['out.generated_text'].values[0]

This results in the following output text.

1. "The Battle of Algiers" (1966) - This film follows the story of the National Liberation Front (FLN) fighters during the Algerian Revolution, and their struggle against French colonial rule. 
2. "The Goodfather" (1977) - A mobster's rise to power is threatened by his weaknesses, including his loyalty to his family and his own moral code. 
3. "Dog Day Afternoon" (1975) - A desperate bank clerk turns to a life of crime when he can't pay his bills, but things spiral out of control.

Learn More: Retrieval-Generated LLMs with Wallaroo

Wallaroo LLM Listeners^TM

There may be certain use cases or compliance and regulatory rules that restrict the use of RAG. In such scenarios LLM accuracy and integrity can still be accomplished through the validation and monitoring components with Wallaroo LLM Listeners^TM.

With the shift to LLMs, together with our customers we came up with the concept of an LLM Listener which is essentially a set of models that we build and offer off the shelf that can be customizable to detect and monitor certain behaviors such as toxicity, harmful language etc.

For example you may be looking to generate an alert for poor quality responses immediately or even autocorrect that behavior from the LLM that can be done in-line. If needed this can also be utilized offline if you're looking to do some further analysis on the LLM interaction. This is especially useful if it's something that is done in a more controlled environment. For example, you can be doing this in a RAG setting and add these validation and monitoring steps on top of that to help further improve generated text output..

The Wallaroo LLM Listeners^TM can also be orchestrated to generate real-time monitoring reports and metrics to understand how your LLM is behaving and ensure that it's effective in production which helps shorten the time to value for the business. You can also iterate on the LLM Listener and keep the endpoint static while everything that happens behind it can remain fluid to allow AI teams to iterate quickly on the LLMs without impacting the bottom line which could be your business reputation, revenue costs, customer satisfaction, ROI, etc.

Fig-1

The Wallaroo LLM Listener^TM approach illustrated above in Fig -1 is implemented as follows:

1: Input text from application and corresponding generated text.

2: The input is processed by your LLM inference endpoint.

3: Wallaroo will log the interactions between the LLM inference endpoint and your users in the inference results logs. Data Scientists can see the input text and corresponding generated text from there.

4: The inference results logs can be monitored by a suite of listener models which can be anything from standard processes to other NLP models that are monitoring these outputs inline or offline. Think of them as things like sentiment analyzers or even full systems that check against some ground truth.

5: The LLM listeners are going to score your LLM interactions on a variety of factors and can be used to start to generate automated reporting and alerts in cases where, over time, behavior is changing or some of these scores start to fall out of acceptable ranges.

For example below, an inference is performed by submitting an Apache Arrow table to the deployed LLM and LLM Validation Listener, and displaying the results. Apache arrow tables provide low latency methods of data transmission and inference.

Input text:

text = "Please summarize this text: Simplify production AI for seamless self-checkout or cashierless experiences at scale, enabling any retail store to offer a modern shopping journey. We reduce the technical overhead and complexity for delivering a checkout experience that’s easy and efficient no matter where your stores are located. Eliminate Checkout Delays: Easy and fast model deployment for a smooth self-checkout process, allowing customers to enjoy faster, hassle-free shopping experiences. Drive Operational Efficiencies: Simplifying the process of scaling AI-driven self-checkout solutions to multiple retail locations ensuring uniform customer experiences no matter the location of the store while reducing in-store labor costs. Continuous Improvement: Enabling integrated data insights for informing self-checkout improvements across various locations, ensuring the best customer experience, regardless of where they shop."
input_data = pa.Table.from_pydict({"text" : [text]})
pipeline.infer(input_data, timeout=600)

pyarrow.Table
time: timestamp[ms]
in.text: string not null
out.generated_text: string not null
out.score: float not null
check_failures: int8
----
time: [[2024-05-23 20:08:00.423]]
in.text: [["Please summarize this text: Simplify production AI for seamless self-checkout or cashierless experiences at scale, enabling any retail store to offer a modern shopping journey. We reduce the technical overhead and complexity for delivering a checkout experience that’s easy and efficient no matter where your stores are located.Eliminate Checkout Delays: Easy and fast model deployment for a smooth self-checkout process, allowing customers to enjoy faster, hassle-free shopping experiences. Drive Operational Efficiencies: Simplifying the process of scaling AI-driven self-checkout solutions to multiple retail locations ensuring uniform customer experiences no matter the location of the store while reducing in-store labor costs. Continuous Improvement: Enabling integrated data insights for informing self-checkout improvements across various locations, ensuring the best customer experience, regardless of where they shop."]]
out.generated_text: [[" Here's a summary of the text:

This AI technology simplifies and streamlines self-checkout processes for retail stores, allowing them to offer efficient and modern shopping experiences at scale. It reduces technical complexity and makes it easy to deploy AI-driven self-checkout solutions across multiple locations. The system eliminates checkout delays, drives operational efficiencies by reducing labor costs, and enables continuous improvement through data insights, ensuring a consistent customer experience regardless of location."]]
out.score: [[0.837221]]
check_failures: [[0]]

The following fields are output from the inference:

out.generated_text: The LLM’s generated text.

out.score: The quality score.

In addition, we also have the ability to deploy Wallaroo LLM Listeners^TM in line to ride alongside the LLM and actually give it the ability to suppress outputs that violate set thresholds from being returned to the user in the first place.

Learn More: LLM Validation with Wallaroo LLM Listeners ^TM

Conclusion

We have seen that Managed Inference Endpoints may not always be the happy path to GenAI/LLM nirvana for enterprises. Lack of control for model governance limits an organization's ability to build industrial-grade practices and operations to maximize the return out of their investment in LLMs.

With Wallaroo, technology teams have control over model hallucinations and bias behavior can be taken back in house by the organization through implementation of methods such as RAG and Wallaroo LLM Listeners^TM to ensure that production LLMs are up-to-date, reliable, robust and effective through implementing measures for monitoring metrics and alerts. Using RAG and Wallaroo LLM Listeners^TM help mitigate potential issues such as toxicity, obscenity, etc. to avoid risks and provide accurate and relevant generated outputs.

Technology teams that would like to extend this control to data security & privacy, can meet their requirements regardless of where the model needs to run with Wallaroo in their private Azure tenant using custom and on-prem LLMs.

Wallaroo enables these technology teams to get up and running quickly with custom and on-prem LLMs, on their existing infrastructure, with a unified framework to package and deploy custom on-prem LLMs directly on their Azure infrastructure. In an upcoming blog, we will lay out some important considerations when deploying custom on-prem LLMs on your own infrastructure to ensure optimal inference performance.