Startups at Microsoft

9 MIN READ

Optimizing Inference Performance for “On-Prem” LLMs

Christopher_Tearpak

Microsoft

Dec 18, 2024

In this blog, guest blogger Martin Bald, Sr. Manager DevRel and Community at Microsoft Partner Wallaroo.AI, covers how technology teams can take back control of their data security and privacy, without compromising on performance, when launching custom private/on-prem LLMs in production. We will showcase how LLM performance optimization engines such as Llama.cpp and vLLM can be integrated and deployed with LLMs in Wallaroo. Additionally, we will highlight how Wallaroo natively enables achieving performance optimization requirements with simplified Autoscaling and Dynamic Batching configurations.

Introduction

In a previous blog we addressed how Data Scientists and Data Engineers on technology teams can achieve effective model monitoring of LLMaaS and gain control over their LLM needs, through RAG and Wallaroo LLM Listeners^TM, to mitigate hallucinations and bias to generate accurate reliable outputs.

AI technology teams can extend control over their LLMs from model governance to other LLMOps aspects by deploying custom private/on-prem LLMs. AI teams customizing and deploying open source LLMs directly within their private environments are commonly referred to as “on-prem” LLMs.

As a result, on-prem LLMs may offer a great deal of privacy and security over MaaS. However, they tend to present challenges related to performance and infrastructure cost that may prevent LLM adoption and scale within the enterprise. As such, they do introduce a new challenge of model performance & optimization on finite infrastructure.

With custom private/on-prem LLMs, technology teams face the challenge of meeting consistent inference latency and inference throughput goals. Production LLMs can place a burden on existing finite infrastructure resources resulting in sub par inference performance. Poor inference performance can be prohibitive for an organization to take advantage of custom private/on-prem LLMs and also delay the time to value of the LLM solution.

In addition to offering a unified framework for managing and monitoring LLMs, Wallaroo enables enterprises working with private/on-prem LLMs to optimize performance on existing infrastructure in addition to simplifying the deployment and monitoring of those LLMs.

Custom LLM Performance Optimization

Llama.cpp and vLLM are two versatile and innovative frameworks for optimizing LLM inference. Let’s look at how these frameworks, integrated within Wallaroo, can help technology teams achieve optimal inference performance for custom LLMs On-Prem.

Llama.cpp

Llama.cpp is known for its portability and efficiency designed to run optimally on CPUs and GPUs without requiring specialized hardware. It is a lightweight framework which makes it ideal for technology teams launching LLMs on smaller devices and local On-Prem machines such as Edge use case scenarios.

It is also a versatile framework that provides extensive customization options, allowing technology teams to fine-tune various parameters to suit specific needs.

These two capabilities deliver control and versatility for custom LLMs on-prem that technology teams don’t enjoy with managed inference endpoints.

Let’s look in detail at how a team would deploy a custom LLM into their on-prem infrastructure and optimize the resources available to them using Llama.cpp and vLLM with Wallaroo.

To begin with, both Llama.cpp and vLLM are deployed in Wallaroo using the Bring Your Own Predict (BYOP) framework. The BYOP framework allows organizations to use pre-defined Python templates and supporting libraries to build and configure custom inference pipelines that can be auto-packaged natively in Wallaroo. This means that teams have added control over inference microservices creation for any type of use case, while using a native method and with a great deal of infrastructure abstraction.

Deploying LLama.cpp with the Wallaroo BYOP framework requires Llama-cpp-python. This example uses Llama 70B Instruct Q5_K_M for testing and deploying Llama.cpp.

Llama.cpp BYOP Implementation Details

1.) To run Llama-cpp-python on GPU, llama-cpp-python is installed using the subprocess library in python, straight into the Python BYOP code:

import subprocess
import sys

pip_command = (
    f'CMAKE_ARGS="-DLLAMA_CUDA=on" {sys.executable} -m pip install llama-cpp-python'
)
subprocess.check_call(pip_command, shell=True)

2.) The model is loaded via the BYOP’s _load_model method, which supports the biggest context and offloads all the model’s layers to the GPU:

def _load_model(self, model_path):
    llm = Llama(
        model_path=f"{model_path}/artifacts/Meta-Llama-3-70B-Instruct.Q5_K_M.gguf",
        n_ctx=4096,
        n_gpu_layers=-1,
        logits_all=True,
    )
    return llm

3.) The prompt is constructed based on the chosen model as an instruct-variant:

messages = [
    {
        "role": "system",
        "content": "You are a generic chatbot, try to answer questions the best you can.",
    },
    {
        "role": "user",
        "content": prompt},
]
result = self.model.create_chat_completion(
    messages=messages, max_tokens=1024, stop=["<|eot_id|>"]
)

4.) The deployment configuration sets what resources are allocated for the Llama.cpp LLM’s use. For this example, the Llama.cpp LLM is allocated 8 cpus, 10 Gi RAM, and 1 GPU.

deployment_config = DeploymentConfigBuilder() \
    .cpus(1).memory('2Gi') \
    .sidekick_cpus(model, 8) \
    .sidekick_memory(model, '10Gi') \
    .sidekick_gpus(model, 1) \
    .deployment_label("wallaroo.ai/accelerator:a100") \
    .build()

vLLM

In contrast to Llama.cpp, vLLM focuses on ease of use and performance, offering a more streamlined experience with fewer customization requirements. vLLM leverages GPU acceleration to achieve higher performance making it more suitable for environments with access to powerful GPUs.

vLLM delivers the following competitive features:

Ease of use: One of vLLM’s primary design decisions is user-friendliness, making it more accessible to technology teams with different levels of expertise. vLLM provides a straightforward setup and configuration process for quick development.
High Performance: vLLM is optimized for high performance, leveraging advanced techniques such as:

PagedAttention to maximize inference speed.
Tensor Parallelism enables efficient distribution of computations across multiple GPUs. This results in faster responses and higher throughput, making it the perfect choice for demanding applications.

Scalability: vLLM is built with scalability in mind by deploying any LLM on a single or multiple GPUs. This scalability makes it suitable for both small-scale and large-scale deployments.

In this vLLM BYOP implementation example the Llama 3 8B Instruct is used for this example of deploying a vLLM.

1.) To run vLLM on CUDA, vLLM is installed using the subprocess library in python, straight into the Python BYOP code:

import subprocess
import sys

pip_command = (
    f'{sys.executable} -m pip install https://github.com/vllm-project/vllm/releases/download/v0.5.2/vllm-0.5.2+cu118-cp38-cp38-manylinux1_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118'
)

subprocess.check_call(pip_command, shell=True)

2.) The model is loaded via the BYOP’s _load_model method and setting model weights that are found here.

def _load_model(self, model_path):
    llm = LLM(
        model=f"{model_path}/artifacts/Meta-Llama-3-8B-Instruct/"
    )

    return llm

3.) The deployment configuration sets what resources are allocated for the vLLM’s use. For this example, the vLLM is allocated 4 CPUs, 10 Gi RAM, and 1 GPU.

deployment_config = DeploymentConfigBuilder() \
    .cpus(1).memory('2Gi') \
    .sidekick_cpus(model, 4) \
    .sidekick_memory(model, '10Gi') \
    .sidekick_gpus(model, 1) \
    .deployment_label("wallaroo.ai/accelerator:a100) \
    .build()

We can see from both Llama.cpp and vLLM examples that Llama.cpp brings portability and efficiency, designed to run optimally on CPUs and GPUs without any specific hardware. While vLLM brings user-friendliness, rapid inference speeds, and high throughput, making it an excellent choice for projects that prioritize speed and performance.

Technology teams have the flexibility, versatility, and control to optimize deployment of custom LLM models to their limited infrastructure across CPUs and GPUs using the Llama.cpp and vLLM frameworks.

Inference Latency and Throughput Optimizations

The primary performance challenge technology teams face with launching custom LLMs on-prem is optimizing inference latency and throughput at scale, all within the bounds of a pre-defined “infrastructure budget”. Teams can take control of these performance metrics in Wallaroo through implementing configurations for Autoscaling and Dynamic Batching.

Autoscaling

Autoscaling aims at reducing latency for LLM inference requests by automatically adding resources and scaling them down, without manual intervention, with pre-configured scaling triggers based on the size of the incoming request queue.

Autoscale triggers provide LLMs greater flexibility by:

Increasing resources to LLMs based on scale up triggers. This decreases inference latency when more requests come in, then spins idle resources back down to save on costs.
Controlling the allocation of resources by configuring autoscaling windows to prevent sudden or volatile resources spikes and drops.

Autoscaling batch and real-time inference requests helps to reduce unnecessary infrastructure and operations costs by providing automatic adjustment of computational resources to handle fluctuating AI inference traffic.

Autoscaling ensures that the LLM infrastructure can dynamically scale up or down to meet the current load without latency issues or over-provisioning valuable resources without manual intervention. Autoscaling efficiently handles fluctuating workloads by automatically increasing or decreasing computational resources (such as CPU, GPU, or memory) based on the current demand. This means that LLM production inference is operating at optimal cost to the business and not burning through expensive resources unnecessarily which in turn positively impacts time to value for the business.

The following example shows a deployment configuration made for CPUs where the DS sets the LLM resource allocations and the behavior for triggering autoscale for the LLM.

Resource Allocation

Behavior

Sets resources to the LLM deployment_cpu_based with the following allocations:
Replica autoscale: 0 to 5.
Wallaroo engine per replica:
- 1 cpu, 2 Gi RAM
llm per replica:
- 30 cpus
- 10 Gi RAM
scale_up_queue_depth: 1
scale_down_queue_depth: 1

Wallaroo engine and LLM scaling is 1:1.
Scaling up occurs when the scale up queue depth is above 5 over 300 seconds.
Scaling down is triggered when the 5 minute queue average is < 1.

This is implemented with the following code:

deployment_cpu_based =wallaroo.DeploymentConfigBuilder()
                       .replica_autoscale_min_max(minimum=0, maximum=5)
                       .cpus(1).memory('2Gi')
                       .sidekick_cpus(llm, 30)
                       .sidekick_memory(llm, '10Gi')
                       .scale_up_queue_depth(1)
                       .scale_down_queue_depth(1)
                       .build()

Learn more from the following documentation as well as setting the Autoscale trigger for GPU scenarios. Autoscaling LLMs with Wallaroo

Dynamic Batching

Dynamic Batching helps optimize inference throughput by aggregating multiple incoming requests into a single batch, which are then processed together. This not only improves inference throughput but also optimizes utilization of limited resources helping to avoid incurring additional cloud or hardware costs.

When multiple inference requests are sent from one or multiple clients, a Dynamic Batching Configuration accumulates those inference requests as one “batch”, and processes them at once. This increases efficiency and inference results performance by using resources in one accumulated batch rather than starting and stopping for each individual request. Once complete, the individual inference results are returned back to each client.

The benefits of Dynamic Batching are multi-fold from higher throughput, improved hardware utilization, and reduced latency both for batch and real-time inference workloads, all leading to cost efficiency and efficient infrastructure utilization.

In Wallaroo, Dynamic Batching of inferences is triggered when either the max batch delay OR batch size target are met. When either of those conditions are met, inference requests are collected together then processed as a single batch.

When Dynamic Batching is implemented, the following occurs:

Inference requests are processed in FIFO (First In First Out) order. Inference requests containing batched inputs are not split to accommodate dynamic batching.
Inference results are returned back to the original clients.
Inference result logs store results in the order the inferences were processed and batched.
Dynamic Batching Configurations and target latency are honored and are not impacted by Wallaroo pipeline deployment autoscaling configurations.

Dynamic batching in Wallaroo can optionally be configured when uploading a new LLM or retrieving an existing one.

The Dynamic Batch Config takes the following parameters.

Maximum batch delay:
- Set the maximum batch delay in milliseconds.
Batch size target:
- Set the target batch size; can not be less than or equal to zero.
Batch size limit:
- Set the batch size limit; can not be less than or equal to zero. This is used to control the maximum batch size.

E.g.:

dynamic_batch_config = wallaroo.dynamic_batching_config.DynamicBatchingConfig()
                       .max_batch_delay_ms(5)
                       .batch_size_target(1)
                       .batch_size_limit(1)
                       .build()

The following demonstrates applying the dynamic batch config at LLM upload.

llm_model = (wl.upload_model(model_name, 
                              model_file_name, 
                              framework=framework,
                              input_schema=input_schema,
                              output_schema=output_schema)
                              .configure(input_schema=input_schema,
                                         output_schema=output_schema,
                                         dynamic_batching_config=dynamic_batch_config)
                        )

Learn more: Dynamic Batching for LLMs with Wallaroo

Conclusion

While Managed Inference Endpoints (MaaS) can appear to offer an easy seamless solution for enterprises to deploy LLM inference services, there is a lack of control that AI technology teams in organizations have over data security & privacy and cost in certain cases.

In scenarios where these concerns prevent enterprises from launching and scaling LLMs in production, AI teams may choose to customize and deploy open source LLMs directly within their private environments. These LLMs are commonly referred to as “on-prem” LLMs. As a result, on-prem LLMs may offer a great deal of privacy and security over MaaS. However, they tend to present challenges related to performance and infrastructure cost that may prevent LLM adoption and scale within the enterprise.

Deploying custom LLMs on-prem with Wallaroo helps AI technology teams take back control of the above factors. While launching custom LLMs brings new challenges to the table in the form of performance optimization over inference latency and throughput, these challenges can efficiently and effectively be overcome in Wallaroo through using the Llama.cpp and vLLM frameworks in conjunction with the techniques of Autoscaling and Dynamic Batching.

Through having control over performance optimization with existing resources in Wallaroo, technology teams in enterprises can operate with confidence to launch extensible and efficient custom LLM solutions in production.

The Wallaroo AI Inference Platform enables enterprise technology teams to quickly and efficiently operationalize custom LLMs at scale through an end-to-end LLM lifecycle from deployment, and ongoing model management with full governance and observability while realizing optimal performance scaling across Ampere, x64 & GPU architectures in the cloud, on-prem and at the edge.