In today's digitally interconnected landscape, language models stand at the forefront of technological innovation, reshaping the way we engage with various platforms and applications. These sophisticated algorithms have become indispensable tools in tasks ranging from text generation to natural language processing, driving efficiency and productivity across diverse sectors.
Yet, the reliance on cloud-based solutions presents a notable obstacle in certain contexts. In environments characterized by limited internet connectivity or stringent data privacy regulations, accessing cloud services may prove impractical or even impossible. This dependency on external servers introduces latency issues, security concerns, and operational challenges that hinder the seamless integration of language models into everyday workflows.
Enter the solution: running language models offline. By bringing the computational power of sophisticated models like phi2/3 and Whisper directly to mobile devices, this approach circumvents the constraints of cloud reliance, empowering users to leverage advanced language processing capabilities irrespective of connectivity status.
In this blog, we delve into the significance of enabling offline capabilities for LLMs and explore the practicalities of running SLMs on mobile devices, offering insights into the transformative potential of this technology.
In a typical Large Language Model (LLM) deployment scenario, the LLM is hosted on a public cloud infrastructure like Microsoft Azure using tools like Azure Machine Learning and exposed as an API endpoint. This API serves as the interface through which external applications, such as Web Applications, mobile apps on Android and iOS devices, interact with the LLM to perform natural language processing tasks. When a user initiates a request through the mobile app, the app sends a request to the API endpoint using data, specifying the desired task, such as text generation or sentiment analysis.
The API processes the request, utilizing the LLM to perform the required task, and returns the result to the mobile app. This architecture enables seamless integration of LLM capabilities into mobile applications, allowing users to leverage advanced language processing functionalities directly from their devices while offloading the computational burden to the cloud infrastructure.
To overcome the limitations of relying on internet connectivity and ensure users have the flexibility and ease to interact with their safety copilot even in remote locations or locations where internet isn’t available like basements or underground facilities while safeguarding privacy, the optimal solution is to run Large Language Models (LLMs) on-device, offline. By deploying LLMs directly on users' devices, such as mobile phones and tablets, we eliminate the need for continuous internet access and the associated back-and-forth communication with remote servers. This approach empowers users to access their safety copilot anytime, anywhere, without dependency on network connectivity.
What are Small Language Models (SLMs) ?
Small Language Models (SLMs) represent a focused subset of artificial intelligence tailored for specific enterprise needs within Natural Language Processing (NLP). Unlike their larger counterparts like GPT-4, SLMs prioritize efficiency and precision over sheer computational power. They are trained on domain-specific datasets, enabling them to navigate industry-specific terminologies and nuances with accuracy. In contrast to Large Language Models (LLMs), which may lack customization for enterprise contexts, SLMs offer targeted, actionable insights while minimizing inaccuracies and the risk of generating irrelevant information. SLMs are characterized by their compact architecture, lower computational demands, and enhanced security features, making them cost-effective and adaptable for real-time applications like chatbots. Overall, SLMs provide tailored efficiency, enhanced security, and lower latency, addressing specific business needs effectively while offering a promising alternative to the broader capabilities of LLMs.
Small Language Models (SLMs) offer enterprises control and customization, efficient resource usage, effective performance, swift training and inference, and resource-efficient deployment. They scale easily, adapt to specific domains, facilitate rapid prototyping, enhance security, and provide transparency. SLMs also have clear limitations and offer cost efficiency, making them an attractive option for businesses seeking AI capabilities without extensive resource investment.
Why running SLM's offline at edge is a challenge?
Running small language models (SLMs) offline on mobile phones enhances privacy, reduces latency, and promotes access. Users can interact with llm-based applications, receive critical information, and perform tasks even in offline environments, ensuring accessibility and control over personal data. Real-time performance and independence from centralized infrastructure unlock new opportunities for innovation in mobile computing, offering a seamless and responsive user experience. However, running SLMs offline on mobile phones presents several challenges due to the constraints of mobile hardware and the complexities of running LLM tasks. Here are some key challenges:
How to deploy SLMs on Mobile Device?
Deploying SLM's on mobile devices involves a integration of MediaPipe and WebAssembly technologies to optimize performance and efficiency. MediaPipe, is known for enabling on-device ML capabilities, provides a robust framework for running SLMs entirely on mobile devices, thereby eliminating the need for constant network connectivity and offloading computation to remote servers. With the experimental MediaPipe LLM Inference API, developers can seamlessly integrate popular SLMs like Gemma, Phi 2, Falcon, and Stable LM into their mobile applications. This breakthrough is facilitated by a series of optimizations across the on-device stack, including the integration of new operations, quantization techniques, caching mechanisms, and weight sharing strategies. MediaPipe leverages WebAssembly (Wasm) to further enhance the deployment of SLMs on mobile devices.
Wasm's compact binary format and compatibility with multiple programming languages ensure efficient execution of non-JavaScript code within the mobile environment. By time-slicing GPU access and ensuring platform neutrality, Wasm optimizes GPU usage and facilitates seamless deployment across diverse hardware environments, thus enhancing the performance of LLMs on mobile devices. Additionally, advances such as the WebAssembly Systems Interface – Neural Networks (WASI-NN) standard enhance Wasm's capabilities, promising a future where it plays a pivotal role in democratizing access to AI-grade compute power on mobile devices. Through the synergistic utilization of MediaPipe and WebAssembly, developers can deploy SLMs on mobile devices with unprecedented efficiency and performance, revolutionizing on-device AI applications across various platforms.
Mediapipe's LLM Inference API empowers you to harness the SLMs directly on Android devices, With this framework, you can execute various tasks like text generation, natural language information retrieval, and document summarization without relying on external servers. It offers seamless integration with multiple text-to-text SLMS, enabling you to leverage cutting-edge generative AI models within your Android applications, with support for popular SLM's like Phi-2, Gemma, Falcon-RW-1B, and StableLM-3B.
The LLM Inference API uses the com.google.mediapipe:tasks-genai
library. Add this dependency to the build.gradle
file of your Android app:
dependencies {
implementation 'com.google.mediapipe:tasks-genai:0.10.11'
}
he model conversion process requires the MediaPipe PyPI package. The conversion script is available in all MediaPipe packages after 0.10.11
.
Install and import the dependencies with the following:
$ python3 -m pip install mediapipe
Use the genai.converter
library to convert the model:
import mediapipe as mp
from mediapipe.tasks.python.genai import converter
def phi2_convert_config(backend):
input_ckpt = '/content/phi-2'
vocab_model_file = '/content/phi-2/'
output_dir = '/content/intermediate/phi-2/'
output_tflite_file = f'/content/converted_models/phi2_{backend}.bin'
return converter.ConversionConfig(input_ckpt=input_ckpt, ckpt_format='safetensors', model_type='PHI_2', backend=backend, output_dir=output_dir, combine_file_only=False, vocab_model_file=vocab_model_file, output_tflite_file=output_tflite_file)
Parameter | Description | Accepted Values |
---|---|---|
input_ |
The path to the model. or pytorch. file. Note that sometimes the model safetensors format are sharded into multiple files, e.g. model-00001-of-00003. , model-00001-of-00003. . You can specify a file pattern, like model*. . |
PATH |
ckpt_ |
The model file format. | {"safetensors", "pytorch"} |
model_ |
The SLM being converted. | {"PHI_2", "FALCON_RW_1B", "STABLELM_4E1T_3B", "GEMMA_2B"} |
backend |
The processor (delegate) used to run the model. | {"cpu", "gpu"} |
output_ |
The path to the output directory that hosts the per-layer weight files. | PATH |
output_ |
The path to the output file. For example, "model_cpu.bin" or "model_gpu.bin". This file is only compatible with the LLM Inference API, and cannot be used as a general `tflite` file. | PATH |
vocab_ |
The path to the directory that stores the tokenizer. and tokenizer_ files. For Gemma, point to the single tokenizer. file. |
PATH |
Push the content of the output_path folder to the Android device.
$ adb shell rm -r /data/local/tmp/llm/ # Remove any previously loaded models
$ adb shell mkdir -p /data/local/tmp/llm/
$ adb push model.bin /data/local/tmp/llm/model_phi2.bin.bin
The MediaPipe LLM Inference API uses the createFromOptions()
function to set up the task. The createFromOptions()
function accepts values for the configuration options. For more information on configuration options, see Configuration options.
The following code initializes the task using basic configuration options:
// Set the configuration options for the LLM Inference task
val options = LlmInferenceOptions.builder()
.setModelPATH('/data/local/.../')
.setMaxTokens(1000)
.setTopK(40)
.setTemperature(0.8)
.setRandomSeed(101)
.build()
// Create an instance of the LLM Inference task
llmInference = LlmInference.createFromOptions(context, options)
Use the following configuration options to set up an Android app:
Option Name | Description | Value Range | Default Value |
---|---|---|---|
modelPath |
The path to where the model is stored within the project directory. | PATH | N/A |
maxTokens |
The maximum number of tokens (input tokens + output tokens) the model handles. | Integer | 512 |
topK |
The number of tokens the model considers at each step of generation. Limits predictions to the top k most-probable tokens. When setting topK , you must also set a value for randomSeed . |
Integer | 40 |
temperature |
The amount of randomness introduced during generation. A higher temperature results in more creativity in the generated text, while a lower temperature produces more predictable generation. When setting temperature , you must also set a value for randomSeed . |
Float | 0.8 |
randomSeed |
The random seed used during text generation. | Integer | 0 |
resultListener |
Sets the result listener to receive the results asynchronously. Only applicable when using the async generation method. | N/A | N/A |
errorListener |
Sets an optional error listener. | N/A | N/A |
The LLM Inference API accepts the following inputs:
val inputPrompt = "Compose an email to remind Brett of lunch plans at noon on Saturday."
Use the generateResponse()
method to generate a text response to the input text provided in the previous section (inputPrompt
). This produces a single generated response.
val result = llmInference.generateResponse(inputPrompt)
logger.atInfo().log("result: $result")
To stream the response, use the generateResponseAsync()
method.
val options = LlmInference.LlmInferenceOptions.builder()
...
.setResultListener { partialResult, done ->
logger.atInfo().log("partial result: $partialResult")
}
.build()
llmInference.generateResponseAsync(inputPrompt)
The LLM Inference API returns a LlmInferenceResult
, which includes the generated response text.
Here's a draft you can use:
Subject: Lunch on Saturday Reminder
Hi Brett,
Just a quick reminder about our lunch plans this Saturday at noon.
Let me know if that still works for you.
Looking forward to it!
Best,
[Your Name]
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.