We are pleased to announce SynapseML v0.11, a new version of our open-source distributed machine learning library that simplifies and accelerates the development of scalable AI. In this release, we are excited to introduce many new features from the past year of developments well as many bug fixes and improvements. Though this post will give a high-level overview of the most salient new additions, curious readers can check out the full release notes for all of the new additions.
OpenAI Language Models and Embeddings
A new release wouldn’t be complete without joining the large language model (LLM) hype train and SynapseML v0.11 features a variety of new features that make large-scale LLM usage simple and easy. In particular, SynapseML v0.11 introduces three new APIs for working with foundation models: `OpenAIPrompt`, ` OpenAIEmbedding`, and `OpenAIChatCompletion`. The `OpenAIPrompt` API makes it easy to construct complex LLM prompts from columns of your dataframe. Here’s a quick example of translating a dataframe column called “Description” into emojis.
from synapse.ml.cognitive.openai import OpenAIPrompt
emoji_template = """
Translate the following into emojis
Word: {Description}
Emoji: """
results = (OpenAIPrompt()
.setPromptTemplate(emoji_template)
.setErrorCol("error")
.setOutputCol("Emoji")
.transform(inputs))
This code will automatically look for a database column called “Description” and prompt your LLM (ChatGPT, GPT-3, GPT-4) with the created prompts. Our new OpenAI embedding classes make it easy to embed large tables of sentences quickly and easily from your Apache Spark clusters. To learn more, see our docs on using OpenAI embeddings API and the SynapseML KNN model to create an LLM-based vector search engine directly on your spark cluster. Finally, the new OpenAIChatCompletion transformer allows users to submit large quantities of chat-based prompts to ChatGPT, enabling parallel inference of thousands of conversations at a time. We hope you find the new OpenAI integrations useful for building your next intelligent application.
Simple Deep Learning
SynapseML v0.11 introduces a new Simple deep learning package that allows for the training of custom text and deep vision classifiers with only a few lines of code. This package integrates the power of distributed deep network training with PytorchLightning with the simple and easy APIs of SynapseML. The new API allows users to fine-tune visual foundation models from torchvision as well as a variety of state-of-the-art text backbones from HuggingFace.
Here’s a quick example showing how to fine-tune custom vision networks:
from synapse.ml.dl import DeepVisionClassifier
train_df = spark.createDataframe([
("PATH_TO_IMAGE_1.jpg", 1),
("PATH_TO_IMAGE_2.jpg", 2)
], ["image", "label"])
deep_vision_classifier = DeepVisionClassifier(
backbone="resnet50",
num_classes=2,
batch_size=16,
epochs=2,
)
deep_vision_model = deep_vision_classifier.fit(train_df)
Keep an eye out with upcoming new releases of SynapseML featuring additional simple deep-learning algorithms that will make it easier than ever to train and deploy models at scale.
LightGBM v2
LightGBM is one of the most used features of SynapseML and we heard your feedback on better performance! SynapseML v0.11 introduces a completely refactored integration between LightGBM and Spark, called LightGBM v2. This integration aims for high performance by introducing a variety of new streaming APIs in the core LightGBM library to enable fast and memory-efficient data sharing between spark and LightGBM. In particular, the new “Streaming execution mode” has a >10x lower memory footprint than earlier versions of SynapseML yielding fewer memory issues and faster training. Best of all, you can use the new mode by just passing a single extra flag to your existing LightGBM models in SynapseML.
ONNX Model Hub
SynapseML supports a variety of new deep learning integrations with the ONNX runtime for fast, hardware-accelerated inference in all of the SynapseML languages (Scala, Java, Python, R, and .NET). In version 0.11 we add support for the new ONNX model hub, which is an open collection of state-of-the-art pre-trained ONNX models that can be quickly downloaded and embedded into spark pipelines. This allowed us to completely deprecate and remove our old dependence on the CNTK deep learning library.
To learn more about how you can embed deep networks into Spark pipelines, check out our ONNX episode in the new SynapseML video series:
Causal Learning
SynapseML v0.11 introduces a new package for causal learning that can help businesses and policymakers make more informed decisions. When trying to understand the impact of a “treatment” or intervention on an outcome, traditional approaches like correlation analysis or prediction models fall short as they do not necessarily establish causation. Causal inference aims to overcome these shortcomings by bridging the gap between prediction and decision-making. SynapseML's causal learning package implements a technique called "Double machine learning", which allows us to estimate treatment effects without data from controlled experiments. Unlike regression-based approaches, this approach can model non-linear relationships between confounders, treatment, and outcome. Users can run the DoubleMLEstimator using a simple code snippet like the one below:
from pyspark.ml.classification import LogisticRegression
from synapse.ml.causal import DoubleMLEstimator
dml = (DoubleMLEstimator()
.setTreatmentCol("Treatment")
.setTreatmentModel(LogisticRegression())
.setOutcomeCol("Outcome")
.setOutcomeModel(LogisticRegression())
.setMaxIter(20))
dmlModel = dml.fit(dataset)
For more information, be sure to check out Dylan Wang's guided tour of the DoubleMLEstimator on the SynapseML video series:
Vowpal Wabbit v2
Finally, SynapseML v0.11 introduces Vowpal Wabbit v2, the second-generation integration between the Vowpal Wabbit (VW) online optimization library and Apache Spark. With this update, users can work with Vowpal wabbit data directly using the new “VowpalWabbitGeneric” model. This makes working with Spark easier for existing VW users. This more direct integration also adds support for new cost functions and use cases including “multi-class” and “cost-sensitive one against all” problems. The update also introduces a new progressive validation strategy and a new Contextual Bandit Offline policy evaluation notebook to demonstrate how to evaluate VW models on large datasets.
Conclusion
In conclusion, we are thrilled to share the new SynapseML library with you with you and hope you will find that it simplifies your distributed machine learning pipelines. This blog only covered the highlights, so be sure to check out the full release notes for all the updates and new features. Whether you are working with large language models, training custom classifiers, or performing causal inference, SynapseML makes it easier and faster to develop and deploy machine learning models at scale.