Bing accelerates model training with LightGBM and Azure Machine Learning
Published Nov 14 2019 09:01 AM 5,218 Views
Microsoft

This post was co-authored by @Dipankar-Ray, Chris Lauren, @David_Aronchick, Kaarthik Sivashanmugam, Barry Li, and Rangan Majumder

We are pleased to announce that Bing is now using Azure Machine Learning pipelines with LightGBM to train the machine learning models for their core search relevance ranking engine. Using the combination of Azure Machine Learning and LightGBM has led to significant improvements in speed and reliability: Bing engineers now regularly run distributed jobs on up to 100 nodes using more than 13 TB of data per run. We are also announcing that we have added built-in support for using LightGBM as an Estimator in the Azure Machine Learning SDK, which enables everyone to use the same improvements that we have demonstrated on Bing’s workloads.

 

To learn more and get started with distributed training using LightGBM in Azure Machine Learning see our new sample Jupyter notebook.

 

Gradient Boosted Decision Trees and Search

 

While Deep Learning has gotten a lot of attention in the news over the last few years, Gradient Boosted Decision Trees (GBDTs) are the hidden workhorse of the modern Internet. They are used everywhere; any service that uses click-prediction, document ranking, or relevance scoring was likely trained using GBDTs in an essential way.

 

The reason for this is straightforward: GBDTs consistently produce best-in-class results in a variety of situations where the input data is heterogeneous, and/or where you don't have a lot of labeled training data. Even in Kaggle Competitions - where one would expect to see a plethora of different algorithms - the winning entries are very often powered by GBDTs.

 

In the last decade, joint work by Bing and Microsoft Research on ranking problems played a significant role in the rise of GBDTs; in fact, their papers directly inspired the GBDT implementations in LightGBM, XGBoost and CatBoost. In 2017, Guolin Ke of Microsoft Research released LightGBM as an open-source project; it's a fast, efficient, and powerful implementation with an active user community and developer base.

 

Enabling productivity with Azure Machine Learning

 

Bing saw an opportunity to both accelerate our training as well as reduce management overhead by using LightGBM with Azure Machine Learning service. Further, by using Azure Machine Learning pipelines, we were able to orchestrate multi-step workflows to prepare data, train and evaluate models across multiple types of compute targets. Azure Machine Learning pipelines helped Bing engineers easily refine and develop their data preparation, pre-processing and post-processing tasks.

 

We used dynamic resizing with AMLCompute to train our models to save money and resources when the clusters were not being used. Using dynamic compute pools, Bing engineers can submit more training jobs and have the cluster automatically grow to support the additional compute workloads. Once the jobs are completed the cluster then shrinks down to save money. In this way, dynamic resizing allows our groups to manage our compute budget over the year, paying for only the compute that they actually used. As a result, Bing can now get out of the business of managing compute infrastructure and focus on their core business goals of building top-quality search technologies.

 

Get started with LightGBM in Azure Machine Learning

 

Once we did this project, we immediately realized that our customers would also benefit from running distributed training using LightGBM in Azure Machine Learning. This unlocks the power of GBDTs on large datasets, enabling training on LibSVM-format datasets in excess of 10TB.

 

We implemented LightGBM as an Estimator class which allows users of LightGBM to seamlessly access the experimentation-management features in the Azure Machine Learning SDK as well as in the Azure portal. With the Estimator class, submitting a run is as easy as the following:

 

 

training_data_list=["binary0.train", "binary1.train"]
validation_data_list = ["binary0.test", "binary1.test"]
lgbm = LightGBM(source_directory=scripts_folder, 
                compute_target=cpu_cluster, 
                distributed_training=Mpi(),
                node_count=2,
                lightgbm_config='train.conf',
                data=training_data_list,
                valid=validation_data_list
               )

experiment = Experiment(ws, name='lightgbm-estimator-test')
run = experiment.submit(lgbm, tags={"test public docker image": None})

 

 

Distributed training of Gradient Boosted Decision Trees is one of the most effective and versatile algorithms in the data science toolkit, so we're very excited to offer this kind of power to the general community, and we're eager to see what it unlocks for our users. Check it out at our sample Jupyter notebook demonstrating how to use distributed LightGBM training in Azure Machine Learning today!

 

1 Comment
Version history
Last update:
‎Nov 14 2019 09:03 AM
Updated by: