First published on MSDN on Apr 19, 2017
With the launch of Microsoft R Server 9.1 , many optimizations and new features were delivered to our users. One key feature is interoperability between Microsoft R Server and sparklyr.
sparklyr , a package by RStudio, is an R interface to Apache Spark. It allows users to utilize Spark as the backend for dplyr, one of the most popular data manipulation packages. sparklyr also provides interfaces to Spark packages, allows users to query data in Spark using SQL, and develop extension in R by creating an interface to the full Spark API. Another key feature is it allows users the ability to use Spark integrated Machine Learning algorithms directly from within R. For H2O users, the Microsoft R Server sparklyr Interop can be used to covert sparklyr data frames to H2O data frames. This allows data imported from Microsoft R Server to be used with H2O modelling and data partitioning algorithms, via the rsparkling package. (to learn more about dplyr, please visit their CRAN site here .)
Microsoft R Server and sparklyr can now be used in tandem within a single Spark session. With this, Data Scientists and Solution Engineers can use all features of Microsoft R Server advanced Machine Learning algorithms on data prepared using the dplyr grammar.
Prerequisites:
Note: if you are unfamiliar with using Microsoft R Server with Spark, please see here .
Note: To load SampleData into HDFS, please use one of the following was to load SampleData:
Microsoft R Server Functions:
Shell Script:
If you on HDI, please install the sparklyr package in the following way:
options(repos = " https://mran.microsoft.com/snapshot/2017-05-01 ")
install.packages("sparklyr")
If you are not on HDI, please install in the following way:
install.packages("sparklyr")
In this example, we will:
Note: We will use the Standard R dataset mtcars for this example, for more information, please see here .
Sample Code
Sample Output, Comments Removed
In this example, we will:
Sample Code
Sample Output, Comments Removed
In this example, we will:
Note: In the call to rxSparkConnect, we define numExecutors, executorCores, and executorMem. These are the minimum requirements to run this example. Allocating less memory to the Spark App may cause a hang on the call to as_h2o_frame().
Sample Code
Sample Output, Comments Removed
The ability to use both Microsoft R Server and sparklyr from within one Spark session will allow Microsoft R Server users to quickly and seamlessly utilize features provided by sparklyr within their solutions.
-----
Author: Kirill Glushko, Premal Shah
For a comprehensive view of all the capabilities in Microsoft R Server 9.1, refer to this blog
Introduction
With the launch of Microsoft R Server 9.1 , many optimizations and new features were delivered to our users. One key feature is interoperability between Microsoft R Server and sparklyr.
sparklyr , a package by RStudio, is an R interface to Apache Spark. It allows users to utilize Spark as the backend for dplyr, one of the most popular data manipulation packages. sparklyr also provides interfaces to Spark packages, allows users to query data in Spark using SQL, and develop extension in R by creating an interface to the full Spark API. Another key feature is it allows users the ability to use Spark integrated Machine Learning algorithms directly from within R. For H2O users, the Microsoft R Server sparklyr Interop can be used to covert sparklyr data frames to H2O data frames. This allows data imported from Microsoft R Server to be used with H2O modelling and data partitioning algorithms, via the rsparkling package. (to learn more about dplyr, please visit their CRAN site here .)
Microsoft R Server and sparklyr can now be used in tandem within a single Spark session. With this, Data Scientists and Solution Engineers can use all features of Microsoft R Server advanced Machine Learning algorithms on data prepared using the dplyr grammar.
Using Microsoft R Server with sparklyr
Prerequisites:
- A Hadoop cluster with Spark and valid installation of Microsoft R Server
- Microsoft R Server is configured for use with Hadoop and Spark (Instructions here )
- Microsoft R Server SampleData loaded into HDFS
- gcc and g++ installed on the edgenode that the example will be run on
- Write permissions to the R Library Directory
- Read/Write permissions to HDFS directory /user/RevoShare
- An internet connection or the ability to download and manually install sparklyr and h2o
Note: if you are unfamiliar with using Microsoft R Server with Spark, please see here .
Note: To load SampleData into HDFS, please use one of the following was to load SampleData:
Microsoft R Server Functions:
Shell Script:
Installation of sparklyr package
If you on HDI, please install the sparklyr package in the following way:
options(repos = " https://mran.microsoft.com/snapshot/2017-05-01 ")
install.packages("sparklyr")
If you are not on HDI, please install in the following way:
install.packages("sparklyr")
Example One: Load and Partition Data in sparklyr, Train and Predict in MRS
In this example, we will:
- Create a connection to Spark using rxSparkConnect(), specifying a sparklyr interop; using sparklyr and its interfaces to connect to Spark.
- Call rxGetSparklyrConnection() on the compute context to get a sparklyr connection object.
- We will use dplyr to load mtcars into a Spark DataFrame via the sparklyr connection object.
- Partition the data in-Spark into a training and scoring set using dplyr.
- After partitioning, we will register the training set DataFrame in Spark to a Hive table.
- Train a model in ScaleR using rxLinMod() on an RxHiveData() object.
- With that trained model, well will run a toy prediction using rxPredict() on the test partition.
- After prediction, take the root mean square to determine accuracy.
Note: We will use the Standard R dataset mtcars for this example, for more information, please see here .
Sample Code
Sample Output, Comments Removed
Example Two: Load Data with MRS, Partition and Train a Model with sparklyr
In this example, we will:
- Create a connection to Spark using rxSparkConnect(), specifying a sparklyr interop; using sparklyr and its interfaces to connect to Spark.
- Call rxGetSparklyrConnection() on the compute context to get a sparklyr connection object.
- Use Microsoft R Server to load data from many sources
- Partition the data in-Spark into a training and scoring set using dplyr.
- After partitioning, we will register the training set DataFrame in Spark to a Hive table.
- Train a model using sparklyr to call Spark ML algorithms
- Take a summary of the trained model to see estimates and errors
Sample Code
Sample Output, Comments Removed
Example Three: Connect to Spark and Load Data with MRS, Cache Data with dplyr, Train a Model and Predict with H2O, Gather Metrics with MRS
In this example, we will:
- Create a connection to Spark using rxSparkConnect(), specifying a sparklyr interop; using sparklyr and its interfaces to connect to Spark.
- Call rxGetSparklyrConnection() on the compute context to get a sparklyr connection object.
- Use Microsoft R Server to load Training and Test Data from HDFS
- Represent the Data as Hive Tables, and Cache the Tables in Spark
- Cast the data to h2o data frames for analysis
- Train a model using h2o's built in GLM algorithm
- Print the model data
- Run a prediction on the test data with h2o.predict
- Take the ROC and Area Under the Curve (AUC) to see how our model did
Note: In the call to rxSparkConnect, we define numExecutors, executorCores, and executorMem. These are the minimum requirements to run this example. Allocating less memory to the Spark App may cause a hang on the call to as_h2o_frame().
Sample Code
Sample Output, Comments Removed
Conclusion
The ability to use both Microsoft R Server and sparklyr from within one Spark session will allow Microsoft R Server users to quickly and seamlessly utilize features provided by sparklyr within their solutions.
-----
Author: Kirill Glushko, Premal Shah
For a comprehensive view of all the capabilities in Microsoft R Server 9.1, refer to this blog
Updated Mar 23, 2019
Version 2.0SQL-Server-Team
Microsoft
Joined March 23, 2019
SQL Server Blog
Follow this blog board to get notified when there's new activity