First published on MSDN on Apr 12, 2017
, allows you to run arbitrary R functions in a distributed fashion, using available nodes (computers) or available cores (the maximum of which is the sum over all available nodes of the processing cores on each node). The rxExec approach exemplifies the traditional high-performance computing approach: when using rxExec, you largely control how the computational tasks are distributed and you are responsible for any aggregation and final processing of results.
release of Microsoft R Server 9.1
, we have a new function called rxExecBy()
rxExecBy(inData, keys, func, funcParams = NULL, filterFunc = NULL)
which can be used to partition input data source by keys and apply user defined function on individual partitions. If input data source is already partitioned, we can apply user defined function on partitions directly. A typical family of use cases that is a good fit for this implementation is building a customized machine learning model or models per data partition. rxExecBy() is currently supported in local, localpar, SQL Server and Spark Compute Contexts. The data source objects that can be passed to rxExecBy() using the inData parameter is dependent on the currently active compute context and its supported data source objects. Here is a quick Compute Context Data Source flowchart to understand the different supported scenarios for rxExecBy() :
[caption id="attachment_1005" align="aligncenter" width="861"]
rxExecBy() ComputeContext - DataSource Support FlowChart[/caption]
Now let us look at code samples for all possible scenarios in the above chart:
SPARK COMPUTE CONTEXT
In this example, we will build a linear model on AirlineDemoSmall Data using lm() function, groupby key DayOfWeek.
NOTE: 'filterFunc' parameter is currently not supported in RxSpark compute context and is ignored for computation.
SQL COMPUTE CONTEXT
In this example, we will use airlinedemosmall table data, groupby key DayOfWeek and count the number of rows.
LOCAL (OR) LOCAL PARALLEL COMPUTE CONTEXT
In this example, we will build a linear model on Claims Data using rxLinMod() function, groupby multiple keys (age, type) and filtering out 60+ age and type A.