First published on MSDN on Jan 20, 2017
Microsoft R Server
supports four cases of R transformations, such as
transformFunc
,
transforms
(lists of transform statements),
rowSelection
(a logical expression) and
in-line expressions in formulas
. In this article, let's focus on how to use "
transforms
" and "
transformFunc
" to do variable transformation. For all the following example, we use a
RxSqlServerData
source from the test database of
Microsoft R Server
as the input data of function
rxDataStep
, and run the examples in SQL compute context (
please note: transforms, transformFunc and their related parameters can be used in all compute contexts, including Teradata, Hadoop and Spark
). To understand how data transformation works in RevoScaleR, let's go through some concepts first:
-
Lexical scoping
: when R executes an expression, it first looks at the objects within the local environment, if the object is not found by name in that environment, R searches the enclosing environment of the local environment; if the object is not in the enclosing environment, R searches the enclosing environment of the enclosing environment, and so on.
-
Dynamic scoping
: looking up variables in the calling environment rather than in the enclosing environment.
-
Calling environment
: the environment where the function was called.
parent.frame()
used to get the calling environment.
-
Enclosing environment
: the environment where the function was created and used for lexical scoping. Every function has one and only one enclosing environment.
parent.env()
used to get the enclosing environment.
You can get more information about
scoping/environment
in R here.
1.
Using transforms
In rx functions of RevoScaleR,
transforms
argument is designed to use an expression of the form list (name = expression, ...) representing the first round of variable transformations, expression returns a vector. You can change the content, datatype of the vector, or remove it in the expression.
The Original data:
OUTPUT:
2. Using transformFunc
Argument
transformFunc
is different from
transforms
,
transformFunc
argument is a R function whose first argument and return value are named R lists with equal length vector elements. The output list of transformFunc can contain modifications or newly named elements. It's recommended way to do variable transformation.
OUTPUT:
3. Using transforms with UDF and "unknown" variable in UDF
transforms
can also be defined in the function call using the expression function. When you use UDF for transforms, you need to pass the UDF to the remote by using argument
transformObject
since transforms expression gets evaluated in the server side
.
If an "unknown" variable is referred to in the UDF, you also need to specified the "unknown" variable in
transformObjects
which will pass the object into calling environment
.
To access the "unknown" variable in the UDF, you have to use dynamic scoping so that R looks up the "unknown" variable in calling environment, otherwise, R looks up the "unknown" variable in the enclosing environment of the UDF according to the lexical scoping.
Note:
here, R expression
constant <
-
get("constant", parent.frame())
is the dynamic scoping.
OUTPUT:
4. Using transformFunc with UDF and "unknown" variable in UDF
transformFunc
is different from
transforms
, it will be get evaluated at client side, you do not need to pass the UDF name to server side. In addition, the objects specified by
transformObjects
will be passed to enclosing environment of the transformation function, so you do not need to do dynamic scoping when you use
transformFunc
to do variable transformation.
OUTPUT:
5. Using transformEnvir with transforms
transformEnvir
is a user-defined environment. It's used as parent environment of the transformation functions and contains the data specified by
transformObjects
. If there are multiple objects referenced by transform functions, you can bind those objects to an user-defined environment, and then just pass the environment in
transformEnvir
to remote, instead of listing all the objects in
transformObjects
.
However, when using
transforms
to do variable transformation, you should set the user-defined environment as the enclosing environment of transformation function, otherwise R cannot find the "unknown" variable and the function in calling or enclosing environment.
OUTPUT:
6. Using transformEnvir with transformFunc
When you use
transformEnvir
with
transformFunc
, the user-defined environment specified in
transformEnvir
is passed to the remote. All the variables and functions binding to this user-defined environment will be in the calling and enclosing environment. So you do not need to set up the enclosing environment for the R transformation function in the R script.
OUTPUT:
Summary
transformFunc
is the recommended way to do variable transformation, for how to use
transformFunc
, please see
rxTransform
. Even though
transforms
can be used to do variable transformation as well, there are some difference about R scoping/environment and where to get evaluated between these two arguments.
REFERENCE
Lexical Scope and Function Closures in R
Environments
rxDataStep
rxTransforms