RxTextData
to perform analytics after potentially uploading the text data to HDFS. With Microsoft R Server 9.0 release, Spark compute context now supports Hive and Parquet data sources so you can directly work with them. We will work through an example showing how to use Hive datasource in this blog (we will cover Parquet in a future blog).
hadoop fs -mkdir -p /share/SampleData
hadoop fs -copyFromLocal /usr/lib64/microsoft-r/3.3/lib64/R/library/RevoScaleR/SampleData/* /share/SampleData/
hadoop fs -ls /share/SampleData
--queue <queue_name>
parameter):
spark-shell --master yarn
RxHiveData
to get summary information.
> rxSummary(~., hive_data)
Call:
rxSummary(formula = ~., data = hive_data)
Summary Statistics Results for: ~.
Data: hive_data (RxSparkData Data Source)
Number of valid observations: 6e+05
Name Mean StdDev Min Max ValidObs MissingObs
arrdelay 11.31794 40.688536 -86.000000 1490.00000 582628 17372
crsdeptime 13.48227 4.697566 0.016667 23.98333 600000 0
Category Counts for dayofweek
Number of categories: 7
Number of valid observations: 6e+05
Number of missing observations: 0
dayofweek Counts
Monday 97975
Tuesday 77725
Wednesday 78875
Thursday 81304
Friday 82987
Saturday 86159
Sunday 94975
> rxGetVarInfo(xdfOutput)
Var 1: arrdelay, Type: integer, Low/High: (-86, 1490)
Var 2: crsdeptime, Type: numeric, Storage: float32, Low/High: (0.0167, 23.9833)
Var 3: dayofweek
7 factor levels: Monday Tuesday Wednesday Thursday Friday Saturday Sunday
Var 4: arrdelay15, Type: logical, Low/High: (0, 1)
> head(myData)
arrdelay dayofweek
1 285 Monday
2 284 Tuesday
3 281 Tuesday
4 278 Wednesday
5 288 Wednesday
6 294 Wednesday
> logitObj
Logistic Regression Results for: arrdelay15 ~ dayofweek + crsdeptime
Data: hive_data (RxSparkData Data Source)
Dependent variable(s): arrdelay15
Total independent variables: 9 (Including number dropped: 1)
Number of valid observations: 582628
Number of missing observations: 17372
Coefficients:
arrdelay15
(Intercept) -2.01814346
dayofweek=Monday 0.06295299
dayofweek=Tuesday -0.09538265
dayofweek=Wednesday -0.12945236
dayofweek=Thursday -0.19226847
dayofweek=Friday 0.26043331
dayofweek=Saturday 0.01939645
dayofweek=Sunday Dropped
crsdeptime 0.06846911
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.