First published on MSDN on Feb 15, 2017
Introduction
Very often in our projects we encounter a need to export huge amount of data (in GBs) and the conventional solution, write.csv, can test anyone’s patience with the time it demands.
In this blog, we will learn by doing. We make use of a package that is not very popular, but serves the purpose really well.
Package Feather
In words of “revolution analytics” blog, package feather is defined as,
“A collaboration of Wes McKinney and Hadley Wickham, to create a standard data file format that can be used for data exchange by and between R, Python and any other software that implements its open-source format.”
When we export data with feather, it is stored in a binary format file, which makes it less bulky (a 10-digit integer takes just 4 bytes, instead of the 10 ASCII characters required by a CSV file). There’s no need to go to and fro from numbers to text, and this aids in speedier reading and writing. Additionally, feather is a column-oriented file format, which matches R’s internal representation of data.
Code
With the primary motive of reducing the exporting time using R, I have created a random dataset of 25,000,000 rows and 3 columns and ran it with compatible solutions to compare the time taken by them to export the data in a csv or a bin format.
Here’s the sample code I used:
####################################
install.packages("data.table")
install.packages("stringi")
install.packages("feather")
library(feather)
library(data.table)
library(stringi)
num = 10000
size = 25000000
path0 <- "D:\\dataset101.feather"
path1 <- "D:\\dataset102.csv"
#######Generating Random DataSet##########
dataset <- data.table(col1 = rep(stri_rand_strings(num, 10), size / num),
col2 = rep(1:(size/ num), each = num),
col3 = rnorm(size))
#######Comparing Methods to Export#########
#1 Using 'feather'
print(system.time(write_feather(dataset, path0)))
#2 Using 'write.csv'
print(system.time(write.csv(dataset, path1)))
#####################################
Output:
#1 Using 'FEATHER'
user system elapsed
1.86 1.21 6.11
#2 Using 'write.csv'
user system elapsed
437.80 6.64 452.89
Conclusion
Here, we have seen that Package Feather is one the most efficient method which can be used to export and import datasets of all sizes.
In the next blog we will look at a few other options to do the same, and compare them with Package Feather.
The package bigmemory also works well with R but comes paired with a limitation, it can import/export dataset of only one type. It has been devised to work on matrices, and matrices in R support only one type of data.
For more information on types of data structures in R, please refer to this
link
.
Blog Author
Prashant Babber,
ASSC Consultant, Data Insights, MACH,
IGD