You are on page 1of 8

NST773 Data Mining and Statistical Learning II

Topic 3
Handling Large Data in R
Xiaogang Su, Ph.D.
School of Nursing
University of Alabama at Birmingham
xgsu@uab.edu

INTRODUCTION

Data mining and statistical learning essentially process very large datasets. Nevertheless, one major
drawback of R in its original version is its inability to deal with large data set. This can be
exemplified by the following two simple examples. The first illustrates a problem with the system
memory limitations. Relying on chunk based data processing, R holds all current available objects in
virtual memory; but a normal computer only has 2-4 GB memory available, depending on the
computer operating system (32- or 64-bit).
> x <- rep(0, 2^31-1)
Error: cannot allocate vector of length 2147483647

The second example is related to the maximally addressable range of different types of numbers. The
maximum addressable range for integers is 231-1.
> as.integer(2^31-1)
[1] 2147483647

> as.integer(2^31)
[1] NA
Warning message:
NAs introduced by coercion

How to solve this inherent problem has attracted tremendous research efforts from many R users,
which reached one of its peaks when participants are challenged with analysis of the complete airline
on-time performance data in 2009 JSM Data Expo (http://stat-computing.org/dataexpo/2009/the-
data.html). The involved data set is approximately 11 GB (about 120 million rows and 29 columns)
in total. We shall use this data set for illustration and more details will be given in the following.

The history of large data set analysis in R is rather short. In this topic, I will provide a concise
overview of packages in R for handling massive data and illustrate some of them. Throughout the
course, most statistical learning methods are demonstrated via moderately-sized data for the sake of
convenience. But bear in mind that you might want to use the packages covered in this topic when it
calls for huge data analysis in your own projects.

1
NST773 Data Mining and Statistical Learning II

A BRIEF OVERVIEW

The earliest approach suggested for handling large data in R resorts to SQL. The shortcoming is that
you will have to learn a different programming language, although SQL is really easy to pick up. The
R Special Interest Group on Databases has developed a number of packages that provide an R
interface to commonly used relational database management systems (RDBMS) such as MySQL
(RMySQL), PostgreSQL (RPgSQL), and Oracle (ROracle). These packages use the S4 classes and
generics defined in the DBI package and have the advantage that they offer much better database
functionality, inherited via the use of a true database management system. However, this benefit
comes with the cost of having to install and use third-party software. While installing an RDBMS
may not be an issue—many systems have them preinstalled and the RSQLite package comes bundled
with the source for the RDBMS—the need for the RDBMS and knowledge of structured query
language (SQL) nevertheless adds some overhead. This overhead may serve as an impediment for
users in need of a database for simpler applications.

One popular package in this orientation is RMySQL. This package allows you to make connections
between R and MySQOL server. MySQL, which is claimed to be “the world’s most popular open
source database,” is a mid-size, multi-platform RDBMS popular in the open source community. Some
of its advantages include high-performance, open source, and free for non-commercial use. In order to
install this package properly, you need to download both the MySQL server and RMySQL, referring
to Appendix A for specific steps. The book “Data Mining with R: learning by case studies” by Luis
Torgo (2003), available from
http://www.liaad.up.pt/~ltorgo/DataMiningWithR/book.html,
is exclusively based on RMySQOL.

Another very popular SQL database package in R is SQLite (http://sqlite.org/). You may want to
check out the RSQLite package, which embeds the SQLite database engine in R and provides an
interface compliant with the DBI package. Note that the source for the RDBMS source has been
bundled within the RSQLite package. So no additional downloading of third-party software is needed.
The sqldf package provides an easy way of working with moderately large datasets, without having
to go through installing an SQL database and putting data into the database. Under the hood of the
sqldf package, an SQL database system called SQLite is in use. For illustration, check out a
presentation of Soren Hojsgaard: http://gbi.agrsci.dk/~shd/misc/Rdocs/R-largedata.pdf.

At the same time, there are several R packages available allowing direct interactions with large
dataset within R. The idea is to avoid loading the whole dataset into memory. Here is a list of these
packages:

I. The fileharsh package: http://cran.r-project.org/package=filehash by Roger D. Peng


II. The ff package: http://ff.r-forge.r-project.org/ by Daniel Adler et al.
III. The bigmemory package: http://www.bigmemory.org by Michael J. Kane and John W.
Emerson.

R PACKAGE – filehash

The first package, filehash, for solving large data problem is contributed by Roger Peng. The
rationale of filehash is to dump large data or object into the hard drive. Then assign an environment
2
NST773 Data Mining and Statistical Learning II

name for the dumped object. You then can access the database through the assigned environment. The
whole procedure avoids using memories to deal with large object.

Some basic steps in filehash are illustrated as follows. First, Databases can be created using the
dbCreate function and must be initialized (via dbInit) in order to be accessed. The dbInit
function returns an S4 object inheriting from class “filehash”.

dbCreate("mydb")
db <- dbInit("mydb")

The primary interface to filehash databases consists of the functions dbFetch, dbInsert,
dbExists, dbList, and dbDelete. These functions are all generic—specific methods exist for
each type of database backend. They all take as their first argument an object of class “filehash”. To
insert some data into the database we can simply call dbInsert. We can then retrieve those data
values with dbFetch.

dbInsert(db, "a", rnorm(100))


value <- dbFetch(db, "a")
mean(value)

The function dbList lists all of the keys that are available in the database, dbExists tests to see if
a given key is in the database, and dbDelete deletes a key-value pair from the database.

dbInsert(db, "b", 123)


dbDelete(db, "a")
dbList(db)
dbExists(db, "a")

Another very useful command is dbLoad(), which works just like attach(). The objects are
attached, but are kept stored on the local hard disk. We may also assess the objects in the filehash
database using the usual standard R subset and accessor functions like $, [[, and [.

db$a <- rnorm(100, 1)


mean(db$a)
mean(db[["a"]])
db$b <- rnorm(100, 2)
dbList(db)

Finally, there is method for the with generic function which operates much like using with on lists or
environments. The following three statements all return the same value.

with(db, c(a = mean(a), b = mean(b)))

When a database is initialized using the default “DB1” format, a file connection is opened for reading
and writing to the database file on the disk. This file connection remains open until the database is
closed via dbDisconnect or the database object in R is removed. Since there is a hard limit on the
number of file connections that can be open at once, some protection is need to make sure that file
connections are close properly.

3
NST773 Data Mining and Statistical Learning II

There are a few other utilities included with the filehash package. Two of the utilities,
dumpObjects and dumpImage, are analogues of save and save.image in R. Rather than save
objects to an R workspace, dumpObjects saves the given objects to a “filehash” database so that in
the future, individual objects can be reloaded if desired. Similarly, dumpImage saves the entire
workspace to a “filehash” database. The function dumpList takes a list and creates a “filehash”
database with values from the list. The list must have a non-empty name for every element in order
for dumpList to succeed. dumpDF creates a “filehash” database from a data frame where each
column of the data frame is an element in the database. Essentially, dumpDF converts the data frame
to a list and then calls dumpList.

Here is an example code of how to use dumpDF.

dumpDF(read.table("large.dat", header=T), dbName="db01")


env01 <- db2env(db="db01")

The first element of dumpDF() is a data object. Read in the data within dumpDF(), so R memory
does not have a copy of it. Space saved! So now, the large data set "large.dat" can be accesses
through the environment of the env01. To access it, we use with(). Suppose we want to do a
linear regression of "y" on "x." And we access the data using the variable names. If you assign a
object name for the read.table, the memory will have a copy of the data, which is not desirable.
Using the with() function, we can fit a model or compute summary statistics as usual:
fit <- with(env01, lm(y~x))
with(env01, mean(y))
with(env01, y[1] <- 2))

PACKAGE ff

The second package is ff, for which I have had little experience so far. The manual can be found here:
http://cran.r-project.org/web/packages/ff/ff.pdf. You can check out a few presentations: http://ff.r-
forge.r-project.org/.

PACKAGE bigmemory

The bigmemory package (http://www.bigmemory.org/) is the latest release along the similar lines.
Although it might be faster than filehash and ff according to the authors, it seems to me that they are
trying to rewrite functions in R. It usage is combined with its sister packages bigglm,
biganalytics, synchronicity, and bigalgebra. The newest release of this package has been
commercialized as RevoScaleR™ by Revolution Analytics. Here is an illustrative example of
using this package. The codes should be accredited to Allan Engelhardt (http://www.cybaea.net/).

Preparing the Airline Data from 2009 JSM Data Expo

Note that you might want to try using the nrows= parameter to read.csv and adjust as needed
for your computer. Namely,
4
NST773 Data Mining and Statistical Learning II

d <- read.csv("2008.csv.bz2", nrows = 1e5)


which helps you to get over any memory or performance issues at this stage of data preprocessing.

## Install the packages we will use


install.packages("bigmemory",
dependencies = c("Depends", "Suggests", "Enhances"))

## Data sets are downloaded from the Data Expo '09 web site at
## http://stat-computing.org/dataexpo/2009/the-data.html
for (year in 1987:2008) {
file.name <- paste(year, "csv.bz2", sep = ".")
if ( !file.exists(file.name) ) {
url.text <- paste("http://stat-computing.org/dataexpo/2009/",
year, ".csv.bz2", sep = "")
cat("Downloading missing data file ", file.name, "\n", sep = "")
download.file(url.text, file.name)
}
}

## Read sample file to get column names and types


d <- read.csv("2008.csv.bz2")
integer.columns <- sapply(d, is.integer)
factor.columns <- sapply(d, is.factor)
factor.levels <- lapply(d[, factor.columns], levels)
n.rows <- 0L

## Process each file determining the factor levels


## TODO: Combine with next loop
for (year in 1987:2008) {
file.name <- paste(year, "csv.bz2", sep = ".")
cat("Processing ", file.name, "\n", sep = "")
d <- read.csv(file.name)
n.rows <- n.rows + NROWS(d)
new.levels <- lapply(d[, factor.columns], levels)
for ( i in seq(1, length(factor.levels)) ) {
factor.levels[[i]] <- c(factor.levels[[i]], new.levels[[i]])
}
rm(d)
}
save(integer.columns, factor.columns, factor.levels, file =
"factors.RData")

## Now convert all factors to integers so we can create a bigmatrix of the


data
col.classes <- rep("integer", length(integer.columns))
col.classes[factor.columns] <- "character"
cols <- which(factor.columns)
first <- TRUE
csv.file <- "airlines.csv" # Write combined integer-only data to this
file
csv.con <- file(csv.file, open = "w")

for (year in 1987:2008) {

5
NST773 Data Mining and Statistical Learning II

file.name <- paste(year, "csv.bz2", sep = ".")


cat("Processing ", file.name, "\n", sep = "")
d <- read.csv(file.name, colClasses = col.classes)
## Convert the strings to integers
for ( i in seq(1, length(factor.levels)) ) {
col <- cols[i]
d[, col] <- match(d[, col], factor.levels[[i]])
}
write.table(d, file = csv.con, sep = ",",
row.names = FALSE, col.names = first)
first <- FALSE
}
close(csv.con)

## Now convert to a big.matrix


library("bigmemory")
backing.file <- "airlines.bin"
descriptor.file <- "airlines.des"
data <- read.big.matrix(csv.file, header = TRUE,
type = "integer",
backingfile = backing.file,
descriptorfile = descriptor.file,
extraCols = c("age"))

Some Exploratory Data Analysis (EDA)

## bigScale.R - Replicate the analysis from http://bit.ly/aTFXeN with


## normal R
## http://info.revolutionanalytics.com/bigdata.html
## See big.R for the preprocessing of the data

## Load required libraries


library("biglm")
library("bigmemory")
library("biganalytics")
library("bigtabulate")

## Use parallel processing if available


## (Multicore is for "anything-but-Windows" platforms)
if ( require("multicore") ) {
library("doMC")
registerDoMC()
} else {
warning("Consider registering a multi-core 'foreach' processor.")
}

day.names <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday",


"Saturday", "Sunday")

## Attach to the data


descriptor.file <- "airlines.des"
data <- attach.big.matrix(dget(descriptor.file))

6
NST773 Data Mining and Statistical Learning II

## Replicate Table 5 in the Revolutions document:


## Table 5
t.5 <- bigtabulate(data,
ccols = "DayOfWeek",
summary.cols = "ArrDelay", summary.na.rm = TRUE)
## Pretty-fy the outout
stat.names <- dimnames(t.5.2$summary[[1]])[2][[1]]
t.5.p <- cbind(matrix(unlist(t.5$summary), byrow = TRUE,
nrow = length(t.5$summary),
ncol = length(stat.names),
dimnames = list(day.names, stat.names)),
ValidObs = t.5$table)
print(t.5.p)
# min max mean sd NAs ValidObs
# Monday -1410 1879 6.669515 30.17812 385262 18136111
# Tuesday -1426 2137 5.960421 29.06076 417965 18061938
# Wednesday -1405 2598 7.091502 30.37856 405286 18103222
# Thursday -1395 2453 8.945047 32.30101 400077 18083800
# Friday -1437 1808 9.606953 33.07271 384009 18091338
# Saturday -1280 1942 4.187419 28.29972 298328 15915382
# Sunday -1295 2461 6.525040 31.11353 296602 17143178

## Figure 1
plot(t.5.p[, "mean"], type = "l", ylab="Average arrival delay")

REFEERENCES

 DuBois, P. (2000). MySQL. Indianapolis, IN: New Riders.


 Targo, L. (2003). Data Mining with R: Learning with Case Studies. CRC Press. Available
from http://www.liaad.up.pt/~ltorgo/DataMiningWithR/.

7
NST773 Data Mining and Statistical Learning II

Appendix A:

Installation of RMySQL

The following content is taken from http://biostat.mc.vanderbilt.edu/wiki/Main/RMySQL.

RMySQL is a database interface and MySQL driver for R. This version complies with the database
interface definition as implemented in the package DBI 0.2-2. The latest version will always be
available here: http://cran.r-project.org/web/packages/RMySQL/index.html .But until it has updated,
you can get the source for RMySQL 0.7-5 here:
 RMySQL_0.7-5.tar.gz MD5: 793810dd6d91a45dc9c0680cd98cdab7

Installing the RMySQL Source Package:

1. Download Rtools from here: http://www.murdoch-sutherland.com/Rtools/, making sure to


install the correct version for your R version.

2. Install a MySQL client library from http://www.mysql.com or http://dev.mysql.com. If you


already installed a MySQL server, you may want to re-run the install to ensure that you also
installed client header and library files. Note that Xampp doesn't include these.

3. Edit or create the file Renviron.site and add the variable MYSQL_HOME which contains the
location of your MySQL install. The file typically isn't created when installing R, so you may
need to create it yourself. You will want to place it under the /etc directory in your R Home
area. If you don't know where that is, you can issue R.home() at your R prompt. You will be
adding a variable named MYSQL_HOME in variable=value syntax. Here's an example:
Location of Renviron.site: C:/PROGRA~1/R/R-2.11~1.0/etc/Renviron.site
Content is:
MYSQL_HOME=C:/PROGRA~1/MySQL/MYSQLS~1.0/

4. Restart R andd execute install.packages('RMySQL',type='source') at the R prompt.

As long as you followed the above steps correctly, RMySQL will install cleanly and you will be able
to immediately load it and use it. If for some reason it fails, a first place to look would be ensuring
that the MYSQL_HOME environment variable is set. Issue Sys.getenv('MYSQL_HOME') from the
R prompt. If it's empty, then you may want to re-check your Renviron.site file and make sure you
have named it correctly, placed it in the correct directory, named MYSQL_HOME correctly, and
added the correct value to it. Also notice that value is in a notation called 8 dot 3 notation. You can
find that value for your directory path by issuing dir /x at the Windows command prompt.

You might also like