You are on page 1of 47

Introduction to R

(a tutorial)

Gerardo Soto-Campos, PhD


Modeling Health
www.modelinghealth.com

Modeling Health, LLC

PART I
The Basics

What is R?
R is an inexpensive, well maintained, free, and

beautiful language to do statistical programming


and graphics.
Actually it is more than that. It is the way of doing statistics

for the future.


It can be used to build whole modeling and simulation
environments.
R is the good-twin of S and S-Plus, an expensive

program developed at Bell Labs by Becker,


Chambers, and Wilks.
One day, when you know R well, read Chambers book

(Chamber 2008)
Modeling Health, LLC

Modeling Health, LLC

R expected to take over the data world by


2015 true/false?...
RA Muenchen in the

blog r4stats.com
Predicts R taking over
SAS by 2015:
http://r4stats.com/articl
es/popularity/
Plot of # of papers
published in R, Stata,
SAS, SPSS, and SPlus

Modeling Health, LLC

How to get R
Download it from the R project site:

http://www.r-project.org/
It works in any platform: Windows, Mac and Linux.

Mexico has two mirrors at:


ITAM: http:
http://cran.itam.mx/
Colegio de Postgraduados de Texcoco:
http://www.est.colpos.mx/R-mirror/
* Si trabajamos duro podria haber uno mas en Pachuca Hgo.

Modeling Health, LLC

R IDE, decent but there is better

Modeling Health, LLC

Better integrated development environments (IDEs) to run


R: Rstudio, Eclipse, Emacs, Revolution Analytics
Rstudio
http://rstudio.org/ (easy to configure, one of my favorites)
Eclipse
http://eclipse.org/ (Requires StatET plugging, the favorite
among computer nerds, it also supports java, python, C++,
not too easy to configure)
Emacs
http://www.gnu.org/software/emacs/ (an old and ugly classic.
I neither use it nor like it)
Revolution Analytics R
http://www.revolutionanalytics.com/ (a commercial version of
R, excellent packages to handle BIG data, smart people, lots
of webcasts, great service, free for academics)

Two more things


The bioinformatics community has a specific software

written in R called
Bioconductor: http://www.bioconductor.org/
Bioconductor is specific for genomics, and can be use to run in

the Amazon cloud


Bioconductor can also be run as an R package called from
Rstudio using standard R.

Revolution Analytics R
Revolution Analytics has a standard version of R, called
Revolution R. It is free and can be ran from RStudio or the
standard R GUI
Revolution Enterprise costs money, it has a great GUI, and
excellent packages. It can be customized to run in the cloud.
Modeling Health, LLC

Modeling Health, LLC

Most of this deck will deal with RStudio


www.rstudio.org
Download and
install it for linux,
Mac or windows
In windows is
extremely easy to
configure
In linux is not so
easy to configure:
Ask CARLOS

Modeling Health, LLC

10

A session on R in RStudio
Open Rstudio
In the Rstudio menu go to Project -> NewProject

Say New Directory


I chose C:/Users/jerry/R2.13.0/Rdemo/primerOnR/tutorialOnR

Modeling Health, LLC

11

Parts of the R session in RStudio


Console
It is the working horse of Rstudio
Use it to test things you dont care to loose
In general it is not a good idea to develop code in the console
R Scripts
Go to File -> New -> R Script -> Save As -> aScriptName.R
I call it Rtutorial.R
Comments are coded with symbol: #

Modeling Health, LLC

R objects and data structures


Everything in R is an

object in the sense of


object oriented
programing
Data Types
Double Integer
Complex
Logical
Factors
Character
Missing data

Data Structures
Numbers
Vectors
Matrices
Lists
Data.frames
Strings

IMPORTANT
R is case sensitive

12

Modeling Health, LLC

13

R Help
In the console enter:

Useful reference card

help(aDataStructure):
help(vector)
help(matrix)
help(factor)

for R:
http://cran.rproject.org/doc/contrib/
Short-refcard.pdf

Modeling Health, LLC

14

Working Directory, Image, and History


Creating projects in RStudio directly sets the working

directory
In the console type
getwd()
To set the working directory from the console:
setwd(aDirctoryPath)

Image
Binary containing all the objects defined in an R session
save.image(. RTutorialImage) #at the end of the session
load.image(.RTutorialImage) # at the beginning of the session
History
save.history(.RTutorialHistory) #at the end of session
load.history(. RTutorialHistory) #at the start of session

Modeling Health, LLC

Vectors
In the script type:
vec1<-c(a,b,C) # a vector of characters
Click the Run right arrow in Rstudio
Function c( ) , concatenates objects in R
In the Console try:
class(vec1)
length(vec1)
Operations
Vec2<-c(3,4,5,7)
Vec3<-c(88,39.8,1,9.0)
Vec2+Vec3
# adds entry by entry
Vec2*Vec3
# multiplies entry by entry
Vec4<-Vec3[Vec3[]>3] # subsets Vec3
Vec3[3]<-9898.5
# changes 3rd entry in Vec3

15

Modeling Health, LLC

16

Matrices: algebra of matrices remains


mat1 <- matrix (10:15 , nrow =3, ncol =2); mat1
mat2<-matrix(1:4,nrow=2,ncol=2);mat2
Product
mat1 %*% mat2
dim(mat1);dim(mat2); dim(mat1 %*% mat2)
Invoke rows, columns, or matrix elements
Mat1[1,] # first row; mat1[,1] # first column; mat1[3,2]

#element
Form matrices from vectors
Mat3<-cbind(Vec2,Vec3)
Mat4<-rbind(Vec2,Vec3)
Subsetting
mat1[c(1,3),2] # elements 1 and 3 in column 2 of mat1

Modeling Health, LLC

Factors
Factors are useful variables in R taking on a limited

number of values called levels.


Useful for statistical modeling
Example
data=c(1,2,2,3,1,2,3,3,1,2,3,3,1) ; as.factor(data)

[1] 1 2 2 3 1 2 3 3 1 2 3 3 1

Levels: 1 2 3
summary(as.factor(data)) # summary of data as factors
123
445

17

Modeling Health, LLC

Lists
This is a very versatile and important data structure
Very useful in modeling and simulation
Technically a list is a recursive vector.
You can save any object in a list. Example
agent<-list(name=John Smith,gender=male,age=56)
agent$name #alternatevely:
agent[[1]]

agent$gender
agent[[2]]
agent$age
agent[[3]]
Changing parameters:
agent$name<-77
Try agent[[3]]

18

Modeling Health, LLC

19

data.frames
A list with a column vector on each column (see Matloff)
Useful structures to manipulate information with objects

from different classes.


Example
df1<data.frame(list(customers=c("cust1","cust2","cust3"),ages
=c(50,55,60),bankSavings=c(1000,3000,1500)))
df1
dim(df1)
df1[1,]; df1[2,]; df1[,3]

Modeling Health, LLC

20

Strings
Knowing how to handle character strings is useful when

programming.
Commands
grep("character",c("Knowing"," how"," to"," handle"," character","

strings"," is"," useful")).


paste("Knowing how to handle character strings is useful when
programming")
substr("Knowing how to handle character strings is useful when
programming",1,7)
nchar("Knowing how to handle character strings is useful when
programming")

Modeling Health, LLC

Dates and Times


as.Date('1960-7-21')
as.Date('1/30/2012',format='%m/%d/%Y')

21

Modeling Health, LLC

22

Input I
To do anything interesting with R we need to know how to read

and write data into and out of it.


Reading CSV files
Example with data from the National Health and Nutrition Survey 03

04 (NHANES 03-04)
http://www.cdc.gov/nchs/nhanes/nhanes2003-2004/nhanes03_04.htm
Information on: Demographics; Dietary; Examination; Laboratory
For Demographics the list of available variables is:
http://www.cdc.gov/nchs/nhanes/nhanes2003-2004/vardemo_c.htm
The data comes in a format called xpt
xpt is an export file from the SAS software, which is what the NHaNES
people used.
We need to install the package foreign
install.packages(foreign) or use RStudio Packages tab

Modeling Health, LLC

Input II
Save demo_c.xpt locally

demo0304<-

read.xport("C:/modeling
Health/NHANES/NH_0304/Demographics/demo
_c.xpt")[,c(1:8,28:31)]
This reads columns 1-8

and 28-31 of the


demo_c.xpt file
The head of the file
appears on the right ->

23

Modeling Health, LLC

24

Input III
CSV
From Local Directory
read.csv(directoryLocation/fileName.csv,header=TRUE)
From URL
csvFile<read.csv("http://www.math.smith.edu/sasr/datasets/help.csv")

Modeling Health, LLC

25

Output
With the package foreign
Let us use the file csvFile defined in the previous slide
STATA
write.dta(csvFile,dtaFile.dta) #writes a dta file in working directory
SAS binary

write.foreign(csvFile,sasFile.dat,sasFile.sas,pack
age=SAS) # writes a .DAT sas file in the working
directory
CSV
write.csv(csvFile,"csvFile.csv") # writes a CSV file in the working

directory

Modeling Health, LLC

PART II
Programming Tools

26

Modeling Health, LLC

27

for () loops
Syntax
for (j in 1:N) {statements}
EXAMPLE
Vec5<-rep(NA,27); Vec5 # vector of length 27 with only NA entries
[1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA
[25] NA NA NA
for (j in 1:27){Vec5[j]<-j}
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
[25] 25 26 27

Modeling Health, LLC

28

Conditionals: if{ } then{} else{}


Read the csvFile from previous section
csvFile<-read.csv("http://www.math.smith.edu/sasr/datasets/help.csv")
csvFile1<-as.matrix(csvFile) #convert data.frame to matrix
up<-dim(csvFile1)[1] # number of rows in csvFile

# we will correct the variable substane in the object csvFile1


for(j in 1:up){

if(csvFile1[j,87]=="cocaine" || csvFile1[j,87]=="heroin")
{as.character(csvFile1[j,87]<-"complexMolecule")}

}
csvFile1<-as.data.frame(csvFile1) #reconvert matrix to data.frame

Modeling Health, LLC

29

Programming functions in R
Syntax
functionName<-function(var1,var2,,varN) { functionStatements
# the variables can simple or complex objects, e.g. arrays
}
Simple Function: raise a number to the 3rd power
cube<-function(x){
x^3 }
Not so simple function: read the data.frame csvFile, and comput the % of people
who are: black, Hispanic, other, and white,
raceDist<-function(dataFrame){ #declaration

denom<-length(dataFrame[,1]) # number of people in data.frame

black<-subset(dataFrame,dataFrame[,88]=="black") #subFrame with only AA

cat("% African Americans = ",100*(dim(black)[1]/denom),'\n')

hispanic<-subset(dataFrame,dataFrame[,88]=="hispanic") #subFrame only Hisp


cat("% Hispanic = ",100*(dim(hispanic)[1]/denom),'\n')
other<-subset(dataFrame,dataFrame[,88]=="other") #allOther race-ethnicity
cat("% Other = ",100*(dim(other)[1]/denom),'\n') #displayResultInConsole

white<-subset(dataFrame,dataFrame[,88]=="white") #onlyWhites

cat("% Caucasian = ",100*(dim(white)[1]/denom),'\n')}

Modeling Health, LLC

30

Useful Tools for R programming


fix(functionName) # allows to edit changes in the function
Example: fix(raceDist)
Debug(functionName) #walks the code of a function line

by line
After invoking debug() the browser command appears
Commands:

n gives the next line in browser-mode.


c continues the function until the end.
Q quits the browser-mode.
Example
debug(raceDist)
In the console try: raceDist(csvFile)
IMPORTANT: after debugging have to run undebug(raceDist)

Modeling Health, LLC

PART III

III.1 Data Analysis in R


III.2 Data Modeling in R

31

Modeling Health, LLC

32

Data analysis in R. 1
This is an extensive field in data analysis.
R is provided with a large number of functions to perform

analysis:
mean(), sd(), summary(),max(), min(), etc

In practice, we will use the package rattle() written and

maintained by Dr. Graham Williams


Install.packages(rattle)
This steps is straightforward buy it involves multiple packages
install.packages(rattle)

Once rattle is installed invoke it in the console as: rattle()

Modeling Health, LLC

Data Analysis in R. 2
Load rattle(), and

invoke it in the console


We will analyze the
data from csvFile.
The data gets loaded
into the session when
the R image gets
loaded:
load(".RTutorialImage")
See snapshot of rattles

IDE on the right panel:

33

Modeling Health, LLC

Reading data into rattle


Make sure the object

csvFile is in you
session, i.e. type
objects() in the
console and check
Select the R Dataset
option
In the box Data
Name, locate csvFile
Click Execute. The
following opens:

34

Modeling Health, LLC

Exploring data distributions


Go to tab Explore in

rattle
As an example, I
selected the age as a
variable to explore.
The boxplot,
histogram, cumulative,
and Benfords plots
appear to the right.
More variables could
be used.

35

Modeling Health, LLC

36

Multiple regression model

Variables:
i4=average drinks per day
Age=age of individual
Substance=alcohol/cocaine/heroine
Model: how does i4 depend on age and substance

Estimate Std. Error t value Pr(>|t|)


(Intercept)
18.2545 5.7713 3.163 0.00171 **
age
0.3341 0.1441 2.319 0.02106 *
substancecocaine -17.5678 2.6301 -6.679 1.10e-10 ***
substanceheroin -20.0985 2.7334 -7.353 1.71e-12 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 19.18 on 313 degrees of freedom


Multiple R-squared: 0.2183,
Adjusted Rsquared: 0.2108
F-statistic: 29.14 on 3 and 313 DF, p-value: < 2.2e-16

==== ANOVA ====

Analysis of Variance Table

Response: i1
Df Sum Sq Mean Sq F value Pr(>F)
age
1 7669 7669.4 20.847 7.159e-06 ***
substance 2 24490 12245.2 33.285 7.825e-14 ***
Residuals 313 115149 367.9

Modeling Health, LLC

Rattle give the option to build other


models (beyond our scope)
Trees
Random Forests
Neural Networks

37

Modeling Health, LLC

38

R has world support


A plethora of developers with different specialties maintain

and improve R
Packages with multiple applications have to undergo a
code per-review from the R project: http://www.rproject.org/index.html
CRAN supports

Modeling Health, LLC

39

Applications of R in Biomedical Sciences


There is no time to treat this topic with detail
The reader is encourage to check it on her/his own:
http://a-little-book-of-r-for-biomedicalstatistics.readthedocs.org/en/latest/src/biomedicalstats.html#

Modeling Health, LLC

PART IV

Graphs in R

40

Modeling Health, LLC

X-Y plotting
Use the csvFile data

plot(csvFile$age,csvFil

e$pcs,xlab=Age(y),yl
ab=MCS,main=Plot of
the Mental Composite
Score Versus age in
the HELP study)

41

Modeling Health, LLC

3D plots
library(rgl) #load rgl

package
plot(csvFile$age,csvFil
e$pcs,csvFile$cesd,xla
b=Age(y),ylab=MCS,
main=Plot of the
Mental Composite
Score Versus age in
the HELP study)

42

Modeling Health, LLC

43

PART IV: ADVANCED TOPICS


IN R (FOR THE FUTURE)

Object Oriented Programming in R


Parallel Programming
R Programming in the Cloud
Numerical Simulations in R

Modeling Health, LLC

44

A personal suggestion/opinion
Start easy and build up in complexity
Learn R with Lams book then move to more advanced works, i.e. use
Chambers book when you are VERY comfortable with R
Pick up any interesting problem you want to solve, then go

ahead and do it in R. Examples:

Financial time series


Stochastic Processes
Pharmacoeconomics
Physiological modeling, Etc

Get familiar with the relevant R packages at CRAN:


http://cran.rproject.org/web/packages/available_packages_by_name.html
Subcribe to R-bloggers: http://www.r-bloggers.com/
Form an R meetup users group
Dont study R, use it and have fun with it.

45

Bibliography 1
Braun WJ and Murdoch DJ, A first course in statistical programming with R (Cambridge UP, Cambridge

UK, 2007)

Excellent book for beginers.

Chambers JM (2008) Software for Data Analysis: http://www.e-

reading.org.ua/bookreader.php/137398/Software_for_Data_Analysis_-_Programming_with_R.pdf

This is an advanced book!!!

Coghlan Avril (2012) A Little Book of R for Biomedical Statistics: https://media.readthedocs.org/pdf/a-little-book-of-r-forbiomedical-statistics/latest/a-little-book-of-r-for-biomedical-statistics.pdf

Journal of Statistical Software, http://www.jstatsoft.org/

Excellent source of papers on diverse statistical applications. Many of the papers are R packages.

Kainman K and Horton NJ, SAS and R (CRC Press, New York NY, 2010)

Great book with lots of examples. Excellent if you come to R from SAS.

Lam, L (2010) An introduction to R: http://cran.r-project.org/doc/contrib/Lam-IntroductionToR_LHL.pdf

This is the best intro book on R I know.

Maindonald J and Braun WJ, Data Analysis and Graphics Using R (Cambridge UP, Cambridge UK, 2010) Third Edition.

An excellent reference. Not for the beginner, it assumes a lot of background in R. Also the notation is very compact.

Modeling Health, LLC

Modeling Health, LLC

46

Bibliography 2
Matloff N,The Art of R Programming (No starch press, San Francisco, 2011)
One of my favorites references covering both elementary and up to date

applications in R.
The R Journal, http://journal.r-project.org/
Papers available from repository since 2001.
Spector P, Data Manipulation with R (Springer, 2008)
Nice little book with good practical examples on special topics.

Torgo L, Data Mining with R (CRC Press, Boca Raton, FL, 2011)
Good selected topics on data mining with R.

Modeling Health, LLC

47

Bibliography 3
Venables, WN, and Smith DM (2012) An introuduction to R:

http://cran.r-project.org/doc/manuals/R-intro.pdf
The classic reference, to learn R I prefer Lams work. But use what
fits you best.
Williams, G, Data Mining with Rattle and R (Springer, 2011)
Excellent book on data mining. Also the rattle IDE is very good.
Witten IH, Frank E, and Hall M, Data Mining practical machine
learning tools and techniques (MK, Amsterdan, 2011) 3rd Ed.

You might also like