Professional Documents
Culture Documents
(a tutorial)
PART I
The Basics
What is R?
R is an inexpensive, well maintained, free, and
(Chamber 2008)
Modeling Health, LLC
blog r4stats.com
Predicts R taking over
SAS by 2015:
http://r4stats.com/articl
es/popularity/
Plot of # of papers
published in R, Stata,
SAS, SPSS, and SPlus
How to get R
Download it from the R project site:
http://www.r-project.org/
It works in any platform: Windows, Mac and Linux.
written in R called
Bioconductor: http://www.bioconductor.org/
Bioconductor is specific for genomics, and can be use to run in
Revolution Analytics R
Revolution Analytics has a standard version of R, called
Revolution R. It is free and can be ran from RStudio or the
standard R GUI
Revolution Enterprise costs money, it has a great GUI, and
excellent packages. It can be customized to run in the cloud.
Modeling Health, LLC
10
A session on R in RStudio
Open Rstudio
In the Rstudio menu go to Project -> NewProject
11
Data Structures
Numbers
Vectors
Matrices
Lists
Data.frames
Strings
IMPORTANT
R is case sensitive
12
13
R Help
In the console enter:
help(aDataStructure):
help(vector)
help(matrix)
help(factor)
for R:
http://cran.rproject.org/doc/contrib/
Short-refcard.pdf
14
directory
In the console type
getwd()
To set the working directory from the console:
setwd(aDirctoryPath)
Image
Binary containing all the objects defined in an R session
save.image(. RTutorialImage) #at the end of the session
load.image(.RTutorialImage) # at the beginning of the session
History
save.history(.RTutorialHistory) #at the end of session
load.history(. RTutorialHistory) #at the start of session
Vectors
In the script type:
vec1<-c(a,b,C) # a vector of characters
Click the Run right arrow in Rstudio
Function c( ) , concatenates objects in R
In the Console try:
class(vec1)
length(vec1)
Operations
Vec2<-c(3,4,5,7)
Vec3<-c(88,39.8,1,9.0)
Vec2+Vec3
# adds entry by entry
Vec2*Vec3
# multiplies entry by entry
Vec4<-Vec3[Vec3[]>3] # subsets Vec3
Vec3[3]<-9898.5
# changes 3rd entry in Vec3
15
16
#element
Form matrices from vectors
Mat3<-cbind(Vec2,Vec3)
Mat4<-rbind(Vec2,Vec3)
Subsetting
mat1[c(1,3),2] # elements 1 and 3 in column 2 of mat1
Factors
Factors are useful variables in R taking on a limited
[1] 1 2 2 3 1 2 3 3 1 2 3 3 1
Levels: 1 2 3
summary(as.factor(data)) # summary of data as factors
123
445
17
Lists
This is a very versatile and important data structure
Very useful in modeling and simulation
Technically a list is a recursive vector.
You can save any object in a list. Example
agent<-list(name=John Smith,gender=male,age=56)
agent$name #alternatevely:
agent[[1]]
agent$gender
agent[[2]]
agent$age
agent[[3]]
Changing parameters:
agent$name<-77
Try agent[[3]]
18
19
data.frames
A list with a column vector on each column (see Matloff)
Useful structures to manipulate information with objects
20
Strings
Knowing how to handle character strings is useful when
programming.
Commands
grep("character",c("Knowing"," how"," to"," handle"," character","
21
22
Input I
To do anything interesting with R we need to know how to read
04 (NHANES 03-04)
http://www.cdc.gov/nchs/nhanes/nhanes2003-2004/nhanes03_04.htm
Information on: Demographics; Dietary; Examination; Laboratory
For Demographics the list of available variables is:
http://www.cdc.gov/nchs/nhanes/nhanes2003-2004/vardemo_c.htm
The data comes in a format called xpt
xpt is an export file from the SAS software, which is what the NHaNES
people used.
We need to install the package foreign
install.packages(foreign) or use RStudio Packages tab
Input II
Save demo_c.xpt locally
demo0304<-
read.xport("C:/modeling
Health/NHANES/NH_0304/Demographics/demo
_c.xpt")[,c(1:8,28:31)]
This reads columns 1-8
23
24
Input III
CSV
From Local Directory
read.csv(directoryLocation/fileName.csv,header=TRUE)
From URL
csvFile<read.csv("http://www.math.smith.edu/sasr/datasets/help.csv")
25
Output
With the package foreign
Let us use the file csvFile defined in the previous slide
STATA
write.dta(csvFile,dtaFile.dta) #writes a dta file in working directory
SAS binary
write.foreign(csvFile,sasFile.dat,sasFile.sas,pack
age=SAS) # writes a .DAT sas file in the working
directory
CSV
write.csv(csvFile,"csvFile.csv") # writes a CSV file in the working
directory
PART II
Programming Tools
26
27
for () loops
Syntax
for (j in 1:N) {statements}
EXAMPLE
Vec5<-rep(NA,27); Vec5 # vector of length 27 with only NA entries
[1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA
[25] NA NA NA
for (j in 1:27){Vec5[j]<-j}
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
[25] 25 26 27
28
if(csvFile1[j,87]=="cocaine" || csvFile1[j,87]=="heroin")
{as.character(csvFile1[j,87]<-"complexMolecule")}
}
csvFile1<-as.data.frame(csvFile1) #reconvert matrix to data.frame
29
Programming functions in R
Syntax
functionName<-function(var1,var2,,varN) { functionStatements
# the variables can simple or complex objects, e.g. arrays
}
Simple Function: raise a number to the 3rd power
cube<-function(x){
x^3 }
Not so simple function: read the data.frame csvFile, and comput the % of people
who are: black, Hispanic, other, and white,
raceDist<-function(dataFrame){ #declaration
white<-subset(dataFrame,dataFrame[,88]=="white") #onlyWhites
30
by line
After invoking debug() the browser command appears
Commands:
PART III
31
32
Data analysis in R. 1
This is an extensive field in data analysis.
R is provided with a large number of functions to perform
analysis:
mean(), sd(), summary(),max(), min(), etc
Data Analysis in R. 2
Load rattle(), and
33
csvFile is in you
session, i.e. type
objects() in the
console and check
Select the R Dataset
option
In the box Data
Name, locate csvFile
Click Execute. The
following opens:
34
rattle
As an example, I
selected the age as a
variable to explore.
The boxplot,
histogram, cumulative,
and Benfords plots
appear to the right.
More variables could
be used.
35
36
Variables:
i4=average drinks per day
Age=age of individual
Substance=alcohol/cocaine/heroine
Model: how does i4 depend on age and substance
Response: i1
Df Sum Sq Mean Sq F value Pr(>F)
age
1 7669 7669.4 20.847 7.159e-06 ***
substance 2 24490 12245.2 33.285 7.825e-14 ***
Residuals 313 115149 367.9
37
38
and improve R
Packages with multiple applications have to undergo a
code per-review from the R project: http://www.rproject.org/index.html
CRAN supports
39
PART IV
Graphs in R
40
X-Y plotting
Use the csvFile data
plot(csvFile$age,csvFil
e$pcs,xlab=Age(y),yl
ab=MCS,main=Plot of
the Mental Composite
Score Versus age in
the HELP study)
41
3D plots
library(rgl) #load rgl
package
plot(csvFile$age,csvFil
e$pcs,csvFile$cesd,xla
b=Age(y),ylab=MCS,
main=Plot of the
Mental Composite
Score Versus age in
the HELP study)
42
43
44
A personal suggestion/opinion
Start easy and build up in complexity
Learn R with Lams book then move to more advanced works, i.e. use
Chambers book when you are VERY comfortable with R
Pick up any interesting problem you want to solve, then go
45
Bibliography 1
Braun WJ and Murdoch DJ, A first course in statistical programming with R (Cambridge UP, Cambridge
UK, 2007)
reading.org.ua/bookreader.php/137398/Software_for_Data_Analysis_-_Programming_with_R.pdf
Excellent source of papers on diverse statistical applications. Many of the papers are R packages.
Kainman K and Horton NJ, SAS and R (CRC Press, New York NY, 2010)
Great book with lots of examples. Excellent if you come to R from SAS.
Maindonald J and Braun WJ, Data Analysis and Graphics Using R (Cambridge UP, Cambridge UK, 2010) Third Edition.
An excellent reference. Not for the beginner, it assumes a lot of background in R. Also the notation is very compact.
46
Bibliography 2
Matloff N,The Art of R Programming (No starch press, San Francisco, 2011)
One of my favorites references covering both elementary and up to date
applications in R.
The R Journal, http://journal.r-project.org/
Papers available from repository since 2001.
Spector P, Data Manipulation with R (Springer, 2008)
Nice little book with good practical examples on special topics.
Torgo L, Data Mining with R (CRC Press, Boca Raton, FL, 2011)
Good selected topics on data mining with R.
47
Bibliography 3
Venables, WN, and Smith DM (2012) An introuduction to R:
http://cran.r-project.org/doc/manuals/R-intro.pdf
The classic reference, to learn R I prefer Lams work. But use what
fits you best.
Williams, G, Data Mining with Rattle and R (Springer, 2011)
Excellent book on data mining. Also the rattle IDE is very good.
Witten IH, Frank E, and Hall M, Data Mining practical machine
learning tools and techniques (MK, Amsterdan, 2011) 3rd Ed.