Professional Documents
Culture Documents
start();
Funciona el TAB para autocompletado.
Instalar paquetes:
chooseCRANmirror("") Spain(MADRID)=>59 install.packages(pkgs, lib, repos = getOption("repos"), contriburl = contrib.url(repos, type), method, available = NULL, destdir = NULL, dependencies = NA, type = getOption("pkgType"), configure.args = getOption("configure.args"), configure.vars = getOption("configure.vars"), clean = FALSE, Ncpus = getOption("Ncpus", 1L), libs_only = FALSE, INSTALL_opts, ...) VIP) install.packages("tree"); install.packages("e1071"); install.packages("datasets"); install.packages("class"); install.packages("igraph"); install.packages("MVA"); install.packages("MASS"); install.packages("arules"); install.packages("boost"); install.packages("stats"); install.packages("caTools"); install.packages("Matrix"); install.packages("lattice"); install.packages(c("pkg1", "pkg2")) install.packages("Rcmdr", dependencies = TRUE) O TB win: Men paquetes->instalar paquetes->marcar y OK downloaded 106 Kb. package tree successfully unpacked and MD5 sums checked Paquetes importantes y ayuda de los mismos: help("tree") ==> rboles de clsificacin o regresin. help("datasets") ==> hojas de datos ejerccios. data(iris); ls(); iris[1:5,] Package : e1071 include svm, naiveBayes , . > install.packages("e1071") Packages for Machine learning: For classification: tree in tree, svm in e1071, knn in class, lda in MASS, adaboost in boost For clustering: kmean in stats Other useful packages: caToolskernlabmlbenchcluster
RECUERDA CARGAR LOS PAQUETES QUE VAYAS A UTILIZAR Se puede hacer desde cdigo con: library(nombrepaquete); ==>library("e1071");
demo("graphics") Ejemplo de rutinas de dibujo del sistema. ?plot() ayuda del comando plot
Las #ordenes se separan mediante punto y coma, (`;'), o mediante un cambio de l##nea. Si al terminar la l##nea, la orden no est#a sint#acticamente completa, R mostrar#a un signo de continuaci#on, por ejemplo + Ejecuci#on de #ordenes desde un archivo y redirecci#on de la salida: source("#ordenes.R") sink("resultado.lis") enviar#a el resto de la salida, en vez de a la pantalla, al archivo del sistema operativo, resultado.lis, dentro del directorio de trabajo. La orden > sink() devuelve la salida de nuevo a la pantalla. Si utiliza nombres absolutos de archivo en vez de nombres relativos, los resultados se almacena#an en ellos, independientemente del directorio de trabajo.
objects() se puede utilizar para obtener los nombres de los objetos almacenados en R. Esta funci#on es equivalente a la funci#on ls(). Para eliminar objetos puede utilizar la orden rm, por ejemplo: > rm(x, y, z, tinta, chatarra, temporal, barra) Para mostar un objeto vale con ejecutar su nombre. Ej: > x 3; x ==> [1] 3 (vector de 1 el) Los objetos creados durante una sesi#on de R pueden almacenarse en un archivo para su so posterior. Al analizar la sesi#on, R pregunta si desea hacerlo. En caso afirmativo todos objetos se almacenan en el archivo `.RData' en el directorio de trabajo. Es recomendable que utilice un directorio de trabajo diferente para cada problema que analice con R. Es muy com#un crear objetos con los nombres x e y, por ejemplo. La estructura m#as simple es el vector, que es una colecci#on ordenada de n#umeros. Para crear un vector, por ejemplo x, consistente en cinco n#umeros, por ejemplo 10.4, 5.6, 3.1, 6.4 y 21.7, use la orden > x <- c(10.4, 5.6, 3.1, 6.4, 21.7) o bien: assign("x", c(10.4, 5.6, 3.1, 6.4, 21.7)) tb valdra: c(10.4, 5.6, 3.1, 6.4, 21.7) -> x
Si una expresi#on se utiliza como una orden por s## misma, su valor se imprime y se pierde3. As## pues, la orden > 1/x simplemente imprime los inversos de los cinco valores anteriores en la pantalla (por supuesto, el valor de x no se modica). Si a continuaci#on hace la asignaci#on> y <- c(x, 0, x) crear#a un vector, y, con 11 elementos, consistentes en dos copias de x con un cero entre ambas.
v <- 2*x + y + 1 genera un nuevo vector, v, de longitud 11, construido sumando, elemento a elemento, el vector 2*x repetido 2.2 veces, el vector y, y el n#umero 1 repetido 11 veces. Los operadores aritm#eticos elementales son los habituales +, -, *, / y ^ para elevar a una potencia. Adem#as est#an disponibles las funciones log, exp, sin, cos, tan, sqrt, bien conocidas. Existen muchas m#as funciones, entre otras, las siguientes: max y min que seleccionan respectivamente el mayor y el menor elemento de un vector; range cuyo valor es el vector de longitud dos, c(min(x), max(x)); length(x) que es el n#umero de elementos o longitud de x; sum(x) que es la suma de todos los elementos de x; y prod(x) que es el producto de todos ellos. Dos funciones estad##sticas son mean(x), que calcula la media, esto es, sum(x)/length(x) y var(x) que calcula la cuasi-varianza, esto es, sum((x-mean(x))^2)/(length(x)-1) Para ordenar un vector dispone de la funci#on sort(x) que devuelve un vector del mismo tama~no que x con los elementos ordenados en orden creciente. Tambi#en dispone de order() y de sort.list(), que produce la permutaci#on del vector que corresponde a la ordenaci#on. Advierta que max y min seleccionan el mayor y el menor valor de sus argumentos, incluso aunque estos sean varios vectores. Las funciones paralelas pmax y pmin devuelven un vector (de la misma longitud del argumento m#as largo) que contiene en cada elemento el mayor y menor elemento de dicha posici#on de entre todos los vectores de entrada Para trabajar con n#umeros complejos, debe indicar expl##citamente la parte compleja. As sqrt(-17) devuelve el resultado NaN y un mensaje de advertencia, pero sqrt(-17+0i) realiza correctamente el c#alculo de la ra##z cuadrada de este n#umero complejo. Sucesiones: y c(1:10) genera los elementos 1,2,3,4,5,6,7,8,9,10 o bien: y seq(1,10)o bien: y c(seq(1,10)) z seq(-5, 5, by=.2) genera el vector c(-5.0, -4.8, -4.6, ..., 4.6, 4.8, 5.0) y lo almacena en z
muestra <- sample(1:150, 120) genera 120 elementosaleatorios con valor 1 a 150.
Booleanos: > temp <- x > 13 temp<-5==5; temp ==> [1] TRUE (true=1 false=0) Los operadores l#ogicos son < (menor), <= (menor o igual), > (mayor), >= (mayor o igual), == (igual), y != (distinto). Adem#as, si c1 y c2 son expresiones l#ogicas, entonces c1&c2 es su intersecci#on (\conjunci#on"), c1|c2 es su uni#on (\disyunci#on") y !c1 es la negaci#on de c1 Cadenas: a<-HOLA; a ==> [1] "HOLA"
la funci#on paste() une todos los vectores de caracteres que se le suministran y construye una sola cadena de caracteres: > labs <- paste(c("X","Y"), 1:10, sep="") almacena, en labs, el vector de caracteres c("X1", "Y2", "X3", "Y4", "X5", "Y6", "X7", "Y8", "X9", "Y10") Recuerde que al tener c("X", "Y") solo dos elementos, deber#a repetirse 5 veces para obtener la longitud del vector 1:10.
Un vector de caracteres. Esta opci#on solo puede realizarse si el vector posee el atributo names (nombres) para identicar sus componentes, en cuyo caso se comportar#a de modo similar al punto 2. > fruta <- c(5, 10, 1, 20) > names(fruta) <- c("naranja", "pl#atano", "manzana", "pera") > postre <- fruta[c("manzana","naranja")] La ventaja en este caso es que los nombres son a menudo m#as f#aciles de recordar que los ##ndices num#ericos.Esta opci#on es especialmente #util al tratar de la estructura de \hoja de datos" (data frame) que veremos posteriormente.
> alfa <- alfa[2 * 1:5] lo transforma en un objeto de longitud 5 formado por los elementos de posici#on par del objeto inicial.
Getting Started with R See also the introduction to R here. There's also an R wiki(http://rwiki.sciviews.org/doku.php) with lots of useful information, tips etc.
The online help pages can be accessed as follows:
help.start() help(package) ?package help.search("keyword") ?help help(package=base) q() # # # # # # # Load HTML help pages into browser List help page for "package" Shortform for "help(package)" Search help pages for "keyword" For more options List tasks in package "base" Quit R
Some extra packages are not loaded by default, but can be loaded as follows:
library("package") library() detach("package:pkg") library(help="package") # List available packages to load # Unload the loaded package "pkg" # list package contents
Commands can be separated by a semicolon (;) or a newline All characters after # are treated as comments (even on the same line as commands) Variables are assigned using <- (although = also works):
x <- 4.5 y <- 2*x^2 # Use "<<-" for global assignment
Vectors
x <- c(1, 4, 5, 8) y <- 2*x^2 > y [1] 2 32 50 128 # Concatenate values & assign to x # Evaluate expression for each element in x # Typing the name of a variable evaluates it
If x is a vector, y will be a vector of the same length. The vector arithmetic in R means that an expression involving vectors of the same length (e.g. x+y, in the above example) will be evaluated piecewise for corresponding elements, without the need to loop explicitly over the values:
> x+y [1] 3 36 55 136 > 1:10 # [1] 1 2 3 4 5 > a <- 1:10 a[-1] # a[-c(1, 5)] # a[1:5] # a[length(a)] # head(a, n=3) # tail(a, n=3) # Same as seq(1, 10) 6 7 8 9 10 Print Print Print Print Print Print whole vector *except* the first element whole vector except elements 1 & 5 elements 1 to 5 last element, for any length of vector the first n elements of a the last n elements of a
Typing the name of a function lists its contents; typing a variable evaluates it, or you can use print(x).
A great feature of R functions is that they offer a lot of control to the user, but provide sensible hidden defaults. A good example is the histogram function, which works very well without any explicit control: First, generate a set of 10,000 Gaussian distributed random numbers:
data <- rnorm(1e4) hist(data) default # Gaussian distributed numbers with mean=0 & sigma=1 # Plots a histogram, with sensible bin choices by
See ?hist for the full range of control parameters available, e.g. to specifiy 7 bins:
hist(data, breaks=7)
Data Input/Output
See also here Assuming you have a file (file.dat) containing the following data, with either spaces or tabs between fields:
r 1 2 3 4 5 x 4.2 2.4 8.76 5.9 3.4 y 14.2 64.8 63.4 32.2 89.6
Note that you can refer to R objects, names etc. using the first few letters of the full name, provided that is unambiguous, e.g.:
> inp$r # Print the Radius column
but note what happens if the information is ambiguous or if the column doesn't exist:
> inp$t NULL > inp$wibble NULL # Could match "inp$temperature" or "inp$time" # "wibble" column doesn't exist
Writing data out to a file (if no filename is specified (the default), the output is written to the
console)
> write.table(inp, quote=F, row.names=F, col.names=T) radius temperature time 1 4.2 14.2 2 2.4 64.8 3 8.76 63.4 4 5.9 32.2 5 3.4 89.6
By default the columns are separated by whitespace, but you can change this with the sep= option (see ?write.table for details), e.g. use a : with a tab either side:
> write.table(inp, quote=F, row.names=F, col.names=T, sep="\t:\t") radius : temperature : time 1 : 4.2 : 14.2 2 : 2.4 : 64.8 3 : 8.76 : 63.4 4 : 5.9 : 32.2 5 : 3.4 : 89.6
Saving data
?save save(inp, t, file="data.RData") load(file="data.RData") # read back in the saved data: # - this will load the saved R objects into the current session, with # the same properties, i.e. all the variable will have the same names # and contents
Note that when you quit R (by typing q()), it asks if you want to save the workspace image, if you specify yes (y), it writes out two files to the current directory, called .RData and .Rhistory. The former contains the contents of the saved session (i.e. all the R objects in memory) and the latter is a list of all the command history (it's just a simple text file you can look at). At any time you can save the history of commands using:
savehistory(file="my.Rhistory") loadhistory(file="my.Rhistory") session # load history file into current R
Plotting data
See also here Useful functions:
?plot ?par ?Devices # (R can output dev.list() colours() ?plotmath # Help page for plot command # Help page for graphics parameter control # or "?device" in postscript, PDF, bitmap, PNG, JPEG and more formats) # list graphics devices # or "colors()" List all available colours # Help page for writing maths expressions in R
Tip: to generate png, jpeg etc. files, I find it's best to create pdf versions in R, then convert them in gimp (having specified strong antialiasing, if asked, when loading the pdf file into gimp), otherwise the figures are "blocky" (i.e. suffer from aliasing). To create an output file copy of a plot for printing or including in a document etc.
dev.copy2pdf(file="myplot.pdf") device dev.copy2eps(file="myplot.eps") dev.copy() # } Copy contents of current graphics # } to a PDF or Postscript file
Plot graph, but don't show the points (type="n") & plot errors:
plot(x, y, log="xy", type="n") errorbar(x, xlo, xup, y, ylo, yup)
?plot gives more info on plotting options (as does ?par). To use different line colours, styles, thicknesses & errorbar types for different points:
errorbar(x, xlo, xup, y, ylo, yup, col=rainbow(length(x)), lty=1:5, type=c("b", "d", "c", "x"), lwd=1:3)
Plot chi-squared probability vs. reduced chi-squared for 10 and then 100 degrees of freedom
dof <- 10 curve(pchisq(x*dof,df=dof), from=0.1, to=3, xlab="Reduced chi-squared", ylab="Chi-sq distribution function") abline(v=1, lty=2, col="blue") # Plot dashed line for reduced chisq=1 dof <- 100 curve(pchisq(x*dof, df=dof), from=0.1, to=3, xlab="Reduced chi-squared", ylab="Chi-sq distribution function") abline(v=1, lty=2, col="blue") # Plot dashed line for reduced chisq=1
Analysis
Basic stuff
x <- rnorm(100) hist(x) summary(x) d <- density(x) d details) plot(d) plot(density(x)) # # # # # Create 100 Gaussian-distributed random numbers Plot histogram of x Some basic statistical info Compute kernel density estimate Print info on density analysis ("?density" for
Calculate standard deviation of 100 Gaussian-distributed random numbers (NB default sd is 1; can specify with rnorm(100, sd=3) or whatever)
sd(rnorm(100))
Now, let's run N=5 Monte Carlo simulations of the above test:
replicate(5, sd(rnorm(100)))
Obviously 5 is rather small for a number of MC sims (repeat the above command & see how the answer fluctuates). Let's try 10,000 instead:
mean(replicate(1e4, sd(rnorm(100))))
With a sample size of 100, the measured sd closely matches the actual value used to create the random data (sd=1). However, see what happens when the sample size decreases:
mean(replicate(1e4, sd(rnorm(10)))) mean(replicate(1e4, sd(rnorm(3)))) # almost a 3% bias low # >10% bias low
You can see the bias more clearly with a histogram or density plot:
a <- replicate(1e4, sd(rnorm(3))) hist(b, breaks=100, probability=T) lines(density(b), col="red") abline(v=1, col="blue", lwd=3) line # Specify lots of bins # Overlay kernel density estimate # Mark true value with thick blue
This bias is important to consider when esimating the velocity dispersion of a stellar or galaxy system with a small number of tracers. The bias is even worse for the robust sd estimator mad():
mean(replicate(1e4, mad(rnorm(10)))) mean(replicate(1e4, mad(rnorm(3)))) # ~9% bias low # >30% bias low
However, in the presence of outliers, mad() is robust compared to sd() as seen with the following demonstration. First, create a function to add a 5 sigma outlier to a Gaussian random distribution:
outlier <- function(N,mean=0,sd=1) { a <- rnorm(N, mean=mean, sd=sd) a[1] <- mean+5*sd # Replace 1st value with 5 sigma outlier return(a) }
ks.test(x, y)
Correlation tests Load in some data (see section on data frames, below):
dfile <- "http://www.sr.bham.ac.uk/~ajrs/papers/sanderson06/table1.txt" A <- read.table(dfile, header=T, sep="|") cor.test(A$z, A$Tx) # Specify X & Y data separately cor.test(~ z + Tx, data=A) # Alternative format when using a data frame ("A") cor.test(A$z, A$Tx, method="spearman") # } Alternative methods cor.test(A$z, A$Tx, method="kendall") # }
# - note that "sf" is a function: sf(1.5) # Evaluate spline function at x=1.5 plot(x,y); curve(sf(x), add=T) # Plot data points & spline curve
Regression (see also the tutorial at Penn State's Center for Astrostatistics here, which is more thorough.) Load in some data (see section on data frames, below; this dataset is described in more detail in the section on factors, below):
dfile <- "http://www.sr.bham.ac.uk/~ajrs/papers/sanderson06/table1.txt" A <- read.table(dfile, header=T, sep="|") names(A) # Print column names in data frame m <- lm(Tx ~ z, data=A) # Fit linear model summary(m) # Show best-fit parameters & errors
Plot residuals:
plot(A$z, resid(m))
Since R was developed with experimental (rather than observational) sciences in mind, it is not geared up to handle errors in both X and Y directions. Therefore the existing algorithms all perform regression in one direction only. However, unweighed orthogonal regression is equivalent to principal components analysis.
Data Frames
See also here Create a data frame:
A <- data.frame(a=1:3, b=4:6, c=7:9) colnames(A) # List column names for A rownames(A) # List row names (will just be numbers if not specified)
# ("&" = logical AND; "|" = logical OR) # Delete a column from a data frame
However, in a situation where this operation might be occuring iteratively, you could end up with multiple, repeated instances of B:
A <- data.frame(A, B) # To recover the original data, use: A <- A[, 1:3]
Note that if column names are specified, they must match in all data.frames for rbind to work:
rbind(A, B) # Fails! colnames(B) <- colnames(A) rbind(A, B) # Works # Use same column names for B as for A
Working with data frames You can use attach(A) and detach(A) to add/remove data frame to/from the current search path, allowing you to use the column names as if they were standalone vectors. However, this approach is usually best avoided, since it can cause confusion if any objects exist with the same name as a column of an attached data frame. It's better to use with to achieve the same effect within a controlled setting: The with function enables you to work with column names without having to prefix the data frame name, as follows
A <- data.frame(a=1:20, b=rnorm(20)) with(A, a^2 + 2*b)
This principle is used in the transform function, to allow you to construct new columns in the data frame using references to the other column names:
Similarly, the subset function works in the same way to enable easy filtering of data frames:
subset(A, b >= 0 & a%%2 == 0) numbered a # Returns rows with positive b & even
You can also easily perform database join operations, using merge:
B <- data.frame(a=1:7, x=runif(7)) merge(A, B) # Return rows with same "a", combining unique columns from A & B
Functions
See also here Basics
cat # Type function name without brackets to list contents args(cat) # Return arguments of any function body(cat) # Return main body of function formals(fun) # Get or set the formal arguments of a function debug(fun); undebug(fun) # Set or unset the debugging flag on a function
Now create a simple sigma clipping function (you can paste the following straight into R):
sigma.clip <- function(vec, nclip=3, N.max=5) { mean <- mean(vec); sigma <- sd(vec) clip.lo <- mean - (nclip*sigma) clip.up <- mean + (nclip*sigma) vec <- vec[vec < clip.up & vec > clip.lo] # Remove outliers if ( N.max > 0 ) { N.max <- N.max - 1 # Note the use of recursion here (i.e. the function calls itself): vec <- Recall(vec, nclip=nclip, N.max=N.max) } return(vec) }
# Only the main population remains! # } Compare numbers of # } values before & after clipping
Factors
See also here Factors are a vector type which treats the values as character strings which form part of a base set of values (called "levels"). The result of plotting a factor is to produce a frequency histogram of the number of entries in each level (see ?factor for more info). Basics
a <- rpois(100, 3) (lambda=3) hist(a) a[0] b <- as.factor(a) # Create 100 Poisson distributed random numbers # Plot frequency histogram of data # List type of vector ("numeric") # Change type to factor
List type of vector b ("factor"): also lists the "levels", i.e. all the different types of value:
b[0]
Now, plot b, to get a barchart showing the frequency of entries in each level:
plot(b) table(b) # a tabular summary of the same information
A more practical example. Read in table of data (from Sanderson et al. 2006), which refers to a sample of 20 clusters of galaxies with Chandra X-ray data:
file <- "http://www.sr.bham.ac.uk/~ajrs/papers/sanderson06/table1.txt" A <- read.table(file, header=T, sep="|")
By default, read.table() will treat non-numeric values as factors. Sometimes this is annoying, in which case use as.is=T; you can specify the type of all the input columns explicitly -- see ?read.table.
A[1,] # Print 1st row of data frame (will also show column names): plot(A$det) # Plot the numbers of each detector type plot(A$cctype) # Plot the numbers of cool-core and non-cool core clusters
The det column is the detector used for the observation of each cluster: S for ACIS-S and I for ACIS-I. The cctype denotes the cool core type: CC for cool core, NCC for non-cool core. Now, plot one factor against another, to get a plot showing the number of overlaps between the two categories (the cool-core clusters are almost all observed with ACIS-S):
plot(cctype ~ det, data=a, xlab="ACIS Detector", ylab="Cool-core type")
You can easily explore the properties of the different categories, by plotting a factor against a numeric variable. The result is to show a boxplot (see ?boxplot) of the numeric variable for each catagory. For example, compare the mean temperature of cool-core & non-cool core clusters:
plot(kT ~ cctype, data=A)
and compare the redshifts of clusters observed with ACIS-S & ACIS-I:
plot(z ~ det, data=A)
You can define your own factor categories by partitioning a numeric vector of data into discrete bins using cut(), as shown here: Converting a numeric vector into factors Split cluster redshift range into 2 bins of equal width (nearer & further clusters), and assign each z point accordingly; create new column of data frame with the bin assignments (A$z.bin):
A$z.bin <- cut(A$z, 2) A$z.bin # Show levels & bin assignments plot(A$z.bin) # Show bar chart
Plot mean temperatures of nearer & further clusters. Since this is essentially a flux-limited sample of galaxy clusters, the more distant clusters (higher redshift) are hotter (and hence more X-ray luminous - an example of the common selection effect known as Malmquist bias):
plot(z ~ T.bin, data=A) plot(Tx ~ z, data=A) # More clearly seen!
Now, let's say you wanted to colour-code the different CC types. First, create a vector of colours by copying the cctype column (you can add this as a column to the data frame, or keep it as a separate vector):
colour <- A$cctype
Now change the type of the vector to character (i.e. not a factor):
col <- as.character(colour) col[0] col colour # Confirm change of data type # } Print values and # } note the difference
http://www.sr.bham.ac.uk/~ajrs/R/r-access_data.html
Data structures in R
All R objects have a type or mode, as well as a class, which can be determined with typeof, mode & class. Vector Vectors are the basic structure and come in the following atomic modes (data types): numeric, integer, character, logical, complex, raw These modes have corresponding functions which test if an object is of that mode (is, e.g. is.numeric) and convert an object to that mode (as, e.g. as.character) You can assemble and combine vectors using the often-used function c. Note that vectors must consist of values of the same data type:
c(1, "a", TRUE) list(1, "a", TRUE) # all values coerced to character # preserves different types (see below)
Factors Factors encode categorical data, and are an extremely useful and efficient way of handling categories with multiple entries. Note that R often coerces character data to a factor type by default (e.g. when using read.table). Also have is.factor & as.factor.
chars <- strsplit("the cat characters chars <- factor(chars) levels(chars) here) plot(chars) levels(chars)[1] <- "_" paste(chars, collapse="") sat on the mat", "")[[1]] # create vector of # convert from character to a factor # show factor levels (i.e. different letters # show barchart of factor level frequencies # replace whitespaces with underscores # collapse to a single character string
One thing to watch out for with factors is converting them to numeric mode. Factors are actually stored as a list of integers, referring to the element number of the factor levels. In the following example, there are 3 levels ("100", "200" & "300"), which are represented as characters, and the numeric values of the factor comprise the integers 1-3, referring to the elements of the vector of levels.
Xvector <- c(1, 2, 2, 3, 3, 3) * 100 Xfactor <- factor(Xvector) levels(Xfactor) # show levels, which are "100" "200" "300" as.numeric(Xfactor) # reports "1 2 2 3 3 3" - the elements of the levels vector x <- as.numeric(levels(Xfactor)[Xfactor]) # retrieve actual numeric values identical(x, Xvector) # same as original numeric vector
Matrix/arrays Matrices are 2-dimensional arrays, which are themselves generalisations of a vector to more than 1 dimension. Also have is.matrix, is.array & as.matrix, as.array.
M <- matrix(1:12, nrow=3) # create a matrix with 3 rows & 4 columns) dim(M) # show dimensions M[2, 3] # print element in 2nd row & 3rd column 2 * matrix(rep(1, 12), nrow=3) # multiply every element by a constant
A <- array(1:12, dim=c(2, 2, 3)) # create a 3d array A[1, 2, 1] # print single element A[1, , ] # print a matrix subset
Arrays are actually stored in a 1 dimensional structure, so you can still access their elements with a single subscript:
A[5]
List Lists are used to store data of any type or dimensions in a free-form structure. Also have is.list & as.list
l <- list(functions=c(mean, median), chars=month.abb, numbers=rnorm(7)) l$chars # print "chars" element l[2] # print 2nd element *as a single-item list* l[[2]] # print element as a *vector* l["chars"] # } compare and l[["chars"]] # } contrast
Data frame Data frames are widely used in R to store data in a variety of formats with related entries in each row and different attributes in each column, much like a table or spreadsheet. A data frame is essentially a special type of list and elements of data frames can be accessed in exactly the same way as for a list. Also have is.data.frame & as.data.frame
A <- data.frame(a=LETTERS[1:4], b=1:4, c=c(T, T, F, T)) sapply(A, class) # show data types for each column A$a^2 # perform arithmetic on column as a vector dim(A); nrow(A); ncol(A) # show dimensions of data frame (rows, columns) A[1, ] # print first row A[, 2] # print 2nd column as.list(A) # convert to a list # Note that matrices must contain data of the same type, so the following # command converts all the values to character format: as.matrix(A) # convert to a matrix
Data frames can have both row and column names (default row names are the row number). This is the same as having a named vector, as seen in the following example:
# created separate, named vectors of data: planets.mass <- c("Mercury"=0.33, "Venus"=4.87, "Earth"=5.98, "Mars"=0.64, "Jupiter"=1899, "Saturn"=569, "Uranus"=87, "Neptune"=102, "Pluto"=0.13) * 1e24 planets.semimajoraxis <- c("Mercury"=57.9, "Venus"=108, "Earth"=150, "Mars"=228, "Jupiter"=778, "Saturn"=1430, "Uranus"=2870, "Neptune"=4500, "Pluto"=5900) * 1e9 # Now create a data frame: planets <- data.frame(mass=planets.mass, semimajoraxis=planets.semimajoraxis) planets["Earth", ] # show all data for the Earth
planets["Mars", "mass"] # show the mass of Mars; same as planets[4, 1] rownames(planets) <- paste("planet", 1:9) # change row names dimnames(planets); rownames(planets); colnames(planets) # show info
Excluding columns from a data frame is also very easy, and can be done by reference to the column number or name:
A <- transform(planets, dummy = 1:nrow(planets)) A[, -3] by number A[, -c(2:3)] columns by number subset(A, select = -dummy) by name subset(A, select = -c(dummy, mass)) columns by name # add an extra column # exclude extra column # exclude multiple # exclude extra column # exclude multiple
Data input/output in R
For a basic introduction, see getting started. See also the R Data Import/Export manual. R recognises a variety of formats for reading in data. For tabular data, the basic command read.table offers a powerful range of options, which is also used by the shortform commands read.csv and read.delim, for reading in comma-separated variable (e.g. output from a spreadsheet) and tab-delimited format data, respectively. Similarly, the command write.table is used to output tabular format data. For fixed-width format data, use read.fwf. A more powerful method is to read in data directly into a vector or list, using scan. The following are useful functions for reading and writing a variety of data types. See their respective help pages for details. source : read in R commands from a file *ideal for loading pre-written chunks of code* save ; load : read / write R objects from / to a file (see below) *ideal for storing R data* scan: basic core function to read in data into a list/vector read.table ; write.table : generic table-format data read.csv : comma-separated values data (e.g. exported from spreadsheet) read.fwf : fixed-width format data
read.fortran : fixed-format data files using Fortran-style format specifications read.DIF : Data Interchange Format (DIF) for data frames from single spreadsheets read.dcf : Debian Control File format read.ftable / write.ftable : flat contingency tables readBin ; writeBin : binary data readChar ; writeChar : character strings readLines ; writeLines: lines write : write data to a file dump : write text representation of an object dget ; dput : read or recreate an ASCII representation of an R object
At any time you can save the history of commands using: savehistory(file="my.Rhistory") and you can load such commands using:
loadhistory(file="my.Rhistory") ls & objects lists the objects currently defined apropos finds objects with names containing the specified string, e.g.
apropos("max") [1] "cummax" [7] "varimax" "max" "max.col" "which.max" "pmax" "pmax.int" "promax"
One of the most important aspects of computing with data is the ability to manipulate it, to enable subsequent analysis and visualization. R offers a wide range of tools for this purpose. Note that the plyr package provides an even more powerful and convenient means of manipulating and processing data, which I hope to describe in later updates to this page.
http://www.sr.bham.ac.uk/~ajrs/R/r-manipulate_data.html
First create a data frame, then remove a column and create a new one:
A <- data.frame(a=LETTERS[1:5], b=1:5, c=rnorm(5)) A$d <- NULL # to delete column "d" A$e <- 1:5 # add in a new column "e"
Now create a second data frame (the last column is simply a random mix of 1 & 2). Note the use of the same column names to see what happens when A & B are joined together:
set.seed(123) # allow reproducible random numbers B <- data.frame(a=letters[1:5], b=sample(1:2, size=5, replace=TRUE))
Note that the non-numeric columns of both data frames are treated as factors (unless you use stringsAsFactors=FALSE when using data.frame):
> sapply(A, class) a b c e "factor" "integer" "numeric" "integer" > sapply(B, class) aa bb "factor" "integer"
To join them together, you could use c, but the result will be a list:
c(A, B) # creates a list > class(c(A, B)) [1] "list"
You can either convert this list to a data frame, or else use data.frame:
AB1 <- as.data.frame(c(A, B)) AB2 <- data.frame(A, B) > identical( AB1, AB2 ) [1] TRUE
colnames(AB2) colnames(AB3)
the identical column names for A & B are rendered unambiguous when using as.data.frame(c(A, B)), by appending .1 to the 2nd data frame column names. It does this using make.unique, which is useful if you need to generate unique elements, given a vector containing duplicated character strings.
do.call
do.call constructs and executes a function call from a name or a function and a list of arguments to be passed to it. It is an extremely useful task, that can be used to join together data data frames stored in a list, for example:
l <- list(first=A[1, ], second=A[2, ], rest=A[-c(1:2), ]) do.call(rbind, l)
This task cannot be performed using c or rbind, without losing the 2 dimensional structure of the data stored within each component of the list.
- add columns to df, c, data.array(A, newdf) - subset, transform, "[", "[[", methods - rbind, cbind
match identifies common elements between 2 vectors and returns the positions in the second vector of these matching elements in the order they appear in the first vector.
match(c("B", "E"), LETTERS) match(c("B", "3", "E"), LETTERS) # returns "NA" if no corresponding match
To reorder the rows of a data frame according to the contents of one of its columns you just need to use order to specify the row order of the data frame:
A <- data.frame(a=sample(LETTERS[1:5]), b=sample(1:5)) A[order(A$a), ] # } compare A[order(A$b), ] # }
Reshaping data
First create some multi-column data:
set.seed(123) # allow reproducible random numbers A <- data.frame(a=letters[1:3], x=rnorm(3), y=runif(3))
There is also a function reshape which converts between so-called long and wide format data (i.e. columns stacked below each other vs. columns arranged beside each other). However, the documentation for reshape is remarkably opaque! A much more convenient function is melt from the excellent reshape package:
install.packages("reshape") require(melt) melt(A) # retains column "a", unlike "stack(A)"
To truncate data above and below some thresholds (e.g. set all values below zero to zero and above 1 to 1):
x2 <- pmax(pmin(x, 1), 0) # uses nifty parallel maximum & minimum functions
Miscellaneous commands
If a data frame contains any missing values (NA), you can exclude the corresponding entire row:
A$y[4:9] <- A$x[2] <- NA > na.omit(A) x y 1 -0.5604756 0.8895393 3 1.5587083 0.6405068 10 -0.4456620 0.1471136 > unlist(list(a=1, b=2:5, c=6)) a b1 b2 b3 b4 c 1 2 3 4 5 6
When dealing with long format data, where a vector of values has an associated grouping vector, you can use split to pull out separate list entries for each group:
A <- data.frame(group=LETTERS[rep(1:3, 1:3)], x=rnorm(6)) "C" a <- split(A$x, A$group) > a $A [1] 0.6849361 $B [1] -0.3200564 -1.3115224 $C [1] -0.5996083 -0.1294107 0.8867361 # 3 groups: "A", "B",
http://www.sr.bham.ac.uk/~ajrs/R/r-show_data.html
Before you do anything else, it is important to understand the structure of your data and that of any objects derived from it.
A <- data.frame(a=LETTERS[1:10], x=1:10) class(A) # "data.frame" sapply(A, class) # show classes of all columns typeof(A) # "list" names(A) # show list components dim(A) # dimensions of object, if any head(A) # extract first few (default 6) parts tail(A, 1) # extract last row head(1:10, -1) # extract everything except the last element
It is sometimes useful to work with a smaller version of a large data frame, by creating a representative subset of the data, via random sampling:
A.small <- A[sample(nrow(A), 4), ] # select 4 rows at random
which.min & which.max return the element number of the lowest/highest value:
set.seed(123) # allow reproducible random numbers x <- sample(10) > which.max(x) [1] 7 > x[which.max(x)] [1] 10
This can be used in a data frame to extract the corresponding row containing the min/max value of one of the columns:
A <- data.frame(x=rnorm(10), y=runif(10)) A[which.min(A$x), ] #--Alternatively: subset(A, x == min(x))
Other summaries:
x <- rnorm(100) fivenum(x) boxplot: boxplot(x) stem(x) # Tukey's five number summary, used to construct a # see ?boxplot.stats for more details # A stem-and-leaf plot
Matrix summaries:
A <- matrix(rnorm(50), nrow=10) # create 10x5 random number matrix colSums(A); rowSums(A); colMeans(A), rowMeans(A) # self-explanatory max.col(A) # maximum position for each row of a matrix, same as: which.max(A[1,]); which.max(A[2,]) # etc.
Tables
Load some data on a sample of 20 galaxy clusters with a categorical classification status (cctype) indicating whether there is a cool core or not and a factor (det) specifying which of two detectors was used to make the X-ray observation of the cluster:
file <"http://www.sr.bham.ac.uk/~ajrs/papers/sanderson09/sanderson09_table2.txt" a <- read.table(file, header=TRUE, sep="|") # table(a$cctype) # count numbers in each cctype category table(a$cctype, a$det) # 2-way table xtabs(~ cctype + det, data=a) # alternative (formula) syntax addmargins(xtabs(~ cctype + det, data=a)) # add row/col summary (default is sum) prop.table(xtabs(~ cctype + det, data=a)) # show counts as proportions of total
-there is marginal evidence (p=0.07) of an interaction: clusters observed with ACIS-S are more likely to have a cool core than not.
4 non-CC
S 0.03636667
#--Show mean values of a few quantitied, for each cctype: aggregate(. ~ cctype, data=a[c("cctype", "z", "kT", "Z", "S01", "index")], mean)
Base graphics
http://www.sr.bham.ac.uk/~ajrs/R/r-plot_data.html
For a basic introduction, see the "getting started" page here. Base graphics are very flexible and allow a great deal of customisation, with many individual functions available. However, they lack a coherent underlying framework and, for visualizing highly structured data, are outclassed by lattice and ggplot2. Quick reference info:
demo("graphics") ?plot ?par ?layout example("pch") colours() ?plotmath demo(plotmath) # # # # # # # Demonstration of graphics in R Help page for main plot function Help page for changing graphical parameters Help page on plot arrangement Point style examples List pre-defined named colours Help page on plotting maths symbols
Plot symbols and colours can be specified as vectors, to allow individual specification for each point. R uses recycling of vectors in this situation to determine the attributes for each point, i.e. if the length of the vector is less than the number of points, the vector is repeated and concatenated to match the number required. Single plot symbol (see "?points" for more) and colour (type "colours()" or "colors()" for the full list of predefined colours):
plot(x, y, pch=2, col="red") # Hollow triangles plot(x, y, pch=c(3, 20), col=c("red", "blue")) # Blue dots; red "+" signs plot(x, y, pch=1:20) # Different symbol for each point
Label axes:
plot(x, y, xlab="Some data", ylab="Wibble")
Axis limits are controlled by xlim and ylim, which are vectors of the minimum and maximum values, respectively. Specify axis limits:
plot(x, y, xlim=c(11, 12), ylim=c(0, 150))
To view the graphical layout, the following will show the borders of the sub-panels and the number identying each one:
layout(matrix(1:4, 2, 2)) layout.show(4) # Specify layout for 4 panels, for the defined layout layout.show(2) # Try specifying just 2 instead
The heights and widths arguments to layout are vectors of relative heights and widths of the matrix rows and columns, respectively.
curve provides the function to be plotted with a vector of x-axis values called x with which to calculate the corresponding y-axis data. If the argument of your function is not called x (e.g. r) , then you need to use the following syntax: curve(myfun(r=x)). The following example illustrates this with a plot of several blackbody curves. First, define a function for the Planck blackbody law to calculate the radiation intensity as a function of wavelength (lambda, in microns) and temperature (Temp, in Kelvin):
blackbody <- function(lambda, Temp=1e3) { h <- 6.626068e-34 ; c <- 3e8; kb <- 1.3806503e-23 # constants lambda <- lambda * 1e-6 # Convert from metres to microns ( 2*pi*h*c^2 ) / ( lambda^5*( exp( (h*c)/(lambda*kb*Temp) ) - 1 ) ) }
Now plot the curve for the default temperature of 1000K, with some axis labels:
main <- "Planck blackbody curves" xlab <- expression(paste(Wavelength, " (", mu, "m)")) ylab <- expression(paste(Intensity, " ", (W/m^3))) col <- c("blue", "orange", "red") lty <- 1:3 curve(blackbody(lambda=x), from=1, to=15, main=main, xlab=xlab, ylab=ylab, col=col[1])
Print a copy to a PDF file (the resulting plot can be viewed here):
dev.copy2pdf(file="blackbody.pdf") # Also "dev.copy2eps"
Now left click near one or more points and the element number of that point will be printed at the bottom, left, top or right of the point, depending on which side of it you clicked. Right click inside the axes to finish, and the element numbers of the points identified will be printed, as for locator This is more useful if you have named points, in which case identify can print the name instead of the element number, for example:
names(x) <- LETTERS[1:length(x)] plot(x, y) identify(x, y, labels=names(x)) # don't forget right click to finish!
Lattice graphics
Lattice is an excellent package for visualizing multivariate data, which is essentially a port of the S software trellis display to R. While it lacks the flexibility and extensibility of ggplot2, it nevertheless represents a great set of routines for quickly displaying complex data with ease. This makes it ideal for use in exploratory data analysis; you can find out more by reading the excellent book Lattice Multivariate Data Visualization with R by Deepayan Sarkar. Some examples of using lattice, first assemble some data (from this book) on the masses (in kg) and semi-major axis lengths (in metres) of the Planets and a dotplot of the former:
planets.mass <- c("Mercury"=0.33, "Venus"=4.87, "Earth"=5.98, "Mars"=0.64, "Jupiter"=1899, "Saturn"=569, "Uranus"=87, "Neptune"=102, "Pluto"=0.13) * 1e24 planets.semimajoraxis <- c("Mercury"=57.9, "Venus"=108, "Earth"=150, "Mars"=228, "Jupiter"=778, "Saturn"=1430, "Uranus"=2870, "Neptune"=4500, "Pluto"=5900) * 1e9 require(lattice) # ensure package is loaded dotplot(sort(log10(planets.mass)), xlab="log10 mass (kg)")
Now to demonstrate the multivariate capabilities, assemble the data in a data frame and create a categorical variable giant, which identifies the 4 most massive planets:
A <- data.frame(sma=planets.semimajoraxis, mass=planets.mass) A$name <- rownames(A)
Lattice can now separately handle the different categories, either by using group, to use different plotting symbols etc. within the same panel, e.g.:
dotplot(reorder(name, sma) ~ log10(sma), data=A, xlab="log10 semi-major axis (m)", groups=giant, auto.key=TRUE)
...or by conditioning on a categorical variable, to plot separate panels for each dataset:
dotplot(reorder(name, sma) ~ log10(sma) | giant, data=A, xlab="log10 semi-major axis (m)", auto.key=TRUE)
You can also easily plot linear regression models (from lm) for each group category, using the type argument:
xyplot(sma ~ mass, data=A, groups=giant, scales=list(log=TRUE), type=c("g", "p", "r"), auto.key=list(lines=TRUE)) # #---Other "type" arguments: # "g" = show gridlines # "p" = points # "l" = lines (join the dots) # "r" = linear regression model # "smooth" = locally-weighted regression using "loess" #
Lattice offers a very quick route to visualize a set of properties conditioned on one or more factors. For example, to show boxplots of 4 different quantities in separate panels, with each panel comparing values in different categories:
file <"http://www.sr.bham.ac.uk/~ajrs/papers/sanderson09/sanderson09_table2.txt" a <- read.table(file, header=TRUE, sep="|") #--This plot is actually saved as an R object "p" (for use below) and with the # outer "(" & ")" the result is also printed (i.e. plotted in this case, # since "printing" a lattice object draws the plot): ( p <- bwplot( z + kT + Z + index ~ cctype, data=a, outer=TRUE, scales="free", ylab="") )
Another excellent feature of lattice is the ability to span plots over multiple pages, using the layout argument (which is a vector specifying the required number of columns, rows & pages for the plot panels). This is great if you are plotting a large number of panels and want to dump them onto separate pages of a PDF document, say. Following on from the previous example (saved as the lattice object p):
devAskNewPage(TRUE) update(p, layout=c(2, 1, 2)) devAskNewPage(FALSE) # force prompt between each page # 2 cols; 1 row; 2 pages # restore default
You can see examples of a timeseries and dotplot created with lattice, together with the R code that produced them in the R gallery page.
While R is best known as an environment for statistical computing, it is also a great tool for numerical analysis (optimization, integration, interpolation, matrix operations, differential equations etc). Here is a flavour of the capabilities that R offers in analysing data. For a basic introduction, see the analysis section of the getting started page. For a more thorough overview of regression using R in an astronomical context, see this tutorial here.
Numerical Analysis in R
http://www.sr.bham.ac.uk/~ajrs/R/r-analyse_data.html
Parameter optimization
To find the minimum value of a function within some interval, use optimize (optimise is a valid alias):
fun <- function(x) x^2 + x - 1 curve(fun, xlim=c(-2, 1)) ( res <- optimize(fun, interval=c(-10, 10)) ) points(res$minimum, res$objective) # create function # plot f unction # plot point at minimum
Now, let's say you want to find the x value of the function at which the y value equals some number (1.234, say):
#--Define an auxiliary minimizing function: fun.aux <- function(x, target) ( fun(x) - target )^2 ( res <- optimize(fun.aux, interval=c(-10, 10), target=1.234) ) fun(res$minimum) # close enough
Of course, there are 2 solutions in this case, as seen by plotting the function to be minimized:
curve(fun.aux(x, target=1.234), xlim=c(-3, 2)) points(res$minimum, res$objective)
We can get the other solution by giving a skewed search interval (see ?optimize for how the start point is determined):
res2 <- optimize(fun.aux, interval=c(-10, 100), target=1.234) start value points(res2$minimum, res2$objective) # plot other minimum #--Show target values plotted with original function: curve(fun, xlim=c(-3, 2)) abline(h=1.234, lty=2) abline(v=c(res$minimum, res2$minimum), lty=3) # force higher
For more general-purpose optimization, use nlm, optim or nlminb (which I've found to be the most robust) The CRAN task views webpage has a very thorough overview of R packages relating to optimization
Integrate will evaluate the function over the specified range (lower to upper) by passing a vector of these values to the function being integrated. Note that any other arguments to fun must also be specified, as extra arguments to integrate, and that the order of the arguments of fun does not matter, provided all arguments are supplied in this way, apart from the one being integrated over:
fun2 <- function(A, b, x) A*x^b # "x" doesn't have to be the first argument integrate(fun2, lower=0, upper=10, A=1, b=2) # "A" & "b" are given explicitly
Now, let's say you wanted to integrate this function for a series of values of b
bvals <- seq(0, 2, by=0.2) # create vector of b values fun2.int <- function(b) integrate(fun2, lower=0, upper=10, A=1, b=b)$value fun2.int(bvals[1]) # works for a single value of b fun2.int(bvals) # FAILS for a vector of values of b
to make it work, you need to force vectorization of the function, so it can cycle piecewise through the elements of the vector and evaluate the function for each one:
fun2.intV <- Vectorize(fun2.int, "b") fun2.intV(bvals) # Vectorize "fun2.int" over "b" # returns a vector of values
To compute symbolically the derivative of a simple expression, use D (see ?deriv for more info):
> D(expression(sin(x)^2 - exp(x^2)), "x") # differentiate with respect to "x" 2 * (cos(x) * sin(x)) - exp(x^2) * (2 * x)
To solve differential equations, use the deSolve package. You can read a helpful introduction to the deSolve package in Volume 2/2 of the R journal.
install.packages("deSolve") library("deSolve") library(help="deSolve") # see information on package
Interpolating data
A example of spline interpolation:
fun <- function(x) sqrt(3) * sin(2*pi*x) x <- seq(0, 1, length=20) set.seed(123) y <- jitter(fun(x), factor=20) plot(y ~ x) lines(spline(x, y)) # function to generate some data # # # # allow reproducible random numbers add a small amount of random noise plot noisy data add splined data
why not also add the best-fit sine curve predicted from a linear regression with lm:
lines(x, predict(lm( y ~ sin(2*pi*x))), col="red")
Note that, by default, the predicted values are evaluated at the (X) positions of the raw data. This means that you can end up with rather coarse curves, as seen above. To get round this, you need to work with functions for the splines, which can be supplied with more finely-spaced X values for plotting:
fun.spline <- splinefun(x, y) fun.smooth <- function(xx, ...) predict(smooth.spline(x, y), x=xx, ...)$y plot(y ~ x) curve(fun.spline, add=TRUE) curve(fun.smooth, add=TRUE, lty=2) # } "curve" uses n=101 points by default # } at which to evaluate the function
#--And add a smoother best-fit sine curve: fun.sine <- function(X) predict(lm( y ~ sin(2*pi*x)), newdata=list(x=X)) curve(fun.sine, add=TRUE, col="red") #--Finally, just for completeness, plot the original function: curve(fun, add=TRUE, col="blue")
A wider range of splines is available in the package of the same name, accessed via library("splines"), including B splines, natural splines etc.
Matrix operations
t transposes a matrix; %*% is the usual (inner) matrix multiplication; diag returns the diagonal matrix and upper.tri & lower.tri return logical arrays indicating which elements belong to the upper/lower triangles. To evaluate the classive five numbers from linear least squares regression (sum(x), sum(x^2), sum(y), sum(y^2), sum(x*y)), using matrices:
N <- 10 x <- 1:N; y <- 10:19 M <- cbind(n=1, x, y) M2 <- t(M) %*% M res <- M2[! lower.tri(M2)] identical(res, c(N, sum(x), # } # } create some X & Y data # combine into a matrix # matrix multiplication with transposed version # length of x/y & famous five numbers: sum(x^2), sum(y), sum(x*y), sum(y^2)))
Matrix crossproduct:
x <- rnorm(1e7) system.time( a1 <- drop(crossprod(x))) system.time( a2 <- sum(x^2)) # -> matrix version faster identical(a1, a2) # check the answer is the same
You can also use solve to solve a system of equations, eigen to calculate eigenvalues and eigenvectors of a matrix, as well as svd to compute the singular-value decomposition of a rectangular matrix. There is also a dedicated matrix package, which can handle sparse and dense matrices, accessible via library(help="Matrix")
Statistical Analysis in R
Work in progress...This will be just a very brief taster of some of the many things that R can do in the way of statistical analysis, but right now consists only of a guide to do fast bootstrap resampling of regression parameter errors.
#--Demonstrate function: > mystat(A, 1:nrow(A)) # same as "coef(m)" (Intercept) x 0.3100925 0.9839554 > set.seed(123) # allow reproducible random numbers > mystat(A, sample(nrow(A), replace=TRUE)) # result for a single resample (Intercept) x 0.6143148 0.9416338 #--Run full set of "N.boot" bootstrap resamples: N.boot <- 500 require(boot) # load boot library set.seed(123) # allow reproducible random numbers b <- boot(A, mystat, R=N.boot) #--Plot results of plot(b, index=1) plot(b, index=2) # see "?plot.boot" bootstrapping: # intercept # slope for details
#--Now compare the standard errors on the model parameters from # the bootstrap resampling with those from the normal summary method: > b # print results ORDINARY NONPARAMETRIC BOOTSTRAP Call: boot(data = A, statistic = mystat, R = N.boot) Bootstrap Statistics : original bias t1* 0.3100925 0.0139150827 t2* 0.9839554 -0.0007816746
## NB the bias is the difference between the mean of the N.boot ## resample parameter values and the original best-fit model ## parameter values, i.e. "apply(b$t, 2, mean) - coef(m)" #--Now show the standard errors computed (see "?summary.lm"): > coef(summary(m)) Estimate Std. Error t value Pr(>|t|) (Intercept) 0.3100925 0.46199915 0.6711972 5.106173e-01 x 0.9839554 0.03856694 25.5129205 1.389147e-15 #--Or, quantify the difference by comparing the ratios: > sd(b$t) / coef(summary(m))[, 2] # close to 1 in each case (Intercept) x 0.9686099 1.0318067
A quicker version of the above example, using the method described in Section 4.3.1 from Chambers & Hastie, 1992 (see References section in ?lm for book details). This method exploits the fact that lm does a certain amount of initial processing prior to the actual regression (using lm.fit), and this represents a substantial overhead that need only be performed once (and need not be replicated during each bootstrap resampling iteration).
# following Section 4.3.1 from Chambers & Hastie (1992): m <- lm(y ~ x, A, x=TRUE, y=TRUE) # "x=y=TRUE" returns extra data for lm.fit
mystat.fast <- function(dummy, i, model) coef(lm.fit(model$x[i, ], model$y[i])) require(boot) set.seed(123) # allow reproducible random numbers system.time(slow <- boot(A, mystat, 1e4)) set.seed(123) system.time(fast <- boot(A, mystat.fast, 1e4, model=m)) faster #--Check results are identical: slow fast #--Formal check if two objects are identical: # the only differences reported are in components 6 & 8, which are due to the # different mystat function & name (i.e. "mystat" vs. "mstat.fast"), which is # stored in the object returned by "boot": identical(slow, fast) # not completely identical all.equal(slow, fast) # only differences due to different mystat functions
# roughly 10x
Note that there is no point having a very large number of bootstrap samples compared to the number of fitted values (i.e. the number of rows in the data frame), since the latter ultimately becomes the limiting factor in the accuracy of the recovered parameter error estimates. An example using non-linear regression (nls)
#--Create some non-linear data: N <- 20 set.seed(123) #--Create some data: B <- data.frame(x=1:N, y=(4 * log10(1:N)) + rnorm(N, mean=2, sd=0.2)) plot(y ~ x, B) # plot the data #--Fit the non-linear model m <- nls(y ~ a * log10(x) + b, data=B, start=list(a=1, b=1)) lines(B$x, fitted(m), lty=2) # plot best-fit model values as a dashed line #--A better way of plotting the best-fit model (as a smooth curve): curve(predict(m, newdata=data.frame(x=x)), add=TRUE) summary(m) # summarise the best-fit parameters and their errors etc. #--There is no equivalent possible for the fast version of "mystat" # for nls, so set up the basic function to calculate bootstrapped fit: mystat <- function(A, indices) { m <- nls(y ~ a * log10(x) + b, data=B[indices, ], start=list(a=1, b=1)) return(coef(m)) } #--Run full set of "N.boot" bootstrap resamples: N.boot <- 500 set.seed(123) require(boot) b <- boot(B, mystat, R=N.boot) #--Plot results of bootstrapping: # (NB note significant non-normal distribution of values; # i.e. right panel quantile-quantile plot values deviate from a
#--Now compare the standard errors on the model parameters from # the bootstrap resampling with those from the normal summary method: > b # print results of bootstrap ORDINARY NONPARAMETRIC BOOTSTRAP Call: boot(data = B, statistic = mystat, R = N.boot) Bootstrap Statistics : original bias t1* 3.986804 -0.02582234 t2* 2.040456 0.02679389
> summary(m) # print standard errors (see "?summary.nls") Formula: y ~ a * log10(x) + b Parameters: Estimate Std. Error t value Pr(>|t|) a 3.9868 0.1299 30.70 < 2e-16 *** b 2.0405 0.1275 16.01 4.33e-12 *** --Signif. pres: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.1998 on 18 degrees of freedom Number of iterations to convergence: 1 Achieved convergence tolerance: 1.234e-07 #--Or, quantify the difference by comparing the ratios: sd(b$t) / coef(summary(m))[, 2] # close to 1 in each case a b 1.048229 1.045583
For further information, you can find out more about how to access, manipulate, summarise, plot and analyse data using R.