You are on page 1of 39

R Iniciar R: Consola->md c:\trabajo->cd c:\trabajo->R Salir: q() Ayuda: help() o help.

start();
Funciona el TAB para autocompletado.

o bien SAVE->ruta y fichero destino

Instalar paquetes:
chooseCRANmirror("") Spain(MADRID)=>59 install.packages(pkgs, lib, repos = getOption("repos"), contriburl = contrib.url(repos, type), method, available = NULL, destdir = NULL, dependencies = NA, type = getOption("pkgType"), configure.args = getOption("configure.args"), configure.vars = getOption("configure.vars"), clean = FALSE, Ncpus = getOption("Ncpus", 1L), libs_only = FALSE, INSTALL_opts, ...) VIP) install.packages("tree"); install.packages("e1071"); install.packages("datasets"); install.packages("class"); install.packages("igraph"); install.packages("MVA"); install.packages("MASS"); install.packages("arules"); install.packages("boost"); install.packages("stats"); install.packages("caTools"); install.packages("Matrix"); install.packages("lattice"); install.packages(c("pkg1", "pkg2")) install.packages("Rcmdr", dependencies = TRUE) O TB win: Men paquetes->instalar paquetes->marcar y OK downloaded 106 Kb. package tree successfully unpacked and MD5 sums checked Paquetes importantes y ayuda de los mismos: help("tree") ==> rboles de clsificacin o regresin. help("datasets") ==> hojas de datos ejerccios. data(iris); ls(); iris[1:5,] Package : e1071 include svm, naiveBayes , . > install.packages("e1071") Packages for Machine learning: For classification: tree in tree, svm in e1071, knn in class, lda in MASS, adaboost in boost For clustering: kmean in stats Other useful packages: caToolskernlabmlbenchcluster

RECUERDA CARGAR LOS PAQUETES QUE VAYAS A UTILIZAR Se puede hacer desde cdigo con: library(nombrepaquete); ==>library("e1071");
demo("graphics") Ejemplo de rutinas de dibujo del sistema. ?plot() ayuda del comando plot

Las #ordenes se separan mediante punto y coma, (`;'), o mediante un cambio de l##nea. Si al terminar la l##nea, la orden no est#a sint#acticamente completa, R mostrar#a un signo de continuaci#on, por ejemplo + Ejecuci#on de #ordenes desde un archivo y redirecci#on de la salida: source("#ordenes.R") sink("resultado.lis") enviar#a el resto de la salida, en vez de a la pantalla, al archivo del sistema operativo, resultado.lis, dentro del directorio de trabajo. La orden > sink() devuelve la salida de nuevo a la pantalla. Si utiliza nombres absolutos de archivo en vez de nombres relativos, los resultados se almacena#an en ellos, independientemente del directorio de trabajo.

objects() se puede utilizar para obtener los nombres de los objetos almacenados en R. Esta funci#on es equivalente a la funci#on ls(). Para eliminar objetos puede utilizar la orden rm, por ejemplo: > rm(x, y, z, tinta, chatarra, temporal, barra) Para mostar un objeto vale con ejecutar su nombre. Ej: > x 3; x ==> [1] 3 (vector de 1 el) Los objetos creados durante una sesi#on de R pueden almacenarse en un archivo para su so posterior. Al analizar la sesi#on, R pregunta si desea hacerlo. En caso afirmativo todos objetos se almacenan en el archivo `.RData' en el directorio de trabajo. Es recomendable que utilice un directorio de trabajo diferente para cada problema que analice con R. Es muy com#un crear objetos con los nombres x e y, por ejemplo. La estructura m#as simple es el vector, que es una colecci#on ordenada de n#umeros. Para crear un vector, por ejemplo x, consistente en cinco n#umeros, por ejemplo 10.4, 5.6, 3.1, 6.4 y 21.7, use la orden > x <- c(10.4, 5.6, 3.1, 6.4, 21.7) o bien: assign("x", c(10.4, 5.6, 3.1, 6.4, 21.7)) tb valdra: c(10.4, 5.6, 3.1, 6.4, 21.7) -> x

Si una expresi#on se utiliza como una orden por s## misma, su valor se imprime y se pierde3. As## pues, la orden > 1/x simplemente imprime los inversos de los cinco valores anteriores en la pantalla (por supuesto, el valor de x no se modica). Si a continuaci#on hace la asignaci#on> y <- c(x, 0, x) crear#a un vector, y, con 11 elementos, consistentes en dos copias de x con un cero entre ambas.

v <- 2*x + y + 1 genera un nuevo vector, v, de longitud 11, construido sumando, elemento a elemento, el vector 2*x repetido 2.2 veces, el vector y, y el n#umero 1 repetido 11 veces. Los operadores aritm#eticos elementales son los habituales +, -, *, / y ^ para elevar a una potencia. Adem#as est#an disponibles las funciones log, exp, sin, cos, tan, sqrt, bien conocidas. Existen muchas m#as funciones, entre otras, las siguientes: max y min que seleccionan respectivamente el mayor y el menor elemento de un vector; range cuyo valor es el vector de longitud dos, c(min(x), max(x)); length(x) que es el n#umero de elementos o longitud de x; sum(x) que es la suma de todos los elementos de x; y prod(x) que es el producto de todos ellos. Dos funciones estad##sticas son mean(x), que calcula la media, esto es, sum(x)/length(x) y var(x) que calcula la cuasi-varianza, esto es, sum((x-mean(x))^2)/(length(x)-1) Para ordenar un vector dispone de la funci#on sort(x) que devuelve un vector del mismo tama~no que x con los elementos ordenados en orden creciente. Tambi#en dispone de order() y de sort.list(), que produce la permutaci#on del vector que corresponde a la ordenaci#on. Advierta que max y min seleccionan el mayor y el menor valor de sus argumentos, incluso aunque estos sean varios vectores. Las funciones paralelas pmax y pmin devuelven un vector (de la misma longitud del argumento m#as largo) que contiene en cada elemento el mayor y menor elemento de dicha posici#on de entre todos los vectores de entrada Para trabajar con n#umeros complejos, debe indicar expl##citamente la parte compleja. As sqrt(-17) devuelve el resultado NaN y un mensaje de advertencia, pero sqrt(-17+0i) realiza correctamente el c#alculo de la ra##z cuadrada de este n#umero complejo. Sucesiones: y c(1:10) genera los elementos 1,2,3,4,5,6,7,8,9,10 o bien: y seq(1,10)o bien: y c(seq(1,10)) z seq(-5, 5, by=.2) genera el vector c(-5.0, -4.8, -4.6, ..., 4.6, 4.8, 5.0) y lo almacena en z
muestra <- sample(1:150, 120) genera 120 elementosaleatorios con valor 1 a 150.

Booleanos: > temp <- x > 13 temp<-5==5; temp ==> [1] TRUE (true=1 false=0) Los operadores l#ogicos son < (menor), <= (menor o igual), > (mayor), >= (mayor o igual), == (igual), y != (distinto). Adem#as, si c1 y c2 son expresiones l#ogicas, entonces c1&c2 es su intersecci#on (\conjunci#on"), c1|c2 es su uni#on (\disyunci#on") y !c1 es la negaci#on de c1 Cadenas: a<-HOLA; a ==> [1] "HOLA"

la funci#on paste() une todos los vectores de caracteres que se le suministran y construye una sola cadena de caracteres: > labs <- paste(c("X","Y"), 1:10, sep="") almacena, en labs, el vector de caracteres c("X1", "Y2", "X3", "Y4", "X5", "Y6", "X7", "Y8", "X9", "Y10") Recuerde que al tener c("X", "Y") solo dos elementos, deber#a repetirse 5 veces para obtener la longitud del vector 1:10.

Un vector de caracteres. Esta opci#on solo puede realizarse si el vector posee el atributo names (nombres) para identicar sus componentes, en cuyo caso se comportar#a de modo similar al punto 2. > fruta <- c(5, 10, 1, 20) > names(fruta) <- c("naranja", "pl#atano", "manzana", "pera") > postre <- fruta[c("manzana","naranja")] La ventaja en este caso es que los nombres son a menudo m#as f#aciles de recordar que los ##ndices num#ericos.Esta opci#on es especialmente #util al tratar de la estructura de \hoja de datos" (data frame) que veremos posteriormente.

> alfa <- alfa[2 * 1:5] lo transforma en un objeto de longitud 5 formado por los elementos de posici#on par del objeto inicial.

Getting Started with R See also the introduction to R here. There's also an R wiki(http://rwiki.sciviews.org/doku.php) with lots of useful information, tips etc.
The online help pages can be accessed as follows:
help.start() help(package) ?package help.search("keyword") ?help help(package=base) q() # # # # # # # Load HTML help pages into browser List help page for "package" Shortform for "help(package)" Search help pages for "keyword" For more options List tasks in package "base" Quit R

Some extra packages are not loaded by default, but can be loaded as follows:
library("package") library() detach("package:pkg") library(help="package") # List available packages to load # Unload the loaded package "pkg" # list package contents

Commands can be separated by a semicolon (;) or a newline All characters after # are treated as comments (even on the same line as commands) Variables are assigned using <- (although = also works):
x <- 4.5 y <- 2*x^2 # Use "<<-" for global assignment

Vectors
x <- c(1, 4, 5, 8) y <- 2*x^2 > y [1] 2 32 50 128 # Concatenate values & assign to x # Evaluate expression for each element in x # Typing the name of a variable evaluates it

If x is a vector, y will be a vector of the same length. The vector arithmetic in R means that an expression involving vectors of the same length (e.g. x+y, in the above example) will be evaluated piecewise for corresponding elements, without the need to loop explicitly over the values:
> x+y [1] 3 36 55 136 > 1:10 # [1] 1 2 3 4 5 > a <- 1:10 a[-1] # a[-c(1, 5)] # a[1:5] # a[length(a)] # head(a, n=3) # tail(a, n=3) # Same as seq(1, 10) 6 7 8 9 10 Print Print Print Print Print Print whole vector *except* the first element whole vector except elements 1 & 5 elements 1 to 5 last element, for any length of vector the first n elements of a the last n elements of a

Demonstrations of certain topics are available:


demo(graphics) demo() ?demo # Run grapics demonstration # List topics for which demos are available # List help for "demo" function

Typing the name of a function lists its contents; typing a variable evaluates it, or you can use print(x).

A great feature of R functions is that they offer a lot of control to the user, but provide sensible hidden defaults. A good example is the histogram function, which works very well without any explicit control: First, generate a set of 10,000 Gaussian distributed random numbers:
data <- rnorm(1e4) hist(data) default # Gaussian distributed numbers with mean=0 & sigma=1 # Plots a histogram, with sensible bin choices by

See ?hist for the full range of control parameters available, e.g. to specifiy 7 bins:
hist(data, breaks=7)

Data Input/Output
See also here Assuming you have a file (file.dat) containing the following data, with either spaces or tabs between fields:
r 1 2 3 4 5 x 4.2 2.4 8.76 5.9 3.4 y 14.2 64.8 63.4 32.2 89.6

Now read this file into R:


> inp <- read.table("file.dat", header=T) # Read data into data frame > inp # Print contents of inp > inp[0] # Print the type of object that inp is: NULL data frame with 5 rows > colnames(inp) # Print the column names for inp: [1] "r" "x" "y" # This is simply a vector that can easily be changed: > colnames(inp) <- c("radius", "temperature", "time") > colnames(inp) [1] "radius" "temperature" "time"

Note that you can refer to R objects, names etc. using the first few letters of the full name, provided that is unambiguous, e.g.:
> inp$r # Print the Radius column

but note what happens if the information is ambiguous or if the column doesn't exist:
> inp$t NULL > inp$wibble NULL # Could match "inp$temperature" or "inp$time" # "wibble" column doesn't exist

Alternatively, you can refer to the columns by number:


> inp[[1]] [1] 1 2 3 4 5 > inp[1] # Print data as a vector (use "[[" & "]]") # Print data as a data.frame (only use "[" and "]")

Writing data out to a file (if no filename is specified (the default), the output is written to the

console)
> write.table(inp, quote=F, row.names=F, col.names=T) radius temperature time 1 4.2 14.2 2 2.4 64.8 3 8.76 63.4 4 5.9 32.2 5 3.4 89.6

By default the columns are separated by whitespace, but you can change this with the sep= option (see ?write.table for details), e.g. use a : with a tab either side:
> write.table(inp, quote=F, row.names=F, col.names=T, sep="\t:\t") radius : temperature : time 1 : 4.2 : 14.2 2 : 2.4 : 64.8 3 : 8.76 : 63.4 4 : 5.9 : 32.2 5 : 3.4 : 89.6

Other types of connection


?connections # Help page on opening/closing connections write.table() # Write data to file # You can also refer to URLs for files, e.g. a <- read.table("http://www.sr.bham.ac.uk/~ajrs/R/datasets/file.dat", header=T)

Loading R commands from a file


source("commands.R") # Or, to load from a path specified by an environment variable # "$ENV_VAR/commands.R" source(paste(Sys.getenv("ENV_VAR"),"/commands.R",sep=""))

Saving data
?save save(inp, t, file="data.RData") load(file="data.RData") # read back in the saved data: # - this will load the saved R objects into the current session, with # the same properties, i.e. all the variable will have the same names # and contents

Note that when you quit R (by typing q()), it asks if you want to save the workspace image, if you specify yes (y), it writes out two files to the current directory, called .RData and .Rhistory. The former contains the contents of the saved session (i.e. all the R objects in memory) and the latter is a list of all the command history (it's just a simple text file you can look at). At any time you can save the history of commands using:
savehistory(file="my.Rhistory") loadhistory(file="my.Rhistory") session # load history file into current R

Plotting data
See also here Useful functions:
?plot ?par ?Devices # (R can output dev.list() colours() ?plotmath # Help page for plot command # Help page for graphics parameter control # or "?device" in postscript, PDF, bitmap, PNG, JPEG and more formats) # list graphics devices # or "colors()" List all available colours # Help page for writing maths expressions in R

Tip: to generate png, jpeg etc. files, I find it's best to create pdf versions in R, then convert them in gimp (having specified strong antialiasing, if asked, when loading the pdf file into gimp), otherwise the figures are "blocky" (i.e. suffer from aliasing). To create an output file copy of a plot for printing or including in a document etc.
dev.copy2pdf(file="myplot.pdf") device dev.copy2eps(file="myplot.eps") dev.copy() # } Copy contents of current graphics # } to a PDF or Postscript file

Adding more datasets to a plot:


x <- 1:10; y <- x^2 z <- 0.9*x^2 plot(x, y) # Plot original data points(x, z, pch="+") # Add new data, with different symbols lines(x,y) # Add a solid line for original data lines(x, z, col="red", lty=2) # Add a red dashed line for new data curve(1.1*x^2, add=T, lty=3, col="blue") # Plot a function as a curve text(2, 60, "An annotation") # Write text on plot abline(h=50) # Add a horizontal line abline(v=3, col="orange") # Add a vertical line

Error bars (click here for source code)


source("http://www.sr.bham.ac.uk/~ajrs/R/scripts/errorbar.R") code # Create some data to plot: x <- seq(1,10); y <- x^2 xlo <- 0.9*x; xup <- 1.08*x; ylo <- 0.85*y; yup <- 1.2*y # source

Plot graph, but don't show the points (type="n") & plot errors:
plot(x, y, log="xy", type="n") errorbar(x, xlo, xup, y, ylo, yup)

?plot gives more info on plotting options (as does ?par). To use different line colours, styles, thicknesses & errorbar types for different points:
errorbar(x, xlo, xup, y, ylo, yup, col=rainbow(length(x)), lty=1:5, type=c("b", "d", "c", "x"), lwd=1:3)

Plotting a function or equation


curve(x^2, from=1, to=100) range curve(x^2, from=1, to=100, log="xy") # Plot a function over a given # With log-log axes

Plot chi-squared probability vs. reduced chi-squared for 10 and then 100 degrees of freedom
dof <- 10 curve(pchisq(x*dof,df=dof), from=0.1, to=3, xlab="Reduced chi-squared", ylab="Chi-sq distribution function") abline(v=1, lty=2, col="blue") # Plot dashed line for reduced chisq=1 dof <- 100 curve(pchisq(x*dof, df=dof), from=0.1, to=3, xlab="Reduced chi-squared", ylab="Chi-sq distribution function") abline(v=1, lty=2, col="blue") # Plot dashed line for reduced chisq=1

Define & plot custom function:


myfun <- function(x) x^3 + log(x)*sin(x) #--Plot dotted (lty=3) red line, with 3x normal line width (lwd=3): curve(myfun, from=1, to=10, lty=3, col="red", lwd=3) #--Add a legend, inset slightly from top left of plot: legend("topleft", inset=0.05, "My function", lty=3, lwd=3, col="red")

Plotting maths symbols:


demo(plotmath) ?plotmath # Shows a demonstration of many examples # Manual page with more information

Analysis
Basic stuff
x <- rnorm(100) hist(x) summary(x) d <- density(x) d details) plot(d) plot(density(x)) # # # # # Create 100 Gaussian-distributed random numbers Plot histogram of x Some basic statistical info Compute kernel density estimate Print info on density analysis ("?density" for

# Plot density curve for x # Plot density curve directly

Plot histogram in terms of probability & also overlay density plot:


hist(x, probability=T) lines(density(x), col="red") # overlay smoothed density curve #--Standard stuff: mean(x); median(x); max(x); min(x); sum(x); weighted.mean(x) sd(x) # Standard deviation var(x) # Variance mad(x) # median absolute deviation (robust sd)

Testing the robustness of statisical estimators


?replicate # Repeatedly evaluate an expression

Calculate standard deviation of 100 Gaussian-distributed random numbers (NB default sd is 1; can specify with rnorm(100, sd=3) or whatever)
sd(rnorm(100))

Now, let's run N=5 Monte Carlo simulations of the above test:

replicate(5, sd(rnorm(100)))

and then compute the mean of the N=5 values:


mean(replicate(5, sd(rnorm(100))))

Obviously 5 is rather small for a number of MC sims (repeat the above command & see how the answer fluctuates). Let's try 10,000 instead:
mean(replicate(1e4, sd(rnorm(100))))

With a sample size of 100, the measured sd closely matches the actual value used to create the random data (sd=1). However, see what happens when the sample size decreases:
mean(replicate(1e4, sd(rnorm(10)))) mean(replicate(1e4, sd(rnorm(3)))) # almost a 3% bias low # >10% bias low

You can see the bias more clearly with a histogram or density plot:
a <- replicate(1e4, sd(rnorm(3))) hist(b, breaks=100, probability=T) lines(density(b), col="red") abline(v=1, col="blue", lwd=3) line # Specify lots of bins # Overlay kernel density estimate # Mark true value with thick blue

This bias is important to consider when esimating the velocity dispersion of a stellar or galaxy system with a small number of tracers. The bias is even worse for the robust sd estimator mad():
mean(replicate(1e4, mad(rnorm(10)))) mean(replicate(1e4, mad(rnorm(3)))) # ~9% bias low # >30% bias low

However, in the presence of outliers, mad() is robust compared to sd() as seen with the following demonstration. First, create a function to add a 5 sigma outlier to a Gaussian random distribution:
outlier <- function(N,mean=0,sd=1) { a <- rnorm(N, mean=mean, sd=sd) a[1] <- mean+5*sd # Replace 1st value with 5 sigma outlier return(a) }

Now compare the performance of mad() vs. sd():


mean(replicate(1e4, mad(outlier(10)))) mean(replicate(1e4, sd(outlier(10))))

You can also compare the median vs. mean:


mean(replicate(1e4, median(outlier(10)))) mean(replicate(1e4, mean(outlier(10))))

...and without the outlier:


mean(replicate(1e4, median(rnorm(10)))) mean(replicate(1e4, mean(rnorm(10))))

Kolmogorov-Smirnov test, taken from manual page (?ks.test)


x <- rnorm(50) # 50 Gaussian random numbers y <- runif(30) # 30 uniform random numbers # Do x and y come from the same distribution?

ks.test(x, y)

Correlation tests Load in some data (see section on data frames, below):
dfile <- "http://www.sr.bham.ac.uk/~ajrs/papers/sanderson06/table1.txt" A <- read.table(dfile, header=T, sep="|") cor.test(A$z, A$Tx) # Specify X & Y data separately cor.test(~ z + Tx, data=A) # Alternative format when using a data frame ("A") cor.test(A$z, A$Tx, method="spearman") # } Alternative methods cor.test(A$z, A$Tx, method="kendall") # }

Spline interpolation of data


x <- 1:20 y <- jitter(x^2, factor=20) sf <- splinefun(x, y) # Add some noise to vector # Perform cubic spline interpolation

# - note that "sf" is a function: sf(1.5) # Evaluate spline function at x=1.5 plot(x,y); curve(sf(x), add=T) # Plot data points & spline curve

Scatter plot smoothing


#--Using above data frame ("A"): plot(Tx ~ z, data=A) #--Return (X & Y values of locally-weighted polynomial regression lowess(A$z, A$Tx) #--Plot smoothed data as line on graph: lines(lowess(A$z, A$Tx), col="red")

Numerical integration of a function (see Functions section below)


fun <- function(x, norm, index) norm*x^index # Create simple function # See "?integrate" for details. Note that: # 1) the names of arguments in "fun" must not match those of arguments # in "integrate()" itself. # 2) fun must be able to return a vector, if supplied with a vector # (i.e. not just a single value) # i <- integrate(fun, lower=1, upper=10, norm=0.5, index=2) > i 166.5 with absolute error < 1.8e-12 > i[0] list() # Note that integrate() returns a list > names(i) # Show names components in list [1] "value" "abs.error" "subdivisions" "message" "call" > i$value # If you want the integral value alone [1] 166.5

Regression (see also the tutorial at Penn State's Center for Astrostatistics here, which is more thorough.) Load in some data (see section on data frames, below; this dataset is described in more detail in the section on factors, below):
dfile <- "http://www.sr.bham.ac.uk/~ajrs/papers/sanderson06/table1.txt" A <- read.table(dfile, header=T, sep="|") names(A) # Print column names in data frame m <- lm(Tx ~ z, data=A) # Fit linear model summary(m) # Show best-fit parameters & errors

Specify 4 plots on same page, in 2x2 configuration:


layout(matrix(1:4, nrow=2, ncol=2)) layout.show(4) # Show plot layout for 4 plots plot(m) # Shows useful diagnostics of the fit (4 plots)

Plot data & best-fit model:


plot(Tx ~ z, A) abline(m, col="blue") # add best-fit straight line model

Plot residuals:
plot(A$z, resid(m))

Something fancier: show data, model & residuals in 2 panels:


layout(matrix(c(1,1,2)),heights=c(1,1)) # Specify layout for panels layout.show(2) # Show plot layout for 2 plots plot(Tx ~ z, data=A) abline(m, col="blue") plot(A$z, resid(m), main="Residuals", col="blue", pch=20)

Nonlinear least-squares estimates of the parameters of a nonlinear model:


?nls

Since R was developed with experimental (rather than observational) sciences in mind, it is not geared up to handle errors in both X and Y directions. Therefore the existing algorithms all perform regression in one direction only. However, unweighed orthogonal regression is equivalent to principal components analysis.

Data Frames
See also here Create a data frame:
A <- data.frame(a=1:3, b=4:6, c=7:9) colnames(A) # List column names for A rownames(A) # List row names (will just be numbers if not specified)

Accessing data frames


A$a A[c("a", "c")] A[2] A[[2]] A[,2] A[2,] A[1, ] A[2:3, ] A[, -1] A[0] A[[1]][0] A[A$a%%2 == 1, ] A[A$b<6 & A$c>7, ] constraints A[A$b<6 | A$c>7, ] A$a <- NULL # # # # # # # # # # # # # Print the column named "a" Print the columns listed in the square brackets Print column 2 of A (type=data frame) Print column 2 of A (type=vector) Print column 2 of A (type=vector) Print row 2 of A Print the 1st row of data frame A Print rows 2 to 3 Print everything except the 1st column of A Print type of object (data frame) (numeric, i.e. vector) Print rows of A where column "a" has an odd number Print rows of A according to a combination of

# ("&" = logical AND; "|" = logical OR) # Delete a column from a data frame

To retrieve a named column, where the column name is itself a variable:


name <- "b" A[name] A[[name]] # Produces a data frame # Produces a vector

Manipulating data frames


B <- data.frame(d=11:13, e=14:16, f=17:19) A + 1 # Perform arithmetic on each element of A A * B # Operates on each element, if A & B have same dimensions A[order(A$a, decreasing=T), ] # Sort A according to a specified column t(A) # Transpose A (swap rows for columns & vice versa) # NB the type of t(A) is no longer a data frame, but you can make it so: as.data.frame(t(A))

To join data frame B to A (side by side), you could use:


A <- data.frame(A, B) # Can also use "cbind(A, B)"

However, in a situation where this operation might be occuring iteratively, you could end up with multiple, repeated instances of B:
A <- data.frame(A, B) # To recover the original data, use: A <- A[, 1:3]

A better way might be to use:


A[colnames(B)] <- B # - this will replace any existing named columns from B that are already # present in A

To append one data frame below another (instead of side-by-side), use:


rbind(A, A, A)

Note that if column names are specified, they must match in all data.frames for rbind to work:
rbind(A, B) # Fails! colnames(B) <- colnames(A) rbind(A, B) # Works # Use same column names for B as for A

Working with data frames You can use attach(A) and detach(A) to add/remove data frame to/from the current search path, allowing you to use the column names as if they were standalone vectors. However, this approach is usually best avoided, since it can cause confusion if any objects exist with the same name as a column of an attached data frame. It's better to use with to achieve the same effect within a controlled setting: The with function enables you to work with column names without having to prefix the data frame name, as follows
A <- data.frame(a=1:20, b=rnorm(20)) with(A, a^2 + 2*b)

This principle is used in the transform function, to allow you to construct new columns in the data frame using references to the other column names:

transform(A, c=a^2 + 2*b)

# Add new column to data frame

Similarly, the subset function works in the same way to enable easy filtering of data frames:
subset(A, b >= 0 & a%%2 == 0) numbered a # Returns rows with positive b & even

You can also easily perform database join operations, using merge:
B <- data.frame(a=1:7, x=runif(7)) merge(A, B) # Return rows with same "a", combining unique columns from A & B

Functions
See also here Basics
cat # Type function name without brackets to list contents args(cat) # Return arguments of any function body(cat) # Return main body of function formals(fun) # Get or set the formal arguments of a function debug(fun); undebug(fun) # Set or unset the debugging flag on a function

Create your own function


> fun <- function(x, a, b, c) (a*x^2) + (b*x^2) + c > fun(3, 1, 2, 3) [1] 30 > fun(5, 1, 2, 3) [1] 78

A more complicated example of a function. First, create some data:


set.seed(123) a <- rnorm(1000, mean=10, sd=1) b <- rnorm(100, mean=50, sd=15) x <- c(a, b) hist(x) sd(x) mad(x) mean(x) median(x) # # # # # # # # # allow reproducible random numbers 1000 Gaussian random numbers smaller population of higher numbers Combine datasets Shows outlier population clearly Strongly biased by outliers Robustly estimates sd of main sample biased robust

Now create a simple sigma clipping function (you can paste the following straight into R):
sigma.clip <- function(vec, nclip=3, N.max=5) { mean <- mean(vec); sigma <- sd(vec) clip.lo <- mean - (nclip*sigma) clip.up <- mean + (nclip*sigma) vec <- vec[vec < clip.up & vec > clip.lo] # Remove outliers if ( N.max > 0 ) { N.max <- N.max - 1 # Note the use of recursion here (i.e. the function calls itself): vec <- Recall(vec, nclip=nclip, N.max=N.max) } return(vec) }

Now apply the function to the test dataset:

new <- sigma.clip(x) hist(new) length(new) length(x)

# Only the main population remains! # } Compare numbers of # } values before & after clipping

Factors
See also here Factors are a vector type which treats the values as character strings which form part of a base set of values (called "levels"). The result of plotting a factor is to produce a frequency histogram of the number of entries in each level (see ?factor for more info). Basics
a <- rpois(100, 3) (lambda=3) hist(a) a[0] b <- as.factor(a) # Create 100 Poisson distributed random numbers # Plot frequency histogram of data # List type of vector ("numeric") # Change type to factor

List type of vector b ("factor"): also lists the "levels", i.e. all the different types of value:
b[0]

Now, plot b, to get a barchart showing the frequency of entries in each level:
plot(b) table(b) # a tabular summary of the same information

A more practical example. Read in table of data (from Sanderson et al. 2006), which refers to a sample of 20 clusters of galaxies with Chandra X-ray data:
file <- "http://www.sr.bham.ac.uk/~ajrs/papers/sanderson06/table1.txt" A <- read.table(file, header=T, sep="|")

By default, read.table() will treat non-numeric values as factors. Sometimes this is annoying, in which case use as.is=T; you can specify the type of all the input columns explicitly -- see ?read.table.
A[1,] # Print 1st row of data frame (will also show column names): plot(A$det) # Plot the numbers of each detector type plot(A$cctype) # Plot the numbers of cool-core and non-cool core clusters

The det column is the detector used for the observation of each cluster: S for ACIS-S and I for ACIS-I. The cctype denotes the cool core type: CC for cool core, NCC for non-cool core. Now, plot one factor against another, to get a plot showing the number of overlaps between the two categories (the cool-core clusters are almost all observed with ACIS-S):
plot(cctype ~ det, data=a, xlab="ACIS Detector", ylab="Cool-core type")

You can easily explore the properties of the different categories, by plotting a factor against a numeric variable. The result is to show a boxplot (see ?boxplot) of the numeric variable for each catagory. For example, compare the mean temperature of cool-core & non-cool core clusters:
plot(kT ~ cctype, data=A)

and compare the redshifts of clusters observed with ACIS-S & ACIS-I:
plot(z ~ det, data=A)

You can define your own factor categories by partitioning a numeric vector of data into discrete bins using cut(), as shown here: Converting a numeric vector into factors Split cluster redshift range into 2 bins of equal width (nearer & further clusters), and assign each z point accordingly; create new column of data frame with the bin assignments (A$z.bin):
A$z.bin <- cut(A$z, 2) A$z.bin # Show levels & bin assignments plot(A$z.bin) # Show bar chart

Plot mean temperatures of nearer & further clusters. Since this is essentially a flux-limited sample of galaxy clusters, the more distant clusters (higher redshift) are hotter (and hence more X-ray luminous - an example of the common selection effect known as Malmquist bias):
plot(z ~ T.bin, data=A) plot(Tx ~ z, data=A) # More clearly seen!

Check if redshifts of (non) cool-core clusters differ:


plot(cctype ~ z.bin, data=A) # Equally mixed

Now, let's say you wanted to colour-code the different CC types. First, create a vector of colours by copying the cctype column (you can add this as a column to the data frame, or keep it as a separate vector):
colour <- A$cctype

Now all you need to do is change the names of the levels:


levels(colour) # Show existing levels levels(colour) <- c("blue", "red") # specify blue=CC, red=NCC colour[0] # Show data type (& levels, since it's is a factor)

Now change the type of the vector to character (i.e. not a factor):
col <- as.character(colour) col[0] col colour # Confirm change of data type # } Print values and # } note the difference

Now plot mean temperature vs. redshift, colour coded by CC type:


plot(Tx ~ z, A, col=col)

Why not add a plot legend:


legend(x="topleft", legend=c(levels(A$cctype)), text.col=levels(colour))

Now for something a little bit fancier:


plot(Tx ~ z, A, col=col, xlab="Redshift", ylab="Temperature (keV)") legend(x="topleft", c(levels(A$cctype)), col=levels(colour), text.col=levels(colour), inset=0.02, pch=1)

http://www.sr.bham.ac.uk/~ajrs/R/r-access_data.html

Data structures in R
All R objects have a type or mode, as well as a class, which can be determined with typeof, mode & class. Vector Vectors are the basic structure and come in the following atomic modes (data types): numeric, integer, character, logical, complex, raw These modes have corresponding functions which test if an object is of that mode (is, e.g. is.numeric) and convert an object to that mode (as, e.g. as.character) You can assemble and combine vectors using the often-used function c. Note that vectors must consist of values of the same data type:
c(1, "a", TRUE) list(1, "a", TRUE) # all values coerced to character # preserves different types (see below)

Factors Factors encode categorical data, and are an extremely useful and efficient way of handling categories with multiple entries. Note that R often coerces character data to a factor type by default (e.g. when using read.table). Also have is.factor & as.factor.
chars <- strsplit("the cat characters chars <- factor(chars) levels(chars) here) plot(chars) levels(chars)[1] <- "_" paste(chars, collapse="") sat on the mat", "")[[1]] # create vector of # convert from character to a factor # show factor levels (i.e. different letters # show barchart of factor level frequencies # replace whitespaces with underscores # collapse to a single character string

One thing to watch out for with factors is converting them to numeric mode. Factors are actually stored as a list of integers, referring to the element number of the factor levels. In the following example, there are 3 levels ("100", "200" & "300"), which are represented as characters, and the numeric values of the factor comprise the integers 1-3, referring to the elements of the vector of levels.
Xvector <- c(1, 2, 2, 3, 3, 3) * 100 Xfactor <- factor(Xvector) levels(Xfactor) # show levels, which are "100" "200" "300" as.numeric(Xfactor) # reports "1 2 2 3 3 3" - the elements of the levels vector x <- as.numeric(levels(Xfactor)[Xfactor]) # retrieve actual numeric values identical(x, Xvector) # same as original numeric vector

Matrix/arrays Matrices are 2-dimensional arrays, which are themselves generalisations of a vector to more than 1 dimension. Also have is.matrix, is.array & as.matrix, as.array.
M <- matrix(1:12, nrow=3) # create a matrix with 3 rows & 4 columns) dim(M) # show dimensions M[2, 3] # print element in 2nd row & 3rd column 2 * matrix(rep(1, 12), nrow=3) # multiply every element by a constant

A <- array(1:12, dim=c(2, 2, 3)) # create a 3d array A[1, 2, 1] # print single element A[1, , ] # print a matrix subset

Arrays are actually stored in a 1 dimensional structure, so you can still access their elements with a single subscript:
A[5]

List Lists are used to store data of any type or dimensions in a free-form structure. Also have is.list & as.list
l <- list(functions=c(mean, median), chars=month.abb, numbers=rnorm(7)) l$chars # print "chars" element l[2] # print 2nd element *as a single-item list* l[[2]] # print element as a *vector* l["chars"] # } compare and l[["chars"]] # } contrast

To assemble a list cumulatively, e.g. in a loop:


l <- as.list(NULL) # create empty list for ( i in 1:3 ) l[i] <- LETTERS[i]

Data frame Data frames are widely used in R to store data in a variety of formats with related entries in each row and different attributes in each column, much like a table or spreadsheet. A data frame is essentially a special type of list and elements of data frames can be accessed in exactly the same way as for a list. Also have is.data.frame & as.data.frame
A <- data.frame(a=LETTERS[1:4], b=1:4, c=c(T, T, F, T)) sapply(A, class) # show data types for each column A$a^2 # perform arithmetic on column as a vector dim(A); nrow(A); ncol(A) # show dimensions of data frame (rows, columns) A[1, ] # print first row A[, 2] # print 2nd column as.list(A) # convert to a list # Note that matrices must contain data of the same type, so the following # command converts all the values to character format: as.matrix(A) # convert to a matrix

Data frames can have both row and column names (default row names are the row number). This is the same as having a named vector, as seen in the following example:
# created separate, named vectors of data: planets.mass <- c("Mercury"=0.33, "Venus"=4.87, "Earth"=5.98, "Mars"=0.64, "Jupiter"=1899, "Saturn"=569, "Uranus"=87, "Neptune"=102, "Pluto"=0.13) * 1e24 planets.semimajoraxis <- c("Mercury"=57.9, "Venus"=108, "Earth"=150, "Mars"=228, "Jupiter"=778, "Saturn"=1430, "Uranus"=2870, "Neptune"=4500, "Pluto"=5900) * 1e9 # Now create a data frame: planets <- data.frame(mass=planets.mass, semimajoraxis=planets.semimajoraxis) planets["Earth", ] # show all data for the Earth

planets["Mars", "mass"] # show the mass of Mars; same as planets[4, 1] rownames(planets) <- paste("planet", 1:9) # change row names dimnames(planets); rownames(planets); colnames(planets) # show info

Working with data frames is very easy:


subset(planets, mass > mean(mass)) subset(planets, mass > 1e24 & semimajoraxis < 1e12 ) # Adding new columns to the data frame: planets <- transform(planets, log10mass = log10(mass), wibble = mass * semimajoraxis) # You can access the columns without including the data frame name, using with: with(planets, mass^2 + 3 * semimajoraxis) # which is more convenient than: planets$mass^2 + 3 * planets$semimajoraxis # Similarly, you can often access column data within other functions # e.g. plotting with a data frame: plot( semimajoraxis ~ mass, data=planets, log="xy")

Excluding columns from a data frame is also very easy, and can be done by reference to the column number or name:
A <- transform(planets, dummy = 1:nrow(planets)) A[, -3] by number A[, -c(2:3)] columns by number subset(A, select = -dummy) by name subset(A, select = -c(dummy, mass)) columns by name # add an extra column # exclude extra column # exclude multiple # exclude extra column # exclude multiple

Data input/output in R
For a basic introduction, see getting started. See also the R Data Import/Export manual. R recognises a variety of formats for reading in data. For tabular data, the basic command read.table offers a powerful range of options, which is also used by the shortform commands read.csv and read.delim, for reading in comma-separated variable (e.g. output from a spreadsheet) and tab-delimited format data, respectively. Similarly, the command write.table is used to output tabular format data. For fixed-width format data, use read.fwf. A more powerful method is to read in data directly into a vector or list, using scan. The following are useful functions for reading and writing a variety of data types. See their respective help pages for details. source : read in R commands from a file *ideal for loading pre-written chunks of code* save ; load : read / write R objects from / to a file (see below) *ideal for storing R data* scan: basic core function to read in data into a list/vector read.table ; write.table : generic table-format data read.csv : comma-separated values data (e.g. exported from spreadsheet) read.fwf : fixed-width format data

read.fortran : fixed-format data files using Fortran-style format specifications read.DIF : Data Interchange Format (DIF) for data frames from single spreadsheets read.dcf : Debian Control File format read.ftable / write.ftable : flat contingency tables readBin ; writeBin : binary data readChar ; writeChar : character strings readLines ; writeLines: lines write : write data to a file dump : write text representation of an object dget ; dput : read or recreate an ASCII representation of an R object

Other packages for R data input / output


There are a number of separate packages for reading and writing data in different formats. The following are some common examples; see the R Data Import/Export manual for more information. library(help="foreign") # Minitab, S, SAS, SPSS, Stata, Systat, dBase, Octave format RODBC package : for database sources supporting an ODBC interface gdata package : various tools, e.g. read.xls for reading data from Excel xtable package : Export tables to LaTeX or HTML

Entering & editing data within R


data.entry ; de : conveniet GUI tools for entering data edit : use text editor to modify an R object fix : invoke edit to change & overwrite an R object

Saving & loading R objects


save writes an external representation of R objects to the specified file; these can then be loaded back into R using load, e.g.
a <- 1:10; b <- a^2 save(a,b,file="mydata.RData") rm(a,b) load("mydata.RData") tmp <- load("mydata.RData") tmp [1] "a" "b" # Remove (delete) objects # Load data into R # Lists names of objects in file

At any time you can save the history of commands using: savehistory(file="my.Rhistory") and you can load such commands using:

loadhistory(file="my.Rhistory") ls & objects lists the objects currently defined apropos finds objects with names containing the specified string, e.g.
apropos("max") [1] "cummax" [7] "varimax" "max" "max.col" "which.max" "pmax" "pmax.int" "promax"

One of the most important aspects of computing with data is the ability to manipulate it, to enable subsequent analysis and visualization. R offers a wide range of tools for this purpose. Note that the plyr package provides an even more powerful and convenient means of manipulating and processing data, which I hope to describe in later updates to this page.

Add and remove data

http://www.sr.bham.ac.uk/~ajrs/R/r-manipulate_data.html

First create a data frame, then remove a column and create a new one:
A <- data.frame(a=LETTERS[1:5], b=1:5, c=rnorm(5)) A$d <- NULL # to delete column "d" A$e <- 1:5 # add in a new column "e"

Now create a second data frame (the last column is simply a random mix of 1 & 2). Note the use of the same column names to see what happens when A & B are joined together:
set.seed(123) # allow reproducible random numbers B <- data.frame(a=letters[1:5], b=sample(1:2, size=5, replace=TRUE))

Note that the non-numeric columns of both data frames are treated as factors (unless you use stringsAsFactors=FALSE when using data.frame):
> sapply(A, class) a b c e "factor" "integer" "numeric" "integer" > sapply(B, class) aa bb "factor" "integer"

To join them together, you could use c, but the result will be a list:
c(A, B) # creates a list > class(c(A, B)) [1] "list"

You can either convert this list to a data frame, or else use data.frame:
AB1 <- as.data.frame(c(A, B)) AB2 <- data.frame(A, B) > identical( AB1, AB2 ) [1] TRUE

Compare this to what happens when using cbind:


AB3 <- cbind(A, B)

colnames(AB2) colnames(AB3)

# } note the # } difference

the identical column names for A & B are rendered unambiguous when using as.data.frame(c(A, B)), by appending .1 to the 2nd data frame column names. It does this using make.unique, which is useful if you need to generate unique elements, given a vector containing duplicated character strings.

do.call
do.call constructs and executes a function call from a name or a function and a list of arguments to be passed to it. It is an extremely useful task, that can be used to join together data data frames stored in a list, for example:
l <- list(first=A[1, ], second=A[2, ], rest=A[-c(1:2), ]) do.call(rbind, l)

This task cannot be performed using c or rbind, without losing the 2 dimensional structure of the data stored within each component of the list.
- add columns to df, c, data.array(A, newdf) - subset, transform, "[", "[[", methods - rbind, cbind

Joining data frames


merge is used to perform a database join operation to merge together rows of 2 data frames which share common entries in one or more columns:
A <- data.frame(letter=LETTERS[1:5], a=1:5) B <- data.frame(letter=LETTERS[sample(10)], x=runif(10)) merge(A, B) # Return rows with same "letter", combining unique columns from A & B merge(A, B, all=TRUE) # see how non-overlapping "letter" values are handled

match identifies common elements between 2 vectors and returns the positions in the second vector of these matching elements in the order they appear in the first vector.
match(c("B", "E"), LETTERS) match(c("B", "3", "E"), LETTERS) # returns "NA" if no corresponding match

Using the above example, merge is equivalent to:


B[match(A$letter, B$letter), ] "B" # same as "merge(A, B)" but with row names from

Some other examples:


x <- rep(LETTERS[1:3], each=3) match(LETTERS[1:3], x) # "match" *only* returns position of *first* match match(x, LETTERS[1:3]) # match returns a vector as long as its first argument

A more intuitive version of match is %in%:


x[x %in% "B"] <- "b" #--Alternatively: x[grep("C", x)] <- "c" # change elements of x equal to "B" # change elements of x equal to "C"

On a related theme, the following set operators are also useful:


intersect(1:5, 3:8) union(1:5, 3:8)

and to identify or remove duplicate entries from a vector:


x <- c(1:5, 3:8) duplicated(x) which(duplicated(x)) unique(x) # logical vector # return duplicate element numbers # same as "x[! duplicated(x)]"

Rearranging data structures


To sort a vector:
a <- sample(1:10) sort(a) sort(a, decreasing=TRUE) order(a) "a" a[order(a)] # reverse order # the element numbers of "a" in order of the values of # same as "sort(a)"

To reorder the rows of a data frame according to the contents of one of its columns you just need to use order to specify the row order of the data frame:
A <- data.frame(a=sample(LETTERS[1:5]), b=sample(1:5)) A[order(A$a), ] # } compare A[order(A$b), ] # }

To transpose the rows and columns of a matrix:


A <- matrix(1:6, nrow=3) t(A)

Since t returns a matrix, the equivalent for a data frame is as follows:


#--Create a data frame with column *and* row names: B <- data.frame(a=1:3, b=LETTERS[1:3], row.names=c("one", "two", "three")) > as.data.frame(t(B)) one two three a 1 2 3 b A B C

A more general task for restructuring an array is aperm:


A <- array(1:12, dim=c(2, 2, 3)) aperm(A, perm=1:3) aperm(A, perm=c(1, 3, 2)) # create a 3d array # return original structure # swap 2nd & 3rd dimensions

Reshaping data
First create some multi-column data:
set.seed(123) # allow reproducible random numbers A <- data.frame(a=letters[1:3], x=rnorm(3), y=runif(3))

x y 1 a -0.5604756 0.5281055 2 b -0.2301775 0.8924190 3 c 1.5587083 0.5514350

Now stack the columns:


> stack(A) values ind 1 -0.5604756 x 2 -0.2301775 x 3 1.5587083 x 4 0.5281055 y 5 0.8924190 y 6 0.5514350 y # NB, the "ind" column is now a factor: > class(stack(A)$ind) [1] "factor"

But note that the column a is lost in the stacking:


> unstack(stack(A)) x y 1 -0.5604756 0.5281055 2 -0.2301775 0.8924190 3 1.5587083 0.5514350

There is also a function reshape which converts between so-called long and wide format data (i.e. columns stacked below each other vs. columns arranged beside each other). However, the documentation for reshape is remarkably opaque! A much more convenient function is melt from the excellent reshape package:
install.packages("reshape") require(melt) melt(A) # retains column "a", unlike "stack(A)"

Truncating and rounding data


Create a set of Gaussian-distributed random numbers:
set.seed(123) x <- rnorm(20, sd=2) round(x) round(x, 1) format(x, digits=1) trunc(x) x[round(x) != trunc(x)] floor(x) ceiling(x) # # # # # # # # # allow reproducible random numbers default mean is zero round to nearest integer round to 1 decimal place format to 1 d.p. (and convert to character) elements of x between N+0.5 and N+1, for integer N round down to nearest integer round up to nearest integer

Show the floor and ceiling values around each point:


i <- seq(along=x) plot(i, x) abline(h=-4:3, lty=2) segments(i, floor(x), i, # vector of x element numbers # same as "plot(x)" # add dashed lines to mark the integers ceiling(x)) # plot floor/ceiling values

To truncate data above and below some thresholds (e.g. set all values below zero to zero and above 1 to 1):
x2 <- pmax(pmin(x, 1), 0) # uses nifty parallel maximum & minimum functions

The result can be visualised as follows:


plot(x, pch=3) # plot original data as "+" symbols abline(h=c(0, 1), lty=2) # show thresholds as dashed lines points(x2) # show thresholded data as default hollow points elms <- x2 %in% c(0, 1) # elements of x2 which have been thresholded points(i[elms], x2[elms], pch=19) # highlight thresholded points

Miscellaneous commands
If a data frame contains any missing values (NA), you can exclude the corresponding entire row:
A$y[4:9] <- A$x[2] <- NA > na.omit(A) x y 1 -0.5604756 0.8895393 3 1.5587083 0.6405068 10 -0.4456620 0.1471136 > unlist(list(a=1, b=2:5, c=6)) a b1 b2 b3 b4 c 1 2 3 4 5 6

When dealing with long format data, where a vector of values has an associated grouping vector, you can use split to pull out separate list entries for each group:
A <- data.frame(group=LETTERS[rep(1:3, 1:3)], x=rnorm(6)) "C" a <- split(A$x, A$group) > a $A [1] 0.6849361 $B [1] -0.3200564 -1.3115224 $C [1] -0.5996083 -0.1294107 0.8867361 # 3 groups: "A", "B",

You can reverse the splitting with unsplit:


unsplit(a, f=A$group) unname(unlist(a)) # same result

View data structure

http://www.sr.bham.ac.uk/~ajrs/R/r-show_data.html

Before you do anything else, it is important to understand the structure of your data and that of any objects derived from it.
A <- data.frame(a=LETTERS[1:10], x=1:10) class(A) # "data.frame" sapply(A, class) # show classes of all columns typeof(A) # "list" names(A) # show list components dim(A) # dimensions of object, if any head(A) # extract first few (default 6) parts tail(A, 1) # extract last row head(1:10, -1) # extract everything except the last element

It is sometimes useful to work with a smaller version of a large data frame, by creating a representative subset of the data, via random sampling:
A.small <- A[sample(nrow(A), 4), ] # select 4 rows at random

Basic numerical summaries


Generate and summarise some random numbers:
a <- rnorm(50) summary(a) # min(a); max(a) # range(a) # mean(a); median(a) # sd(a); mad(a) # IQR(a) # quantile(a) # quantile(a, c(1, 3)/4) gives min, max, mean, median, 1st & 3rd quartiles } } self-explanatory } standard deviation, median absolute deviation interquartile range quartiles (by default) # specific percentiles (25% & 75% in this case)

Data frame summaries:


A <- data.frame(a=rnorm(10), b=rpois(10, lambda=10)) summary(A) # summarise data frame apply(A, 1, mean) # calculate row means apply(A, 2, mean) # calculate column means: same as "mean(A)"

which.min & which.max return the element number of the lowest/highest value:
set.seed(123) # allow reproducible random numbers x <- sample(10) > which.max(x) [1] 7 > x[which.max(x)] [1] 10

This can be used in a data frame to extract the corresponding row containing the min/max value of one of the columns:
A <- data.frame(x=rnorm(10), y=runif(10)) A[which.min(A$x), ] #--Alternatively: subset(A, x == min(x))

Other summaries:
x <- rnorm(100) fivenum(x) boxplot: boxplot(x) stem(x) # Tukey's five number summary, used to construct a # see ?boxplot.stats for more details # A stem-and-leaf plot

Matrix summaries:
A <- matrix(rnorm(50), nrow=10) # create 10x5 random number matrix colSums(A); rowSums(A); colMeans(A), rowMeans(A) # self-explanatory max.col(A) # maximum position for each row of a matrix, same as: which.max(A[1,]); which.max(A[2,]) # etc.

Tables
Load some data on a sample of 20 galaxy clusters with a categorical classification status (cctype) indicating whether there is a cool core or not and a factor (det) specifying which of two detectors was used to make the X-ray observation of the cluster:
file <"http://www.sr.bham.ac.uk/~ajrs/papers/sanderson09/sanderson09_table2.txt" a <- read.table(file, header=TRUE, sep="|") # table(a$cctype) # count numbers in each cctype category table(a$cctype, a$det) # 2-way table xtabs(~ cctype + det, data=a) # alternative (formula) syntax addmargins(xtabs(~ cctype + det, data=a)) # add row/col summary (default is sum) prop.table(xtabs(~ cctype + det, data=a)) # show counts as proportions of total

To test whether the input factors are independent of each other:


chisq.test(xtabs(~ det + cctype, data=a), simulate.p.value=TRUE)

-there is marginal evidence (p=0.07) of an interaction: clusters observed with ACIS-S are more likely to have a cool core than not.

Calculate aggregate statistics


Calculate numerical summaries for subsets of a data frame (using above dataset):
> aggregate( kT ~ cctype, data=a, FUN=mean) cctype kT 1 CC 5.121111 2 non-CC 6.146364 # mean cluster redshift of each cctype for each detector: > aggregate(z ~ cctype + det, data=a, FUN=mean) cctype det z 1 CC I 0.06070000 2 non-CC I 0.05137500 3 CC S 0.04105714

4 non-CC

S 0.03636667

#--Show mean values of a few quantitied, for each cctype: aggregate(. ~ cctype, data=a[c("cctype", "z", "kT", "Z", "S01", "index")], mean)

You can also apply multi-number summaries:


> aggregate( index ~ cctype, data=a, FUN=range) cctype index.1 index.2 1 CC 0.714 1.120 2 non-CC 0.283 0.944

Base graphics

http://www.sr.bham.ac.uk/~ajrs/R/r-plot_data.html

For a basic introduction, see the "getting started" page here. Base graphics are very flexible and allow a great deal of customisation, with many individual functions available. However, they lack a coherent underlying framework and, for visualizing highly structured data, are outclassed by lattice and ggplot2. Quick reference info:
demo("graphics") ?plot ?par ?layout example("pch") colours() ?plotmath demo(plotmath) # # # # # # # Demonstration of graphics in R Help page for main plot function Help page for changing graphical parameters Help page on plot arrangement Point style examples List pre-defined named colours Help page on plotting maths symbols

Useful plotting functions:


lines, points, abline, curve, text, rug, legend segments, arrows, polygon locator, identify # For interacting with plots

Create some data for plotting:


x <- 10 + (1:20)/10 y <- x^2 + rnorm(length(x)) # Add Gaussian random number plot(x, y) curve(x^2, add=TRUE, lty=2) # Add dashed line showing y=x^2 plot(x, y, type="l", col="blue") # Plot as blue line (try 'type="o"') plot(x, y, type="l", log="xy") # Plot as line with log X & Y axes: abline(v=11, lty=3) # Add vertical dotted line text(11.5, 120, "Hello") # Add annotation legend("topleft", inset=0.05, "data", pch=1, col="blue", bty="n") # Add a legend

Different point styles:


plot(x, y, pch=2, col="red") # Hollow triangles plot(1:10, rep(1, 10), pch=LETTERS) # Can also use any character example("pch") # Show point style examples

Plot symbols and colours can be specified as vectors, to allow individual specification for each point. R uses recycling of vectors in this situation to determine the attributes for each point, i.e. if the length of the vector is less than the number of points, the vector is repeated and concatenated to match the number required. Single plot symbol (see "?points" for more) and colour (type "colours()" or "colors()" for the full list of predefined colours):
plot(x, y, pch=2, col="red") # Hollow triangles plot(x, y, pch=c(3, 20), col=c("red", "blue")) # Blue dots; red "+" signs plot(x, y, pch=1:20) # Different symbol for each point

Create vector of contiguous colours in a rainbow palette:


col <- rainbow(length(x)) plot(x, y, col=col)

Label axes:
plot(x, y, xlab="Some data", ylab="Wibble")

Axis limits are controlled by xlim and ylim, which are vectors of the minimum and maximum values, respectively. Specify axis limits:
plot(x, y, xlim=c(11, 12), ylim=c(0, 150))

Changing the plot layout


The basic idea behind the R function layout is to divide the plotting device into a series of rows and columns specified by a matrix. The matrix itself is composed of values referring to the plot number, generally just 1,2,3...etc., but can feature repetition. Show simple 2x1 matrix:
matrix(1:2) matrix(1:4) # 4x1 matrix(1:4, 2, 2) # 2x2 matrix(1:6, 3, 2) # 3x2 ordered by columns matrix(1:6, 3, 2, byrow=TRUE) # 3x2 ordered by rows

To view the graphical layout, the following will show the borders of the sub-panels and the number identying each one:
layout(matrix(1:4, 2, 2)) layout.show(4) # Specify layout for 4 panels, for the defined layout layout.show(2) # Try specifying just 2 instead

Now fill the layout with 4 plots:


x <- 1:10 plot(x, x) plot(x, x^2) plot(x, sqrt(x)) plot(x, log10(x)) curve(log10, add=TRUE)

# Adds to last panel plotted

The heights and widths arguments to layout are vectors of relative heights and widths of the matrix rows and columns, respectively.

Specifying panels of different sizes:


layout(matrix(1:4, 2, 2), heights=c(2, 1)); layout.show(4) replicate(4, plot(x, x)) # Repeat plot 4 times

Plotting a function or equation


The function curve allows you to plot equations or complex functions, either on their own, or added to an existing plot (with add=T). Plot some analytic expressions:
curve(x^2) curve(x^1) # "curve(x)" fails! (can also use "curve(I(x))") curve(x^2+log10(x)-sin(x)) # Can use arithmetic curve(dnorm) # Normal distribution for mean=0, standard deviation=1 curve(x^3-x+1, from=-10, to=10, lty=2) # Specify range & use dashed line

Plot a function, with specified arguments:


curve(dnorm(x, mean=1, sd=2), from=-10, to=10)

curve provides the function to be plotted with a vector of x-axis values called x with which to calculate the corresponding y-axis data. If the argument of your function is not called x (e.g. r) , then you need to use the following syntax: curve(myfun(r=x)). The following example illustrates this with a plot of several blackbody curves. First, define a function for the Planck blackbody law to calculate the radiation intensity as a function of wavelength (lambda, in microns) and temperature (Temp, in Kelvin):
blackbody <- function(lambda, Temp=1e3) { h <- 6.626068e-34 ; c <- 3e8; kb <- 1.3806503e-23 # constants lambda <- lambda * 1e-6 # Convert from metres to microns ( 2*pi*h*c^2 ) / ( lambda^5*( exp( (h*c)/(lambda*kb*Temp) ) - 1 ) ) }

Now plot the curve for the default temperature of 1000K, with some axis labels:
main <- "Planck blackbody curves" xlab <- expression(paste(Wavelength, " (", mu, "m)")) ylab <- expression(paste(Intensity, " ", (W/m^3))) col <- c("blue", "orange", "red") lty <- 1:3 curve(blackbody(lambda=x), from=1, to=15, main=main, xlab=xlab, ylab=ylab, col=col[1])

Finally, add 2 more curves for 900K and 800K:


curve(blackbody(lambda=x, T=900), add=T, col=col[2], lty=lty[2]) curve(blackbody(lambda=x, T=800), add=T, col=col[3], lty=lty[3]) legtext <- paste(c(1000, 900, 800), "K", sep="") legend("topright", inset=0.05, legend=legtext, lty=lty, col=col, text.col=col)

Print a copy to a PDF file (the resulting plot can be viewed here):
dev.copy2pdf(file="blackbody.pdf") # Also "dev.copy2eps"

Interacting with the plot


To find out the coordinates at a particular position on a graph, type: locator() then left click with the mouse any number of times within the axes and right click to end; the R prompt will then return and a list will be printed with the X and Y coordinates of the positions clicked. You can retain this information by repeating the above, but with A <- locator() the coordinates will then be stored in A$x and A$y To identify a particular point in a plot, use "identify", e.g:
x <- 1:10; y <- x^2 plot(x, y) identify(x, y)

Now left click near one or more points and the element number of that point will be printed at the bottom, left, top or right of the point, depending on which side of it you clicked. Right click inside the axes to finish, and the element numbers of the points identified will be printed, as for locator This is more useful if you have named points, in which case identify can print the name instead of the element number, for example:
names(x) <- LETTERS[1:length(x)] plot(x, y) identify(x, y, labels=names(x)) # don't forget right click to finish!

Lattice graphics
Lattice is an excellent package for visualizing multivariate data, which is essentially a port of the S software trellis display to R. While it lacks the flexibility and extensibility of ggplot2, it nevertheless represents a great set of routines for quickly displaying complex data with ease. This makes it ideal for use in exploratory data analysis; you can find out more by reading the excellent book Lattice Multivariate Data Visualization with R by Deepayan Sarkar. Some examples of using lattice, first assemble some data (from this book) on the masses (in kg) and semi-major axis lengths (in metres) of the Planets and a dotplot of the former:
planets.mass <- c("Mercury"=0.33, "Venus"=4.87, "Earth"=5.98, "Mars"=0.64, "Jupiter"=1899, "Saturn"=569, "Uranus"=87, "Neptune"=102, "Pluto"=0.13) * 1e24 planets.semimajoraxis <- c("Mercury"=57.9, "Venus"=108, "Earth"=150, "Mars"=228, "Jupiter"=778, "Saturn"=1430, "Uranus"=2870, "Neptune"=4500, "Pluto"=5900) * 1e9 require(lattice) # ensure package is loaded dotplot(sort(log10(planets.mass)), xlab="log10 mass (kg)")

A histogram and a kernel-smoothed density plot of the semi-major axes:


histogram(log10(planets.semimajoraxis)) densityplot(log10(planets.semimajoraxis)) points # shows raw data as "jittered"

Now to demonstrate the multivariate capabilities, assemble the data in a data frame and create a categorical variable giant, which identifies the 4 most massive planets:
A <- data.frame(sma=planets.semimajoraxis, mass=planets.mass) A$name <- rownames(A)

A$giant <- ifelse(A$mass>1e25, "Giant", "Not giant")

Lattice can now separately handle the different categories, either by using group, to use different plotting symbols etc. within the same panel, e.g.:
dotplot(reorder(name, sma) ~ log10(sma), data=A, xlab="log10 semi-major axis (m)", groups=giant, auto.key=TRUE)

...or by conditioning on a categorical variable, to plot separate panels for each dataset:
dotplot(reorder(name, sma) ~ log10(sma) | giant, data=A, xlab="log10 semi-major axis (m)", auto.key=TRUE)

You can also easily plot linear regression models (from lm) for each group category, using the type argument:
xyplot(sma ~ mass, data=A, groups=giant, scales=list(log=TRUE), type=c("g", "p", "r"), auto.key=list(lines=TRUE)) # #---Other "type" arguments: # "g" = show gridlines # "p" = points # "l" = lines (join the dots) # "r" = linear regression model # "smooth" = locally-weighted regression using "loess" #

Lattice offers a very quick route to visualize a set of properties conditioned on one or more factors. For example, to show boxplots of 4 different quantities in separate panels, with each panel comparing values in different categories:
file <"http://www.sr.bham.ac.uk/~ajrs/papers/sanderson09/sanderson09_table2.txt" a <- read.table(file, header=TRUE, sep="|") #--This plot is actually saved as an R object "p" (for use below) and with the # outer "(" & ")" the result is also printed (i.e. plotted in this case, # since "printing" a lattice object draws the plot): ( p <- bwplot( z + kT + Z + index ~ cctype, data=a, outer=TRUE, scales="free", ylab="") )

Another excellent feature of lattice is the ability to span plots over multiple pages, using the layout argument (which is a vector specifying the required number of columns, rows & pages for the plot panels). This is great if you are plotting a large number of panels and want to dump them onto separate pages of a PDF document, say. Following on from the previous example (saved as the lattice object p):
devAskNewPage(TRUE) update(p, layout=c(2, 1, 2)) devAskNewPage(FALSE) # force prompt between each page # 2 cols; 1 row; 2 pages # restore default

You can see examples of a timeseries and dotplot created with lattice, together with the R code that produced them in the R gallery page.

While R is best known as an environment for statistical computing, it is also a great tool for numerical analysis (optimization, integration, interpolation, matrix operations, differential equations etc). Here is a flavour of the capabilities that R offers in analysing data. For a basic introduction, see the analysis section of the getting started page. For a more thorough overview of regression using R in an astronomical context, see this tutorial here.

Numerical Analysis in R

http://www.sr.bham.ac.uk/~ajrs/R/r-analyse_data.html

Parameter optimization
To find the minimum value of a function within some interval, use optimize (optimise is a valid alias):
fun <- function(x) x^2 + x - 1 curve(fun, xlim=c(-2, 1)) ( res <- optimize(fun, interval=c(-10, 10)) ) points(res$minimum, res$objective) # create function # plot f unction # plot point at minimum

Now, let's say you want to find the x value of the function at which the y value equals some number (1.234, say):
#--Define an auxiliary minimizing function: fun.aux <- function(x, target) ( fun(x) - target )^2 ( res <- optimize(fun.aux, interval=c(-10, 10), target=1.234) ) fun(res$minimum) # close enough

Of course, there are 2 solutions in this case, as seen by plotting the function to be minimized:
curve(fun.aux(x, target=1.234), xlim=c(-3, 2)) points(res$minimum, res$objective)

We can get the other solution by giving a skewed search interval (see ?optimize for how the start point is determined):
res2 <- optimize(fun.aux, interval=c(-10, 100), target=1.234) start value points(res2$minimum, res2$objective) # plot other minimum #--Show target values plotted with original function: curve(fun, xlim=c(-3, 2)) abline(h=1.234, lty=2) abline(v=c(res$minimum, res2$minimum), lty=3) # force higher

For more general-purpose optimization, use nlm, optim or nlminb (which I've found to be the most robust) The CRAN task views webpage has a very thorough overview of R packages relating to optimization

Integration, differentiation and differential equations


To integrate a function numerically, use integrate (note that the function must be able to accept and return a vector):
fun <- function(x, const) x^2 + 2*x + const > integrate(fun, lower=0, upper=10, const=1) 443.3333 with absolute error < 4.9e-12

Integrate will evaluate the function over the specified range (lower to upper) by passing a vector of these values to the function being integrated. Note that any other arguments to fun must also be specified, as extra arguments to integrate, and that the order of the arguments of fun does not matter, provided all arguments are supplied in this way, apart from the one being integrated over:
fun2 <- function(A, b, x) A*x^b # "x" doesn't have to be the first argument integrate(fun2, lower=0, upper=10, A=1, b=2) # "A" & "b" are given explicitly

Now, let's say you wanted to integrate this function for a series of values of b
bvals <- seq(0, 2, by=0.2) # create vector of b values fun2.int <- function(b) integrate(fun2, lower=0, upper=10, A=1, b=b)$value fun2.int(bvals[1]) # works for a single value of b fun2.int(bvals) # FAILS for a vector of values of b

to make it work, you need to force vectorization of the function, so it can cycle piecewise through the elements of the vector and evaluate the function for each one:
fun2.intV <- Vectorize(fun2.int, "b") fun2.intV(bvals) # Vectorize "fun2.int" over "b" # returns a vector of values

To compute symbolically the derivative of a simple expression, use D (see ?deriv for more info):
> D(expression(sin(x)^2 - exp(x^2)), "x") # differentiate with respect to "x" 2 * (cos(x) * sin(x)) - exp(x^2) * (2 * x)

To solve differential equations, use the deSolve package. You can read a helpful introduction to the deSolve package in Volume 2/2 of the R journal.
install.packages("deSolve") library("deSolve") library(help="deSolve") # see information on package

Interpolating data
A example of spline interpolation:
fun <- function(x) sqrt(3) * sin(2*pi*x) x <- seq(0, 1, length=20) set.seed(123) y <- jitter(fun(x), factor=20) plot(y ~ x) lines(spline(x, y)) # function to generate some data # # # # allow reproducible random numbers add a small amount of random noise plot noisy data add splined data

Now, compare with the prediction from a smoothing spline:


f <- smooth.spline(x, y) lines(predict(f), lty=2)

why not also add the best-fit sine curve predicted from a linear regression with lm:
lines(x, predict(lm( y ~ sin(2*pi*x))), col="red")

Note that, by default, the predicted values are evaluated at the (X) positions of the raw data. This means that you can end up with rather coarse curves, as seen above. To get round this, you need to work with functions for the splines, which can be supplied with more finely-spaced X values for plotting:
fun.spline <- splinefun(x, y) fun.smooth <- function(xx, ...) predict(smooth.spline(x, y), x=xx, ...)$y plot(y ~ x) curve(fun.spline, add=TRUE) curve(fun.smooth, add=TRUE, lty=2) # } "curve" uses n=101 points by default # } at which to evaluate the function

#--And add a smoother best-fit sine curve: fun.sine <- function(X) predict(lm( y ~ sin(2*pi*x)), newdata=list(x=X)) curve(fun.sine, add=TRUE, col="red") #--Finally, just for completeness, plot the original function: curve(fun, add=TRUE, col="blue")

A wider range of splines is available in the package of the same name, accessed via library("splines"), including B splines, natural splines etc.

Matrix operations
t transposes a matrix; %*% is the usual (inner) matrix multiplication; diag returns the diagonal matrix and upper.tri & lower.tri return logical arrays indicating which elements belong to the upper/lower triangles. To evaluate the classive five numbers from linear least squares regression (sum(x), sum(x^2), sum(y), sum(y^2), sum(x*y)), using matrices:
N <- 10 x <- 1:N; y <- 10:19 M <- cbind(n=1, x, y) M2 <- t(M) %*% M res <- M2[! lower.tri(M2)] identical(res, c(N, sum(x), # } # } create some X & Y data # combine into a matrix # matrix multiplication with transposed version # length of x/y & famous five numbers: sum(x^2), sum(y), sum(x*y), sum(y^2)))

To calculate the outer product of 2 arrays, use outer:


x <- seq(-1, 1, length=200) A <- outer(x, x, function(x, y) cos(x)^2 - sin(y)^2 ) require(lattice) levelplot(A)

You can also use this to generate a grid of values:


outer(LETTERS[1:3], 1:5, paste, sep="")

Matrix crossproduct:
x <- rnorm(1e7) system.time( a1 <- drop(crossprod(x))) system.time( a2 <- sum(x^2)) # -> matrix version faster identical(a1, a2) # check the answer is the same

You can also use solve to solve a system of equations, eigen to calculate eigenvalues and eigenvectors of a matrix, as well as svd to compute the singular-value decomposition of a rectangular matrix. There is also a dedicated matrix package, which can handle sparse and dense matrices, accessible via library(help="Matrix")

Statistical Analysis in R
Work in progress...This will be just a very brief taster of some of the many things that R can do in the way of statistical analysis, but right now consists only of a guide to do fast bootstrap resampling of regression parameter errors.

Fast bootstrap resampling to estimate regression parameter errors


Bootstrap resampling is a very useful method to determine parameter error estimates. This section makes use of the boot R package, which can be loaded with library(boot) or require(boot). A demonstration using simple linear regression:
set.seed(123) # allow reproducible random numbers N <- 20 A <- data.frame(x=1:N, y=rnorm(N, mean=1:N)) # create some data plot(y ~ x, A) # plot the data m <- lm(y ~ x, A) # fit linear model abline(m) # add best-fit model to plot as a line summary(m) # show model best fit values & standard errors #--Create a simple function to return the best-fit coefficients for the model # fitted to a subset of the original data ("A"), given a vector of row # numbers for the data frame ("indices"). "indices" will be the same length # as "nrow(A)", and will be supplied by the "boot" function, using random # sampling with replacement (i.e. "sample(nrow(A), replace=TRUE)") mystat <- function(A, indices) { m <- lm(y ~ x, A[indices, ]) return(coef(m)) }

#--Demonstrate function: > mystat(A, 1:nrow(A)) # same as "coef(m)" (Intercept) x 0.3100925 0.9839554 > set.seed(123) # allow reproducible random numbers > mystat(A, sample(nrow(A), replace=TRUE)) # result for a single resample (Intercept) x 0.6143148 0.9416338 #--Run full set of "N.boot" bootstrap resamples: N.boot <- 500 require(boot) # load boot library set.seed(123) # allow reproducible random numbers b <- boot(A, mystat, R=N.boot) #--Plot results of plot(b, index=1) plot(b, index=2) # see "?plot.boot" bootstrapping: # intercept # slope for details

#--Now compare the standard errors on the model parameters from # the bootstrap resampling with those from the normal summary method: > b # print results ORDINARY NONPARAMETRIC BOOTSTRAP Call: boot(data = A, statistic = mystat, R = N.boot) Bootstrap Statistics : original bias t1* 0.3100925 0.0139150827 t2* 0.9839554 -0.0007816746

std. error 0.44749693 0.03979363

## NB the bias is the difference between the mean of the N.boot ## resample parameter values and the original best-fit model ## parameter values, i.e. "apply(b$t, 2, mean) - coef(m)" #--Now show the standard errors computed (see "?summary.lm"): > coef(summary(m)) Estimate Std. Error t value Pr(>|t|) (Intercept) 0.3100925 0.46199915 0.6711972 5.106173e-01 x 0.9839554 0.03856694 25.5129205 1.389147e-15 #--Or, quantify the difference by comparing the ratios: > sd(b$t) / coef(summary(m))[, 2] # close to 1 in each case (Intercept) x 0.9686099 1.0318067

A quicker version of the above example, using the method described in Section 4.3.1 from Chambers & Hastie, 1992 (see References section in ?lm for book details). This method exploits the fact that lm does a certain amount of initial processing prior to the actual regression (using lm.fit), and this represents a substantial overhead that need only be performed once (and need not be replicated during each bootstrap resampling iteration).
# following Section 4.3.1 from Chambers & Hastie (1992): m <- lm(y ~ x, A, x=TRUE, y=TRUE) # "x=y=TRUE" returns extra data for lm.fit

mystat.fast <- function(dummy, i, model) coef(lm.fit(model$x[i, ], model$y[i])) require(boot) set.seed(123) # allow reproducible random numbers system.time(slow <- boot(A, mystat, 1e4)) set.seed(123) system.time(fast <- boot(A, mystat.fast, 1e4, model=m)) faster #--Check results are identical: slow fast #--Formal check if two objects are identical: # the only differences reported are in components 6 & 8, which are due to the # different mystat function & name (i.e. "mystat" vs. "mstat.fast"), which is # stored in the object returned by "boot": identical(slow, fast) # not completely identical all.equal(slow, fast) # only differences due to different mystat functions

# roughly 10x

Note that there is no point having a very large number of bootstrap samples compared to the number of fitted values (i.e. the number of rows in the data frame), since the latter ultimately becomes the limiting factor in the accuracy of the recovered parameter error estimates. An example using non-linear regression (nls)
#--Create some non-linear data: N <- 20 set.seed(123) #--Create some data: B <- data.frame(x=1:N, y=(4 * log10(1:N)) + rnorm(N, mean=2, sd=0.2)) plot(y ~ x, B) # plot the data #--Fit the non-linear model m <- nls(y ~ a * log10(x) + b, data=B, start=list(a=1, b=1)) lines(B$x, fitted(m), lty=2) # plot best-fit model values as a dashed line #--A better way of plotting the best-fit model (as a smooth curve): curve(predict(m, newdata=data.frame(x=x)), add=TRUE) summary(m) # summarise the best-fit parameters and their errors etc. #--There is no equivalent possible for the fast version of "mystat" # for nls, so set up the basic function to calculate bootstrapped fit: mystat <- function(A, indices) { m <- nls(y ~ a * log10(x) + b, data=B[indices, ], start=list(a=1, b=1)) return(coef(m)) } #--Run full set of "N.boot" bootstrap resamples: N.boot <- 500 set.seed(123) require(boot) b <- boot(B, mystat, R=N.boot) #--Plot results of bootstrapping: # (NB note significant non-normal distribution of values; # i.e. right panel quantile-quantile plot values deviate from a

straight line) plot(b, index=1) plot(b, index=2)

# parameter "a" # parameter "b"

#--Now compare the standard errors on the model parameters from # the bootstrap resampling with those from the normal summary method: > b # print results of bootstrap ORDINARY NONPARAMETRIC BOOTSTRAP Call: boot(data = B, statistic = mystat, R = N.boot) Bootstrap Statistics : original bias t1* 3.986804 -0.02582234 t2* 2.040456 0.02679389

std. error 0.1361342 0.1332876

> summary(m) # print standard errors (see "?summary.nls") Formula: y ~ a * log10(x) + b Parameters: Estimate Std. Error t value Pr(>|t|) a 3.9868 0.1299 30.70 < 2e-16 *** b 2.0405 0.1275 16.01 4.33e-12 *** --Signif. pres: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.1998 on 18 degrees of freedom Number of iterations to convergence: 1 Achieved convergence tolerance: 1.234e-07 #--Or, quantify the difference by comparing the ratios: sd(b$t) / coef(summary(m))[, 2] # close to 1 in each case a b 1.048229 1.045583

For further information, you can find out more about how to access, manipulate, summarise, plot and analyse data using R.

You might also like