You are on page 1of 5

ST1232 Tutorial 1

Introduction to Exploratory Data Analysis and Statistical


Analysis with R-through-Excel
1.
INTRODUCTION
Microsoft Excel is the most common software for managing data and for performing
simple statistical analysis. However, it does not have the sophistication that
dedicated statistical software like SPSS, SAS or R has for performing detailed
statistical analysis. This worksheet aims to introduce the add-on package to Excel,
known as RExcel, which provides a user-friendly interface through Excel to the
analytical capability within R. Most of the materials in this instruction sheet have
been adopted from the following reference:
Heiberger RM and Neuwirth E (2009) R through Excel. Springer, New York.
2.
INSTALLING REXCEL
To install RExcel, you will require an internet connection and follow the steps below:
(i)
(ii)

direct your internet browser to http://rcom.univie.ac.at;


click on the Download tab at the top of the page;

(iii)
click on the download link to RAndFriends, and download the file to a
temporary location;

(iv)
you will require Administrator access to your computer in order for Excel and
R to communicate;
(v)

make sure you have closed Excel and any previous version of R;

(vi)

double-click on the downloaded file (i.e. RAndFriendsSetup.exe). This


installation may take up to 15 minutes, and it will install R with RExcel and R
Commander. During the installation:
(a)
check the box that says R through Excel book demo;
(b)
do not install Rggobi when given the option;
(c)
do not install Notepad++ and NppToR when given the option.

(vii)

RAndFriends installer subsequently will download statconnDCOM from the


internet, and you should choose the default setup when asked.

[Troubleshooting]
Occasionally, the installer is unable to locate Excel and this affects the
communication between Excel and R. To resolve this:
(a)
start R by double-clicking on the R icon;
(b)
at the R prompt, enter
> library(RExcelInstaller)
> installRExcel()
Verify the installation by double-clicking on the RExcel2007 icon on your desktop,
and going to the Add-Ins tab at the top of the screen.

You should be able to see the series of drag-down menus for performing various
statistical analyses.

3.
GETTING STARTED
We will explore the use of RExcel through a series of guided worked examples,
which will include data management, basic statistical analysis and exploratory data
analysis. Double-click on the RExcel2007 icon on your desktop to launch both R
and Excel.
3.1
Height data for Children
Three groups of 10 children each have been identified in a survey of childhood
puberty, and the height data for the 30 children are shown below.
Group 1
93
95
101
103
108
111
114
115
115
117

Group 2
105
107
110
110
115
118
120
120
123
126

Group 3
100
101
103
107
111
113
115
115
118
125

(a) Enter the data into an Excel spreadsheet or SPSS such that each row contains
the data for an unique individual (thus, you should have the data in thirty
rows and two columns, one for the height data and one to indicate which
group the child was from).
(b) Produce numerical summary statistics for the height of all children,
irrespective of the groupings. Interpret your results.
(c) Explore the data, stratified by the groupings. Interpret your results.
(d) Produce informative figures that will aid the understanding of the dataset. Did
you produce a histogram? Was it useful for understanding the distribution of
the height data?

3.2
Mathematical ability and omega 3 consumption
The mathematics.xls dataset describes the data from an artificial study into the
effect of omega 3 consumption on the marks of the mock Secondary 4 exams in
Additional Mathematics from 3 schools. The dataset can be downloaded from
http://www.statistics.nus.edu.sg/~statyy/ST1232/bin/mathematics.xls.
(a)

Produce numerical summaries for the variables marks_before,


marks_after and omega3, as well as other graphical summaries that may
be useful for understanding these variables. Specifically, consider the
histograms of these variables and comment on their empirical distributions.
From these analyses, can you identify any issues with the existing dataset?

(b) Through the use of appropriate figures, identify the problematic data and
remove them from further analyses.
(c)
Through the use of an appropriate graphical summary, explore whether
there is any graphical evidence to suggest that the mathematics scores
before starting the omega 3 treatment differs significantly between males
(coded with sex = 1) and females (coded with sex = 2).
(d) Produce a scatterplot of the scores before and after the omega 3 treatment,
and comment on the relationship between the two variables.
(e)
Calculate the empirical correlation between the scores before and after
the omega 3 trial (explore how to do this in either RExcel or SPSS yourself, try
looking through the drop-down menus).
(f) Introduce a new variable, which is defined as the difference in the scores
before and after the omega 3 trial (explore how to do this in either RExcel or
SPSS). Produce a scatterplot of the difference in scores with omega 3
consumption. Comment on the figure obtained.
(f) Produce a cross-tabulation table of school against sex, including the
frequencies of the school by sex. Comment qualitatively on whether there
exists any difference in the frequencies of the students from the different
schools between the different genders.
(g) Finally, investigate graphically with the use of a boxplot whether there is any
evidence to suggest a difference between the daily omega 3 consumption
and the schools. Is this plot informative?

You might also like