You are on page 1of 17

Chapter 13 Quick Overview of Correlation and Linear Regression

When comparing two population means using a t-test in chapter 10, if we conclude that
there is a difference in means, we are also concluding that there is a relationship between
Y (or X), a quantitative variable, and the populations, a qualitative variable. For
example, if we test whether the average number of hours studied differ among Business
and Engineering students, we would be also testing whether there is a relationship
between hours studied and a students major (and if we reject H0, we would be
concluding that there is such a relationship). In this example, hours studied is
quantitative and a students major is qualitative. So testing differences in means is
equivalent to testing for a relationship between a quantitative and a qualitative variable.
When testing if there is a difference in two population proportions of successes, we are
also testing whether there is a relationship between two qualitative variables. For
example, if we test whether the proportion of students taking business differs at two
different universities, we would also be testing whether there is a relationship between
two qualitative variables, business major and university, (and if we reject H0, we would
be concluding that there is such a relationship. Whats left is to look at the relationship
between two quantitative variables.
In this chapter, we are interested in studying the linear relationship between two
quantitative variables (e.g. between hours studied and marks on an exam). To study such
a relationship, we either use correlation analysis or simple linear regression.
Correlation Analysis
We use correlation analysis when we believe that there is a linear relationship between
two quantitative random variables but we do not want to assume that one of these
variables is affected or caused by the other variable. (For example, we believe that there
is a linear relationship between the number of HD TVs sold and the number of cell
phones sold but we do not want to assume that the number of HD TVs sold is affected by
the number of cell phones sold or vice versa.) And, we use correlation analysis when we
want to estimate the strength of this linear relation and possibly test whether a linear
relationship (or a positive linear relationship or a negative linear relationship) actually
exists by taking a sample from the appropriate population. For example, suppose we
gathered data on the number of each product sold at six outlets of a large electronics
chain during one weekend in September and the following data resulted:
Store

HD TVs

A
B
C
D
E
F

12
19
36
16
22
27

cell phones
27
37
65
24
39
42

Here, we will arbitrarily use the variable X to represent the number of HD TVs sold and
Y to represent the number of cell phones sold.

But, before we measure the strength of this linear relationship, we first should determine
whether a linear relationship is the appropriate relationship. We can use a scatter diagram
to see whether or not a linear relationship is a reasonable assumption to make. The
scatter diagram for this example would look like:

From this scatter diagram, we can see that a linear relationship is reasonable in this case
and, although the relationship is not a perfect linear relationship, common sense would
tell us that the linear relationship is strong. But, what do we mean by strong?
We measure strength by calculating a correlation coefficient. If we had the entire
population of data, the symbol for this coefficient is . If we only had sample data as is
the case here, the coefficient is symbolized by r where r is the best estimator of if X and
Y has a bivariate normal distribution. One way to express the formulas for these
coefficients is:
N

( xi x )( yi y )
i 1

i 1

i 1

( xi x )2 ( yi y ) 2

( xi x )( yi y )
i 1

i 1

i 1

( xi x )2 ( yi y ) 2

To simplify the formula for r and other formulas to follow, we will use the symbols SSxx,
SSyy and SSxy (where SS stands for sum of squares) where
n

SS xx ( xi x ) 2
i 1

SS yy ( yi y ) 2

SS xy ( xi x )( yi y )

i 1

i 1

These symbols can simplify r to read


SS xy

SS xx SS yy

No matter whether we have the entire population data set or just a sample data set, the
following applies:

or

1 1
1 r 1

A coefficient of -1 indicates a perfect negative linear relationship between


X and Y
A coefficient of +1 indicates a perfect positive linear relationship between
X and Y
A coefficient of 0 indicates no linear relationship between X and Y
The closer the coefficient is to -1 or +1, the stronger the linear relationship

In our example, the relationship is positive (i.e., X and Y tend to move in the same
direction) and it appears to be very strong, although not perfect. So, when we calculate r,
as we will do below, there should be no surprise if rs value is fairly high.
To calculate r, we perform the following calculations by setting up the following table
(note: we first must calculate x and y before completing most of the calculations):
x

xi 132 22
n

yi 234 39
n

xi x yi y ( xi x ) ( yi y ) ( xi x )( yi y )
Store
x
y
__________________________________________________________
2

A
12 27
-10
-12
100
144
120
B
19 37
-3
-2
9
4
6
C
36 65
14
26
196
676
364
D
16 24
-6
-15
36
225
90
E
22 39
0
0
0
0
0
F
27 42
5
3
25
9
15
___________________________________________________________
totals

132 234

Using this table, we see that:

366

1058

595

SS xx ( xi x ) 2 366

SS yy ( yi y ) 2 1058

i 1

and, r

SS xy
SS xx SS yy

i 1

SS xy ( xi x )( yi y ) 595
i 1

595
.9562
366 1058

As expected, the correlation coefficient is high and positive indicating a strong


positive linear relationship between the number of HD TVs sold and the number of cell
phones sold in our sample.
The value of r is the best estimate of . Previously, when estimating, we also calculated
the confidence interval for the parameter of interest using the relationship between the
best estimator and the parameter. The relationship between r and is not that simple
unless is assumed to equal zero and thus we will not attempt to calculate this interval.
Instead we will proceed to test hypotheses concerning . The obvious question arises:
If the relationship is not that simple how can we use r to test ?.
It so happens that if H 0 : 0 , the relationship, if H0 is assumed true, is quite simple. In
fact the relationship can be expressed as:
tcalc r

n2
1 r2

where tcalc has a t-distn with n 2

Another obvious question arises: How can this be a relationship between r and when
is not part of this expression?. It is part of this expression but since its value is 0, you
cant see it.
Using the same reasoning as before, we can set up the following hypothesis tests for :
Test statistic: tcalc r

If

If

If

n2
1 r2

H0 : 0
H1 : 0

R : tcalc t

(test for a negative linear relationship)

H0 : 0
H1 : 0

R : tcalc t / 2 or t / 2

H0 : 0
H1 : 0

R : tcalc t

(test for a linear relationship)

(test for a positive linear relationship)

Continuing our example, at = .05, is there sufficient evidence to indicate that there is a
positive linear relationship between the number of HD TVs sold and the number of cell
phones sold?

H0 : 0
H1 : 0

.05
Decision rule
R : tcalc t

n2 4

tcalc 2.132
Test statistic
tcalc r

n2
62
.9562
6.533
2
1 r
1 .95622

As 6.533 > 2.132 we reject H0 at =.05


There is a positive linear relationship between the number of HD TVs sold and the
number of cell phones sold.
In summary, correlation coefficients measure the strength of the linear relationship
between two quantitative variables when we are not willing to assume or dont want to
assume that one variable is affected by the other or that there is a cause-effect relationship
between the two variables.
If we want to assume a cause-effect relationship between two quantitative variables that
is linear, we use simple linear regression.
Simple Linear Regression
With simple linear regression, we believe that the random variable, Y (the dependent
variable) is affected by or depends on the variable, X (the independent variable, which is
not random) and that this true relationship takes the form of an imperfect linear
relationship which can be expressed by the equation:
yi 0 1 xi
where, if you remember your high school math,
y 0 1 x is the equation of a straight line with 0 being the y-intercept and 1
being the slope.
The term represents the fact that the relationship is not perfect and that y randomly
fluctuates around this line.
While the objective of correlation analysis was to estimate the true strength of the linear
relationship and possibly to test hypotheses concerning this strength, the objective of
regression analysis goes beyond that. In this course, we will use
simple linear regression to:

find the best estimate of the true linear relationship


measure the strength of the linear relationship in the sample
make inferences concerning 1 and possibly 0
predict an individuals value of Y for a given value of X
estimate the mean of Y for a given value of X

Finding the best estimate of the true linear relationship must be accomplished before we
can proceed with any of the other objectives.
We can express the estimated relationship by the straight line:
y i b0 b1 xi
or, by the equation:
yi b0 b1 xi ei
where,
ei is the difference between the actual value of y and the predicted value of y or
ei yi yi

(ei usually referred to as the residual)

an example
Many companies try to improve their sales by sending out the same emails to individuals
email addresses, some sending out several per day, day after day after day. But, does this
strategy actually increase sales? A study was undertaken in which 8 email addresses were
randomly selected from a list of previous customers emails and each address was sent a
specific number of emails per day over a month period. The dollar sales which resulted
are recorded below:
Customer
1
2
3
4
5
6
7
8

Daily emails

$ Sales

2
4
6
8
10
12
14
16

70
30
80
20
110
100
54
120

Before we proceed to use regression analysis to estimate the true linear relationship, we
first have to determine which of the two variables is X and which is Y. We have to ask
ourselves which variable depends on the other. Common sense would tell us that sales
depends on the number of emails sent, resulting in
Y = $ Sales

X = number of emails sent


Once we have decided on X and Y, we create a scatter diagram to see whether a linear
relationship seems to be a reasonable assumption. The scatter diagram for this example
is:

Although the relationship seems rather weak, there is no reason to believe that the
relationship is not linear. Therefore, we will assume the relationship is linear and proceed
to find the best estimate of this relationship. But, before we do so, we must first make the
following assumptions:

a linear relationship is the appropriate relationship


the errors or is normally distributed with a mean of zero
has a constant variance
The errors are independent of one another

If these assumptions are valid, the method of ordinary least squares or the method that
minimizes
n

( yi yi ) 2

or

i 1

ei2
i 1

will give us the best estimate of the true linear relationship. Using this method, we end
up with the following formulas:
b1

SS xy
SS xx

and

b0 y b1 x

Where SSxy and SSxx have the same formulas as before or


n

SS xy ( xi x )( yi y )
i 1

SS xx ( xi x ) 2

and

i 1

To determine their values and SSyy which we will need later on, we first create the
following table (calculating x and y before being able to complete the table). Using the
formula for sample means, we easily determine that:
x 9 and y 73
xi x yi y ( xi x ) ( yi y ) ( xi x )( yi y )
Customer
x
y
_______________________________________________________________
2

1
2
70
-7
-3
49
9
21
2
4
30
-5
-43
25
1849
215
3
6
80
-3
7
9
49
-21
4
8
20
-1
-53
1
2809
53
5
10
110
1
37
1
1369
37
6
12
100
3
27
9
729
81
7
14
54
5
-19
25
361
-95
8
16
120
7
47
49
2209
329
_______________________________________________________________
totals

72

584

168

9384

620

Using this table:


n

SS xy ( xi x )( yi y ) 620
i 1

SS xx ( xi x ) 2 168
i 1

and
b1

SS xy
SS xx

620
3.6905
168

and

b0 y b1 x 73 3.6905(9) 39.7855

or,
y i 39.7855 3.6905 xi
(If we were just interested in a mathematical interpretation of this line, our interpretation
would be that the y -intercept (i.e. the value of y when x = 0) is 39.7855 and the value of
y increases by 3.6905 for each unit increase in x.)
At this point, we could draw this line on the scatter diagram. To do so manually, we need
to find two points along this line and use a ruler connecting these points. One of these
points could be the y-intercept and another point could be the point ( x , y ) because the

regression line always passes through this point. Using Excel, the scatter diagram would
now look like:

You will notice that the regression line using Excel (also indicated in the scatter diagram
if requested) does not go as far as the y axis. It starts when x = 2 and stops when x = 16,
which just happen to be the lowest and highest values of X in our sample. There is a
logical reason for this.
We drew a scatter diagram to see whether or not a linear relationship seems reasonable.
We can only make this observation over the range of X in our sample. So, in this case,
we observe that the relationship seems reasonable when X is somewhere between a value
of 2 and a value of 16. Outside this range of values, we have no information to either
support or not support this linearity assumption. Therefore, to be safe, any observations
we make or conclusions we reach should only apply over the range of our X in our
sample. So, at this point in our analysis, we can say that there appears to be a fairly weak
positive linear relationship between sales and number of emails when the number of
emails ranges between 2 and 16. But, how strong does the linear relationship appear to
be in our sample? To answer this question, we calculate the coefficient of determination,
symbolized by R2.
Coefficient of Determination
To measure the strength of the linear relationship between X and Y, we do something
similar to what we did in ANOVA. We observe that the values of Y are not all the same
and we try to explain what causes these values to vary. Here, we define two such causes Y varies because it is related to X and X varies, and, the cause which we cannot explain
(i.e. the residuals or errors). As with ANOVA, we can create the following expression:

Total variation in Y = variation in Y due to its linear relationship with X


+ unexplained variation
or,
SST = SSR + SSE
where,
n

SST ( yi y ) 2
i 1

(same formula as SSyy and similar to ANOVAs SST)

i 1

i 1

SSR ( yi y ) 2 and, SSE ( yi yi ) 2


Looking at these SSs,

if all values of Y were equal, SST, SSR and SSE would all equal zero
if there was no linear relationship (weakest type of linear relationship) between X and
Y in our sample (i.e. b1 = 0), SSR would equal zero, and, SST = SSE
if there was a perfect linear relationship (strongest type of linear relationship)
between X and Y in our sample, SSE would equal zero and SST = SSR

So, we want R2 to reflect some measure of strength. We do this using the relationship
R2 1

SSE SSR

SST SST

By using this relationship, we observe that:

0 R2 1
R2 = 0 if there was no linear relationship in our sample
R2 = 1 if there was a perfect linear relationship in our sample
The closer R2 is to 1, the stronger the linear relationship

In addition to making these observations, we can define R2 as the proportion of the total
variation in Y in our sample which can be explained by its linear relationship to X.
At this point, we really dont want to calculate SSR and SSE if we can avoid it. Luckily
for us, R2 equals r2, or
R
2

SS xy2
SS xx SS yy

Since we have already calculated SSxy (SSxy = 620) and SSxx (SSxx = 168), the only
n

2
calculation that is missing is SSyy and SS yy ( yi y ) 9384
i 1

With these three SSs,

R2

SS xy2
SS xx SS yy

6202
.2438
(168)(9384)

Or, 24.38% of the variation in sales in our sample can be explained by its linear
relationship to the number of emails received. Since R2 is relatively small, our initial
observation that the relationship was weak is supported.
Simply because the relationship in our sample is weak, it does not rule out the possibility
that there still is a relationship between X and Y in our population. Conversely, if the
relationship in our sample was strong, it would not rule out the possibility that there was
no linear relationship between X and Y in our population. To make inferences about the
populations linear relationship we usually only make inferences concerning 1 since 1
tells us whether or not a linear relationship exists in the population.
Inferences concerning 1
Common sense should tell us that b1 is the best estimator of 1 and to use this estimator to
make these inferences we need to know its relationship with 1. If our previously stated
assumptions we made for regression analysis are valid, the relationship between b1 and 1
can be expressed as:
t

b1 1
sb
1

where t has a t- distribution with n 2 degrees of freedom.


This formula seems simple enough except that calculating sb is somewhat complicated.
Here are the formulas that we need (with one of them being SSE):
1

SSE ( yi yi ) 2
i 1

s yx

sb
1

SSE
n2
s yx
SS xx

We already have SSxx (SSxx = 168), so the only real issue is calculating SSE. Using the
formula, we would have to calculate y i for each of the observations in our sample which
could be rather tedious. So, lets see if there is a shortcut we can use.
R2 1
or,

SS xy2
SSE
SSE
1
SSE SS yy (1 R 2 ) SS yy 1

SST
SS yy
SS xx SS yy

SSE SS yy

SS xy2
SS xx

where all SSs have been previously calculated

In our example,
SSE SS yy

s yx

sb
1

SS xy2
SS xx

9384

6202
7095.905
168

SSE
7095.905

34.3897
n2
6
s yx
SS xx

34.3897
2.653
168

2
Note: We could have used SSE SS yy (1 R ) to calculate SSE but there is a greater
chance of rounding errors and more chance of making a mistake (if we made a mistake
calculating R2).

Knowing the relationship between b1 and 1 and knowing how to do the preliminary
calculations, we can use the same reasoning as before and the same mathematical
manipulations to either estimate 1 or test hypotheses concerning 1.
Estimating 1

The best estimate of 1 is the value of b1 calculated from our sample


The confidence interval for 1 is determined as follows:

b 1
P (t / 2 t t / 2 ) 1 P t / 2 1
t / 2 1

sb

Manipulating to isolate 1, we end up with the confidence interval,


b1 t / 2 sb (which should be no surprise!)
1

Hypothesis testing of 1
As with all other test statistics, the test statistic for 1 is simply the relationship between
0
b1 and 1 assuming Ho is true. So, our test statistic for testing 1 if H 0 : 1 1 is
tcalc

b1 10
sb

n2

And, depending on H1, the rejection regions would look like the following if:
H 0 : 1 10
H1 : 1 10

R : tcalc t

H 0 : 1 10
H1 : 1 10

R : tcalc t / 2 or t / 2

H 0 : 1 10
H1 : 1 10

R : tcalc t

0
If the value of 1 = 0, which is the most common value for testing 1, we would be
testing for:

a negative linear relationship if H1 : 1 0 ,


a linear relationship if H1 : 1 0
a positive linear relationship if H1 : 1 0

In our example, lets calculate the 95% confidence interval for 1 and then test at = .05
if there is sufficient evidence to indicate that the more emails received, the greater the
sales.
estimating 1
In our example, b1 = 3.6905, sb 2.653 and 6 t / 2 2.447
1

b1 t / 2 sb 3.6905 2.447(2.653) 3.6905 6.4919 2.8014, 10.1824


1

We are 95% confident that sales will change between -$2.8014 and $10.1824 for each
additional email with the best estimate being a change of $3.6905 as long as the number
of emails range between 2 and 16.
Note: This wide confidence interval doesnt give the company much information about
how the number of emails affects sales.

Hypothesis testing of 1
Here we are testing for a positive linear relationship resulting in:
H 0 : 1 0
H1 : 1 0

.05
Decision rule
R : tcalc t

Test statistic

tcalc 1.943
tcalc

b1 10 3.6905 0

1.391
sb
2.653
1

As 1.391 < 1.943, we dont reject H0 at = .05


There is insufficient evidence that increasing the number of emails increases sales
when the number of emails ranges from 2 to 16.
Based on this test, the company may decide to limit the number of emails it sends its
customers especially if there is a cost associated with sending each email.
We still have two more things we want to do to complete our analysis but, before we do
so, here is the basic output that we would get when we run regression analysis using
Excel:
SUMMARY OUTPUT
Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations

0.4938
0.2438
0.1178
34.3897
8

ANOVA
Regression
Residual
Total

Intercept
no. of emails

df
1
6
7

SS
2288.0952
7095.9048
9384

Coefficient
s
39.7857
3.6905

Standard
Error
26.7962
2.6532

MS
2288.0952
1182.6508

F
1.9347

Significanc
eF
0.2136

t Stat
1.4848
1.3909

P-value
0.1881
0.2136

Lower 95%
-25.7823
-2.8017

Upper 95%
105.3537
10.1827

You can notice some of the values (give or take some rounding errors) we calculated
using the appropriate formulas:
b1 = 3.6905
b0 = 39.7857
R2 = .2438
Syx = 34.3897
SSE = 7095.9048
[-2.8017, 10.1827]
sb 2.6532
1

t = 1.3909
Notes:

(Coefficient for no. of emails)


(Coefficient for Intercept)
(R Square)
(Standard Error)
(SS, residual)
(SSR = 2288.0952 and SST = 9384 are also given)
(Lower 95% and Upper 95% for no. of emails)
(Standard Error, no. of emails)
(t Stat for no. of emails if H1 : 1 0 )

The output also gives us the p-value for no. of emails (0.2136) but Excel
calculates a two-tail p-value. So, the p-value for this one-tail test is half the
quoted P-value or p-value = .1068. (It is half of the quoted value only when the
sign of b1 or t supports H1 being true, i.e. since we are testing for a positive linear
relationship, we would expect b1 or t to be positive and it is positive in the case.
If the sign is not what we would expect if H1 were true, the p-value would be
greater than .5.)
The value of the t-test and the accompanying p-value only apply when H : 1 0 .
For values other than 0, the quoted t Stat and p-value do not apply.
The output also gives us an ANOVA table, similar to the ANOVA table from the
previous chapter. As before, the table determines the total variation in Y and
partitions it into the variation in Y which can be explained (the Regression part)
and the variation which cannot be explained (the Residual part). Working through
this table, we end up with a value of F (1.9347) and a p-value or Significance F
(0.2136). You will note that Significance F = 0.2136 which is the same value as
the P-value for no. of emails. This is not a coincidence. This ANOVA table
(through the F-statistic) allows us to test whether the regression model is useful or
whether the independent variables in the model have an effect on Y. That is, it
allows us to test:
H 0 : 1 2 k 0
H 1 : not all above j's=0

(model is useless)
(model is useful)

Here, since there is only one independent variable, the model being useful would
mean that the no. of emails affects sales or H1 : 1 0 . If we tested this
hypothesis using our t-test, we would still have ended up with a test statistic of
1.3909 and the p-value of this test would be 0.2136. In fact, if we were to square
the value of t, we would end up with the value of F (give or take some rounding
error) in the ANOVA table. With only one independent variable, the ANOVA
table really doesnt give us much information. Its use will be more evident when
we have regression models with more than one independent variable.

What we have left in our analysis is:

predicting a value of Y for a given value of X (e.g. if we send a customer 6 emails,


what would we predict the sales to be?)
estimating the mean of Y for a given value of X (e.g. what would be the average sales
if all customers received 6 emails?)

In both cases, the best estimates would be:


y i b0 b1 xi
where xi is the value of X of interest
Without developing the confidence interval formulas, once we have this best estimate, the
confidence intervals become:
predicting a value of Y:
y i t / 2 s yx 1

1 ( xi x ) 2

n
SS xx

n2

estimating the mean of Y:


y i t / 2 s yx

1 ( xi x ) 2

n
SS xx

n2

Continuing with our example, we want to predict, with 95% confidence, sales for an
individual who receives 6 emails daily.
y i b0 b1 xi 39.7855 3.6905(6) 61.93
y i t / 2 s yx

1 ( xi x ) 2
1 (6 9) 2
1
61.93 2.447(34.3897) 1
n
SS xx
8
168

61.93 91.36
29.43, 153.29
We are 95% confident that the sales for an individual who receives 6 emails daily
will lie between -$29.43 and $153.29 with the best estimate being $61.93.
note: It is obvious that sales cannot be negative, so we can quote the lower limit for sales
to be $0.00 without be criticized for doing so.

We also want to estimate, with 95% confidence, mean sales for all individuals who
receives 6 emails daily.
y i t / 2 s yx

1 ( xi x ) 2
1 (6 9) 2

61.93 2.447(34.3897)
n
SS xx
8
168

61.93 35.56
26.37, 97.49
We are 95% confident that mean sales for all individuals who receive 6 emails daily
will lie between $26.37 and $97.49 with the best estimate being $61.93.
Observations:

Both of the above confidence intervals are quite wide making their use in any
decision making by the company rather useless.
Although both confidence intervals are wide, the confidence interval for the mean is
narrower than the confidence interval for an individual Y.
The widths of both confidence intervals could be smaller if:
- we lowered the level of confidence
- syx was smaller (or the values of Y better fit the regression line)
- took a larger sample
- xi was closer to x
- SSxx was larger (or the values of X in our sample were more spread out)

Note: As with previous analyses, we should only be predicting Y or estimating the mean
value of Y for values of X within the range of xs in our sample.
When we began to discuss regression analysis we made some assumptions about the true
linear relationship (a linear relationship is the appropriate relationship, is normally
distributed with a constant variance, the s are independent of one another). A scatter
diagram, with the regression line added to it, would give us some indication whether a
linear relationship is appropriate, whether the eis dont have a pattern to them (indicating
that the s are independent) and whether the eis are equally spread out no matter the
value of X (indicating constant variance). A histogram of the eis would give us some
indication as to whether is normally distributed or not. There are some formal tests for
some of these assumptions but they are beyond an introductory course in statistics.
Whats next?
Chapter 14 will allow us to incorporate more than one independent variable in a
regression model, allowing us to create models which are more appropriate in the real
world and allowing us to expand on the testing capabilities of regression analysis.

You might also like