You are on page 1of 61

Lecture 04

Regression & Correlation

Copyright 2017, A. Marshall

Bivariate Statistics
>So far, we have been dealing with
statistics of individual variables
>Now we will look at statistics that
describe the relationship between
pairs of variables

Copyright 2017, A. Marshall

Interactions
Sometimes two variables appear
related:
>smoking and lung cancers
>height and weight
>years of education and income
>engine size and gas mileage
>GMAT scores and MBA GPA
>house size and price
Copyright 2017, A. Marshall

Interactions
>Some of these variables would appear
to positively related & others
negatively
>If these were related, we would
expect to be able to derive a linear
relationship:

y = a + bx
>where,

b is the slope, and


a is the intercept

Copyright 2017, A. Marshall

Linear Relationships
>We will be deriving linear relationships
from bivariate (two-variable) data
>Our symbols for the true regression
relationship will be:

yi 0 1 xi or yi 0 1 xi
Slope
Intercept

1
0
Errorterm
Copyright 2017, A. Marshall

Notation
yi

Individual observations of the


dependent variable and
typically not on the line

yi

The predicted values of the


dependent variable and always
on the line

yi yi ei

Individual error terms, the


difference between the
individual observations and the
values predicted by the linear
equation

Copyright 2017, A. Marshall

Understanding the Two Forms


yi 0 1 xi or yi 0 1 xi
>The first relationship describes the line in
terms of the actual data points, none of which
will be on the line, hence the error term
>The second relationship describes the line
itself, based on the predicted values of y
given x, hence no error term
>When using the equation of the line
empirically, the error term is not actually used
Copyright 2017, A. Marshall

Estimating a Line
>The symbols for the estimated
linear relationship are:

y b0 b1 x
>b1 is our estimate of the slope, 1
>b0 is our estimate of the intercept, 0

Copyright 2017, A. Marshall

Example
>Consider the following example
comparing the returns of Consolidated
Moose Pasture stock (CMP) and the
TSX 300 Index
>The next slide shows 25 monthly
returns

Copyright 2017, A. Marshall

Example Data

Copyright 2017, A. Marshall

10

Example
>From the data, it appears that a
positive relationship may exist
Most of the time when the TSX is up, CMP
is up
Likewise, when the TSX is down, CMP is
down most of the time
Sometimes, they move in opposite
directions

>Lets graph this data


Copyright 2017, A. Marshall

11

Graph Of Data

Copyright 2017, A. Marshall

12

Example Summary Statistics


> The data do appear to be positively related
> Lets derive some summary statistics about
these data:

Copyright 2017, A. Marshall

13

Observations
>Both have means of zero and standard
deviations just under 3
The data was deliberately created to have
means of zero for both variables for the
demonstration of the deviations on next three
slides

>However, each data point does not have


simply one deviation from the mean, it
deviates from both means
>Consider Points A, B, C and D on the next
graph
Copyright 2017, A. Marshall

14

Graph of Data

Copyright 2017, A. Marshall

15

Implications
>When points in the upper right and
lower left quadrants dominate, then
the sums of the products of the
deviations will be positive
>When points in the lower right and
upper left quadrants dominate, then
the sums of the products of the
deviations will be negative

Copyright 2017, A. Marshall

16

An Important Observation
>The sums of the products of the
deviations will give us the appropriate
sign of the slope of our relationship

Copyright 2017, A. Marshall

17

Covariance
N

COV X, Y XY

x y
i

i1

Note the notation!


n
cov X, Y sXY

Copyright 2017, A. Marshall

x x y y
i1

n 1

xy

xy
i

i i

n 1

18

Covariance
>Excel:
=COVARIANCE.P(range1,range2)
Population Covariance

=COVARIANCE.S(range1,range2)
Sample Covariance

>Excel2007, and earlier:


=COVAR(range1,range2)
Returns POPULATION covariance
Need to multiply by n/(n-1) to convert to
sample covariance
Copyright 2017, A. Marshall

19

Using Covariance
>In the same units as Variance (if both
variables are in the same unit), i.e.
units squared
>Very useful in Finance for measuring
portfolio risk
>Unfortunately, it is hard to interpret
for two reasons:
What does the magnitude/size imply?
The units are confusing
Copyright 2017, A. Marshall

20

A More Useful Statistic


>We can simultaneously adjust for both
of these shortcomings by dividing the
covariance by the two relevant
standard deviations
>This operation
Removes the impact of size & scale
Eliminates the units

Copyright 2017, A. Marshall

21

Correlation
>Correlation measures the sensitivity of
one variable to another, but ignoring
magnitude
>Range: -1 to 1
+1: Implies perfect positive co-movement
-1: Implies perfect negative co-movement
0: No relationship

Copyright 2017, A. Marshall

22

Calculating Correlation

XY
XY
X Y
sXY
XY
rXY
sXs Y
=CORREL(range1,ran
ge2)
Copyright 2017, A. Marshall

23

Regression Analysis

Copyright 2017, A. Marshall

24

Regression Analysis
>A statistical technique for determining
the best fit line through a series of
data

Copyright 2017, A. Marshall

25

Error
>No line can hit all, or even most of the
points - The amount we miss by is called
ERROR
>As we have learned, error does not
mean mistake! It simply means the
inevitable missing that will happen
when we generalize, or try to describe
things with models
>When we looked at the mean and
variance, we called the errors deviations
Copyright 2017, A. Marshall

26

What Regression Does


>Regression finds the line that minimizes
the amount of error, or deviation from the
line
The mean is the statistic that has the
minimum total of squared deviations

>Likewise, the regression line is the unique


line that minimizes the total of the
squared errors.
>The Statistical term for what gets
minimized is Sum of Squared Errors or
SSE
Well come back to this measure later
Copyright 2017, A. Marshall

27

Example
>Suppose we are examining the sale
prices of compact cars sold by rental
agencies and that we have the
following summary statistics:

Copyright 2017, A. Marshall

28

Summary Statistics
> Our best estimate
of the average price
would be $5,411
> Using the Empirical
Rule, a 95%
Confidence Interval
would be $5,411
(2)(255) or $5,411
(510) or $4,901 to
$5,921

Copyright 2017, A. Marshall

29

Something Missing?
>Clearly, looking at this data in such a
simplistic way ignores a key factor:
the mileage of the vehicle

Copyright 2017, A. Marshall

30

Price vs. Mileage

Copyright 2017, A. Marshall

31

Importance of the Factor


>After looking at the scatter graph, you
would be inclined to revise your
estimate depending on the mileage
25,000 km about $5,700 - $5,900
45,000 km about $5,100 - $5,300

Copyright 2017, A. Marshall

32

Key Summary Statistics

Copyright 2017, A. Marshall

33

Estimating the Regression


Line

>From Graph

Right-click on a data point


Choose Add Trendline > Linear
Options tab: check Display equation on
chart

>Excel Functions
=INTERCEPT(known ys,known xs)
=SLOPE(known ys,known xs)

>Regression tool in Data Analysis


>Using Formulas Manually
Copyright 2017, A. Marshall

34

From Graph

Copyright 2017, A. Marshall

35

Excel Functions

=INTERCEPT(B3:B102,A3:A102)
=SLOPE(B3:B102,A3:A102)

Copyright 2017, A. Marshall

36

From Graph & Functions


>While these methods are a quick way
to get the equation, they do not
provide you with any statistics to use
to assess the model
>The Regression tool in Data Analysis
provides a full range of statistics

Copyright 2017, A. Marshall

37

The Regression Tool


>Office 2007 & later:
Data Analysis will appear at far right on
ribbon bar
Choose Regression from the dialogue box
menu.

Copyright 2017, A. Marshall

38

Switch to Excel
File CarPrice.xls
Tab Raw Data

Copyright 2017, A. Marshall

39

Excel Regression Output

Copyright 2017, A. Marshall

40

Stripped Down Output

Copyright 2017, A. Marshall

41

Interpretation
>Our estimated relationship is
>Price = $6,533 - 0.031(km)
Every 1000 km reduces the price by an
average of $31
What does the $6,533 mean?
Careful! It is outside the data range!

Copyright 2017, A. Marshall

42

A Useful Formula
cov(
x
,
y
)
1 b1

var(x)
>The estimate of the slope coefficient
is the ratio of the covariance between
the dependent and independent
variables and the variance of the
independent variable

Copyright 2017, A. Marshall

43

Estimating Slope
cov(x, y) - 1356256

1 b1

0.03115...
var(x)
43528690
>We get the same slope as the
regression output using the ratio of
the covariance to the variance of the
independent variable

Copyright 2017, A. Marshall

44

Estimating the Intercept


>We now have the slope, but need the
intercept
>Every estimated regression line goes through
the point (x-bar, y-bar), the joint mean
>Therefore:

b y b x

0
0
1

5411.41 0.03115... 36009.45


6533.383

Copyright 2017, A. Marshall

45

Manual Calculation
> To calculate regression manually, you will
need the following summary statistics

x, y, x , y , xy
2

Copyright 2017, A. Marshall

46

For Manual Calculation

(Note: Data are both in 1000s to keep squared


values small)

Copyright 2017, A. Marshall

47

Formulae - I

x y

xy

n
b1
2

2
x n
b0 y b1 x
Copyright 2017, A. Marshall

48

Formulae - II
cov x, y
b1
, b0 y b1 x
var x
cov x, y

x y

xy
n 1

var x
Copyright 2017, A. Marshall

n 1

49

Car Price Example Calculations


x y

xy

19351.92

3600.945 541.141

100
n

2
2

3600.945

2
133977.3892
x n
100
134.2693

0.031158
4309.3403
y
x 541.141
3600.945

b0 y b1 x
b1

0.031158

n
n
100
100

6.5334
b1

Copyright 2017, A. Marshall

50

Assessing the Model

Copyright 2017, A. Marshall

51

Assessing and Testing


>You will learn the statistics for
assessing and testing the quality of
the model in Business Analytics II
>We must ask, Is the model logical?
>We will look at the R2, (R-Squared)
AKA Coefficient of Determination

Copyright 2017, A. Marshall

52

R : Coefficient of Determination
2

>The R2 (R-squared) tells of the


proportion of the variability in our
dependent variable is explained by
variability of the independent variable
>It is the square of the correlation
coefficient for simple regression

Copyright 2017, A. Marshall

53

Car Price Example

Copyright 2017, A. Marshall

54

Manual Calculations
R
2

cov x, y

s s
2
x

2
y

1.35626

43.52869 0.06499
2

0.65023 65.023%
Minor difference due to
rounding
Copyright
2017, A. Marshall

55

Car Price Example: Quality


>Logical: Price is lowered as mileage
increases, and by a plausible amount.
>The R-squared: 0.65: 65% of the
variation in price is explained by
variation in mileage

Copyright 2017, A. Marshall

56

CMP-TSX Example

Copyright 2017, A. Marshall

57

TSX-CMP Regression Output


(Abridged)

Copyright 2017, A. Marshall

58

Interpreting the Output

Copyright 2017, A. Marshall

59

The TSX-CMP Example


>Cov(TSX,CMP) = 4.875
>Var(TSX) = 7.25
>b1 = 4.875/7.25 = 0.6724

Copyright 2017, A. Marshall

60

For Next Class


>Do questions from Chapter 4
>Read Chapter 6
>Assignment Posted, Quiz (?) for next
week
>Midterm, Sunday, February 12, covering
Weeks 1-4, Chapters 1-5

YOU LEARN STATISTICS


BY DOING STATISTICS
Copyright 2017, A. Marshall

61

You might also like