You are on page 1of 9

QMT412 Pn.

Sanizah's Notes 02/05/2013


1
CHAPTER 5
CORRELATION AND REGRESSION

Introduction
Correlation and Regression
Scatter Plot/Diagram
Coefficient of Correlation
Simple Linear Regression


sanizah@tmsk.uitm.edu.my
1
Learning objectives
Explain the concept of correlation
Calculate Pearsons correlation coefficient and
interpret the results
Calculate Spearmans rank correlation for
qualitative and quantitative data and interpret the
results
Determine the regression equation for a set of data and
interpret the equation
Use the regression equation to forecast
sanizah@tmsk.uitm.edu.my
2
Introduction
Correlation: Do you have a relationship?
(Between two quantitative variables, x & y)
If you have a relationship:
1) What is the direction? (+ or -)
2) What is the strength (r: -1 to +1)

#Correlation measures LINEAR relationship.

If you have a significant correlation:
How well can you predict a subjects y-score if
you know their x-score?
sanizah@tmsk.uitm.edu.my
3
Correlation & Regression
Regression and correlation are two concepts used to
describe the relationship between variables.

Correlation is a statistical method used to
determine if a relationship between variables
exists.

Regression is the statistical method used to
describe the nature of the relationship between
variables - that is, positive or negative, linear or
nonlinear.
sanizah@tmsk.uitm.edu.my
4
QMT412 Pn. Sanizah's Notes 02/05/2013
2
Independent and Dependent Variable
In this chapter, we want to study the relationship
between 2 variables only.
Independent variable x
Dependent variable - y

For example:
Expenditure (x) and Revenue (y)
Price (x) and sales (y)
Number of days absent (x) and CGPA (y)
Age of a person (x) and his/her blood pressure (y)

sanizah@tmsk.uitm.edu.my
5
Independent and Dependent Variable
Also called predictor or
explanatory or manipulated
variable
the variable in regression that can
be controlled or manipulated
Independent
variable (x)
Also called the response variable
the variable that cannot be
controlled or manipulated
Dependent
variable (y)
sanizah@tmsk.uitm.edu.my
6
Dependent(x) Vs. Independent(y)
Intentionally manipulated
Controlled
Vary at known rate
Cause
Intentionally left alone
Measured
Vary at unknown rate
Effect
7
sanizah@tmsk.uitm.edu.my
Example: What affects a students
arrival to class?
Variables:
Type of School
FSPPP, Business School, FSKM
Type of Student
Gender? CGPA?
Class Time
Morning, Afternoon, Evening
Mode of Transportation
Motorcycle, Car, UiTM bus
8
sanizah@tmsk.uitm.edu.my
QMT412 Pn. Sanizah's Notes 02/05/2013
3
Scatter Plot (scatter diagram)
A scatter plot is used to show the relationship
between two variables.

The scatter plot is a visual way to describe the nature of
the relationship between the independent
variable (x) and the dependent variable (y).

Interpreting scatter plots:
Positive linear relationship
Negative linear relationship
Nonlinear relationship
No relationship

sanizah@tmsk.uitm.edu.my
9
Scatter Plot Examples
y
x
y
x

y
y
x
x
Linear relationships Nonlinear (Curvilinear)
relationships
Positive
Negative
10
sanizah@tmsk.uitm.edu.my
Scatter Plot Examples
y
x
y
x

y
y
x
x
Strong relationships Weak relationships
(continued)
11
sanizah@tmsk.uitm.edu.my
Scatter Plot Examples
y
x
y
x

No relationship
(continued)
12
sanizah@tmsk.uitm.edu.my
QMT412 Pn. Sanizah's Notes 02/05/2013
4
Example 1 (pg. 134)
Draw a scatter diagram for the following data and state
the type of relationship between the variables.

sanizah@tmsk.uitm.edu.my
13
x 1 3 5 7 9 13 17
y 0 5 11 14 19 22 30
Correlation Coefficient

sanizah@tmsk.uitm.edu.my
14
Correlation coefficient measures the strength and direction of
a LINEAR relationship between a pair of random variables.

The POPULATION correlation coefficient (rho) measures the
strength of the association between the variables.

The sample correlation coefficient r or
s
is an estimate of
and is used to measure the strength of the linear relationship
in the sample observations.
Correlation Coefficient
r or
s
indicates
strength of relationship (strong, weak, or none)
direction of relationship
positive (direct) variables move in same direction
negative (inverse) variables move in opposite
directions
r ranges in value from 1.0 to +1.0.
Very Strong No Strong Very
Strong Relationship Strong
-1.0 -0.8 -0.5 0.0 +0.5 +0.8 +1.0
Moderate Weak Weak Moderate
Negative Positive
15
sanizah@tmsk.uitm.edu.my
-ve
Perfect
+ve
Perfect
Do Variables Relate to One Another?
Is teachers pay related to performance?
Is exercise related to illness?
Is CO
2
related to global warming?
Is TV viewing related to shoe size?
Is shoe size related to height?
Is height related to IQ?
Is cigarettes smoked per day related to
lung capacity?
Positive
Negative
Positive
Zero
16
sanizah@tmsk.uitm.edu.my
QMT412 Pn. Sanizah's Notes 02/05/2013
5
Positive correlation
sanizah@tmsk.uitm.edu.my
17
Two variables move in the same direction
Negative correlation
sanizah@tmsk.uitm.edu.my
18
Two variables tend to go in the opposite direction
sanizah@tmsk.uitm.edu.my
19
Methods for Calculating
Correlation Coefficient, r or
s

Pearson Product-
Moment Correlation
Coefficient
Spearman Rank
Correlation Coefficient
Pearson Coefficient of Correlation
Both variables must be quantitative and normally
distributed.
Calculation for r :


( ) ( )
( ) ( )
(
(

(
(

=
(



n
y
y
n
x
x
n
xy
xy
r
or
y y n x x n
y x xy n
r
2
2
2
2
2
2
2
2


sanizah@tmsk.uitm.edu.my
20
QMT412 Pn. Sanizah's Notes 02/05/2013
6
Example 2
Refer to Example 1. Compute Pearson coefficient
of correlation and interpret the result.


________ n
________ xy
________ y
________ y
________ x
________ x
=
=
=
=
=
=

2
2
sanizah@tmsk.uitm.edu.my
21
( ) ( )
(
(

(
(

n
y
y
n
x
x
n
xy
xy
r
2
2
2
2

The Spearman rank correlation coefficient
Spearmans rank correlation coefficient is a measure of association
between two variables that are at least of ordinal scale (suitable for
qualitative data).
Can also be applied to quantitative data but the variables must firsts
be ranked and then only it is calculated based on these rankings.





where:
d = difference between two ranks
n = number of pairs of observations

NOTE: Be careful with tied observations
) 1 (
6
1
2
2

=

n n
d
s

22
sanizah@tmsk.uitm.edu.my
How to calculate Spearmans rank
correlation coefficient?
1. List each set of scores in a column.
2. Rank the two sets of scores.
3. Place the appropriate rank beside each score.
4. Head a column dand determine the difference in rank for
each pair of scores.
(Note: Sum of the dcolumn should always be 0)
5. Square each number in the dcolumn and sum the
values (Ed
2
).
6. Use the formula to calculate the correlation coefficient.
sanizah@tmsk.uitm.edu.my
23
Refer Example 5 pg. 140

Student
Subject d d
2
Statistics Computer
A 1 3
B 2 1
C 3 4
D 4 2
E 5 5
sanizah@tmsk.uitm.edu.my
24
Five students A, B, C, D, E are ranked in two subjects, statistics and
computer programming with the following results.
Calculate the Spearmans rank correlation coefficient.
) 1 (
6
1
2
2

=

n n
d
s

QMT412 Pn. Sanizah's Notes 02/05/2013


7
Refer Example 6 pg. 141
x y Rank of x,
R
x
Rank of y,
R
y
d=R
x
-R
y
d
2
6.0 80
6.2 80
6.5 78
6.8 75
7.0 70
7.2 60
7.5 60
7.8 55
8.0 50
8.2 48
8.4 45
8.7 40
sanizah@tmsk.uitm.edu.my
25
The Regression Line
Regression indicates the degree to which the variation in one
variable X, is related to or can be explained by the variation in
another variable Y
Once you know there is a significant linear correlation, you
can write an equation describing the relationship between
the x and y variables.
This equation is called the line of regression or least squares
line.
The equation of a line may be written as:


where bis the slope of the line and a is the y-intercept.

sanizah@tmsk.uitm.edu.my
26
bx a y + =
Regression line
Creates a line of best fit running through the data

Analyze the relationship between the two quantitative
variables, X and Y


a-intercept:
if x = 0 is in the range, then ais the mean of the distribution
of the response y, when x = 0;
if x = 0 is not in the range, then ahas no practical
interpretation

b-slope:
change in the mean of the distribution of the response
produced by a unit change in x



bx a y + =
sanizah@tmsk.uitm.edu.my
Dependent
variable
Independent
variable
27
The Least Squares Regression Line
The values of a and b in the regression line y = a + bx
can be calculated by using the least squares method (or
method of least squares), given by the following formula:
sanizah@tmsk.uitm.edu.my
n
x
x
n
y x
xy
b
2
2
|
|
.
|

\
|

|
|
.
|

\
|
|
|
.
|

\
|

n
x
b
n
y
x b y a

=
=
28
QMT412 Pn. Sanizah's Notes 02/05/2013
8
x y
8 78
2 92
5 90
12 58
15 43
9 74
6 81
Absences
Final
Grade
Example 3: Application
95
90
85
80
75
70
65
60
55
45
40
50
0 2 4 6 8 10 12 14 16
F
i
n
a
l

G
r
a
d
e

X
Absences
29
sanizah@tmsk.uitm.edu.my
Calculate a and b.
Write the equation of the
line of regression with
x = number of absences
and y = final grade.
The line of regression is:
6084
8464
8100
3364
1849
5476
6561

624
184
450
696
645
666
486
57 516 3751 579 39898
1 8 78
2 2 92
3 5 90
4 12 58
5 15 43
6 9 74
7 6 81
64
4
25
144
225
81
36
xy x
2
y
2
x y
30
sanizah@tmsk.uitm.edu.my
0 2 4 6 8 10 12 14 16
40
45
50
55
60
65
70
75
80
85
90
95
Absences
F
i
n
a
l

G
r
a
d
e

The line of regression is: y = -3.924x + 105.667
Note that the point = (8.143, 73.714) is on the line.
The Line of Regression
31
sanizah@tmsk.uitm.edu.my
The regression line can be used to predict values of y
for values of x falling within the range of the data.
The regression equation for number of times absent and final
grade is:
Use this equation to predict the expected grade for a student with

(a) 3 absences (b) 12 absences

Predicting y Values
(a) y = 3.924(3) + 105.667 = 93.895
(b) y = 3.924(12) + 105.667 = 58.579
y = 3.924x + 105.667
32
sanizah@tmsk.uitm.edu.my
QMT412 Pn. Sanizah's Notes 02/05/2013
9
Coefficient of Determination
The coefficient of determination, r
2
,

measures the
strength of the association and is the ratio of explained
variation in y to the total variation in y.





Interpretation : proportion of the variation in
y that is explained by the variation in x

( )
2
2
t coefficien n correlatio
variation total
variation explained
=
= r
sanizah@tmsk.uitm.edu.my
33
The correlation coefficient of number of times absent and final
grade is r = 0.975. The coefficient of determination is
r
2
= (0.975)
2
= 0.9506.
Interpretation: About 95.06% of the variation in final grades can be
explained by the number of times a student is absent.
Note: The other 4.94% is unexplained and can be due to sampling
error or other variables such as intelligence, amount of time
studied, etc.
Recall Example 3
( )
2
2
t coefficien n correlatio
variation total
variation explained
=
= r
34
sanizah@tmsk.uitm.edu.my

You might also like