You are on page 1of 52

Simple Linear Regression

and Correlation

Presented by :
Eng. Heba El-Haddad

1
Introduction
Regression analysis is a statistical tool for analyzing the relationships
between variables.
For example:
Collage guidance counselor have just administrated a vocational
aptitude test to 1000 entering freshman, she is interested in knowing
whether there is a relationship between the math aptitude scores and
the business aptitude score

To determine the relationship between the math aptitude scores and the
business aptitude scores, we have to compute a number that measure
the relationship between these two sets of scores.
This number is called the correlation coefficient
A Digression into History

The English mathematician


Francis Galton (1822 – 1911)
introduced the concept of the
correlation.

Sir Francis Galton


The precise mathematical
measures of the correlation as we
use it today was actually
formulated by Karl Pearson (1857
-1936)

Karl Pearson
17.3
Correlation Coefficient
Different Aptitude Scores Received by Ten Students
To determine the correlation
between the math aptitude and Student Math Business language Music
business aptitude we can aptitude aptitude aptitude aptitude

analyze the situation A 52 48 26 22


pictorially by using scatter
diagram. B 49 49 53 23
C 26 27 48 57

The math score will represent D 28 24 31 54


the independent variable and E 63 59 67 13
denote it by x.
F 44 40 75 20
G 70 72 31 9
The business score will
represent the dependent H 32 31 22 50
variable and denote it by y I 49 50 11 17
J 51 49 19 24
Scatter Diagram
Analyze the situation pictorially by
using scatter diagram.
To create a scatter diagram by
using Minitab
 click on Graph > Scatterplot

 Click on "With Regression" then "OK" in


the first dialog box. Student F

In the second box, select business score


into the first box in the Y column, and math
score into the first box in the X column.
Note: each plot represent each
person score
Scatter Diagram
From scatter diagram:
1.The dots form an approximate
straight line. That is means that
there is a linear relation between
the two variables.

2.The line moves in a direction the


o
tt
from the lower left to the upper l e f
er r i g h
t
w
l o p er
right. That is means that there is a m
o up
Fr
positive correlation.

Positive linear correlation


Another Types of scatter diagram

Negative linear No correlation


correlation
The coefficient Correlation
Once we have determined that there is a linear relation
between two variables we cam measure the strength of this
relation by using the coefficient correlation of the linear
relationship developed by Karl Pearson.

The coefficient of linear correlation is given by

n∑xy – (∑x)(∑y)
r=
√n(∑x2 ) – (∑x)
√ 2 n(∑y2 ) –(∑y)2
Where
x = label for one of the variables
y = label for the other variable
n = number of pairs of scores
Possibilities of the r value
The coefficient correlation will always have a value -1 ≤ r ≤ + 1

Strong positive Correlation Perfect positive Correlation


positive Correlation
r>0 r close to 1 r=1

No Correlation
r=0

Strong negative Correlation


negative Correlation Perfect negative Correlation
r close to -1
r<0 r = -1
Example on coefficient correlation

M a th Business
ap titu d e " x " aptitude"y"
x2 y2 xy
5 2 .0 0 4 8 .0 0 2704 2304 2496
4 9 .0 0 4 9 .0 0 2401 2401 2401
2 6 .0 0 2 7 .0 0 676 729 702
2 8 .0 0 2 4 .0 0 784 576 672
6 3 .0 0 5 9 .0 0 3969 3481 3717
n (22,729) – (464)(449) 4 4 .0 0 4 0 .0 0 1936 1600 1760
r= 2
(464) – (23,396) 10 √ 2
(449) – (22,137) 10 √
7 0 .0 0
3 2 .0 0
7 2 .0 0
3 1 .0 0
4900
1024
5184
961
5040
992
4 9 .0 0 5 0 .0 0 2401 2500 2450
5 1 .0 0 4 9 .0 0 2601 2401 2499
23396 22137 22729
r= 0.986747791
464 449

Thus, the coefficient of correlation is 0.9867. Since this value is


close to +1 we say that there is a high degree of positive
correlation
The sensitivity of correlation coefficient
Math aptitude Business x2 y2 XY
The correlation coefficient is "x * 29" aptitude" y / 20
"
unaffected by adding or 1508 2.4 2274064 5.76 3619.2
subtracting a number to either x 1421 2.45 2019241 6.0025 3481.45
or y or both, even if x coded in
754 1.35 568516 1.8225 1017.9
one way – perhaps by adding or
812 1.2 659344 1.44 974.4
subtracting a number- and y is 1827 2.95 3337929 8.7025 5389.65
coded by another way – say, by
1276 2 1628176 4 2552
multiplying by a number
2030 3.6 4120900 12.96 7308

928 1.55 861184 2.4025 1438.4

1421 2.5 2019241 6.25 3552.5

r= 0.986747791 1479 2.45 2187441 6.0025 3623.55

13456 22.45 19676036 55.3425 32957.05


The sensitivity of correlation coefficient
Math aptitude "x Business x2 y2 XY
The correlation coefficient is + 29" aptitude" y -
38"
unaffected by adding or
subtracting a number to either x 81 10 6561 100 810

or y or both, even if x coded in 78 11 6084 121 858

one way – perhaps by adding or 55 -11 3025 121 -605

subtracting a number- and y is 57 -14 3249 196 -798

coded by another way – say, by 92 21 8464 441 1932


multiplying by a number 73 2 5329 4 146

99 34 9801 1156 3366

61 -7 3721 49 -427

r= 0.986747791 78 12 6084 144 936

80 11 6400 121 880

754 69 58718 2453 7098


The Reliability of r
Business aptitude" Music aptitude "Y" x2 y2 XY
x"
When r is computed we may get a
48 22 2304 484 1056
strong correlation, positive or
49 23 2401 529 1127
negative which is due purely to
27 57 729 3249 1539
chance not to some relation that
24 54 576 2916 1296
exists between x and y
59 13 3481 169 767

40 20 1600 400 800

72 9 5184 81 648

31 50 961 2500 1550

50 17 2500 289 850

49 24 2401 576 1176

449 289 22137 11193 10809

r = -0.914447
amount of snow in no. of hourse x2 y2 XY
inchs "X" studied "Y"
The value of r in this case is
0.63, but we can not conclude 1 2 1 4 2

that if it snows in U.S.A then 4 6 16 36 24

2 3 4 9 6
the students in Egypt studies
6 4 36 16 24
more !!!
3 4 9 16 12

16 19 66 81 68

r = 0.63
A chart has been constructed that allow us to determine the significance of particular
value of the correlation coefficient

1. Compute the value of r

2. Look in the chart for the appropriate r-value corresponding to some given n,
where n is the number of pairs of scores

3. The value of r is not satisfactory significant if it is between –rα and rα for a


particular value of n.
Coefficient Correlation Chart
In case of the correlation between
amount of snow in U.S.A and the
studied hours for the students egypt

Assume α = 0.025

1. r = 0.63 , n = 5

2. From table r0.025 is between -


0.878 and + 0.878
3. Since the value of r = 0.63 is
between than + 0.878 and - 0.878
We conclude that the correlation is due
purely the chance
Coefficient Correlation Chart
In case of the correlation between math
aptitude and business aptitude scores

Assume α = 0.025

1. r = 0.986747791 , n = 10

2. From table r0.025 is between -


0.0632 and + 0.0632
3. Since the value of r is greater than +
0.0632
We conclude that there is a definite positive
correlation between the math aptitude
score and the business aptitude score.
The correlation coefficient merely determines weather two
variables are related, but it does not specify how
Linear Regression
Once we determine the linear correlation between two
variable, Linear Regression is used to predict the value of
one variable (the dependent variable y ) on the basis of other
variables (the independent variables x).

To predict the value of y

A- From scatter diagram


Which line has the best “fit” to the
data?
?
?
?

B - Least Square Method


A Digression into History

The Statistical method of least


squares was developed by French
mathematician Adrien-Marie
Legendre (1752 – 1833)

Adrien-Marie
Legendre
The Method of the Least Square
B- Least Square Method
The differences
between the
observed and
predict value
the n
izes iatio
inim l dev
at m tica
t h er
e v
e lin een
o f th betw
ti on ared
q ua squ
he e the
T of
sum

Regression line
The Method of the Least Square
The regression equation of the estimated regression line is

Where

n∑xy – (∑x)(∑y) 1
b1= b0= ∑y - b1 ∑x
n(∑x2 ) – (∑x)2 n

and n is a number of pairs of scores


The Prediction of y value
Math aptitude Business x2 XY
If the counselor was interested in "x" aptitude" y"

predicting how will student do on 52 48 2704 2496


the business aptitude if she knows
49 49 2401 2401
the student score in the math
26 27 676 702
aptitude.
n∑xy – (∑x)(∑y)
b1= 28 24 784 672
n(∑x2 ) – (∑x)2
63 59 3969 3717

b1= 10(22729) – (464)(449) 44 40 1936 1760


= 1.01553
10(23396 ) – (464)2
70 72 4900 5040
b0= 1 ∑y - b ∑x
n 1
32 31 1024 992

b0= 1
449 – 1.01553* 464 = -2.221 49 50 2401 2450
10
51 49 2601 2499
-2.221+ 1.01553 X
464 449 23396 22729
For example at x =50
-2.221 + 1.01553 * 50 = 48.56
Alternative way to compute b1 and b0
The coefficients b1 and b0 for
the least squares line…

…are calculated as:

1. Compute the average of x-values


and average of y values.
2. Compute sample standard deviation
for x values “Sx”
3. Compute sample covariance of n
data points, which is defined by
“Sxy”
Alternative way to compute b1 and b0

= 464/10 = 46.4 = 449/10 = 44.9

= 1866.4/9 = 207.378

= 1895.4/9 = 210.6

= 210.6 / 207.378 = 1.01554

= 44.9 – (1.01554 * 46.4) = -2.221


-2.221 + 1.01554 X
STANDARD ERROR TO ESTIMATE
Predicted value

At x =50 48.56
-2.221 + 1.01553 * 50 =
We can not expect such a prediction to be accurate
For each x there is a corresponding
population y values
The relationship between X y =α + βx
Y
and Y is a straight-Line
(linear) relationship.
The values of the
independent variable X are
assumed fixed (not random); 48.56
the only randomness in the
values of Y comes from the
error term .

50 X
STANDARD ERROR TO ESTIMATE
Y

my|x = β0 + β1 x
The mean of the corresponding
y value lies on some straight
line whose equation we do not y
know but which is of the
form: Identical normal
distributions of errors,
all centered on the
regression line.

X
x
STANDARD ERROR TO ESTIMATE
The error term (vertical distance
between the predicted y value and
the true population values ) are
normally distributed with mean 0
and the same standard deviation σ
This is called error sum of square

SSE = ∑(y − ŷ)2


The value of σ can be estimated
from the sample data by computing
the standard error of the estimate
also called residual standard
deviation

If is zero, all the points fall on the regression


line
STANDARD ERROR TO ESTIMATE
Math aptitude Business aptitude" ŷ y-ŷ (y - ŷ)2
"x" y"

52 48 50.58656 -2.58656 6.69029263

49 49 47.53997 1.46003 2.1316876

26 27 24.18278 2.81722 7.93672853

28 24 26.21384 -2.21384 4.90108755

63 59 61.75739 -2.75739 7.60319961

44 40 42.46232 -2.46232 6.06301978

70 72 68.8661 3.1339 9.82132921


2.55 = (10-2)/52.049 √=
32 31 30.27596 0.72404 0.52423392

49 50 47.53997 2.46003 6.0517476

51 49 49.57103 -0.57103 0.32607526 SSE


464 449 52.0494017
Hypothesis Tests About The Regression
If no linear relationship exists between the two variables, we would expect the
regression line to be horizontal, that is, to have a b1 = 0

ŷ = b0 + b1 x
ŷ = b0

to determine whether x can be used as a predictor of y, we will implement test


of hypothesis
Hypothesis Tests About The Regression
1. State hypothesis becomes:
H0: b1 = 0
H1: b1≠ 0
2. Compute the value of test statistic

3. Find n-2 which is the degree of freedom for the t-distribution


4. Find the appropriate critical value ± tα/2 by using t-student table
5. If the value of the test statistic falls in the rejection region “two-tail test”,
reject H0 .Otherwise do not reject H0.
6. State the conclusion.
Hypothesis Tests About The Regression
For the math aptitude and business aptitude scores example
b1 = 1.01553 , n =10 , = 2.55, ∑x = 464 , ∑x2 = 23396, α = 0.05
1. State hypothesis becomes:
H0: b1 = 0
H1: b1≠ 0
2. Compute the value of test statistic

= 1.01553 /( 2.55 / √23369 – (4642/10) = 3.2


3. n-2 = 10 -2 = 8
4. ± tα/2 = 2.306
5. the value of the test statistic falls in the acceptance region, we do not reject H0.
6. conclude that the math scores can be used to predict the values of business scores
The confidence interval for b1

Where
tα/2 represent the t-distribution value obtained from table using n-2
degrees of freedom.
Se the standard error to estimate
ŷp is the predicted value of y corresponding to x=xp
Minitab

Regression Analysis

35
Minitab
 Type the data into C1 and C2 in the data window, and
label the columns; we will call C1 “math aptitude" and C2
“business aptitude.“

36
Minitab
To find the equation of the regression click on
 Stat > Regression > Regression.
 Select C2 “business aptitude” into the "Response:" box
and C1 “math aptitude” into the Predictors:" box
 then click "OK".

37
Minitab
 The resulted information:

 The resulted information can be filtered can deleting the unwanted


information

38
Minitab
To compute the linear correlation coefficient.
 Click on Stat > Basic Statistics > Correlation then select “math
aptitude” and “math aptitude” into the "Variables:“ box
 click "OK."

39
Minitab
To predict business aptitude score when the math aptitude score is 50
 Give C6 “new math score” label; then write in it 50 “the values of math
score for which we want predictions”.
 click on Stat > Regression > Regression.
 Select “math aptitude” into the "Response:" box and “business aptitude”
into the "Predictors:" box.
 Now go to "Options" and select "New math score“ into the box called
"Prediction intervals for new observations:“
 click on the "Fits" box under "Storage" to check it.
 click "OK" to get back to the Regression dialog box
 click on "Results“ then click on the top button, "Display nothing."
 Click "OK" on the Results dialog box and on the Regression dialog box.
The result is displayed on C7

40
Residual Analysis

Model Adequacy Checking

41
Regression Model:
Regression Model:
Y = β 0 + β1 X + ε
Assumptions:
1.The relationship between and the predictors is linear.
2.The є’s are normally distributed
3.The noise term є has zero mean.
4.All є’s have the same variance σ2.
5.The є’s are uncorrelated between observations.
6.The є’s are independent of the predictors.
Residual analysis is used for detecting departures from
assumptions.

42
Residual Analysis
Definition of Residuals
Residual are estimates of experimental error.
Mathematically, the residual for a specific predictor value is the difference between the
response value y and the predicted response value.

Where:
yi = actual observation at Xi
εˆ = ( y − yˆ )
ŷi = predicted value from equation ŷi = β0+ β1Xi

Residual Plots
Residual plotting is a very effective way to investigate the adequacy of the fit of a regression model and to check the underlying
assumption.

43
Residual Analysis
The residual Plots are:
Histogram of the residual
Check normal probability plot of residuals
A symmetric bell-shaped histogram which is evenly distributed around zero indicates that the
normality assumption is likely to be true. If the histogram indicates that random error is not
normally distributed, it suggests that the model's underlying assumptions may have been
violated.

Sample sizes of residuals are generally small (<50) because experiments have limited treatment
combinations, so a histogram is not be the best choice for judging the distribution of residuals.
A more sensitive graph is the normal probability plot.

44
Residual Analysis
The normal probability plot
The normal probability plot should produce an
approximately straight line if the points come from a
normal distribution. Normal plot of residuals
DE S IG N-E X P E RT P l o t
L i fe

99

95
N o rm a l % p ro b a b ility
90

80
70

50

30
20

10

-60.75 -34.25 -7.75 18.75 45.25

R e s id u a l

45
Residual Analysis
 Residuals plotted against the fitted values,
Check for the error variance
This plot should produce a distribution of points scattered randomly about 0, regardless of the size of
the fitted value.
A residuals plot which has an increasing trend suggests that the error variance increases with the
independent variable; while a distribution that reveals a decreasing trend indicates that the error
variance decreases with the independent variable. Neither of these distributions are constant variance
patterns.
Therefore they indicate that the assumption of constant variance is not likely to be true and the regression
is not a good one.
On the other hand, a horizontal-band pattern suggests that the variance of the residuals is constant

46
Residual analysis
 Residuals against run-order sequence or time
Checking the process drift
The Residual vs. Order of the Data plot can be used to check the drift of the variance during
the experimental process, when data are time-ordered. If the residuals are randomly distributed
around zero, it means that there is no drift in the process.

Checking independence of the error term


the Residual vs. Order of the Data plot will reflect the correlation between the error term and
time. Fluctuating patterns around zero will indicate that the error term is dependent.

47
Residual Analysis
Outlier
Outlier is a single or a group of observations which are markedly different from the bulk of the data or from the pattern set by the majority of the observations.”
The presence of one or more outliers can seriously distort the analysis of variance.

The check of the outlier may be made by examining the standardized residuals

Where:
di = standardized residual
ei = residual
MSE = error to be estimate

The standardized residual should be approximately normal with mean zero and
unit variance.
A residual bigger than 3 or 4 standard deviations from zero is a potential
outlier.
48
Residual Analysis
Math aptitude Business aptitude" ŷ
Example: "x" y" residual
y–ŷ

52 48 50.58656 -2.58656

49 49 47.53997 1.46003
the regression equation:
26 27 24.18278 2.81722

28 24 26.21384 -2.21384

63 59 61.75739 -2.75739
-2.221 + 1.01553 X
44 40 42.46232 -2.46232

70 72 68.8661 3.1339

32 31 30.27596 0.72404

49 50 47.53997 2.46003

51 49 49.57103 -0.57103

464 449

49
Residual Analysis
Using Minitab
 Stat
Regression
Regression
Graphs

50
Residual Analysis

51
Thank you

You might also like