Professional Documents
Culture Documents
and Correlation
Presented by :
Eng. Heba El-Haddad
1
Introduction
Regression analysis is a statistical tool for analyzing the relationships
between variables.
For example:
Collage guidance counselor have just administrated a vocational
aptitude test to 1000 entering freshman, she is interested in knowing
whether there is a relationship between the math aptitude scores and
the business aptitude score
To determine the relationship between the math aptitude scores and the
business aptitude scores, we have to compute a number that measure
the relationship between these two sets of scores.
This number is called the correlation coefficient
A Digression into History
Karl Pearson
17.3
Correlation Coefficient
Different Aptitude Scores Received by Ten Students
To determine the correlation
between the math aptitude and Student Math Business language Music
business aptitude we can aptitude aptitude aptitude aptitude
n∑xy – (∑x)(∑y)
r=
√n(∑x2 ) – (∑x)
√ 2 n(∑y2 ) –(∑y)2
Where
x = label for one of the variables
y = label for the other variable
n = number of pairs of scores
Possibilities of the r value
The coefficient correlation will always have a value -1 ≤ r ≤ + 1
No Correlation
r=0
M a th Business
ap titu d e " x " aptitude"y"
x2 y2 xy
5 2 .0 0 4 8 .0 0 2704 2304 2496
4 9 .0 0 4 9 .0 0 2401 2401 2401
2 6 .0 0 2 7 .0 0 676 729 702
2 8 .0 0 2 4 .0 0 784 576 672
6 3 .0 0 5 9 .0 0 3969 3481 3717
n (22,729) – (464)(449) 4 4 .0 0 4 0 .0 0 1936 1600 1760
r= 2
(464) – (23,396) 10 √ 2
(449) – (22,137) 10 √
7 0 .0 0
3 2 .0 0
7 2 .0 0
3 1 .0 0
4900
1024
5184
961
5040
992
4 9 .0 0 5 0 .0 0 2401 2500 2450
5 1 .0 0 4 9 .0 0 2601 2401 2499
23396 22137 22729
r= 0.986747791
464 449
61 -7 3721 49 -427
72 9 5184 81 648
r = -0.914447
amount of snow in no. of hourse x2 y2 XY
inchs "X" studied "Y"
The value of r in this case is
0.63, but we can not conclude 1 2 1 4 2
2 3 4 9 6
the students in Egypt studies
6 4 36 16 24
more !!!
3 4 9 16 12
16 19 66 81 68
r = 0.63
A chart has been constructed that allow us to determine the significance of particular
value of the correlation coefficient
2. Look in the chart for the appropriate r-value corresponding to some given n,
where n is the number of pairs of scores
Assume α = 0.025
1. r = 0.63 , n = 5
Assume α = 0.025
1. r = 0.986747791 , n = 10
Adrien-Marie
Legendre
The Method of the Least Square
B- Least Square Method
The differences
between the
observed and
predict value
the n
izes iatio
inim l dev
at m tica
t h er
e v
e lin een
o f th betw
ti on ared
q ua squ
he e the
T of
sum
Regression line
The Method of the Least Square
The regression equation of the estimated regression line is
Where
n∑xy – (∑x)(∑y) 1
b1= b0= ∑y - b1 ∑x
n(∑x2 ) – (∑x)2 n
b0= 1
449 – 1.01553* 464 = -2.221 49 50 2401 2450
10
51 49 2601 2499
-2.221+ 1.01553 X
464 449 23396 22729
For example at x =50
-2.221 + 1.01553 * 50 = 48.56
Alternative way to compute b1 and b0
The coefficients b1 and b0 for
the least squares line…
= 1866.4/9 = 207.378
= 1895.4/9 = 210.6
At x =50 48.56
-2.221 + 1.01553 * 50 =
We can not expect such a prediction to be accurate
For each x there is a corresponding
population y values
The relationship between X y =α + βx
Y
and Y is a straight-Line
(linear) relationship.
The values of the
independent variable X are
assumed fixed (not random); 48.56
the only randomness in the
values of Y comes from the
error term .
50 X
STANDARD ERROR TO ESTIMATE
Y
my|x = β0 + β1 x
The mean of the corresponding
y value lies on some straight
line whose equation we do not y
know but which is of the
form: Identical normal
distributions of errors,
all centered on the
regression line.
X
x
STANDARD ERROR TO ESTIMATE
The error term (vertical distance
between the predicted y value and
the true population values ) are
normally distributed with mean 0
and the same standard deviation σ
This is called error sum of square
ŷ = b0 + b1 x
ŷ = b0
Where
tα/2 represent the t-distribution value obtained from table using n-2
degrees of freedom.
Se the standard error to estimate
ŷp is the predicted value of y corresponding to x=xp
Minitab
Regression Analysis
35
Minitab
Type the data into C1 and C2 in the data window, and
label the columns; we will call C1 “math aptitude" and C2
“business aptitude.“
36
Minitab
To find the equation of the regression click on
Stat > Regression > Regression.
Select C2 “business aptitude” into the "Response:" box
and C1 “math aptitude” into the Predictors:" box
then click "OK".
37
Minitab
The resulted information:
38
Minitab
To compute the linear correlation coefficient.
Click on Stat > Basic Statistics > Correlation then select “math
aptitude” and “math aptitude” into the "Variables:“ box
click "OK."
39
Minitab
To predict business aptitude score when the math aptitude score is 50
Give C6 “new math score” label; then write in it 50 “the values of math
score for which we want predictions”.
click on Stat > Regression > Regression.
Select “math aptitude” into the "Response:" box and “business aptitude”
into the "Predictors:" box.
Now go to "Options" and select "New math score“ into the box called
"Prediction intervals for new observations:“
click on the "Fits" box under "Storage" to check it.
click "OK" to get back to the Regression dialog box
click on "Results“ then click on the top button, "Display nothing."
Click "OK" on the Results dialog box and on the Regression dialog box.
The result is displayed on C7
40
Residual Analysis
41
Regression Model:
Regression Model:
Y = β 0 + β1 X + ε
Assumptions:
1.The relationship between and the predictors is linear.
2.The є’s are normally distributed
3.The noise term є has zero mean.
4.All є’s have the same variance σ2.
5.The є’s are uncorrelated between observations.
6.The є’s are independent of the predictors.
Residual analysis is used for detecting departures from
assumptions.
42
Residual Analysis
Definition of Residuals
Residual are estimates of experimental error.
Mathematically, the residual for a specific predictor value is the difference between the
response value y and the predicted response value.
Where:
yi = actual observation at Xi
εˆ = ( y − yˆ )
ŷi = predicted value from equation ŷi = β0+ β1Xi
Residual Plots
Residual plotting is a very effective way to investigate the adequacy of the fit of a regression model and to check the underlying
assumption.
43
Residual Analysis
The residual Plots are:
Histogram of the residual
Check normal probability plot of residuals
A symmetric bell-shaped histogram which is evenly distributed around zero indicates that the
normality assumption is likely to be true. If the histogram indicates that random error is not
normally distributed, it suggests that the model's underlying assumptions may have been
violated.
Sample sizes of residuals are generally small (<50) because experiments have limited treatment
combinations, so a histogram is not be the best choice for judging the distribution of residuals.
A more sensitive graph is the normal probability plot.
44
Residual Analysis
The normal probability plot
The normal probability plot should produce an
approximately straight line if the points come from a
normal distribution. Normal plot of residuals
DE S IG N-E X P E RT P l o t
L i fe
99
95
N o rm a l % p ro b a b ility
90
80
70
50
30
20
10
R e s id u a l
45
Residual Analysis
Residuals plotted against the fitted values,
Check for the error variance
This plot should produce a distribution of points scattered randomly about 0, regardless of the size of
the fitted value.
A residuals plot which has an increasing trend suggests that the error variance increases with the
independent variable; while a distribution that reveals a decreasing trend indicates that the error
variance decreases with the independent variable. Neither of these distributions are constant variance
patterns.
Therefore they indicate that the assumption of constant variance is not likely to be true and the regression
is not a good one.
On the other hand, a horizontal-band pattern suggests that the variance of the residuals is constant
46
Residual analysis
Residuals against run-order sequence or time
Checking the process drift
The Residual vs. Order of the Data plot can be used to check the drift of the variance during
the experimental process, when data are time-ordered. If the residuals are randomly distributed
around zero, it means that there is no drift in the process.
47
Residual Analysis
Outlier
Outlier is a single or a group of observations which are markedly different from the bulk of the data or from the pattern set by the majority of the observations.”
The presence of one or more outliers can seriously distort the analysis of variance.
The check of the outlier may be made by examining the standardized residuals
Where:
di = standardized residual
ei = residual
MSE = error to be estimate
The standardized residual should be approximately normal with mean zero and
unit variance.
A residual bigger than 3 or 4 standard deviations from zero is a potential
outlier.
48
Residual Analysis
Math aptitude Business aptitude" ŷ
Example: "x" y" residual
y–ŷ
52 48 50.58656 -2.58656
49 49 47.53997 1.46003
the regression equation:
26 27 24.18278 2.81722
28 24 26.21384 -2.21384
63 59 61.75739 -2.75739
-2.221 + 1.01553 X
44 40 42.46232 -2.46232
70 72 68.8661 3.1339
32 31 30.27596 0.72404
49 50 47.53997 2.46003
51 49 49.57103 -0.57103
464 449
49
Residual Analysis
Using Minitab
Stat
Regression
Regression
Graphs
50
Residual Analysis
51
Thank you