Professional Documents
Culture Documents
Scattered diagram can give us two type of information. Visually we can look for pattern that
indicate that the variables are related. Then, if the variables are related, we can see what kind of line, or
estimating equation, describes this equation.
Later on we will study correlation analysis to determine the degree to which the variables are
related. It tells us how well the estimating equation actually describes the relationship.
We find a CAUSAL relationship between variables.
i.e. how does the independent variable causes the dependent variable to change.
Deterministic and Probabilistic Relations or Models
A formula that relates quantities in the real world is called a model. Recall that in physics we
have studies that if a body is moving under uniform motion with an initial velocity u and uniform
acceleration a, the velocity after time t is given by:
v = u + at
This is a model for uniform motion. This model has the property that when a value of t is
substituted in the above equation, the value of v is determined without any error. Such models are called
deterministic models. An important example of the deterministic model is the relationship between
Celsius and Fahrenheit scales in the form of F = 32+9/5C. Other examples of such models are Boyles
law, Newtons law of gravitation, ohms law etc.
Consider an other example to investigate the relationship between the yield of potatoes y and the
level of fertilizer application x. An investigator divides the field into eight plots of equal size with equal
fertility and applied varying amounts of fertilizer to each. The yield of potatoes (in kg) and the fertilizer
application (in kg) was recorded for each plot. This data is given below:
Fertilizer Applied: =x 1 1.5 2 2.5 3 3.5 4 4.5
Yield of Potatoes: =y 25 31 27 28 36 35 32 34
Suppose the investigator believes that the relation between y and x is exactly given by:
y = 22 + 2.5 x
If this is true we must obtain the exact value of yield y for a given value of x. Thus when x = 1,
the yield must be:
Y = 22 + 2.5 (1) = 24.5
But it is 25. There is an error of 24.5 25.0 = - 0.5. Hence no deterministic model can be
constructed to represent this experiment. This type of error is known as probabilistic model. The
deterministic relation in such cases is then modified to include both a deterministic component and a
random error component given as
Yi = a = bXi + i , where is are the unknown random errors.
Regression Model
There are many statistical investigations in which the main objective is to determine whether a
relationship exists between two or more variables. If such a relationship can be expressed by a
mathematical formula, we will then be able to use it for the purpose of making predictions. The reliability
of any prediction will, of course, depend on the strength of the relationship between the variables included
in the formula.
A mathematical equation that allows us to predict values of one dependent variable from known
values of one or more independent variables is called a regression equation. Today the term regression is
applied to all types of prediction problems and does not necessarily imply a regression towards the
population mean.
Linear Regression
We consider here the problem of estimating or predicting the value of a dependent variable Y on
the basis of a known measurement of an independent and frequently controlled variable X. The variable
intended to be estimated or predicted is termed as dependent variable or response variable and the
variable on the basis of which the dependent variable is to be estimated is called the independent variable,
the regressor or the predictor.
e.g. If we want to estimate the heights of children on the basis of their ages, the heights would
be the dependent variable and the ages would be the independent variable. In estimating the yields of a
crop, on the basis of the amount of the fertilizer used, the yield will be the dependent variable and the
amount of fertilizer would be the independent variable.
Scatter Diagram
Let us consider the distribution of statistics grades corresponding to intelligence test scores of 50,
55, 65 and 70. The statistics grades for a sample of 12 freshmen having these intelligence test scores are
presented in the following table
Using the regression line in figure, we can predict a statistics grade of 88 for a student whose
intelligence test score is 60. However, we would be extremely fortunate if a student with an intelligence
test score of 60 made a statistics grade of exactly 88. In fact, the original data of table show that three
students with this intelligence test score received grades of 85, 90 and 94. we must therefore, interpret the
predicted statistics grade of 88 as an average or expected value for all students taking the course who have
an intelligence test score of 60.
Many possible regression lines could be fitted to the sample data, but we choose that particular
line which best fits that data. The best regression line is obtained by estimating the regression parameters
by the most commonly used method of least squares.
Estimation of a Straight Line using the Method of Least Squares
The basic linear relationship between the dependent variable Yi and the value Xi is
Yi = a + b Xi + i
where a and b are called the unknown population parameters (b is also called the coefficient of
regression), Yi are the observed values and i are the error components.
We write the equation for the estimating line as
Yi = a + b Xi
The line will have a good fit if it minimizes the error between the estimated points on the line and
the actual observed points that were used to draw it.
One way to measure the error of the estimating line is to sum all the individual differences or
errors, between the estimated points and the observed points. In the figure we have two estimated lines
that have been fitted to the same set of three data points. Two very different lines have been drawn to
describe the relationship between the two variables.
We have calculated the individual differences between the corresponding Y and Y, and then we
have found the sum of these differences. The total error in both the cases is zero. This means that both the
lines describe the data equally well. Thus we must conclude that the process of summing individual
differences for calculating the error is not a reliable way to judge the goodness of fit of an estimating line.
Actually the sum of the individual errors is cancelling effect of the positive and negative values.
However if we consider the sum of the absolute values of the errors seems to be the better criteria
to find a good fit but it does not stress the magnitude of the errors. But if we consider the sum of the
squares of the errors, it magnifies the larger errors and it cancels the effect of the positive and negative
values. So finally we looking for an estimating line that minimizes the sum of the squares of the errors.
That is called the method of the least squares.
Method of Least Squares
The method of least squares determines the values of the unknown parameters that minimize the
sum of squares of the errors where errors are defined as the difference between observed values and the
corresponding predicted or estimated values.
It is denoted by
n n n
S(a,b) = ei2 = (Yi Yi)2 = (Yi a b Xi)2
i =1 i =1 i =1
minimizing S(a,b), we put first partial derivatives w. r. t. a and b
equal to zero. Therefore
S(ab) n
= 2 (Yi a b Xi)(1) = 0
a i =1
S(ab) n
= 2 (Yi a b Xi)(Xi) = 0
b i =1
by simplifying, we have
Yi = na + bXi
Xi Yi = aXi + bXi 2
by solving, we have
(Y)(X2) (X)(XY) nXY (X)(Y)
a= b=
n(X2) (X)2 n(X2) (X)2
If the variable X is taken as dependent variable , then the least square line is given by
X = c + dY
and the normal equations are
X = nc + dY
XY = cY + dY2
By solving simultaneously, w have the values of c and d
(X)(Y2) (Y)(XY) nXY (X)(Y)
c= 2 2 ,d=
n(Y ) (Y) n(Y2) (Y)2
Example (1)
Fit a least square line of regression to the following data taking X as independent variable.
Vehicle no. 0 1 2 3 4
Life in Years X 5 3 3 1
Repair expenses during
last year in hundreds of $ Y 7 7 6 4
Solution
(i) The equation of the least square line is
Y = a + bX
and the normal equations are
y = na + bX
XY = aX + bX2
From given data we have n = 4
X Y XY X2
5 7 35 25
3 7 21 9
3 6 18 9
1 4 4 1
X = 12 Y = 24 XY= 78 X 2 =64
24 = 4a + 12b
78 = 12a + 64b
By solving simultaneously, we get
a = 3.75, b = 0.75
We can also find the values of a and b using the formulas
(Y)(X2) (X)(XY) nXY (X)(Y)
a= ,b=
n(X2) (X)2 n(X2) (X)2
Hence the required straight line is
Y = 3.75 + 0.75 X
Example (2)
Following is given the data of 10 randomly selected areas in each area number of oil stoves and
the annual consumption of oil in barrels is given. Fit a regression equation of annual oil consumption on
number of stoves.
No. of stoves: =x 1 1.5 2 2.5 3 3.5 4 4.5
Annual Consumption of oil: =y 25 31 27 28 36 35 32 34
Solution
Necessary calculations are given below
x y xy x2
27 142 3834 729
32 170 5440 1024
38 200 7600 1444
42 194 8148 1764
48 224 10752 2304
54 256 13824 2916
60 261 15660 3600
67 270 18090 4489
73 304 22192 5329
79 349 27571 6241
Total 520 2370 133111 29840
(Y)(X2) (X)(XY) nXY (X)(Y)
a= ,b=
n(X2) (X)2 n(X2) (X)2
we have a = 52.6792 and b = 3.5254
^
so the linear regression equation is: y = 52.68 + 3.53 x
Positive value of b shows that for an increase of one stove, the oil consumption is increased by 3.53
barrels on the average.
Example (3)
Consider the data on experiment on potatoes. We want to fit a line of regression to the data using
the method of least squares.
The necessary calculations are given in the table below:
x y Xy x2 y2 y^ e
1 25 25.0 1.00 625 26.83 -1.83
1.5 31 46.5 2.25 961 28.02 2.98
2 27 54.0 4.00 729 29.21 -2.21
2.5 28 70.0 6.25 784 30.40 -2.40
3 36 108.0 9.00 1296 31.59 4.41
3.5 35 122.5 12.25 1225 32.79 2.21
4 32 128.0 16.00 1024 33.98 -1.98
4.5 34 153.0 20.25 1156 35.17 -1.17
Total 22 248 707.0 71.00 7800
Example (4)
(1) Given below the data relating to the thermal energy generated in Pakistan 1981-94. The energy
generation is in billion kwh.
Year 1981 1982 1983 1984 1985 1986 1987
Energy Generated 4.2 5.2 5.1 5.2 6.5 7.3 8.4
Year 1988 1989 1990 1991 1992 1993 1994
Energy Generated 10.8 11.9 14.5 16.1 19.4 19.7 23.0
Fit a straight line to the data. Find the residuals. Plot the residuals and comment on your result.
(2) Following is the annual installation of computers in labs in UET. Fit a linear regression equation
of the computers on years and give the annual rate of installation of them.
Year: 2001-2003 2003-2005 2005-2007 2007-2009 2009-2011
No of Computers installed: 139 144 150 154 158
Note
For each situation where the independent variable is a time factor, the values assigned to
2001-2003, may be taken as 1,2,3,
Example (5)
Fit a parabola y = ax2 + bx + c in least square sense to the data
X= 10 12 15 23 20
Y= 14 17 23 25 21
Solution
The normal equations to the curve are
Y = a X2 + b X + 5c
XY = a X3 + b X2 + c X
X2Y = a X4 + b X3 + c X2
Table:
X Y X2 X3 X4 XY X2Y
10 14 100 1000 10000 140 1400
12 17 144 1728 20736 204 2448
15 23 225 3375 50625 345 5175
23 25 529 12167 279841 575 13225
20 21 400 8000 160000 420 8400
X = Y = x2 = x3 = x4 XY = X2Y=
80 100 1398 26270 =521202 1684 30648
Substituting the obtained values from the table in normal equations, we have
100 = 1398 a + 80 b + 5c
1684 = 26270 a + 1398 b + 80 c
30648 = 521202 a + 26270 b + 1398 c
on solving, a = 0.07, b = 3.03, c = 8.89
Hence the required equation is
Y = 0.07 X2 + 3.03 X 8.89
Example (6)
Use method of least squares to determine the constants a and b such that Y = a ebX fits the
following data.
X= 0.0 0.5 1.0 1.5 2.0 2.5
Y= 0.10 0.45 2.15 9.15 40.35 180.75
Solution
The curve to be fitted is Y = a ebX or y = A+Bx, where y = log 10Y, A = log10 a and B = b log10 e
the normal equations are
y = 6A + BX
Xy = A x + B x2
Table construction:
X Y Y = log10 y X2 Xy
0 0.10 -1 0 0
0.5 0.45 - 0.3468 0.25 -0.1734
1.0 2.15 0.3324 1.0 0.3324
1.5 9.15 0.9614 2.25 1.4421
2.0 40.35 1.6058 4.0 3.2116
2.5 180.75 2.2571 6.25 5.6428
2
X=7.5 y = 3.8099 X = 13.75 Xy = 10.4555
X Y y = log10Y X2 Xy
2 8.3 0.9191 4 1.8382
3 15.4 1.1872 9 3.5616
4 33.1 1.5198 16 6.0792
5 65.2 1.8142 25 9.0710
6 127.4 2.1052 36 12.6312
X = 20 y = 7.5455 X2 = 90 Xy = 33.1812
an independent variable.
Example (9)
Given the following sets of values:
X 6.5 5.3 8.6 1.2 4.2 2.9 1.1 3.9
Y 3.2 2.7 4.5 1.0 2.0 1.7 0.6 1.9
(b) Compute the least squares regression equation for Y values on X values.
(c) Compute the least squares regression equation for X values on Y values.
Example (10)
For each of the following data, determine the estimated regression equation Y = a + bX:
(a) Y = 20, X = 10, XY = 1000, X2 = 2000, n = 10.
(b) X = 528, Y = 11720, XY = 193640, X2 = 11440, n = 32
Example (11)
For the following set of data: (AIOU)
X Y XY X2
Y = 3.75 + 0.75 X
5 7 35 25
3 7 21 9
3 6 18 9
1 4 4 1
X = 12 Y = 24 XY= 78 X2 =64
Example (13)
Cost accountant often estimates overhead based on the level of production. At the Standard
Knitting Co. they have collected information on overhead expenses and units produced at different plants,
and want to estimate a regression equation to predict future overhead. (AIOU)
Overhead 191 170 272 155 280 173 234 116 153 178
Units 40 42 53 35 56 39 48 30 37 40
The coefficient of determination which measures the proportion of variability in the values of the
dependent variable (Y) associated with its linear relation with the independent variable (X) is defined by:
^ 2 ^ 2
Explained Variation
(Y . Y) (Y Y .)
r2 = Total Variation = =1
(Y Y)2 (Y Y)2
Alternate formula for Coefficient of Determination:
2 a Y + bXY nY2
r =
2
Y2 nY
It assumes values that range from +1 for perfect positive linear relationship, to 1, for perfect
negative linear relationship and r = 0 indicates no linear relationship between X and Y.
It is important to note that r = 0 does not mean that there is no relationship at all. e.g. if all the
observed values lie exactly on a circle, there is a perfect non-linear relationship between the variables.
Rank Correlation
Sometimes, the actual measurements of individuals or objects are either not available or accurate
assessment is not possible. They are then arranged in order according to some characteristic of interest..
Such an ordered arrangement is called a ranking and the order given to an individual or object is called its
rank. The correlation between two such sets of ranking is called Rank Correlation.
Let we have n pairs of two data sets ranked with respect to some characteristic. Say, (x1, y1),
(x2,y2), (x3, y3), , (xn, yn). Since both xi and yi are the first n natural numbers, therefore we have
n(n+1)
xi = 1 + 2 + + n =
2
n(n+1)(2n+1)
x 2 = y2 = 12 + 22 + + n2 =
6
2 2 2
- = (yi - y)2
(xi - x)2 - = yi2 - (yi) = n(n+1)(2n+1) - n(n+1) = n(n -1)
n 6 4 12
Let di = xi - yi
Then
di2 = (xi - yi)2 = x i2 + yi2 - 2x i yi
n(n+1)(2n+1) n(n+1)(2n+1)
= + - 2xi yi
6 6
n(n+1)(2n+1) 1
xi yi = - di2
6 2
The product moment coefficient of correlation is:
XY(X)( Y)/n
r=
[X (X)2/n][ Y2(Y)2/n]
2
by substitution we have
6di2
r=1-
n(n2 - 1)
This is also ranging from 1 to + 1
Note
If two objects or observations are tied (having same value), lets say for fourth and fifth, then they
are both given the mean rank of 4 and 5. i.e. 4.5.
This situation is given in the following example.
Example (15)
The following table shows the number of hours studied (X) by a random sample of ten students
and their grades in examination (Y):
X: 8 5 11 13 10 5 18 15 2 8
Y: 56 44 79 72 70 54 94 85 33 65
Calculate Spearmans rank correlation coefficient.
Solution
We rank the X values by giving rank 1 to the highest value 18, rank 2 to 15, rank 3 to 13, rank 4
to 11, rank 5 to 10, rank 6.5 (mean of rank 6 and 7) to both 8, rank 8.5 (mean of rank 8 and 9) to both 5
and rank 10 to 2. Similarly we rank the values of Y by giving 1 to the highest value 94, rank 2 to 85, rank
3 to 79, , and rank 10 to 33 which is the smallest.
Table given below:
X Y Rank of X Rank of Y di d2
8 56 6.5 7 - 0.5 0.25
5 44 8.5 9 - 0.5 0.25
11 79 4 3 1.0 1
13 72 3 4 - 1.0 1
10 70 5 5 0.0 0
5 54 8.5 8 0.5 0.25
18 94 1 1 0.0 0
15 85 2 2 0.0 0
2 33 10 10 0.0 0
8 65 6.5 6 0.5 0.25
d2 = 3
The value of n is 10.
Hence
6di2
r=1-
n(n2 - 1)
6(3)
=1-
10(102 - 1)
= 0.98
Compare this value with the correlation coefficient for the original values.
Example (16)
Ten competitors in a beauty contest are ranked by three judges in the following order
1st Judge 1 6 5 10 3 2 4 9 7 8
2nd Judge 3 5 8 4 7 10 2 1 6 9
3rd Judge 6 4 9 8 1 2 3 10 5 7
Use the rank correlation coefficient to discuss which pair of judges have the nearest approach to
common tastes in beauty.
Example (17)
The body mass index (BMI) For an individual is found as follows. Using your weight in pounds
and your height in inches, multiply your weight by 705, divide the result by your height, and divide again
by your height. The desirable body mass index varies between 19 and 25. Table gives the body mass
index and the age for 20 individuals. Find the coefficient rank correlation for data shown in the table.
Example (18)
Table gives the percent of calories from fat and the micrograms of lead per decilitter of blood for
a sample of preschoolers. Find the coefficient of rank correlation for the data shown in the table.
Assignment (2)
Given the following data:
X1: 41 31 26 43 21 33 41 31 46 31 36 32 38 27 35 40
X2: 62 52 50 56 51 52 63 50 47 40 56 54 60 57 57 58
X3: 41 33 33 38 35 36 43 33 37 33 33 31 30 37 35 31
nX1 X3 X1 X3
r13 =
nX1 (X1)2 nX32 (X3)2
2
nX2 X3 X2 X3
r23 =
nX2 (X2)2 nX32 (X3)2
2
(d) The correlation coefficients between X1 and X2 when the effect of X3.is held constant