Professional Documents
Culture Documents
REGRESSION ANALYSIS
Structure
8.0 Objectives
8.1 Introduction
8.2 Coalation
8.2.1 Concept
8.2.2 Correlation and Independence
82.3 Nonsense Correlation
8.3 Regression
8.3.1 Concept
8.3.2 Correlation and Regression
8.3.3 Simple Regression
8.3.4 Multiple Regression ,
8.6 Exercises
8.7 Key Words
8.8 Some Usefd Books
8.9 Answers or Hints to Check Your Progress
8.1 0 Answers or Hints toY3xercises
I
8.0 OBJECTIVES
\
8.2 CORRELATION
8.2.1 Concept
In the introductionto this chapter, we have already referred to a mathematical model
of some real-world observable economic phenomenon. In general, a model consists
of some hctional relationships, some equations, some identitiesand someconstraints.
Once, a model like this is formulated, the next issue is to examine how this model
works in the real-world situation, for example, in India. This is what is known as the
estimation of an econometric model. It may be mentioned here that Lawrence Klein
did some pioneering work in the formulation and estimation of suchmodels. In fact,
many complex econometric models consisting of hundreds of functions, equations,
identities and constraintshave been constructed and estimated for differenteconomies
of the world, including India, by using empirical data.
The estimation of such complete macro-econometric models, however, involves
certain issues that are beyond our scope. As a result, we shall abstract from such
kind of a model and focus on a single equation economic relationship and consider
itsempirical verification For example,intheKeynesianmodel of income detemimtion,
consumption function plays a pivotal role. The essence of this relationship is that
consumptiondepends on income. We may specifl a simpleconsumptionfunctionin Correlation and
Regression Analysis
the form of a linear equation with two constraints: one, autonomous part of
consumptionbeing positive and two, marginal propensity to consume being more
than zero but less than on..
Thus, our consumption equation is
This kind of a single equation and its estimationis commonly known as the'regression
model in the econometric literature. It may be mentioned here that such a single-
equation regression model need not be a part of any econometricmodel and can be
a mathematical formulationof some independentlyobserved economicphenomenon.
Any scientificinquiry has to be conducted systematically,and, economic inquiry is
no exception. In the case of our regression model involving consumption and incame,
for example,aprelirninarystep may be to examine, whether inthe real-world situation,
there exists any relationshipbetween consumptionand income at all. This is precisely
what we attempt at with the help of the concept of correlation.Thus, at the moment,
we are not concerned with the issue of dependenceof consumption on income or -
vice-versa. We are simply interested in the possible co-movement of the two
variables. We shall focus on the differencebetween correlationand regression later.
Correlation can be defined as a quantitativemeasure of the degree or strength of
relationshipthat may exist between two variables. You are already familiar with the
concept of Karl Pearson's coefficient of correlation. IfXand Yare two variables,
we know that this correlation coefficient is given by the ratio of q e covariance
'
between X and Y to the product of the standard deviation of X and that of Y. In
symbols:
The symbolshave usual meaning. Here, the covariance in the numerator is important.
This in fact, gives a measure of the simultaneous change in the two uariables. It is
divided by product of the standard deviation of X and Y to make the measure f k e
of any unit in order to facilitate a comparison between more than one set of
bi-variate data which may be expressed in different units. It may be noted here that
this measure! of correlatibn coefficient is independent of a shift in the origin and a
change of scale. The correlation coefficient lies between +1 and -1. In symbol :
If the two variables tend to move in the same direction, the correlation coefficient is
positive. In the event ofthe two variablesten- to move iri the opjpositedirections,
the correlation coefficient assumes a negative value. In the case of a perfect
correlationship, the correlation coefficient is either +1 or -1, which is almost
impossible in economics. When there does not seem to be any relationshipbetween
the two variables on the basis of the available data, the correlation coefficientmay
assume a value equal to zero. *
It should be noted here that Karl Pearson's correlation coefficient measures linear
correlationship between two variables. This means that there exists a proportional
relationship between the two variables i.e., the two variables change in a fixed
proportion. For example, we may find that the correlation coefficient between
Vuantitallve Methods-' disposable income and personal consunlp~ionexpenditure in India on the basis of some
national income data is 0.7. It only means that consumption in relation to income or
income in relation to consumption changes by a factor of 0.7. We again stress here that
at the moment we are not commenting on whether income is the independent variable
and consumption is the dependent variable or it is the other way round.
It is important he^ to comment on what is known as coefficientof determination.Although
iVis numerically equal to the square of the correlation coefficient, conceptually it is quite
different from the correlation coefficient. We shall discuss this concept in details in the
next unit.
Example 8.1
If three uncorrelated variables x,, x2 and x3 have the same standard deviation, find the
correlation coefficient between x, + x2 and x2 + x3.
Let XI, X2and X3be the means of X,, X2and X3 . So, respectively, we have,
2) Explain how two independent variables have zero correlation but the converse
is not true.
REGRESSION
8.3.1 Concept
The term regression literally means a backward movement. Francis Galton first used
the term in the late nineteenth century. He studied the relationship between the height
of parents and that of children. Galton observed that although tall parents had tall
children and similarly short parents had short children in a statistical sense, but in
general the children's height tended towards an average value. In other words, the
children's height moved backward or regressed to the average. However, now the
term regression in statistics has nothing to do with its earlier connotationof a backward
movement.
Quantitative Methods-l Regression analysis can be described as the study of the dependence of one variable
on another or more vatiables. In other words, we can use it for examining the
relationshipthat may exist among certain variables. For example, we may be interested
in issues like how the aggregate demand for money depends upon the aggregate
income level in an economy. We may employ regression techniqueto examine this.
Here, Aggregate demand for money is called the dependent variable and aggregate
income level is called the independent variable. Consequently, we have a simple
demand for money hction. In this context, we present the following table to show
some of the terms that are also used in the literature in place of dependent variable
and independent variable.
Table 8.1: Classifying Terms for Variables in Regression Analysis
Dependent Variable Independent Variable
Explained Variable ExplanatoryVariable
Regressand Regressor
Predictand Predictor
EndogenousVariable ExogenousVariable
ControlledVariable Control Variable
Target Variable Control Variable
Response Variable StimulusVariable
Source: Maddala (2002) and Gujrati (2003).
It is now important to clarify that the terms dependent and independent do not
necessarily imply a causal connection between the two types of variables. Thus,
regression analysis per-se is not really concerned with causality analysis.Acausal
connection has to be established first by some theory that is outside the parlance of
the regressionanalysis. In our earlier example of consumption function and the present
example of demand for money function we have theories like Keynesian income
hypothesis and transaction demand for money. On the basis of such theories perhaps
we can employ regression technique to get some preliminary idea of some causal
connection involvingcertain variables. In fact, causality study is now a highly speckhxd
branch of econometrics and goes far beyond the scope of the ordinary regression
analysis.
A major purpose of regression analysis is to predict the value of one variable given
the value of another or more variables. Thus, we may be interested in predicting the
aggregate demand of money from a given value of aggregateincome.
We should be clear that by virtue of the very nature of economics and other branches
of social science, the concern is a statistical relationship involving some variables
rather than an exact mathematical relationship as we may obtain in natural science.
Consequently,if we are able to establish some kind of a relationship between an
independent variable Xand a dependent variable Y, it can be expected to give us
ogy sort of an average value of Y for a given value ofX. This kind of arelationship is
known as a statistical or stochastic relationship. Regression method is essentially
concerned with the analysis of such kind of a stochasticrelationship.
From the above discussion, it should be clear that in our context, the dependent
variable is assumed to be stochasticor random. In contrast,the independent variables
are taken to be non-stochastic or non-random. However, we must mention here that
at an advanced level, even the independent variables are assumed to be stochastic.
In the next unit, we shall discuss the stochastic nature of the regression analysis in
details.
If a regression relationship has just one independent variable, it is called a two Correlation and
Regression Analysis
variable or simple regression. On the other hand, if we have more than one
independent variable in it, then it is multiple repsion.
8.3.2 Correlation and Regression
Earlier we made a reference ta the conceptual difference between correlation and
regression. We may discuss it here. In regression analysis, we examinethe nature of
the relationship between the dependent and the independent variables. Here, as
stated earlier, we try to estimate the average value of one variable h m the given
values of other variables. In correlation, on the other hand, our focus is on the
measurement of the strength of such a relationship.Consequently,in regression, we
classify the variables in two classes of dependent and independent variables. In
correlation,the treatment of the variables is rather symmetric;we do not have such
kind of a classification. Finally, in regression, at our level, we take the dependent
variable as random or stochastic and the independent variables as non-random or
fixed.In correlation, in contrast,all the variables are implicitly taken to be random in
nature.
8.3.3 Simple Regression
Here, we are focusing onjust one independent variable. The first thing that we have
to do is to specify h e relationship between Xand Y. Let us assume that there is a
linear relationship between the two variables like:
X
Fig. 8.1: Scatter-Plot
Quantitative Methods-I A visual inspection of the scatter-plot makes it clear that for different values ofX,
the correspondingvalues of Yare not aligned on a straight line. As we have mentioned
earlier, in regression, we are concerned with an inexact or statisticalrelationship.
And this is the consequence of such a relationship. Now, the constants a and b are
respectively the intercept and slope of the straight line described by the above-
mentioned linear equation and several straight lines with differentpairs of the values
(a, b) can be passed through the above scatter. Our concern is the choice of a
particular pair as the estimates of a and b for the regression equation under
consideration. Obviously,this calls for an objective criterion.
Such a criterion is provided by the method of least squares. The philosophy behind
the least squares method is that we should fit in a straight line through the scatter-
plot, in such a manner thatthe vertical differences between the observed values of Y
and the correspondmg values obtained fiom the straight line for different values of
X, called errors,are minimum.The line fitted in such a fashion is called the regression
line. The values of a and b obtained fiom the regression line are taken to be the
estimates of the intercept and slope (regressioncoefficient)of the regression equation.
The values of Y obtained fiom regression line are called the estimated values of Y. A
stylized scatter-plotwith a straight line fitted in it is presented below:
The method of least square requires that we should choose our a and b in such a
manner that sum of the squares of the vertical differences between the actual values
or observed values of Yand the ones obtained &om the straight line is minimum.
Putting mathematically,
and
After solving the two equations simultaneouslywe obtain the least square estimates
- C (X- X)(Y- 0
h=
x (X- Xp
and
is in fact called the regression of Y on X, the slope b of this equation is termed as the
regression coefficient of Y on X. It is also denoted by byx.Aglance at the expression
of the regression coefficient Yon X makes it quite clear that the above expression
can also be written as
Reverse Regression
Suppose, in another regression relationshipXacts as the dependent variable and Y
as the independent variable. Then that relationshipis called the regression ofXon Y.
Here, we should dewtely avoid the temptation of expressingXin terms of Y fiom
the regression equation of Y on Xto obtain that ofXon Y and trying to mechanically
extract the least square estimates of its constants fiom the already known values of
B and i .The regression ofXon Y is in fact intrinsically differeat fiom that of Y onX.
Geometricallyspeaking, in regression ofXon Y, we minimize the sum ofthe squares
of the horizontal distances as against the mhhization of the sum of the squares of
the vertical distances in Yon X, for obtaining the least square estimates. If our
regression equation of X on Y is given by
d
Quantitative Methods-l
<
'
By applying the usual minimization procedure, we obtain the following two normal
equations:
and
We can simultaneouslysolve these two equations to get the least square estimates
and
-- A -
i' = X-b' Y
The slope b' of the regression of Xon Y is called the regression coefficient ofX
on Y. It measures the rate of change ofXwith respect to Y, in order to distinguish it
clearly from the regression coefficient of Y onX; we also use the symbol b, for it.
Puttipg the values of a' and b', the regression equation of X on Y can be written as
Properties
Let us now briefly consider some of the properties of the regression.
The product of the two regression coefficients is always equal to the square of Correlation and
1) Regression Analysis
the correlation coefficient:
2) The two regression coefficients have the same sign. In fact, the sign of the two
coefficients dependsupon the sign of the correlation coefficient. Since the standard
deviationsof both Xand Yare, by definition,positive; if correlationcoefficient is
positive, both the regression coefficients are positive and similarly, if correlation
coefficient happens to be negative, both the regressioncoefficientsbecome negative.
- -
3) The two regression lines always intersect each other at the point (X, Y).
Let Y be yield and X be rainfall. So, for estimating the yield, we have to run the
regression of Y on Xand for the purpose of estimating the rainfall, we have to use
the regression ofXon Y.
..................................................................................................................
I
!
..................................................................................................................
1 3) What is the distinctionbetween Time Series Data and Cross Section Data?
!
..................................................................................................................
...............................................
...................................................................
..................................................................................................................
I
I 8.5 LET US SUM UP
I
I Regression models occupy a central place in empirical economic analysis. These
I
i
models are essentiallybased on the conceptsof comelation and regression. Correlation
- is a quantitative measure of the strength of the linear relationship that may exist
.among some variables. The existence of a high degree of correlation,however, is
not necessarily the evidence of a meaningll relationship. It only suggests that the
data are not inconsistent withthe possibility of such kind of a relationship. Regression
on the other hand focuses on the direction of a linear relationship. Here, one is
concerned with the dependence of one variable on other variables. Regression, in
itself, does not suggest any causalrelationship. Correlation and regression, both are
concerned with a statistical or stochastic relationship as against amathematical or an
exact relationship. In the conventional regression analysis,the dependent variable is
treated to be stochastic or random, whereas, the independent variables are taken to
,
be non-stochastic in nature. The constants of a regression equation are estimated
h m the empirical observationsby using the least squaretechnique. In atwo variable
regression equation, there is one dependent variable and one independent variable.
The slope coefficient of a regression equation is called the regression coefficient. It
measures the rate of change of the dependent variable with respect to the independent
variable. The distinction between the concept of direct regression and that of the
reverse regression is crucial in the regression analysis. Sometimes by running both
the kinds of regression, important insight can be gained in the empirical economic
analysis. In multiple regression, there are at least two independent variables.
Finally, in regression analysis, three types of data, namely, time series, cross section
and pooled, can be used.
Quantitative Methods-I
8.6 EXERCISES
1) Prove that correlation coefficient lies between - 1 and + 1.
2) Show that correlation coefficient is unaffected by a shift in the origin and a
change of scale.
3) For the regressionequation of Y onX, derive the least square estimators of the
parameters. Try and work out the same for the regression equation ofXon Y.
4) From the following data, derive that regression equation which you consider to
be economically more meaningful. Givejustification for your choice.
Output 5 7 9 11 13 15
Profit per unit 1.70 2.40 2.80 3.40 3.70 4.40
5) To study the effect of rain on yield of wheat, the followingresults were obtained:
Mean Standard Deviation
Yield in kg per acre 800 12
M a l l in inches 50 2
Correlation coefficient is 0.80.
Estimate the yield, when d a l l is 80 inches.
KEY WORDS
coefficient of : It is equal to the square ofthe correlationcoefficient.
Determination
Corklation : It is a quantitative measure of the strength of the
relationship that may exist among certain variables.
Cross Section Data : In cross section data, we have observations for a
variable for different unitsat the same point of time.
Econometrics : It is described as the application of statistical tools in
the quantitativeanalysis of economicphenomena.
Mathematical Model : The mathematical form of some economic theory is
what is generally called a mathematical model.
Method of Least : 1t is the method of estimating the parameters of a
Square regression equation in such a fashion that the sum of
the squares of the differences between the actual
values or observed values of the dependent variable
and their estimated values from the regression
equation is minimum.
Multiple Regression : It is a regression equation with more than one
independent variable.
Nonsense Correlation : The presence of correlation between two
variables when there does not exist any meanmgfd
relationship between them is known as nonsense
correlation.
Pooled Data : In pooled data, we have time series observations Correlation and
Regression Analysis
for various cross sectional units. Here, we combine
the element oftime series with that of cross section
data.
Regression Equation : It is the equation that specifies the relationship
between the dependent and the independent
variables for the purpose of estimatingthe constants
or the parameters of the equation with the help of
empirical data on the variables.
Regression : It is a statistical analysis of the nature of the
relationship between the dependent and the
independent variables.
Reverse Regression : It is an independent estimation of a new regression
equation when the independent variable ofthe origu7al
equation is changed into the dependent variable and
the dependent variable of the original equation is
changed into the independent variable.
Time Series Data : It is a series of the values of a variable obtained at
different points of time.
Two Variable : It is a regression equation with one independent
Regression variable.
3) Y = 0.257X + 0.50
4) Do Yourself.
5) 944 kg per acre.