You are on page 1of 59

MULTIPLE REGRESSION

ANALYSIS
By-Amisha
Pragya
Harshit
Pratishta
Shivangi
Pooja
Akash
DISCLAIMER!!
› All facts, ideas and thoughts mentioned throughout the
ppt are definitely not our own. They have been picked
from google baba and various books with long
unpronounceable names. So if you want to blame
anyone, BLAME THE MATH GENIUSES who are putting us
through this ordeal.
› Also this presentation will be long, so GIDDY UP!!!!!!!!
WHAT IS REGRESSION?
›Correlation: change in one variable corresponding with change in another
variable.
›Regression is a statistical measurement used in social science, finance,
investing and other disciplines that attempts to determine the strength of the
relationship between one dependent variable (usually denoted by Y) and a
series of other changing variables(known as independent variables).
›The term regression was first used by Francis Galton.
›The dictionary meaning of the word regression is ‘stepping back’ or ‘going
back’.
›Regression is the measures of the average relationship between two or more
variables in terms of the original units of the data.

›Regression analysis is used to :


- understand the relationship between variables.
- predict the value of one variable based on another variable.
TYPES OF REGRESSION
› SIMPLE REGRESSION :
-The criterion or dependent variable is a function of single
independent variable or predictor.
- i.e. one D.V and One I.V
- e.g., the regression of examination marks of a candidate in
mathematics on his or her numerical aptitude test score.
› MULTIPLE REGRESSION :
- The criterion is function of two or more predictors.
- i.e. one DV and Many IV.
- e.g., the regression of mathematics marks of an examinee on his or
her numerical aptitude and abstract reasoning test scores.
TYPES OF REGRESSION (Cont.)
› CANONICAL REGRESSION:
› It is a way of inferring information from cross-covariance matrices.
› i.e. many DV and Many IV.
LINEAR REGRESSION
› It deals with making prediction about a variable on the basis of its relationship with
other variable.
› The model we fit in regression is linear
› Outcome=model + error
This equation is replaced by equation of line
i.e. Y(pre)= (A+BX)+ E
Where Y(pre) = The predicted value for the Dependent Variable Y.
A= value of intercept( where line crosses Y axis when X=0)
B= raw score (unstandardised score of coefficient , slope)
X= value of independent variable
E= Error score
Y

lo pe
S
B=

Y(pre)= (A+BX)+ E
MULTIPLE REGRESSION
› Multiple regression analysis is a powerful technique used for predicting the
unknown value of a variable from the known value of two or more variables.
› The variable whose value is to be predicted is known as the dependent
variable and also known as criterion variable.
› The ones whose known values are used for prediction are known
independent(exploratory) variables and also called as predictors.
Multiple regression allows us to:
› Use several variables at once to explain the variation in a continuous dependent
variable.
› Isolate the unique effect of one variable on the continuous dependent variable
while taking into consideration that other variables are affecting it too.
› It shows criterion variables as a function of the predictor variables
› The predictor variables can be “fixed” treatment variable or classification
variable.
› For example:

• Null Hypothesis: There is no relationship between


education of respondents and the number of children in
families.

• Null Hypothesis: There is no relationship between


family income and the number of children in families.
For Example :
› Research Hypothesis : As education of respondents increases,
the number of children in families will decline (negative
relationship).
› Research Hypothesis : As family income of respondents
increases, the number of children in families will decline
(negative relationship).
PURPOSE:
› Describe: Develop a model to describe the relationship between
the explanatory variables and the response variable.
› Predict: Use a set of sample data to make predictions.
› Confirm: theories are often developed about individual variables,
such as confirming which variables, or combination of variables,
need to be included in the model.
Unstandardized regression equation
● The regression equation for raw scores is essentially an
expansion of the simple linear regression equation. The
equation is as follows.

a = constant/intercept value
bn= partial regression coefficient
Xn= predictor variables
Standardized Regression Equation
● Based on the standard score.
● Weights of IVs in unstandardized regression equation have different units .
● In order to standardize those units we use standard scores, free from different units.
● Standard score,

Z= (X-X)
S.D
Y (pre)=  1X 1 +  2 X 2 +  3 X 3- - - - - -+  nx n
● In this, one unit change in IV is associated with how many standard score change in DV .
PARTIAL CORRELATION
AND

SEMI-PARTIAL
CORRELATION
Partial Correlation
  A direct procedure for controlling correlation third variable is by
partial correlation.
 Its allows researcher to measure the relationship between two
variables while eliminating or holding constant the effects of the
third variable.
Here three variables are X,Y and Z it is possible to compute three
individual Pearson correlation 
 ,rXY meaning correlation between X and Y.
 ,rXZ meaning correlation between X and Z.
 ,rYZ meaning correlation between Y and Z.
Formula Of Partial Correlation

rXY  (rYZ rXZ )


rXY .Z 
(1  r )(1  r )
2
XZ
2
YZ
Semi Partial Correlation

Semi partial correlation , also known as part


correlation is the statistical process to partial
out or eliminate the effect of third variable
from either of the variables, but NOT from
the both variables.
Cont.
The formula for computing semi partial correlation is given
below :
r12  r23r13
r1( 2.3) 
1  r232
And,

r12  r23r13
r2 (1.3) 
1  r132
Cont.
Where,
r12 = correlation coefficient between var 1 and 2
r12 = correlation coefficient between var 1 and 3
r23 = correlation coefficient between var 2 and 3
Multiple correlation coefficient
› Multiple correlation coefficient denotes Correlation of one
variable With Other Multiple variables and is denoted by “R”.

Coefficient of Determination (R 2 )

r12 = correlation coefficient between var 1 and 2

r13 = correlation coefficient between var 1 and 3

r23 = correlation coefficient between var 2 and 3


STRATEGIES FOR
REGRESSION EQUATION
BUILDING
1. STANDARD METHOD
THE STANDARD REGRESSION METHOD
It is also known as ‘simultaneous’ or direct method
All the predictors are entered into the in a single “step”.
But the single step is not at all simple & unitary.
To compute the weighting coefficients, the predictors must be
individually evaluated.
Each predictor is added as though it has entered last.
 it is done to determine the over and above prediction work it does to
the rest of the variables.
In this way focuses on the unique contribution.
Lets say dependent variable is Y and 3 predictor variable
X1,X2,X3.
For instance, X1 (controlled for X2 & X3), X2 (controlled for
X1 & X3), X3 (controlled for X1 & X2)
The total variance explained is the sum of the variance
accounted for by the orthogonal components of IVs plus the
shared variance.
The standard method provides a full model solution in that
all the predictors are part of it.
X1
a
b
Y c X2
d
e X3
2. STATISTICAL METHOD
STATISTICAL OR STEP METHOD
› Order of entry of variables is based solely on statistical
criteria.
› Model is constructed by using one variable at a time,
rather than all at once.
› Primary goal is to build model with only important
predictors.
› Methods used are forward method, backward method and
step-wise method.
 FORWARD METHOD
› Independent variables are added to the equation One at
a time.
› At each step Variable Adding to the most predictive
power is entered at that time.
› In the next step, the other variable with the highest
partial correlation is entered. If that partial correlation
is statistically significant.
› Once a variable is entered into the model, it remains in
the model.
› The process ends when the addition of the variable
results in an Non-significant contribution.
 BACKWARD METHOD
› In the first step, all predictors are added into the
equation, regardless of their contribution.
› Then, all the variables in it are examined.
› The significant predictors are retained and non-significant
predictors, whose loss would least significantly decrease
R2 , is removed.
› The removal process continues, until only significant
predictors remain in the equation.
X1
a
b
Y c X2
d
e X3
 BACKWARD VERSUS FORWARD METHOD
› Backward regression Does not always produce the same
model as forward regression.
› Predictors Need to meet a more stringent Criterion to
enter the equation in the forward method , as compared to
being retained in the equation in the backward method.
› Stringency is defined statically by the alpha or probability
level associated with entry or removal.
› The alpha level governing the entry decision is usually the
traditional 0.05 level. For removal, it drops down to 0.10 .
› Variables having probability between the entry and removal
criteria, are affected by the consequences
 STEPWISE METHOD
› It is essentially a composite of the forward and backward
methods.
› Begin with the forward method and include variables whose
addition increases R2 significantly and stops adding variable
when addition results in non-significant increase in R2.
› The stepwise method now begins to evaluate the unique
contribution of each variables already entered in the equation,
and Removes that variable whose unique contribution is not
significant.
› Thus, from this point onward the backward method steps in.
 STEPWISE METHOD

DV Q

R
3. SEQUENTIAL
REGRESSION METHOD
THE SEQUENTIAL REGRESSION METHOD
 Also known as researcher-controlled regression method,
covariance analysis,hierarchical analysis,& block-entry analysis.
The researcher-controlled regression methods are really variations
on a theme.
It is the researchers who specify the order of entry of predictors
into the equation.
The main issue that the researcher face is to determine how many
variables one instructed to enter the equation at any one time.
ASSUMPTIONS
Assumptions
 Normality of distribution of errors.

○ It is assumed that the residuals in the model are random,


normally distributed variables with a mean of 0.

○ This assumption simply means that the differences between the


model and the observed data are most frequently zero or very
close to zero, and that differences much greater than zero
happen only occasionally.

○ Assessed using:
■ Graphical Method:
● Q-Q plots
● Histogram:

■ Testing for error normality:


● Kolmogrove- Smirnov Test
● Shapiro- Wilk Test
● Ryan- Joiner Test
● Anderson- Darling Test
○ By testing for error normality, we hope to fail to reject the null
hypothesis, and assume that the errors are distributed normally.
 Assumption of linearity.

● Multiple correlation requires that the relationship between


independent and dependant variables be linear.

○ Assessed using:
■ Graphical Method:
● Scatterplots of observed vs. predicted value or residual
vs. predicted value.
○ Fig 1, shows a non-linear or a curvilinear
relationship.
○ Fig. 2, shows a linear relationship.
 Homogeneity of variation assumption (Homoscedasticity):

○ Multiple regression model assumes that there is constant error


variance across all levels of predictor variables.
■ Assessed using:
● Scatterplots:
○ If the data turns out to be cone shaped, as shown in
the figure 3 & 4, our assumption is homogeneity is
violated. Such a plot is said to be heteroscedastic.
● Statistical Method:
○ Levine’s test :
■ The objective is to perform the following hypothesis test.
■ As with the normality tests considered in the previous section, we
hope we fail to reject the null hypothesis as this would mean the
variance is constant.
 Independence of errors:

○ Multiple regression assumes that errors are uncorrelated with


each other, i.e, there is little or no autocorrelation of errors in our
data.

■ Assessed through:
● Durbine-Watson Test:
○ This tests the null hypothesis that the residuals are
not linearly auto-correlated.
○ While d can assume values between 0 and 4,
values around 2 indicate no autocorrelation.
○ As a rule of thumb values of 1.5 < d < 2.5 show that
there is no autocorrelation in the data.
DATA REQUIREMENTS
● Criterion variable must be on a continuous scale, i.e., criterion
variable cannot be dichotomous or categorical.

● Predictor variable, should ideally be on ratio or interval scale, but if


data is nominal,
○ If nominal < 2, we directly put the data in
○ If nominal > 2, we do dummy coding

● Sample size should be larger, general rule of thumb, larger the


sample size, better accuracy of result.
○ At absolute minimum, there should be 5 times as many
participants as predictor variable.
TABULATION AND
INTERPRETION OF MULTIPLE
REGRESSION ANALYSIS
TABULATION AND INTERPRETION OF
MULTIPLE REGRESSION ANALYSIS
Ways to do regression analysis:
1) Standard: we enter all the IV’s simultaneously and then interpret
their contribution in predicting criterion variable altogether.

2) Stepwise: we enter variables one-by-one


• Addition is significant when it results in significant
increase in R²
• non-significant when not increase R².
• removes variable whose unique contribution is not
significant.
STANDARD WAY :
TABLE 1 Output of standard multiple regression analysis
TABLE 2:
INTERPRETATION:

• Combined variance:38.6%
• Examination of B and ß values(regression coefficient):
Depression, hostility and psychoticism correlate positively and
anxiety correlate negatively but as the standard coefficient explains
very less correlation by anxiety and interpersonal sensitivity its
correlation is neglected.
• Level of significance: p<0.05 , depression, hostility and psychoticism
found to be significant
• Regression Equation:
• In raw score form:
Suc Cog =24.630 +1.203IS + 5.368Dp – 0.625Anx + 2.284 Hos. + 4.349psh
• In standard score form:
Suc Cog= 0.063 IS + 0.316Dp + 0.035Anx + 0.142 Hos. + 0.216psh
STEPWISE REGRESSION:
•COMBIMED VARIANCE: 38.5%
• Contribution in prediction:
Depression predicts the most ,
then pyshoticism
and last hostility.
(all correlate positively)
• all significant
•Equation:
• Raw score score:
Suc Cog= 24.883 + 5.621Dp + 4.518Psh + 2.275Hos.
• Standard score form:
Suc Cog= .331Dp + .214psh + .142Hos.
ADVANTAGES AND
DISADVANTAGES
ADVANTAGES
 Ability to determine the relative influence of one or more predictor
variables to the criterion value.
 It provides a functional relationship b/w two or more related
variables.
 It provides a measure of errors of estimates made through
the regression line.
 Ability to identify the outliers.
 This technique is highly used in our day to day life .example-Birth
rate, death rate, tax rate,etc .
DISADVANTAGES
It is assumed that the cause and effect relationship b/w
the variable remains unchanged and this mey lead to
erraneous and misleading results.
Limited dates mey lead to misleading result.
It involves very lengthy and complicated procedure of
calculation and analysis.
It can not be used in case of qualitative phenomenon like
honesty,crime etc.
ISSUES
 Adding more IV to a multiple regression procedure does not mean
the regression will be “better” or offer better prediction, in fact it can
make things worse, that is called “OVERFITTING”.
The addition of more IVS create more relationship among them so
not only one the IVS potentially related to the OV, they are also
potentially related to each other. When this happen, it is called
“multicollinearity”.

 Outliers:A observation that is substantially different from


other ones can make a large difference in the result of regression
analysis.
Missing Value
CONCLUSION
• In most research problems where regression analysis is
applied more than one independent variable is needed in the
regression model.

• The complexity of most scientific mechanisms is such that


in order to be able to predict an important response, a
multiple regression model is often the best way for real world
issues.

• Multiple regression is used for predictive purposes such as


estimating from a series of entrance tests how successful
various jobs applicants might be.
› When this model is linear in coefficients, it is called a
multiple linear regression
› Similar to simple linear regression, multiple regression
also generates two variations of the prediction equation,
one in raw score form and the other in standardised form.
› In MRA the variables that are being predicted are called
criterion while the one’s through which the prediction is
done are called predictors.
The multiple regression equation is produced in both raw
scores and standardized score form are given below.
As we have already discussed previously, these equation
provide us with a predicted value of Y(criterion) from the
observed scores of IV’s.

Raw score:
Y=

Standardized score:
Y=

 
• The goal of multiple regression is to produce a model in the
form of a linear equation that identifies the best weighted
combination of independent variables in the study to
optimally predict the criterion variable.

It’s computation procedure conforms to the ordinary least


squares solutions; the solution or model describes a line
for which the sum of the squared differences between the
predictions we make with the model and the actual
observed values are the prediction errors.

The model thus can be thought of as representing the


function that minimizes the sum of the squared error.

You might also like