You are on page 1of 9

MULTIPLE LINEAR REGRESSION Model

If on each experimental unit we measure one response variable but a number of


predictor variables, and each of these predictor variables individually is assumed
to be linearly related to the predictor variable, we have a case of multiple linear
regression.

The model for multiple linear regression with predictor variables, X1 , X2 , X3 , . . . Xk


and response variable, Y, with observations of all variables being taken at each of
n experimental units, can be written as:

yi = β0 + β1 x1i + β2 x2i + . . . + βk xki + εi i = 1, 2, . . . , n


εi ∼ IIDN (0, σ 2 )

The least squares estimates of the parameters β0 , β1 , . . . , βk are again found by


minimizing the sum of squares of the residuals.

A variable will contribute to the observed variability of the response variable if its
coefficient in the regression equation is non–zero and so we test the null hypothesis
H0 : β1 = . . . = βk = 0 against the alternative H1 : βi 6= 0 for some i = 1, 2, . . . , k.

The appropriate analysis of variance table is given below.

Source of Degrees of Sums of Mean F P


Variation Freedom Squares Squares
Pk RegMS
Regression k i=1 β̂i SXi Y RegMS EMS p

Residual n−k−1 † RMS

Total n−1 SY Y

Rejecting the null hypothesis H0 : β1 = . . . = βk = 0 does not necessarily mean that


all of the regression coefficients are non–zero. It is also possible to test if each of the
individual coefficients in the multiple regression model is non–zero and determine
the importance of each of the predictor variables in explaining the variability in the
response variable.

A test of H0 : βi = 0 against H1 : βi 6= 0 can be performed using a test statistic


which has a t–distribution, large values of the test statistic leading to a rejection of
the hypothesis that the coefficient is equal to zero.

3.1
MULTIPLE LINEAR REGRESSION Sequential Sums of Squares

An indication of the importance of a predictor variables in a multiple linear


regression is given by the sequential sum of squares for the predictor variable. In
fitting the multiple linear regression model

yi = β0 + β1 x1i + β2 x2i + . . . + βk xki + εi

the regression sum of squares can be partitioned into components which measure the
contributions to the reduction in the error variability due to each of the predictors.

If the multiple linear regression is fitted using Minitab, it produces a table of


sequential sums of squares. The first line of the sequential sums of squares table
gives the reduction in the error sum of squares due to fitting the X1 term given that
the intercept term β0 is already in the model. This can be written SS(β1 |β0 ). It is
the regression sum of squares in the simple linear regression with the one predictor
variable X1 . The second line of the table is the reduction in the error sum of
squares due to fitting X2 given that β0 and β1 are already in the model. This can
be denoted by SS(β2 |β0 , β1 ), and is the difference between the error sum of squares
in the simple linear regression with the one predictor variable X1 and the error sum
of squares in the multiple linear regression with the two predictor variables X1 and
X2 . The next line in the table would be SS(β3 |β0 , β1 , β2 ), and so on.

These sequential sums of squares are obtained by entering the predictor variables
into Minitab in the order X1 , X2 , X3 and so on. If a sequential sum of squares such
as SS(β2 |β0 , β3 ) were wanted, the predictor variables would have to be entered into
Minitab starting with X3 and followed by X2 .

A variable Xj makes a significant contribution to the explanation of the variability


in the response variable following the other variables already being included in the
model if its sequential mean square divided by the error mean square produces a
significant value for a F1,n−k−1 variable.

3.2
MULTIPLE LINEAR REGRESSION Example

Nurses are given an aptitude test on entrance to nursing school and the scores
recorded (X1 ). They were given a hospital final examination (X2 ) just prior to a
State Board Examination (Y ). It is desired to find a relationship between the State
Board result and the previous marks. The following data are available.

Y X1 X2

450 82 87
468 88 88
457 89 84
505 74 89
495 99 90
525 75 91
525 80 92
540 89 93
525 86 94
530 67 94

If values of the response variable and stored in c1 and the values for the two
predictor variables are in columns c2 and c3 of the worksheet, the regression
coefficients, the analysis of variance and the residual plots are obtained using the
menu system with the following steps.

Stat > Regression > Regression


Responses: [c1]
Predictors: [c2 c3]
Graphs
Residuals
J for Plots:
Standardized
Residual
J plots
Individual
√ Plots
√ Normal plot of residuals
Residuals versus fits

For this example, the simple linear regressions of Y on X1 and Y on X2 are also
obtained.

3.3
MULTIPLE LINEAR REGRESSION Example

Regression Analysis: Y versus X1

The regression equation is


Y = 602 - 1.20 X1

Predictor Coef SE Coef T P


Constant 601.86 98.53 6.11 0.000
X1 -1.205 1.182 -1.02 0.338

S = 32.8568 R-Sq = 11.5% R-Sq(adj) = 0.4%

Analysis of Variance

Source DF SS MS F P
Regression 1 1121 1121 1.04 0.338
Residual Error 8 8637 1080
Total 9 9758

Regression Analysis: Y versus X2

The regression equation is


Y = - 326 + 9.18 X2

Predictor Coef SE Coef T P


Constant -326.4 134.3 -2.43 0.041
X2 9.184 1.488 6.17 0.000

S = 14.5532 R-Sq = 82.6% R-Sq(adj) = 80.5%

Analysis of Variance

Source DF SS MS F P
Regression 1 8063.6 8063.6 38.07 0.000
Residual Error 8 1694.4 211.8
Total 9 9758.0

3.4
MULTIPLE LINEAR REGRESSION Example

MTB > Regress ’Y’ 2 ’X1’ ’X2’;


SUBC> Constant;
SUBC> Brief 2.

Regression Analysis: Y versus X1, X2

The regression equation is


Y = - 285 - 0.256 X1 + 8.97 X2

Predictor Coef SE Coef T P


Constant -285.5 169.3 -1.69 0.136
X1 -0.2557 0.5788 -0.44 0.672
X2 8.965 1.646 5.45 0.001

S = 15.3455 R-Sq = 83.1% R-Sq(adj) = 78.3%

Analysis of Variance

Source DF SS MS F P
Regression 2 8109.6 4054.8 17.22 0.002
Residual Error 7 1648.4 235.5
Total 9 9758.0

Source DF Seq SS
X1 1 1121.4
X2 1 6988.2

3.5
MULTIPLE LINEAR REGRESSION Example

MTB > Regress ’Y’ 2 ’X2’ ’X1’;


SUBC> Constant;
SUBC> Brief 2.

Regression Analysis: Y versus X2, X1

The regression equation is


Y = - 285 + 8.97 X2 - 0.256 X1

Predictor Coef SE Coef T P


Constant -285.5 169.3 -1.69 0.136
X2 8.965 1.646 5.45 0.001
X1 -0.2557 0.5788 -0.44 0.672

S = 15.3455 R-Sq = 83.1% R-Sq(adj) = 78.3%

Analysis of Variance

Source DF SS MS F P
Regression 2 8109.6 4054.8 17.22 0.002
Residual Error 7 1648.4 235.5
Total 9 9758.0

Source DF Seq SS
X2 1 8063.6
X1 1 46.0

In the regression of Y on the two predictors X1 and X2 , the overall regression


is significant (p=0.002) so both coefficients are not zero. If the coefficients are
considered separately, we see that the coefficient of X2 is non–zero (p= 0.001) but
the coefficient of X1 is not significantly different from zero (p=0.672). If the variable
X2 is included in the model, the addition of X1 does not significantly reduce the
variability in the response.

(SeqMS/RMS = 46.0/235.5 = 0.195 6> F1,7;0.05 = 5.59)

3.6
MULTIPLE LINEAR REGRESSION Coefficient of Determination

In simple linear regression, the total sum of squares (TSS) is a measure of the
variability in the response (Y) with no account of the predictor (X). The residual
sum of squares (RSS) measures the variability in the response when the predictor
is used and the reduction in variability is given by TSS - RSS = RegSS.

A measure of the effect of the predictor in reducing the variation in the response is
the reduction in the variation in Y as a proportion of the total variation, that is,

RegSS RSS
R2 = =1− .
TSS TSS

R2 is the proportionate reduction in variation due to using the predictor or the


proportion of variation explained by the predictor.

As in a simple linear regression, the coefficient of determination, R2 , in a


multiple linear regression is calculated as the ratio of the regression sum of squares
to the total sum of squares.

If any terms are added to the model, the value of R2 will be increased even if these
terms do not aid in the prediction of the response variable.

The adjusted coefficient of determination, given by


!
2 n−1 RSS
Radj =1−
n − (p + 1) TSS
!
n−1
=1− (1 − R2 )
n − (p + 1)

does not necessarily increase if more terms are added to the model and in comparing
2
models the one with the largest Radj is usually chosen.

3.7
MULTIPLE LINEAR REGRESSION Mulitcollinearity

When dependence exits amongst the predictor variables in a regression model,


multicollinearity is said to exist amongst these predictor variables. Multicollinearity
can have two main effects on the regression model.

First, when a predictor variable is added to the regression model, and the predictor
variable is related to the predictor variables already in the model, the least squares
estimates of the regression parameters will change. That is, the least squares
estimates of the regression parameters depend upon which predictors have been
included in the model and a physical interpretation of a regression parameter
becomes uncertain as a unit change in one predictor variable, holding the other
predictor variables constant is not possible.

Secondly, the significance of any predictor variable with respect to the response
variable is determined by the value of its corresponding t–statistic. When multi-
collinearity exists, predictor variables (which are correlated) contribute redundant
information and this can lead to a reduction in the value of the t–statistic obtained
by fitting the regression using the full set of correlated predictor variables compared
with the values of the t–statistics obtained by fitting a regression with a subset of
the predictor variables. This can cause some of the predictor variables to appear
less important in the regression model and in extreme cases every predictor variable
can appear non–significant whereas some of the predictor variables do accurately
predict the response.

The reduction in the values of the t–statistics is caused by an increase in the


variability of the regression coefficient. In the regression model
Y = β0 + β1 X 1 + β2 X 2 + ε
let the correlation between X1 and X2 be denoted by r12 . Then, the variance of β̂i
is given by
 1  1 
V (β̂i ) = σ 2 2 i = 1, 2
1 − r12 SXi Xi
.

More generally, in the regression model


Y = β0 + β1 X 1 + β2 X 2 + . . . + βk X k + ε
the variance of β̂i is given by
2
 1  1 
V (β̂i ) = σ i = 1, 2, . . . , k
1 − Ri2 SXi Xi
where Ri2 is the coefficient of determination in the regression of Xi on the other
k − 1 predictors in the model.

3.8
MULTIPLE LINEAR REGRESSION Mulitcollinearity

The variance inflation factor for βi is defined by

1
V IFi =
1 − Ri2

and is only equal to one if Xi is uncorrelated with the other predictors.

Multicollinearity is said to have an undue influence on the estimates of the


parameters if the largest VIF is greater than 10.

In Minitab, variance inflation factors can be obtained by requesting a display of


variance inflation factors under options in the regression menu.

3.9

You might also like