MAS316/Math352 Regression Analysis: 1 Multiple Linear Regression Models

MAS316/Math352 Regression Analysis
Chapter 4: Multiple Linear Regression (I)
Multiple Linear Regression is used to explore the possible relationship be-

tween one response variable and (one or more than one) predictor variable(s).
Obviously, Simple Linear Regression is a special case of Multiple Linear Re-
gression.
1 Multiple Linear Regression Models

Consider
Y : response variable, X1 , . . . , Xp : predictor variables.
A sample of n observations is observed in the form of (yi , xi1 , . . . , xip ) for the ith
observation:
Y X1 X2 Xp
y1 x11 x12 x1p
y2 x21 x22 x2p
.. .. .. .. ..
. . . . .
yn xn1 xn2 xnp
DEFINITION 1 A multiple linear regression model (MLR) assumes:
yi = 0 + 1 xi1 + + p xip + i , (1.1)
where
X1 , . . . , Xp are non-random variables,
1 , . . . , n are i.i.d. with i N(0, 2 ),
Note that, hence, y1 , ..., yn are independent, and normally distributed..
Equivalently, in matrix form,
Y = X + , Nn (0, 2 In ),
1
where

0
y1 1 x11 x1p
1 1
y2 1 x21 x2p
2 .. .
Y= .. , X= .. .. .. .. , = and = .
. . . . . ..
. n
yn 1 xn1 xnp
p
Notations:
0 , 1 , . . . , p : unknown regression coecients;
j is associated with eect of Xj on Y , and also called the coecient of Xj ;
X: n (p + 1) design matrix with known entries;
Assume the columns of X are linearly independent, so that (X X)1 exists;
Y Nn (X, 2 In ).
Note: The word linear in a linear model refers to the property that E[Y] is
linear in the regression coecients 0 , . . . , p , but not necessarily in each ex-
planatory variable.
EXAMPLE 1 Linear models:

(i) SLR yi = 0 + 1 xi + i ;
(ii) Polynomial regression

yi = 0 + 1 zi + 2 zi2 + i ,
predictor variables: xi1 = zi , xi2 = zi2 ; Then the above model becomes
yi = 0 + 1 xi1 + 2 xi2 + i .
(iii) Interaction eects

yi = 0 + 1 xi1 + 2 xi2 + 3 xi1 xi2 + i .
predictor variables: xi3 = xi1 xi2 .
(iv) yi = 0 + 1 sin 2zi1 2 log zi2 + i ,

predictor variables: xi1 = sin 2zi1 , xi2 = log zi2 .
We can use linear regression models to deal with almost any "function" of a
predictor variable.
EXAMPLE 2 NOT linear models:
(i) yi = exp{0 + 1 xi + i };
(ii) yi = 1/(0 + 1 xi + 2 xi2 ) + i .
2
2 Estimation of Regression Coecients
In the multiple regression model, the least squares (LS) estimates 0 , 1 , , p
for 0 , 1 , , p are chosen to minimize
n

(yi 0 1 xi1 p xip )2 . (2.1)
i=1
Using calculus the p + 1 rst order conditions are obtained as follows.

n

(yi 0 1 xi1 p xip ) = 0
i=1
n

xi1 (yi 0 1 xi1 p xip ) = 0
i=1

n

xip (yi 0 1 xi1 p xip ) = 0. (2.2)
i=1
Since the above expression is cumbersome we would like to have a neat solution.
To this end, by (1.1) rewrite (2.1) as

1
n 2

i2 = (1 2 p ) . = = (Y X) (Y X). (2.3)
.
.
i=1
p
To minimize (2.3) we need to take derivative with respect to the vector . It is
like a quadratic function. Using the chain rule we nd
d [ ]
(Y X) (Y X) = 2X (Y X).
d
Setting this equal to zero and solving for we obtain the normal equation
2X X 2X Y = 0.
It follows that
b = (X X)1 X Y.
(2.4)
3 Fitted values and residuals

Denote the tted value of yi by yi and the residual by ei = yi yi . Let

y1 0 + 1 x11 + + p x1p e1
y2 0 + 1 x21 + + p x2p e2

Y = . = .. , e = .. .
.. . .
yp 0 + 1 xn1 + + p xnp ep
3
The tted values are then represented by
b
Y = X (3.1)
and the residual value by

b
e = Y Y = Y X. (3.2)
DEFINITION 2 The n n matrix H = X(X X)1 X is commonly known as the

hat matrix, which provides important information about the inuence of the X
values on the linear model t.
Some properties of hat matrix H:
HX = X,
H2 = H,
H = H,
(I H)2 = I H,
(I H) = I H.
In terms of the hat matrix H, we may further write (3.1) and (3.2) as
Y = HY (3.3)
and
e = (I H)Y. (3.4)
4 An example
A data set is taken from an environmental study that measured four variables
for 30 consecutive days, which contains 30 observations (rows) and 4 variables
(columns), is shown in Table 1:
ozone (Y ) ozone surface concentration of ozone in New York, in parts per

million;
radiation (X1 ) solar radiation;
temperature (X2 ) observed temperature, in degrees Fahrenheit;
wind (X3 ) wind speed, in miles per hour.
4
Table 1: Environmental data
ozone radiation temperature wind
3.45 190.00 67.00 7.40
3.30 118.00 72.00 8.00
2.29 149.00 74.00 12.60
2.62 313.00 62.00 11.50
2.84 299.00 65.00 8.60
2.67 99.00 59.00 13.80
2.00 19.00 61.00 20.10
2.52 256.00 69.00 9.70
2.22 290.00 66.00 9.20
2.41 274.00 68.00 10.90
2.62 65.00 58.00 13.20
2.41 334.00 64.00 11.50
3.24 307.00 66.00 12.00
1.82 78.00 57.00 18.40
3.11 322.00 68.00 11.50
2.22 44.00 62.00 9.70
1.00 8.00 59.00 9.70
2.22 320.00 73.00 16.60
1.59 25.00 61.00 9.70
3.17 92.00 61.00 12.00
2.84 13.00 67.00 12.00
3.56 252.00 81.00 14.90
4.86 223.00 79.00 5.70
3.33 279.00 76.00 7.40
3.07 127.00 82.00 9.70
4.14 291.00 90.00 13.80
3.39 323.00 87.00 11.50
2.84 148.00 82.00 8.00
2.76 191.00 77.00 14.90
3.33 284.00 72.00 20.70
Figure 1 displays a scatter plot matrix of the data, providing visual informa-
tion about pairwise relationships among the 4 variables.
To study the variation of ozone concentration (Y ), we t an MLR model to

the data using 3 explanatory variables: radiation (X1 ), temperature (X2 ) and
wind (X3 ).
The 30 4 design matrix X equals

1 190.00 67.00 7.40
1 118.00 72.00 8.00

X = 1 149.00 74.00 12.60 .

.. .. .. ..
. . . .
5
0 50 150 250 10 15 20
5
4
ozone
3
2
1
250
radiation
150
0 50
90
80
temperature
70
60
20
15
wind
10
1 2 3 4 5 60 70 80 90
Figure 1: Scatter plot matrix for the environmental data
Then the LSE is

0.295271027
0.001305816
= (X X)1 X Y =
0.045605744 .
0.027843496
According to the signs of estimates, it seems that ozone concentration in-

creases as radiation and temperature increase, but is reduced by wind. Sta-
tistical tests have to be conducted if we wish to have a more rigorous study.
6
5 Model checks for adequacy
5.1 ANOVA table
Denote 1 = (1, ..., 1) to be a n-dimension column vector, and then A = n1 11 is
a n n matrix with all entries equal to 1/n. Hence, formulas for sums of squares
are given as follows.

Syy = (yi y)2 = yi2 ny 2 = Y Y Y AY

SSE = (yi yi )2 = e e = Y (In H)2 Y = Y (In H)Y

SSR = (yi y)2 = yi2 ny 2 = Y H2 Y Y AY = Y (H A)Y.
where we use (3.3) and (3.4) and the properties of the hat matrix H. From the
above matrix expression for sums of squares one may easily verify the following
partition
Syy = SSR + SSE.
Table below shows these analysis of variance results.
Source df SS MS F p-value

Regression p SSR = (yi y)2 MSReg MSReg /s2
Residual np1 SSE = (yi yi )2 s2

Total n1 Syy = (yi y)2
where
SSR
MSReg = .
dfR
The dierence SSR measures how eective the variables X1 , ..., Xp are to ex-
plain the variation in the response Y collectively.
5.2 F test
It can be proved that E(s2 ) = 2 , as for SLR. The expectation of MSReg is 2 plus
a quantity that is nonnegative. For instance, when p = 2, we have
1[ 2 ]
n n n
E(MSReg ) = 2 + 1 (xi1 x1 )2 +22 (xi2 x2 )2 +21 2 (xi1 x1 )(xi2 x2 ) .
2
i=1 i=1 i=1
Note that if both 1 and 2 are equal to zero, then E(MSReg ) = 2 . Otherwise
E(MSReg ) > 2 .
This suggests that a comparison of MSReg and s2 is useful for testing whether
there is a regression relation between the response variable y and the predictor
variables x1 , , xp .
7
The formal F test for the signicance of the MLR is equivalent to testing
H0 : 1 = = p = 0 vs H1 : at least one j = 0
MSReg
F= s2
F(p, n p 1).
Rejection rule: reject H0 i the p-value P(Fp, np1 > F) is smaller

than ,
()
or equivalently, i F > Fp, np1 .
Theoretical justication of the above F distribution is as follows. To this end

we rst introduce Cochrans Theorem.
THEOREM 1 Let Y Nn (0, 2 I). Suppose that A1 , . . . , Am are symmetric n n

matrices such that
m
(i) i=1 Ai = I (identity matrix), and
(ii) Ai2 = Ai for all i,
then we have that
YT A1 Y, YT A2 Y, . . . , YT Am Y are independent random variables,
and
YT Ai Y 2 r2i , for i = 1, . . . , m,
where ri = tr(Ai ), the trace of Ai .
For the matrices involved in SSE and SSR one may verify that
A + (In H) + (H A) = In ;
A2 = A, (In H)2 = (In H) and (H A)2 = (H A);
tr(A) = 1, tr(In H) = n p 1 and tr(H A) = p.
Then, by Cochrans Theorem, we have that SSE and SSR are independent, and
SSR p2 and 2
SSE np1 .
8
5.3 R2 statistic
R2 statistic:
SSR
R2 = .
Syy
Adding more variables will make R2 go up because SSE will never become larger
with more predicator variables and SSy is always the same for a given set of
responses ; so cannot really use R2 to determine whether a variable should be
added.
Adjusted R2 statistic:
SSE/(n p 1)
Ra2 = 1 .
Syy /(n 1)
Sum of squares are adjusted by degree of freedom. Note that:
Ra2 is mainly used to measure the performance of the tted model on dif-
ferent data sets, especially with dierent sample size.
Generally, for a good model, the R2 should not be small (< 60%) or not be
large (> 95%).
EXAMPLE 3 Example 1 (contd)
ANOVA table:
Source df SS MS F p-value
Regression 3 7.6685 2.5562 7.3244
Residual 26 9.0738 0.3490
Total 29 16.7423
Calculating using computer,
p-value = P(F3,26 > 7.3244) = 0.001026,
which is very small MLR highly signicant.
For a size test, we read the critical values from standard statistical tables
or computer software. For example,
()
Critical value F3,26
5% 2.9752
1% 4.6365
In either case, the F-ratio 7.3244 is bigger than the critical value and we
should reject H0 at the stated signicance level.
The coecient of determination is R2 = 7.6685/16.7423 = 0.4580, which

shows that the MLR does not give a good t (even though it is highly sig-
nicant!)
9
6 Inference for Regression Coecients
Tests model signicance give no indication of which variable(s) in particular are
important. So we need to consider tests for individual regression coecients.
bi Np+1 (, 2 (X X)1 ). It follows that
Since Y N(X, 2 I) we have
i N(i , 2 (X X)1
ii )
where (X X)1 1 2
ii denotes the i diagonal element of (X X) . Note that is unknown
in practice. As in SLR one may verify that s2 = SSE/(n p 1) = MSE is an
unbiased estimator for 2 . Moreover we have
i i
tnp1 ,
s.e.(i )
where s.e.(i ) = s(X X)1

ii .
To test
H0 : i = 0 H : i = 0.
The test statistic is
i
t =
s.e.(i )
and the decision rule is
if |t | > tnp1,/2 , reject H0 .
The condence interval is
( tnp1,/2 s.e.(i , + tnp1,/2 s.e.(i ),
EXAMPLE 4 Example 1 (contd)

Suppose we want to see if temperature can be removed from the full model, i.e.
to test H0 : 2 = 0 against H0 : 2 unrestricted.
Note that
2 = 0.04561 and s = 0.3490 = 0.5908.
Calculate

2.8559225951 0.0006173356 0.0353506868 0.0409097295
0.0006173356 0.0000033414 0.0000180726 0.0000000141
(X X)1 =
0.0353506868 0.0000180726
,
0.0005338541 0.0001439104
0.0409097295 0.0000000141 0.0001439104 0.0026139210
so that
s.e.(2 2 ) = s 0.0005338541 = 0.01365.
10
The 95% condence interval for 2 is
[ 0.04561 0.01365(2.0555), 0.04561 + 0.01365(2.0555) ] = [ 0.01755, 0.07366 ],

(0.025)
where t26 = 2.0555.
The point zero falls outside the above interval and so we reject H0 at the 5%
signicance level. It seems that including temperature in the model signicantly
improves our t.
R code for Scatterplot

Step 1: Read the data into R and code for scatterplot.
ozone_raw<-read.table(C:/ozone.txt,header=FALSE)
ozone<-data.frame(y=ozone_raw$V1,x1=ozone_raw$V2,
x2=ozone_raw$V3,x3=ozone_raw$V4)
plot(ozone)
R code for tting Multiple Regression
> mlr<-lm(y~x1+x2+x3,data=ozone)
> summary(mlr)
Output for multiple regression model
Call:
lm(formula = y ~ x1 + x2 + x3, data = ozone)
Residuals:
Min 1Q Median 3Q Max
-1.13583 -0.39280 0.00007 0.39270 1.41993
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.295271 0.998348 -0.296 0.76976
x1 0.001306 0.001080 1.209 0.23746
x2 0.045606 0.013650 3.341 0.00253 **
11
x3 -0.027843 0.030203 -0.922 0.36507
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 0.5908 on 26 degrees of freedom

Multiple R-squared: 0.458, Adjusted R-squared: 0.3955
F-statistic: 7.324 on 3 and 26 DF, p-value: 0.001026
R codes for ANOVA table
> anova(mlr)
Analysis of Variance Table
Response: y
Df Sum Sq Mean Sq F value Pr(>F)
x1 1 3.1469 3.1469 9.0172 0.005845 **
x2 1 4.2250 4.2250 12.1062 0.001787 **
x3 1 0.2966 0.2966 0.8498 0.365073
Residuals 26 9.0738 0.3490
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
R code for condence interval of 2
> confint(mlr,x2,level=0.95)
2.5 % 97.5 %
x2 0.01754858 0.0736629
12

MAS316/Math352 Regression Analysis: 1 Multiple Linear Regression Models

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MAS316/Math352 Regression Analysis: 1 Multiple Linear Regression Models

Uploaded by

Copyright:

Available Formats

MAS316/Math352 Regression Analysis

Chapter 4: Multiple Linear Regression (I)

Multiple Linear Regression is used to explore the possible relationship be-

1 Multiple Linear Regression Models

Y : response variable, X1 , . . . , Xp : predictor variables.

DEFINITION 1 A multiple linear regression model (MLR) assumes:

yi = 0 + 1 xi1 + + p xip + i , (1.1)

X1 , . . . , Xp are non-random variables,

1 , . . . , n are i.i.d. with i N(0, 2 ),

Note that, hence, y1 , ..., yn are independent, and normally distributed..

Equivalently, in matrix form,

j is associated with eect of Xj on Y , and also called the coecient of Xj ;

X: n (p + 1) design matrix with known entries;

Assume the columns of X are linearly independent, so that (X X)1 exists;

EXAMPLE 1 Linear models:

(ii) Polynomial regression

(iii) Interaction eects

(iv) yi = 0 + 1 sin 2zi1 2 log zi2 + i ,

(ii) yi = 1/(0 + 1 xi + 2 xi2 ) + i .

Using calculus the p + 1 rst order conditions are obtained as follows.

3 Fitted values and residuals

and the residual value by

DEFINITION 2 The n n matrix H = X(X X)1 X is commonly known as the

ozone (Y ) ozone surface concentration of ozone in New York, in parts per

radiation (X1 ) solar radiation;

temperature (X2 ) observed temperature, in degrees Fahrenheit;

wind (X3 ) wind speed, in miles per hour.

To study the variation of ozone concentration (Y ), we t an MLR model to

The 30 4 design matrix X equals

Figure 1: Scatter plot matrix for the environmental data

Then the LSE is

According to the signs of estimates, it seems that ozone concentration in-

Rejection rule: reject H0 i the p-value P(Fp, np1 > F) is smaller

Theoretical justication of the above F distribution is as follows. To this end

THEOREM 1 Let Y Nn (0, 2 I). Suppose that A1 , . . . , Am are symmetric n n

(ii) Ai2 = Ai for all i,

then we have that

YT A1 Y, YT A2 Y, . . . , YT Am Y are independent random variables,

A2 = A, (In H)2 = (In H) and (H A)2 = (H A);

tr(A) = 1, tr(In H) = n p 1 and tr(H A) = p.

p-value = P(F3,26 > 7.3244) = 0.001026,

which is very small MLR highly signicant.

The coecient of determination is R2 = 7.6685/16.7423 = 0.4580, which

where s.e.(i ) = s(X X)1

if |t | > tnp1,/2 , reject H0 .

The condence interval is

( tnp1,/2 s.e.(i , + tnp1,/2 s.e.(i ),

EXAMPLE 4 Example 1 (contd)

[ 0.04561 0.01365(2.0555), 0.04561 + 0.01365(2.0555) ] = [ 0.01755, 0.07366 ],

R code for Scatterplot

R code for tting Multiple Regression

Output for multiple regression model

Residual standard error: 0.5908 on 26 degrees of freedom

R codes for ANOVA table

Analysis of Variance Table

You might also like