2b

Multiple linear regression involves one dependent

variable and more than one independent variable. The

equation that describes multiple linear regression model is

given below:

y = 0 + 1 x1 + 2 x2 + .

. + k xk +

.

.,xk are

independent variables. These independent variables being

used to predict the dependent variable.

coefficients (also called model parameters). These

regression coefficients are estimated based on observed

sample data.

The term (pronounced as epsilon) is random error.

Sasadhar Bera, IIM Ranchi

Suppose that n number of observations are collected for

response variable (y) and k number of independent

variables present in the regression model.

i = 1, 2, . . ., n

y

y1

y2

.

yi

.

yn

x1

x11

x21

.

xi1

.

xn1

j = 1, 2, . . .,k

x2

x12

x22

.

xi2

.

xn2

.

.

.

.

.

.

.

xj

x1j

x2j

.

xij

.

xnj

.

.

.

.

.

.

.

xk

x1k

x2k

.

xik

.

xnk

Suppose that n number of observations are collected for

response variable (y) and k number of independent

variables present in the regression model.

The scalar notation of regression model:

yi = 0 + 1 xi1 + 2 xi2 + .

i = 1, 2, . . ., n

j = 1, 2, . . .,k

. + j xij + . . + k xik + i

k = number of independent variables

Suppose that n number of observations are collected for

response variable (y) and k number of independent

variables present in the regression model.

yn1 = Xn(k+1) (k+1) 1 + n1

n = total number of observations, k = total number of

variables, is model parameters in vector notation.

y1

.

y yi

.

y n

1 x11 . x1 j

. . .

.

X 1 x i1 . x ij

. . .

.

1 x . x

n1

nj

.

.

.

.

.

x 1k

.

x ik

.

x nk

0

1

.

j

.

k

1

.

i

.

n

The error in regression model is the difference between

actual and predicted value. It may be positive or negative

value.

Error is also known as residual. Predicted value by

regression equation is called fitted value or fit.

The sum of squared difference between the actual and

predicted values known as sum of square of error. Least

square method minimizes the sum of square of error to

find out the best fitting plane.

It is to be noted that the regressor variables in linear

regression model are non-random. That means its values

are fixed.

Sasadhar Bera, IIM Ranchi

In matrix notation, the regression equation:

y =X +

n

that minimizes L =

i 1

2

i

=

T

y X ( y X)

T

T

T

( L) 2 X y 2 X X 0

Sasadhar Bera, IIM Ranchi

For

ith

y i Xi

ei y i y i

n

2

e

i

i 1

n k 1

of regressors.

MSE

Variance( ) = (X T X) 1

Sasadhar Bera, IIM Ranchi

The test for significance of regression is a test to

determine if there is a linear relationship between the

response variable and regressor variables.

H0 : 1 = 2 = . . . = k = 0

H1 : At least one j is not zero

The test procedure involves an analysis of variance

(ANOVA) partitioning of the total sum of square into a sum

of squares due to regression and a sum of square due to

error (or residual)

Total number of model parameters = p = Number of

regression coefficients = (k+1)

Sasadhar Bera, IIM Ranchi

10

ANOVA table

Source of

Variation

Regression

Residual

error

Total

DF

SS

MS

FCal

SSR

SSR /k =MSR

MSR/MSE

n k-1

SSE

SSE / (n-k-1)

= MSE

n 1

TSS

y

2

i

n

T

SSR yi y XT y i1

n

i 1

n

SSE yi yi y T y XT y

i 1

n

TSS yi y

i 1

11

Coefficient

Adding an unimportant variable to the model can actually

increase the mean square error, thereby decreasing the

usefulness of the model.

The hypothesis for testing the significance of any

individual regression coefficient, say j is

H0: j = 0

H1: j 0

j

2 C jj

, ( n k 1)

element of (XTX)-1 . Reject H0 if Tcal > t , ( n k 1)

2

12

In matrix notation, the regression equation:

y =X +

where Normal (0, 2)

Mean response = y = E(y) = E(X ) + E() = X + 0

y|x = E(y | x0 ) = x0

0

var(y | x0 )

x T0 (XT X)1 x 0

y|x

( n p )

x T0 (XT X)1 x 0

Sasadhar Bera, IIM Ranchi

13

Coefficient of multiple determination =

R2

SSR

=

TSS

SSE

1

TSS

TSS

TSS

SSR : Sum of square due to regression

SSE : Sum of square due to error

TSS : Total sum of square

dependent variable explained by regressor variables.

R2 is measure the goodness of linear fit. The better the

linear fit is, the R2 closer to 1.

14

The major drawback of using coefficient of multiple

determination (R2) is that adding a predictor variable to the

model will always increase R2, regardless of whether the

additional variable is significant or not. To avoid such

situation, regression model builders prefer to use adjusted

R2 statistic.

SSE

2

adj

n 1

( n p)

(1 R 2 )

1

1

TSS

n p

(n 1)

are added to the model.

chance that non-significant terms have been included in the

15

model.

Sasadhar Bera, IIM Ranchi

