You are on page 1of 5

Stat102B/Sanchez

Review of Multiple regression with R


(Will be needed for lecture on principal components regression)
Introduction
It is hard to imagine that any dependent variable is related to only one independent variable.
Usually, there are many more variables that will affect a dependent variable. For example, the
amount of crime in a city does not depend only on the number of convicted murderers in that city,
but also on other socio-demographic variables like the average number of years in school of youth,
the median income, the minimum wage, etc… We need models that allow us to determine
quantitatively the effect of all variables on a dependent variable. This is where a linear model with
multiple independent variables becomes very useful. The linear model with several independent
variables is called multiple regression as opposed to simple regression (one independent variable).
Basically, let

Y =the sale price of a house/1000


X1 =Taxes (local, school, county)/1000
X2 = number of baths
X3 = Lot size (sq ft x1000)
X4 = Living Space (sq ft x 1000)
X5 = number of garage stalls
X6 = Number of rooms
X7 = Number of bedrooms
X8 = Age of the home (years)
X9 = Number of fireplaces

For all homes out there in the US, we can assume that for all i=1,……,N, where N is the size of the
population, the population model is

Yi   0  1 X i1   2 X i2  ...............  9 X i9  ei (1)
where each e i is such that
ei are independent and identically distributed (iid) random variables.
E(ei )  0
Var(ei )   2 Homoscedasticity property : all e i the same variance
Cov(e i ,e j )  0 for i  j No serial correlation property.
And the e i is Normally distributed

Assume further that there is no multicollinearity


( Cov(X il , X im )  0 for l  m l =1,2....,k, m =1,2,.....,k

As with the simple regression model, those assumptions about the e’s imply the following
assumptions about the Y’s.
 E(Yi )  E  0  1 X i1   2 X i2  ...............  9 X i9  ei    0  1 X i1   2 X i2  ...............  9 X i9
Var(Yi )  Var 0  1 X i1   2 X i2  ...............  9 X i9  ei   Var(ei )   2
Cov(Yi ,Y j )  0 i j

And, of course, if the e’s are normally distributed, so are the Y’s.
 1
Stat102B/Sanchez

The first task for us is to obtain a random sample of houses whose information about those variables
allow us to say something about the population of all houses.

Once we obtain the data, the tasks in front of us are:


(a) Estimating the model parameters (the  ' s and the  2 )
(b) Properties of the estimators (for the  ' s and the  2 )
(c) Drawing inferences about them or linear combinations of them with confidence intervals
and tests of hypotheses 

(d) Assessing how good the model is for a sample of data
 
(e) Prediction
(f) Estimation of the E(Y|X)
(g) Model selection (this one is new). Choosing the best model is the process of selecting the
best independent variables.

As you can see, those are the same tasks as in the simple regression model with only one
independent variable. But now, the computations will be a little more cumbersome because we have
many more variables. Matrix algebra makes it easy for us to do the job.

Notice that E(Y|X) is now a plane, not a line. The intercept parameter  0 is the intercept of the
plane.

1 Estimating the model parameters using a sample of data and the least squares
method applied to this sample. 
Suppose we observe a random sample of n=24 houses and measure in each of them values of the
independent variables and the dependent variables assumed in the model. Here are a few lines of
the data

y x1 x2 x3 x4 x5 x6 x7 x8 x9
25.9 4.9176 1.0 3.4720 0.9980 1.0 7 4 42 0
29.5 5.0208 1.0 3.5310 1.5000 2.0 7 4 62 0
…… ………. …… ……. …….. …… …. ….. …… ….
38.9 8.3607 1.5 9.1500 1.7770 2.0 8 4 48 1
36.9 8.1400 1.0 8.0000 1.5040 2.0 7 3 3 0
45.8 9.1416 1.5 7.3262 1.8310 1.5 8 4 31 0

With R, estimating the model parameters is very easy. First, I read the data and then use the lm
command. But now we will have to do some work and extract the model parameters and put them in
a nice table with their names and all. The data codebook and manual are very important when
dealing with many variables.

houses=read.table("http://www.stat.ucla.edu/~jsanchez/data/montproperty.txt",sep="\t",header=T)
pairs(houses) #do a pairwise plot to study correlation between pairs of variables.

2
Stat102B/Sanchez

#Now extract the variables from the R object houses. Give meaningful names to the variables, not x
and y.

price=houses$y
tax=houses$x1
baths=houses$x2
lotsize=houses$x3
livespace=houses$x4
garage=houses$x5
rooms=houses$x6
bedrooms=houses$x7
age=houses$x8
fireplace=houses$x9

# And now fit the model and view the results.

model1=lm(price~tax+baths+lotsize+livespace+garage+rooms+bedrooms+age+fireplace)

summary(model1)

3
Stat102B/Sanchez
The results will look something like this in R, but you do not report this. You edit these
meaningfully to be suitable for presentation to others.

Call:
lm(formula = price ~ tax + baths + lotsize + livespace + garage +
rooms + bedrooms + age + fireplace)

Residuals:
Min 1Q Median 3Q Max
-3.71960 -1.95575 -0.04503 1.62733 4.25259

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14.92765 5.91285 2.525 0.0243 *
tax 1.92472 1.02990 1.869 0.0827 .
baths 7.00053 4.30037 1.628 0.1258
lotsize 0.14918 0.49039 0.304 0.7654
livespace 2.72281 4.35955 0.625 0.5423
garage 2.00668 1.37351 1.461 0.1661
rooms -0.41012 2.37854 -0.172 0.8656
bedrooms -1.40324 3.39554 -0.413 0.6857
age -0.03715 0.06672 -0.557 0.5865
fireplace 1.55945 1.93750 0.805 0.4343
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 2.949 on 14 degrees of freedom


Multiple R-Squared: 0.8531, Adjusted R-squared: 0.7587
F-statistic: 9.037 on 9 and 14 DF, p-value: 0.0001850

> anova(model1)
Analysis of Variance Table

Response: price
Df Sum Sq Mean Sq F value Pr(>F)
tax 1 636.16 636.16 73.1525 6.238e-07 ***
baths 1 29.18 29.18 3.3551 0.08836 .
lotsize 1 4.71 4.71 0.5416 0.47391
livespace 1 0.03 0.03 0.0032 0.95537
garage 1 8.78 8.78 1.0091 0.33216
rooms 1 13.03 13.03 1.4982 0.24115
bedrooms 1 9.14 9.14 1.0515 0.32254
age 1 0.64 0.64 0.0741 0.78943
fireplace 1 5.63 5.63 0.6478 0.43435
Residuals 14 121.75 8.70
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

# You then analyze the residuals


4
Stat102B/Sanchez
resid=model1$residuals
sdresid=(resid-mean(resid))/sd(resid)
fittedmodel1=model1$fitted
plot(sdresid~fittedmodel1,xlab="yhat",ylab="standardized residuals", main="residual plot")
plot(sdresid~fittedmodel1,xlab="yhat",ylab="standardized residuals", main="residual
plot",ylim=c(-3,3))
abline(h=0)

Let’s briefly mention what these things are and take notes. Write the fitted model and let’s talk
about the parameters and some of the inference. Do you see any problem with the data?

You might also like