Professional Documents
Culture Documents
• Linear Model
Υ = β 0 + β1Χ1 + β 2 Χ 2 + β 3Χ 3 + β 4 Χ 4 + β 5Χ 5 + ε
ANOVA Table
INTERPRETING PARAMETER ESTIMATES
> anova(gfit)
Response: disburse
• Naïve interpretation: “A unit change in X1 will produce
Df Sum Sq Mean Sq F value Pr(>F) a change of β1 in the response.
income 1 635639 635639 809.0628 < 2.2e-16 ***
food 1 7651 7651 9.7391 0.003182 **
• β1 is the effect of X1 when all other (specified
house 1 21618 21618 27.5158 4.268e-06 *** predictors) are held constant
fulyt 1 46 46 0.0583 0.810301
numem 1 2632 2632 3.3496 0.074002 .
Residuals 44 34569 786
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05
‘.’ 0.1 ‘ ’ 1
CONFIDENCE INTERVALS(CI) FOR β CONFIDENCE INTERVALS FOR β
> vif(object, …)
CRITERION-BASED PROCEDURES
VARIABLE SELECTION • Fit the 2p possible models and choose the best one
according to some criterion.
• Intended to select the “best” subset of predictors • Also known as “All possible-regression procedure”
• To explain the data in the simplest way – redundant Criteria for Comparing Regression models
predictors are removed.
1. Akaike Information Criterion (AIC) and Bayes
Information Criterion (BIC)
AIC = -2log-likelihood + 2p
BIC = -2log-likelihood + plogn
where:
-2log-likelihood =nlog(SSE/n) also known as
the deviance
Rule : Lowest AIC or BIC is the best model
CRITERION-BASED PROCEDURES CRITERION-BASED PROCEDURES
Criteria for Comparing Regression models Criteria for Comparing Regression models
2. Adjusted R2 or Ra2 : SSE p
3. Mallow’s Cp Statistic : C p = + 2p − n
MSE
n − 1 SSE MSE
Ra2 = 1 − = 1−
n − p SST SST / n − 1 - MSE is from the model with all predictors and SSEp
if from the model with p parameters
Rule : The model with the highest adjusted R2
value is considered best. - When number of independent variables = p, Cp=p.
Thus a model with bad fit will have Cp much bigger
than p
1. Fit a model with all predictors in the model 1. Fit the intercept or the null model, without variables in
2. Remove the predictor with highest p-value greater than the model
significance level ∝crit 2. For all predictors not in the model, check their p-values
3. Refit the model and repeat Step 2. if they are added to the model. Choose the one with
lowest p-value less than ∝crit.
4. Stop when all the pvalues < ∝crit
3. Continue until no new predictors can be added.
• This approach is known as the “saturated” model
approach, where we saturate the model with all terms
and remove those that are insignificant relative to the
presence of the others.
STEPWISE REGRESSION regsubsets()
• Combination of backward elimination and forward • One of the functions in package ‘leaps’, which is used
selection for regression subset selection.
• At each stage a variable may be added or removed ad > regsubsets(x, data, method =
there are several variations on exactly how this is done. c(“exhaustive", “backward", “forward",
• Some drawbacks “seqrep“), force.in=n[,n...])
1. Possible to miss the “optimal model” # x – design matrix or model formula
2. The procedures are not directly linked to final # method – method of variable selection such as
objectives of prediction or explanation and may not exhaustive search, forward, backward or sequential
help solve the problem replacement. (Note: for few number of independent
3. Tends to pick smaller models than desirable for variables, the exhaustive method is recommended)
prediction or explanation. # force.in – specifies one or more variables to be
included in all models.
Final Model
Manual Variable Selection
gfit2 <- lm(disburse ~
• Choose term to add to the new model income+food+house+numem,
newmodel2 <- update(newmodel,.~. + food) data=DIS)
AIC(newmodel2) summary(gfit2)
ei ei
Increasing trend
ei
Constant Variance
Non-Linear Linear
Decreasing trend
RESIDUAL ANALYSIS RESIDUAL ANALYSIS
• Non-Independence of Errors • Outlier detection
ei ei
Independence of Errors
ei ei
> plot(gfit2$fit,
gfit2$res,ylab="Residuals",
xlab="Fitted",
main="Residual Plot")
> abline(h=0)
RESIDUAL PLOT VS. PREDICTORS RESIDUAL PLOT VS. PREDICTORS
> par(mfrow=c(2,2))
> plot(DIS$income, gfit2$res,
ylab="Residuals", xlab="Income")
> plot(DIS$house, gfit2$res,
ylab="Residuals", xlab="House")
> plot(DIS$food, gfit2$res,
ylab="Residuals", xlab="Food")
> plot(DIS$numem, gfit2$res,
ylab="Residuals", xlab="No. of
family members")
> par(mfrow=c(1,1)
• Illustration
> cook <- cooks.distance(gfit2)
> plot(cook,ylab="Cooks distances",
xlim=c(0,60))
> OBS <- rownames(DIS)
> identify(1:50,cook,OBS)