You are on page 1of 16

Name: Ang Wei Cang ST5202 Mid-term Project Matric: A0024060B

** Most computations are done through R. Please refer to Appendix for codes. **

Q1) Let X represent the ACT score, Y represent the GPA score. Assuming a linear model
appropriate, we want to fit a model of the form Y aX b . Using R, we obtain the
following least square linear model with the estimates for a and b:
0.03883 2.11405 Y x

The ANOVA table for this linear model is shown below:
Analysis of Variance Table
Response: GPA
Source Df Sum sq Mean sq F value P value
Regression
Error
1 SSR =
3.5878
MSR = 3.5878
9.2402
MSR
MSE

1,118
( 9.2402)
0.002917
P F


Regression
Error
118 SSE =
45.818
MSE
=
45.818
0.3883
118

Total 119 SSTO =
49.4058


We shall perform the F test of whether
1
0 or
1
0 at 5% level of significance.
0 1
1 1
: 0
: 0
H
H


F* statistic =
MSR
MSE
= 9.2402
Critical value F
0.95,1,118
= 3.921478
Since the test statistic has a larger value than the critical value, we conclude that we have
sufficient evidence to reject H
0
. I.e. the data supports the observation that
1
0 (there is a
relationship between ACT score and GPA score).

The 95% prediction interval of GPA score when ACT score is 25, using R code, is (1.84562,
4.323835).

ANG WEI CANG ST5202 Mid term Project
A0024060B
Q2) Diagnostics
First we plot a scatter plot to have a visual idea on the relationship between ACT score
and their GPA score. Also we included the best fit line found in Q1.













There is no strong indication in a linear relationship between ACT score and GPA score.
We shall take a look at the residual scatter plot and residuals boxplot.










ANG WEI CANG ST5202 Mid term Project
A0024060B
We observe that there is no clear pattern among the residuals which suggest that there is
a possible linear relationship between ACT score and GPA score. However we also
observe presence of a few outliers, in particular, the ones being circled.

Next we shall check on the normality assumption. First we look at the Q- Q plot before
we perform the Shapiro Wilk test.













From the Q Q plot, it seems like the residuals are not normally distributed, especially
at the tail ends.
From the Shapiro Wilk test, we obtained the following results.
Shapiro-Wilk normality test
data: residual
W = 0.9525, p-value = 0.0003304
Since p value is 0.0003304, we reject the null hypothesis that the residuals are
normally distributed at 5 % level of significance.

ANG WEI CANG ST5202 Mid term Project
A0024060B
Next, we check the assumption of expectation of error term is zero and constant error
variance through a plot of absolute values of the residuals against ACT score.
Assumption on zero mean of error term is verified through the sample mean of the error
term, which is
5
7.175 10 0

~ .













In general the absolute values of the residuals look fairly similar, except for the two
outliers indicated in red. Therefore apart from the outliers, it is reasonable to assume that
the remaining data points have constant error variance.

However a visual plot may not be an accurate indicator of whether the error variances
are the same. Note that the sample mean is 24.75, hence we break up the ACT scores
into two groups, one that is one group consists of data points with ACT score more than
25, the remaining points in another group. This is followed by conducting F test
(following the normality assumption) to test if the true variances of data points between
two groups are the same.
2 2
0 1 2
2 2
1 1 2
:
:
H
H
o o
o o
=
=
, level of significance = 5%
F test to compare two variances
data: residual[score >= 26] and residual[score < 26]
ANG WEI CANG ST5202 Mid term Project
A0024060B
F = 1.5987, num df = 54, denom df = 64, p-value = 0.07202
alternative hypothesis: true ratio of variances is not equal to 1

From the F test, we have p-value = 0.07202. Hence we do not reject the null hypothesis
that the two groups of data have the same variance.

We shall verify the assumption that the residuals are uncorrelated among themselves.
This can be verified through the Durbin Watson test, which takes on the follow
hypothesis statements:
0
1
: 0
: 0
H
H

=
=
, level of significance = 5%
Durbin-Watson test
data: linear.reg
DW = 1.8307, p-value = 0.1758
alternative hypothesis: true autocorrelation is greater than 0

Since the p value is 0.1758 > 0.05, we do not reject the null hypothesis at 5% level of
significance. Hence there is evidence that the residuals are uncorrelated (independent
under normality assumption).

Last but not least, we want to check if the residuals are uncorrelated to the predictor
variable. This can be checked from the residuals against the ACT score above, except for
the outliers there are no signs of correlation between them. Alternatively the sample
correlation between residuals and X is -2.070691e-05 0 ~ .

ANG WEI CANG ST5202 Mid term Project
A0024060B
Remedies
From the earlier analysis, we have identified the following areas that need remedy
before we can fit a linear model:
a) Presence of outliers
b) Evidence that distribution of response variable is not normal.

Therefore I propose to remove outliers from the data set before we fit a linear regression
model. This is to assume that we have valid reasons to remove the outliers from the data set
(e.g. error in ACT score).

Approach
Under the suspicion that non normality indication is due to the outliers, we first removed
the two outliers shown in the earlier diagram, i.e. the data pairs (29, 0.5000) and (31, 1.486),
before we perform the Shapiro Wilk test again and obtained the following result:
Shapiro-Wilk normality test
data: residual2
W = 0.989, p-value = 0.4627
Hence we do not reject the null hypothesis that the response variable is normally distributed
at 5 % level of significance. This would have taken care for the non normality characteristic
observed earlier and now we can proceed to fit a linear model.


ANG WEI CANG ST5202 Mid term Project
A0024060B
Model Fitting (outliers removed)
Using R, we obtain the following result:

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.90216 0.28349 6.710 7.51e-10 ***
score2 0.04900 0.01133 4.327 3.22e-05 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 0.5457 on 116 degrees of freedom
Multiple R-squared: 0.139, Adjusted R-squared: 0.1315
F-statistic: 18.72 on 1 and 116 DF, p-value: 3.224e-05

Hence the least square estimated linear model is 1.90216 0.049
i i
Y X = + , where Y represents
GPA score and X represent ACT score.

Next we have the ANOVA table for our linear model.
Analysis of Variance Table
Response: gpa2
Df Sum Sq Mean Sq F value p value
score2 1 5.574 5.5737 18.720 3.224e-05 ***
Residuals 116 34.537 0.2977
Total 117 40.111
Using ANOVA table, we perform the F test of whether
1
0 | = or
1
0 | = at 5% level of
significance.
0 1
1 1
: 0
: 0
H
H
|
|
=
=

ANG WEI CANG ST5202 Mid term Project
A0024060B
F* statistic =
MSR
MSE
=
5.574
18.724
0.2977
=
Critical value F
0.95,1,116
= 3.922879
Since the test statistic has a larger value than the critical value, we conclude that we have
sufficient evidence to reject H
0
at 5% level of significance. I.e. the data suggests the
observation that
1
0 | = .

Notice that the R squared value of 0.139 is low (though much higher than the one in Q1,
0.073), which means that majority of the variation of the response variable is not being
explained by the linear model we just fitted. This is also evident from the scatter plot after
removing the outliers (shown below) that there is no clear indication of a strong linear
relationship between ACT score and GPA score.:












Transformations in ACT score has been performed, such as log( ), X X , but the computed R
squared values are approximately the same values of 0.137 and 0.1386 respectively.
Therefore one possible reason for the low R- squared value is due to missing predictor
variables that could explain further the variation among the response variable.

ANG WEI CANG ST5202 Mid term Project
A0024060B
The 95% prediction interval required, using R code, is (2.041846, 4.212510). Comparing to
the prediction interval in Q1, (1.84562 , 4.323835), we have seen that the interval in Q2 is
tighter than Q1 which makes it a better prediction interval. This is because removing the
outliers has reduced the MSE compared to the model in Q1, hence the variation of the
prediction has been reduced.


Summary
We started off the diagnostics of the data points given and attempted to find the reasons why
a linear model found in Q1 is poor. We concluded two aspects of the data that need
modification before we can fit a linear model. First is the presence of outliers which can be
spotted from the box plot and also the non normality characteristics from the Q-Q plot and
Shapiro Wilk test.

Under the suspicion that the outliers are the ones that distorts the results, we remove the
outliers under the assumption that there are valid reasons to support this move. Then we
checked the normality characteristics and indeed, then we can now safely place the normality
assumption of the response variable before we proceed with model fitting.

After removing the outliers, we obtain a tighter 95% prediction interval than the one before
the remedies. This is due to the MSE of the modified data set being smaller than before.
However, even though the R squared value increased from 0.073 to 0.139, we cannot
conclude that the new linear model is a good fit to the data set. One possibility is that there
might be missing predictor variables not included in the model. As such, using ACT score
alone will not allow the linear model in Q2 to explain majority of the variation observed.

To improve the regression analysis, one may investigate if it is legitimate to remove the
outliers and also model the data set using more sophisticated models.
ANG WEI CANG ST5202 Mid term Project
A0024060B

Appendix
Q1)
> data1<-read.table("e:\\CH01PR19.txt",header=T)
> gpa<-data1$GPA
> score<-data1$Score
>linear.reg<- lm(gpa~score)
> summary(linear.reg)

Call:
lm(formula = gpa ~ score)

Residuals:
Min 1Q Median 3Q Max
-2.74004 -0.33827 0.04062 0.44064 1.22737

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.11405 0.32089 6.588 1.30e-09 ***
score 0.03883 0.01277 3.040 0.00292 **
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 0.6231 on 118 degrees of freedom
Multiple R-squared: 0.07262, Adjusted R-squared: 0.06476
F-statistic: 9.24 on 1 and 118 DF, p-value: 0.002917

> anova(linear.reg)
ANG WEI CANG ST5202 Mid term Project
A0024060B

Analysis of Variance Table

Response: gpa
Df Sum Sq Mean Sq F value Pr(>F)
score 1 3.588 3.5878 9.2402 0.002917 **
Residuals 118 45.818 0.3883
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

> newdata<-data.frame(score=25)
> predict(linear.reg, newdata, interval="predict")
fit lwr upr
3.084727 1.84562 4.323835
ANG WEI CANG ST5202 Mid term Project
A0024060B

Q2)
> plot(score, gpa, main="Scatterplot GPA against ACT",
+ xlab="ACT score ", ylab="GPA score", pch=19)
> abline(linear.reg)

> fit<-2.11405+score*0.03883
> residual<-gpa-fit
> plot(score,residual)
> abline(h=0)

> plot(score,abs(residual))

> qqnorm(residual,ylab="residuals")
> qqline(residual)

> shapiro.test(residual)
Shapiro-Wilk normality test
data: residual
W = 0.9525, p-value = 0.0003304

> var.test(residual[score>=26],residual[score<26])

F test to compare two variances

data: residual[score >= 26] and residual[score < 26]
F = 1.5987, num df = 54, denom df = 64, p-value = 0.07202
ANG WEI CANG ST5202 Mid term Project
A0024060B

alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.9585866 2.6970009
sample estimates:
ratio of variances
1.598705
> library(lmtest)
> dwtest(linear.reg)

Durbin-Watson test

data: linear.reg
DW = 1.8307, p-value = 0.1758
alternative hypothesis: true autocorrelation is greater than 0
>cor(residual,score)
[1] -2.070691e-05

REMEDIES
> gpa2<-gpa[-c(9,115)]
> score2<-score[-c(9,115)]
> linear.reg2<-lm(gpa2~score2)
> summary(linear.reg2)

Call:
lm(formula = gpa2 ~ score2)

ANG WEI CANG ST5202 Mid term Project
A0024060B

Residuals:
Min 1Q Median 3Q Max
-1.28618 -0.34068 -0.00068 0.42657 1.29683

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.90216 0.28349 6.710 7.51e-10 ***
score2 0.04900 0.01133 4.327 3.22e-05 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 0.5457 on 116 degrees of freedom
Multiple R-squared: 0.139, Adjusted R-squared: 0.1315
F-statistic: 18.72 on 1 and 116 DF, p-value: 3.224e-05

> residual2<-gpa2-(1.90216+0.049*score2)
> shapiro.test(residual2)

Shapiro-Wilk normality test

data: residual2
W = 0.989, p-value = 0.4627




ANG WEI CANG ST5202 Mid term Project
A0024060B

> anova(linear.reg2)
Analysis of Variance Table

Response: gpa2
Df Sum Sq Mean Sq F value Pr(>F)
score2 1 5.574 5.5737 18.720 3.224e-05 ***
Residuals 116 34.537 0.2977
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

> rtscore2<-sqrt(score2)
> summary(lm(gpa2~rtscore2))

Call:
lm(formula = gpa2 ~ rtscore2)
Residuals:
Min 1Q Median 3Q Max
-1.29575 -0.34442 -0.00584 0.42810 1.34925
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.7487 0.5487 1.365 0.175
rtscore2 0.4776 0.1105 4.321 3.3e-05 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 0.5458 on 116 degrees of freedom
Multiple R-squared: 0.1386, Adjusted R-squared: 0.1312
ANG WEI CANG ST5202 Mid term Project
A0024060B

F-statistic: 18.67 on 1 and 116 DF, p-value: 3.3e-05
>logscore2<- log(score2)
> summary(lm(gpa2~logscore2))
Call:
lm(formula = gpa2 ~ logscore2)
Residuals:
Min 1Q Median 3Q Max
-1.30480 -0.34648 -0.01188 0.42967 1.40383

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.5439 0.8527 -0.638 0.525
logscore2 1.1463 0.2671 4.292 3.69e-05 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 0.5463 on 116 degrees of freedom
Multiple R-squared: 0.137, Adjusted R-squared: 0.1296
F-statistic: 18.42 on 1 and 116 DF, p-value: 3.691e-05

> newdata2<-data.frame(score2=25)
> predict(linear.reg2, newdata2, interval="predict")
fit lwr upr
1 3.127178 2.041846 4.212510

You might also like