You are on page 1of 7

Bailey Wang

Professor Chen
STA 108
Project 1

1. Graph 1: The raw data.

The data does not appear to be a simple linear regression. Therefore, it does not satisfy the
linearity test. The data does not appear to have an equal variance shown by the change in width
either. Overall, graph 1 appears to be a reciprocal function. Thus, it fails the linearity test and
equal variance test.

2. Graph 1’s appearance is similar to the reciprocal function. After applying the reciprocal
function to the x-variable, the data is now similar to a logarithmic function. After applying the
changes to x, the graph still appears to have a curve. Therefore, the y variable was transformed
using the logarithmic function as well. Thus, the data transformed into Graph 2.
View Code 2. However, the data’s diagnostic plots are extremely similar to those of logarithmic.
Therefore, the data was only changed using the logarithmic function on x and y-variables.

Graph 2: The data transformed using the log function.


3.
Graph 3: The transformed data with fitted line

Graph 4: The boxcox after transforming the data with log.

The fitted line appears to be relatively near the center of the graph. The data with the fitted line
appears to have equal variance traveling through the entire graph. The boxcox also shows that
the transformation is close to perfect.
the transformation is close to perfect.

a The transformed data appears in code 1 under the transformed data section.
​Least Squares Estimates:
​Beta1hat2=.2374
​Beta0hat2= 2.876

​R-squared:
​Rsq2= .5813

b
lm(formula = y2 ~ x2)

Residuals:
Min 1Q Median 3Q Max
-0.92398 -0.16996 0.03671 0.20633 0.86331

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.87607 0.11715 24.55 <2e-16 ***
x2 0.23749 0.01494 15.90 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3377 on 182 degrees of freedom


Multiple R-squared: 0.5813, ​Adjusted R-squared: 0.579
F-statistic: 252.7 on 1 and 182 DF, p-value: < 2.2e-16
c.
Betamatrix=
2.8760719
0.2374852
Rsqmatrix=.5813
View code6 for the matrix manipulation.

4. Graph 5: The diagnostic plots. (The data uses log(y)+10 coming from the boxcox, the data is
only shifted to the right by 10)

The Residuals VS Fitted shows that the data’s linearity which is considered linear. The Scale-
location shows that the data is pretty close to equal variance. The Normal Q-Q shows that the
tails are both heavy.

5 H0: B1 = 0 VS Ha: B1 =/= 0


5 H0: B1 = 0 VS Ha: B1 =/= 0
Test statistic
t* = B1/se{B1}
t* = -15.896
alpha = .05
critical value = tn-2,1-alpha/2 = 1.973
|-15.740| > 1.973
View Code 3.

We reject the null hypothesis at .05 significance level. Thus, we conclude we have enough
evidence to believe that there is a linear association between the transformed variables.

For question 6-8, the code used is in Code 4.


6 At 99% CI, the CI is (1.515,1.882)

7. Graph 6:

8 At 99% PI, the PI is (.655,3.917)

9 The residual vs fitted data shows that the data appears to be pretty linear. However, the normal
Q-Q graph shows that the tails are heavy. Therefore, the data is considered to be skewed. The
width of the data appears to be constant throughout the data clarifying that the data has equal
variance. The reference line on the scale-location is considered to be pretty linear, however, there
is a slight bump. There are no outliers shown by Cook’s distance in the residuals vs leverage
graph. Even though there is a linear relationship for the transformed data, the heavy tailed curves
show that the data is heavily skewed. Thus, it was not possible to use a simple linear regression
model; instead, the data had to be transformed to be analyzed.

Code 1:
setwd("~/Desktop/RStudio")
data=read.table("UN.txt", header=T)

x=data[,3]
y=data[,2]

plot(x,y)
plot(x,y, ylab = "Fertility", xlab = "PPgdp", main = "Fertility VS PPgdp")

n=dim(data)[1]
xbar=mean(x)
ybar=mean(y)
Sxx=sum((x-xbar)^2)
ybar=mean(y)
Sxx=sum((x-xbar)^2)
Sxy=sum((x-xbar)*(y-ybar))
beta1hat=(Sxy)/(Sxx)
beta0hat=ybar-beta1hat*xbar
yhat=beta0hat+beta1hat*x
e=y-yhat
SSE=sum(e^2)
MSE=(SSE)/(n-2)
sigmahat=sqrt(MSE)
abline(beta0hat,beta1hat)

SSTO=sum((y-ybar)^2)
SSR=sum((yhat-ybar)^2)
Rsq=SSR/SSTO

model1=lm(y~x)

Code2:
x2=log(1/x)
y2=log(y)
y3=(y2+10) ##the data used for the boxcox required positive values
model=lm(y3~x2)
par(mfrow=c(2,2))
plot(model)
boxcox(model)

plot(x2,y2)
plot(x2,y2, ylab = "Fertility", xlab = "PPgdp", main = "Fertility VS PPgdp") ##pch=".")

x2bar=mean(x2)
y2bar=mean(y3)
Sxx2=sum((x2-x2bar)^2)
Sxy2=sum((x2-x2bar)*(y2-y2bar))
beta1hat2=(sum(Sxy2))/(sum(Sxx2))
beta0hat2=y2bar-beta1hat2*(x2bar)
y2hat=beta0hat2+beta1hat2*(x2)
e2=(y2)-(y2hat)
SSE2=sum((e2)^2)
MSE2=(SSE2)/(n-2)
sigmahat2=sqrt(MSE2)
abline(beta0hat2,beta1hat2)

SSTO2=sum((y3-y2bar)^2)
SSR2=sum((y2hat-y2bar)^2)
Rsq2=(SSR2)/(SSTO2)

t=(e2)/(sigmahat2*sqrt(1-1/n-(x2-x2bar)^2/(Sxx2)))
v2=1-1/n-(x2-x2bar)^2/(Sxx2)
eSR=(e2)/(sigmahat2)/(sqrt(v2))
qqnorm((e2))
qqline((e2),lty=3)

Code3:
SEbeta1hat=sigmahat2*sqrt(1/(Sxx2))
tbeta1hat=(beta1hat2)/(SEbeta1hat)
qt(1-Alpha/2,n-2)

Code4:
qt(1-Alpha/2,n-2)

Code4:
xh=20000
xj=25000
x3=log((1/xh))
x4=log((1/xj))
alpha=.01
fCI=function(x){
cilb3=(beta0hat2+beta1hat2*(x3))-qt(1-alpha/2,n-2)*sigmahat2*sqrt(1/n+((x3-
x2bar)^2)/(Sxx2))
ciub3=(beta0hat2+beta1hat2*(x3))+qt(1-alpha/2,n-2)*sigmahat2*sqrt(1/n+((x3-
x2bar)^2)/(Sxx2))

return(c(cilb3,ciub3))
}
cilb3=(beta0hat2+beta1hat2*(x3))-qt(1-alpha/2,n-2)*sigmahat2*sqrt(1/n+((x3-x2bar)^2)/(Sxx2))
ciub3=(beta0hat2+beta1hat2*(x3))+qt(1-alpha/2,n-2)*sigmahat2*sqrt(1/n+((x3-
x2bar)^2)/(Sxx2))
cilbt=exp((cilb3)-10)
ciubt=exp((ciub3)-10)
fPI=function(x){
pilb3=(beta0hat2+beta1hat2*(x3))-qt(1-alpha/2,n-2)*sigmahat2*(1+1/n+((x3-x2bar)^2)/(Sxx2))
piub3=(beta0hat2+beta1hat2*(x3))+qt(1-alpha/2,n-2)*sigmahat2*(1+1/n+((x3-
x2bar)^2)/(Sxx2))

return(c(pilb3,piub3))
}
pilb3=(beta0hat2+beta1hat2*(x4))-qt(1-alpha/2,n-2)*sigmahat2*(1+1/n+((x4-x2bar)^2)/(Sxx2))
piub3=(beta0hat2+beta1hat2*(x4))+qt(1-alpha/2,n-2)*sigmahat2*(1+1/n+((x4-x2bar)^2)/(Sxx2))
pilbt=exp((pilb3)-10)
piubt=exp((piub3)-10)

Alpha=.05
w=sqrt(2*qf(1-Alpha,2,n-2))
yhat=function(x){
beta0hat2+beta1hat2*(x)
}
sigmahat3=sigmahat2^2
SEyhat=function(x){
sqrt((sigmahat3)*(1/n+(((x)-x2bar)^2)/(Sxx2)))
}

cbands=function(xv){
d=length(xv)
CIs=matrix(0,d,2)
colnames(CIs) =c("lower", "upper")
for(i in 1:d){
CIs[i,]=c(yhat(x=xv[i])+c(-1,1)*w*SEyhat(x=xv[i]))
}
return(as.data.frame(CIs))
}
range=seq(from=min(x),to=max(x),length.out = 184)
bands=cband(log(range))
plot(x,y,ylab="Fertility", xlab = "PPgdp", main = "Fertility VS PPgdp",cex=.5)
abline(model1)
CIslb=c(yhat(x)-w*SEyhat(x=x))
CIsub=c(yhat(x)+w*SEyhat(x=x))
points(range,exp(CIslb),col="red",lty="dashed")
CIsub=c(yhat(x)+w*SEyhat(x=x))
points(range,exp(CIslb),col="red",lty="dashed")
points(range,exp(CIsub),col="blue",lty="dashed")
Code6:
xmatrix=cbind(rep(1,n),x2)
ymatrix2=cbind(y2)
betamatrix2=cbind(beta0hat2,beta1hat2)
Betahatmatrix=solve(((t(xmatrix)%*%(xmatrix))))%*%(t(xmatrix))%*%(y2)

yhatmatrix=(xmatrix)%*%(Betahatmatrix)
ybarmatrix=colSums(ymatrix2)/n

SSEmatrix=yhatmatrix-ybarmatrix
Syymatrix=ymatrix2-ybarmatrix
SSRmatrix=colSums(SSEmatrix^2)
SSTOmatrix=colSums(Syymatrix^2)
Rsqmatrix=SSRmatrix/SSTOmatrix

You might also like