Professional Documents
Culture Documents
Regression Analysis
1. Create boxplots for both X and Y. Are there any outliers?
No outliers identified. See boxplot below.
8 6 4 2 0
8 6 4 2 0
X Y
B o x p l o t o f X , Y
2. Make a Scatterplot with Regression. Does there appear to be a linear relationship?
What two points appear to be potential outliers?
Yes there does appear to be a linear relationship with two points, row 18 (X=9, Y=6) and
row 19 (X=5, Y=9), representing possible outliers.
9 8 7 6 5 4 3 2 1 0
9
8
7
6
5
4
3
2
1
0
X
Y
S c a t t e r p l o t o f Y v s X
3. Check for outliers using the semi-studentized method. Are there any outliers? What are
the absolute values of the semi-studentized residuals you identified in question 2?
2
No as all semi-studentized residuals have an absolute value less than four. From the two
points from question two the absolute semi-studentized values are 1.97819 (row 18) and 2.06186
(row 19).
4. Do a check of normality by using a probability plot of the residuals. Include: a) the null
and alternative hypotheses, b) the p-value of the test, c) your decision based on a 0.05
level of significance, and d) Minitab copy of your plot.
a) Ho: The residuals come from a normal distribution
Ha: The residuals do not come from a normal distribution
b) p-value is 0.942
c) Since p-value is greater than 0.05 we fail to reject Ho and will conclude the assumption
of normality is plausible.
d)
5 . 0 2 . 5 0 . 0 - 2 . 5 - 5 . 0
9 9
9 5
9 0
8 0
7 0
6 0
5 0
4 0
3 0
2 0
1 0
5
1
R E S I 1
P
e
r
c
e
n
t
M e a n - 2 . 1 3 1 6 3 E - 1 5
S tD e v 1 . 5 5 6
N 2 0
A D 0 . 1 5 8
P - V a lu e 0 . 9 4 2
P r o b a b i l i t y P l o t o f R E S I 1
N o r m a l - 9 5 % C I
5. Do a check of equal variances by performing a Modified Levene Test. Include: a) the
null and alternative hypotheses, b) the p-value of the test, c) your decision based on a 0.05
level of significance, and d) Minitab copy of your plot.
a) Ho: The variances are equal
Ha: The variances are not equal
b) The p-value is 0.533 NOTE: Remember that the Levenes test is more robust against
violations to normality than is the F-test making the Levene test a better overall test of equal
variances. The only condition for the Levene test is that the variable being tested is continuous.
c) Since the p-value is greater than 0.05 we conclude that the assumption of equal
variances is plausible.
d)
3
1
0
4 . 0 3 . 5 3 . 0 2 . 5 2 . 0 1 . 5 1 . 0
g
r
o
u
p
9 5 % B o n f e r r o n i C o n f i d e n c e I n t e r v a l s f o r S t D e v s
1
0
3 2 1 0 - 1 - 2 - 3 - 4
g
r
o
u
p
R E S I 1
T e st S ta tistic 0 . 8 3
P - V a lu e 0 . 8 6 1
T e st S ta tistic 0 . 4 0
P - V a lu e 0 . 5 3 3
F - T e st
L e v e n e 's T e st
T e s t f o r E q u a l V a r i a n c e s f o r R E S I 1
6. Perform a Lack of Fit Test to check if linear regression function is appropriate. Include:
a) the null and alternative hypotheses, b) the correct F-statistic and p-value of the test, c)
your decision based on a 0.05 level of significance, and d) Minitab copy of your ANOVA
output.
a) Ho: The linear regression function is appropriate
Ha: The linear regression function is not appropriate
b) F-statistic is 0.95 and p-value is 0.507
c) Since p-value is greater than 0.05 we fail to reject Ho and conclude plausible that
linear regression function is appropriate.
d)
Analysis of Variance
Source DF SS MS F P
Regression 1 70.769 70.769 27.67 0.000
Residual Error 18 46.031 2.557
Lack of Fit 7 17.364 2.481 0.95 0.507
Pure Error 11 28.667 2.606
Total 19 116.800
7. Even though you may not have found any assumption violations perform a Box-Cox
analysis on Y to see if any transformation is suggested. Include the a) estimated and
rounded lambda values, b) the interpretation of this value, and c) the Box-Cox plot. NOTE:
This can only be done using Minitab Version 15 or higher i.e. student version 14 does
not contain Box-Cox program.
a) Estimated value is 1.07 and rounded lambda is 1.00
b) The rounded value implies one raise Y to power of 1.00 which means no
transformation necessary
c)
4
5 4 3 2 1 0 - 1 - 2
9
8
7
6
5
4
3
2
L a m b d a
S
t
D
e
v
L o w e r C L U p p e r C L
L i m i t
E s t im a t e 1 . 0 7
L o w e r C L 0 . 3 4
U p p e r C L 1 . 8 9
R o u n d e d V a l u e 1 . 0 0
( u s in g 9 5 . 0 % c o n f id e n c e )
L a m b d a
B o x - C o x P l o t o f Y
8. Find Bonferroni joint confidence intervals for Bo and B
1
with a 90% family confidence
level and include your interpretation of these intervals. You can use the Minitab output to
find s{b
o
} and s{b
1
}
With sample size, n, of 20 the degrees of freedom are n-2 or 18. Since interested in two
joint intervals, Bo and B
1
, g is equal to 2 for our Bonferroni correction. Using the equations
2
1 1 1 / 4
{ } and { } where
n
o o
b Bs b b Bs b B t
\
|
= =
x
y
s
s
b r When the x and y
variables have the same spread, the correlation equals the slope.
c)
2
r = (0.5)(0.5) = 0.25; The sum of squared errors is 25% less when we use the
regression equation instead of the mean of y.
8
7. SAT regression toward mean:
a) y = 250 + 0.5(800) = 650
b) The predicted y value will be 0.5 standard deviations above the mean, for every one
standard deviation above the mean that x is. Here, x = 800 is three standard deviations
above the mean; so the predicted y value is 0.5(3) = 1.5 standard deviations above the
mean.
8. GPAs and TV watching:
a) The correlation of -0.353 (rounds to -0.35) indicates that there is a negative relation
between the two variables. The more one watches television, the lower his or her college
GPA tends to be. The proportional reduction in error of 0.125 (rounds to 0.13) indicates
that the sum of squared errors is 13% less when we use the regression equation instead
of the mean of y.
b) We would expect that student to be (2)(0.505) = 1.01 standard deviations above the
mean on high school GPA. With regression to the mean, the predicted y is relatively
closer to its mean than x is to its mean.
9. t-score?:
a) df = n 2 = 25 2 = 23
b) - 2.069 (rounds to -2.07) and 2.069 (rounds to 2.07)
c) Wed use 2.07
10. More boys are bad?:
a) The negative slope indicates a negative association between life length and number of
sons. Having more sons is bad.
b) i) Assumptions: Assume randomization, linear trend with normal conditional distribution
for y and the same standard deviation at different values of x.
ii) Hypotheses: The null hypothesis that the variables are independent is H
0
: = 0. The
two-sided alternative hypothesis of dependence is H
a
: 0.
iii) Test statistic: t = b/se = - 0.65/0.29 = - 2.241.
iv) P-value: The P-value is 0.026.
v) Conclusion: If H
0
were true that the population slope = 0, it would be unusual to get a
sample slope at least as far from 0 as b = - 0.65. In fact, the probability would be 0.026.
The P-value gives very strong evidence that an association exists between number of
sons and life length.
c) The 95% confidence interval is , ) , ) 29 . 0 966 . 1 651 . 0
025 .
= se t b . The confidence
interval is (-1.220, -0.080) which rounds to (-1.2, -0.1). The plausible values for the true
population slope range from -1.2 to -0.1. It is not plausible that the true slope is 0.
11. Student GPAs:
a) i) Assumptions: Assume randomization, linear trend with normal conditional distribution
for y and the same standard deviation at different values of x.
ii) Hypotheses: The null hypothesis that the variables are independent is H
0
: = 0. The
two-sided alternative hypothesis of dependence is H
a
: 0.
iii) Test statistic: t = b/se = 0.6369/0.1442 = 4.42 (or just look at the printout for the test
statistic).
iv) P-value: The P-value is 0.000.
v) Conclusion: If H
0
were true that the population slope = 0, it would be very unusual
the probability would be almost 0 to get a sample slope at least as far from 0 as b =
0.6369. The P-value is beyond the significance level of 0.05, and we can reject the null
hypothesis. We have very strong evidence that an association exists between high
school and college GPA.
9
b) The 95% confidence interval is , ) , ) 1442 . 0 002 . 2 6369 . 0
025 .
= se t b
The confidence interval is (0.348, 0.926) which rounds to (0.3, 0.9). Zero is not a
plausible value for this slope; as was concluded in the significance test, it is not plausible that
there is no association.
12. Predicting house prices:
a) The residual df, 98, equals n 2; therefore, the sample size was 100.
b) The sample predicted mean selling price was y = 9.2 + 77.0(1.53) = 127.010, or
$127,010.
c) The estimated residual standard deviation of y is the square root of the MS Error, 1349.
The square root of 1349 is 36.729.
d) The prediction interval is: y 2s or 127.0102(36.729); (53.552, 200.468) which rounds
to (53.6, 200.5).
13. Predicting clothes purchases:
a) The value under Fit, 448, is the predicted amount spent on clothes in the past year for
those in the 12
th
grade of school.
b) The 95% confidence interval of (427, 469) is the range of plausible values for the
population mean of dollars spent on clothes for 12
th
grade students in the school.
c) The 95% prediction interval of (101, 795) is the range of plausible values for the
individual observations (dollars spent on clothes) for all the 12
th
grade students at the
school.
14. Savings grow exponentially:
a) y =
x
= (100)(1.10)
1
= 110
b) y =
x
= (100)(1.10)
5
= 161.05
c) y =
x
= (100)(1.10)
x
d) The first year after which youll have more than $200 is the 8
th
. y =
x
= (100)(1.10)
8
=
214.36
15. U.S. population growth:
a) y = 68.331.1418
0
= 68.33 million . y = 68.331.1418
11
= 293.83 million
b) 1.1418 is the multiplicative effect on y for a one-unit increase in x.
c) This suggests a very good fit of data to model. The high correlation indicates a linear
relation between the log of the y values and the x values.