You are on page 1of 9

1

Regression Analysis
1. Create boxplots for both X and Y. Are there any outliers?
No outliers identified. See boxplot below.
8 6 4 2 0
8 6 4 2 0
X Y
B o x p l o t o f X , Y
2. Make a Scatterplot with Regression. Does there appear to be a linear relationship?
What two points appear to be potential outliers?
Yes there does appear to be a linear relationship with two points, row 18 (X=9, Y=6) and
row 19 (X=5, Y=9), representing possible outliers.
9 8 7 6 5 4 3 2 1 0
9
8
7
6
5
4
3
2
1
0
X
Y
S c a t t e r p l o t o f Y v s X
3. Check for outliers using the semi-studentized method. Are there any outliers? What are
the absolute values of the semi-studentized residuals you identified in question 2?
2
No as all semi-studentized residuals have an absolute value less than four. From the two
points from question two the absolute semi-studentized values are 1.97819 (row 18) and 2.06186
(row 19).
4. Do a check of normality by using a probability plot of the residuals. Include: a) the null
and alternative hypotheses, b) the p-value of the test, c) your decision based on a 0.05
level of significance, and d) Minitab copy of your plot.
a) Ho: The residuals come from a normal distribution
Ha: The residuals do not come from a normal distribution
b) p-value is 0.942
c) Since p-value is greater than 0.05 we fail to reject Ho and will conclude the assumption
of normality is plausible.
d)
5 . 0 2 . 5 0 . 0 - 2 . 5 - 5 . 0
9 9
9 5
9 0
8 0
7 0
6 0
5 0
4 0
3 0
2 0
1 0
5
1
R E S I 1
P
e
r
c
e
n
t
M e a n - 2 . 1 3 1 6 3 E - 1 5
S tD e v 1 . 5 5 6
N 2 0
A D 0 . 1 5 8
P - V a lu e 0 . 9 4 2
P r o b a b i l i t y P l o t o f R E S I 1
N o r m a l - 9 5 % C I
5. Do a check of equal variances by performing a Modified Levene Test. Include: a) the
null and alternative hypotheses, b) the p-value of the test, c) your decision based on a 0.05
level of significance, and d) Minitab copy of your plot.
a) Ho: The variances are equal
Ha: The variances are not equal
b) The p-value is 0.533 NOTE: Remember that the Levenes test is more robust against
violations to normality than is the F-test making the Levene test a better overall test of equal
variances. The only condition for the Levene test is that the variable being tested is continuous.
c) Since the p-value is greater than 0.05 we conclude that the assumption of equal
variances is plausible.
d)
3
1
0
4 . 0 3 . 5 3 . 0 2 . 5 2 . 0 1 . 5 1 . 0
g
r
o
u
p
9 5 % B o n f e r r o n i C o n f i d e n c e I n t e r v a l s f o r S t D e v s
1
0
3 2 1 0 - 1 - 2 - 3 - 4
g
r
o
u
p
R E S I 1
T e st S ta tistic 0 . 8 3
P - V a lu e 0 . 8 6 1
T e st S ta tistic 0 . 4 0
P - V a lu e 0 . 5 3 3
F - T e st
L e v e n e 's T e st
T e s t f o r E q u a l V a r i a n c e s f o r R E S I 1
6. Perform a Lack of Fit Test to check if linear regression function is appropriate. Include:
a) the null and alternative hypotheses, b) the correct F-statistic and p-value of the test, c)
your decision based on a 0.05 level of significance, and d) Minitab copy of your ANOVA
output.
a) Ho: The linear regression function is appropriate
Ha: The linear regression function is not appropriate
b) F-statistic is 0.95 and p-value is 0.507
c) Since p-value is greater than 0.05 we fail to reject Ho and conclude plausible that
linear regression function is appropriate.
d)
Analysis of Variance
Source DF SS MS F P
Regression 1 70.769 70.769 27.67 0.000
Residual Error 18 46.031 2.557
Lack of Fit 7 17.364 2.481 0.95 0.507
Pure Error 11 28.667 2.606
Total 19 116.800
7. Even though you may not have found any assumption violations perform a Box-Cox
analysis on Y to see if any transformation is suggested. Include the a) estimated and
rounded lambda values, b) the interpretation of this value, and c) the Box-Cox plot. NOTE:
This can only be done using Minitab Version 15 or higher i.e. student version 14 does
not contain Box-Cox program.
a) Estimated value is 1.07 and rounded lambda is 1.00
b) The rounded value implies one raise Y to power of 1.00 which means no
transformation necessary
c)
4
5 4 3 2 1 0 - 1 - 2
9
8
7
6
5
4
3
2
L a m b d a
S
t
D
e
v
L o w e r C L U p p e r C L
L i m i t
E s t im a t e 1 . 0 7
L o w e r C L 0 . 3 4
U p p e r C L 1 . 8 9
R o u n d e d V a l u e 1 . 0 0
( u s in g 9 5 . 0 % c o n f id e n c e )
L a m b d a
B o x - C o x P l o t o f Y
8. Find Bonferroni joint confidence intervals for Bo and B
1
with a 90% family confidence
level and include your interpretation of these intervals. You can use the Minitab output to
find s{b
o
} and s{b
1
}
With sample size, n, of 20 the degrees of freedom are n-2 or 18. Since interested in two
joint intervals, Bo and B
1
, g is equal to 2 for our Bonferroni correction. Using the equations
2
1 1 1 / 4
{ } and { } where
n
o o
b Bs b b Bs b B t

= . From t-table the value for the Bonferrroni


multiplier using DF of 18 and 1-/4 for alpha of 0.10 results in a 2.101 t-statistic. Plugging into
the equations:
For Bo: 1.377 +/- 2.101*0.8442 = 1.377 +/- 1.774 = -0.397 <= Bo <= 3.151
For B
1
: 0.8652 +/- 2.101*0.1645 = 0.8652 +/- 0.3456 = 0.5196 <= B
1
<= 1.2108
Interpretation: We are 90% confident that both intervals contain the true intercept and slope.
9. Use Minitab to find Bonferroni simultaneous confidence intervals for new X
observations of 0 and 10 using a 95% family confidence level. Include your the output and
interpretation of these intervals. Follow-up question 1: What is the interpretation of the
level of confidence for the confidence intervals in the output? Follow-up question 2: Can
you think of a reason why these new X values might not be reliable? Follow-up question 3:
Show mathematically how one would use the Minitab output to get the simultaneous level
of confidence for new observations.
Interpretation: We are 95 percent confident in both of the following intervals being correct: that
the reading achievement stanine for a reading readiness stanine of 0 would be from -0.687 to
3.441 and the reading achievement stanine for a reading readiness stanine of 10 would be from
7.706 to 12.351
Predicted Values for New Observations
New
Obs Fit SE Fit 97.5% CI 97.5% PI
1 1.377 0.844 (-0.687, 3.441) (-3.044, 5.798)
5
2 10.029 0.950 ( 7.706, 12.351) ( 5.481, 14.576)X
X denotes a point that is an outlier in the predictors.
Values of Predictors for New Observations
New
Obs X
1 0.0
2 10.0
Follow-up 1: The 97.5% level of confidence is how confident we are in any ONE of the intervals
being correct.
Follow-up 2: The range of x-values used in this analysis was from 1 to 9 bringing into
consideration the possibility of improper extrapolation of applying the regression equation to
values outside this range of x.
Follow-up 3: This 97.5% level of confidence is found using 1 /g = 0.975. For this particular
problem we are interested in two simultaneous intervals, or a g = 2. Using algebra to find alpha
we would get /g = 0.025 resulting in 0.05 alpha or a 95% simultaneous level of confidence.
NOTE: Software systems by default use /2 when constructing confidence intervals and is why
when solving this equation we do not use /2 but instead /g. If one were to use /2g based on
the level of confidence in the output you would double divide by 2.
10. What is the value and interpretation of the coefficient of determination? Using the
output and correct values show two ways this value can be calculated.
From the output the coefficient of determination, or R-squared, is 60.6% meaning that 60.6
percent of the variation in reading achievement stanines can be explained by reading readiness
stanines.
S = 1.59914 R-Sq = 60.6% R-Sq(adj) = 58.4%
Analysis of Variance
Source DF SS MS F P
Regression 1 70.769 70.769 27.67 0.000
Residual Error 18 46.031 2.557
Total 19 116.800
Two possible methods for calculating R-squared are:
1) (SSR/SST)*100% = (70.769/116.8)*100% = 60.6%
2) [1 (SSE/SST)]*100% = [1 (46.031/116.8)]*100% = 60.6%
11. From our in class example of Sales-Advertising, the tests results were as follows: the
intercept had T = -0.16 and p-value of 0.885; the slope test had T = 3.66 and p-value of
0.035; and the ANOVA test had F = 13.66 and p-value of 0.035. Use Minitab to find this p-
values by going to Calc > Probability Distributions and selecting appropriately either T or
F. Then select the radio button for Cumulative Probability, enter the appropriate
degrees of freedom for the test, click the radio button for Input Constant and enter in the
text box the appropriate value of the test statistic. Click OK. From the output show how
6
one gets from this output to the p-value. Include a copy of the Minitab output for each
test.
Test of Intercept: From the output we would take 0.441524 and multiply by two to get 0.883
which is approximately 0.885 due to rounding.
Cumulative Distribution Function
Student's t distribution with 3 DF
x P( X <= x )
-0.16 0.441524
Test of Slope: From output we would subtract 0.982377 from 1 and then double this result
getting 0.017623*2 = 0.035246 which is approximately 0.035
Cumulative Distribution Function
Student's t distribution with 3 DF
x P( X <= x )
3.66 0.982377
F-Test: From this output we would simply subtract 0.96526 from 1 to get 0.03474 which is
approximately 0.035
Cumulative Distribution Function
F distribution with 1 DF in numerator and 3 DF in denominator
x P( X <= x )
13.66 0.965626
7
Regression Analysis
1. Car mileage and weight:
a) The response variable is mileage, and the explanatory variable is weight.
b) x y 0052 . 6 . 45 = ; the y-intercept is 45.6 and the slope is - 0.0052.
c) For each 1000 pound increase in the vehicle, the predicted mileage will decrease by 5.2
miles per gallon.
d) The y-intercept is the predicted miles per gallon for a car that weighs 0 pounds. This is far
outside the range of the car weights in this database and, therefore, does not have
contextual meaning for these data.
2. Children of working females:
a) , ) 968 . 3 28 044 . 2 . 5 = = y (rounds to 4.0)
b) , ) 196 . 1 91 044 . 2 . 5 = = y (rounds to 1.2)
c) 104 . 1 196 . 1 3 . 2 = = y y (rounds to 1.1)
d) The y-intercept indicates that for nations with no female economic activity, the predicted
fertility rate is 5.2. As x increases from 0 to 100, the predicted fertility rate decreases from
5.2 to 0.8.
3. Dollars and thousands of dollars:
Slope when income is in dollars: 1.50/1000 = 0.0015
4. When can you compare slopes?:
a) For a $1000 increase in GDP, the predicted percentage using cell phones increases by
2.62, and the predicted percentage using the Internet increases by 1.55.
b) Because the slope of GDP to cell phone use is larger than is the relation of GDP to
Internet use, an increase in GDP would have a slightly greater impact on the percentage
using cell phones than on the percentage using the Internet.
5. Weight, height, and fat:
a) (i) Percentage of body fat and body mass index have the strongest association.
(ii) Height and body mass index have the weakest association.
b) There is a fairly strong, positive association between height and weight. As one goes up,
the other tends to go up.
c)
2
r = (0.553)(0.553) = 0.306 (rounds to 0.31).
2
r summarizes the reduction in sum of
squared errors in predicting y using the regression line instead of using the mean of y. In
this case, the sum of squared errors is 31% less when we use the regression equation.
d) None of these results would differ if height and weight were instead measured with metric
units.
6. Verbal and Math SAT:
a) y = 250 + 0.5(500) = 500. Generally, at the x-value equal to its mean, the predicted
value of y is equal to its mean.
b) We can find the correlation as follows: 5 . 0
100
100
5 . 0 = |
.
|

\
|
= =
x
y
s
s
b r When the x and y
variables have the same spread, the correlation equals the slope.
c)
2
r = (0.5)(0.5) = 0.25; The sum of squared errors is 25% less when we use the
regression equation instead of the mean of y.
8
7. SAT regression toward mean:
a) y = 250 + 0.5(800) = 650
b) The predicted y value will be 0.5 standard deviations above the mean, for every one
standard deviation above the mean that x is. Here, x = 800 is three standard deviations
above the mean; so the predicted y value is 0.5(3) = 1.5 standard deviations above the
mean.
8. GPAs and TV watching:
a) The correlation of -0.353 (rounds to -0.35) indicates that there is a negative relation
between the two variables. The more one watches television, the lower his or her college
GPA tends to be. The proportional reduction in error of 0.125 (rounds to 0.13) indicates
that the sum of squared errors is 13% less when we use the regression equation instead
of the mean of y.
b) We would expect that student to be (2)(0.505) = 1.01 standard deviations above the
mean on high school GPA. With regression to the mean, the predicted y is relatively
closer to its mean than x is to its mean.
9. t-score?:
a) df = n 2 = 25 2 = 23
b) - 2.069 (rounds to -2.07) and 2.069 (rounds to 2.07)
c) Wed use 2.07
10. More boys are bad?:
a) The negative slope indicates a negative association between life length and number of
sons. Having more sons is bad.
b) i) Assumptions: Assume randomization, linear trend with normal conditional distribution
for y and the same standard deviation at different values of x.
ii) Hypotheses: The null hypothesis that the variables are independent is H
0
: = 0. The
two-sided alternative hypothesis of dependence is H
a
: 0.
iii) Test statistic: t = b/se = - 0.65/0.29 = - 2.241.
iv) P-value: The P-value is 0.026.
v) Conclusion: If H
0
were true that the population slope = 0, it would be unusual to get a
sample slope at least as far from 0 as b = - 0.65. In fact, the probability would be 0.026.
The P-value gives very strong evidence that an association exists between number of
sons and life length.
c) The 95% confidence interval is , ) , ) 29 . 0 966 . 1 651 . 0
025 .
= se t b . The confidence
interval is (-1.220, -0.080) which rounds to (-1.2, -0.1). The plausible values for the true
population slope range from -1.2 to -0.1. It is not plausible that the true slope is 0.
11. Student GPAs:
a) i) Assumptions: Assume randomization, linear trend with normal conditional distribution
for y and the same standard deviation at different values of x.
ii) Hypotheses: The null hypothesis that the variables are independent is H
0
: = 0. The
two-sided alternative hypothesis of dependence is H
a
: 0.
iii) Test statistic: t = b/se = 0.6369/0.1442 = 4.42 (or just look at the printout for the test
statistic).
iv) P-value: The P-value is 0.000.
v) Conclusion: If H
0
were true that the population slope = 0, it would be very unusual
the probability would be almost 0 to get a sample slope at least as far from 0 as b =
0.6369. The P-value is beyond the significance level of 0.05, and we can reject the null
hypothesis. We have very strong evidence that an association exists between high
school and college GPA.
9
b) The 95% confidence interval is , ) , ) 1442 . 0 002 . 2 6369 . 0
025 .
= se t b
The confidence interval is (0.348, 0.926) which rounds to (0.3, 0.9). Zero is not a
plausible value for this slope; as was concluded in the significance test, it is not plausible that
there is no association.
12. Predicting house prices:
a) The residual df, 98, equals n 2; therefore, the sample size was 100.
b) The sample predicted mean selling price was y = 9.2 + 77.0(1.53) = 127.010, or
$127,010.
c) The estimated residual standard deviation of y is the square root of the MS Error, 1349.
The square root of 1349 is 36.729.
d) The prediction interval is: y 2s or 127.0102(36.729); (53.552, 200.468) which rounds
to (53.6, 200.5).
13. Predicting clothes purchases:
a) The value under Fit, 448, is the predicted amount spent on clothes in the past year for
those in the 12
th
grade of school.
b) The 95% confidence interval of (427, 469) is the range of plausible values for the
population mean of dollars spent on clothes for 12
th
grade students in the school.
c) The 95% prediction interval of (101, 795) is the range of plausible values for the
individual observations (dollars spent on clothes) for all the 12
th
grade students at the
school.
14. Savings grow exponentially:
a) y =
x
= (100)(1.10)
1
= 110
b) y =
x
= (100)(1.10)
5
= 161.05
c) y =
x
= (100)(1.10)
x
d) The first year after which youll have more than $200 is the 8
th
. y =
x
= (100)(1.10)
8
=
214.36
15. U.S. population growth:
a) y = 68.331.1418
0
= 68.33 million . y = 68.331.1418
11
= 293.83 million
b) 1.1418 is the multiplicative effect on y for a one-unit increase in x.
c) This suggests a very good fit of data to model. The high correlation indicates a linear
relation between the log of the y values and the x values.

You might also like