You are on page 1of 25

9/27/2006

26 Answers

Mix and Match


1. i 2. c 3. a 4. g 5. b 6. e 7. j 8. f 9. d 10. h

True/False
11. True Its only possible confounding. The lurking variable must also be related to the response. 12. False Its another name for regression to patch up for the absence of randomization, but its not the same thing. 13. True 14. True 15. True 16. True 17. False The purpose of an interaction is to allow the slopes to differ. Without an interaction, the slopes match. 18. True 19. False Wed have to do this for every conceivable lurking factor, and weve not measured them all. Confounding is always possible without randomization. 20. True 21. False It is helpful if the sizes of the two groups are similar, but not assumed by the model.

9/27/2006

26 Answers

22. False Use comparison boxplots of the residuals grouped by the categorical variable.

Think About It
23. Is this data from a randomized experiment? If not, do we know that the sales agents sell comparable products that produce similar revenue streams? Do we know the costs for the agents in the two groups are comparable, with similar supporting budgets, such as comparable levels of advertising and internal staff support? Without such balance, there are many sources of confounding that could explain the differences that we see in the figure. The lurking factor might also explain the slight difference in variation that we see in the summary. 24. The relevant lurking factor that ought to come to mind is inflation in the cost of products bought by this firm. If the prices of these purchase have risen 10% over the year, then this should be taken into account. Similarly, has the nature of the business changed over this time period. Perhaps the invoices in the 2006 year are for more expensive types of purchases or in larger quantity than those bought in 2005. 25. We combine them in order to compare the intercepts and compare the slopes. The multiple regression that combines them include one coefficient that is the difference in the intercepts and another that is the difference between the slopes. These both come with standard errors, and hence allow us to test whether the observed differences (which are the same with either approach) are statistically significant. 26. The assumption of equation error variance in the two SRMs. You can have the SRM work in each subset, but with different error variances. When combined, the difference in error variances violates the similar variances condition of the MRM. To check this condition, we should look at the residuals grouped by the dummy variable. 27. In general, one should always try an interaction unless you have strong reason to know that the slopes are parallel in the two groups. In this context, it seems clear that the model needs an interaction. Union labor in auto plants make more than nonunion labor, and the slope is the estimate for the cost per hour of labor. Wed expect it to be higher in the union shop. 28. The intercept is the start-up time and the slope is the time per unit (fixed and variable costs, respectively). If the robots function as before, then the main change will be the reduction in the intercept (smaller fixed costs). One might also expect less variation after the change; it might have been the case that start-up times varied widely seeing that they ran on for 20 hours. The slopes should be about the same, though wed still check for an interaction. 29. a) The intercept is the mean salary for Group=0, namely the women ($140,467). The slope is the difference in salaries, with men marginally making $3,644 more than women overall (i.e., ignoring the effect of managerial grade level). b) These match (almost). The slope in the simple regression is the difference in mean salaries, so regression assigns this estimate almost same level of significance A26-2

9/27/2006

26 Answers

found in the two-sample t-test. c) That the variances in the two groups are the same. The regression approach is comparable to a two-sample t-test that requires equal variances. The t-test introduced in Chapter 18 does not require this assumption. 30. a) The intercept is the average number of mailings (about 30) for companies that were not aware. The slope is the difference on average between those that were not aware and those that are (12 more for those that are aware). b) On average, those that were aware sent statistically significant more packages in the ntext month. c) The difference in variation could be due to a lurking variable. The larger variation could be due to the role of the hours variable. If a wider range of hours were given to those that are aware, then this could explain the visible differences in variation. That is, a lurking variable could spread out those that were aware more than those that were not. 31. a) About 2. Focus on the green points. At x = 0, the average seems to be about 0. At x = 4, the average of these is near 8. 8/4 = 2. A similar calculation applies to the red points and gives a similar slope near 2. b) The slope will be much flatter, closer to zero. It seems like it might be positive, but will be considerably less than 2. 32. a) The intercept is set-up time, time to configure the robots used in the assembly. b) The slope is the minutes per item produced, basically the rate at which the process churns out items once started. c) Need the interaction. The two fits evidently cross-over near about 40 to 50 units. 33. a) Yes, the fits appear parallel because the coefficient of the interaction (D * x) which measures the difference in the slopes of the two groups is not statistically significant (its t-statistic is within 1 of zero). b) Remove the interaction term to reduce the collinearity and force the slopes to be precisely parallel. 34. a) The slope for D tells you the difference in the fits of the two equations when Units = 0. In the context of this problem, the slope 52.82 means that the green employees (those with training) take about 50 minutes longer to get the production line set up for producing units. Once they get it set up, however, they are more efficient with smaller time costs for additional units. b) We need to find the point at which the two regression lines cross (they are not parallel in this example). As you can see in the following figure that shows the fits, the two lines cross near 45 units:

A26-3

9/27/2006

26 Answers
250

200

Minutes

150

100

50 20 30 40 50 60 70 80 90 100 110

Units

Thats probably good enough in practice, but if we want to be thorough, we need to find the units such that 26.783 + 2.062 units = (26.783 + 52.816) + (2.062 -1.277) units Since the baseline terms are common to both sides, we can drop these and solve for units in this equation: 0 = 52.816 -1.277 units units = 52.816/1.277 41.36

You Do It
35. Emerald diamonds (a) In order to be a confounding variable, the weight has to be related to the price (we know this is true from previous study of these data, and common sense) and the weight has to be related to the group indicator. That is, diamonds of one clarity grade have to have different weights than those of the other. If the two groups have comparable weights, then the effect of weight is balanced between the two. A twosample comparison of weight by clarity shows that the average weight is almost the same in the two groups. Weight is unlikely to be a confounding effect in this analysis. Level Number Mean Std Dev VS1 90 0.413556 0.054408 VVS1 54 0.408148 0.053661 (b) The two-sample t-test finds a statistically significant difference, with VVS1 costing on average about $110 more than VS1 diamonds. VS1-VVS1, allowing unequal variances Difference -112.30 t Ratio -2.88504 Std Err Dif 38.93 DF 103.4548 Upper CL Dif -35.11 Prob > |t| 0.0048 Lower CL Dif -189.50 Prob > t 0.9976 Confidence 0.95 Prob < t 0.0024

A26-4

9/27/2006

26 Answers

(c) Because the interaction is not statistically significant, well remove it and refit the model without this term. Evidently, the cost of either type of diamond rise at the same rate with weight. Term Intercept Weight (carats) Clarity Clarity * Weight Estimate -52.53705 2863.4963 214.1266 -211.5379 Std Error 131.9049 316.2582 215.9869 522.1921 t Ratio -0.40 9.05 0.99 -0.41 Prob>|t| 0.6910 <.0001 0.3232 0.6860

Without the interaction, the fits are parallel and the estimated effect for clarity is statistically significant. R2 se n Term Intercept Weight (carats) Clarity Estimate -20.44887 2785.9054 127.36823 0.497376 161.8489 144 Std Error 105.1595 250.9129 27.89249 t Ratio -0.19 11.10 4.57 Prob>|t| 0.8461 <.0001 <.0001

Based on the fit of this multiple regression, we see that for diamonds of comparable weight, those of clarity VVS1 cost on average about $127 more than those of clarity VS1. (d) From the two-sample comparison, the 95% confidence interval for the mean difference in price is $35 to $190 more for VVS1 diamonds. The estimated mean difference from the multiple regression is 127.36823 - 2 * 27.89249, 127.36823 + 2 * 27.89249 $72 to $183 The regression interval is shorter because it removed the variation price due to weight, providing a more precise estimate. Theres no confounding however, because the weights are comparable in the two groups. Hence the estimated average difference in price ($112 vs $127) are comparable. (e) The two groups have similar variances, but the variance increases with the price. Thus, the multiple regression does not meet the similar variances condition. The prices of diamonds become more variable as they get larger. Weve seen this one before.
500 400 300 200 100 0 -100 -200 -300 -400 -500 700

Price ($) Residual

900 1100 1300 1500 17001900

Price ($) Predicted

A26-5

9/27/2006

26 Answers

36. Convenience shopping (a) Previous analysis of this data has shown that volume of gasoline is related to sales in the convenience store. So, volume meets one of the conditions for a confounding variable: its related to the response. To be a confounder, it also has to differ between the two groups. In this data, the following summary shows that Site 1 is busier. Gasoline sales (traffic) confounds the comparison of the sales. Site 1 sells about 20% more gasoline, a statistically significant amount. Site 1-Site 2, allowing unequal variances Difference 659.251 t Ratio 13.81833 Std Err Dif 47.708 DF 506.2273 Upper CL Dif 752.982 Prob > |t| 0.0000 Lower CL Dif 565.520 Prob > t 0.0000 Confidence 0.95 Prob < t 1.0000 (b) A two-sample t-test finds a statistically significant difference in sales, with Site 1 selling on average about $700 more than Site 2. Site 1-Site 2, allowing unequal variances Difference 727.208 t Ratio 28.71723 Std Err Dif 25.323 DF 551.4839 Upper CL Dif 776.949 Prob > |t| 0.0000 Lower CL Dif 677.466 Prob > t 0.0000 Confidence 0.95 Prob < t 1.0000 (c) The initial analysis finds no statistically significant interaction, so well remove this term and refit the model. Evidently, gasoline sales produce comparable sales in both convenience stores. Term Estimate Std Error t Ratio Prob>|t| Intercept 688.06922 84.96066 8.10 <.0001 Volume (Gallons) 0.2997907 0.031264 9.59 <.0001 Dummy 460.14764 113.4244 4.06 <.0001 Dummy * Volume 0.0208022 0.038283 0.54 0.5871 Without the interaction, the estimated model is R2 0.735122 se 243.5976 n 568 Term Intercept Volume (Gallons) Dummy Estimate 650.91601 0.313664 520.42454 Std Error 50.40009 0.018032 23.64755 t Ratio 12.91 17.39 22.01 Prob>|t| <.0001 <.0001 <.0001

Adjusted for volume, the estimated difference in sales remains statistically significant, but falls to about $520. (d) The initial two-sample comparison estimates the mean difference in daily sales as $677 to $777 more at Site 1. Adjusted for gasoline sales, the multiple regression puts this difference at 520.42454 - 2 * 23.64755, 520.42454 + 2 * 23.64755 $473 to $567 A26-6

9/27/2006

26 Answers

The estimated range from the multiple regression is shorter because the regression removes the variation from the response due to variation in gasoline sales. The bigger difference, however, is the shift of about $200. When adjusted for differences in traffic volume, Site 1 is still doing better, but not so much as suggested by the initial comparison. (e) The model meets the similar variances condition. In this example, we can identify both groups in the plot of the residuals on the fitted values. Color-coding makes the boxplots unnecessary in this case, but it would probably be best to do both.
Sales (Dollars) Residual
1000

1000

2000

3000

Sales (Dollars) Predicted

(f) Yes. By pooling, the slope is inflated, making it look as though gasoline sales have a bigger impact on sales in the convenience store. This simple regression suggests that each gallon of gas sold generates $0.51 in convenience store sales. In fact, the slope at either location is only $0.31/gallon.
3000

Sales (Dollars)

2000

1000 1000 2000 3000 4000 5000

Volume (Gallons)

Term Intercept Volume (Gallons)

Estimate 310.44032 0.5131557

Std Error 65.31151 0.021225

t Ratio 4.75 24.18

Prob>|t| <.0001 <.0001

37. Download a) The file size is related to the transmission time. To be a confounding variable it must also be different in the two groups. As shown in the two sample comparison, the file sizes are paired in the two groups. Because of this balance, the file size cannot be a confounding variable. Its the same in both samples. Level Number Mean Std Dev MS 40 56.9500 25.7014 A26-7

9/27/2006 Level NP Number 40 Mean 56.9500

26 Answers Std Dev 25.7014

b) The two-sample t-test finds a very statistically significant difference in the performance of the software from the two vendors. On average, the software labeled MS transfers files in about 5.5 fewer seconds. (The variance is substantially larger for the files sent using the NP software.) MS-NP, allowing unequal variances Difference -5.5350 t Ratio -2.52682 Std Err Dif 2.1905 DF 58.79005 Upper CL Dif -1.1515 Prob > |t| 0.0142 Lower CL Dif -9.9185 Prob > t 0.9929 Confidence 0.95 Prob < t 0.0071 c) The interaction in the model is statistically significant, meaning that the two types of software have different rates of transfer (different MB per second). R2 0.752229 se 5.138168 n 80 Term Intercept File Size (MB) Vendor Dummy Vendor Dummy * File Size Estimate 4.8929786 0.4037229 4.7633694 -0.180832 Std Error 1.995934 0.032012 2.822677 0.045272 t Ratio 2.45 12.61 1.69 -3.99 Prob>|t| 0.0165 <.0001 0.0956 0.0001

As shown in this plot of the fit of this model (different intercepts and slopes in the two groups), the transfer times using MS (in red) become progressively less than obtained by the software labeled NP. The small difference in the intercepts (the coefficient of the dummy variable is not statistically significant) happens because both send small files quickly. The difference emerges only when the files get larger.
50 40 30 20 10 20 30 40 50 60 70 80 90 100

Transfer Time (sec)

File Size (MB)

A26-8

9/27/2006

26 Answers

d) The two-sample comparison finds an average difference of 5.5 seconds (range 1 to 10 seconds), with MS transferring files faster. The analysis of covariance also identifies MS as faster, but shows that the gap becomes progressively wider as the file size increases. NP transfers files (once started) at a rate of about 0.4 sec/MB compared to 0.4 sec/MB for MS. The mean of the two-sample comparison is an average gap ignoring the size of the files. e) No. You can see hints of a problem in the color-coded plot of residuals on fitted values (with MS shown in red). Similarly, the boxplots of residuals show different variances.
Transfer Time (sec) Residual
Residual Transfer Time (sec)
10 5 0 -5 -10 -15 MS NP

10 5 0 -5 -10 -15 10 20 30 40 50

Transfer Time (sec) Predicted

Vendor

38. Production costs a) Material costs could be a confounding variable because it is related to the average cost per unit. The material costs per unit are, however, very similar in the two plants. Hence, materials costs per unit is not going to confound the comparison using the two-sample test. NEW-OLD, allowing unequal variances Difference -0.22241 t Ratio -1.28719 Std Err Dif 0.17279 DF 156.2608 Upper CL Dif 0.11889 Prob > |t| 0.1999 Lower CL Dif -0.56372 Prob > t 0.9000 Confidence 0.95 Prob < t 0.1000 b) The average cost per unit is slightly lower per unit in the new plant by $1.10, but the different found by the two-sample test is not statistically significant. NEW-OLD, allowing unequal variances Difference -1.1133 t Ratio -0.877 Std Err Dif 1.2694 DF 166.1142 Upper CL Dif 1.3930 Prob > |t| 0.3818 Lower CL Dif -3.6195 Prob > t 0.8091 Confidence 0.95 Prob < t 0.1909 c) Neither the interaction nor dummy variable are statistically significant in the model with both. Term Estimate Std Error t Ratio Prob>|t| Intercept 32.83758 1.650956 19.89 <.0001 Material Cost ($/unit) 2.9437822 0.609098 4.83 <.0001 A26-9

9/27/2006 Term Plant Dummy Dummy * Mat Cost/Unit Estimate 1.1116877 -0.716403

26 Answers Std Error 2.776285 1.094844 t Ratio 0.40 -0.65 Prob>|t| 0.6893 0.5137

After removing the interaction term, the effect of the plant dummy alone remains not statistically significant. The final model is just the original simple regression. Term Estimate Std Error t Ratio Prob>|t| Intercept 33.372891 1.431873 23.31 <.0001 Material Cost ($/unit) 2.722051 0.505381 5.39 <.0001 Plant Dummy -0.507856 1.255816 -0.40 0.6864 The final model is basically the original simple regression. d) In both analyses, we do not find a difference between the plants. Both approaches reach comparable conclusions since the material cost is not confounded between the two plants, plus theres so much variation that we find little difference between the t-test interval and that from regression. The t-test indicates that the new plant costs run between $0.55 less to 0.12 more than the old plant. The regression gives a wider range -0.507856 - 2 * 1.255816, -0.507856 + 2 * 1.255816 -$3.00 to $2.00 e) The model with a dummy variable meets the assumptions. Theres no indication of a problem in the color-coded plot of residuals on the fitted values, and the comparison boxplots agree. The residuals are slightly more variable in the old plant, but not so much as to indicate a problem. (one box twice the length of the other).
20

Residual Average Cost ($/unit)

10

-10

-20

NEW

OLD

Plant

39. Home prices a) Theres a clear difference, with a much steeper slope (higher fixed costs) for the data for Realtor B (shown as green crosses here)

A26-10

9/27/2006

26 Answers

0.45 0.4 0.35

Price/Sq Ft

0.3 0.25 0.2 0.15 0.1 .0003 .0005 .0007 .0009 .0011

1/Sq Ft

b) The model requires both a dummy variable and interaction. R2 0.762904 se 0.037308 n 36 Term Intercept 1/Sq Ft Realtor Dummy Dummy * 1/SqFt Estimate 0.155721 57.923342 -0.176852 568.5921 Std Error 0.019713 31.2019 0.061062 110.8419 t Ratio 7.90 1.86 -2.90 5.13 Prob>|t| <.0001 0.0726 0.0068 <.0001

c) The data for Realtor B is much less variable around the fitted line than for Realtor A; the residuals do not meet the similar variances condition. You dont need to see the boxplots to see the problem in this example, if youve got the points colored.
0.075
0.075

Price/Sq Ft Residual

0.050 0.025 0.000 -0.025 -0.050 -0.075 .10 .15 .20 .25 .30 .35 .40 .45
Residual Price/Sq Ft

0.05 0.025 0 -0.025 -0.05 -0.075

Price/Sq Ft Predicted

Realtor

d) The estimates are fine to interpret even with the evident lack of similar variances, as these reproduce the fitted equations for the separate groups. The intercept, about $156/SqFt is the estimated variable cost for Realtor A homes. The fixed costs for this realtor (slope for 1/SqFt) run about $58,000. For Realtor B, the estimated intercept is near zero (0.158-0.177), suggesting no variable costs! Instead, the prices for this realtor seem to be all fixed costs, with an estimate near 58+569 = $627,000 regardless of the size of the home! e) No, you cannot use these estimates of variation. The formula for the SE of a regression slope depends on the single estimate se of residual variation, and that is A26-11

9/27/2006

26 Answers

inappropriate in this analysis. We need to have separate estimates of the variance of for the two realtors. We can interpret the fit, but not use the tools for inference.

40. Leases a) The locations are very distinct, with those in the city (shown as red dots) costing more than those in the suburbs (green crosses).
26 24

Cost per Sq Foot

22 20 18 16 14 12 0 .0001 .0003 .0005 .0007 .0009

1/Sq Feet

b) The fitted model uses both the dummy variable and interaction. Both appear statistically significant, though we need to check the conditions before going further with inference. R2 0.615452 se 1.092205 n 223 Term Estimate Std Error t Ratio Prob>|t| Intercept 15.817545 0.117467 134.66 <.0001 1/Sq Feet 1911.3659 512.8437 3.73 0.0002 Location Dummy 1.5369136 0.1874 8.20 <.0001 Dummy * 1/SqFt 5145.1802 835.996 6.15 <.0001 c) Certainly, the original plot looks straight enough. The plot of residuals shows slightly higher variation in the city. Some (shown to the right, with large estimated values) are rather expensive. The boxplots show that the variances are by-and-large comparable. A normal quantile plot of the residuals combined for both locations shows that the data are also nearly normal (not shown here).

A26-12

9/27/2006

26 Answers

Cost per Sq Foot Residual

5 4
Residual Cost per Sq Foot

5 4 3 2 1 0 -1 -2 -3

3 2 1 0 -1 -2 -3 15 16 17 18 19 20 21 22 23

Cost per Sq Foot Predicted

City

Suburbs

Location

d) The baseline model (for the suburbs coded by 0 in the dummy variable), the variable costs are about $15.82 per square foot with about $1900 in fixed costs. For the city, the variable costs are higher by about $1.54 with higher fixed costs (about $5150 more). e) Yes, because this model meets the conditions for the MRM, we can build confidence intervals and tests. For example, we can estimate that the premium for locating in the city costs roughly 1.5369 - 2 * 0.1874, 1.5369 + 2 * 0.1874 $1.16 to $1.91 per square foot more than a comparable location in the suburbs. 41. R&D expenses (a) The two look very similar with the colors evenly mixed. A simple regression to both years seems reasonable.
8

Log R&D Expense

6 4 2 0 -2 -4 -6 0 10

Log Assets

R2 se n Term Intercept Log Assets Estimate -1.192587 0.7954859

0.807597 0.896963 985 Std Error 0.062477 0.012384 t Ratio -19.09 64.23 Prob>|t| <.0001 0.0000

(b) The residuals from the multiple regression show some skewness noted previously for 2004. Both have comparable variances, however, and share this A26-13

9/27/2006

26 Answers

problem. As you can tell from the normal quantile plot, the combined data are not nearly normal, but since we are working with the slopes (which are averages) we can continue on thanks to the CLT. Theres a more serious problem, however, not seen in these plots: do you really think that the two data values from AMD or Intel, for example, are independent of each other? Or, does it seem more likely that the data are dependent. Were voting for dependent, calling into question any notion of using the usual formulas for standard errors.
2

Residual Log R&D Expense

Log R&D Expense Residual

3 2 1 0 -1 -2 -3 -4 -6 -4 -2-1 0 1 2 3 4 5 6 7 8 9

1 0 -1 -2 -3 -4 2003 2004

Log R&D Expense Predicted

Year

2 1 0 -1 -2 -3 -4 100 200 300 -4

.001.01.05 .10.25.50.75.90 .95.99 .999

-3

-2 -1

Count

Normal Quantile Plot

(c) Heres the summary of the multiple regression. Neither added variable is statistically significant and the R2 has hardly moved from the simple regression. R2 0.8077 se 0.897636 n 985 Term Intercept Log Assets Year Dummy Dummy * Log Assets Estimate -1.184021 0.7900453 -0.016828 0.0110461 Std Error 0.091473 0.017898 0.125321 0.024828 t Ratio -12.94 44.14 -0.13 0.44 Prob>|t| <.0001 <.0001 0.8932 0.6565

A26-14

9/27/2006

26 Answers

The incremental F test that measures the change in R2 that comes with adding two explanatory variables is F =(0.8077 - 0.807597)/(1-0.8077) * (985-1-3)/2 0.26 which is not statistically significant. This agrees with the visual impression conveyed by the original scatterplot: the relationship appears to be the same in both years. (d) Overall, a common regression model captures the relationship. The elasticity of R&D expenses with respect to assets is about 0.8: on average each 1% increase is assets comes with a 0.8% increase in R&D expenses. I have serious questions, however, about the independence of the residuals in the two years, since I have a pair of measurements on each company. Its hard to think of these as independent. 42. Cars a) The color-coded scatterplot shows that cars from European companies (red dots) test to be more expensive, given their HP, than cars from domestic manufacturers (green crosses).
5 4.9 4.8 4.7 4.6 4.5 4.4 4.3 4.2 4.1 4 2.1 2.2 2.3 2.4 2.5 2.6

Log 10 Price

Log 10 HP

As a result, the shown simple regression splits the difference between the two, compromising the slope and intercept to blend the two into one. R2 0.767004 se 0.104323 n 132 Term Intercept Log 10 HP Estimate 1.2125378 1.406791 Std Error 0.157681 0.068004 t Ratio 7.69 20.69 Prob>|t| <.0001 <.0001

b) The model appears fine but for some minor warts. Here is a summary of the fit of the model. R2 0.869241 se 0.078761 n 132 Term Intercept Log 10 HP Estimate 1.5383054 1.2423271 Std Error 0.158665 0.068972 t Ratio Prob>|t| 9.70 <.0001 18.01 <.0001 A26-15

9/27/2006 Term Import Dummy Import * Log HP Estimate -0.271731 0.1777922

26 Answers Std Error 0.245108 0.105291 t Ratio -1.11 1.69 Prob>|t| 0.2697 0.0937

The initial scatterplot of the data seems straight enough in both groups, and the plot of the residuals on fitted values suggests no problems, though perhaps a slight increase in variation as the car prices increase. The boxplots indicate similar variances, and the normal quantile plot confirms that the data are nearly normal (albeit with outliers such as the exotic Panoz on the high end of the scale and the cheap for its power Ford Cobra on the low side.)
0.3
Residual Log 10 Price

Log 10 Price Residual

0.2 0.1 0.0 -0.1 -0.2 4.0 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5.0

0.3 0.2 0.1 0 -0.1 -0.2

Log 10 Price Predicted

Europe

US

Location

0.3 0.2 0.1 0 -0.1 -0.2 10 20 30 -3

.01 .05 .10 .25 .50 .75 .90 .95 .99

-2

-1

Count

Normal Quantile Plot

(c) The incremental F-test gives the value F = (0.869241 - 0.767004)/(1-0.869241) * (132 - 1 - 3)/2 50 >> 4 The incremental F shows that an increase in R2 from 77% to 87% by the addition of two explanatory variables is highly statistically significant. (d) The estimates of the MRM show that neither coefficient is statistically significant taken separately. Thats collinearity at work! The VIFs for these estimates are larger than 300! Because of the collinearity, neither one appears statistically significant taken individually. As a pair, however, the combination brings a statistically significant improvement to the fit of the model. 43. Movies a) Adult movies (red dots) appear to have consistently higher subsequent sales at A26-16

9/27/2006

26 Answers

a given box-office gross than family movies. The fits to the two groups look linear (on this log scale) with a fringe of outliers. A common simple regression splits the difference between the two groups. Heres the simple regression.
Log 10 Subsequent Purchase

-1 1 2

Log 10 Gross

R2 se n Term Intercept Log 10 Gross Estimate -1.305742 0.8420019

0.648668 0.253298 224 Std Error 0.063479 0.04159 t Ratio -20.57 20.25 Prob>|t| <.0001 <.0001

(b) The following results summarize fitting the multiple regression with a dummy variable and interaction. R2 0.75236 se 0.213623 n 224 Term Intercept Log 10 Gross Audience Dummy Dummy * Log Gross Estimate -1.344678 0.7394797 -0.070228 0.2358524 Std Error 0.104526 0.063621 0.122375 0.076836 t Ratio -12.86 11.62 -0.57 3.07 Prob>|t| <.0001 <.0001 0.5666 0.0024

The initial scatterplot appears straight enough within groups, and the plot of residuals on fitted values shows no deviations from the conditions. The comparison boxplots show that the variability is consistent in the two groups. The normal quantile plot confirms that the combined residuals are nearly normal, though a bit skewed (toward the left and smaller values). A subset of movies (kids movies, it seems) earn quite a bit less in subsequent sales for their level of box-office success. Call it the Barney effect parents who endured these in theatres didnt want these movies in the house.

A26-17

9/27/2006
Log 10 Subsequent Purchase Residual

26 Answers

0.4 0.2 0.0 -0.2 -0.4 -0.6 -0.8 -0.5 .0 .5 1.0

Residual Log 10 Subsequent Purchase

0.6

0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 Adult Family

Log 10 Subsequent Purchase Predicted

Audience

0.6 0.5 0.4 0.3 0.2 0.1 0 -0.1 -0.2 -0.3 -0.4 -0.5 -0.6 -0.7 -0.8 10 20 30 40 50 -3

.01 .05 .10 .25 .50 .75 .90 .95 .99

-2

-1

Count

Normal Quantile Plot

(c) The incremental F-test uses the change in R2 to measure the statistically significant of adding the two predictors (dummy and interaction). The test statistic is F = (0.75236 - 0.648668)/(1 -0.75236) * (224-1-3)/2 46 which is very statistically significant. Wed reject H0 that the added variables both have slope zero. (d) The interaction is highly statistically significant, but the slope for the dummy is not. Only one predictor seems useful. The F-test reaches a more impressive view of the value of adding these two predictors because it is not concerned about the substantial collinearity between them. The VIFs for these explanatory variables are almost 15, reducing the size of the shown t-statistic for each by about 4. (e) The estimates show that as the box-office gross increases, movies intended for adult audiences sell statistically significantly better. Each 1% increase in the boxoffice gross for an adult movie fetches about 0.74% increase in after-market sales. For adult movies, the elasticity jumps to about 0.74+0.24 = 0.98%. As the success at the box office grows, the gap opens up with adult movies doing better. 44. Hiring (a) The black and white view shows a cluster of points separated from the main body of the data. These all joined an existing office (red). The simple regression A26-18

9/27/2006

26 Answers

fit to all of the data has a smaller slope than seems appropriate to employess in a new office (green).
12 11

Log Profit

10 9 8 7 6 0 1 2 3 4 5 6 7

Log Accounts

R2 se n Term Intercept Log Accounts Estimate 8.9444533 0.2903952

0.176184 0.717014 464 Std Error 0.100374 0.029215 t Ratio 89.11 9.94 Prob>|t| <.0001 <.0001

(b) These table summarize the multiple regression with dummy variable and interaction. Both are statistically significant. R2 0.29671 se 0.663929 n 464 Term Intercept Log Accounts Office Dummy Office * Log Accounts Estimate 8.8514041 0.2612048 -0.897375 0.432733 Std Error 0.147548 0.035875 0.224139 0.068611 t Ratio 59.99 7.28 -4.00 6.31 Prob>|t| <.0001 <.0001 <.0001 <.0001

The plot of the residuals on fitted values shows skewness, with the negative range far twice that above zero. The residuals in the two groups have similar variances. The normal quantile plot of all residuals together shows these to be nearly normal, though somewhat skewed as noticed. Hence, plots of the data do not suggest a problem. We would question the assumption of independence if we learned that some of these employees worked in the same office or collaborated in some way.

A26-19

9/27/2006

26 Answers

Log Profit Residual

Residual Log Profit

1 0 -1 -2 -3 -4 8 9 10 11

1 0 -1 -2 -3 -4 Existing New

Log Profit Predicted

Office

2 1 0 -1 -2 -3 -4 50 100 150 -3

.01 .05 .10 .25 .50 .75 .90 .95 .99

-2

-1

Count

Normal Quantile Plot

(c) The incremental F-test judges the change in R2 to be statistically significant, as you would guess since both estimates are statistically significant by wide margins and the sample size is rather large (n = 464) F = (0.29671 - 0.176184)/(1-0.29671) * (464-1-3)/2 39.4 (d) These agree strongly in this example. Part of the reason for the agreement is that both the slope and intercept differ in the two groups. Also, theres less collinearity than in many cases (such as the other exercises). The VIFs are about 10 large, but not devastating. (e) The following plot shows the fits implied by the multiple regression. The statistically significant interaction suggests that a one approach for all placement procedure is not going to be the best solution. Hires that are able to generate lots of new accounts appear to do much better in new offices. Hires that do not open so many accounts appear more suited to starting work in an existing office. The crossover point in the two fits occurs where log of accounts is approximately (ratio of coefficient of dummy to the interaction) 0.897/0.433 2.07 or about exp(2.07) 7.9, or say 8 accounts.

A26-20

9/27/2006

26 Answers

12 11

Log Profit

10 9 8 7 6 0 1 2 3 4 5 6 7

Log Accounts

45. Promotion (a) A simple regression that combines the data from both locations makes a serious mistake, one that vastly overstates the effect/benefit of detailing. By fitting one line to both groups, rather than within each, the higher sales in Boston (red dots) inflate the slope.
0.24 0.22

Market Share

0.20 0.18 0.16 0.14 0.12 0.10 .02 .04 .06 .08 .10 .12 .14

Detail Voice

R2 se n Term Intercept Detail Voice Estimate 0.0917039 1.0825305

0.310936 0.038406 78 Std Error 0.015423 0.184853 t Ratio 5.95 5.86 Prob>|t| <.0001 <.0001

(b) The scatterplot suggests that parallel fits in the two locations, with a common slope for detailing. This model meets the MRM conditions. First, we check the interaction and find that its not statistically significant (the model meets the MRM conditions). With a dummy variable (Boston is 1, Portland 0), the fitted model with interaction gives a much better fit to the data, R2 0.97145 A26-21

9/27/2006 se n Term Intercept Detail Voice City Dummy Dummy * Detailing


Market Share Residual

26 Answers 0.007923 78 Estimate 0.1212103 0.1798054 0.0899672 -0.048412


0.025 0.020 0.015 0.010 0.005 0.000 -0.005 -0.010 -0.015 -0.020 .10 .12 .14 .16 .18 .20 .22 .24

Std Error 0.004774 0.067579 0.007311 0.089441

t Ratio 25.39 2.66 12.31 -0.54

Prob>|t| <.0001 0.0096 <.0001 0.5899

Market Share Predicted

The residuals also do not show substantial tracking over time, with the DW statistic for both being reasonably close to 2. If we plot the residuals from one location on those from the other, theres no association here either.
0.025 0.02

Boston Residual

0.015 0.01 0.005 0 -0.005 -0.01 -0.015 -0.02 -0.02 -0.01 0 .005 .01 .015 .02

Portland Residual

Because the interaction term is not statistically significant, well omit it and continue (the model continues to meet the usual conditions). Heres the summary of the model without the interaction, forcing parallel slopes in the two locations. Term Intercept Detail Voice City Dummy Estimate 0.1230925 0.1521676 0.0861738 Std Error 0.003255 0.04406 0.002073 t Ratio 37.81 3.45 41.57 Prob>|t| <.0001 0.0009 <.0001

(c) The effect for detailing has fallen from 1.08 down to 0.15, with a range of 0.152167 - 2 * 0.04406, 0.152167+ 2 * 0.04406 = .064047 to .240287 A26-22

9/27/2006

26 Answers

which rounds to 0.06 to 0.24. Rather than get a 1% gain in market share with each 1% increase in detailing voice, the model estimates a far smaller return on this promotion. By ignoring the effects of the two groups, the analyst inflated the effect of promotion. 46. iTunes (a) The scatterplot makes it clear that you need to distinguish the formats. Theres a clear interaction, with the AAC files (red dots) occupying much less space than the AIFF files (green +) for a given time duration. Makes you wonder why anyone would prefer AIFF format unless it sounds a lot better.
130 120 110

Megabytes (MB)

90 80 70 60 50 30 20 10 0 0 100 200 300 400 500 600 700 800 900

Time (seconds)

(b) The estimated model with dummy variable (1 for AIFF and 0 for AAC) is darn near perfect, with an R2 thats about off the charts The only error appears to be when a song does not quite fill the allocated space. These t-statistics are about as large as they come unless you have a data set with millions of cases. R2 0.999996 se 0.043304 n 596 Term Intercept Time (seconds) Format Dummy Dummy * Time Estimate 0.0110338 0.0154106 0.0715807 0.152824 Std Error 0.007804 0.000026 0.009385 0.000032 t Ratio 1.41 603.76 7.63 4761.9 Prob>|t| 0.1579 0.0000 <.0001 0.0000

The residuals for the AIFF format spread to the right in this plot because theyre bigger fitted sizes. The other files encoded using AAC, by comparison, are uniformly smaller. Theres no problem in this plot. With such a large difference in the size of the files, its perhaps not surprising that the AIFF residuals have more variation (not that its big, mind you!) With such a good fit, these differences are only an issue if we need to predict one group or the other very accurately. We do better for the AAC files.

A26-23

9/27/2006

26 Answers

Megabytes (MB) Residual

0.10
Residual Megabytes (MB)

0.1

0.05 0.00 -0.05 -0.10 0 10 2030 405060 708090 110 130

0.05

-0.05

-0.1 AAC AIFF

Megabytes (MB) Predicted

Format

Because of these differences in variation, the standard errors of the slopes are probably not precise. Just the same, whatever effect this violation of the similar variance condition has on the SEs, its not enough to change those t-statistics to make the estimates of the slopes for the compression rates not statistically significant. (c) The estimates show that songs recorded using the AAC format take about 0.01541 megabytes per second of recording time. Those recorded using the AIFF format require about 0.1528 MB additional space per second (more than 10 times the space used by AAC). Moreover, the fixed space needed by AAC (regardless of the length of the song) is about 0.011 MB, whereas AIFF requires an additional 0.072 MB to get started. (d) Because the errors do not meet the similar variances condition and the fits are so good that we dont need to borrow strength, lets just fit two separate regression lines, one for each format. We already know the fit for AAC, but now we also get the appropriate se for the errors. We also discover, now that we can get more details, that the data have a slight kink and are skewed, definitely not normal. No prediction interval for these! AAC: Megabytes (MB) = 0.0110338 + 0.0154106 Time (seconds) se = 0.02115
0.04 0.03 0.02 0.01 0 -0.01 -0.02 -0.03 -0.04 -0.05 -0.06 -0.07 -0.08 .01 .05 .10 .25 .50 .75 .90 .95 .99

0.04 0.02

Residual

0.00 -0.02 -0.04 -0.06 -0.08 0 100 200 300 400 500 600 700 800 900

10 20 30 40 -3

-2

-1

Time (seconds)

Count

Normal Quantile Plot

For the songs stored in AIFF format, the equation is AIFF: Megabytes (MB) = 0.0826145 + 0.1682346 Time (seconds) A26-24

9/27/2006

26 Answers

se = 0.0484 The SD of these residuals is about twice that for the songs coded using the AAC procedure. These residuals seem more typical and symmetric about zero, but the distribution does not tail off as one would expect for a normal distribution. They appear uniformly distributed.
.01 .05 .10 .25 .50 .75 .90 .95 .99

0.10 0.05

Residual

0.00 -0.05 -0.10 0 100 200 300 400 500 600 700
10 20 30 -3 -2 -1 0 1 2 3

Time (seconds)

Count

Normal Quantile Plot

Where does this leave us? We can come within 0.10 MB guaranteed for the AIFF format song. None of the residuals is larger than that. So, wed say the song would take 0.0826145 + 0.1682346 * 240 = 40.4589185 0.1 MB for the AAC format with about 100% coverage. For the ACC format, we get a much smaller estimate, but its not so easy to set the range. Perhaps we might be able to use a range like this: 0.0110338 + 0.0154106 * 240 = 3.7095778 , which might overestimate by 0.06 or underestimate by 0.04 about half the size of the interval for the other format.

A26-25

You might also like