You are on page 1of 9

Multiple Regression Practice Problems

Stat 112

1. When, in 1982, average Scholastic Achievement Test (SAT) scores were first published on a state-by-state basis in the United States, the huge variation in the scores was a source of great pride for some states and of consternation for others. Average scores ranged from a low of 790 (out of a possible 1,600) in South Carolina to a high of 1,088 in Iowa. Two researchers set out to figure out how certain variables are associated with state SAT differences.1 The variable SAT is the average total SAT (verbal+quantitative) score in the state and the two explanatory variables considered are the following: Takers Expend percentage of the total eligible students (high school seniors) in state who took the exam total state expenditure on secondary schools, expressed in hundreds of dollars per student

Output from a multiple regression analysis is shown below.


Response SAT Whole Model Actual by Predicted Plot
1100 1050 1000 SAT Actual 950 900 850 800 750 750 800 850 900 950 1000 1050 1100

SAT Predicted P<.0001 RSq=0.81 RMSE=31.937

Summary of Fit
RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts) 0.808786 0.800472 31.93721 948.449 49 Sum of Squares 198456.79 46919.33 245376.12 Estimate 932.41448 4.2985226 Std Error 22.16843 1.025343 Mean Square 99228.4 1020.0 F Ratio 97.2841 Prob > F <.0001 Prob>|t| <.0001 0.0001

Analysis of Variance
Source Model Error C. Total Term Intercept EXPEND
1

DF 2 46 48

Parameter Estimates
t Ratio 42.06 4.19

B. Powell and L.C. Steelman, Variations in State SAT Performance: Meaningful or Misleading?, Harvard Educational Review 54(4), 1984: 389-412.

TAKERS

-3.07411 Nparm 1 1 DF 1 1

0.2206

-13.94

<.0001 F Ratio 17.5752 194.1902 Prob > F 0.0001 <.0001

Effect Tests
Source EXPEND TAKERS Sum of Squares 17926.44 198071.21

Residual by Predicted Plot


100

SAT Residual

50

-50 750 800 850 900 950 1000 1050 1100

SAT Predicted

For questions (a)-(e), assume the ideal multiple linear regression model holds. (a) For Pennsylvania, SAT=885, TAKERS=50 and EXPEND=27.98. What would you predict Pennsylvanias average SAT score to be based on knowing its TAKERS and EXPEND, but not knowing its SAT? What is the residual for Pennsylvania? (b) Is there strong evidence that the multiple regression model provides better predictions of SAT than just using the sample mean of SAT to predict SAT? Use a test at the .05 level to justify your answer. (c) Find an approximate 95% confidence interval for the coefficient on TAKERS. (d) Is there strong evidence that total state expenditures (EXPEND) helps to predict a states average SAT score once TAKERS has been taken into account? Use a test at the . 05 level to justify your answer. (e) The two states with the largest Cooks distances are Alaska and South Carolina with Cooks distances of 2.06 and 0.18 respectively and leverages of 0.44 and 0.09 respectively. For each state (Alaska, South Carolina), answer whether it would be justified to delete the state from the analysis and report that we omitted the state and that our conclusions only hold for a reduced range of explanatory variables, not including the explanatory variables of the state.

(f) Suppose we want to use either Takers or Log(Takers) in the multiple regression. On the basis of the below information, which of these two forms would you choose to use? Explain.
Bivariate Fit of SAT By TAKERS
1100 1050 1000 SAT 950 900 850 800 750 0 10 20 30 40 50 60 70 TAKERS

Linear Fit:

SAT = 1020.3062 - 2.7599621 TAKERS

Linear Fit Transformed Fit to Log

Linear Fit
SAT = 1020.3062 - 2.7599621 TAKERS

Summary of Fit
RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts) 0.735838 0.730335 36.79525 947.94 50

Analysis of Variance
Source Model Error C. Total DF 1 48 49 Sum of Squares 181024.09 64986.73 246010.82 Mean Square 181024 1354 F Ratio 133.7066 Prob > F <.0001

Parameter Estimates
Term Intercept TAKERS Estimate 1020.3062 -2.759962 Std Error 8.139082 0.238686 t Ratio 125.36 -11.56 Prob>|t| <.0001 <.0001

Residual Plot for Linear Fit


100 Residual 50 0 -50 -100 0 10 20 30 40 50 60 70

TAKERS

Transformed Fit to Log


SAT = 1112.2477 - 59.018822 Log(TAKERS)

Summary of Fit
RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts) 0.810762 0.80682 31.14298 947.94 50

Analysis of Variance
Source Model Error C. Total DF 1 48 49 Sum of Squares 199456.33 46554.49 246010.82 Mean Square 199456 970 F Ratio 205.6494 Prob > F <.0001

Parameter Estimates
Term Intercept Log(TAKERS) Estimate 1112.2477 -59.01882 Std Error 12.27496 4.11554 t Ratio 90.61 -14.34 Prob>|t| <.0001 <.0001

Residual Plot for Transformed Fit to Log


50 Residual 0 -50 -100 0 10 20 30 40 50 60 70 TAKERS

2. The number of car accidents on a particular stretch of highway seems to be related to the number of vehicles that travel over it and the speed at which they are traveling. A city alderman has decided to ask the county sheriff to provide him with statistics covering the last few years, with the intention of examining these data statistically so that he can (if possible) introduce new speed laws that will reduce traffic accidents. Using the number of accidents as the response variable, he obtains estimates of the number of cars passing along a stretch of road (subtracted from the mean number of cars passing along a stretch of the road) and their average speeds (in miles per hour, subtracted from the mean average speed) for 60 randomly selected days. (a) JMP output from simple linear regressions of (i) Accidents on Speed and (ii) Cars on Speed are shown below. Would you expect the estimated coefficient on Speed to increase, decrease or stay the same in a multiple linear regression of Accidents on Speed and Cars as compared to the estimated coefficient of Speed in the simple linear regression of Accidents on Speed. Justify your answer using the omitted variable bias formula.

Response Accidents Summary of Fit


RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts) 0.021001 0.004122 2.430355 7.033333 60

Parameter Estimates
Term Intercept Speed Estimate -8.018052 0.2508495 Std Error 13.49733 0.224888 t Ratio -0.59 1.12 Prob>|t| 0.5548 0.2693

Response Cars Summary of Fit


RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts) 0.003515 -0.01367 1.222004 9.935 60

Parameter Estimates
Term Intercept Speed Estimate 13.003931 -0.051147 Std Error 6.786575 0.113076 t Ratio 1.92 -0.45 Prob>|t| 0.0603 0.6527

(b) JMP output from a multiple linear regression of Accidents on Cars, Speed and Cars*Speed is shown below. Is there strong evidence of an interaction between Cars and Speed? Justify your answer using a test at the .05 level.
Response Accidents Summary of Fit
RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts) 0.743622 0.729887 1.265725 7.033333 60

Analysis of Variance
Source Model Error C. Total DF 3 56 59 Sum of Squares 260.21801 89.71533 349.93333 Mean Square 86.7393 1.6021 F Ratio 54.1424 Prob > F <.0001

Parameter Estimates
Term Intercept Cars Speed Cars*Speed Estimate 7.1405117 0.4158119 0.0644162 1.0763228 Std Error 0.163638 0.136049 0.118519 0.087791 t Ratio 43.64 3.06 0.54 12.26 Prob>|t| <.0001 0.0034 0.5889 <.0001

(c) The alderman proposes decreasing the speed limit by 5 MPH. The number of cars on the road is higher on average on weekdays than the weekends. Assuming that the average number of cars will not be changed by decreasing the speed limit and that there are no confounding variables, would you expect the decrease in the speed limit to have a larger impact on the number of accidents during the weekends or the weekdays? 3. Car designers have been experimenting with ways to improve gas mileage for many years. An important element in this research is the way in which a cars speed affects how quickly fuel is burned. Competitions whose objective is to drive the farthest on the smallest amount of gas have determined that low speeds and high speeds are inefficient. Designers would like to know which speed burns gas most efficiently. As an experiment, 50 identical cars are driven at different speeds and the gas mileage measured. (a) JMP output from a simple linear regression model of Mileage on Speed is shown below. Comment on the regression diagnostics the residual plot, the histogram of the residuals and the boxplot of the Cooks distances. If you see any problems, suggest what you would do next in the analysis to try to address those problems.
Bivariate Fit of Mileage By Speed
40 35 30 Mileage 25 20 15 10 5 0 10 20 30 40 50 60 70 80 90 100 110 Speed

Linear Fit

Linear Fit
Mileage = 23.266776 - 0.0012701 Speed

Summary of Fit
RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts) 0.000028 -0.02081 7.102586 23.202 50

Analysis of Variance
Source Model Error C. Total DF 1 48 49 Sum of Squares 0.0672 2421.4426 2421.5098 Mean Square 0.0672 50.4467 F Ratio 0.0013 Prob > F 0.9710

Parameter Estimates
Term Intercept Speed Estimate 23.266776 -0.00127 Std Error 2.039431 0.034802 t Ratio 11.41 -0.04 Prob>|t| <.0001 0.9710

10 Residual 0 -10 -20 0 10 20 30 40 50 60 70 80 90 100 110 Speed

Distributions Residual Mileage

-15

-10

-5

10

15

Distributions Cook's D Influence Mileage


0.2

0.15

0.1

0.05

(b) JMP output for a quadratic regression of mileage on speed and speed squared is shown below. Is there strong evidence that the quadratic regression provides better predictions of mileage based on speed than the simple linear regression? Justify your answer using a test at the .05 level.

Response Mileage Summary of Fit


RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts) 0.710249 0.697919 3.863732 23.202 50 Estimate 9.3413673 0.8021188 -0.007876 Std Error 1.70707 0.077207 0.000734 t Ratio 5.47 10.39 -10.73 Prob>|t| <.0001 <.0001 <.0001

Parameter Estimates
Term Intercept Speed Speed Squared

Response Mileage Whole Model Actual by Predicted Plot


40 35 Mileage Actual 30 25 20 15 10 5 5 10 15 20 25 30 35 40 Mileage Predicted P<.0001 RSq=0.71 RMSE=3.8637

Summary of Fit
RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts) 0.710249 0.697919 3.863732 23.202 50 Sum of Squares 1719.8740 701.6358 2421.5098 Estimate 9.3413673 0.8021188 -0.007876 Mean Square 859.937 14.928 F Ratio 57.6040 Prob > F <.0001 Prob>|t| <.0001 <.0001 <.0001

Analysis of Variance
Source Model Error C. Total Term Intercept Speed Speed Squared DF 2 47 49

Parameter Estimates
Std Error 1.70707 0.077207 0.000734 t Ratio 5.47 10.39 -10.73

Residual by Predicted Plot


10

Mileage Residual

-5 5 10 15 20 25 30 35 40

Mileage Predicted

Speed Leverage Plot


40 Mileage Leverage Residuals 35 30 25 20 15 10 5 0 10 20 30 40 50 60 70 80 90 100 Speed Leverage, P<.0001

Speed Squared Leverage Plot


40 Mileage Leverage Residuals 35 30 25 20 15 10 5 0 1000 3000 5000 7000 9000 Speed Squared Leverage, P<.0001

(c) Suppose you are low on gas. Which speed does the quadratic regression model suggest that it is best to drive at 20 MPH, 50 MPH or 70 MPH? Justify your answer.