You are on page 1of 2

UNIVERSITY OF CAPE TOWN DEPARTMENT OF STATISTICAL SCIENCES STA2020F: BUSINESS STATISTICS Test 2 - memo Hannah Gerber 24 April 2011

Internal examiners: Date: 1 hour 30 minutes Total number of questions: 1 Time: 13 (6 + 1 + 6) Total number of pages: Total marks: 50 Instructions: Answer all questions in the answer book(s) provided. The appropriate tables and formulae have been provided.

1. 2.

3.

4.

5.

6.

and for the stated values mg CO. (2) vs. that at least one regression coefficient is not 0. We have a -value which is less than any reasonable level of significance. We therefore conclude that we have sufficient evidence to say that at least one of the explanatory variables is linearly related to the response variable, sales. The model is valid. (4) Note: the F-stat value has been removed, it is 78.9838. vs. We have a -value which is much greater than any reasonable level of significance. We therefore fail to reject and we have insufficient evidence to say that the nicotine variable is significant in the regression mode. (4) , which means that on average for every additional mg of nicotine, there is 2.6317 mg less CO content assuming all other variable remain constant. It is worth noting though that the nicotine variable is very highly correlated with the tar variable (0.9766), which strictly speaking means we shouldnt interpret this variable since multicolinearity is a concern. (3) The CI about the expected value: We expect the true mean CO content associated with , and to occur in the interval, , with 95% confidence . (3) The PI: For single observation at , and , we expect the true CO content to have a value between -0.3348 and 19.0533, with 95% confidence . (2) Notes: do not penalize if was used in interpretation (due to typo) and be lenient on the level of confidence since it wasnt stated in the question. It is 0.9259 and this means that there is a very strong linear relationship between the nicotine and CO content. (2)

7. Missing values (actual values, allow for rounding): a) 0.9070 b) 3 c) 21 d) 2.0901 (no mark allocated since provided on memo) e) 78.9838 f) 3.4618 g) 3.9736 h) -10.7433 i) 5.4800 j) 0.9735 (8) 8. Consider the all subsets regression: a) Any one of the following combinations with following reason: , and model has the highest OR and simplest model with a relatively high OR and simplest model with a relatively high OR simplest model with a relatively high (2) b) simplest model with highest . (2) c) The measure increases automatically as the number of variables increase. The accounts for the number of variables in the regression model, giving us a better indication of whether or not a new variable truly contributes to explain the variability in the response variable. (2) 9. The variable since it has the highest correlation to the CO content (0.9575) (2) 10. The variable since it has the highest p-value (0.9735). (2) 11. Note: Need to clearly indicate that the assumptions are associated with the random ERROR (not the residuals). If this is not clear mark assumption portion the question out of 3 full marks can still be obtained for assessments. a) Errors are normally distributed (one of q-q plot, histogram, Chi square test, Lillifors, K-S test, Shapiro-Wilks test) b) Errors have an expected value (or mean) of 0 (one of residual plot or t-test about the mean) c) Errors have a constant by unknown variance (Residual plot) d) Errors are independent of each other (Durbin-Watson test) (8) 12. Some concern about heteroscedasticity as the variability seems to become larger as the predicted responses increase. (2) 13. Errors seem to be relatively normally distributed as there is a straight line through the origin. (2)

You might also like