You are on page 1of 4

Analysis of Continuous Data

Homework 3

This homework consists of a SAS exercise, some small questions and a theoretical exercise. For the SAS exercise, it is again important that you write your results in a report format. The small questions only require brief answers. Note that this homework should be made individually, and the marks that you will receive for this homework contribute to your nal score for Analysis of Continuous Data.

A SAS Exercise

The dataset caschoolsimp.xlsx contains a random sample of California elementary school districts. The data consists of test scores ( Y: testscr ) and class sizes ( X: stratio ): The test score is a districtwide average of reading and math scores on the Stanford achievement test, a test utilized by school districts in the USA. The student-teacher ratio, i.e. the total number of students in the district divided by the number of teachers, is used as a measure of the ( overall ) class size in the district. Policy makers are interested whether reducing class size, for instance by hiring more teachers, improves students education. Skeptics worry that reducing class size will increase costs without producing substantial benets1 . The aim of this homework is study the association between the two variables
You can nd more information on the research conducted on this subject at http://www.ed.gov/pubs/ReducingClass/index.html
1

by tting a simple linear regression model Yi = 0 + 1 Xi + i . Answer the following questions: 1. Give the parameter estimates, their standard deviations and their 95% condence intervals. 2. Give a clear and useful interpretation of the estimated regression coefcient. 3. Calculate a 95% condence interval of the regression coecient and interpret the interval. 4. Perform a two-sided statistical hypothesis test to test the hypothesis that the regression coecient is zero (use = 0.05). 5. Assess the assumptions underlying the linear regression model (scope of the model, study of outliers and residuals, linearity of the curve, constancy of the variance, lack-of-t, ...), and give a detailed discussion on the model quality. 6. Write an executive summary containing your main conclusions from your statistical analysis (max. 1/2 page).

Some Small Questions

Answer the following questions briey. Although the questions are closely related to the simulation exercise, you must not write the answers in your report. Just write them on a separate paper. 1. Give the advantages and disadvantages of a one-sided and a two-sided hypothesis test. When is it appropriate to do a one-sided test ? 2. What are the minimal conditions under which the regression parameter estimators are unbiased? 3. In the data analysis in previous section, would your results still be valid if the error in the regression model was not normal ? Explain.

4. What is the interpretation of a conditional expectation, E(Y |X = x)? ( You may assume that X and Y are not independent ). 5. Can your ndings be used to address the question of the policy makers ? Give a short discussion on the potential issues.

A Theoretical Exercise

In a clinical study, physicians are interested in the eect of a new treatment on the blood pressure. They observe patients who use the new treatment and patients who do not. On the other hand, a certain gene is known to have an extreme inuence on the blood pressure. In this exercise, we aim to nd out what happens when the treatment eect is estimated with and without accounting for the gene eect. Suppose that the blood pressure is modeled correctly by the following underlying model: Yi = 0 + 1 x1i + 2 x2i + 3 x1i x2i + i (1)

where i N (0, 2 ) (i = 1, . . . , n). Yi is the blood pressure for individual i, x1i indicates whether person i uses the new treatment (x1i =1) or not (x1i =0) and x2i = 1 (x2i = 0) indicates the presence (absence) of the gene. Assume that P (X2i = 1|X1i = 0) = q0 P (X2i = 1|X1i = 1) = q1 . Assume that the outcomes are obtained from a randomized study. 1. Model (1) represents the true data-generating model, but this is of course unknown to the statistician. Suppose we analyze the data with the model Yi = 0 + 1 x1i + 2 x2i + i , (2) where i N (0, 2 ) (i = 1, . . . , n). Write the parameters 1 and 2 in function of the parameters of Model (1), and interpret these parameters.

2. Suppose now that we do not know that there is a confounding eect of the gene. In this case, we want to analyse the data with the simple model Yi = 0 + 1 x1i + i , (3) where i N (0, 2 ) (i = 1, . . . , n). Write the parameter 1 in function of the parameters of Model (1), and interpret this parameter. 3. What happens if 3 = 0 in Model (1)? Give a discussion for both Models (2) and (3).

You might also like