You are on page 1of 9

Data to Decisions

Final examination

You have two hours. Good luck.

Notes:

1. Download D2Dfinal2016.xlsx (the exam file hereafter) from my webpage.

2. Assume the CLT applies throughout. Sample means, in particular, are roughly
normally distributed with the SE implied by the CLT.

3. Samples are large enough in all questions that t distributions can be well ap-
proximated by the standard normal distribution.

Name:

1
Problem 1 (10 points)

According to the Center for Disease Control, the percent of adults 20 years of age
and over in the United States who are overweight is 70%. One citys council wants
to know if the proportion of overweight citizens in their city is different from this
known national proportion. They take a random sample of 150 adults 20 years of
age or older in their city and find that 98 are classified as overweight. Given these
data, calculate and report the two-sided p-value associated with the hypothesis that
the fraction of overweight people among adults 20 age and over is 70% in this city.

2
Problem 2 (15 points)

TabBikes 1 in the exam file has the result of recent attempts to sell a bike to 10,000
customers of various marital status (1 for married, 0 otherwise), gender (1 for male),
yearly income and age in years.

1. Estimate a linear probability model of the bike buying decision with all of the four
customer characteristics as explanatory variables. What is the t-statistic associated
with the coefficient on income in this regression?

2. According to this model, what is the probability that a single female of age 40 and
whose yearly income is 40,000 will buy a bike if approached?

3. According to the model, how many customers of that type (single female of age
40 and whose yearly income is 40,000 ) must the seller approach to have better than
a 50% chance of selling at least one bike?

3
Problem 3 (10 points)

The one-sided hypothesis that mean income among a target population of customers
is lower than $50, 000 was tested using a simple mean test based on a sample whose
mean income is $50, 500 and standard deviation is $5, 712. The test produced a
p-value of roughly 4%. Approximately how large is the sample?

4
Problem 4 (15 points)

Yet another company is trying to sell bikes but only has income and age information
about its customers. A probit regression of purchase on age (in years) and income
(in thousands of dollars) on historical data has given the following coefficients:

Variable coefficient
Constant -0.3843
Age -0.0080
Income 0.0013

It wants to test the lift it will get from this model by applying it to the data shown
in tab Bikes 2 in the exam file. If it were to rank the customers in this data set
according to the likelihood of purchase given the model above and approach the top
5% of customers, what lift rate would it get compared to a naive model? (Recall that
the lift rate is the ratio of positive discoveries when using the model divided by the
expected number of positive discoveries under the naive model.)

5
Problem 5 (8 points)

Tab Satisfaction in the exam file contains satisfaction scores from a representative
sample of students by gender. Can you reject the hypothesis that mean satisfaction
is the same for both genders with 95% confidence?

6
Problem 6 (15 points)

Tab Infant Mortality in the exam file contains GDP per capita (in US$) and Infant
mortality data (infant deaths per 1,000 live birth) for all countries for which data
were available in 1998. You want to forecast Infant Mortality. You are hesitating
between two models: 1) a regression of ln(Infant Mortality) on ln(GDP per capita),
and, 2) a naive model that predicts that every countrys infant mortality rate is the
average rate in the training sample. Use the first 20 observations of the data set
(Cuba to Iran) as your testing sample while the remaining part of the sample is your
training sample. Estimate each of the two models on the training sample. Compute
and report the RMSE for infant mortality on the testing data for both models.

7
Problem 7 (12 points)

Tab Real GDP in the exam file has post-1985 real GDP on a quarterly basis in
the United States. Use an AR(2) model of real GDP growth (a regression of current
growth rate on last quarters growth rate and the growth rate from two quarters ago)
to forecast the level of GDP in the second quarter of 2017. As part of your answer,
report the coefficients of your AR(2) regression.

8
Problem 8 (15 points)

Tab Housing in the exam file has data for home sale prices in a given US city
together with some characteristics of those houses. Using the first 100 observations
as your testing sample while the reminder of your data is a training sample, compare
the forecasting performance of two models for predicting whether a house has air
conditioning ((i.e. airco=yes) or not. Model 1 is a naive model that assumes that the
likelihood of having air conditioning is the same for all houses. Model 2 is a Probit
regression of airco on lot size. As part of your answer report the coefficients of the
probit. Also report the log likelihood for the two models measured on the testing
sample.

You might also like