Professional Documents
Culture Documents
People
who have no education tend to smoke more often than expected if they were independent
from each other.
In contrast, high proportional deviations come from people who have long educations.
People with long educations tend to smoke less often then expected if smoking and
education attainment were independent.
It could be explained by the fact that more educated people are more self-disciplined,
therefore people who want to quit smoker are more likely to succeed if they have longer
education
In conclusion we applied Chi-squared contingency table to explain dependence between
education and smoking. All of our assumptions were met, except for constant probability
assumption, which limits level of applicability of the results of the method. 𝑋 2 test did show
that there is dependence between education and smoking. Contingency tables show that
non-educated people smoker more often than expected and long-educated people smoke
less often than expected if smoking and education were independent from each other.
Question 10.5
In order to estimate regression model above, we have to check several assumption in order to
evaluate reliability and validity for the model’s results. These assumptions are:
It does seem that zero conditional mean assumption is not violated as residuals seem to be
distributed around 0.
5) Homoscedasticity 𝑉[𝜀|𝑋] = 0
Homoscedasticity assumption is not violated as well, because residuals are distributed as tunnel
shape across the different predicted values.
6) Normally distributed residuals 𝑒|𝑋~𝑁𝐷
Residual seem to have more density closer to the mean, so the normality assumption is not violated
In conclusions assumptions are not violated, linearity between income and age of schooling is
questionable, but does not seem very extreme. Multicollinearity I not violated except for
interactions. As a result we continue the estimation of the model.
̂
Estimation of the model: 𝐶𝐿𝑜𝑔𝐼𝑛𝑐𝑜𝑚𝑒 = 3.3696515 + 0.0239288 ∗ 𝐵𝑌𝑒𝑎𝑟𝑠𝑂𝑓𝑆𝑐ℎ𝑜𝑜𝑙𝑖𝑛𝑔 +
0.0643261 ∗ 𝐷𝐴𝑔𝑒 − 0.273718 ∗ 𝐸𝐹𝑒𝑚𝑎𝑙𝑒 − 0.008899 ∗ 𝐻𝑆𝑚𝑜𝑘𝑒𝑟 − 0.008106 ∗ 𝐷𝐴𝑔𝑒 𝐸𝐹𝑒𝑚𝑎𝑙𝑒 −
0.038363𝐷𝑎𝑔𝑒 𝐻𝑆𝑚𝑜𝑘𝑒𝑟 + 𝜀
Based on JMP Output , the model is significant ad explains 39.9% of Y, but interaction of age and
female is not significant as well as smoker, therefore it is possible to reduce the model further.
Therefore we release second order variable – Interaction of age and female.
Estimation of new model:
̂
𝐶𝐿𝑜𝑔𝐼𝑛𝑐𝑜𝑚𝑒 = 3.3673338 + 0.0244622 ∗ 𝐵𝑌𝑒𝑎𝑟𝑠𝑂𝑓𝑆𝑐ℎ𝑜𝑜𝑙𝑖𝑛𝑔 + 0.0641827 ∗ 𝐷𝐴𝑔𝑒 − 0.273806
∗ 𝐸𝐹𝑒𝑚𝑎𝑙𝑒 − 0.010066 ∗ 𝐻𝑆𝑚𝑜𝑘𝑒𝑟 − 0.037696𝐷𝑎𝑔𝑒 ∗ 𝐻𝑆𝑚𝑜𝑘𝑒𝑟 + 𝜀
Based on JMP Output, the new model is as significant and explains as much (39.9%), however there is
a variable, which is insignificant(H_Smoker). We cannot reduce the model, because interaction of
smoking and age is significant, which means that smoking does have an effect on age effect to
income and vice versa.
*All the interpretations of variables assume the effect of variable having all the other variables not
changes. In the end this model states that for every years in education, person is expected to earn
2.44% more, which makes sense, because person becomes more qualified for more jobs as he gets
more education.
For every year a person is expected to earn 6.41% more, which makes sense because as person gets
older he acquires more skills and more experience, which leads to higher productivity and higher
wage.which only makes sense for people aged 18 to 39, since datasets includes only these people.
Females are expected to earn 27.38% less then males, which can be explained by gender
discrimination and personal/social preferences in lifestyle between gender that affect income.
Smokers earn around 0.01% less than non smokers, it can be explained by the fact that people with
higher education earn more, and there is correlation between smokers and low education, which
explains this results.
Interaction of age and smoking says that as the person gets old, the effect of smoking on income gets
smaller until 29 years old, from then the effect of smoking starts to get a bigger impact to income
again.
Task c
Based on the JMP Output, at 95% significance level, the effect of 1 years of schooling on income lies
between 1.08104% and 4.20406%.
Manually:
𝑆𝜀
𝑆𝑏1 = =
√(𝑛 − 1)𝑠𝑥2
𝐵1 ± 𝑡𝑎;𝑛 𝑆𝑏1 = 0.0244622 ± 1.964 ∗ 0.007842 = 0.0244622 ± 0.015405
2
Task d)
5) Homoscedasticity 𝑉[𝜀|𝑋] = 0
Homoscedasticity assumption is not violated , because residuals are distributed at same standard
deviation regardless of predicted value.
6) Normally distributed residuals 𝑒|𝑋~𝑁𝐷
Residual seem to have more density closer to the mean, so the normality assumption is not violated
Based on JMP output, there is no multicollinearity between 2 variables, therefore this assumption is
not violated.
9) Outliers
In conclusion assumption of zero conditional mean is slightly violated while the rest of the
assumptions are not. Therefore during any conclusions it is important to take into consideration
possible under/over-estimations.
̂
Estimated model: 𝐶_𝐿𝑜𝑔𝐼𝑛𝑐𝑜𝑚𝑒 = 4.9105963 + 0.0523107 ∗ 𝐵𝑌𝑒𝑎𝑟𝑠𝑆𝑐ℎ𝑜𝑜𝑙𝑖𝑛𝑔 − 0.274003 ∗
𝐸_𝐹𝑒𝑚𝑎𝑙𝑒
According to this model, 1 extra year of schooling is expected to increase 5.23% of income, while
being a female is expected to decrease income by 27.40%. However, based on JMP Output, the
model is significant and all of its variables are significant. However RSquare adjusted is much lower
then previous model(9.44% against 39.9%), which means that model does not explain as much
variables as the previous model.
In order to verify, that this model is worse than previous one we have to test the models:
Unrestricted model:
̂
𝐶𝐿𝑜𝑔𝐼𝑛𝑐𝑜𝑚𝑒 = 3.3673338 + 0.0244622 ∗ 𝐵𝑌𝑒𝑎𝑟𝑠𝑂𝑓𝑆𝑐ℎ𝑜𝑜𝑙𝑖𝑛𝑔 + 0.0641827 ∗ 𝐷𝐴𝑔𝑒 − 0.273806
∗ 𝐸𝐹𝑒𝑚𝑎𝑙𝑒 − 0.010066 ∗ 𝐻𝑆𝑚𝑜𝑘𝑒𝑟 − 0.037696𝐷𝑎𝑔𝑒 ∗ 𝐻𝑆𝑚𝑜𝑘𝑒𝑟 + 𝜀
̂
Restricted model: 𝐶𝐿𝑜𝑔𝐼𝑛𝑐𝑜𝑚𝑒 = 4.9105963 + 0.0523107 ∗ 𝐵𝑌𝑒𝑎𝑟𝑠𝑆𝑐ℎ𝑜𝑜𝑙𝑖𝑛𝑔 − 0.274003 ∗ 𝐸𝐹𝑒𝑚𝑎𝑙𝑒
Therefore we have to test whether the variables from restricted model are significant:
𝐻0 = 𝛽2 = 𝛽4 = 𝛽5
𝐻1 = 𝑎𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝑖𝑠 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑡 0
(𝑅2 2
𝑈𝑅 −𝑅𝑅 )
𝑞
Test statistic: 𝑓𝑜𝑏𝑠 = 1−𝑅2
≈ 𝐹𝑞,𝑛−𝑘−1
𝑈𝑅
𝑛−𝑘−1
0.4044 − 0.09768
3 0.30672
𝑓𝑜𝑏𝑠 = = = 281.70
1 − 0.4044 0.0010888
553 − 5 − 1
𝐹𝑞,𝑛−𝑘−1 = 𝐹5,547;0.5 = 3.012 (Excel Output)
Because 𝑓𝑜𝑏𝑠 > 𝐹𝑞,𝑛−𝑘−1, and 𝑃(𝑓𝑜𝑏𝑠 > 281.70) ≈ 0, we reject 𝐻0 , and accept 𝐻_1, which states
that unrestricted model has more significance than restricted model, therefore we include the 3
variables in our final model: Age, Smoking habits and interaction of age and smoking habits.
Task e)
𝐵2 Estimate in the model from task c) is gender. Based on previous task we have figured out that
unrestricted model, which includes more variables is significant. That means that variables included
in unrestricted model are much closer to true model than the one in task c). This means that model
from task c) omits variables (D_age, H_Smoker and Interaction of age and smoking).
Taking this into consideration means, that the estimates main effect of gender might be biased. To
determine how much biased the effect is we can just compare estimates of gender from both
models.
Unrestricted model: -0.273806
Restricted model:-0.274003
Difference:0.000197=0.0197%, which is very small. It can be explained by multicollinearity table from
previous exercises, which show that the correlation between omitted variables and gender is very
small, therefore the biased effect is also expected to be very small.
Question 10.6
Task a)
Based on the Bivariate Fit of A_Income By D_age, it can be seen that the relationship between age
and income is not completely linear. There is some level of polynomial relationship. On top of that,
given JMP output shows that polynomial relationship is significant (p-value<0.0001; t-ratio -5.29).
The Polynomial model will adjust for the exponential growth in income versus age, because the
increase of income against age is expected to grow exponentially. Furthermore, cantered polynomial
relationship would adjust for the fact that from certain age people start to earn less as they are