You are on page 1of 7

Based on this table, biggest deviations come from people who have no education.

People
who have no education tend to smoke more often than expected if they were independent
from each other.
In contrast, high proportional deviations come from people who have long educations.
People with long educations tend to smoke less often then expected if smoking and
education attainment were independent.
It could be explained by the fact that more educated people are more self-disciplined,
therefore people who want to quit smoker are more likely to succeed if they have longer
education
In conclusion we applied Chi-squared contingency table to explain dependence between
education and smoking. All of our assumptions were met, except for constant probability
assumption, which limits level of applicability of the results of the method. 𝑋 2 test did show
that there is dependence between education and smoking. Contingency tables show that
non-educated people smoker more often than expected and long-educated people smoke
less often than expected if smoking and education were independent from each other.

Question 10.5

True regression model: C_LogIncome = β0 + β1 · B_Years of schooling + β2 · D_Age +


β3 · E_Female
+ β4 ·H_Smoker + β5 · D_Age ·E_Female + β6 · D_Age · H_Smoker + ε
̂
𝐶𝐿𝑜𝑔𝐼𝑛𝑐𝑜𝑚𝑒 = 𝑏0 + 𝑏1 𝐵𝑌𝑒𝑎𝑟𝑠𝑂𝑓𝑆𝑐ℎ𝑜𝑜𝑙𝑖𝑛𝑔 + 𝑏2 𝐷𝐴𝑔𝑒 + 𝑏3 𝐸𝐹𝑒𝑚𝑎𝑙𝑒 + 𝑏4 𝐻𝑆𝑚𝑜𝑘𝑒𝑟 + 𝑏5 𝐷𝐴𝑔𝑒 𝐸𝐹𝑒𝑚𝑎𝑙𝑒
+ 𝑏6 𝐷𝑎𝑔𝑒 𝐻𝑆𝑚𝑜𝑘𝑒𝑟 + 𝜀

In order to estimate regression model above, we have to check several assumption in order to
evaluate reliability and validity for the model’s results. These assumptions are:

1) Simple Random Selection (SRS) – Refer back to previous assignments


2) Trustworthiness – Refer to previous assignments.
3) Linearity – Linearity of relationship between income and independent variables can be subjectively
evaluated by plotting residuals (JMP Output below).
Years of schooling does not seem linear very linear, it seems there are many values above predicted
near 10 years and 16 years of education. Other than that everything else seems linear.

4) Zero conditional mean 𝐸(𝜀|𝑋) = 0

It does seem that zero conditional mean assumption is not violated as residuals seem to be
distributed around 0.
5) Homoscedasticity 𝑉[𝜀|𝑋] = 0
Homoscedasticity assumption is not violated as well, because residuals are distributed as tunnel
shape across the different predicted values.
6) Normally distributed residuals 𝑒|𝑋~𝑁𝐷
Residual seem to have more density closer to the mean, so the normality assumption is not violated

It does seem that


7) Independence between errors
As long as the data has been gathered with SRS method, there should not be any dependence
between erros.
8) Multicollinearity
There is a very strong correlation between Age and smoking interaction and smoking as well as
between age and female interaction and female. It can be explained by the fact that males are
identified with zero and interaction of any age male is identified as zero as well. Females are
identified with 1, value increases and any female with any age is identified with a value of age. Same
goes for age and smoker interaction.
9) Outliers

There does not seem to be any extreme outliers.

In conclusions assumptions are not violated, linearity between income and age of schooling is
questionable, but does not seem very extreme. Multicollinearity I not violated except for
interactions. As a result we continue the estimation of the model.

̂
Estimation of the model: 𝐶𝐿𝑜𝑔𝐼𝑛𝑐𝑜𝑚𝑒 = 3.3696515 + 0.0239288 ∗ 𝐵𝑌𝑒𝑎𝑟𝑠𝑂𝑓𝑆𝑐ℎ𝑜𝑜𝑙𝑖𝑛𝑔 +
0.0643261 ∗ 𝐷𝐴𝑔𝑒 − 0.273718 ∗ 𝐸𝐹𝑒𝑚𝑎𝑙𝑒 − 0.008899 ∗ 𝐻𝑆𝑚𝑜𝑘𝑒𝑟 − 0.008106 ∗ 𝐷𝐴𝑔𝑒 𝐸𝐹𝑒𝑚𝑎𝑙𝑒 −
0.038363𝐷𝑎𝑔𝑒 𝐻𝑆𝑚𝑜𝑘𝑒𝑟 + 𝜀

Based on JMP Output , the model is significant ad explains 39.9% of Y, but interaction of age and
female is not significant as well as smoker, therefore it is possible to reduce the model further.
Therefore we release second order variable – Interaction of age and female.
Estimation of new model:
̂
𝐶𝐿𝑜𝑔𝐼𝑛𝑐𝑜𝑚𝑒 = 3.3673338 + 0.0244622 ∗ 𝐵𝑌𝑒𝑎𝑟𝑠𝑂𝑓𝑆𝑐ℎ𝑜𝑜𝑙𝑖𝑛𝑔 + 0.0641827 ∗ 𝐷𝐴𝑔𝑒 − 0.273806
∗ 𝐸𝐹𝑒𝑚𝑎𝑙𝑒 − 0.010066 ∗ 𝐻𝑆𝑚𝑜𝑘𝑒𝑟 − 0.037696𝐷𝑎𝑔𝑒 ∗ 𝐻𝑆𝑚𝑜𝑘𝑒𝑟 + 𝜀

Based on JMP Output, the new model is as significant and explains as much (39.9%), however there is
a variable, which is insignificant(H_Smoker). We cannot reduce the model, because interaction of
smoking and age is significant, which means that smoking does have an effect on age effect to
income and vice versa.

*All the interpretations of variables assume the effect of variable having all the other variables not
changes. In the end this model states that for every years in education, person is expected to earn
2.44% more, which makes sense, because person becomes more qualified for more jobs as he gets
more education.

For every year a person is expected to earn 6.41% more, which makes sense because as person gets
older he acquires more skills and more experience, which leads to higher productivity and higher
wage.which only makes sense for people aged 18 to 39, since datasets includes only these people.

Females are expected to earn 27.38% less then males, which can be explained by gender
discrimination and personal/social preferences in lifestyle between gender that affect income.

Smokers earn around 0.01% less than non smokers, it can be explained by the fact that people with
higher education earn more, and there is correlation between smokers and low education, which
explains this results.

Interaction of age and smoking says that as the person gets old, the effect of smoking on income gets
smaller until 29 years old, from then the effect of smoking starts to get a bigger impact to income
again.

Task c
Based on the JMP Output, at 95% significance level, the effect of 1 years of schooling on income lies
between 1.08104% and 4.20406%.

Manually:
𝑆𝜀
𝑆𝑏1 = =
√(𝑛 − 1)𝑠𝑥2
𝐵1 ± 𝑡𝑎;𝑛 𝑆𝑏1 = 0.0244622 ± 1.964 ∗ 0.007842 = 0.0244622 ± 0.015405
2

𝐿𝑜𝑤𝑒𝑟 95% 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙 ∶ 0.0244622 − 0.015405 = 0.0090573


𝐻𝑖𝑔ℎ𝑒𝑟 95% 𝐼𝑛𝑡𝑒𝑟𝑣𝑎𝑙: 0.0244622 + 0.015405 = 0.39867

Task d)

Model:C_LogIncome = β0 + β1 · B_Years of schooling + β2 · E_Female + ε


̂
Estimated model:𝐶𝐿𝑜𝑔𝐼𝑛𝑐𝑜𝑚𝑒 = 𝑏0 + 𝑏1 ∗ 𝐵𝑌𝑒𝑎𝑟𝑂𝑓𝑆𝑐ℎ𝑜𝑜𝑙𝑖𝑛𝑔 + 𝑏2 ∗ 𝐸𝐹𝑒𝑚𝑎𝑙𝑒 + 𝜀

1) Simple Random Selection (SRS) – Refer back to previous assignments


2) Trustworthiness – Refer to previous assignments.
3) Linearity – Refer to previous exercise(analysed same variables)

4) Zero conditional mean 𝐸(𝜀|𝑋) = 0


It does seem that for some predicteds conditional mean of errors deviates from 0, therefore it seems
that this assumption is violated

5) Homoscedasticity 𝑉[𝜀|𝑋] = 0
Homoscedasticity assumption is not violated , because residuals are distributed at same standard
deviation regardless of predicted value.
6) Normally distributed residuals 𝑒|𝑋~𝑁𝐷
Residual seem to have more density closer to the mean, so the normality assumption is not violated

It does seem that


7) Independence between errors
As long as the data has been gathered with SRS method, there should not be any dependence
between erros.
8) Multicollinearity

Based on JMP output, there is no multicollinearity between 2 variables, therefore this assumption is
not violated.
9) Outliers

There does not seem to be any extreme outliers.

In conclusion assumption of zero conditional mean is slightly violated while the rest of the
assumptions are not. Therefore during any conclusions it is important to take into consideration
possible under/over-estimations.

̂
Estimated model: 𝐶_𝐿𝑜𝑔𝐼𝑛𝑐𝑜𝑚𝑒 = 4.9105963 + 0.0523107 ∗ 𝐵𝑌𝑒𝑎𝑟𝑠𝑆𝑐ℎ𝑜𝑜𝑙𝑖𝑛𝑔 − 0.274003 ∗
𝐸_𝐹𝑒𝑚𝑎𝑙𝑒
According to this model, 1 extra year of schooling is expected to increase 5.23% of income, while
being a female is expected to decrease income by 27.40%. However, based on JMP Output, the
model is significant and all of its variables are significant. However RSquare adjusted is much lower
then previous model(9.44% against 39.9%), which means that model does not explain as much
variables as the previous model.

In order to verify, that this model is worse than previous one we have to test the models:
Unrestricted model:
̂
𝐶𝐿𝑜𝑔𝐼𝑛𝑐𝑜𝑚𝑒 = 3.3673338 + 0.0244622 ∗ 𝐵𝑌𝑒𝑎𝑟𝑠𝑂𝑓𝑆𝑐ℎ𝑜𝑜𝑙𝑖𝑛𝑔 + 0.0641827 ∗ 𝐷𝐴𝑔𝑒 − 0.273806
∗ 𝐸𝐹𝑒𝑚𝑎𝑙𝑒 − 0.010066 ∗ 𝐻𝑆𝑚𝑜𝑘𝑒𝑟 − 0.037696𝐷𝑎𝑔𝑒 ∗ 𝐻𝑆𝑚𝑜𝑘𝑒𝑟 + 𝜀
̂
Restricted model: 𝐶𝐿𝑜𝑔𝐼𝑛𝑐𝑜𝑚𝑒 = 4.9105963 + 0.0523107 ∗ 𝐵𝑌𝑒𝑎𝑟𝑠𝑆𝑐ℎ𝑜𝑜𝑙𝑖𝑛𝑔 − 0.274003 ∗ 𝐸𝐹𝑒𝑚𝑎𝑙𝑒

Therefore we have to test whether the variables from restricted model are significant:
𝐻0 = 𝛽2 = 𝛽4 = 𝛽5
𝐻1 = 𝑎𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝑖𝑠 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑡 0
(𝑅2 2
𝑈𝑅 −𝑅𝑅 )
𝑞
Test statistic: 𝑓𝑜𝑏𝑠 = 1−𝑅2
≈ 𝐹𝑞,𝑛−𝑘−1
𝑈𝑅
𝑛−𝑘−1

0.4044 − 0.09768
3 0.30672
𝑓𝑜𝑏𝑠 = = = 281.70
1 − 0.4044 0.0010888
553 − 5 − 1
𝐹𝑞,𝑛−𝑘−1 = 𝐹5,547;0.5 = 3.012 (Excel Output)

𝑃(𝑓𝑜𝑏𝑠 > 281.70) ≈ 0(𝐸𝑥𝑐𝑒𝑙 𝑜𝑢𝑡𝑝𝑢𝑡)

Because 𝑓𝑜𝑏𝑠 > 𝐹𝑞,𝑛−𝑘−1, and 𝑃(𝑓𝑜𝑏𝑠 > 281.70) ≈ 0, we reject 𝐻0 , and accept 𝐻_1, which states
that unrestricted model has more significance than restricted model, therefore we include the 3
variables in our final model: Age, Smoking habits and interaction of age and smoking habits.

Task e)

𝐵2 Estimate in the model from task c) is gender. Based on previous task we have figured out that
unrestricted model, which includes more variables is significant. That means that variables included
in unrestricted model are much closer to true model than the one in task c). This means that model
from task c) omits variables (D_age, H_Smoker and Interaction of age and smoking).

Taking this into consideration means, that the estimates main effect of gender might be biased. To
determine how much biased the effect is we can just compare estimates of gender from both
models.
Unrestricted model: -0.273806
Restricted model:-0.274003
Difference:0.000197=0.0197%, which is very small. It can be explained by multicollinearity table from
previous exercises, which show that the correlation between omitted variables and gender is very
small, therefore the biased effect is also expected to be very small.

Question 10.6
Task a)

Based on the Bivariate Fit of A_Income By D_age, it can be seen that the relationship between age
and income is not completely linear. There is some level of polynomial relationship. On top of that,
given JMP output shows that polynomial relationship is significant (p-value<0.0001; t-ratio -5.29).

The Polynomial model will adjust for the exponential growth in income versus age, because the
increase of income against age is expected to grow exponentially. Furthermore, cantered polynomial
relationship would adjust for the fact that from certain age people start to earn less as they are

You might also like