Professional Documents
Culture Documents
doc)
Assignment 1. Using any data you wish, examine and write up a four or more variable causal
model by estimating multiple regression equations. Present the structural equations and the
path diagram. Interpret the regression coefficients (focus on the usual unstandardized
Step 1:itTheory,
coefficients; Path
is not required to interpretDiagram, and Recoding
the standardized coefficients nor to decompose any
zero-order relationships into direct, indirect, and noncausal [spurious]
1) Develop a simple theoretical relationship between four (or more) effects). Test forvari
first-order
aes and
interactions (and, if you think
present it in a flow chart. necessary, any theoretically compelling higher order interactions).
Examine possible multicollinearity (in the correlations matrix) and provide some analysis of residuals (i.e.,
heteroskedasticity, as shown below), and, as needed, outliers if you have a small data set.
For this assignment, we examine the relationship between four (or more) variables, i.e.
three (or more) independent variables and one dependent variable. In this example, we
estimate the following relationships.
Race
X1
(White)
1a 1d
1b 1c
2c
Education 2b
Attitude regarding the
Party ID
X2 4a responsibility of
X4
(Republican)
government for
poverty alleviation
2a
3a
Y
3b (Not its responsibility)
Income
X3
To do this we selected the following variables from GSS 2006: “race” (white, black,
other), “educ” (highest year of school completed), “income06” (total family income),
“partyid” (political party affiliation) and “helppoor” (self placement on a five point scale
that goes from “I strongly agree the government should improve living standards” to “I
strongly agree that people should take of themselves”).1
2) Obtain the frequency distribution for the original four variables (Check the
missing values)
. tab race, miss
race of |
respondent | Freq. Percent Cum.
------------+-----------------------------------
white | 3,284 72.82 72.82
black | 634 14.06 86.87
other | 592 13.13 100.00
------------+-----------------------------------
Total | 4,510 100.00
highest |
year of |
school |
completed | Freq. Percent Cum.
------------+-----------------------------------
0 | 22 0.49 0.49
1 | 4 0.09 0.58
2 | 28 0.62 1.20
3 | 13 0.29 1.49
4 | 11 0.24 1.73
5 | 23 0.51 2.24
6 | 69 1.53 3.77
7 | 32 0.71 4.48
8 | 85 1.88 6.36
9 | 127 2.82 9.18
10 | 152 3.37 12.55
11 | 215 4.77 17.32
12 | 1,204 26.70 44.01
13 | 422 9.36 53.37
14 | 628 13.92 67.29
15 | 212 4.70 72.00
16 | 687 15.23 87.23
17 | 167 3.70 90.93
18 | 208 4.61 95.54
19 | 78 1.73 97.27
20 | 112 2.48 99.76
1. I'd like to talk with you about issues some people tell us are important. Please look at CARD BC. Some people think
that the government in Washington should do everything possible to improve the standard of living of all poor
Americans; they are at Point 1 on this card. Other people think it is not the government's responsibility, and that each
person should take care of himself; they are at Point 5.
2
dk | 2 0.04 99.80
. | 9 0.20 100.00
------------+-----------------------------------
Total | 4,510 100.00
total family |
income | Freq. Percent Cum.
-------------------+-----------------------------------
under $1 000 | 43 0.95 0.95
$1 000 to 2 999 | 38 0.84 1.80
$3 000 to 3 999 | 29 0.64 2.44
$4 000 to 4 999 | 27 0.60 3.04
$5 000 to 5 999 | 40 0.89 3.92
$6 000 to 6 999 | 45 1.00 4.92
$7 000 to 7 999 | 48 1.06 5.99
$8 000 to 9 999 | 83 1.84 7.83
$10000 to 12499 | 142 3.15 10.98
$12500 to 14999 | 145 3.22 14.19
$15000 to 17499 | 126 2.79 16.98
$17500 to 19999 | 102 2.26 19.25
$20000 to 22499 | 157 3.48 22.73
$22500 to 24999 | 125 2.77 25.50
$25000 to 29999 | 212 4.70 30.20
$30000 to 34999 | 231 5.12 35.32
$35000 to 39999 | 217 4.81 40.13
$40000 to 49999 | 394 8.74 48.87
$50000 to 59999 | 332 7.36 56.23
$60000 to 74999 | 360 7.98 64.21
$75000 to $89999 | 284 6.30 70.51
$90000 to $109999 | 229 5.08 75.59
$110000 to $129999 | 162 3.59 79.18
$130000 to $149999 | 89 1.97 81.15
$150000 or over | 213 4.72 85.88
refused | 442 9.80 95.68
dk | 195 4.32 100.00
-------------------+-----------------------------------
Total | 4,510 100.00
political party |
affiliation | Freq. Percent Cum.
-------------------+-----------------------------------
strong democrat | 700 15.52 15.52
not str democrat | 736 16.32 31.84
ind,near dem | 527 11.69 43.53
independent | 997 22.11 65.63
ind,near rep | 327 7.25 72.88
not str republican | 637 14.12 87.01
strong republican | 495 10.98 97.98
other party | 65 1.44 99.42
. | 26 0.58 100.00
-------------------+-----------------------------------
Total | 4,510 100.00
should govt |
improve standard |
3
of living? | Freq. Percent Cum.
-------------------+-----------------------------------
govt action | 369 8.18 8.18
2 | 204 4.52 12.71
agree with both | 915 20.29 32.99
4 | 261 5.79 38.78
people help selves | 209 4.63 43.41
dk | 30 0.67 44.08
. | 2,522 55.92 100.00
-------------------+-----------------------------------
Total | 4,510 100.00
3) Recode the variables if necessary and obtain the frequency distribution of the
recoded variables.
You are advised not to collapse categories of any variable unless you have compelling
reason to do so. Recode so that the values start from “0” while retaining the original
number of categories. This makes it easier to interpret the regression results, that is, to
interpret the constant when the variables take their lowest value, 0.
Race
We recode this variable by reversing the order of the categories so that the larger value is
assigned to whites (1) because we believe being white will have a positive impact on the
dependent variable, we also combine“other” and “black” into a non-white category which
will be coded “0”.
RECODE of |
race (race |
of |
respondent) | Freq. Percent Cum.
------------+-----------------------------------
0 | 1,226 27.18 27.18
1 | 3,284 72.82 100.00
------------+-----------------------------------
Total | 4,510 100.00
. tab EDUC
RECODE of |
educ |
(highest |
year of |
4
school |
completed) | Freq. Percent Cum.
------------+-----------------------------------
0 | 22 0.49 0.49
1 | 4 0.09 0.58
2 | 28 0.62 1.20
3 | 13 0.29 1.49
4 | 11 0.24 1.73
5 | 23 0.51 2.24
6 | 69 1.53 3.78
7 | 32 0.71 4.49
8 | 85 1.89 6.38
9 | 127 2.82 9.20
10 | 152 3.38 12.58
11 | 215 4.78 17.36
12 | 1,204 26.76 44.12
13 | 422 9.38 53.50
14 | 628 13.96 67.46
15 | 212 4.71 72.17
16 | 687 15.27 87.44
17 | 167 3.71 91.15
18 | 208 4.62 95.78
19 | 78 1.73 97.51
20 | 112 2.49 100.00
------------+-----------------------------------
Total | 4,499 100.00
.recode income06(1=0)(2=1)(3=2)(4=3)(5=4)(6=5)(7=6)(8=7)(9=8)(10=9)
(11=10)(12=11)(13=12)(14=13)(15=14)(16=15)(17=16)(18=17)(19=18)(20=19)(2
1=20)(22=21) (23=22)(24=23)(25=24)(26/98=.), gen (INCOM)
(4510 differences between income06 and INCOM)
. tab (INCOM)
RECODE of |
income06 |
(total |
family |
income) | Freq. Percent Cum.
------------+-----------------------------------
0 | 43 1.11 1.11
1 | 38 0.98 2.09
2 | 29 0.75 2.84
3 | 27 0.70 3.54
4 | 40 1.03 4.57
5 | 45 1.16 5.73
6 | 48 1.24 6.97
7 | 83 2.14 9.11
8 | 142 3.67 12.78
9 | 145 3.74 16.52
10 | 126 3.25 19.78
11 | 102 2.63 22.41
12 | 157 4.05 26.47
13 | 125 3.23 29.69
14 | 212 5.47 35.17
15 | 231 5.96 41.13
5
16 | 217 5.60 46.73
17 | 394 10.17 56.91
18 | 332 8.57 65.48
19 | 360 9.30 74.77
20 | 284 7.33 82.11
21 | 229 5.91 88.02
22 | 162 4.18 92.20
23 | 89 2.30 94.50
24 | 213 5.50 100.00
------------+-----------------------------------
Total | 3,873 100.00
RECODE of |
partyid |
(political |
party |
affiliation |
) | Freq. Percent Cum.
------------+-----------------------------------
0 | 700 15.84 15.84
1 | 736 16.66 32.50
2 | 527 11.93 44.42
3 | 997 22.56 66.98
4 | 327 7.40 74.38
5 | 637 14.42 88.80
6 | 495 11.20 100.00
------------+-----------------------------------
Total | 4,419 100.00
. tab GOVRES
RECODE of |
helppoor |
(should |
govt |
improve |
standard of |
living?) | Freq. Percent Cum.
------------+-----------------------------------
0 | 369 18.85 18.85
1 | 204 10.42 29.26
2 | 915 46.73 76.00
6
3 | 261 13.33 89.33
4 | 209 10.67 100.00
------------+-----------------------------------
Total | 1,958 100.00
4) Filter observations with missing value on any variables in the model, if you are
estimating a set of equations and want all equations based on the same cases.
When you estimate a regression, Stata drops observations with missing values in any of
the variables included in the model automatically. But, when you estimate more then one
regression, different observations may be dropped because different variables are
included in the different models. To make sure that exactly the same sample is used in all
regressions, you have to follow either of the following two methods.
Method #1 Drop the cases with missing value in any of the five variables
Normally, dropping observations with missing value drop in any of the newly recoded
variables is not recommended because once you drop them you cannot recover them. But,
for the sake of simplicity in this exercise, you can choose this method.
This second method is highly recommended for real data analysis. The first command,
“ mark newvariable” creates a new variable named newvariable that equals 1 for all
cases. The second command “markout newvariable variablelist” adjusts the values of
newvariable from 1 to 0 for the cases in which values of any of the variables in
variablelist ( in this case RACE, EDUC, INCOM, REPUBLICAN and GOVRES) are
missing. Here we name the newvariable “nomiss”.
. mark nomiss
. markout nomiss RACE EDUC INCOM REPUBLICAN GOVRES
Then include “ if nomiss ==1” at the end of the regression models you estimate.
Observations will be used in the estimate only if they have no missing values in any of
the variables that are used in this analysis. We will use method #2 in this handout.
Examples are shown below.
7
RACE | 1.0000
EDUC | 0.1997 1.0000
INCOM | 0.2259 0.3931 1.0000
REPUBLICAN | 0.2566 -0.0004 0.1191 1.0000
GOVRES | 0.2043 0.1178 0.1889 0.2716 1.0000
1) Regress X2 on X1
. reg EDUC RACE if nomiss==1, beta
------------------------------------------------------------------------------
EDUC | Coef. Std. Err. t P>|t| Beta
-------------+----------------------------------------------------------------
RACE | 1.44842 .1744958 8.30 0.000 .1996871
_cons | 12.40138 .149854 82.76 0.000 .
------------------------------------------------------------------------------
2) Regress X3 on X1 and X2
------------------------------------------------------------------------------
INCOM | Coef. Std. Err. t P>|t| Beta
-------------+----------------------------------------------------------------
RACE | 1.966103 .2911319 6.75 0.000 .1535411
EDUC | .6397802 .0401371 15.94 0.000 .3624045
_cons | 5.483277 .5547762 9.88 0.000 .
------------------------------------------------------------------------------
8
. reg REPUBLICAN RACE EDUC INCOM if nomiss==1, beta
------------------------------------------------------------------------------
REPUBLICAN | Coef. Std. Err. t P>|t| Beta
-------------+----------------------------------------------------------------
RACE | 1.130313 .1093783 10.33 0.000 .2523994
EDUC | -.0549386 .0159755 -3.44 0.001 -.0889839
INCOM | .0339408 .0091024 3.73 0.000 .0970496
_cons | 2.219035 .2115915 10.49 0.000 .
------------------------------------------------------------------------------
------------------------------------------------------------------------------
GOVRES | Coef. Std. Err. t P>|t| Beta
-------------+----------------------------------------------------------------
RACE | .2903471 .0658539 4.41 0.000 .1089254
EDUC | .0183701 .0093559 1.96 0.050 .0499881
INCOM | .0244208 .0053341 4.58 0.000 .1173151
REPUBLICAN | .1367426 .014336 9.54 0.000 .2297342
_cons | .6329858 .1275091 4.96 0.000 .
------------------------------------------------------------------------------
9
EDUC = 12.40138 + 1.44842 *RACE
Race
X1
.1996 .1089
.15354 .2524
254
11
.04999 Government’s
Education 891
X2 -.0889 Party ID Responsibility
.2297
X4 Y
.362 .09704
4045 .1173
INCOM
X3
We can decompose the total effect of each of the independent variables on the dependent
variable. Calculate the decomposition tables for each of X1, X2, X3, X4 and Y,
according to the following rules. Do not round when doing the calculations. (You may
round when presenting the final result.)
10
Decomposition of Effects for x2 (EDUC)
Variables Total Direct Indirect Calculation of Indirect Effects Spurious
Effects Effects Effects Effects
11
1) X1 and X2
2) X1 and X3
3) X1 and X4
4) X2 and X3
5) X2 and X4
6) X3 and X4
------------------------------------------------------------------------------
GOVRES | Coef. Std. Err. t P>|t| Beta
-------------+----------------------------------------------------------------
RACE | .9902724 .2931196 3.38 0.001 .3715064
EDUC | .0239648 .0287538 0.83 0.405 .0652123
INCOM | .0287287 .0229921 1.25 0.212 .1380098
REPUBLICAN | -.1350117 .0698544 -1.93 0.053 -.2268262
raceduc | -.0352751 .0202833 -1.74 0.082 -.1978869
racINCOM | -.0131167 .011881 -1.10 0.270 -.094953
racrepub | -.032165 .0367194 -0.88 0.381 -.0600898
edurepub | .013037 .0051297 2.54 0.011 .3244292
eduINCOM | -.0011666 .0015857 -0.74 0.462 -.1041171
incomrepub | .0075223 .0027743 2.71 0.007 .2464588
_cons | .7505942 .368533 2.04 0.042 .
------------------------------------------------------------------------------
C) Estimate the regression without the insignificant interaction terms, i.e., with
the significant interactions.
In this section, run the regression retaining only those interaction terms that turned
statistically significant in the previous section. For this example, we will see if omitting
raceduc, racINCOM, racrepub and eduINCOM makes the fit of the regression
significantly different.
. reg GOVRES RACE EDUC INCOM REPUBLICAN edurepub incomrepub, beta
12
Model | 280.290217 6 46.7150361 Prob > F = 0.0000
Residual | 2004.41297 1654 1.2118579 R-squared = 0.1227
-------------+------------------------------ Adj R-squared = 0.1195
Total | 2284.70319 1660 1.37632722 Root MSE = 1.1008
------------------------------------------------------------------------------
GOVRES | Coef. Std. Err. t P>|t| Beta
-------------+----------------------------------------------------------------
RACE | .2845684 .0655477 4.34 0.000 .1067575
EDUC | -.0122706 .0164101 -0.75 0.455 -.0333903
INCOM | .0059682 .0091095 0.66 0.512 .0286707
REPUBLICAN | -.1310843 .0678247 -1.93 0.053 -.220228
edurepub | .0115822 .0050109 2.31 0.021 .2882262
incomrepub | .0067193 .0026649 2.52 0.012 .2201474
_cons | 1.348882 .2181173 6.18 0.000 .
------------------------------------------------------------------------------
The Chow test tests whether a regression equation with interaction terms explains a
significantly greater amount of variance than a regression equation without interaction
terms. The null hypothesis is the difference between the explained variance of the two
equations is zero in the population.
Rejecting the null hypothesis tells you that the interaction terms bring additional
explanatory power in a statistically significant way. The Chow test formula produces an
F-statistics to be compared to the critical values in the F-statistic table. If the F-test value
of the Chow test exceeds the critical value found in the table, you can reject the null
hypothesis.
We will compare the following three regression models using Chow Test.
Model ➀: The Regression Model with no interaction terms.
GOVRES = .6329858 +.2903471 *RACE + .0183701* EDUC +.0244208 *INCOM + .1367426* REPUBLICAN
13
K = Number of original independent variables +1 (for the constant)
N = Number of observations
R2K = R2 for the original regression equation (with no interaction terms)
R2K+M = R2 for the regression equation with interaction terms
The critical value of F statistics with df1=6 (degree of freedom of the numerator, M),
df2=1651 (degree of freedom of the denominator, N-K-M), and α = 0.05 is 2.09. Since
4.67>2.09, we reject the null hypothesis that the equation with six interaction terms and
the equation with no interaction term explain just the same amount of variance.
(Remember: rejecting the null hypothesis tells you that the interaction terms bring
additional explanatory power in a statistically significant way.)
The equivalent of the Chow test can be done with Stata by typing “test” command
right after executing the regression command. See the following examples. Compare this
F-statistic with the hand-calculated one in the previous section. (Small differences may
result from rounding.)
(Output omitted)
. test raceduc racINCOM racrepub edurepub eduINCOM incomrepub
( 1) raceduc = 0
( 2) racINCOM = 0
( 3) racrepub = 0
( 4) edurepub = 0
( 5) eduINCOM = 0
( 6) incomrepub = 0
F( 6, 1650) = 4.65
Prob > F = 0.0001
Estimated F-value (6, 1650) is 4.65. Since 4.65>2.09 critical value, we reject the null
hypothesis that the equation with six interaction terms and the equation with no
interaction term explain just the same amount of variance.
14
***YOU CAN REPEAT VARIATIONS OF THIS TEST FOR SUBSET OF THE
INTERACTION TERMS TO HELP DETERMINE WHICH ONES TO KEEP***
The “predict” command applies to the regression estimated right before typing it into the
command window. In this section, we will give new names for the predicted values and
estimated residuals after executing the “predict” command. When you type “predict
newvariable” without adding any option, you will obtain the predicted values of your
dependent variable. In this exercise we have named this newvariable, yhat.
So, right after the estimated regression equation that you are focusing on:
. predict yhat
(option xb assumed; fitted values)
This command can also be used to obtain residuals by and standardized residuals as
shown below by adding to the “predict newvariable” “, resid” and “,
rstandard” respectively. We have named the variable which contains the residuals
“e” and the variable containing the standardized residuals “std_e”.
. predict e, resid
. predict std_e, rstandard
The command for obtaining the histogram is “hist” followed by the variable name. The
command “qnorm” followed by the variable name will give you a normal probability
plot of this variable. The option “saving (name for graph, replace)” saves the
generated images to the working directory. For example, we named the file containing
the histogram of the standard residuals, “Histogram_std_e”
Histogram
To save and display the histogram
. hist std_e, saving(Histogram_std_e, replace)
(bin=32, start=-2.2421422, width=.15336815)
(note: file Histogram_std_e.gph not found)
(file Histogram_std_e.gph saved)
15
.3 .5
.4
Density
.2
.1
0
-2 -1 0 1 2 3
Standardized residuals
-4 -2 0 2 4
Inverse Normal
The unstandardized residuals should be on the y axis and the independent variables
should be on the x axis.
RACE
. graph twoway scatter e RACE, saving(e_RACE, replace)
(file e_RACE.gph saved)
16
4
2
Residuals
0
-2
0 .2 .4 .6 .8 1
RECODE of race (race of respondent)
EDUC
graph twoway scatter e EDUC, saving(e_EDUC, replace)
(file e_EDUC.gph saved)
4
2
Residuals
0
-2
0 5 10 15 20
RECODE of educ (highest year of school completed)
INCOM
graph twoway scatter e INCOM, saving(e_INCOM, replace)
4
2
Residuals
0
-2
0 5 10 15 20 25
RECODE of income06 (total family income)
17
REPUBLICAN
graph twoway scatter e REPUBLICAN, saving(e_REPUBLICAN, replace)
4
2
Residuals
0
-2
0 2 4 6
RECODE of partyid (political party affiliation)
Squared residuals will be named “squared_e” and the variable containing the
absolute values of the residuals will be “absolute_e”. The command “abs
(variable)” gives you the absolute value of variable.
. gen squared_e=e*e
. gen absolute_e= abs(e)
f) Examine the means of the residual (in absolute values, not shown here) and
squared residuals by different categories of independent variables. Why?
18
This can be done using the command “tab variable A name, sum(variable B name)”.
However, in order to make this analysis clearer, we collapse the independent variables
into fewer categories. In this example, we collapse the independent variables that have
many categories EDUC, INCOME and REPUBLICAN into three categories each and
leave RACE (and GOVRES) intact.
. recode EDUC (0/12=0)(13/16=1)(17/20=2), gen(EDUC2)
(1655 differences between EDUC and EDUC2)
RECODE of |
race (race |
of | Summary of squared_e
respondent) | Mean Std. Dev. Freq.
------------+------------------------------------
0 | 1.3321228 1.6271939 436
1 | 1.1807221 1.5715731 1225
------------+------------------------------------
Total | 1.2204636 1.5872673 1661
RECODE of |
EDUC |
(RECODE of |
educ |
(highest |
year of |
school | Summary of squared_e
completed)) | Mean Std. Dev. Freq.
------------+------------------------------------
0 | 1.4125591 1.7571281 687
1 | 1.1320202 1.4776772 737
2 | .9386629 1.3135448 237
------------+------------------------------------
Total | 1.2204636 1.5872673 1661
. tab INCOM2, summ(squared_e)
RECODE of |
INCOM |
(RECODE of |
income06 |
(total |
family | Summary of squared_e
19
income)) | Mean Std. Dev. Freq.
------------+------------------------------------
0 | 1.5703851 1.8339021 337
1 | 1.1768049 1.5589693 1020
2 | .97904393 1.3033466 304
------------+------------------------------------
Total | 1.2204636 1.5872673 1661
. tab REPUBLICAN, summ(squared_e)
RECODE of |
partyid |
(political |
party |
affiliation | Summary of squared_e
) | Mean Std. Dev. Freq.
------------+------------------------------------
0 | 1.3544698 1.754 250
1 | 1.1022653 1.5308627 283
2 | 1.2853162 1.6759844 189
3 | 1.4062153 1.6491756 350
4 | 1.2662294 1.6828067 127
5 | .99803069 1.3768972 273
6 | 1.101894 1.398698 189
------------+------------------------------------
Total | 1.2204636 1.5872673 1661
Sample do-file
*Obtain the frequency distribution
20
** or,
mark nomiss
markout nomiss RACE EDUC INCOM REPUBLICAN GOVRES
*Obtain correlations
corr RACE EDUC INCOM REPUBLICAN GOVRES if nomiss== 1
**Regressions
reg EDUC RACE if nomiss==1, beta
reg GOVRES RACE EDUC INCOM REPUBLICAN raceduc racINCOM racrepub edurepub
eduINCOM incomrepub, beta
**Chow Test
reg GOVRES RACE EDUC INCOM REPUBLICAN raceduc racINCOM racrepub edurepub
eduINCOM incomrepub, beta
predict yhat
predict e, resid
predict std_e, rstandard
21
**Examine the means of the residual (in absolute values, not shown here) and
squared residuals by different categories of independent variables.
**First collapse indep. var. into fewer categories.
22