You are on page 1of 240

RELATIONSHIPS WITH INTERVAL-LEVEL DEPENDENT VARIABLES:

Scatterplots and Correlation Coefficients


1. A relationship between variables is a pattern of association between the values of
two (or more) variables. If two variables are related, then certain values of one variable
are more common for cases with some than with other values of the second variable. If
changes in the first variable tend to lead to changes in the second variable, then the first
(the cause) is called the independent variable and the second (the effect) is called the
dependent variable.
SCATTERPLOTS
2. One way to show a relationship between two interval-level variables is with a
scatterplot. In a scatterplot, every case appears as a point on a graph. The horizontal
axis shows the values of the independent variable (typically called X), and the vertical
axis shows the values of the dependent variable (typically called Y). The point for a
particular observation is directly above its value for the independent variable and at a
height that represents its value for the dependent variable.
3. If above-average values of the independent variable tend to be associated with aboveaverage values of the dependent variable, then the relationship is positive. As the value
of the X-variable rises, the value of the Y-variable also tends to rise.
4. If above-average values of the independent variable tend to be associated with belowaverage values of the dependent variable, the relationship is negative. As the value of
the X-variable rises, the value of the Y-variable tends to fall.
5. Relationships can also be more complex. Values of the dependent variable may be
highest (or lowest) for intermediate values of the independent variable, or the
relationship may be positive for certain ranges of the independent variable but flat or
negative for other values. If the relationship can be described by a curved line, it is
called curvilinear.
6. The stronger a relationship, the more obvious it will be in a scatterplot. In a perfect
linear relationship, all the points in the graph will fit on one straight line. In a strong
linear relationship, the points will all seem to be grouped fairly close to a straight line.
The weaker the linear relationship, the farther points will tend to be from the line and
the harder it will be to see where the line is. When there is no relationship between two
variables, points will seem to be scattered randomly around the graph with no pattern
(like a chocolate chip cookie).
7. Whether a relationship is weak or strong has nothing to do with whether it is positive
or negative. You can have strong negative relationships and weak positive ones.

50000

100000

150000

The following three graphs show the relationships between federal salaries in 1994 and
the employees grade, years of education, and years of federal service, for a random
sample of 1,000 employees:

10

15

grade
salary in dollars
M edian spline

F itted v alues

A. The relationship between grade and salary is strong salaries at each grade level are
bunched tightly together: if we know a persons grade, we have a pretty good idea what
his/her salary is. The relationship is curvilinear rather than linear, however. The
straight line is below all observed salaries for those at the lowest and highest grades and
above almost all observed salaries in the middle grades. The median spline connects
the median salary at each grade level; it suggests that a fairly smooth curve can capture
the relationship between grade and salary.
B. In the next graph, the relationship between experience and salary is much weaker
at every level of federal service there is a wide range of observed salaries. The dispersion
is so wide, it is hard to see the pattern. The median spline fluctuates but with a general
upward trend (the drop at the highest levels of experience is based on only a few cases).
In this case, the straight line does not seem to seriously mis-state the relationship.

20000 40000 60000 80000 100000120000


0

10

20
years of federal service
salary in dollars
M edian spline

30

40

F itted v alues

20000 40000 60000 80000 100000120000

C. Because there are so few values of education and so many observations at each level
of education, the pattern is hard to see from the scatterplot alone. Again, the median
spline and the straight line seem to tell pretty much the same story, but the low return to
starting but not finishing college may be important.

10

12

14
16
years of education

18

salary in dollars
M edian spline

F itted v alues

20

CORRELATION
8. The CORRELATION COEFFICIENT (r) gives a numerical measure of the
strength and direction of a linear relationship between two continuous variables.
r has a maximum value of 1 for a perfect positive relationship, a minimum value
of -1 for a perfect negative relationship, and a value of 0 for no relationship
whatsoever. The closer to zero the correlation coefficient is, the weaker the linear
relationship. The further from zero (in either a positive or negative direction) r
is, that is, the higher its absolute value, the stronger the relationship is.
The sign of the correlation coefficient shows the direction of the relationship. A
positive r indicates that above-average values of X tend to be associated with
above-average values of Y; that is, as X increases, Y also tends to increase. A
negative correlation coefficient shows that above-average values of X tend to be
associated with below-average values of Y; that is, as X increases, Y tends to
decrease.
9. Table 1 is a correlation matrix for our sample.
pwcorr shows the correlation coefficient for each pair of variables below the
diagonal. (The correlation coefficients in the main diagonal of the correlation
matrix are all 1.000's, indicating that every variable is perfectly correlated with
itself.)
The obs subcommand tells Stata to print the number of observations the
correlation coefficient is based on. (Because there are missing values for sal but
not the other variables in this data set, the correlations with sal are based on 995
observations while the other correlations are based on all 1,000 observations.)
The sig subcommand tells Stata to print the p-value for each correlation (the
probability of getting a correlation coefficient this strong in the sample if there
were no relationship between the variables in the population.)
The star(.01) subcommand tells Stata to put an asterisk next to each correlation
coefficient that is significant at the .01 level. (You can choose any significance
level.) In this large sample, all but one of the correlations is significant at the .01
level.

Table 1. Correlation Matrix


. pwcorr sal grade edyrs yos age male minority, obs sig star(.01)
|
sal
grade
edyrs
yos
age
male minority
-------------+--------------------------------------------------------------sal |
1.0000
|
|
995
|
grade |
0.9111* 1.0000
|
0.0000
|
995
1000
|
edyrs |
0.5921* 0.6064* 1.0000
|
0.0000
0.0000
|
995
1000
1000
|
yos |
0.4022* 0.3064* 0.0070
1.0000
|
0.0000
0.0000
0.8263
|
995
1000
1000
1000
|
age |
0.2902* 0.1855* 0.0772
0.6214* 1.0000
|
0.0000
0.0000
0.0146
0.0000
|
995
1000
1000
1000
1000
|
male |
0.3613* 0.3520* 0.3065* 0.0823* 0.0937* 1.0000
|
0.0000
0.0000
0.0000
0.0092
0.0030
|
995
1000
1000
1000
1000
1000
|
minority | -0.2327* -0.2333* -0.1525* -0.1310* -0.1544* -0.1226* 1.0000
|
0.0000
0.0000
0.0000
0.0000
0.0000
0.0001
|
995
1000
1000
1000
1000
1000
1000
|

10. You can rank-order the strength of the relationships by the absolute value of their
correlation coefficients.
Salary (sal) is most strongly correlated to grade (.911), followed by edyrs (.592),
yos (.402), male (.361), and minority (-.233). (Note that the correlation with
minority is weakest because of its absolute value, not because it is negative.)
The correlation between sal and grade is so close to 1 that we can be confident
that grade level is a good predictor of salary. Fairly strong correlations suggest
that education and federal experience have important impacts on salary.
Years of federal service (yos) is most strongly correlated to sal (.402), followed
by grade (.306), minority (-.131), male (.082), and edyrs (.007). Note that
levels of education and federal experience are almost completely unrelated in this
sample knowing how much education one has will be of virtually no use in
predicting how much federal service one has.
11. You interpret the directions of the relationships by looking at the signs of the
correlation coefficients. Positive signs mean that above-average values of the first
variable are associated with above-average values of the second variable. Negative
5

signs mean that above-average values of the first variable are associated with belowaverage values of the second variable.
The correlation coefficients of sal are positive with grade, edyrs, and yos.
Thus, people with higher salaries tend to have higher grades, more education,
and more federal experience than people with lower salaries.
The strength of correlation coefficients is less valid with dummy variables (such
as male and minority) than with interval level variables (such as edyrs or
yos). Still, the direction of the relationships is quite clear. Positive coefficients
mean that those with the 1 values on the dummy variable have, on average,
higher values on the other variable than those with the 0 values. Negative
coefficients mean that the 1's have lower average values on the other variable
than the 0's.
Because the male coefficients with the interval-level variables are all
positive, men (male=1) have higher mean grades and salaries and more
years of education and federal service, on average, than women (male=0).
The difference is stronger for grades and salaries than for years of
education and federal service.
Because the minority coefficients are all negative, minorities
(minority=1) have lower mean grades and salaries and fewer years of
education and federal service, on average, than nonminorities. The
difference is stronger for grades and salaries than for years of education
and federal service.
The correlation between male and minority is negative. Therefore, high
values on the first tend to be associated with low values on the second. In
other words, men are more likely than women to be nonminorities, and
minorities are more likely than nonminorities to be women.
12. In general, the closer the points on a scatterplot are to a straight line, the higher the
correlation coefficient will be.
13. Even a perfect curvilinear relationship (one that follows a curved line rather than
a straight line) may have a correlation coefficient very close to zero. It is therefore
important to graph relationships to make sure that the relationship is really linear and
not curvilinear. Otherwise, the correlation coefficient will be misleading, and you may
need to change your regression analysis.

Stata Note:
14. Stata has two commands for creating correlation matrices, pwcorr and correlate.
pwcorr drops observations pairwise, and calculates correlations based on all possible
observations. Corr drops observations listwise if an observation has a missing
value on any of the variables in the correlation matrix, that observation is dropped from
all the correlations. In the pwcorr table, all the correlations that include sal have 995
observations and all correlations that exclude it have 1,000 observations (because 5
observations have missing salaries). In the corr output, Stata tells us in parentheses
that there are 995 observations. All correlations in the first column are identical for the
two tables, but all others differ slightly between the tables, because this table drops the
observations with missing salaries.
Corr will calculate descriptive statistics with the mean subcommand. It will
not, however, calculate significance levels or attach asterisks.
Table 2. Correlation Matrix
. corr sal grade edyrs yos age male minority, mean
(obs=995)
Variable |
Mean
Std. Dev.
Min
Max
-------------+---------------------------------------------------sal |
40784.47
17683.05
15054
116529
grade |
9.60804
3.37125
3
16
edyrs |
14.37085
2.264714
10
20
yos |
14.8201
8.869657
1
41
age |
43.94171
10.21719
20
79
male |
.5135678
.5000672
0
1
minority |
.2713568
.4448834
0
1
|
sal
grade
edyrs
yos
age
male minority
-------------+--------------------------------------------------------------sal |
1.0000
grade |
0.9111
1.0000
edyrs |
0.5921
0.6061
1.0000
yos |
0.4022
0.3064
0.0053
1.0000
age |
0.2902
0.1813
0.0724
0.6217
1.0000
male |
0.3613
0.3523
0.3051
0.0780
0.0899
1.0000
minority | -0.2327 -0.2329 -0.1539 -0.1268 -0.1534 -0.1206
1.0000

BIVARIATE REGRESSION
1. The purpose of statistics is to simplify, to summarize data in ways that make
patterns easier to see and understand. If the relationship between two interval level
variables is essentially linear (as shown in a scatterplot, for instance), the relationship
can best be described by the equation of a line:

y = a + b X
In this equation, the dependent variable (Y) is a function of the independent variable
(X). For any given value of X, the best guess for Y ( y , the expected value of the
dependent variable) is a + (b times that value of X). In this equation, a and b are
constants that will be calculated for you by the computer, X is a variable whose value
you can choose, and y is a variable whose value is determined by the equation.
TABLE 1. reg sal grade
Source |
SS
df
MS
---------+-----------------------------Model | 2.5802e+11
1 2.5802e+11
Residual | 5.2796e+10
993 53168190.5
---------+-----------------------------Total | 3.1081e+11
994
312690124

Number of obs
F( 1,
993)
Prob > F
R-squared
Adj R-squared
Root MSE

=
995
= 4852.86
= 0.0000
= 0.8301
= 0.8300
= 7291.7

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------grade |
4779.042
68.60279
69.662
0.000
4644.418
4913.665
_cons | -5132.755
698.4975
-7.348
0.000
-6503.455
-3762.054
------------------------------------------------------------------------------

SAL-hat = -5133 + 4779 GRADE


sal (salary in dollars) is the dependent variable, grade is the independent
variable, the y-intercept or constant is -5133, and the regression coefficient is
4779.
TABLE 2. reg sal age
Source |
SS
df
MS
---------+-----------------------------Model | 2.6176e+10
1 2.6176e+10
Residual | 2.8464e+11
993
286644059
---------+-----------------------------Total | 3.1081e+11
994
312690124

Number of obs
F( 1,
993)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

995
91.32
0.0000
0.0842
0.0833
16931

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------age |
502.2623
52.55897
9.556
0.000
399.1229
605.4017
_cons |
18714.2
2371.079
7.893
0.000
14061.3
23367.11
------------------------------------------------------------------------------

SAL-hat = 18714 + 502 AGE


8

sal is the dependent variable, age is the independent variable, the y-intercept or
constant is 18,714, and the regression coefficient is 502.
2. In a bivariate regression with an interval-level independent variable, the Yintercept (a) is the point where the regression line crosses the Y-axis. It is also the
expected value of Y when X equals zero. This interpretation is only meaningful when X
can have the value zero and when we have values of X close to zero in our data set.
Otherwise, think of the Y-intercept as an anchor for the regression line -- a point the line
needs to pass through.
Our first y-intercept indicates that the expected salary for employees at grade 0 is
$-5133, but there is no grade 0 and the lowest grade in this sample is 3. We
should just treat this y-intercept as one point on the regression line, but not one
that is very relevant.
Our second y-intercept indicates that the expected salary of newborns is $18,714.
Again, this number has little meaning in itself.
3. The slope (b), also called the regression coefficient or slope coefficient, is the
expected change in Y for a one-unit increase in X (or the change in the expected value of
Y for a one-unit increase in X). (When describing a particular regression coefficient,
always explain it in terms of the actual units that are used for measuring X and Y rather
than saying a one-unit increase in X.) A positive slope means a positive relationship
(higher values of X tend to be associated with higher values of Y), while a negative slope
reflects a negative relationship (higher values of X tend to be associated with lower
values of Y). The higher the absolute value of b (its numerical value, ignoring its sign),
the steeper the line and the more Y responds to changes in X.
Our first regression coefficient suggests that as grade rises by one, expected
salary rises by $4,779. That is, on average, employees earn $4,779 more than
people one grade below them and $4,779 less than people one grade above them.
Our second regression coefficient indicates that expected salary rises $502 with
each additional year of age. Employees, on average, earn $502 more than people
who are one year younger and $502 less than people who are one year older.
The slope is clearly steeper with grade rather than age as the independent
variable. That is, expected salary clearly increases more as grade rises by one
than as age rises by one.
4. y (y-hat) is the predicted value of the dependent variable when the independent
variable has a particular value. That is, it is an estimate of the mean value of Y for all
observations where X has that particular value. It is also our best guess of the value of
an individual Y when we know the value of X.
9

5. We can calculate the expected value of Y ( y ) by filling in the chosen value of X,


multiplying it times b (the regression, or slope, coefficient) and adding the product to a
(the Y-intercept). Note that in these calculations, a and b are always the same, whereas
X is whatever value we choose and y varies, depending on the value of X we have
chosen. In other words, a and b are constants, while X and y are variables.
The expected salary of an employee in grade 4 is -5133 + 4779(4) = $13,983.
For an employee in grade 7, the expected salary is -5133 + 4779(7) = $28,320.
The expected salary of an employee in grade 10 is -5133 + 4779(10) = $42,657.
Notice that each three-grade increase (from 4 to 7, and from 7 to 10) raises the
expected salary by $14,337 (from $13,983 to $28,320, to $42,657). Three grades
times $4779 per grade totals $14,337 (3 x 4779 = 14,337).
A Special Case: Dummy Variables
6. Regression analysis allows two types of independent variables: interval level variables
and dummy variables. A dichotomous or dummy variable has only two possible
values, typically 1 and 0.
7. With a dummy independent variable, the bivariate regression equation is still:
y = a + b X
but this is no longer the equation for a line, because X has only two possible values,
which means that there are only two possible values of y (rather than infinitely many
possible values as there are in a regression line).
8. Because there are only two possible values for y , the expected values of the
dependent variable turn out to be the sample means.
TABLE 3. reg sal male
Source |
SS
df
MS
---------+-----------------------------Model | 4.0577e+10
1 4.0577e+10
Residual | 2.7024e+11
993
272142282
---------+-----------------------------Total | 3.1081e+11
994
312690124

Number of obs
F( 1,
993)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

995
149.10
0.0000
0.1305
0.1297
16497

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------male |
12776.64
1046.349
12.211
0.000
10723.33
14829.95
_cons |
34222.8
749.8516
45.639
0.000
32751.32
35694.27
-----------------------------------------------------------------------------predict salhat
(option xb assumed; fitted values)

10

sal-hat = 34223 + 12777 male

(Table 3)

The expected salary of a woman (male=0) is 34223 + 12777(0) = $34,223.


The expected salary of a man (male=1) is 34223 + 12777(1) = $47,000.
10. When there is only one dichotomous independent variable, the constant or Yintercept (a) is the mean value of the dependent variable for the reference group (the
group that has the value 0 for the independent variable). The regression coefficient (b)
is the difference in the mean values of Y between the named group (those who have the
value X=1) and the reference group.
Thus, $34,223 is the mean salary of women in the sample, and $12,777 is the
difference in mean salaries between men and women. That is, on average, men
earn $12,777 more than women.
11. Dummy variables can be coded in either direction and regression will still give us the
same expected values. Suppose we recode male into a new variable called female
(which is coded 1 for women and 0 for men--that is, whenever male=1, female=0;
whenever male=0, female=1).
. gen female = 1 - male
TABLE 4. reg sal female
Source |
SS
df
MS
---------+-----------------------------Model | 4.0577e+10
1 4.0577e+10
Residual | 2.7024e+11
993
272142282
---------+-----------------------------Total | 3.1081e+11
994
312690124

Number of obs
F( 1,
993)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

995
149.10
0.0000
0.1305
0.1297
16497

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------female | -12776.64
1046.349
-12.211
0.000
-14829.95
-10723.33
_cons |
46999.44
729.7726
64.403
0.000
45567.36
48431.51
predict fsalhat
(option xb assumed; fitted values)

sal-hat = 46999 - 12777 female


When we use female as the independent variable, the y-intercept (46999) is the
mean salary of men (the reference group, for whom female=0); the regression
coefficient (-12777) is the difference between the mean salaries of women and
men (the female mean minus the male mean); and a+b (46999 - 12777 = 34222)
is the mean salary of women (female=1).
The expected salary of a man (female=0) is 46999 - 12777(0) = $46,999.
The expected salary of a woman (female=1) is 46,999 - 12777(1) = $34,222.
11

Thus, the y-intercept ($46,999) is the mean salary of men in the sample, and
-$12,777 is the difference in mean salaries between women and men. That is, on
average, women earn $12,777 less than men.
13. The regression coefficient has the same numerical value in both regressions, but the
sign changes, because the regression coefficient is the female mean minus the male
mean in Table 4 and the male mean minus the female mean in Table 3. The y-intercept
(a) in Table 4 equals a+b in Table 3, because both represent the mean salary of men.
The y-intercept (a) in Table 3 equals a+b in Table 4, because both represent the mean
salary of women. The two regression equations look different, but they provide the
same information about the mean salaries of men and women.
TABLE 5. ttest sal, by(male)
Two-sample t test with equal variances
-----------------------------------------------------------------------------Variable |
Obs
Mean
Std. Err.
Std. Dev.
[95% Conf. Interval]
---------+-------------------------------------------------------------------female |
484
34222.8
623.8963
13725.72
32996.91
35448.68
male |
511
46999.44
829.325
18747.15
45370.12
48628.75
---------+-------------------------------------------------------------------combined |
995
40784.47
560.5903
17683.05
39684.39
41884.54
---------+-------------------------------------------------------------------diff |
-12776.64
1046.349
-14829.95
-10723.33
-----------------------------------------------------------------------------Degrees of freedom: 993
Ho: mean(female) - mean(male) = diff = 0
Ha: diff < 0
t = -12.2107
P < t =
0.0000

Ha: diff ~= 0
t = -12.2107
P > |t| =
0.0000

Ha: diff > 0


t = -12.2107
P > t =
1.0000

14. A difference-of-means test (ttest) shows the same mean salaries for men and
women and differences of means generated by regression. When we also assume that
mens and womens salaries have the same variances (the homoskedasticity assumption
of OLS regression), the difference of means (-12776.64) has the same estimated
standard error (1046.349) and the same t-statistic (-12.2107) as regression. (Equal
variances is the default assumption of the ttest command. If you want to allow for
unequal variances, add unequal as an option after the comma.)
15. Stata Note: The predict subcommand after regress (and many other estimation
commands) creates new variables based on the estimation procedures, typically
expected values or residuals. After the regressions in Tables 3 and 4, I created two new
variables, salhat and fsalhat. In the regress command, the default is the expected
value [note that Stata tells us (option xb assumed; fitted values)]. Therefore, salhat is
the expected salary generated by Table 3 and fsalhat is the expected salary generated
by Table 4.
The table command allows us to calculate means and other statistics for variable Y for
each value of variable X. In the following command, for instance, X=male and Y=sal
12

salhat and fsalhat. Cont stands for content, m for mean, and sd for standard
deviation.
Note again that the expected salaries for men (or women) are the same in both
regressions and that they equal the sample mean. Note too that the standard
deviation of sal is $13,725 for women and $18,747 for men, but that the standard
deviation of salhat is 0 for both men and women. That is, actual salaries of
women vary around the mean salary of $34,222.80 but every woman has the
same expected salary according to this model: $34,222.80.
TABLE 6. table male, cont(m sal m salhat m fsal sd sal sd salhat )
-------------------------------------------------------------------------------1= male, |
0=female |
mean(sal) mean(salhat) mean(fsal~t)
sd(sal)
sd(salhat)
----------+--------------------------------------------------------------------female |
34222.8
34222.8
34222.8
13725.72
0
male |
46999.44
46999.44
46999.44
18747.15
0
--------------------------------------------------------------------------------

16. Adding both male and female to the same equation would not accomplish
anything, since neither variable would provide new information that was not given by
the other (once we know that a person is a woman, we know she is not a man) and
because there is nothing new to learn from the regression results. We already know the
mean salaries of both men and women from either of the previous regressions: what is
putting both variables in the model supposed to accomplish?
Including an independent variable whose value can be exactly predicted from
knowledge of the values of other independent variables is called perfect
multicollinearity.

DICHOTOMOUS DEPENDENT VARIABLES


1. Many times the variable that we are trying to explain or predict is not interval level,
but dichotomous. For instance, with the federal personnel data, we may want to explain
why an employee did or did not get a promotion or did or did not leave the federal
service.
2. Next semester, we will learn much better techniques for working with dummy
dependent variables, but linear regression analysis is also an option for dichotomous
variables. We call it the linear probability model (LPM). In Table 1, the dependent
variable is whether the employee was a supervisor or manager; those who are, are coded
1 and everyone without supervisory authority is coded 0. The independent variable is
male, coded 1 for men and 0 for women. As in the other regression analyses with
dummy independent variables, the y-intercept represents the expected value of the
dependent variable for the reference group (in this case, the expected value of supmgr
13

is .1070 for women) and the regression coefficient is the expected difference in the
values of the dependent variable for the named group in the reference group; in this
case, the expected value of supmgr is .0873 higher for men
Table 1. OLS Bivariate Regression
. reg supmgr male
Source |
SS
df
MS
-------------+-----------------------------Model |
2.1399287
1
2.1399287
Residual | 141.404635 1120 .126254138
-------------+-----------------------------Total | 143.544563 1121 .128050458

Number of obs
F( 1, 1120)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

1122
16.95
0.0000
0.0149
0.0140
.35532

-----------------------------------------------------------------------------supmgr |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------male |
.087344
.0212157
4.12
0.000
.0457171
.1289709
_cons |
.1069519
.0150017
7.13
0.000
.0775172
.1363865
------------------------------------------------------------------------------

3. As Table 2 shows, the expected values from the regression analysis are actually
probabilities. The probability that a woman was a supervisor was 10.70 percent (.1070),
and a probability that a man was promoted was 19.43 percent, 8.73 percentage
points higher.
Table 2. Crosstabs, Probability of Being a Supervisor by Gender
. tab supmgr male, col
supervisor | sex dummy (1=male,
/manager, |
0=female)
0=no) |
0
1 |
Total
-----------+----------------------+---------0 |
501
452 |
953
|
89.30
80.57 |
84.94
-----------+----------------------+---------1 |
60
109 |
169
|
10.70
19.43 |
15.06
-----------+----------------------+----------

4. In fact, it may be better to convert supmgr from a (0,1) dummy variable to a (0,100)
dummy variable. Since this is a linear transformation of the dependent variable, it does
not meaningfully change the regression. The y-intercept and regression coefficient (and
many of the other numbers in the table) are multiplied times 100. Many others dont
change.
The constant now represents the percentage of the reference group (women) who
have supervisory authority (10.70%), and the regression coefficient shows that
men are 8.73 percentage points more likely than women to have supervisory
authority.
14

Notice that when I describe a differences of percentages based on subtraction, I


talk about a percentage-point rather than a percent difference.
The reason is that men are nearly twice as likely as women to be supervisors.
Because (19.43/10.70)*100 = 182, men are actually 82 percent more likely than
women to be supervisors. That is, we use percent difference when we calculate
the difference by division and percentage point difference when we calculate the
difference by subtraction.
Table 3. Table 1 revised
. gen SupMgr = supmgr * 100
. reg SupMgr male
Source |
SS
df
MS
-------------+-----------------------------Model |
21399.287
1
21399.287
Residual | 1414046.35 1120 1262.54138
-------------+-----------------------------Total | 1435445.63 1121 1280.50458

Number of obs
F( 1, 1120)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

1122
16.95
0.0000
0.0149
0.0140
35.532

-----------------------------------------------------------------------------SupMgr |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------male |
8.734403
2.121565
4.12
0.000
4.571713
12.89709
_cons |
10.69519
1.500173
7.13
0.000
7.751721
13.63865
-----------------------------------------------------------------------------. di (19.43/10.70)*100
181.58879

5. With an interval-level independent variable (and a dummy dependent variable), the


expected values are on a line and each expected value is the probability that a person at
a particular grade is a supervisor.
The y-intercept is the predicted probability of being a supervisor at grade 0. The
value of -21.5 doesnt make any sense, both because you cannot have negative
probabilities and because there is no grade zero.
The regression coefficient indicates that the probability of being a supervisor
rises 3.69 percentage points with each one-level rise in grade.

15

Table 4. reg SupMgr grade


Source |
SS
df
MS
-------------+-----------------------------Model | 147474.816
1 147474.816
Residual | 1287970.82 1120 1149.97394
-------------+-----------------------------Total | 1435445.63 1121 1280.50458

Number of obs
F( 1, 1120)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

1122
128.24
0.0000
0.1027
0.1019
33.911

-----------------------------------------------------------------------------SupMgr |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------grade |
3.693635
.3261663
11.32
0.000
3.05367
4.333601
_cons | -21.51843
3.385197
-6.36
0.000
-28.16047
-14.87639
------------------------------------------------------------------------------

6. The graph makes clear two problems with the linear probability model. First,
because the dependent variable only has two possible values, a scatterplot doesnt show
you much; this also means that a line will never fit the data all that well. Second, the
linear probability model will sometimes yield predicted probabilities less than 0 or
greater than 100%, neither of which makes any sense.
. predict gradeprob
(option xb assumed; fitted values)
. twoway (scatter SupMgr grade) (connect gradeprob grade), xlabel(3(3)15)

50

100

Figure 1. Predicted probabilities by Grade

9
grade in 2001
SupMgr

12

Fitted values

16

15

MULTIPLE REGRESSION ANALYSIS


1. Multiple regression analysis has an interval-level (or sometimes dummy) dependent
variable and two or more independent variables, which can be either dichotomous or
interval level. If an effect has multiple causes, multiple regression allows us to predict
values of Y more accurately than bivariate regression does. Multiple regression also
helps isolate the direct effect of a single independent variable on the dependent variable,
once the effects of the other independent variables are controlled.
2. The equation for a multiple regression is similar to that of a linear regression,
although it no longer describes a two-dimensional line:
Y-hat= b0 + b1X1 + b2X2 + b3X3 + ... + bkXk
Y-hat is the expected value of Y,
X1 through Xk are the independent variables,
b0 is the y-intercept, and
b1 through bk are the regression coefficients (also called partial slope
coefficients).
3. To determine the expected value of Y, insert the actual values of X1 through Xk into
the equation, multiply, and add.
The y-intercept is still the expected value of the dependent variable when all of
the independent variables equal zero (though the y-intercept will seldom have a
practical meaning; it usually does not make sense for all independent variables to
equal zero).
The partial slope coefficients show the expected change in Y from a one-unit
increase in Xj, holding all the other Xs constant. That is, it is the change in the
expected value of Y when Xj changes but none of the other independent variables
do. It usually does not matter at what values you hold the other variables
constant.
Multiple Regression Type I:
Several Dummy Variables Representing One Real Independent Variable
4. With a nominal level independent variable with more than two values (like race), we
must first recode it into several dichotomous variables.
We cannot directly use a nominal-level variable as an independent variable,
because its coefficient would show the expected change in Y from a one-unit
increase in Xj (holding all the other Xs constant), but it doesnt make sense to
talk about a one-unit increase in Xj because values in nominal-level variables
are just names, they do not show order (which values are higher or lower).
17

We include one less dichotomous variable in the regression than the number of
values in the original variable. The group that does not have a dichotomous
variable of its own is the reference group.
In Tables 1 and 2, for instance, sal is the dependent variable (Y) and the
independent variable (race) has been converted into five dichotomous variables
(asian, black, hispanic, amerind, and white). For example, asian=1 for
Asians and asian=0 for everyone else.
(1)

sal-hat = 35,624 - 3,244 asian - 7,777 black - 5,220 hispanic - 11,109 amerind

(2)

sal-hat = 32,380 - 4,533 black - 1,976 hispanic - 7,865 amerind + 3,244 white

.
.
.
.
.
.

use
gen
gen
gen
gen
gen

"E:\statDATA\opm\opm91.dta", clear
*/ notice that I have changed datasets /*
asian = (race==1)
// notice the different use of single and double
black = (race==2)
// equal signs
hispanic = (race==3)
amerind = (race==4)
white = (race==5)

TABLE 1. reg sal

asian black hispanic amerind

Source |
SS
df
MS
---------+-----------------------------Model | 3.5396e+10
4 8.8491e+09
Residual | 6.9799e+11 3479
200628771
---------+-----------------------------Total | 7.3338e+11 3483
210560961

Number of obs
F( 4, 3479)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

3484
44.11
0.0000
0.0483
0.0472
14164

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------asian |
-3244.08
1423.083
-2.280
0.023
-6034.242
-453.9175
black | -7777.392
636.0846
-12.227
0.000
-9024.529
-6530.255
hispanic | -5220.197
1250.381
-4.175
0.000
-7671.752
-2768.643
amerind |
-11109.1
2314.523
-4.800
0.000
-15647.06
-6571.143
_cons |
35624.42
278.0532
128.121
0.000
35079.26
36169.58
-----------------------------------------------------------------------------TABLE 2. reg sal

black hispanic amerind white

Source |
SS
df
MS
---------+-----------------------------Model | 3.5396e+10
4 8.8491e+09
Residual | 6.9799e+11 3479
200628771
---------+-----------------------------Total | 7.3338e+11 3483
210560961

Number of obs
F( 4, 3479)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

3484
44.11
0.0000
0.0483
0.0472
14164

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------black | -4533.312
1508.357
-3.005
0.003
-7490.667
-1575.957
hispanic | -1976.118
1853.103
-1.066
0.286
-5609.397
1657.162
amerind | -7865.024
2688.412
-2.926
0.003
-13136.05
-2594.001
white |
3244.08
1423.083
2.280
0.023
453.9175
6034.242
_cons |
32380.34
1395.655
23.201
0.000
29643.95
35116.72
------------------------------------------------------------------------------

18

5. The y-intercept (b0) is the expected value of Y when all Xs equal 0. In multiple
regression where all the Xs are dummy variables measuring a single real independent
variable (like race), the y-intercept represents the mean value of Y for the reference
group (the group that has the value 0 for all independent variables).
In Table 1, the variable white has been left out of the equation, so the reference
group is nonHispanic whites. The y-intercept ($35,624) is the mean salary of
nonHispanic whites in the sample.
In Table 2, the variable asian has been left out of the equation, so the y-intercept
($32,380) is the mean salary of Asians (the reference group).
6. Each of the regression coefficients (b1 through bk) is the expected change in Y from a
one-unit increase in Xj, holding constant all other Xs. In multiple regression where all
the Xs are measuring a single real independent variable, the regression coefficients
represent the difference in mean values of Y between the group represented by the
dichotomous variable (X1 through Xk) and the reference group.
7. Remember that there is only one possible one-unit increase in any X in this equation,
from 0 to 1. Remember also that all the Xs in this equation are mutually exclusive; that
is, because of the way the variables were created, a person could not be both black and
Hispanic. Thus, only one X at a time can equal 1. For one X to change from 0 to 1 while
all other Xs keep their same values, all Xs must equal 0 before the change, which means
that we are changing from the reference group to another group.
8. Each regression coefficient is an expected difference in the value of the dependent
variable between the named group and the reference group, so you must know what the
reference group is. Any of the values of the original variable can be chosen as the
reference group and the regression equation will still give the same expected values of Y
for each of the groups. However, the y-intercept and all of the regression coefficients
will change.
In Table 1, with whites as the reference group, the coefficient on asian (-3,244) is
the mean salary of Asians minus the mean salary of whites. That is, the mean
salary of Asians is $3,244 less than the mean salary of white nonHispanics. The
coefficient on hispanic (-5,220) is the mean salary of Hispanics minus the mean
salary of nonHispanic whites. In other words, the mean salary of Hispanics is
$5,220 less than the mean salary of nonHispanic whites. The mean salary of
Asians is $32,380 (35624 - 3244), and the mean salary of Hispanics is $30,404
(35624 - 5220).
In Table 2, asian is the reference group. The constant ($32,380) is the mean
salary of Asians; the coefficient on white (+3244) again shows that mean salary
of whites is $3,244 higher than the mean salary of Asians; and the coefficient on
hispanic (-1976) shows that Hispanics mean salary is $1,976 lower than Asians
19

mean salary. The coefficient on white is the white nonHispanic mean minus the
Asian mean (35624 - 32380 = 3244) and the coefficient on hispanic is the
Hispanic mean minus the Asian mean (30404 - 32380= -1976).
9. If we included as many dummy variables in the regression equation as there were
values in the original variable, we would have perfect multicollinearity (like putting
both male and female in the same regression), because the value of the last dummy
variable can be predicted perfectly from the values of the other variables. (For instance,
if asian=0, black=0, hispanic=0, and amerind=0, then we know that white=1.) If
we included all five dummy variables, there would be no reference group, so the yintercept would have no meaning. We would have six constants (numbers) to represent
five sample means, one more than we need.
10. These two regression equations are not the equations for lines. Each generates only
the same five points (the five expected salaries of the five race/ethnic groups).
11. When the number of expected values generated by a regression equation exactly
equals the number of constants in the equation (the y-intercept plus the regression
coefficients), the expected values of the dependent variable are sample means. (This can
only happen when all of the independent variables are dummy variables.)
In Tables 1 and 2, there are five constants (the y-intercept plus four regression
coefficients) and five expected values. Table 3 confirms these values really are
the group means.
TABLE 3. table race, contents(mean sal count sal)
----------------+----------------------race | mean(sal)
N(sal)
----------------+----------------------asian |
32380.34
103
black |
27847.03
613
hispanic |
30404.22
135
american indian |
24515.32
38
white |
35624.42
2,595
----------------+-----------------------

12. Notice that R2, F*, SSR, SSE, and indeed all the statistics above the coefficients are
identical in Tables 1 and 2, further indicating that the two regressions are the same.
They give us identical information but in different forms.
Multiple Regression Type II:
One Dichotomous and One Interval-level Independent Variable
13. The basic formula for a multiple regression equation with one dichotomous
independent variable and one interval-level independent variable is:
Y-hat = b0 + b1Dummy + b2X2
20

14. In Table 4, for instance, the dependent variable is sal and there are two
independent variables, one dichotomous (male) and one interval level (edyrs).
(4)

SAL-hat = -14,367 + 7983 male + 3076 edyrs

TABLE 4. reg sal male edyrs


Source |
SS
df
MS
---------+-----------------------------Model | 2.7323e+11
2 1.3662e+11
Residual | 4.6015e+11 3481
132189887
---------+-----------------------------Total | 7.3338e+11 3483
210560961

Number of obs
F( 2, 3481)
Prob > F
R-squared
Adj R-squared
Root MSE

=
3484
= 1033.48
= 0.0000
= 0.3726
= 0.3722
=
11497

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------male |
7983.294
415.2969
19.223
0.000
7169.044
8797.544
edyrs |
3075.956
96.11011
32.004
0.000
2887.518
3264.394
_cons | -14367.21
1335.065
-10.761
0.000
-16984.8
-11749.62
-----------------------------------------------------------------------------. predict mesalhat
(option xb assumed; fitted values)

15. This equation actually gives the formulas for two lines (not one), one for the men
and one for the women.
For the women, male = 0, so the equation simplifies to:
(4a) SAL-hat = -14,367 + 3076 edyrs
For the men, male = 1, so the equation simplifies to:
(4b) SAL-hat = (-14,367 + 7983) + 3076 edyrs = -6384 + 3076 edyrs
The two lines have the same slope (expected salary rises by $3,076 for each
additional year of education) but different y-intercepts (-14,367 for women, and
-6384 for men).
16. The coefficient on the dummy variable is sometimes called the intercept-shift
coefficient, since it shifts the y-intercept but does not affect the slope. In this case, it
means that a mans expected salary is $7,983 higher than the expected salary of a
woman with the same number of years of education.
A woman with 12 years of education has an expected salary of $22,545 [-14367 +
3076(12)], while the expected salary of men is $7,983 higher [-14367 + 7983 +
3076(12) = $30,528].
A woman with 16 years of education has an expected salary of $34,849 [-14367 +
3076(16)], while the expected salary of men is $7,983 higher [-14367 + 7983 +
3076(16) = $42,832].
21

When we hold edyrs constant statistically (at any level of education) and
compare equally educated men and women, the sex difference in expected salary
is $7,983.

Expected Salary
20000
40000

60000

17. Graph 1 shows expected salaries generated from Table 4. Note the two parallel lines,
the top for the men, the bottom for the women.

10
15
years of education completed
Fitted v alues

20

Fitted values

Graph 1. twoway (scatter salhat edyrs if male==0, msymbol(triangle) mcolor(pink)


msize (medlarge)) (scatter salhat edyrs if male==1, msymbol(square) mcolor(blue)
msize(medlarge)), ytitle(Expected Salary)

18. More generally, in equations of this type, the y-intercept (b0) is the y-intercept of the
regression line for the reference group. The coefficient on the interval-level variable
(b2) is the slope of both lines. The coefficient on the dummy variable (b1) is the vertical
distance between the two lines, the expected difference in the value of the dependent
variable between the named group and the reference group when both have the same
value of the interval-level variable.
For the reference group, the equation simplifies to:
Y-hat = b0 + b1(Dummy=0) + b2X2 = b0 + b2X2
For the named group, the equation simplifies to:
Y-hat = b0 + b1(Dummy=1) + b2X2 = (b0 + b1) + b2X2

22

19. Notice that the coefficient on male is different in Table 4 ($7,983) than in a
bivariate regression ($12,582 in Table 5). That is because the coefficients have different
meanings. In Table 5, the male coefficient is the difference between the mean salaries
of men and women. In Table 4, with edyrs held constant, the male coefficient is the
expected difference in salary between men and women with the same level of education.
(5)

SAL-hat = $27,423 + $12,582 male

TABLE 5. reg sal male


Source |
SS
df
MS
---------+-----------------------------Model | 1.3783e+11
1 1.3783e+11
Residual | 5.9555e+11 3482
171037759
---------+-----------------------------Total | 7.3338e+11 3483
210560961

Number of obs
F( 1, 3482)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

3484
805.85
0.0000
0.1879
0.1877
13078

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------male |
12581.9
443.2203
28.387
0.000
11712.9
13450.89
_cons |
27422.93
316.4478
86.659
0.000
26802.48
28043.37
------------------------------------------------------------------------------

19. Notice also that the edyrs coefficient differs between Table 4 ($3,076) and a
bivariate regression ($3,715 in Table 6). Again, the coefficients have different meanings.
Table 6 tells us that a person with an additional year of education is expected to earn
$3,715 more than a person without that extra year of education. Table 4 tells us that a
person with an additional year of education is expected to earn $3,076 more than a
person of the same sex without that extra year of education.
(6)

SAL-hat = -19,469 + 3715 edyrs

TABLE 6. reg sal edyrs


Source |
SS
df
MS
---------+-----------------------------Model | 2.2438e+11
1 2.2438e+11
Residual | 5.0900e+11 3482
146180591
---------+-----------------------------Total | 7.3338e+11 3483
210560961

Number of obs
F( 1, 3482)
Prob > F
R-squared
Adj R-squared
Root MSE

=
3484
= 1534.97
= 0.0000
= 0.3060
= 0.3058
=
12091

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------edyrs |
3715.173
94.82633
39.179
0.000
3529.253
3901.094
_cons | -19469.25
1375.916
-14.150
0.000
-22166.94
-16771.57
------------------------------------------------------------------------------

23

Multiple Regression Type III:


Two Dichotomous Independent Variables (Different Concepts)
20. In Table 7, salary is the dependent variable and male and minority are two
independent variables measuring separate concepts. In Table 1, with asian, black,
hispanic, and amerind as the independent variables, a single observation could have
the value 1 on only one of the independent variables. Here, however, both male and
minority can have the value 1 at the same time.
(7)

SAL-hat = 28,946 + 11,863 male - 4531 minority

TABLE 7. reg sal male minority


Source |
SS
df
MS
---------+-----------------------------Model | 1.5097e+11
2 7.5486e+10
Residual | 5.8241e+11 3481
167311399
---------+-----------------------------Total | 7.3338e+11 3483
210560961

Number of obs
F( 2, 3481)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

3484
451.17
0.0000
0.2059
0.2054
12935

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------male |
11862.86
445.8096
26.610
0.000
10988.79
12736.93
minority | -4530.764
511.2052
-8.863
0.000
-5533.057
-3528.472
_cons |
28945.56
357.0325
81.073
0.000
28245.55
29645.57
-----------------------------------------------------------------------------. predict mmsalhat
(option xb assumed; fitted values)

21. This regression does not generate a line but only four points: the expected salaries
for minority men, minority women, non-minority men, and non-minority women.
(Thus, the expected salaries will not be sample means, because there are four expected
salaries but only three constants, the y-intercept plus the two regression coefficients.)
The means are shown in Table 7A and in the final column below.
Expected
Mean
SAL-hat(MM)= 28946 + 11863 - 4531 = $36,278
[$34,194]
SAL-hat(MF)= 28946 + 0 - 4531 =
$24,415
[$25,558]
SAL-hat(NM)= 28946 + 11863 - 0 =
$40,809
[$41,258]
SAL-hat(NF)= 28946 + 0 - 0 =
$28,946
[$28,367]
TABLE 7A. table male minority, c(m sal m mmsal)
-----------------------------|
minority
male |
0
1
----------+------------------women | 28366.74 25558.32
| 28945.56
24414.8
|
men | 41257.69 34193.91
| 40808.42 36277.66
------------------------------

24

22. The Y-intercept is the expected salary of the reference group: non-minority women
(the group that has the value 0 for both male and minority).
23.

The coefficient on male is the expected difference in salary between men and
women of the same minority status. Minority men have an expected salary
$11,863 higher than minority womens (36,278-24,415), and non-minority men
have an expected salary $11,863 higher than non-minority womens (40,80928,946).
The coefficient on minority shows an expected difference in salary of $4,531
between minorities and non-minorities of the same sex. The difference in the
expected salaries between minority and non-minority men (36,278 - 40,809)
equals the difference in the expected salaries between minority and non-minority
women (24,415 - 28,946): minorities make $4,531 less than non-minorities of the
same sex.

24. The expected values are not sample means because the structure of the model forces
the difference in expected salaries to be the same between minority men and women
and between non-minority men and women. It also forces the difference in expected
salaries to be the same between minority and non-minority men and between minority
and non-minority women.
In the sample, the difference in mean salaries between minority men and
women ($34,194 - 25,558 = 8,648) is much smaller than difference in mean
salaries between non-minority men and women ($41,258 - 28,367 = 12,891).
25. Notice again that the coefficients on both male and minority are different in
bivariate than in multiple regression, because bivariate and multiple regression
coefficients have different meanings.
The male coefficient in bivariate regression is the difference in mean salaries
between men and women ($12,582). The male coefficient in this multiple
regression (11,863) is the difference in expected salaries between men and
women of the same minority status.
The minority coefficient in bivariate regression shows that the mean salary of
minorities is $7,006 less than the mean salary of non-minorities (Table 8). The
minority coefficient in this multiple regression (Table 7) shows that the
expected salary of minorities is $4,531 less than the expected salary of nonminorities of the same sex.
(8)

SAL-hat = 35624 -7006 minority

25

TABLE 8. reg sal

minority

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------minority | -7006.248
551.3495
-12.707
0.000
-8087.248
-5925.247
_cons |
35624.42
278.5087
127.911
0.000
35078.36
36170.48
------------------------------------------------------------------------------

Multiple Regression Type IV:


Two Continuous Independent Variables
26. When both independent variables are continuous, the regression equation does not
generate a line but a plane. Take the following two examples:
(9)
(10)

SAL-hat = -31103 + 770 yos + 3798 edyrs


SAL-hat = 22010 + 707 yos + 52 age

TABLE 9. reg sal edyrs yos


Source |
SS
df
MS
---------+-----------------------------Model | 3.7753e+11
2 1.8876e+11
Residual | 3.5585e+11 3481
102227661
---------+-----------------------------Total | 7.3338e+11 3483
210560961

Number of obs
F( 2, 3481)
Prob > F
R-squared
Adj R-squared
Root MSE

=
3484
= 1846.51
= 0.0000
= 0.5148
= 0.5145
=
10111

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------edyrs |
3798.288
79.32818
47.881
0.000
3642.754
3953.823
yos |
769.8755
19.89076
38.705
0.000
730.8767
808.8742
_cons | -31103.07
1189.23
-26.154
0.000
-33434.73
-28771.41
-----------------------------------------------------------------------------TABLE 10. reg sal yos age
Source |
SS
df
MS
---------+-----------------------------Model | 1.4384e+11
2 7.1920e+10
Residual | 5.8954e+11 3481
169360573
---------+-----------------------------Total | 7.3338e+11 3483
210560961

Number of obs
F( 2, 3481)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

3484
424.66
0.0000
0.1961
0.1957
13014

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------yos |
707.2954
31.55209
22.417
0.000
645.4329
769.1578
age |
51.95367
26.0536
1.994
0.046
.8717993
103.0355
_cons |
22010.35
962.1955
22.875
0.000
20123.83
23896.87
------------------------------------------------------------------------------

27. In Table 9, a college graduate (edyrs=16) with 10 years of service has an expected
salary of $37,365. A college graduate with 11 years of service has an expected salary of
$38,135 an increase of $770. Someone with 17 years of education and 11 years of
service has an expected salary of $41,933 an increase of $3,798.

26

Thus, the yos coefficient represents the expected change in salary from an
additional year of service, holding level of education constant.
The edyrs coefficient represents the expected change in salary from an
additional year of education, holding level of seniority constant.
28. In Table 10, a 40-year-old with 10 years of service has an expected salary of $31,160.
A 40-year-old with 11 years of service has an expected salary of $31,867 an increase of
$707. A 41-year-old with 11 years of service has an expected salary of $31,919 an
increase of $52.
Thus, the yos coefficient represents the expected change in salary from an
additional year of service, holding age constant.
The age coefficient represents the expected change in salary from an additional
year of age, holding level of seniority constant.
29. The y-intercept (-31103) in Table 9 is the expected salary of an employee with zero
years of service and zero years of education. (Such an employee would need to pay
mightily for the pleasure of serving the government.) The y-intercept (22010) in Table
10 is the expected salary of a newborn baby with zero years of service. More practically,
both y-intercepts are simply points the regression planes pass through.
30. In Table 9, the yos coefficient represents the expected change in salary from an
additional year of service, holding level of education constant. Thus, we can imagine a
separate regression line at each level of education.
If edyrs=12,
If edyrs=13,
If edyrs=14,

sal-hat = (-31103 + 45576 (=12*3798)) + 770 yos.


sal-hat = 18,271 + 770 yos
sal-hat = 22,069 + 770 yos

Each additional year of education would raise the y-intercept by $3798, but the
slope would remain 770.
31. On the other hand, the edyrs coefficient represents the expected change in salary
from an additional year of education, holding level of seniority constant. Thus, we can
imagine a separate regression line at each level of seniority. With one year of service,
the regression line becomes:
sal-hat = (-31103 + 770) + 3798 edyrs

= -30396 + 3798 edyrs

Each additional year of service would raise the y-intercept by $770, but the slope
would remain $3,798.

27

32. If each independent variable were truly continuous, we would have one regression
line at 12.01 years of education, and another at 12.02 years of education, and another at
12.03 years of education. There would be infinitely many regression lines with yos as
the independent variable, each with a slope of $770. At the same time, there would be
infinitely many regression lines with edyrs as the independent variable, each with a
slope of $3,798. These lines essentially cross-hatch to create a plane. At any given point
on that plane, two regression lines pass through, one holding years of education
constant and the other holding years of service constant.
Multiple Regression Type V:
Combining All Variable Types
33. We can also combine all types of independent variables into a single regression
model of almost any level of complexity. Table 13 includes almost all the independent
variables we have used in this chapter, with the exception of minority, which would
cause a problem of perfect multicollinearity with the asian, black, hispanic, and
indian variables.
34. All coefficients in this equation represent the expected change in salary for a oneunit increase in the independent variable, holding all of the other independent variables
constant. There are three interval-level independent variables (years of education, years
of federal service, and age) and 5 dummy variables (male, asian, black, hispanic,
and indian). Four of those dummy variables (asian, black, hispanic, and indian)
represent a single concept (race or ethnicity) and have a single reference group (nonminorities). The dummy variable male represents gender by itself, with women as the
reference group.
The coefficient on male shows that men's expected salary is $6034 higher than
the expected salary of women of the same race or ethnicity, who have the same
the years of education, the same years of federal service, and the same age.
The coefficient on asian shows that Asians expected salary is $2086 less than
expected salary of non-minorities of the same sex, who have the same years of
education, the same years of federal service, and the same age.
The coefficient on edyrs shows that, holding federal experience, age, sex, and
race/ethnicity constant, expected salary rises in $3233 with each additional year
of education.
The coefficient on yos shows that, holding education, age, sex, and race/ethnicity
constant, expected salary rises in $731 with each additional year of federal
service.

28

The y-intercept, in theory, represents the expected salary of a newborn, female


non-minority with no education and no federal experience. Because this makes
no sense, the y-intercept merely represents a point that the regression
hyperplane passes through.
TABLE 13
. reg sal edyrs yos age male

asian black hispanic amerind

Source |
SS
df
MS
---------+-----------------------------Model | 4.1289e+11
8 5.1611e+10
Residual | 3.2050e+11 3475 92229060.6
---------+-----------------------------Total | 7.3338e+11 3483
210560961

Number of obs
F( 8, 3475)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

3484
559.60
0.0000
0.5630
0.5620
9603.6

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------edyrs |
3232.53
80.98997
39.913
0.000
3073.737
3391.323
yos |
730.5837
23.41642
31.200
0.000
684.6723
776.495
age | -10.80357
19.4365
-0.556
0.578
-48.91168
27.30455
male |
6034.097
355.1608
16.990
0.000
5337.752
6730.442
asian | -2086.068
967.2924
-2.157
0.031
-3982.586
-189.5492
black | -2615.829
445.444
-5.872
0.000
-3489.188
-1742.471
hispanic | -992.3598
851.6376
-1.165
0.244
-2662.12
677.4007
amerind | -6272.744
1572.172
-3.990
0.000
-9355.218
-3190.27
_cons | -24435.23
1358.659
-17.985
0.000
-27099.08
-21771.38
------------------------------------------------------------------------------

Summary
35. In multiple regression analysis, we use two or more dichotomous or interval-level
independent variables to explain or predict a typically interval-level dependent variable.
Multiple regression gives us more accurate estimates of the impact of the independent
variables by subtracting out the effects of the other independent variables.
36. The y-intercept is still the expected value of the dependent variable when all of the
independent variables equal zero, though the y-intercept will seldom have a practical
meaning, because it usually does not make sense for all independent variables to equal
zero.
When all of the independent variables are dummy variables, whether they all
represent a single concept or a variety of concepts, the y-intercept represents the
expected value of the dependent variable for the grand reference group. The
grand reference group would be the group that has the value 0 on all of the
dummy variables.
When one or more of the independent variables are interval level, the y-intercept
will have a practical meaning only when it makes sense for all of the interval-level
independent variables to have the value 0 at the same time. Otherwise, the yintercept is of little interest for anything other than prediction purposes.

29

37. The regression coefficients show the expected change in Y from a one-unit increase
in Xj, holding all the other Xs constant. That is, Xj changes but none of the other
independent variables do. It usually does not matter at what values you hold the other
variables constant.
When the independent variable is interval level, we generally interpret its
coefficient as an expected change in the dependent variable from a one-unit
increase in the independent variable, holding all the other independent variables
constant.
When the independent variable is a dummy variable, we generally interpret its
coefficient as the expected difference in the dependent variable between the
named group and the reference group, holding all of the other independent
variables constant.
Remember that when you have a set of dummy variables representing a single
concept, the coefficient on any one of those dummy variables represents an
expected difference between that group and the reference group, holding the
other dummy variables in the set constant at 0, but just holding the other
variables constant.
Thus, in Table 13, the coefficient on black is the difference in expected
salary between blacks and whites of the same sex with the same levels of
education, federal experience, and age, and not the expected difference
between blacks and comparable nonblacks.
MULTIPLE REGRESSION WITH DICHOTOMOUS DEPENDENT
VARIABLES
1. We can also use multiple regression with dummy dependent variables, but we run
into greater danger of predicted probabilities above 100% or below 0% (which dont
make sense). Linear regression with dummy dependent variables works best if most
predicted probabilities are between 20% and 80%; in that case, the linear probability
model works almost as well as the more appropriate logit and probit models. That
means that this is not a very good example, since the highest probability is only 41%,
15% of the expected values are negative, and half of the expected values are below 17%.
In other words, dont take the following interpretations all that seriously:
In Table 1, the dependent variable SupMgr is coded 100 for supervisors and
managers and 0 for everyone else. Grade has the greatest impact on the
probability of being a supervisor: as grade rises by one standard deviation
(holding the other variables in the model constant), the probability of being a
supervisor rises by .32 standard deviation. As grade rises by one, the probability
of being a supervisor rises by 3.67 percentage points.

30

Federal experience is the second most important predictor. Holding grade and
the other variables constant, as federal service rises by one year, the probability of
being a suprevisor rises by 0.40 percentage point.
The other independent variables have much smaller impacts (and their
coefficients are not statistically significant). Men are 3.5 percentage points more
likely than comparably experienced women in the same grade to be supervisors,
but blacks, Latinos, and American Indians are all more likely to be supervisors
than comparable whites (by 2.2, 7.6, and 7.0 percentage points, respectively).
The probability of being a supervisor is expected to fall by 0.7 percentage point
with an additional year of education (holding grade and experience constant.)
Table 1. OLS Multiple Regression
.
.
.
.

gen
gen
gen
gen

black = bf+bm
latino = hf + hm
asian = af+am
indian = naf + nam

. reg SupMgr male black latino asian indian grade yos edyrs, beta
Source |
SS
df
MS
-------------+-----------------------------Model | 171424.409
8 21428.0511
Residual | 1263566.66 1111 1137.32373
-------------+-----------------------------Total | 1434991.07 1119 1282.38702

Number of obs
F( 8, 1111)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

1120
18.84
0.0000
0.1195
0.1131
33.724

-----------------------------------------------------------------------------SupMgr |
Coef.
Std. Err.
t
P>|t|
Beta
-------------+---------------------------------------------------------------male |
3.548425
2.175325
1.63
0.103
.0495666
black |
2.19585
2.637708
0.83
0.405
.0253976
latino |
7.565312
4.479238
1.69
0.092
.0486969
asian | -.7832162
4.94979
-0.16
0.874
-.0045188
indian |
7.033295
6.926935
1.02
0.310
.029027
grade |
3.673106
.4238377
8.67
0.000
.3186431
yos |
.3984984
.1193163
3.34
0.001
.1012499
edyrs |
-.72297
.5713085
-1.27
0.206
-.0444971
_cons |
-20.5427
7.562459
-2.72
0.007
.
-----------------------------------------------------------------------------. predict multprob
(option xb assumed; fitted values)
(2 missing values generated)
. gen multprobneg = multprob<0 if multprob<.
(2 missing values generated)

31

. sum multprob, detail


Fitted values
------------------------------------------------------------Percentiles
Smallest
1%
-10.29248
-14.20026
5%
-5.999979
-13.33042
10%
-2.804629
-12.96324
Obs
1120
25%
4.942193
-12.62322
Sum of Wgt.
1120
50%
75%
90%
95%
99%

17.33166
24.76756
29.89252
33.06671
38.6378

Largest
40.88074
41.23924
41.73301
41.93656

Mean
Std. Dev.

15.08929
12.37717

Variance
Skewness
Kurtosis

153.1943
-.249739
2.114017

. tab multprobneg
multprobneg |
Freq.
Percent
Cum.
------------+----------------------------------0 |
954
85.18
85.18
1 |
166
14.82
100.00
------------+----------------------------------Total |
1,120
100.00

2. The following example works much better. Im using a November 2009 survey
conducted by the Pew Research Center. The dependent variable is coded 100 if the
respondent said that torture of suspected terrorists was sometimes or often justified, 0 if
the respondent said that it was rarely or never justified. Party identification is coded 1
Democrat to 5 Republican with independents coded 2 or 4 if they lean toward one
party and 3 if they are pure independents. Conservatism is coded from 1 Very liberal
to 5 Very conservative.
. d party5 conserv5 xborn catholic jewish othrelig norelig xattend ///
edyrs agex male black latino asian othmin
storage display
value
variable name
type
format
label
variable label
-----------------------------------------------------------------------------party5
byte
%16.0g
party5
Party identification (1-5)
conserv5
byte
%17.0g
conserv5
Conservatism (1-5)
xborn
byte
%9.0g
YesNo
Born-again or evangelical
Christian
catholic
byte
%9.0g
YesNo
Catholic
jewish
byte
%9.0g
YesNo
Jewish
othrelig
byte
%9.0g
YesNo
Other religion
norelig
byte
%9.0g
YesNo
No religious affiliation
xattend
byte
%9.0g
YesNo
Attends church nearly every week
edyrs
byte
%9.0g
Education (years)
agex
byte
%9.0g
Age
male
byte
%9.0g
malel
Male
black
byte
%9.0g
YesNo
African American
latino
byte
%9.0g
YesNo
Latino
asian
byte
%9.0g
YesNo
Asian American
othmin
byte
%9.0g
YesNo
Other/Mixed

32

. sum party5 conserv5 xborn catholic jewish othrelig norelig xattend ///
edyrs agex male black latino asian othmin
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------party5 |
1921
2.918792
1.674461
1
5
conserv5 |
1865
3.275603
1.003305
1
5
xborn |
2000
.326
.4688645
0
1
catholic |
1925
.2290909
.4203571
0
1
jewish |
1925
.0202597
.1409241
0
1
-------------+-------------------------------------------------------othrelig |
1925
.025974
.1590991
0
1
norelig |
1925
.1490909
.3562706
0
1
xattend |
2000
.403
.4906234
0
1
edyrs |
1955
14.08542
2.612929
6
18
agex |
1950
52.78051
17.82933
18
97
-------------+-------------------------------------------------------male |
2000
.477
.4995956
0
1
black |
1942
.092173
.2893445
0
1
latino |
1942
.0803296
.2718727
0
1
asian |
1942
.0200824
.1403183
0
1
othmin |
1942
.0298661
.1702617
0
1

3. Although several of the independent variables have reasonably strong effects and
many coefficients are statistically significant, all but four of the predicted probabilities
are between 0 and 100, and about 90% are between 20% and 80%. Most of the
following interpretations will be very close to what we would get from logit or probit.
Ideology and party identification have the strongest impacts, based on the beta
weights. Holding the other variables constant, each one point rise on the fivepoint conservatism scale raises the expected probability of saying that torture is
at least sometimes justified by 10.1 percentage points and each one point rise on
the five-point Republicanism scale raises that probability by 4.8 percentage
points.
Holding the other variables constant, those who attend religious services weekly
are 10.1 percentage points less likely to say torture is at least sometimes justified.
Holding religious attendance constant, evangelical Protestants, Catholics, and
Jews are (insignificantly) more likely to approve of torture than mainline
Protestants (the reference group for the religion dummy variables). Those who
belong to other religions and those with no religious affiliation are 20 and 15
percentage points less likely than mainline Protestants to agree (though note that
the non-religious are extremely unlikely to be attending church weekly, so they
are only 5 points less likely than mainline Protestants who are only occasional
church-goers).
Support for torture drops with both education (by 1.3 percentage points per year)
and age (by 0.2 percentage point per year), but neither effect is terribly strong.
Men are 6 percentage points more likely than comparable women to think torture
can be justified. Racial differences are small; compared to whites who are the
same on all the other independent variables, Asians and black are 7 and 2
percentage points more likely to think torture can sometimes be justified, and
33

Latinos and other minorities and 2.5 and 5.5 points less likely to agree,
respectively.
. reg Torture party5 conserv5 xborn catholic jewish othrelig norelig ///
xattend edyrs agex male black latino asian othmin, beta
Source |
SS
df
MS
-------------+-----------------------------Model |
308184.48
15
20545.632
Residual | 1911041.98
876 2181.55477
-------------+-----------------------------Total | 2219226.46
891 2490.71432

Number of obs
F( 15,
876)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

892
9.42
0.0000
0.1389
0.1241
46.707

-----------------------------------------------------------------------------Torture |
Coef.
Std. Err.
t
P>|t|
Beta
-------------+---------------------------------------------------------------party5 |
4.818727
1.136174
4.24
0.000
.1623514
conserv5 |
10.13744
1.876484
5.40
0.000
.2019606
xborn |
2.183997
3.983522
0.55
0.584
.0206524
catholic |
4.066398
4.191228
0.97
0.332
.0344157
jewish |
12.72483
10.79395
1.18
0.239
.0386801
othrelig | -20.18361
9.888169
-2.04
0.042
-.0693275
norelig | -15.33975
5.255957
-2.92
0.004
-.1085136
xattend | -10.10115
3.554534
-2.84
0.005
-.0998145
edyrs | -1.299308
.6548084
-1.98
0.048
-.0659095
agex | -.2001906
.0953088
-2.10
0.036
-.0696003
male |
6.084354
3.178386
1.91
0.056
.0609835
black |
2.210531
6.160191
0.36
0.720
.0122986
latino | -2.475507
6.492728
-0.38
0.703
-.0128081
asian |
6.998053
11.33263
0.62
0.537
.0207715
othmin | -5.483082
9.127359
-0.60
0.548
-.019168
_cons |
37.10918
13.08889
2.84
0.005
.
-----------------------------------------------------------------------------. predict tprob
(option xb assumed; fitted values)
(223 missing values generated)
. sum tprob, detail
Fitted values
------------------------------------------------------------Percentiles
Smallest
1%
8.543769
-.0747188
5%
19.98055
-.0747188
10%
27.68358
.5258531
Obs
1777
25%
41.31963
1.326616
Sum of Wgt.
1777
50%
75%
90%
95%
99%

55.60023
67.7653
76.42638
80.83438
89.7094

Largest
97.24033
98.96882
100.5629
102.8457

Mean
Std. Dev.

53.70753
18.64381

Variance
Skewness
Kurtosis

347.5916
-.3460393
2.658592

34

MEASURING THE STRENGTH OF RELATIONSHIPS


1. We measure the strength of relationships in regression analysis in many different
ways.
Regression Coefficients
2. The regression coefficient (b) tells you that as x increases by one unit, y-hat changes
by b units. In general, the larger the regression coefficient, the stronger the
relationship, at least when we are working with comparable independent variables.
In Table 1, a one-year increase in education raises expected salary by $3715. In
Table 2, a one-year increase in federal experience raises expected salary by $744.
Thus, an additional year of education raises expected salary more than an
additional year of experience.
In Table 3, the mean salary of men is $12,582 higher than the mean salary of
women. In Table 4, the mean salary of whites is $7,006 higher than the mean
salary of others. Thus, gender appears to affect salary more than race.
TABLE 1. reg sal edyrs, beta
Source |
SS
df
MS
-------------+-----------------------------Model | 2.2438e+11
1 2.2438e+11
Residual | 5.0900e+11 3482
146180591
-------------+-----------------------------Total | 7.3338e+11 3483
210560961

Number of obs
F( 1, 3482)
Prob > F
R-squared
Adj R-squared
Root MSE

=
3484
= 1534.97
= 0.0000
= 0.3060
= 0.3058
=
12091

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
Beta
-------------+---------------------------------------------------------------edyrs |
3715.173
94.82633
39.18
0.000
.5531327
_cons | -19469.25
1375.916
-14.15
0.000
.
-----------------------------------------------------------------------------TABLE 2. reg sal yos, beta
Source |
SS
df
MS
-------------+-----------------------------Model | 1.4317e+11
1 1.4317e+11
Residual | 5.9022e+11 3482
169505344
-------------+-----------------------------Total | 7.3338e+11 3483
210560961

Number of obs
F( 1, 3482)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

3484
844.61
0.0000
0.1952
0.1950
13019

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
Beta
-------------+---------------------------------------------------------------yos |
744.0948
25.60352
29.06
0.000
.4418294
_cons |
23745.03
411.3754
57.72
0.000
.
------------------------------------------------------------------------------

32

TABLE 3. reg sal male, beta


Source |
SS
df
MS
-------------+-----------------------------Model | 1.3783e+11
1 1.3783e+11
Residual | 5.9555e+11 3482
171037759
-------------+-----------------------------Total | 7.3338e+11 3483
210560961

Number of obs
F( 1, 3482)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

3484
805.85
0.0000
0.1879
0.1877
13078

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
Beta
-------------+---------------------------------------------------------------male |
12581.9
443.2203
28.39
0.000
.4335176
_cons |
27422.93
316.4478
86.66
0.000
.
-----------------------------------------------------------------------------TABLE 4. reg sal white, beta
Source |
SS
df
MS
-------------+-----------------------------Model | 3.2504e+10
1 3.2504e+10
Residual | 7.0088e+11 3482
201286672
-------------+-----------------------------Total | 7.3338e+11 3483
210560961

Number of obs
F( 1, 3482)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

3484
161.48
0.0000
0.0443
0.0440
14188

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
Beta
-------------+---------------------------------------------------------------white |
7006.248
551.3495
12.71
0.000
.2105234
_cons |
28618.17
475.8353
60.14
0.000
.
------------------------------------------------------------------------------

3. The problem is that regression coefficients are expressed in the units of measurement
of the X variables, and those may be completely non-comparable.
How do you compare the expected impact of an additional year of education to
the expected impact of being male rather than female? The sex variable has only
two possible values, while level of education has a much wider range. Being male
rather than female raises expected salary more than three, but less than four,
additional years of education.
4. Also, units of measurement that appear comparable may be less similar than they
appear.
The vast majority of federal employees have between 12 and 16 years of
education, but their years of federal experience range from 0 to over 30. The
standard deviation of federal experience is four times as large as the standard
deviation of education. Although one additional year of education is clearly more
profitable than one additional year of federal experience, federal experience
might have nearly as much impact on the distribution of salaries in the federal
service, because it varies more.

33

Linear Transformations of Variables


5. A linear transformation of a variable takes the form: z1 = a + b*z. That is, you
transform an existing variable into a new variable by adding or subtracting a constant
and/or multiplying or dividing by another constant.
6. A linear transformation of the independent and/or dependent variable(s) will change
the y-intercept and/or regression coefficient(s), but will not meaningfully change the
regression.
In Table 2A, I multiply years of federal service by 12 to create months of public
service. The coefficient on mos is 1/12th the coefficient on yos in Table 2,
because a month is only 1/12th as large as a year. Neither the y-intercept nor any
of the material in the top half of the table changes.
In Table 2B, I divide sal by 1,000 to measure salary in thousands of dollars. The
y-intercept, the regression coefficients, and their standard errors are only
1/1000th as large as in Table 2A.
In Table 1A, I measure education as years beyond high school by subtracting 12
from years of education a college graduate (edyrs = 16) has hsedyrs = 4; a
high school dropout with 10 years of education has hsedyrs = -2. The yintercept is much larger than in Table 1, because it now represents the expected
salary of someone with 12 years of education instead of no education, but the
regression coefficient, the expected impact of another year of education, does not
change.
. gen mos = yos*12
TABLE 2A. reg sal mos, beta
Source |
SS
df
MS
-------------+-----------------------------Model | 1.4317e+11
1 1.4317e+11
Residual | 5.9022e+11 3482
169505344
-------------+-----------------------------Total | 7.3338e+11 3483
210560961

Number of obs
F( 1, 3482)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

3484
844.61
0.0000
0.1952
0.1950
13019

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
Beta
-------------+---------------------------------------------------------------mos |
62.0079
2.133627
29.06
0.000
.4418294
_cons |
23745.03
411.3754
57.72
0.000
.
-----------------------------------------------------------------------------. gen salthou = sal/1000
TABLE 2B. reg salthou mos, beta
Source |
SS
df
MS
-------------+-----------------------------Model | 143166.219
1 143166.219
Residual | 590217.612 3482 169.505345
-------------+-----------------------------Total | 733383.831 3483 210.560962

34

Number of obs
F( 1, 3482)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

3484
844.61
0.0000
0.1952
0.1950
13.019

-----------------------------------------------------------------------------salthou |
Coef.
Std. Err.
t
P>|t|
Beta
-------------+---------------------------------------------------------------mos |
.0620079
.0021336
29.06
0.000
.4418294
_cons |
23.74503
.4113754
57.72
0.000
.
-----------------------------------------------------------------------------. gen hsedyrs = edyrs-12
TABLE 1a. reg sal hsedyrs, beta
Source |
SS
df
MS
-------------+-----------------------------Model | 2.2438e+11
1 2.2438e+11
Residual | 5.0900e+11 3482
146180591
-------------+-----------------------------Total | 7.3338e+11 3483
210560961

Number of obs
F( 1, 3482)
Prob > F
R-squared
Adj R-squared
Root MSE

=
3484
= 1534.97
= 0.0000
= 0.3060
= 0.3058
=
12091

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
Beta
-------------+---------------------------------------------------------------hsedyrs |
3715.173
94.82633
39.18
0.000
.5531327
_cons |
25112.83
302.5534
83.00
0.000
.
------------------------------------------------------------------------------

Correlation Coefficients
6. The mathematical formulas for calculating regression and correlation coefficients are
very strongly related.
b1 = (Xi-Xbar)(Yi-Ybar) = (Xi-Xbar)(Yi-Ybar)/(N-1)
(Xi-Xbar)2
(Xi-Xbar)2 /(N-1)
r = (Xi-Xbar)(Yi-Ybar)
%[ (Xi-Xbar)2 (Yi-Ybar)2]

Cov(X,Y)
s2x

= (Xi-Xbar)(Yi-Ybar)/(N-1)
=
(%[ (Xi-Xbar)2 (Yi-Ybar)2])/(N1)

Cov(X,Y)
sx sy

7. The numerator for both is the covariance of X and Y. The sign of both coefficients
depends only on whether the covariance of X and Y is positive or negative, as variances
and standard deviations are always positive.
If above-average values of X tend to be associated with above-average values of Y
(and if below-average values of X tend to be associated with below-average values
of Y), the covariance is positive.
If above-average values of X tend to be associated with below-average values of Y
(and if below-average values of X tend to be associated with above-average values
of Y), the covariance is negative.
8. The regression coefficient (b1) will tend to be higher, (1) the higher the correlation
between X and Y, and (2) the larger the variation in Y relative to the variation in X.

35

9. The correlation coefficient (r) and the slope or regression coefficient (b1) give us the
same information in bivariate regression, but in different forms. In fact, b1 can be
calculated from r and vice versa.
b1 = r [sy/sx]
r = b1 [sx/sy]
In Table 5, the correlation between edyrs and sal is 0.5531, the standard
deviation of sal is $14510.72, and the standard deviation of edyrs is 2.160425. In
Table 1, the coefficient on edyrs is 3715.173.
b1 = r [sy/sx] = .5531*(14510.72/2.160425)= (.5531*14510.72)/2.160425 = $3715
r = b1 [sx/sy] = 3715.173*(2.160425/14510.72) = (3715.173*2.160425)/14510.72 =
.5531
The correlation coefficient tells us that as edyrs increases by one standard
deviation, sal-hat increases by .5531 standard deviations. A one standard
deviation increase in edyrs (2.160425 years) raises expected sal by .5531
standard deviation (.5531*14510.72= $8,025.88). To calculate the slope, we can
divide the change in sal-hat ($8,025.88) by the change in edyrs (2,160425) to
get $3715.
10. The regression coefficient (b1) is expressed in the units of measurement of x and y.
It tells you that as x increases by one unit, y changes by b units (b can be positive or
negative). As noted above, when you change the units of measurement (from years to
months, for instance, or from dollars to cents), the slope will also change.
11. In calculating r, the deviations are standardized by dividing them by their standard
deviations (turning them into z-scores), so r is not affected by units of measurement.
Changing the units of measurement has no impact on the value of r. We can interpret
the correlation coefficient as meaning that, as x increases by one standard
deviation (of x), y changes by r standard deviations (of y).
TABLE 5. corr sal edyrs yos male white salthou mos hsedyrs, m
(obs=3484)
Variable |
Mean
Std. Dev.
Min
Max
-------------+---------------------------------------------------sal |
33836.66
14510.72
12385
95300
edyrs |
14.34816
2.160425
6
20
yos |
13.56228
8.61619
1
46
male |
.5097589
.4999765
0
1
white |
.7448335
.4360173
0
1
salthou |
33.83666
14.51072
12.385
95.3
mos |
162.7474
103.3943
12
552
hsedyrs |
2.348163
2.160425
-6
8

36

|
sal
edyrs
yos
male
white salthou
mos hsedyrs
-------------+---------------------------------------------------------------------sal |
1.0000
edyrs |
0.5531
1.0000
yos |
0.4418 -0.0271
1.0000
male |
0.4335
0.3460
0.1035
1.0000
white |
0.2105
0.1422
0.0718
0.1820
1.0000
salthou |
1.0000
0.5531
0.4418
0.4335
0.2105
1.0000
mos |
0.4418 -0.0271
1.0000
0.1035
0.0718
0.4418
1.0000
hsedyrs |
0.5531
1.0000 -0.0271
0.3460
0.1422
0.5531 -0.0271
1.00

The mean and standard deviation of mos are exactly 12 times as large as those of
yos, and mos and yos are perfectly correlated with each other (r=1.0000), and
both are correlated identically with sal and edyrs. Both correlation coefficients
with sal indicate that as federal experience (measured either as yos or mos)
increases by one standard deviation, expected salary increases .4418 standard
deviations.
The mean and standard deviation of salthou are exactly 1,000th those of sal. sal
and salthou are perfectly correlated with each other and correlated identically
with all other variables. As federal experience (measured either as yos or mos)
increases by one standard deviation, expected salary (measured in dollars or
thousands of dollars) increases .4418 standard deviations.
Standardized Coefficients
12. The standardized coefficient or beta-weight is another unit-less measure of
the relationship between X and Y. It also shows the expected change in Y
(measured in standard deviations of Y) from a one-standard-deviation
increase in X. Changing the units of measurement for either X or Y will not change
the correlation coefficient or the standardized coefficient.
In Tables 2, 2A, and 2B, the regression coefficients change as federal experience
is measured in years and then in months, and salary is measured in dollars and
then in thousands of dollars, but the beta-weight remains .4418294. As federal
experience rises by one standard deviation, expected salary rises by .4418294
standard deviation.
13. In a bivariate regression, the correlation coefficient (r) and the standardized
coefficient (beta-weight) are identical.
The beta-weights in Tables 2, 2A, and 2B (.4418294) match the relevant
correlation coefficients (0.4418) in Table 5, except for rounding.
14. The correlation coefficient and the standardized coefficient are measures of the
strength and direction of a relationship. In a bivariate regression, both fall between -1
and +1. The closer they are to 0, the weaker the relationship. The closer they are to

37

either -1 or +1, the stronger the relationship. The sign of the coefficient indicates
whether the relationship is negative or positive.
15. In multiple regression, standardized coefficients are even more useful, because they
allow us to compare the relative importance of the independent variables by comparing
the absolute size of their standardized coefficients. (The sign of the coefficient makes no
difference to its importance.)
Unfortunately, beta-weights do not work nearly as effectively when the model
uses more than one variable to capture the effect of a variable for instance, sets
of dummy variables, squared terms, and interaction terms.
The Coefficient of Determination
16. The coefficient of determination (R2) shows what percentage of the variation in
the dependent variable can be explained by the regression equation.
17. Understanding R2 requires a basic understanding three types of variation in the
dependent variable. Total variation (called the total sum of squares (SST)) is the
total sum of squared deviations from the mean:
SST = (yi - ybar)2
18. SST is the same in all regressions with the same dependent variable and the same
observations, because SST is a measure of the variation in the dependent variable and
does not depend on which independent variables are included in the regression
equation.
TABLE 6. reg sal grade
Source |
SS
df
MS
---------+-----------------------------Model | 6.2078e+11
1 6.2078e+11
Residual | 1.1260e+11 3482 32337418.4
---------+-----------------------------Total | 7.3338e+11 3483
210560961

Number of obs
F( 1, 3482)
Prob > F
R-squared
Adj R-squared
Root MSE

=
3484
=19197.11
= 0.0000
= 0.8465
= 0.8464
= 5686.6

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------grade |
3973.691
28.67981
138.554
0.000
3917.461
4029.922
_cons | -2266.448
277.8113
-8.158
0.000
-2811.138
-1721.759
-----------------------------------------------------------------------------. predict salgrd if sal<.
(option xb assumed; fitted values)
. predict resgrd, res, if sal<.

38

TABLE 7. reg sal age


Source |
SS
df
MS
---------+-----------------------------Model | 5.8734e+10
1 5.8734e+10
Residual | 6.7465e+11 3482
193753491
---------+-----------------------------Total | 7.3338e+11 3483
210560961

Number of obs
F( 1, 3482)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

3484
303.14
0.0000
0.0801
0.0798
13920

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------age |
393.5437
22.60332
17.411
0.000
349.2266
437.8608
_cons |
16916.09
1000.042
16.915
0.000
14955.36
18876.82
-----------------------------------------------------------------------------. predict salage if sal<.
(option xb assumed; fitted values)
. predict resage, res, if sal<.
TABLE 7.5. sum sal salgrd resgrd salage resage
Variable |
Obs
Mean
Std. Dev.
Min
Max
---------+----------------------------------------------------sal |
3484
33836.66
14510.72
12385
95300
salgrd |
3484
33836.66
13350.39
5680.935
61312.62
resgrd |
3484
-.0000153
5685.784 -10328.16
33987.38
salage |
3484
33836.66
4106.47
24393.42
46825.41
resage |
3484
5.03e-06
13917.54 -25919.88
61855.08

Whether grade or age is the independent variable, SST for sal is 7.3338*1011.
This is also the variance of sal (14510.722 = 210,560,995) times n-1 (3483).
19. The following section relies on a random sample of 10 employees from opm94.dta.
Tables 8 and 9 present a list of the values and summary statistics. The mean salary is
$43,105, with a standard deviation of $15,476.81, implying a variance of
239,531,647.7761 (by squaring) and total variation of 2,155,784,829.9849 (by
multiplying times 9).
TABLE 8. list sal grade age
sal
grade
1.
24857
6
2.
27120
7
3.
26595
7
4.
38129
11
5.
38129
11
6.
47209
11
7.
47081
12
8.
52693
13
9.
54746
13
10.
74491
15

age
38
31
39
33
35
53
37
30
46
48

TABLE 9. sum sal grade age


Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+----------------------------------------------------sal |
10
43105
15476.81
24857
74491
grade |
10
10.6
2.988868
6
15
age |
10
39
7.659417
30
53

39

20. Graphs 1 and 2 show scatterplots of the individual salaries around their mean
($43,105), arranged by the grades and the ages of the employees, respectively.the
variation around the mean is identical. (Think of the total variation as the vertical
distances between the points and the line representing the mean, squared and
summed.)
The horizontal line across the middle of the graphs is the mean salary ($43,105).
The vertical lines between the points and the line represent the deviations
between the observed values of sal and the mean value of sal. If we square these
deviations and add them up, they sum to SST.
Because these are the same 10 salaries in both graphs, the vertical differences
from the mean are identical in the two graphs. The vertical lines are simply
rearranged, because in Graph 1 they are arranged by peoples grade levels and in
2 they are arranged by peoples ages. There is a set amount of variation in the
salaries of these seven people, regardless of which variable we look at as the
independent variable.
This shows up as well in the identical Total Sum of Squares of 2,155,800,000 in
Tables 10 and 11.
TABLE 10. reg sal grade
Source |
SS
df
MS
-------------+-----------------------------Model | 1.9229e+09
1 1.9229e+09
Residual |
232842372
8 29105296.5
-------------+-----------------------------Total | 2.1558e+09
9
239531777

Number of obs
F( 1,
8)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

10
66.07
0.0000
0.8920
0.8785
5394.9

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------grade |
4890.522
601.6695
8.13
0.000
3503.07
6277.975
_cons | -8734.537
6601.935
-1.32
0.222
-23958.63
6489.552
-----------------------------------------------------------------------------. predict gsalhat
. predict grdresid, res
TABLE 11. reg sal age
Source |
SS
df
MS
-------------+-----------------------------Model |
518168200
1
518168200
Residual | 1.6376e+09
8
204702224
-------------+-----------------------------Total | 2.1558e+09
9
239531777

Number of obs
F( 1,
8)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

10
2.53
0.1503
0.2404
0.1454
14307

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------age |
990.6458
622.6505
1.59
0.150
-445.1887
2426.48
_cons |
4469.813
24701.26
0.18
0.861
-52491.39
61431.02
-----------------------------------------------------------------------------. predict asalhat
. predict ageresid, res

40

Graph 1. Observed and Mean Salaries, by Grade


p

meansal

salary in dollars

750

650

550

450

350

250
15

51

grade

5_

Graph 2. Observed and Mean Salaries, by Age

meansal

salary in dollars

''4..-

75000

65000

55000

45000
A A

35000

25000

-1

35 /

40

40

age

41

00

00

we knew about each person was that he or she was a member of a group whose mean
salary was $43,105. In that case, our best guess for each person would be $43,105, but
this guess would cause us to make very large errors in prediction.
To calculate the total error, we subtract our guess from the actual salary in each
case, square the differences, and sum them. Our answer equals SST.
22. Next, imagine that you had the regression results and knew the grade levels of each
of the people in the sample. Now your best guess would be to calculate the expected
value of salary for each person based on the regression equation.
These expected values are represented in Graphs 3 through 6 as the triangles on
the regression line immediately above or below the observed values of salary.
23. These expected values generate two new types of variation, which we call
explained and unexplained variation. Explained variation is the variation in the
dependent variable that is predicted by the regression equation. It is the variation
between the regression line and the mean.
In Graphs 3 and 4, the explained (or predicted) deviations are the vertical
distances between the expected values of salary (on the regression line) and
the mean salary. If we square those deviations and add them up, they sum to the
explained variation.
24. Unfortunately, explained variation has several different names, some in conflict. I
call it SSR (the Regression Sum of Squares), Wooldridge calls it SSE (Explained
Sum of Squares), and Stata calls it the Model Sum of Squares. My SSR (and
Wooldridges SSE) is the sum of the squared deviations between the predicted (or
expected) values of Y and the mean value of Y:
SSR = (Y-hati - Ybar)2
Notice that because y and y-hat both have the same mean, SSR is also the total
variation in the expected value of the dependent variable.
25. Unexplained variation is variation that is not predicted by the regression equation.
It is the variation around the regression line.
In Graphs 5 and 6, the unexplained deviations are the vertical differences
between the observed values of salary and the expected values of salary on
the regression line. If we square those deviations and add them up, they sum to
the unexplained variation.
26. Unfortunately, unexplained variation also has several different names, some in
conflict. I call it SSE (the Error Sum of Squares) but Wooldridge and Stata call it
SSR (the Residual Sum of Squares). It is the amount of variation in the dependent
variable that is not predicted (or explained) by the regression equation. My SSE (and
42

their SSR) is the sum of the squared errors in prediction--the sum of the squared
deviations between the individual (observed) values of Y and the predicted (or expected)
values of Y:
SSE = (Yi - Y-hati)2
27. Changing the independent variable does not affect the total variation in the
dependent variable (SST), which remains the same as long as the dependent variable
and the observations are the same. The amount of explained and unexplained
variation depends on which independent variable we use, however. The better we can
predict the dependent variable based on the values of the independent variable(s), the
higher the explained variation and the lower the unexplained variation.
28. Assume that we want to predict salaries based on the regression results and the
individual grades and ages of the people in our sample.
TABLE 12. sum sal meansal gsal asal grdres ageres
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+----------------------------------------------------sal |
10
43105
15476.81
24857
74491
meansal |
10
43105
0
43105
43105
gsalhat |
10
43105
14617.13
20608.6
64623.3
asalhat |
10
43105
7587.77
34189.19
56974.04
grdresid |
10
-.0000557
5086.391 -6932.209
9867.701
ageresid |
10
-.0000183
13489.16 -17257.35
22470.19

The summarize output in Table 12 indicates that the means of sal and of both
expected salaries (gsalhat, from the grade regression, and asalhat, from the age
regression) are $43,105. Sal has a standard deviation of $15,477, a minimum of
$24,875, and a maximum of $74,491. The variance is 239,531,648 (=15,4772).
The grade regression yields expected salaries as high as $64,623 and as low as
$20,609, a range of $44,000. The standard deviation of gsalhat is $14,6178
(94% as high as that for sal), implying a variance of 213,660,490 (89% as high as
that for sal).
The age regression, on the other hand, yields expected salaries only as low as
$34,189 and as high as $56,974, a range of about $22,800. The standard
deviation of asalhat is $7588, only 49% as high as that for sal, and the variance
of 57,574,254 is only 24% as high as that for sal.
In short, we are predicting much less variation in salary based on age than we did
based on grade level.

43

Graph 3. Expected and Mean Salaries, by Grade


p
75000

meansal

Fitted values

-1
A

65000

55000-

45000

35000 yl
25000
-

grade

Graph 4. Expected and Mean Salaries, by Age


p

meansal

Fitted values

75000

65000

55000

45000

35000

25000

age

44

Graph 5. Observed and Expected Salaries, by Grade


-e- Fitted values
75000

salary in dollars

-1

65000

55000

45000

35000

11

13

grade

Graph 6. Observed and Expected Salaries, by Age


o

Fitted values

salary in dollars

75000 -

65000

55000

45000

35000

25000

age

45

On the other hand, the residuals represent our errors in prediction, the variation
around the regression line. With either grade or age as the independent
variable, the residual naturally has a mean of zero (except for rounding error),
but the residuals from the grade equation have a standard deviation of only
$5086, while those from the age equation have a standard deviation of $13,489.
There is much more variation around the age regression line. The variance of
the grade residuals is 25,871,373 (11% of the variance in sal), while that of the
age residuals is 181,957,438 seven times as high, and 80% of the variance in
sal.
The variance of the expected values from the grade regression (213,660,490)
plus the variance of its residuals (25,871,373) is 239,531,863, which differs from
the variance of sal (239,531,648) only due to rounding error. Likewise, the
variances of the expected values and residuals from the age equation also add up
to the variance of sal. Regression divides the total variation in the dependent
variable into a portion that can be explained by the independent variable(s) and a
portion that is not explained by the regression.
Going back to the graphs, the vertical distances between the regression line (the
expected values) and the mean are much larger in Graph 3 (based on grade)
than in Graph 4 (based on age) .
Our errors in guessing salary are much higher when we base our guesses on age
than when we base them on grade. Graph 6 shows much bigger deviations
between the observed values of salary and the expected values based on age
than the deviations between the observed and expected values based on
grade in Graph 5. That is, the observed values are much further from the
regression line in Graph 6 than in Graph 5. Thus, the unexplained variation
in salary is much higher for age than for grade.
30. The explained variation in Y (SSR) plus the unexplained variation in Y (SSE)
equals the total variation in Y (SST):
SSR + SSE = SST
31. The proportion of the total variation that can be explained by the independent
variable is called the coefficient of determination or R2 (R-squared):
R2 = SSR/SST
32. If the explanation were perfect (that is, if we could predict peoples salaries perfectly
based on their grade levels), then all the observed salaries would be on the regression
line. There would be no unexplained variation between the regression line and the
observed values, and the explained variation between the regression line and the mean
would equal the total variation between the observed values and the mean. SSR would
equal SST, so SSR/SST=R2 would equal 1, which is the maximum possible value of R2.

46

33. If knowing a persons age were of no value at all in predicting a persons salary, then
we would still use the mean salary as our best guess for each persons salary even if we
knew how old they were. In that case, the expected value would not change as X (age)
increased (so b would be 0), the predicted values would equal the mean, and the
regression equation would simplify to:
Y-hat = a + 0 X = a = Ybar
The regression line would simply be the horizontal line representing the mean salary.
There would be no variation between the regression line and the mean, so there would
be no explained variation. SSR would equal 0, and SSR/SST=R2 would equal 0, its
minimum possible value.
34. If the independent variable has any impact on the dependent variable, so that
expected values vary at least a little, then SSR will be greater than 0 but almost always
less than SST, and R2 will have a value between 0 and 1.
35. R-squared is a measure of association. It represents the proportion of the
variation in the dependent variable that is explained by the independent
variables. R2 is always a positive number and therefore (unlike the correlation
coefficient) does not show the direction of the relationship. The higher the value of R2,
the stronger the relationship between X and Y.
36. R2 is also a proportionate reduction in error (PRE) measure. If we had to
predict individual salaries without knowing either grade or age, our best guesses would
be the mean salary. We could measure our error in prediction as the sum of squared
deviations of observed values from the mean (SST). Once we know the value of grade,
we can give more accurate predictions by guessing the expected values of salary
calculated from the regression equation. Our measure of prediction error would then be
the sum of squared errors between the observed values and the regression line (SSE).
We would have reduced our error by SST - SSE (= SSR), which as a proportion of the
original error is (SST-SSE)/SST = SSR/SST = R2.
37. Because SSR is the variation in y-hat, R2= SSR/SST also shows the variation in yhat as a proportion of the variation in y. If R2= .25, for instance, there is 25% as much
variation in the expected value of y as there is in y.

47

Correlation Coefficients and Coefficients of Determination


38. In a bivariate regression, the coefficient of determination (R2) is also the
square of the correlation coefficient (r). R2 is always positive and always has a
lower absolute value than r (unless r=0 or r=1 or r=-1). The higher the value of R2, the
stronger the linear relationship between X and Y. R2 does not show the direction of the
relationship, but it gives a more realistic picture of the strength of linear relationships
than r does, because R2 has an intuitive meaning (the proportion of the variation in Y
that is explained by X).
39. The coefficient of determination is also the square of the correlation between
y and y-hat. I repeat Table 1, then use the predict command to general the expected
values of sal, based on level of education. These expected values of salary are perfectly
correlated with edyrs, because they are simply a linear transformation of edyrs. The
correlation between sal and edsalhat therefore equals the correlation between sal and
edyrs; their squared correlations are therefore equal as well.
TABLE 1 (repeated). reg sal edyrs, beta
Source |
SS
df
MS
-------------+-----------------------------Model | 2.2438e+11
1 2.2438e+11
Residual | 5.0900e+11 3482
146180591
-------------+-----------------------------Total | 7.3338e+11 3483
210560961

Number of obs
F( 1, 3482)
Prob > F
R-squared
Adj R-squared
Root MSE

=
3484
= 1534.97
= 0.0000
= 0.3060
= 0.3058
=
12091

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
Beta
-------------+---------------------------------------------------------------edyrs |
3715.173
94.82633
39.18
0.000
.5531327
_cons | -19469.25
1375.916
-14.15
0.000
.
-----------------------------------------------------------------------------. predict edsalhat
(option xb assumed; fitted values)
. corr sal ed*, m
(obs=3484)
Variable |
Mean
Std. Dev.
Min
Max
-------------+---------------------------------------------------sal |
33836.66
14510.72
12385
95300
edyrs |
14.34816
2.160425
6
20
edsalhat |
33836.66
8026.353
2821.788
54834.21
|
sal
edyrs edsalhat
-------------+--------------------------sal |
1.0000
edyrs |
0.5531
1.0000
edsalhat |
0.5531
1.0000
1.0000

40. The finding that the coefficient of determination equals the square of the
correlation between y and y-hat holds for multiple as well as bivariate regression.
Strength in Multiple Regression
48

41. In multiple regression, R2 is no longer the square of a correlation coefficient, but it


still indicates the proportion of the variation in the dependent variable that can be
explained by the set of independent variables. R2 = SSR/SST = explained variation
divided by total variation.
Total variation is a measure of dispersion in the dependent variable and does not
vary from regression to regression, as long as the dependent variable and the
sample size remain the same.
Explained variation is the amount of variation in the dependent variable that is
predicted by guessing the value of the dependent variable to be its expected value
in the regression model.
42. You cannot simply sum the coefficients of determination from several bivariate
regressions in order to calculate the coefficient of determination for multiple regression,
because some of the variation in the dependent variable can be explained by more than
one of the independent variables.
In Table B, education explains 30.6% of the variation in salary by itself. In Table
C, sex alone explains 18.8% of the variation in salary. Yet together, in Table A,
they explain only 37.3% (much less than the sum of 30.6 + 18.8).
TABLE A. reg sal male edyrs
Source |
SS
df
MS
---------+-----------------------------Model | 2.7323e+11
2 1.3662e+11
Residual | 4.6015e+11 3481
132189887
---------+-----------------------------Total | 7.3338e+11 3483
210560961

Number of obs
F( 2, 3481)
Prob > F
R-squared
Adj R-squared
Root MSE

=
3484
= 1033.48
= 0.0000
= 0.3726
= 0.3722
=
11497

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------male |
7983.294
415.2969
19.223
0.000
7169.044
8797.544
edyrs |
3075.956
96.11011
32.004
0.000
2887.518
3264.394
_cons | -14367.21
1335.065
-10.761
0.000
-16984.8
-11749.62
-----------------------------------------------------------------------------TABLE B. reg sal male
Source |
SS
df
MS
---------+-----------------------------Model | 1.3783e+11
1 1.3783e+11
Residual | 5.9555e+11 3482
171037759
---------+-----------------------------Total | 7.3338e+11 3483
210560961

Number of obs
F( 1, 3482)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

3484
805.85
0.0000
0.1879
0.1877
13078

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------male |
12581.9
443.2203
28.387
0.000
11712.9
13450.89
_cons |
27422.93
316.4478
86.659
0.000
26802.48
28043.37
. predict salmres, res
(27 missing values generated)
TABLE C.
reg sal edyrs

49

Source |
SS
df
MS
---------+-----------------------------Model | 2.2438e+11
1 2.2438e+11
Residual | 5.0900e+11 3482
146180591
---------+-----------------------------Total | 7.3338e+11 3483
210560961

Number of obs
F( 1, 3482)
Prob > F
R-squared
Adj R-squared
Root MSE

=
3484
= 1534.97
= 0.0000
= 0.3060
= 0.3058
=
12091

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------edyrs |
3715.173
94.82633
39.179
0.000
3529.253
3901.094
_cons | -19469.25
1375.916
-14.150
0.000
-22166.94
-16771.57
-----------------------------------------------------------------------------. predict saleres, res
(27 missing values generated)

43. The reason is that edyrs and male are fairly strongly positively correlated (gender
explains 12% of the variation in education, Table D). That is, men tend to have more
education than women. Some of the reason that men make more than women is
because they are more educated. Some of the reason that more educated workers earn
more than less educated workers is because they are more likely to be men. In multiple
regression, the coefficient on edyrs shows the effect of an additional year of education
holding sex constant, and a coefficient on male shows the expected difference in salary
between men and women with the same amount of education, but there is additional
variation in salary that doesnt fit into either of these two categories. We call this jointly
explained variation.
TABLE D. reg edyrs male if sal<.
Source |
SS
df
MS
---------+-----------------------------Model | 1946.00556
1 1946.00556
Residual | 14310.6727 3482 4.10990025
---------+-----------------------------Total | 16256.6782 3483 4.66743561

Number of obs
F( 1, 3482)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

3484
473.49
0.0000
0.1197
0.1195
2.0273

-----------------------------------------------------------------------------edyrs |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------male |
1.495016
.0687052
21.760
0.000
1.360309
1.629722
_cons |
13.58607
.0490537
276.963
0.000
13.48989
13.68224
-----------------------------------------------------------------------------. predict edmres, res

44. In the following Ballantine, each circle represents the variation for one of the
variables. The overlaps show that each variable shares some variation with each of the
other variables, that is, that each variable partly explains the variation in each of the
other variables.

50

45. We can determine how much of the variation in the dependent variable is "uniquely"
explained by one independent variable in a multiple regression by subtracting the
coefficient of determination for an equation excluding that independent variable from
the coefficient of determination for the same equation including that independent
variable.
To calculate how much of the variation in salary is uniquely explained by each
variable in a regression including edyrs and male, subtract the R2 for each
bivariate regression from the R2 for the multiple regression. The multiple R2 is
.373, the R2 for male alone is .188, and the R2 for edyrs alone is .306.
The multiple R2
minus the R2 for edyrs without male (.306) shows that
male uniquely explains 6.7% of the variation in salary in the multiple regression.
The multiple R2 (.373) minus the R2 for male without edyrs (.188) shows that
edyrs uniquely explains 18.5% of the variation in salary in the multiple
regression.
The explained variation that is not uniquely explained by edyrs or male (.373 .067 -.185 = .121) is jointly explained by both variables.
46. The amount of variation in the dependent variable that can be uniquely explained by
a particular independent variable depends on what other variables are in the model.
51

salary in Table 12, substantially less than the 30.6% that edyrs uniquely explains
in Table A.
TABLE 12. reg sal male edyrs yos age asian black hispanic amerind, beta
Source |
SS
df
MS
-------------+-----------------------------Model | 4.1289e+11
8 5.1611e+10
Residual | 3.2050e+11 3475 92229060.6
-------------+-----------------------------Total | 7.3338e+11 3483
210560961

Number of obs
F( 8, 3475)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

3484
559.60
0.0000
0.5630
0.5620
9603.6

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
Beta
-------------+---------------------------------------------------------------male |
6034.097
355.1608
16.99
0.000
.2079088
edyrs |
3232.53
80.98997
39.91
0.000
.4812745
yos |
730.5837
23.41642
31.20
0.000
.4338067
age | -10.80357
19.4365
-0.56
0.578
-.0077688
asian | -2086.068
967.2924
-2.16
0.031
-.0243537
black | -2615.829
445.444
-5.87
0.000
-.0686517
hispanic | -992.3598
851.6376
-1.17
0.244
-.0132005
amerind | -6272.744
1572.172
-3.99
0.000
-.0449058
_cons | -24435.23
1358.659
-17.98
0.000
.
-----------------------------------------------------------------------------TABLE 13. reg sal edyrs yos age asian black hispanic amerind, beta
Source |
SS
df
MS
-------------+-----------------------------Model | 3.8627e+11
7 5.5181e+10
Residual | 3.4712e+11 3476 99861358.9
-------------+-----------------------------Total | 7.3338e+11 3483
210560961

Number of obs
F( 7, 3476)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

3484
552.57
0.0000
0.5267
0.5257
9993.1

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
Beta
-------------+---------------------------------------------------------------edyrs |
3687.901
79.52582
46.37
0.000
.5490723
yos |
754.9943
24.32016
31.04
0.000
.4483013
age |
7.539605
20.19351
0.37
0.709
.0054217
asian | -2390.004
1006.348
-2.37
0.018
-.027902
black | -3825.463
457.5498
-8.36
0.000
-.1003982
hispanic | -741.8552
886.0426
-0.84
0.402
-.0098682
amerind | -6756.783
1635.662
-4.13
0.000
-.048371
_cons | -28795.38
1388.31
-20.74
0.000
.
------------------------------------------------------------------------------

52

TABLE 14. reg sal male yos age asian black hispanic amerind, beta
Source |
SS
df
MS
-------------+-----------------------------Model | 2.6596e+11
7 3.7995e+10
Residual | 4.6742e+11 3476
134470428
-------------+-----------------------------Total | 7.3338e+11 3483
210560961

Number of obs
F( 7, 3476)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

3484
282.55
0.0000
0.3627
0.3614
11596

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
Beta
-------------+---------------------------------------------------------------male |
10725.29
404.6845
26.50
0.000
.369547
yos |
682.1735
28.23688
24.16
0.000
.4050617
age | -19.12451
23.46781
-0.81
0.415
-.0137524
asian | -1179.992
1167.663
-1.01
0.312
-.0137758
black | -4126.243
535.9197
-7.70
0.000
-.1082921
hispanic | -3127.464
1026.304
-3.05
0.002
-.0416018
amerind | -9311.796
1896.137
-4.91
0.000
-.066662
_cons |
20923.42
899.102
23.27
0.000
.
------------------------------------------------------------------------------

46A. The amount of variation uniquely explained by each variable is an alternative


method of assessing the relative strength. By comparing Table 15 to the following
regressions, which each drop one variable or one set of variables, we can see that:
Sex uniquely explains 3.6% of the variation in salary.
Education explains 20.0%.
Federal experience and age combined exlain 18.0%.
Race/ethnicity explains only 0.6%.
TABLE 15. reg sal male edyrs yos age asian black hispanic amerind, beta
Source |
SS
df
MS
-------------+-----------------------------Model | 4.1289e+11
8 5.1611e+10
Residual | 3.2050e+11 3475 92229060.6
-------------+-----------------------------Total | 7.3338e+11 3483
210560961

Number of obs
F( 8, 3475)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

3484
559.60
0.0000
0.5630
0.5620
9603.6

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
Beta
-------------+---------------------------------------------------------------male |
6034.097
355.1608
16.99
0.000
.2079088
edyrs |
3232.53
80.98997
39.91
0.000
.4812745
yos |
730.5837
23.41642
31.20
0.000
.4338067
age | -10.80357
19.4365
-0.56
0.578
-.0077688
asian | -2086.068
967.2924
-2.16
0.031
-.0243537
black | -2615.829
445.444
-5.87
0.000
-.0686517
hispanic | -992.3598
851.6376
-1.17
0.244
-.0132005
amerind | -6272.744
1572.172
-3.99
0.000
-.0449058
_cons | -24435.23
1358.659
-17.98
0.000
.
-----------------------------------------------------------------------------. reg sal edyrs yos age asian black hispanic amerind, beta
Source |
SS
df
MS
-------------+-----------------------------Model | 3.8627e+11
7 5.5181e+10
Residual | 3.4712e+11 3476 99861358.9
-------------+-----------------------------Total | 7.3338e+11 3483
210560961

53

Number of obs
F( 7, 3476)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

3484
552.57
0.0000
0.5267
0.5257
9993.1

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
Beta
-------------+---------------------------------------------------------------edyrs |
3687.901
79.52582
46.37
0.000
.5490723
yos |
754.9943
24.32016
31.04
0.000
.4483013
age |
7.539605
20.19351
0.37
0.709
.0054217
asian | -2390.004
1006.348
-2.37
0.018
-.027902
black | -3825.463
457.5498
-8.36
0.000
-.1003982
hispanic | -741.8552
886.0426
-0.84
0.402
-.0098682
amerind | -6756.783
1635.662
-4.13
0.000
-.048371
_cons | -28795.38
1388.31
-20.74
0.000
.
-----------------------------------------------------------------------------. reg sal male yos age asian black hispanic amerind, beta
Source |
SS
df
MS
-------------+-----------------------------Model | 2.6596e+11
7 3.7995e+10
Residual | 4.6742e+11 3476
134470428
-------------+-----------------------------Total | 7.3338e+11 3483
210560961

Number of obs
F( 7, 3476)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

3484
282.55
0.0000
0.3627
0.3614
11596

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
Beta
-------------+---------------------------------------------------------------male |
10725.29
404.6845
26.50
0.000
.369547
yos |
682.1735
28.23688
24.16
0.000
.4050617
age | -19.12451
23.46781
-0.81
0.415
-.0137524
asian | -1179.992
1167.663
-1.01
0.312
-.0137758
black | -4126.243
535.9197
-7.70
0.000
-.1082921
hispanic | -3127.464
1026.304
-3.05
0.002
-.0416018
amerind | -9311.796
1896.137
-4.91
0.000
-.066662
_cons |
20923.42
899.102
23.27
0.000
.
-----------------------------------------------------------------------------. reg sal male edyrs asian black hispanic amerind, beta
Source |
SS
df
MS
-------------+-----------------------------Model | 2.8056e+11
6 4.6760e+10
Residual | 4.5282e+11 3477
130233422
-------------+-----------------------------Total | 7.3338e+11 3483
210560961

Number of obs
F( 6, 3477)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

3484
359.05
0.0000
0.3826
0.3815
11412

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
Beta
-------------+---------------------------------------------------------------male |
7566.204
418.5967
18.08
0.000
.2606986
edyrs |
3014.761
95.99118
31.41
0.000
.448852
asian | -3730.235
1147.052
-3.25
0.001
-.0435484
black | -3109.697
526.0432
-5.91
0.000
-.0816132
hispanic | -3024.445
1009.19
-3.00
0.003
-.0402314
amerind | -6260.943
1868.175
-3.35
0.001
-.0448213
_cons | -12433.66
1355.152
-9.18
0.000
.
------------------------------------------------------------------------------

54

reg sal male edyrs yos age , beta


Source I
SS
df
MS
--+-----------------------------4.0823e+11
4 1.0206e+11
Model I
3.2516e+11 3479 93462355.9
Residual I
-------------------------------------------210560961
Total I
7.3338e+ll 3483

Number of abs
F ( 4, 3479)
Prob > F
R-squared
Adj R-squared
Root MSE

=
3484
= 1091.96
= 0.0000
= 0.5566
= 0.5561
= 9667.6

Beta
sal I
Coef.
Std. Err.
t
P>ItI
--+ ---------------------------------------------------------------.2197061
6376.485
18.08
0.000
male I
352.6905
.4888387
80.99837
40.54
0.000
edyrs I
3283.336
.4327826
0.000
728.8589
23.49308
31.02
yos I
-.0007723
-0.06
0.956
-1.074011
19.41358
age I
-19.84
0.000
cons I
-26362.47
1328.456

47. Adding an extra independent variable will never lower R2 (unless missing data
decrease the sample size). SST (total variation) will not change and SSR (explained
variation) will not decrease. Whatever variation was already explained by other
variables in the model will still be explained. Because adding extra independent
variables can increase R2 even when those variables are not related to the dependent
variable in the population, R2 can be a biased estimate of the true strength of the
relationship in the population. We therefore use the adjusted R2 when we do multiple
regression analysis.
The adjusted R2 adjusts for both the number of independent variables and the
number of observations in the data set. Adjusted R2 is always lower than R2, but
it is only slightly lower in large data sets. Using adjusted R2 is most important
when you are working with small data sets.
Using Uniquely Explained Variation to Understand
Changing Regression Coefficients

Vk

48. The Ballantine diagram can help us understand why regression coefficients change
when additional variables are added to the model. Area A is variation in salary that can
be explained by sex but not by education it represents variation in salary due to men
earning more than equally educated women. Area B is variation in salary that can be
explained by education but not by sex it represents variation in salary due to more
educated workers earning more than less educated workers of the same sex. Area C is
variation in salary that could potentially be explained by either education or by sex it
represents variation in salary due to more educated men earning more than less
educated women.
49. If we run bivariate regressions, Area C will be attributed to whichever independent
variable is included in the regression and that variation will be used in calculating the
regression coefficient; but in multiple regression, Area C can only be counted once, so R2
for edyrs and male together is smaller than the sum of the separate R2's for edyrs and
male and Area C is not used in computing either regression coefficient. The
regression coefficient for male is based entirely on Area A. The regression coefficient
for edyrs is based entirely on Area B.
In bivariate salary regressions comparing more educated men to less educated
women, too much of the salary advantage will be attributed to sex if education is
left out of the equation, and too much will be attributed to education if sex is left
out of the equation. Part of the reason that men earn more than women is that
they are more educated, and part of the reason that more educated employees
earn more than less educated ones is because they are more likely to be men, but
bivariate regressions wont show that.
50. Consider the following scatterplots, where the blue circles represent men and the
pink triangles represent women. The men have higher levels of education and salary
than do the women. In Graph A, if we ignore sex and treat each point simply as a point
(the bivariate case), our regression line is much steeper than in Graph B, where we draw
separate lines for the men and women (the multiple regression case). Expected salary
increases $6200 with each year of education in the bivariate case, $4700 per year of
education among people of the same sex in the multiple regression case.
. reg sal edyrs
Source |
SS
df
MS
-------------+-----------------------------Model | 4.8406e+09
1 4.8406e+09
Residual | 3.7702e+09
30
125674505
-------------+-----------------------------Total | 8.6108e+09
31
277767435

Number of obs
F( 1,
30)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

32
38.52
0.0000
0.5621
0.5476
11210

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------edyrs |
6201.23
999.2033
6.21
0.000
4160.585
8241.876
_cons | -15517.16
15056.59
-1.03
0.311
-46266.81
15232.5
-----------------------------------------------------------------------------. reg sal edyrs male
Source |

SS

df

MS

Number of obs =

56

32

-------------+-----------------------------Model | 7.4022e+09
2 3.7011e+09
Residual | 1.2085e+09
29 41674103.1
-------------+-----------------------------Total | 8.6108e+09
31
277767435

F( 2,
29)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=

88.81
0.0000
0.8596
0.8500
6455.5

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------edyrs |
4669.09
607.6711
7.68
0.000
3426.263
5911.917
male |
19047.72
2429.479
7.84
0.000
14078.87
24016.56
_cons | -3345.157
8808.237
-0.38
0.707
-21360.03
14669.71
------------------------------------------------------------------------------

Likewise, those two lines are closer together ($19,000 apart) than are the mean salaries
for the men and women ($25,000 apart).
. reg sal male
Source |
SS
df
MS
-------------+-----------------------------Model | 4.9419e+09
1 4.9419e+09
Residual | 3.6689e+09
30
122296008
-------------+-----------------------------Total | 8.6108e+09
31
277767435

Number of obs
F( 1,
30)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

32
40.41
0.0000
0.5739
0.5597
11059

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------male |
25050.83
3940.768
6.36
0.000
17002.71
33098.95
_cons |
63022.63
2955.576
21.32
0.000
56986.53
69058.72
------------------------------------------------------------------------------

57

4 00 00

6 00 00

8 00 00

1 00 00 0

1 20 00 0

Graph A.

12

14

16

18

edyrs
s al
F itted values

s al

4 00 00

6 00 00

8 00 00

1 00 00 0

1 20 00 0

Graph B.

12

14

16
edyrs

sal
Fitted v alues

sal
Fitted v alues

58

18

51. Imagine that sex and education were unrelated in the federal service, that is, that
men and women had the same levels of education. The Ballantine would show that the
variation each variable shared with sal was completely separate. If we added both
variables to the same regression equation, there would be no jointly explained variation,
so the combined R2 would equal the sum of the two separate R2's. None of the reason
that men earned more than women would be due to educational differences, so the
coefficients on male and edyrs would not change in the multiple regression. The
scatterplot would show that men and women have the same educational distributions,
so ignoring the sex of the observations would not bias the coefficient on edyrs, and the
male-female regression lines would be as far apart as the male-female means.
52. One way to see what multiple regression is doing with Area C is to think of multiple
regression as regressing residuals on residuals.
Table A (p. 49) showed the coefficients on male and edyrs when both are in the
sal regression. Tables B and C showed quite different coefficients on male and
edyrs in bivariate regressions. After both tables, however, I saved the residuals
the variation in sal that cannot be explained by male, which I called
salmres, and the variation in sal that cannot be explained by edyrs, which I
called saleres
The expected values from a bivariate regression are perfectly correlated
with the independent variable, and both the expected values and the
independent variable are completely uncorrelated with the residuals.
In Table D, I regressed edyrs on male, and again saved the residuals, the
variation in edyrs that is not correlated with male, as edmres.
In Table E, I regress the residuals from Table B on the residuals from Table D.
That is, I am regressing the variation in salary that is not shared with sex on the
variation in education that is not shared with sex. The coefficient on edmres is
identical to that on edyrs in Table A. (The y-intercept is 0, because the means of
the independent and dependent variables are both 0.)
TABLE E. reg salmres edmres
Source |
SS
df
MS
---------+-----------------------------Model | 1.3540e+11
1 1.3540e+11
Residual | 4.6015e+11 3482
132151921
---------+-----------------------------Total | 5.9555e+11 3483
170988650

Number of obs
F( 1, 3482)
Prob > F
R-squared
Adj R-squared
Root MSE

=
3484
= 1024.58
= 0.0000
= 0.2274
= 0.2271
=
11496

-----------------------------------------------------------------------------salmres |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------edmres |
3075.956
96.09631
32.009
0.000
2887.545
3264.367
_cons |
.0000983
194.7591
0.000
1.000
-381.8534
381.8536

In terms of the Ballantine, I am removing all the variation any variable shares
with male, so the coefficient on edyrs (now transformed as edmres) is based
only on the variation uniquely shared between sal and edyrs.
59

Table F nonsensically regresses male on edyrs to get the variation in gender


that is correlated with education. (Note that R2 is identical in Tables D and F.) I
save the residuals, the variation in sex that is unrelated to variation in education,
and call them maleres.
TABLE F. reg male edyrs if sal<.
Source |
SS
df
MS
---------+-----------------------------Model | 104.223331
1 104.223331
Residual | 766.444867 3482 .220116274
---------+-----------------------------Total | 870.668197 3483 .249976514

Number of obs
F( 1, 3482)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

3484
473.49
0.0000
0.1197
0.1195
.46917

-----------------------------------------------------------------------------male |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------edyrs |
.0800694
.0036797
21.760
0.000
.0728548
.0872839
_cons | -.6390899
.0533916
-11.970
0.000
-.743772
-.5344078
-----------------------------------------------------------------------------. predict maleres, res

Table G again regresses residuals on residuals, finding the relationship between


salary and gender once we have removed all the variation shared with education.
The coefficient on maleres is identical to that on male in Table A.
TABLE G. reg saleres maleres
Source |
SS
df
MS
---------+-----------------------------Model | 4.8848e+10
1 4.8848e+10
Residual | 4.6015e+11 3482
132151924
---------+-----------------------------Total | 5.0900e+11 3483
146138622

Number of obs
F( 1, 3482)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

3484
369.63
0.0000
0.0960
0.0957
11496

-----------------------------------------------------------------------------saleres |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------maleres |
7983.294
415.2372
19.226
0.000
7169.161
8797.427
_cons | -.0000177
194.7591
0.000
1.000
-381.8535
381.8534
------------------------------------------------------------------------------

Again, the coefficient on male in multiple regression (in Table A) is based on


variation that male shares only with sal, not with edyrs.
62. Coefficients on the same variable may differ dramatically between linear and
multiple regression, if the independent variables are correlated. If two independent
variables are not related at all, or if either is not related to the dependent variable,
however, their coefficients will be essentially the same in either bivariate or multiple
regression.

60

Computer Assignment 4
1. Load OPM2001.dta.
2. Create race variables asian, black, hispanic, and indian (or amerind).
3. Regress sal on the same variables as in Table 12, but request standardized
coefficients (beta-weights) instead of confidence intervals. Also calculate summary
statistics for the same variables. (That is, use summarize.)
A. Interpret the beta-weights for the four variables that appear to have the most
impact on salary.
B. Discuss the relative strength of the regression coefficients, using both
standardized and unstandardized coefficients.
C. Why does the relative strength of yos rise relative to that of edyrs when we
use beta-weights? Why is the male beta-weight so much larger than the indian betaweight? Hint: use the summary statistics.
4. Re-run that regression, using lsal (the natural logarithm of salary) rather than sal as
the dependent variables. Do the regression coefficients or the beta-weights change
more? Why do you think that is? (Try correlating sal and lsal.)
5. Re-run the regression from question 3 four times. First, drop edyrs. Second, add
edyrs back in and drop yos. Third, add yos back in and drop male. Fourth, add
male back in and drop black. How much of the variation in sal does each variable
uniquely explain? How do the relative amounts match up with the beta-weights?
6. Replicate Tables A through G, but using grade rather than sal as the dependent
variable. Explain what you are doing, in your own words, step by step. Explain how
Tables E and G are related to Table A.
7. Type the following two lines into a do-file, then run them, to get a 10% sample of the
data set. The first line ensures that we all get the same sample.
set seed 123456
sample 10

8. Regress grade on edyrs. Save the predicted values (call them grdhat) and the
residuals (call them grdres). Run a correlation matrix on all four variables (grade
edyrs grdhat grdres). Be sure to get the summary statistics (using , means).
A. How is grdhat related to the other three variables?
B. How is grdres related to the other three variables?
C. Square the correlation coefficients between grade and both grdhat and
grdres, then sum them. What do you get?

61

D. Interpret the correlation coefficient between edyrs and grade using the
words standard deviation. How does the correlation coefficient relate to R2 in the
regression?
E. Use the correlation coefficient between edyrs and grade and their standard
deviations to calculate the regression coefficient in this bivariate regression. Check your
answer against the regression output. If we were to regress edyrs on grade instead
(that is, make grade the independent variable), what would the regression coefficient
on grade be?
F. From their standard deviations, calculate the variances and total variations of
grade and grdhat. How do your answers relate to SST and SSR? Is there a way to
calculate SSE from the summary statistics (without using grade or grdhat)?

62

CAUSAL MODELS
1. We are often interested in relationships between variables because we believe the
relationships are causal That is, we believe that if we had the power to change the
value of the independent variable, that would in turn change the value of the dependent
variable. For instance, we suspect that higher levels of education and higher levels of
experience tend to lead to higher salaries. If so, by increasing the educational levels of
disadvantaged groups, we can increase their salaries as well. If women earn less than
men because they have less work experience, then as women's work experience rises
relative to men's, their salaries will become more similar to men's.

2. In order to establish that an independent variable has a causal impact on a dependent


variable, we typically need to demonstrate three things: (1) temporal sequence, (2)
correlation, and (3) nonspurious correlation.

3. Suppose we suspect that people value height and treat tall people better than short
people in a variety of ways, including paying them more. How would we demonstrate
that height has a causal impact on salary?
4. First, we would have to show that height preceded salary in time. Causes cannot
come after effects. In this case, the argument seems easy because most people reached
their full height before they began to work full time and certainly before they reached
their current salary.

s. If we tried to argue that weight has a causal impact on salary (that heavy people earn
more because they are heavy), on the other hand, we would need to answer the counterargument that people become heavy because they earn high salaries (and can therefore
afford to eat more rich food).
6. The question of temporal sequence is especially difficult in studying relationships
between attitudes. Are people satisfied with their jobs because they are satisfied with
their pay, or are they satisfied with their pay because they are satisfied with their jobs?
Do people trust their supervisors because they have good communications with them,
or do they have good communications with them because they trust them? If we
cannot confidently answer such questions, the simple causal models discussed here will
not be appropriate.
7. Second, we must show that height and salary are correlated, i.e., we must show that
tall people earn more than short people.
8. Third, we must show that the correlation between height and salary is not spurious.
That is, we must show that the reason height and salary are related is not because they
are both the effects of some other cause. For instance, men are taller than women and
men earn more than women. If sex is the cause in each of these relationships, then at
least part of the correlation between height and salary is spurious. That is, at least part
75

of the reason tall people earn more than short people is that more of the tall people are
men and are earning higher salaries because of their sex (rather than their height).
g. To show that at least part of the relationship between height and salary is
nonspurious, we must show that tall men earn more than short men and that tall
women earn more than short women. To show that, we hold sex constant by
comparing men to men and women to women. If tall men earn more than short women,
it could be because of their height, because of their sex, or because of other differences
between them. If tall men earn more than short men, it cannot be because of their sex,
because the tall and short people we are comparing are all men. It could still be because
of other differences between the two groups, however, so we can not be positive that
there is a causal impact until we eliminate other possible causes of the correlation
between height and salary.
Dealing with Spurious Correlations
through Randomization and Statistical Controls

In experimental designs, researchers eliminate other possible causes through


randomization. In a study of the impact of taking aspirins on preventing heart
disease, a variety of other causes might explain differences in heart attack rates:
genetics, cholesterol levels, diet, smoking, exercise, and a whole variety of other factors
that the researchers suspected or did not suspect. To avoid spurious correlations
between heart attacks and aspirin consumption, the researchers randomly assigned
large numbers of men to the treatment group (the ones who took the aspirin) and the
control group (the ones who took the placebos, e.g., a sugar pill that looks like an
aspirin). In very large random samples, the standard errors for the sample means and
proportions of all variables are very small. Researchers could therefore be confident
that both the treatment and control groups were very similar to the population from
which they were drawn and that the groups did not differ much from each other. If the
treatment group had lower heart attack rates than the control group, it was not because
they had lower cholesterol levels or because they exercised more -- randomization
ensured that the two groups had very similar mean cholesterol levels and exercise rates.
Randomization controls for other possible causes by ensuring that both groups do not
differ much on any of the possible causes.
10.

11. Unfortunately for testing our hypothesis, we cannot randomly assign babies to be
short or tall and see whether the ones who turn out tall earn more than the ones who
turn out short. In social science research, variables are often "naturally" related to each
other.

As a group, children raised in two-parent households have wealthier families, go


to better schools, and live in safer neighborhoods than children raised in singleparent households. If children raised in two-parent households differ from
children raised in one-parent households, it could be due to the differences in

parenting, or to any of these other three differences (and many others


systematically related to them).
To establish a causal relationship between family structure and success in school, for
instance, we must statistically control for all other factors that could be causing
differences in school success. That is, we must compare children from single-parent and
dual-parent households who are otherwise the same: equally wealthy households,
equally safe neighborhoods, equally good schools, etc. Because critics can almost always
come up with other possible causes that the researchers have not controlled for, proving
causation is nearly impossible with nonexperimental data. Statistical controls often
allow us to make a persuasive case for causation, however, if we can eliminate major
alternative explanations for a relationship.
12.

Other Purposes of Causal Modeling


13. Multivariate analysis also allows us to understand the mechanisms through which
one variable affects another.

In a 1% sample of federal personnel records for 2001, men's average grade is 1.73
higher than women's (Table 1). Is this the result of "pure" discrimination by the
federal civil service-- unequal treatment of clearly equal individuals? Or do
differences in life and career patterns of men and women partly account for the
salary differences?
TABLE 1. reg grade male, beta

Source I
ss
df
MS
-------------+-----------------------------Model I 8437.15239
1 8437.15239
Residual I 102621.766 11210 9.15448402
-------------+-----------------------------Total I 111058.918 11211
9.9062455
grade I

Coef.

Std. Err.

Number of obs
F( 1, 11210)
Prob > F
R-squared
Adj R-squared
Root MSE
P>ltl

11212
921 64
0.0000
0.0760
0.0759
3.0256
0

Beta

-------------+---------------------------------------------------------------male I
cons I

1.734969
9.05683

.0571493
.0405114

30.36
223.56

0.000
0.000

.2756267

14. Due to past discrimination, socialization, family responsibilities, and a variety of


other reasons, men and women in the federal service may differ in systematic ways. We
have already seen that more educated and experienced individuals earn more than less
educated and newer employees. If the men in the civil service are more educated and
experienced than the women, men's mean salary could be substantially higher than
women's, even if men and women with equal education and seniority received identical
treatment.
15. If this argument is correct, education and experience serve as intervening
variables in the relationship between sex and salary. An intervening variable is caused

77

by the original independent variable and, in turn, causes the original dependent
variable. If educational and experience advantages are the only reason that men earn
more than women, then eliminating those advantages would cause men and women to
have equal salaries.
16. We can statistically control for men's educational and experience advantages by
comparing men and women who have equal levels of education and experience. If men's
educational and experience advantages are the only reason men earn more than women
(this is called a total interpretation), then equally educated and experience men and
women will have equal average salaries. If men's educational and experience advantages
are part of the reason men earn more than women (this is called a partial
interpretation), then men will still earn more than equally educated and experienced
women, but the difference between the expected salaries of comparably educated men
and women will be smaller than the difference between the mean salaries of men and
women.

Developing a Causal Model


17. We develop causal models using theory, logic, and knowledge. A causal model

maps out our hypotheses about the inter-relationships of variables. We test our causal
model with statistics.

18. The simplest causal model simply says that X causes Y. X is to the left, Y is to the
right, and an arrow points from X (the cause) toY (the effect). We frequently put a+
(plus) or- (minus) sign above the arrow when we feel confident that we know whether
the relationship is positive or negative. Many models also include an arrow from an
error term (e), which represents the effects of all other variables that influence Y, plus
the measurement error. When that arrow is not included, it is implied.

X
Exantple1

19. Suppose we are interested in the causal impact of education on federal grade level,
that is, we want to know how much an additional year of education (edyrs) raises one's
expected grade level. We can start out with a bivariate causal model, but this has several
~
problems.

~D y't'L>

1- )>

b rU41) ~

First, we know that other variables influence one's grade level. Grade levels tend to
rise with federal experience (yos), for instance. Neglecting yos may not be a problem if
education and federal experience are not correlated. In that case, the edyrs coefficient
20.

will be unbiased if these are the only two causal variables, though its standard error will
be larger than necessary because the unexplained variation will remain unnecessarily
large. We can add yos to the model as a separate causal variable, either with no link to
edyrs or with a double-headed arrow allowing unexplained correlation between them.

t~

: L~ --~~ :=t=== ~ ~
~blem

A more serious
is that a variety of individual characteristics may influence
how much education federal employees have. Men typically have more education than
women, at least in the federal civil service, which employs a large number oflesseducated women in clerical positions. Whites tend to have more education than
minorities for a variety of societal reasons. And educational levels in this country rose
over the past century, so older workers are likely to have less education than younger
ones (though older workers have also had more years to further their education, and a
fair number of federal employees do continue their education while employed).
21.

Neglecting to include these variables in the model will have more serious effects
than failing to include yos. First, we are virtually certain that educational levels are
related to gender and racejethnicity, so excluding them is likely to have a major impact
on the edyrs coefficient. Second, gender and race/ ethnicity are likely to have
independent impacts on grade through sexual or racial discrimination or through their
impacts on other variables. Even more importantly, gender, race, and age are all
logically prior to education - they can influence how much education a person has, but
one's educational level has no impact on one's gender, race, or age. Thus, sex, race, and
age are potential antecedent variables, causes of both education and grade level.
22.

23. Antecedent variables have a causal impact on both the original independent
variable and the original dependent variable.

If we exclude race and sex from the model, then the edyrs coefficient is likely to
overstate the causal impact of education on grade level. Men and whites have
higher grades than women and minorities of the same educational level. If we
exclude these variables from the model, the edyrs coefficient will incorporate
this positive correlation and overstate the impact of education.

Thus, part of the relationship between education and grade level in a bivariate
model is spurious rather than causal. That is, part of the reason that bettereducated employees have higher grades than less-educated employees, on
average, is that a higher percentage of the better-educated employees are white
males and a higher percentage of the less-educated employees are women and
minorities. Thus, the bivariate edyrs coefficient will overstate the impact of an
79

additional year of education on a person's expected grade level because it includes


these racial and sexual differences in grade level that are not the effect of
education.

24. To incorporate these variables (but not yos) into our c:ausal model, we place male,
white, and age (the causes) to the left of edyrs (the effect), and draw arrows from the
causes to the effects, adding positive or negative signs to the arrows in line with our
hypotheses.

25. Each of these arrows represents a hypothesis.

The positive arrow from male to edyrs hypothesizes that men have more
education than women of the same age and minority status.
The positive arrow from white to edyrs hypothesizes that whites have more
education than minorities of the same age and sex.
The negative arrow from age to edyrs hypothesizes that older employees have
less education than younger employees of the same sex and minority status.
The positive arrow from male to grade hypothesizes that men have higher
grades than women of the same age, minority status, and level of education.
The positive arrow from white to grade hypothesizes that whites have higher
grades than minorities of the same age, sex, and level of education.
The positive arrow from age to grade hypothesizes that older employees have
higher grades than younger employees of the same sex, minority status, and level
of education.
The positive arrow from edyrs to grade hypothesizes that better-educated
employees have higher grades than less-educated employees of the same age, sex,
and minority status. This arrow is the nonspurious (direct) effect of edyrs on
grade.

80

26. Our model also includes double-headed curved arrow between age, minority, and
male, which implies that values of those variables may be correlated but in a noncausal
way. Sex does not cause age, nor age sex, for instance, but the men in the federal service
may be older or younger than the women.
27. Edyrs and grade are both endogenous variables in this model, because both
are determined by other variables in the model. Male, white, and age are exogenous
variables, because they are not determined by other variables in the model.

This usage is somewhat different from Wooldridge's, but reasonably consistent


with it. Both edyrs and grade (the endogenous variables) have error arrows
pointing to them - in other words, both are correlated with at least one error
term. In theory, the exogenous variables are uncorrelated with the error terms.
Example2

28. Suppose we want to understand why men have higher grades than women in the
federal service. Our simplest causal model simply states that sex has a causal impact on
grade level. The arrow leads from male (the cause) to grade (the effect). The positive
sign shows that men have higher grades (high values of the male variable (l=men) are
associated with high values of the grade variable (high grades)).

_----:..+-;;> ~
29. Let's consider two intervening variables. (Intervening variables are caused
by the original independent variable and cause the original dependent variable.) We
expect men to have more education (edyrs) and experience (yos) than women, and we
expect both advantages to lead to higher grades.

30. The positive arrows from male (cause) to edyrs and yos (effects) hypothesize that

men have more education and experience than women. The positive arrow from edyrs
(cause) to grade (effect) hypothesizes that more educated employees have higher
grades than less educated people of the same sex and length of service. The positive
arrow from yos (cause) to grade (effect) hypothesizes that more experienced
employees have higher grades than less experienced employees of the same sex and level
of education.
81

31. The positive arrow from male to grade in 29 represents a different hypothesis than
the positive arrow from male to grade in 28. That arrow (#28) means simply that men
have higher mean grades than women. This arrow (#29) means that men have higher
grades than women who have the same levels of education and experience. This arrow
is the direct effect of sex on grade, holding education and federal experience constant,
while that arrow (#28) represents the total effect of sex on grade.
32. If sex had no direct effect on grade once we control for education and experience
(that is, if men and women with the same amount of education and experience had the
same grades), then the entire effect of sex on grade could be attributed to educational
and experience differences between men and women. This would not mean that sex had
no effect on grade, simply that we had identified the indirect effects, the mechanisms
through which sex affects grade. We would have provided a total interpretation of
the effect of sex on grade.

33. If a substantial direct effect remains after controlling for education and experience,
it means that male-female grade differences are not due simply to differences in
education and experience. Other factors (perhaps "pure" discrimination) are also
responsible.
34. In this model, edyrs, yos, and grade are all endogenous variables because all
are influenced by male. Only male is an exogenous variable in this model, because it
is the only variable that is not caused or influenced by any other variable in the model.
Example3

35. It makes more sense to combine our two models into one. We allow a (noncausal)
correlation among age, white, and male. We expect the latter variables to have a
positive impact on edyrs, and age may have a negative impact, holding race and sex
constant. Older employers should clearly have more experience than younger ones, men
probably have more federal experience than women of the same age, due to women's
greater likelihood of having taken time out of the labor market to raise children, and
whites may have more yos than minorities of the same age and sex, because whites tend
to spend less of their careers unemployed. In addition, we hypothesize that all five
variables have positive direct effects on grade. We express all our hypotheses in the
following path diagram.

Testing the Models Using Path Analysis


36. Let's begin by testing a simpler model. Our previous work suggests that malefemale educational differences are the most important of these three factors in
explaining men's higher grades. A simplified model shows our hypotheses that men
have more education than women, that more educated employees have higher grades
than less educated employees of the same sex, and that men have higher grades than

women with the same level :;::;n~~

V\\.~

dJ...

~ ~

+-

37. Table 2 shows that, in our sample, the average woman has 13.94 years of education
and that the average man has an additional1.17 years of education. The standardized
coefficient (beta-weight) on male is .258.
TABLE 2. reg edyrs male, beta
Source I
ss
MS
df
-------------+-----------------------------Model I 3866.80683
1 3866.80683
Residual I 54300.3623 11210
4.8439217
-------------+-----------------------------Total I 58167.1691 11211 5.18840149

Number of obs
F( 1, 11210)
Frob > F
R-squared
Adj R-squared
Root MSE

11212
798.28
0.0000
0.0665
0.0664
2.2009

Coef.
Std. Err.
edyrs !
Beta
t
F>ltl
-------------+---------------------------------------------------------------rna 1 e I
1 . 1 7 4 54 6
. 0 415 712
28 . 25
0 . 000
. 2 57 8 3 2 3
cons I
13.9394
.0294686
473.03
0.000

38. Table 3 shows that, in this sample, the expected grade of men is .90 higher than the
expected grade of women with the same level of education and that expected grade rises
.707 per additional year of education, holding sex constant. (The standardized
coefficients are .144 and .512, respectively.)
TABLE 3. reg grade male edyrs, beta
Source

ss

df

MS

Number of obs
F(
2, 11209)
Frob > F
R-squared
Adj R-squared
Root MSE

-------------+------------------------------

Model
Residual

I
I

35562.193
2
75496.7253 11209

17781.0965
6.7353667

-------------+------------------------------

Total

grade I

111058.918 11211
Coef.

9.9062455

Std. Err.

F>ltl

11212
2639.96
0.0000
0.3202
0.3201
2.5953
Beta

-------------+----------------------------------------------------------------

male I
edyrs I
cons I

.9048244
7067793
-.7952527

.0507355
.0111373
.1590885

17.83
63.4 6
-5.00

0.000
0.000
0.000

.1437453
. 5115005

39. Notice that the expected difference in grade between men and women with the same
level of education (.90) is smaller than the difference between the mean grades of men
and women (1. 73). Our causal modelled us to expect that. Part of the reason men earn
more than women is because they have more years of education, on average. When we
separate the indirect effect of men's grade advantage through education, the
remaining direct effect of sex on grade is smaller than the total effect.

e~

40. The total effect of sex on grade is 1.73 and the direct effect of sex on grade is .90,
implying that the indirect effect of sex on grade is .83 (1.73- .90). This implies that
educational differences account for .83 of the grade gap between men and women, or
about 48% (.83/1.73) of the male grade advantage.
Notice that men have, on average, 1.17 more years of education than women.
Each additional year of education has a direct effect of. 707. If we multiply the
two coefficients times each other (1.17*.707=.83), we get the indirect effect of sex
on grade.
41. It is more common in path analysis (another name for this causal modeling) to
attach standardized coefficients (beta-weights) to each arrow. When we multiply the
two standardized coefficients on the indirect path (male->edyrs->grade), the
product is .132 (.258*.512). This indirect path plus the standardized coefficient for the
direct path adds up to the standardized coefficient for the total effect (.132+.144=.276,
the initial beta-weight for male->grade in Table 1).

42. The total effect of an additional year of education on expected grade is .758 (betaweight= .549; Table 4) in a bivariate regression. Holding sex constant, the direct effect
of education falls to .707 (in Table 3); its beta-weight also shrinks (to .512). Our causal
modelled us to expect that much of the apparent impact of edyrs in a bivariate

regression is actually spurious, resulting from the fact that more educated employees
are more likely than less educated ones to be men.
TABLE 4. reg grade edyrs, beta
Source I

ss

df

MS

Number of obs
F( 1, 11210)
Prob > F
R-squared
Adj R-squared
Root MSE

-------------+------------------------------

Model I
Residual I

33419.965
1
77638.9533 11210

33419.965
6.92586559

-------------+-----------------------------Total

grade I

111058.918 11211
Coef.

9.9062455

Std. Err.

P>ltl

11212
4825.38
0.0000
0.3009
0.3009
2.6317

Beta

-------------+---------------------------------------------------------------edyrs
cons

I
I

.757991
-1.084666

.0109118
.1604811

69.46
-6.76

0.000
0.000

.5485627

43. In our model, there are two paths from edyrs to grade, the direct effect from
edyrs to grade and the spurious effect from edyrs back to male to grade. (We can
recognize that this latter path is spurious rather than indirect because we are going
backwards on the arrow from male to edyrs.) The spurious effect plus the direct
effect add up to the total effect. Since both the spurious and direct effects are
positive, the direct effect must be smaller than the total effect.

The spurious effect (.051= .758 - .512) comprises about 7% of the total effect
(.051/. 758)
We can also see this by multiplying the beta-weight on male->edyrs (.258)
times the beta-weight on male->grade (.144) to get the spurious effect of .037.
The spurious effect (.037) plus the direct effect (.512) adds up to the bivariate
standardized coefficient on edyrs in the grade equation (.549). Again, .037/-549
= .067, about 7% of the total effect.
44. Because our causal model indicates that male causes edyrs, the problem of a
biased coefficient in a bivariate regression is more serious using edyrs rather than
male as the independent variable. Edyrs is an intervening variable in our model, a
mechanism through which sex affects grade. The total effect of sex on grade is that
men's mean grade is 1. 73 higher than women's mean grade. If we want to isolate the
impact of sexual discrimination by the federal civil service, we should focus on the
direct effect of sex on grade after controlling for education (and other legitimate
factors). But, if we recognize that sex affects grades through additional mechanisms
besides sexual discrimination by the civil service, then Equation 1 provides a reasonable
estimate ofthe causal impact of sex on federal grades.
45. On the other hand, male is an antecedent variable to the edyrs->grade
relationship. Some of the relationship between education and grade is spurious; that
is, some of the apparent impact of edyrs on grade in the bivariate regression is not the
85

causal impact of edyrs at all. Instead, it is a reflection of the fact that both bettereducated and higher-grade employees are more likely to be men than less-educated and
lower-grade employees, and better-educated employees benefit from other grade
advantages that men have over women. Failing to separate out that spurious portion
gives a misleading estimate of the effect of education on grade. Very few of us have
plans to change our sex, so if we want to know what impact we can expect from
additional education, we want to know the impact of education holding sex constant.
Example4

46. Next, we test the model that men earn more than women partly because the men
have more federal experience than the women and that more experienced employees
earn more than less experienced ones. We hypothesize that men still earn more than
women when they have the same amount of federal experience.

-\-

~~

(f>

47. Although it used to be true that the average male federal employee had worked for
the government longer than the average female federal employee, this is apparently no
longer true. In the 2001 sample, the women actually have .03 more years of service than
the men, though the difference is far from statistically significant (Table 5). Since there
is essentially no relationship between male and yos, controlling for yos only changes
the male coefficient from 1.735 (Table 1) to 1.738 (Table 7), and controlling for male
only changes the yos coefficient from .0893 (Table 6) to .0895 (Table 7). Thus, there is
no indirect effect of sex on grade through federal experience, and none of total effect of
yos on grade is the spurious effect of the positive effect of male on both.
TABLE 5. reg yos male, beta

ss
MS
Source I
df
-------------+-----------------------------Model
2.57853613
1 2.57853613
Residual I 933524.887 11210 83.2760827

Number of obs
F( 1, 11210)
Prob > F
R-squared
Adj R-squared
Root MSE

-------------+------------------------------

Total I

933527.465 11211

83.2688846

11212
0.03
0.8603
0.0000
-0.0001
9.1256

Coef.
Std. Err.
yos I
Beta
t
P>ltl
-------------+-------------------------------------------------- -------------male I
cons I

-.0303306
17.28361

.172367
.1221858

-0.18
141.45

0.860
0.000

-.001662

TABLE 6. reg grade yos, beta

Source I

ss

df

MS

-------------+------------------------------

Model I
Residual I

7447.56626
1
103611.352 11210

7447.56626
9.24276111

86

Number of obs
F( 1, 11210)
Prob > F
R-squared

11212
805.77
0.0000
0.0671

Adj R-squared
Root MSE

-------------+------------------------------

Total I

111058.918 11211

9.9062455

=
=

0.0670
3.0402

grade I
Coef.
Std. Err.
Beta
t
P>ltl
-------------+---------------------------------------------------------------yos I
.089319
.0031466
28.39
0.000
.2589587
cons !
8.386255
.0614555
136.46
0.000
TABLE 7. reg grade male yos, beta
MS
ss
df
Source I
-------------+-----------------------------Model
15911.1112
2 7955.55562
Residual I 95147.8071 11209 8.48851879

Number of obs
F( 2, 11209)
Prob > F
R-squared
Adj R-squared
Root MSE

-------------+-----------------------------Total

grade I

111058.918 11211
Coef.

9.9062455

Std. Err.

P>ltl

11212
937.21
0.0000
0.1433
0.1431
2.9135
Beta

-------------+----------------------------------------------------------------

male I
yos !
cons
1

1.737683
.0894772
7.510341

.0550314
.0030155
.0651005

31.58
29.67
115.37

0.000
0.000
0.000

.2760578
.2594175

48. Thus, educational differences between men and women explain much of the malefemale grade difference (.83, or about 48%), but differences in experience do not.
Our measure of the male-female grade difference is the coefficient on male in a
bivariate regression. To see how well other variables explain the male-female
grade difference, we see how much the male coefficient shrinks when we add
those variables to the regression equation. This coefficient is the direct effect of
sex on grade. It is the amount of the sex difference in grades that cannot be
explained or interpreted by other variables in the regression model.
When we find indirect or spurious paths from the independent variable to the
dependent variable, we are interpreting or explaining part of the relationship of
the variables through the mechanisms (intervening variables) or joint causes
(antecedent variables). The larger the indirect or spurious effect we can
identify, the better we understand the bivariate relationship; in this case, the
better we understand why men make more than women.
Testing the Full Model
52. Now, we want to test the full model. We run one regression for each endogenous
variable, using it as the dependent variable and including all the variables that have
arrows pointing to it as independent variables. In this case, since edyrs, yos, and
grade are all endogenous variables, we need three regressions. We also need a
correlation matrix of the exogenous variables (age, male, and white).

TABLE 8. reg edyrs male white age, beta


Source I

ss

df

MS

Number of obs
F( 3, 11208)
Prob > F
R-squared
Adj R-squared
Root MSE

-------------+------------------------------

Model
Residual

I
I

4573.90303
3
53593.2661 11208

1524.63434
4.78169754

-------------+------------------------------

Total I

edyrs

58167.1691 11211
Coef.

5.18840149

Std. Err.

11212
318.85
0.0000
0.0786
0.0784
2.1867
Beta

P>ltl

-------------+----------------------------------------------------------------

male
white
age I
cons I

1.092329
.5543583
-.004195
13.79261

.041971
.0456685
.0022093
.1071627

26.03
12.14
-1.90
128.71

0.000
0.000
0.058
0.000

.2397843
.112233
-.0173355

53. Men have about 1.09 more years of education than women of the same minority
status and age in this sample, and whites have about .55 more years of education than
minorities of the same sex and age (Table 8). There is also a very weak negative
relationship between age and education in this sample (an employee is expected to have
.04 fewer years of education than another employee of the same sex and minority status
who is 10 years younger. This age effect is only marginally significant in a one-tailed
test despite a sample of over n,ooo, so it can probably be ignored.
TABLE 9. reg yos male white age, beta

ss
MS
Source I
df
-------------+-----------------------------Model I
Residual I

311027.082
3
622500.383 11208

103675.694
55.5407194

-------------+------------------------------

Total

933527.465 11211

83.2688846

Number of obs
F( 3, 11208)
Prob > F
R-squared
Adj R-squared
Root MSE

11212
1866.66
0.0000
0.3332
0.3330
7.4526

Coef.
Std. Err.
yos I
t
Beta
P>ltl
-------------+---------------------------------------------------------------male
-.7669799
.1430421
-5.36
0.000
-.0420268
white
.2580153
.1556438
1.66
0.097
.0130392
age I
.5594478
.0075294
74.30
0.000
.5770845
cons I -8.692257
.3652232
-23.80
0.000

54. Age clearly has the strongest impact on federal experience - an employee is
expected to have .56 more years of federal service than another employee of the same
sex and minority status who is one year younger (Table g). Men, on average, have .77
fewer years of service than women of the same age and minority status, and whites have
.26 more years of service than minorities of the same sex and age - though this effect is
only marginally significant and can probably be ignored.
TABLE 10. reg grade male white age edyrs yos, beta
Source I
SS
df
MS
-------------+------------------------------

Model I
46397.80
5
9279.56
Residual I 64661.1183 11206 5.77022294
-------------+-----------------------------Total I
111058.918 11211
9.9062455

88

Number of obs
F ( 5, 11206)
Prob > F
R-squared
Adj R-squared
Root MSE

11212
1608.18
0.0000
0.4178
0.4175
2.4021

grade I

Coef.

Std. Err.

Beta

P>itl

-------------+----------------------------------------------------------------

male
white
age
edyrs
yos
cons

.8347822
.509818
-.0278426
.7304735
.1174303
-2.18331

.0474971
.050513
.002967
.0104176
.0030567
.1858223

17.58
10.09
-9.38
70.12
38.42
-11.75

0.000
0.000
0.000
0.000
0.000
0.000

.132618
.0746977
-.0832677
.5286481
.3404607

55. Each of the independent variables has a highly significant direct effect on expected
grade holding the other variables constant (Table 10). Education and federal experience
have the strongest impacts, based on the standardized coefficients. Men and whites
have higher expected grades than comparable women and minorities. The direct effect
of age is negative, holding the other variables constant.
TABLE 11. pwcorr grade male white age edyrs yos, sig
grade

male

white

age

edyrs

yos

-------------+------------------------------------------------------

grade 1

1.0000

male I

0.2756
0.0000

1. 0000

white

0.1915
0.0000

0.1710
0.0000

1.0000

age

0.1350
0.0000

0.0661
0.0000

0.1066
0.0000

1.0000

edyrs

0.5486
0.0000

0.2578
0.0000

0.1514
0.0000

0.0105
0.2675

1.0000

yos

0.2590
0.0000

-0.0017
0.8603

0.0674
0.0000

0.5757
0.0000

-0.0726
0.0000

1.0000

56. The exogenous variables are significantly, though fairly weakly, intercorrelated
(Table 11). Men, on average, are older and more likely to be white than women are, and
whites are older than minorities in the federal service.
57 Notice that the total effect of age on grade is positive (r=.1350), even though its
direct effect is negative. Age has its positive impact on grade through yos - the
indirect effect (measured in standard deviations) is .5771*.3405 = .1965, stronger than
the total effect. Thus, older employees tend to be in higher grades than younger
employees on average, but older employees tend to be in lower grades than younger
employees with the same length offederal service.
58. The direct effect of male on grade (.1326) is only half the size of its total effect
(.276, from Table 1). The primary reason is the indirect effect through education. Men
tend to have 1.09 more years of education than comparably aged women of the same
minority status, and each year of education has a direct effect of .73 grade- so men's
89

educational advantage seems to raise their mean grade by .80. Figured in terms of betaweights, the indirect effect through education (.2398*.5286 = .127) is nearly half the
total effect.
59. On the other hand, there is very little evidence that much of the bivariate impact of
education on grade is spurious. The direct effect (.529) is nearly as large as the total

effect (.549, from Table 4).

~e~ __. s__;;;z,;......&1_

90

Homework
1.

Load tolerate3.dta.

2.

Restrict the sample to whites: keep ifwhite==l

3. The key variables are:

tolerate

conserv
educ
age
male

a 16-point political tolerance scale- o means that the respondent


preferred to deny a civil liberty (teaching college, giving a public
speech, having an advocacy book in the public library) to a member
of a group whose ideas some consider dangerous (atheists,
communists, homosexuals, militarists, and racists) in every case. A
value of 15 means the respondent preferred to allow the civil liberty
in each case.
a 7-point ideological scale, where o means "extremely liberal" and 6
means "extremely conservative."
years of education
years of age
a dummy variable coded 1 for men, o for women

4. Run a correlation matrix for all five variables and have Stata calculate descriptive
statistics.

s.

Run the following regressions:

reg tolerate conserv, beta


reg conserv age, beta
reg tolerate conserv age, beta
reg conserv male, beta
reg tolerate conserv male, beta
reg conserv age male, beta
reg tolerate conserv age male, beta
reg conserv educ, beta
reg educ age male, beta
reg conserv educ age male, beta
reg tolerate conserv educ age male, beta

6. I hypothesize that political tolerance declines with conservatism, partly because older
people tend to be both more conservative and less politically tolerant than younger
people. Draw a causal model with positive and negative signs indicating the expected
directions of the relationships. Is age an antecedent or intervening variable in this
model? Why? Do you expect the impact of conserv on tolerate to get weaker or
stronger once we add age? Why?
7. Now re-draw the model twice, filling in the appropriate regression coefficients in one
and the appropriate beta-weights and/or correlation coefficients in the other. Calculate
91

total, direct, indirect, and spurious effects of conserv on tolerate. Do the direct,
indirect, and spurious effects add up to the total effect? Calculate total, direct, indirect,
and spurious effects of age on tolerate. Do the direct, indirect, and spurious effects
add up to the total effect? Show me.
8. In the U.S., men tend to be more conservative but more politically tolerant than
women. Draw a three-variable causal model (with tolerate, conserv, and male).
Should adding male to the bivariate regression equation strengthen or weaken the
coefficient of conserv on tolerate. Calculate total, direct, indirect, and spurious
effects of conserv on tolerate. Do the direct, indirect, and spurious effects add up to
the total effect? Do the direct, indirect, and spurious effects of male add up to the total
effect?
g. Explain why the coefficient on conserv changes from model to model as we add and
take out different variables.

92

Lecture 3A. The Elaboration Model


(A somewhat different take on the issues, with lots of repetition)
1. The elaboration model adds control variables to understand a bivariate
relationship better. Sometimes we are interested in the way X influences Y and propose
one or more intervening variables as possible mechanisms. Sometimes we suspect that
the relationship is not causal at all, but that both X andY are responding to some third,
antecedent variable. (Here we are arguing that the relationship between X andY is
spurious, that it appears to be causal but really is not.)
2. We use the same approach to test both interpretations (intervening variable
models) and explanations (antecedent variable models). It is often useful to draw
causal diagrams to specify what we think the causal links are. We draw arrows from
causes to effects. To test our causal models, we generate new contingency tables to see
whether the hypothesized links exist. The intervening variable model argues that X
causes Z and that Z in turn causes Y. The antecedent variable model argues that Z
causes both X andY. We use simple contingency tables to see whether X is related to Z
and whether Z is related to Y in the ways we hypothesized.

3. We repeat the regression analysis ofY on X, but this time controlling for Z (or a
set of Z's). That is, we examine the relationship between X andY when Z is held
constant. Whether we consider Z to be an intervening or an antecedent variable, our
expectations are the same. If the explanation or interpretation is total, the multiple
regression coefficient should fall to zero. If the explanation or interpretation is
worthless, the multiple regression coefficient should be equal to the bivariate
regression coefficient. If the explanation or interpretation is partial, the multiple
regression coefficient should be smaller than the bivariate regression coefficient but
should have the same sign as that coefficient. In general, the closer to zero the multiple
regression coefficient is, the stronger the interpretation or explanation. The closer to
the bivariate regression coefficient the multiple regression coefficient are, the weaker
the interpretation or explanation.
In 1995, the mean grade of men was 2-46 higher than the mean grade of women.
Men were both more educated and more experienced than women in the federal
service at that time. Controlling for education drops the male coefficient from
2-46 to 1.49, meaning that nearly a full grade of the differences of mean grades
(about 40%) can be attributed to educational differences. (The 1.49 grade
difference remains unexplained.) In contrast, controlling for federal experience
only lowers the male coefficient from 2-46 to 2.32, implying that only .14 of the
grade gap was due to experience differences. Adding yos to a model that already
includes edyrs further lowers the male coefficient from 1.49 to 1.28, implying
that experience differences explain a 0.2 grade difference between equally
educated men and women.

93

Note that explaining grade differences is an entirely different issue than


explaining variation in grade levels. Edyrs uniquely explains about 22% of the
variation in grade in a model that includes male, but it explains about 40% of
the difference in mean grades between men and women.
. reg grade male
Source I
ss
df
MS
-------------+-----------------------------Model I
Residual !

19887.3349
1
122953.495 13202

Number of obs
F( 1, 13202)
Prob > F
R-squared
Adj R-squared
Root MSE

19887.3349
9.31324764

-------------+-----------------------------142840.83 13203 10.8188162


Total I

grade I

Coef.

Std. Err.

P>ltl

13204
2135.38
0.0000
0.1392
0.1392
3.0518

[95% Conf. Interval)

-------------+----------------------------------------------------------------

male I
cons I

2.455733
8.325618

.0531427
.0381649

46.21
218.15

0.000
0.000

2.351566
8.250809

2.5599
8.400426

. reg grade male edyrs

ss

Source I

df

MS

Number of obs
F(
2, 13196)
Prob > F
R-squared
Adj R-squared
Root MSE

-------------+------------------------------

Model I 50815.2337
2 25407.6168
Resldual I 91949.6578 13196 6.96799468
-------------+-----------------------------Total I 142764.891 13198
10.817161

13199
3646.33
0.0000
0.3559
0.3558
2.6397

Coef.
Std. Err.
grade I
t
[95% Conf. Interval)
P>ltl
-------------+---------------------------------------------------------------male
1.493481
.0481884
30.99
0.000
1.399025
1.587937
edyrs I
.7202805
.010809
66.64
0.000
.6990933
.7414677
cons I -1.569048
.1521207
-10.31
0.000
-1.867227
-1.27087
. reg grade male yos
Source I

ss

df

Number of obs
F(
2, 13201)
Prob > F
R-squared
Adj R-squared
Root MSE

MS

-------------+------------------------------

Model I
Residual
1

30137.6717
2
112703.159 13201

15068.8359
8.5374713

-------------+------------------------------

Total I
grade I

142840.83 13203
Coef.

10.8188162

Std. Err.

P>ltl

13204
1765.02
0.0000
0.2110
0.2109
2. 9219

[95% Conf. Interval)

-------------+----------------------------------------------------------------

male
yos I
cons I

2.31822
.1036783
6.826369

.0510358
.0029921
.0566337

45.42
34.65
120.54

0.000
0.000
0.000

2.218183
.0978132
6.715359

2.418258
.1095433
6.937379

. reg grade male edyrs yos


Source I

ss

df

MS

-------------+------------------------------

Model I
Residual I

64276.2722
3
78488.6192 13195

21425.4241
5.94836068

-------------+------------------------------

Total I

142764.891 13198

10.817161

94

Number of obs
F( 3, 13195)
Prob > F
R-squared
Adj R-squared
Root MSE

13199
3601.90
0.0000
0.4502
0.4501
2.4389

grade I
Coef.
Std. Err.
t
[95% Conf. Interval]
P>ltl
-------------+---------------------------------------------------------------male !
edyrs I
yos I
cons I

1.28282
.7593479
.119236
-3.829959

.044743
.0100206
.0025065
.1483689

28.67
75.78
47.57
-25.81

0.000
0.000
0.000
0.000

1.195117
.7397061
.1143229
-4.120784

1.370523
.7789897
.1241491
-3.539135

4 There is no way to tell from the regression analysis whether Z is intervening or


antecedent, because the regression analysis will give the same results in either case. We
must know before the analysis whether Z is intervening or antecedent, and our
argument for that must be logical, based on the time-order of the relationship between
X and Z. If the third variable (Z) precedes the original independent variable (X) in time
or in logic (if Z causes X), then Z must be an antecedent variable. If X causes Z, then Z
must be an intervening variable.
5. Third variables are not always intervening or antecedent. One other possibility is that
a third variable may simply have an independent effect. There is no causal link
between one's sex and one's minority status: both are determined at birth. Because both
sex and minority status appear to affect grade level, and because women in the federal
service were more likely than men to be minorities, controlling for minority status
lowers the male coefficient from 2-46 to 2.27, suggesting that 0.2 of the male-female
grade difference was due to race rather than sex differences .
. reg grade male minority

Source I

ss

MS

df

Number of obs
F(
2, 13200)
Prob > F
R-squared
Adj R-squared
Root MSE

-------------+------------------------------

Model I
Residual I

23598.4413
2
119221.299 13200

11799.2206
9.03191662

-------------+------------------------------

Total

grade i

142819.741 13202
Coef.

10.8180382

Std. Err.

P>ltl

13203
1306.39
0.0000
0.1652
0.1651
3.0053

[95% Conf. Interval]

-------------+----------------------------------------------------------------

male
minority
cons

i
I

2.271358
-1.208898
8.751417

.053114
.0595582
.0430322

42.76
-20.30
203.37

0.000
0.000
0.000

2.167247
-1.325641
8.667068

2.375469
-1.092155
8.835767

6. A third variable may also be a suppressor variable. Suppressor variables hide the
strength of the original relationship between X andY.
For instance, the mean grade of federal employees in Washington, DC, in 1995
was 1.91 higher than the mean grade of federal employees in the rest of the
country. This was in spite of the fact that a higher percentage of employees in DC
are minorities and that the expected grade of minorities was 1. 78 lower than that
of nonminorities in the same location.

95

Minority status is a suppressor variable for the relationship between DC location


and grade. Once minority is controlled, the grade advantage to being in DC rises
from 1.91 to 2.17. DC's "disadvantage" in racial composition partly masks the
grade advantage employees in DC experience.
Suppressor variables do not help us understand the original relationship - the
relationship to explain gets stronger rather than weaker when we control for a
suppressor variable. In this case, once we have controlled for minority status, we
need to explain why employees in DC are in positions 2.17 (rather than only 1.91)
grades higher than those in the rest of the country, on average.
reg grade de
Source I

ss

df

Number of obs
F( 1, 13204)
Prob > F
R-squared
Adj R-squared =
Root MSE

MS

-------------+------------------------------

Model
Residual

1
1

4214.04682
1
138658.404 13204

4214.04682
10.5012423

-------------+------------------------------

Total

grade I

142872.45 13205
Coef.

10.8195722

Std. Err.

P>ltl

13206
401.29
0.0000
0.0295
0.0294
3.2406

[95% Conf. Interval]

-------------+---------------------------------------------------------------1. 724919
2.099097
1.912008
.0954466
20.03
0.000
de I

cons

9.406957

.0296688

317.07

0.000

9.465112

9.348802

. reg grade de minority


df
MS
Source I
ss
-------------+-----------------------------Model I 12457.0076
2 6228.50379
Residual
130362.733 13200 9.87596463

Number of obs
F(
2, 13200)
Prob > F
R-squared
Adj R-squared =
Root MSE

-------------+------------------------------

Total

grade I

142819.741 13202
Coef.

10.8180382

Std. Err.

P>ltl

13203
630.67
0.0000
0.0872
0. 0871
3.1426

[95% Conf. Interval]

-------------+---------------------------------------------------------------de
minority
cons

1
1
1

2.169589
-1.781577
9.869826

.0929936
.0616525
.0329189

23.33
-28.90
299.82

0.000
0.000
0.000

1.987308
-1.902424
9.8053

2.35187
-1.660729
9.934352

7. A third variable can also be a distorter variable (also called a lurking variable).

Distorter variables make the independent variable appear to have the opposite impact
on the dependent variable than the one it really has.
For instance, suppose that the civil service discriminated against men with
beards, paying them less than men without beards. A simple relationship
between (presence/absence of a beard) and (grade) would show people with
beards to be in higher grades than people without beards. This is because all the
people with beards are men and only half the people without beards are men, and
because an all-male group will almost always have higher grades than a sexually
96

integrated group. If we failed to control for sex (the distorter variable), we would
conclude that there was discrimination in favor of beards. Once we controlled
for sex, we would see that the discrimination was in favor of men, but that within
the male group there was discrimination against bearded men.
We can apply this logic to the status of veterans in the federal service. Veterans
receive preferential treatment in hiring, promotions, and lay-offs, but it's not
clear how much impact veteran status has on grade level. A bivariate regression
shows that the mean grade of veterans was 0.7 grade higher than that of
nonveterans in 1995, but once we control for sex, veteran status becomes a grade
disadvantage rather advantage - veterans tended to be 0.8 grade lower than
nonveterans of their same sex. Veterans appear to have held higher grades
because they were men rather than because they were veterans .
Distorter variables more than fully explain the original relationship. In fact, they
leave us with an opposite relationship to explain. Instead of trying to explain why
veterans are o. 7 grade higher than nonveterans, we need to explain why they are
0.8 grade lower.
reg grade vet

ss

Source I

df

MS

Number of obs
F( 1, 13204)
Prob > F
R-squared
Adj R-squared
Root MSE

-------------+------------------------------

Model
Residual

1176.69357
1
141695.757 13204

1176.69357
10.7312751

-------------+------------------------------

Total

142872.45 13205

grade I

Coef.

10.8195722

Std. Err.

P> It I

13206
109.65
0.0000
0.0082
0.0082
3.2759

[ 95% Conf. Interval]

-------------+---------------------------------------------------------------vet I
cons I

.6896726
9.419518

.0658623
.0329087

10.47
286.23

0.000
0.000

.560573
9.355012

.8187721
9.484023

. reg grade vet male


Source I

ss

df

MS

Number of obs
F(
2, 13201)
Prob > F
R-squared
Adj R-squared
Root MSE

-------------+------------------------------

Model I
Residual I

21060.7174
2
121780.113 13201

10530.3587
9.22506726

-------------+-----------------------------Total

grade I

142840.83 13203
Coef.

10.8188162

Std. Err.

P>ltl

13204
1141.49
0.0000
0.1474
0.1473
3.0373

[95% Conf. Interval]

-------------+----------------------------------------------------------------

vet
male I
cons I

-.775137
2.763784
8.360289

.0687295
.059527
.038108

-11.28
46.43
219.38

97

0.000
0.000
0.000

-.9098568
2.647102
8.285592

-.6404172
2.880465
8.434986

The Logic of Control Variables


8. Why do regression coefficients change when we control for a third variable? If two
variables (X and Z) are strongly related to each other and to Y, then a bivariate
regression of Y on X will secretly include the relationships between X and Z and between
Z andY.
9. Consider a one-on-one basketball game between myself and Magic Johnson. As he is
the far superior player, he beats me by so points. To make the game more even, we play
again with him playing barefoot and me wearing shoes. Since I now have an advantage
on the third variable (basketball shoes), he beats me by only 40 points. On the other
hand, if I play barefoot and he wears shoes, he beats me by 60 points. Which of these
three point-spreads is the best measure of how much better a basketball player he is
than me? Ideally, we want the two of us to play under identical conditions to get the
best measure of the difference in our abilities. If he wears shoes and I don't, the 6opoint spread overstates how much better he is. If I wear shoes and he doesn't, the 40point spread understates the difference. I need to control the third variable (shoes) to
get the best measure of the difference between the two players.
10.

As a group, veterans are better educated than nonveterans. When I compare all

veterans to all nonveterans, the groups differ in at least two respects: the first group has
advantages from veterans' preference and from higher educational levels. If I simply
compare veterans to nonveterans, it is as if we compare me playing barefoot to Magic
playing with shoes: we will overstate the advantage directly attributable veterans'
preference. We want to control for that intervening variable, to allow both veterans and
nonveterans to play wearing shoes. When we compare veterans and nonveterans with
the same amount of education, we have statistically eliminated the educational
advantage of the veterans and will get a clearer measure of the true advantage to having
veterans' preference. When we take away an advantage of the winning player or group,
the winning team will win by less (partial interpretation) or tie (total interpretation) or
lose (distorter variable).
11. DC workers have higher grades than workers outside DC. A lower percentage of
federal workers inside than outside DC are white. Since whites are generally in higher
grades than nonwhites, the nonDC workforce has a grade advantage in its racial
composition. In this case, the losing team (those outside DC) have the advantage on the
third variable (race). (This is comparable to me playing with shoes and Magic playing
barefoot; he still beats me, but not by as much as if we were equal on footwear.) When
we eliminate their advantage statistically, the grades of those outside DC will fall relative
to workers inside DC (of the same minority status), and the relationship between DC
and grade gets stronger (the multiple regression coefficient is higher than the bivariate
regression coefficient).

12. Note that controlling for a third variable will not affect the relationship between X
andY unless Z is related to both X andY. If the X-groups do not differ on Z or if

98

changes in the value of Z have no effect on Y, then controlling for Z will not change the
regression coefficient. Sex could explain the veteran-grade relationship only if veterans
were more likely than nonveterans to be men AND if men earned higher grades than
women. Education could interpret the relationship only if veterans had higher
educational levels than nonveterans AND if people with more education earned higher
grades. Therefore, controlling for a third variable which is unrelated to one of the
variables will not alter the strength of the original relationship.
13. Likewise, a third variable (Z) can only explain or interpret the relationship between
X andY if the X-group that has the advantage on Y also has the advantage on Z.
Controlling for a variable on which the losing X-group has the advantage will increase
the absolute value of the regression coefficient and leave us with more (not less) to
explain or interpret.

99

MORE ADVANCED TECHNIQUES


Interaction Terms
1. Frequently, we expect some independent variables to have different effects for
different groups. For instance, we may think that additional education raises mens
salaries more than it raises womens salaries or that being a minority lowers womens
salaries more than it lowers mens salaries. The way to test these hypotheses is by using
interaction terms.
2. We previously examined the relationship between salary, sex, and education:
(1)

SAL-hat = -14,367 + 7983 male + 3076 edyrs

TABLE 1. reg sal male edyrs


Source |
SS
df
MS
-------------+-----------------------------Model | 2.7323e+11
2 1.3662e+11
Residual | 4.6015e+11 3481
132189887
-------------+-----------------------------Total | 7.3338e+11 3483
210560961

Number of obs
F( 2, 3481)
Prob > F
R-squared
Adj R-squared
Root MSE

=
3484
= 1033.48
= 0.0000
= 0.3726
= 0.3722
=
11497

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------male |
7983.294
415.2969
19.22
0.000
7169.044
8797.544
edyrs |
3075.956
96.11011
32.00
0.000
2887.518
3264.394
_cons | -14367.21
1335.065
-10.76
0.000
-16984.8
-11749.62
------------------------------------------------------------------------------

3. This equation necessarily gives the formulas for two parallel lines, one for the men
and one for the women.
For women, male =0, so the equation simplifies to:
(1a)

SAL-hat = -14,367 + 3076 edyrs


For men, male=1, so the equation simplifies to:

(1b)

SAL-hat

= (-14,367 + 7983) + 3076 edyrs


= -6384 + 3076 edyrs

The two lines have the same slope (expected salary rises by $3,076 for each
additional year of education) but different y-intercepts (-14,367 for women, and
-6384 for men). By assumption, education raises salary as rapidly for women as
for men ($3,076 per year), and mens salaries are expected to be higher than
womens by the same amount ($7,983) at every level of education.

32

4. Note, however, that salary rising as rapidly with education for women as for men is
an assumption built into the model rather than a conclusion that we can draw from the
results. A regression model with one dummy independent variable and one intervallevel independent variable always yields two parallel lines.
5. Suppose we want to consider the possibility that additional education raises mens
salaries more than it raises womens. This implies that the salary gap between men and
women will be larger among the more-educated than among the less-educated.
6. To test this possibility, we create an interaction term between male and edyrs by
multiplying them times each other. The new variable maledyrs always has value 0 for
women (because male x edyrs = 0 when male=0), and maledyrs = edyrs for men
(because we are always multiplying edyrs times 1 for the men).
7. When we add maledyrs to the regression above, we get the following results:
(2)

Sal-hat = -7474 - 4388 male + 2568.6 edyrs + 870.6 maledyrs

. gen maledyrs = male * edyrs


TABLE 2. reg sal male edyrs maledyrs
Source |
SS
df
MS
-------------+-----------------------------Model | 2.7587e+11
3 9.1956e+10
Residual | 4.5752e+11 3480
131469990
-------------+-----------------------------Total | 7.3338e+11 3483
210560961

Number of obs
F( 3, 3480)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

3484
699.45
0.0000
0.3762
0.3756
11466

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------male | -4388.044
2792.978
-1.57
0.116
-9864.085
1087.998
edyrs |
2568.554
148.393
17.31
0.000
2277.608
2859.5
maledyrs |
870.6214
194.3802
4.48
0.000
489.5107
1251.732
_cons | -7473.612
2035.078
-3.67
0.000
-11463.68
-3483.545
-----------------------------------------------------------------------------. predict intsalhat
(option xb assumed; fitted values)

8. This again is the equation for two lines, but the lines are no longer parallel.
For women, both male and maledyrs are always 0, so womens regression line
is just:
(2a)

Sal-hat

= -7474 + 2568.6 edyrs

For men, male always equals 1, so the male coefficient can be added to the yintercept, and maledyrs always equals edyrs, so their coefficients can be added.
(2b)

Sal-hat

= (-7474 - 4388) + (2568.6 + 870.6) edyrs


= -11,862 + 3439 edyrs
33

The y-intercept is lower for the mens line than for the womens line, but the slope
is steeper for the men.
9. If we actually run separate regressions for men and women, we get these same
regressions.
. bysort male: reg sal edyrs
_______________________________________________________________________________
-> male = women
Source |
SS
df
MS
-------------+-----------------------------Model | 3.9389e+10
1 3.9389e+10
Residual | 1.5592e+11 1706 91394759.7
-------------+-----------------------------Total | 1.9531e+11 1707
114416311

Number of obs
F( 1, 1706)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

1708
430.98
0.0000
0.2017
0.2012
9560.1

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------edyrs |
2568.554
123.726
20.76
0.000
2325.883
2811.224
_cons | -7473.612
1696.791
-4.40
0.000
-10801.62
-4145.602
-----------------------------------------------------------------------------_______________________________________________________________________________
-> male = men
Source |
SS
df
MS
-------------+-----------------------------Model | 9.8649e+10
1 9.8649e+10
Residual | 3.0160e+11 1774
170009078
-------------+-----------------------------Total | 4.0024e+11 1775
225490048

Number of obs
F( 1, 1774)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

1776
580.26
0.0000
0.2465
0.2460
13039

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------edyrs |
3439.175
142.7725
24.09
0.000
3159.155
3719.195
_cons | -11861.66
2175.279
-5.45
0.000
-16128.04
-7595.276
------------------------------------------------------------------------------

10. Lets calculate the expected salaries of men and women high school and college
graduates (with 12 and 16 years of education).
Sal-hat(F12) = -7474 + 2569(12)
Sal-hat(M12) = -11862 + 3440(12)

= $23,354
= $29,418

Sal-hat(F16) = -7474 + 2569(16)


Sal-hat(M16) = -11862 + 3440(16)

= $33,630
= $43,178

The salary difference between men and women is higher at 16 years of education
($43,178 - $33,630 = $9,548) than at 12 years of education ($29,418 - $23,354 =
$6,064).

34

The additional four years of education from high school to college graduation
increased the expected earnings of men by $13,760 and only increased the
expected earnings of women by $10,276 the difference is 4 * 870.6.
The salary gap between men and women grew by $3,484 (9548 - 6064) over this
four-year period, that is, by $871 per additional year of education.
11. Lets use these patterns to interpret Table 2.
The y-intercept is the expected salary of women with zero years of education.
(The y-intercept in 2a (the womens line) is -7474.)
The coefficient on male is the expected difference in salary between men and
women with zero years of education. (Note: we are not just holding education
constant; we are holding it constant at zero). The y-intercept in 2b (the mens
line) is -11862, which is 4388 lower than the womens intercept of -7474.
The coefficient on edyrs is the increase in the expected salary of women
associated with a one-year increase in education. This is the slope of the womens
line (2a).
The coefficient on maledyrs is the difference between the edyrs coefficients in
the male and female equations. (The mens edyrs coefficient of 3440 is 871
higher than the womens edyrs coefficient of 2569.) In this case, the maledyrs
coefficient tells us that an additional year of education raises the expected
salaries of men by $871 more than it raises the expected salary of women. It also
tells us that the expected salary difference between men and women rises $871
with each additional year of education.
The coefficient on the interaction term is a difference of differences. Expected
salaries rise by $871 more per year of education for men than for women (for the
named group than for the reference group). The difference between the expected
salaries of men and women rises $871 with each additional year of education.
12. Although the womens line is higher than the mens line by $4,388 at zero years of
education, the men gain on the women by $871 per year of education. How long will it
take the men to catch up with the women? $4388/$871 = 5.0, so the womens line is
above the mens line up to 5 years of education. Above that level of education, the mens
line is consistently higher than the womens and rising faster. Since no one in our data
set has five or fewer years of education, men are expected to have higher salaries than
women at all relevant levels of education.

35

If we graph the expected values from Table 2, we get two diverging lines. Since
the lines cross at 5 years of education and the lowest level of education in the data
set is 6 years, we do not see them crossing.

10000 20000

F itted values
30000 40000

50000 60000

GRAPH 1. twoway (scatter intsalhat edyrs if male==0, msymbol(triangle) mcolor(pink)


msize(medlarge)) (scatter intsalhat edyrs if male==1, msymbol(square) mcolor(blue)
msize(medlarge))

10
15
years of education completed
Fitted v alues

20

Fitted values

13. In the general case, X1 is an interval level variable, D is a dummy variable, and D*X1
is an interaction term. The interaction term equals X1 when D=1 and it equals 0 when
D=0.
Y-hat = b0 + b1X1 + b2D + b3(D*X1)
14. This equation produces two lines. Unlike the equation without the interaction term,
these two lines are not parallel (unless b3 = 0). Using the distributive property, we can
rearrange this equation to:
Y-hat = (b0 + b2D) + (b1 + b3D)*X1
The equation for the reference group (D=0) simplifies to:
Y-hat = b0 + b1X1
The equation for the named group (D=1) simplifies to:
Y-hat = (b0 + b2) + (b1 + b3)*X1
36

15. When D=0, the y-intercept is b0, the slope is b1, and the regression simplifies to
Y-hat=b0 + b1X1. When D=1, the y-intercept becomes (b0 + b2), and the slope becomes
(b1 + b3). Thus, b0 is the y-intercept for the reference group, b1 is the slope for the
reference group, b2 is the difference in y-intercepts between the named group and the
reference group, and b3 is the difference in slopes between the named group and the
reference group.
Interactions of Two Dummy Independent Variables
16. We can also create interaction terms by multiplying two dummy independent
variables times each other. The new interaction term is also a dummy variable. It
equals 1 only if both of the other dummy variables equal one.
17. Suppose we are interested in understanding double discrimination against
minority women. One way to interpret double discrimination is simply that minority
women earn less than both minority men and non-minority women. Another way to
interpret double discrimination, however, is that minority women face all the
discrimination faced by minority men plus all the discrimination faced by non-minority
women, so that the salary gap between minority and non-minority women is as big as or
bigger than the salary gap between minority and non-minority men.
18. To understand this in more depth, we need to create an interaction term
(minmale) by multiplying minority times male. Our interaction term minmale=1
only for minority men, when both minority and male are 1. If either minority=0 (the
employee is a non-minority) or male=0 (the employee is a woman), then minmale=0.
19. Previously, we regressed salary (the dependent variable) on male and minority
(the independent variables):
(3)

SAL-hat = 28,946 + 11,863 male - 4531 minority

. reg sal male minority


Source |
SS
df
MS
-------------+-----------------------------Model | 1.5097e+11
2 7.5486e+10
Residual | 5.8241e+11 3481
167311399
-------------+-----------------------------Total | 7.3338e+11 3483
210560961

Number of obs
F( 2, 3481)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

3484
451.17
0.0000
0.2059
0.2054
12935

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------male |
11862.86
445.8096
26.61
0.000
10988.79
12736.94
minority | -4530.764
511.2052
-8.86
0.000
-5533.057
-3528.472
_cons |
28945.56
357.0325
81.07
0.000
28245.55
29645.57
------------------------------------------------------------------------------

20. This regression generates only four values: the expected salaries for minority men,
minority women, non-minority men, and non-minority women. These expected salaries

37

are not mean salaries, because there are four expected salaries but only three constants,
the y-intercept plus the two regression coefficients.
21. The coefficient on male is the expected difference in salary between men and
women of the same minority status. The coefficient on minority shows an expected
difference in salary of $4,531 between minorities and non-minorities of the same sex.
22. The expected difference in salaries between men and women is the same for both
minorities and non-minorities. The expected difference in salaries between minorities
and non-minorities is the same for both men and women.
Again, this is an assumption built into the model rather than a conclusion we can
draw from the results. With this regression model, there is no way to discover
whether the salary disadvantage to being female is larger or smaller for
minorities than for non-minorities.
23. When we add the interaction term minmale, the regression becomes:
(4)

sal-hat = 28,367 + 12,891 male - 2,808 minority - 4,255 minmale

. gen minmale=minority * male


TABLE 4. reg sal male minority minmale
Source |
SS
df
MS
-------------+-----------------------------Model | 1.5377e+11
3 5.1255e+10
Residual | 5.7962e+11 3480
166556850
-------------+-----------------------------Total | 7.3338e+11 3483
210560961

Number of obs
F( 3, 3480)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

3484
307.73
0.0000
0.2097
0.2090
12906

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------male |
12890.95
510.7617
25.24
0.000
11889.53
13892.37
minority | -2808.422
661.0932
-4.25
0.000
-4104.592
-1512.252
minmale | -4255.363
1039.133
-4.10
0.000
-6292.736
-2217.991
_cons |
28366.74
383.2434
74.02
0.000
27615.34
29118.15
------------------------------------------------------------------------------

24. First, lets calculate the expected salaries. [Note: there are now four expected values
and four constants (the y-intercept plus the three regression coefficients), so the
expected salaries are mean salaries.]
SAL-hat(NF)= 28,367
SAL-hat(MF)= 28,367 - 2,808
SAL-hat(NM)= 28,367 + 12,891
SAL-hat(MM)= 28,367 + 12,891 - 2,808 - 4,255

38

= $28,367
= $25,559
= $41,258
= $34,195

Non-minority males earn $12,891 more (the coefficient on male) than nonminority females (41,258 - 28367).
Minority males earn $8,636 more than minority females (34195 - 25559).
The salary advantage to being male is $4,255 less (the coefficient on minmale)
for minorities than for non-minorities (8636 - 12891).
Female minorities earn $2,808 less (the coefficient on minority) than female
non-minorities (25559-28367).
Male minorities earn $7,063 less than male non-minorities (34195 - 41258).
The salary advantage to being minority is $4,255 less (the coefficient on
minmale) for men than for women (-7063 - (-2808)).
The following table shows that these are, in fact, the sample means.
TABLE 4A. table male minority, c(mean sal)
-----------------------------|
minority
male |
0
1
----------+------------------women | 28366.74 25558.32
men | 41257.69 34193.91
-----------------------------diff (1-0) 12890.95 8635.59

difference (1-0)
-2808.42
-7063.78
-4255.36

25. Now, lets interpret the coefficients.


The y-intercept (28367) is the mean salary of the reference group, non-minority
females.
The coefficient on male (12891) is the difference between the mean salaries of
male and female non-minorities ($41,258 - $28,367). We are holding minority
constant at zero. You can also see this in the first column of Table 4A.
The coefficient on minority (-2808) is the difference between the mean salaries
of minority and non-minority females ($25,559 - $28,367). We are holding
male constant at zero. You can also see this in the first row of Table 4A.
The coefficient on minmale (-4255) is the additional salary disadvantage to
being minority for males than for females (7063 - 2808). It also shows that the
salary advantage to being male is $4,255 less for minorities than for nonminorities (8636 - 12891). You can also see this in the final column and the final
row of Table 4A.
The coefficient on the interaction term is a difference of differences. The
difference between the mean salaries of minorities and non-minorities is
$4255 lower for men than for women (for the named group than for the
39

reference group). The difference between the mean salaries of men and
women is $4255 lower for minorities than for non-minorities (for the
named group than for the reference group).
Curvilinear Relationships
27. So far, we have looked only at linear relationships, but many relationship between
dependent and independent variables are curvilinear rather than linear. Salaries, for
instance, generally rise faster earlier than later in people's careers. There are several
ways to incorporate curvilinear relationships into regressions by altering the
independent or dependent variable (or both). We will look at one example of each.
A. Squared terms
28. The most common way to model a curvilinear relationship is to include the squared
value of the independent variable in the model. Below, both yos (federal experience
measured in years) and yossq (yos squared) are included in the model:
(5)

SAL-hat = 22,356 + 1002 yos - 8.19 yossq

. gen yossq=yos^2
TABLE 5. reg sal yos yossq
Source |
SS
df
MS
-------------+-----------------------------Model | 1.4477e+11
2 7.2383e+10
Residual | 5.8862e+11 3481
169094355
-------------+-----------------------------Total | 7.3338e+11 3483
210560961

Number of obs
F( 2, 3481)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

3484
428.06
0.0000
0.1974
0.1969
13004

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------yos |
1002.409
87.77899
11.42
0.000
830.3056
1174.513
yossq | -8.191972
2.663002
-3.08
0.002
-13.41318
-2.970768
_cons |
22356.48
610.3814
36.63
0.000
21159.74
23553.23
------------------------------------------------------------------------------

29. We cannot interpret the coefficients as the effect of increasing yos while holding
yossq constant, because we cannot increase one without increasing the other. Yossq
increases much more rapidly than yos, so as seniority increases, the relative importance
of its negative coefficient grows. Thus, the first year of service increases expected salary
more than do later years of service. Look at a few examples:
sal-hat (0) = 22356
sal-hat (1) = 22356 + 1002 (1) - 8.19 (1)
sal-hat (2) = 22356 + 1002 (2) - 8.19 (4)
sal-hat (3) = 22356 + 1002 (3) - 8.19 (9)

= $22,356
= $23,350
= $24,327
= $25,288

sal-hat (10) = 22356 + 1002 (10) - 8.19 (100)

= $31,557

40

sal-hat (11) = 22356 + 1002 (11) - 8.19 (121)


= $32,387
sal-hat (20) = 22356 + 1002 (20) - 8.19 (400) = $39,120
sal-hat (21) = 22356 + 1002 (21) - 8.19 (441)
= $39,786
30. The first year of service increases expected salary by $994. The second increases
expected salary by $977, the third by $961, the eleventh year by $830, the twenty-first
by $666, and the thirty-first by $502.
Graph 1. Salary as Quadratic and Linear Function of Experience

20000

30000

F itted values
40000
50000

60000

. twoway (scatter salyos yos, msymbol(triangle) mcolor(pink) msize(medlarge))


(scatter salyossq yos, msymbol(square) mcolor(blue) msize(medlarge))

10

20
30
years of federal service
Fitted v alues

40

50

Fitted values

31. The coefficients on both the linear and the squared terms can be either positive or
negative.
If both coefficients are positive and the values of X have to be positive, the
expected value of the dependent variable increases at an increasing rate (as the
independent variable rises, the expected value grows at a faster rate).
If both coefficients are negative and the values of X have to be positive, the
expected value of the dependent variable decreases at an increasing rate (as the
independent variable rises, the expected value shrinks at a faster rate).

41

If the linear coefficient is positive and the squared coefficient is negative (as in
this example), the expected value of the dependent variable increases at a
decreasing rate (as the independent variable rises, the expected value grows at a
slower rate), eventually peaks, then declines at an increasing rate.
If the linear coefficient is negative and the squared coefficient is positive, the
expected value of the dependent variable decreases at a decreasing rate (as the
independent variable rises, the expected value shrinks at a slower rate),
eventually bottoms out, then rises at an increasing rate.
32. The general form of this equation is:
Y-hat = b0 + b1 X + b2 X2
33. From the calculus, we determine that the value of X that gives the maximum (or
minimum) expected value for Y is -b1 / 2b2 .
d y-hat/d X = b1 + 2 b2 X = 0
2 b2 X = -b1
X = -b1/2b2
In the yos case, the expected value of salary reaches its peak when yos =
-1002/(2*(-8.19)) = 61. Thus, increased federal experience appears to have a
positive impact on salary up through 61 years of service. Since no one in the data
set has anywhere near that much experience, experience has a positive impact on
salary throughout its relevant range.
34. On the other hand, when we add both age and agesq to a salary equation, the
regression is:
(6)

sal-hat = -11,192 + 1756 age - 15.6 agesq

. gen agesq=age^2
TABLE 6. reg sal age agesq
Source |
SS
df
MS
-------------+-----------------------------Model | 7.2737e+10
2 3.6368e+10
Residual | 6.6065e+11 3481
189786622
-------------+-----------------------------Total | 7.3338e+11 3483
210560961

Number of obs
F( 2, 3481)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

3484
191.63
0.0000
0.0992
0.0987
13776

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------age |
1755.878
160.1742
10.96
0.000
1441.833
2069.923
agesq | -15.56401
1.811977
-8.59
0.000
-19.11666
-12.01137
_cons | -11192.18
3418.794
-3.27
0.001
-17895.22
-4489.132
-----------------------------------------------------------------------------. predict agesalhat

42

(option xb assumed; fitted values)

15000

20000

F itted values
25000
30000

35000

40000

. scatter agesalhat age

20

40

60

80

age at last birthday

Expected salary increases with age, but each additional year of age raises
expected salary less than the year before. Expected salary reaches its peak when
age=
-1756/(2*(-15.6)) = 56. This equation suggests that federal employees in their
60s earn less, on average, than employees in their mid-50s. The regression line is
curved like an upside-down bowl. Expected salary increases up to age 56, but at a
slower and slower rate. After age 56, expected salary starts to decline, very slowly
at first, but then faster and faster.
Remember, these are cross-sectional data. They cover 3,484 people at
many different ages. The data do not follow individuals over time. This
equation does not say that an individuals salary is expected to decrease as
he or she passes age 56. Instead, they suggest that employees in their 60s
who are currently working for the federal government tend to be earning
less than other employees who are in their mid-50s.
Also, regression lines tend to be better measures of relationships near their
centers than at their ends. In this sample, 89% of employees are 56 or under.
Because we have forced Stata to generate a parabola by the form of the model
(with X and X2 in the model), it may overstate the downturn at higher ages.
43

To see the actual pattern in the data, I use the collapse command. In this case,
it calculates the mean salary for each value of age, allowing me to graph the
means. [Note that the collapse command completely transforms the data set.
Do not save the collapsed data set under the old data set name.] Because the one
71-year-old employee made over $80,000, I dropped him from the graph, which
would have been distorted by leaving him in. Note that the pattern becomes
quite erratic after the mid-50s (partly because sample sizes drop). There does
seem to be a downward trend, but the clearest point is that mean salary rises
much faster in the 20s than later in life.

15000

20000

(mean) sal
25000 30000

35000

40000

. collapse sal, by(age)


. twoway (line sal age if age<70)

20

30

40
50
age at last birthday

44

60

70

B. The Natural Logarithm of the Dependent Variable


35. The natural logarithm is used frequently in regression analysis. The natural
logarithm (ln) of a number is the power that the number e must be raised to equal the
original number. ln e3 = 3, and ln e457.3 = 457.3.
The number e has the approximate value 2.71828. It is the limit, as m
approaches infinity, of (1 + (1/m))m. You dont really need to understand this.
36. When we convert the dependent variable to its natural logarithm, the regression
coefficients roughly represent the proportional change in the dependent variable from a
one-unit increase in the independent variable, holding constant the other variables in
the model.
37. When we graph the mean salary at each grade level against grade level, we see that
the relationship between the two variables is curvilinear mean salary rises much more
in dollars at high than at low grades:

20000

40000

meansal
60000

80000

100000

. bysort grade: egen meansal=mean(sal)


. twoway (line meansal grade)

10
gs-level in 1991

45

15

38. We could try to capture that curvilinearity by adding a squared term, but it also
seems plausible that a one-grade increase in grade brings a constant proportional
increase in salary. Table 7 regresses the natural logarithm of sal (lsal) on grade. It
suggests that a one-grade increase in grade is associated with about a 12 percent
increase in salary.
(7)

lsal-hat = 9.269 + .118 grade

TABLE 7. reg lsal grade


Source |
SS
df
MS
-------------+-----------------------------Model | 549.569981
1 549.569981
Residual | 40.1058003 3482 .011518036
-------------+-----------------------------Total | 589.675782 3483 .169301114

Number of obs
F( 1, 3482)
Prob > F
R-squared
Adj R-squared
Root MSE

=
3484
=47713.86
= 0.0000
= 0.9320
= 0.9320
= .10732

-----------------------------------------------------------------------------lsal |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------grade |
.118232
.0005413
218.44
0.000
.1171708
.1192932
_cons |
9.269124
.0052431 1767.88
0.000
9.258844
9.279404
-----------------------------------------------------------------------------. predict lsalhat
(option xb assumed; fitted values)

39. Stata easily generates the expected value of the dependent variable by using the
predict command immediately after doing a regression. In this case, I name the
expected value of the natural logarithm of salary lsalhat. I now need to translate the
natural logarithm back into dollars using the exp( ) command. This exponential
function reverses the natural logarithm that is, it raises e to the power in the
parentheses. Salhatl is the expected salary generated from the regression with lsal as
the dependent variable.
For comparison purposes, I also regressed sal on grade and grade-squared and
saved the predicted values as gsalhat.
. gen salhatl = exp(lsalhat)
. gen gradesq = grade^2
TABLE 8. reg sal grade gradesq
Source |
SS
df
MS
-------------+-----------------------------Model | 6.8159e+11
2 3.4079e+11
Residual | 5.1798e+10 3481 14880327.4
-------------+-----------------------------Total | 7.3338e+11 3483
210560961

Number of obs
F( 2, 3481)
Prob > F
R-squared
Adj R-squared
Root MSE

=
3484
=22902.23
= 0.0000
= 0.9294
= 0.9293
= 3857.5

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------grade | -3972.823
125.8298
-31.57
0.000
-4219.531
-3726.116
gradesq |
444.0829
6.947312
63.92
0.000
430.4617
457.7041
_cons |
28263.04
513.444
55.05
0.000
27256.35
29269.72
------------------------------------------------------------------------------

46

. predict gsalhat
(option xb assumed; fitted values)

20000

40000

60000

80000

100000

. twoway (line meansal grade, lcolor(black) lpattern(solid) lwidth(medium)) (line


salhatl grade, lcolor(blue) lpattern(dot) lwidth(thick)) (line gsalhat grade,
lcolor(red) lpattern(dash) lwidth(medium))

10

15

gs-level in 1991
meansal
Fitted values

salhatl

40. The graph


above shows how meansal, salhatl, and gsalhat rise with grade. All three rise at an
accelerating pace with grade. Salhatl is very similar to meansal at the lower grades
but doesnt rise as quickly as it should at the top grades. Gsalhat does a better job of
matching meansal above grade 12, but it does a much worse job at the lower grades,
suggesting that salaries tend to drop between grades 2 and 6.
Exponentiated form when working with natural logarithms
41. When we use the natural logarithm as the dependent variable, we tend to think of
the regression coefficients as showing proportional changes. In Table 9, for instance, we
would be tempted to say that each year of education raises expected salary by about 9.1
percent (holding the other variables constant) and that Asian females tend to earn about
20.5 percent less than comparable white males.
42. These rough equivalencies hold best when a one-unit increase in X is a pretty small
increase. With dummy independent variables or with independent variables with
limited numbers of values (like edyrs), this equivalency may be quite inaccurate.

47

TABLE 9. reg lsal

af am bf bm hf hm naf nam wf edyrs yos yossq age agesq

Source |
SS
df
MS
-------------+-----------------------------Model | 1160.01333
14 82.8580953
Residual | 990.096517 13074 .075730191
-------------+-----------------------------Total | 2150.10985 13088 .164281009

Number of obs
F( 14, 13074)
Prob > F
R-squared
Adj R-squared
Root MSE

=
13089
= 1094.12
= 0.0000
= 0.5395
= 0.5390
= .27519

-----------------------------------------------------------------------------lsal |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------af | -.2054899
.020221
-10.16
0.000
-.2451261
-.1658537
am | -.0114719
.0187763
-0.61
0.541
-.0482761
.0253323
bf | -.2274759
.0081664
-27.86
0.000
-.2434832
-.2114686
bm | -.1339513
.0107938
-12.41
0.000
-.1551087
-.1127939
hf | -.2117787
.0169264
-12.51
0.000
-.2449569
-.1786005
hm | -.0493306
.0161798
-3.05
0.002
-.0810453
-.0176158
naf | -.2831127
.0237617
-11.91
0.000
-.329689
-.2365363
nam | -.1706531
.0326943
-5.22
0.000
-.2347387
-.1065675
wf | -.1594855
.0059426
-26.84
0.000
-.1711338
-.1478371
edyrs |
.0913524
.0011515
79.34
0.000
.0890954
.0936094
yos |
.0285126
.0011412
24.98
0.000
.0262757
.0307495
yossq |
-.00024
.0000314
-7.64
0.000
-.0003016
-.0001784
age |
.0100905
.0020093
5.02
0.000
.006152
.014029
agesq | -.0001331
.0000219
-6.09
0.000
-.000176
-.0000902
_cons |
8.804098
.0451209
195.12
0.000
8.715655
8.892542
------------------------------------------------------------------------------

43. The correct interpretation is that the expected percentage change is


(exp(b) - 1)*100,
where exp(b) is the exponentiated form (or anti-log) of the regression coefficient.
44. To the best of my knowledge, Stata will not go that far for you, but it will
exponentiate your coefficients. Add eform(string) as a subcommand (after a
comma). In parentheses, put the name you want Stata to put in the output. Here, I use
eform(exp(b)), and exp(b) shows up as the heading of the coefficient column.
. reg lsal

af am bf bm hf hm naf nam wf edyrs yos yossq age agesq, eform(exp(b))

Source |
SS
df
MS
-------------+-----------------------------Model | 1160.01333
14 82.8580953
Residual | 990.096517 13074 .075730191
-------------+-----------------------------Total | 2150.10985 13088 .164281009

Number of obs
F( 14, 13074)
Prob > F
R-squared
Adj R-squared
Root MSE

=
13089
= 1094.12
= 0.0000
= 0.5395
= 0.5390
= .27519

-----------------------------------------------------------------------------lsal |
exp(b)
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------af |
.8142483
.0164649
-10.16
0.000
.7826059
.8471702
am |
.9885936
.0185621
-0.61
0.541
.9528706
1.025656
bf |
.7965416
.0065049
-27.86
0.000
.7838926
.8093947
bm |
.8746327
.0094406
-12.41
0.000
.8563221
.8933347
hf |
.8091438
.0136959
-12.51
0.000
.7827383
.83644
hm |
.9518664
.015401
-3.05
0.002
.9221519
.9825385
naf |
.7534349
.0179029
-11.91
0.000
.7191474
.7893572

48

nam |
.843114
.027565
-5.22
0.000
.7907774
.8989144
wf |
.8525824
.0050665
-26.84
0.000
.8427088
.8625716
edyrs |
1.095655
.0012616
79.34
0.000
1.093185
1.098131
yos |
1.028923
.0011742
24.98
0.000
1.026624
1.031227
yossq |
.99976
.0000314
-7.64
0.000
.9996984
.9998216
age |
1.010142
.0020297
5.02
0.000
1.006171
1.014128
agesq |
.9998669
.0000219
-6.09
0.000
.999824
.9999098
------------------------------------------------------------------------------

45. Multiply these coefficients times 100 in your head and state that Asian females tend
to earn 81.4 percent as much as comparable white men, and that employees tend to earn
109.6 percent as much as comparable employees with one less year of education (that is,
they earn 9.6 percent more).
Notice that my previous interpretations were not very accurate. Asian females
tend to earn 18.6 percent less than comparable white men (not 20.5 percent less),
and each additional year of education tends to raise expected earnings by 9.6
percent (not 9.1 percent).
46. In general, the naive interpretation will overstate the effect of variables with
negative coefficients and understate the impact of variables with positive coefficients.

49

Using Outreg
1. One of the advantages of Stata is that geeks tend to like it and write special programs
to make it even better and let you download those programs for free. One such program
is outreg, which converts standard Stata output into a format more appropriate for a
journal. Outreg is already on the FRC server. To download it onto your computer, type
net search outreg on the command line, then click on the top blue link that appears.
2. As with all Stata commands, if you type help [commandname], Stata will give you
the basics on how to use the command.
3. Here, after running your regression, type outreg using [filename]. The defaults
are that outreg prints the regression coefficients and the t-statistics, and puts asterisks
after the t-statistics to show statistical significance. If you want something else, put the
options you want after a comma. I prefer the asterisks on the coefficients and I
sometimes prefer standard errors to t-statistics. You can also control the number of
significant digits in the final table.
4. You can also put regressions side by side by using the append option. In the
following example, I ran separate regressions for whites, blacks, and Asians.
. use "OPM2001.dta", clear
. keep if race==2|race==5|race==1
(88 observations deleted)
. reg grade edyrs male yos

age if race==5

Source |
SS
df
MS
-------------+-----------------------------Model | 2393.49824
4 598.374559
Residual | 4032.58328
731
5.5165298
-------------+-----------------------------Total | 6426.08152
735 8.74296806

Number of obs
F( 4,
731)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

736
108.47
0.0000
0.3725
0.3690
2.3487

-----------------------------------------------------------------------------grade |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------edyrs |
.6806109
.0394708
17.24
0.000
.6031212
.7581006
male |
.8722048
.18061
4.83
0.000
.5176286
1.226781
yos |
.1024334
.0115354
8.88
0.000
.0797869
.12508
age | -.0196672
.0111235
-1.77
0.077
-.041505
.0021706
_cons | -1.027372
.7273584
-1.41
0.158
-2.455332
.4005888
-----------------------------------------------------------------------------. outreg using "wbaregs1", coef se
. reg grade edyrs male yos age if race==2
deleted for space
. outreg using "wbaregs1", append coef se
. reg grade edyrs male yos age if race==1
deleted for space
. outreg using "wbaregs1", append coef se

50

5. I then open the wbaregs1 file, giving me something like this:


(1)
(2)
(3)
grade in 2001 grade in 2001 grade in 2001
years of education (2001)
0.681**
0.900**
1.025**
(0.039)
(0.091)
(0.157)
sex dummy (1=male, 0=female)
0.872**
0.496 0.963
(0.181)
(0.384)
(0.634)
years of federal service (2001)
0.102**
0.104**
0.143**
(0.012)
(0.022)
(0.046)
years of age (2001) -0.020 -0.017 -0.023
(0.011)
(0.022)
(0.038)
Constant
-1.027 -4.741**
-6.866*
(0.727)
(1.410)
(3.075)
Observations 736 246 50
R-squared
0.37 0.35 0.58
Standard errors in parentheses
* significant at 5%; ** significant at 1%
6. With a change of font, new tabs, and a little editing, I get the following:
Table 1. Salary Regressions by Race

Male
Years of Education
Years of Federal Service
Age
Constant
Observations
R-squared

Whites

Blacks

Asians

0.872**
(0.181)
0.681**
(0.039)
0.102**
(0.012)
-0.020
(0.011)
-1.027
(0.727)

0.496
(0.384)
0.900**
(0.091)
0.104**
(0.022)
-0.017
(0.022)
-4.741**
(1.410)

0.963
(0.634)
1.025**
(0.157)
0.143**
(0.046)
-0.023
(0.038)
-6.866*
(3.075)

736
0.37

246
0.35

50
0.58

Standard errors in parentheses


* significant at 5%; ** significant at 1%
7. Not only is this usually much easier than re-typing the tables yourself, but it ensures
that you will not type the wrong numbers.
8. outreg will also convert coefficients into their exponentiated form.
51

9. Stata has also built in something similar to outreg, the estimates command. You
first store the estimates (here I store them as A), then you can display them, typically
using the table command. Type help estimates for an explanation.
. estimates store A
. estimates table A
--------------------------Variable |
A
-------------+------------af | -.20548988
am | -.01147195
bf | -.22747591
bm | -.1339513
hf | -.21177867
hm | -.04933056
naf | -.28311266
nam | -.17065313
wf | -.15948545
edyrs |
.0913524
yos |
.0285126
yossq |
-.00024
age | .01009051
agesq | -.0001331
_cons | 8.8040984
--------------------------est t A, eform
--------------------------Variable |
A
-------------+------------af | .81424833
am | .98859361
bf | .79654161
bm | .87463266
hf | .80914377
hm | .95186643
naf |
.7534349
nam | .84311398
wf | .85258237
edyrs |
1.095655
yos |
1.028923
yossq | .99976003
age | 1.0101416
agesq | .99986691
_cons | 6661.4893
---------------------------

10. This form makes it easy to cut and paste the coefficients into Excel. You can then
exponentiate (if necessary), subtract 1, and multiply times 100. I havent figured out an
easy way to get this back into outreg or estimates table, but it again cuts down the
probability youll make a simple math error in putting your table together.
Outreg web-sites:
http://www.kellogg.northwestern.edu/rc/stata-outreg.htm

52

Homework 4
Interaction Terms and Curvilinear Relationships
1. Run homework 4.do in the E:\paus9121\do-files folder.
You need to rename the folder in which you place the outreg file.
Make sure you understand every command and comment.
2. Open Table1.out with Word. Convert it into a nice, pretty table with two columns
lined up with decimal tabs.
If you do not see a ruler across the top of the page (new Word), click on the
button above the scroll bar on right side of the screen. It should appear.
Highlight the names of the dependent variables. Click the button at the extreme
left of the ruler until it shows a Center tab set (z). Click on the ruler in the
locations where you want the Center tabs. You may need to cut and paste a little
with the dependent variables names/labels (putting them on two lines?) to make
them line up properly.
Highlight the names of the independent variables and all the coefficients. Click
the button at the extreme left of the ruler until it shows a Decimal tab set (z).
Click on the ruler in the locations where you want the Decimal tabs. These
should line up nicely with the column headings.
Add a title for the table. Change the font to something prettier. You may want to
change some of the type to bold face. Make a few other minor changes until you
think the table looks good. Save it as a Word file.
3. Open Table2.out with Word. Convert it into a nice, pretty table with three columns
lined up with decimal tabs. Save it as a Word file.
4. Based on the first column of Table 2, interpret the impact of sex, race/ethnicity,
education, federal experience, and age on federal salaries in 2001. Be sure that your
interpretations make clear that you are working from multiple rather than bivariate
regression. Be sure that you dont interpret the yos coefficient in isolation from the
yossq coefficient; you must interpret them in tandem. Calculate the value of yos at
which expected salary reaches its maximum, holding the other variables constant. Is
that in the range of the data? How should that influence your interpretation of the
impact of federal experience?

53

5. Based on the final two columns of Table 2, interpret the impact of sex, race/ethnicity,
education, federal experience, and age on federal salaries in 2001. Consider all the same
questions raised in 4.
6. Now do some more computer work. Regress sal on asian, black, hispanic,
amerind, edyrs, yos, and age, separately by sex. That is, run two separate
regressions, one for men and one for women. Save them in an outreg file. Convert it
into an easily readable Word table. (Hint: What should the headings on the columns
be?)
7. Create interaction terms between male and asian, black, hispanic, amerind,
edyrs, yos, and age. Run a regression with sal as the dependent variable and with
asian, black, hispanic, amerind, edyrs, yos, age, male, and all the interaction
terms you just created as the independent variables.
8. Interpret all the coefficients from the regression with the interaction terms. Refer to
the outreg table from question 6 repeatedly. Make sure you understand how each of the
coefficients in the interaction model is related to the coefficients in the outreg table.
9. Create interaction terms between black and male, edyrs, yos, and age. Restrict
the sample to blacks and whites with an if statement at the end of the regression
command. Regress sal on male, edyrs, yos, age, black, and the four interaction
terms. Interpret the coefficients and the y-intercept.
10. Based on the previous regression, what would the equations have been if you had
regressed sal on male, edyrs, yos, and age separately for blacks and for whites?
That is, calculate what the regressions would have been, based on the model with the
interaction terms. Show at least some of your work. Check your answers by running
those regressions.

54

ESTIMATION AND INFERENCE IN BIVARIATE REGRESSION


1. Lets begin with a deterministic or functional linear relationship between an
independent variable X and a dependent variable Y:
yi = $ 0 + $ 1 xi
Then, once we know the value of X, we know what the value of Y is. If X=0, then Y= $0.
($0 is frequently called the y-intercept.) As X increases by 1, Y changes by $1. ($1 is typically
called the slope, the change in Y as X increases by 1.)
2. Most relationships of policy interest are not deterministic but stochastic; that is, they
include an important random element the effects of a myriad of small influences and
chance. To account for that, we add a disturbance term ,i, which well treat as a
random variable. This will make the relationship between X and Y
yi = $ 0 + $ 1 xi + , i
Our dependent variable has a systematic component and a stochastic component. It is a
function of both the independent variable and the error term. We can no longer predict yi
exactly. Instead, at every value of our independent variable, yi has a conditional
probability distribution.
3. Classical Assumptions
A. We have properly specified the relationship. X has an important causal impact
on Y, and no other variable does. We have the correct functional form of the
relationship e.g., E(Y) is linearly related to X and not to lnX or X2 or 1/X.
B. X varies. (It does not have a single value.)
C. xi is a known constant (rather than a variable) or is at least fixed in repeated
samples. In other words, we're going to assume that we chose how many
observations we would have at each value of X.
Later, we will weaken this assumption to the independence of X and the
error term: Cov (xi , ,i) = E(xi ,i) = 0.
D. E(,i) = E(,i | X) = 0. That is, the expected value or conditional mean of ,i is
0 at every value of the independent variable.
Because xi is a constant, E(xi ,i) = xi E(,i) = 0 .
E. Var (,i ) = Var (,i | X) = F2. The probability distribution of ,i is the same at
every value of X: it has a mean of 0 and a variance of F2. This is called the
homoskedasticity assumption. It means that the spread of observed values
around the conditional mean is the same at every value of X. In contrast, when the
100

error terms are heteroskedastic, the spread of observed values is wider at some
values of X than at others.
F. Cov (,i , ,j ) = 0 for all i j. The error or disturbance terms are uncorrelated
from observation to observation. This is the assumption of nonautocorrelation.
(Wooldridge does not add this assumption until we hit time-series analysis.)
G. Sometimes we assume that the error term is normally distributed: ,i ~ N(0, F2).
4. Some Implications for the Conditional Distribution of Y
E(yi *xi)

= E($0 + $1 xi + ,i) = E($0 ) + E($1 xi ) + E(,i)

= $ 0 + $ 1 xi

Each conditional probability distribution has its own mean or expected value and its own
variance and standard deviation. If our regression model is correct, and the relationship
between salary and grade level is exactly linear, then all the conditional means of Y will be
on a straight line. Our regression equation gives the conditional means.
Var(yi *xi)

= E[yi - E(yi )]2


= E[($0 + $1 xi + ,i) - ($0 + $1 xi )]2
= E(,i 2)
= F2

The individual observations on Y will be distributed around those conditional means


that is, yi *xi ~ ($0 + $1 xi , F2) . $0 + $1 xi is a constant and does not affect the variance
[var(a+Y)=var(Y)].
If ,i ~ N(0, F2), then yi *xi ~ N($0 + $1 xi , F2) .
If the error term is normally distributed, then the conditional distribution of Y is also
normal.
5. Estimating the Regression Coefficients
How do we estimate the population regression equation based on this model and
sample data? We want to choose the line that provides the best fit to the sample data. The
sample regression line is:
yi = b0 + b1 xi + ei
and
y$ i = b0 + b1 xi
Thus,
(yi - y$ i) = (b0 + b1 xi + ei) - (b0 + b1 xi) = ei

101

b0 and b1 are the sample parameter estimates the y-intercept and the slope coefficient,
respectively. y$ i is the value of Y predicted by the sample regression equation for a given
value of X. ei is the residual, the difference between the observed and expected values of Y.
Least Squares Principle
ei is the key to finding a fit between the sample regression line and the sample data.
In general, the smaller the eis, the better the fit, but there are many possible regression
lines that all have G ei = 0. One definition of the sample line that fits the data best is the
line that minimizes the sum of the squared errors (G ei2 ). How do we minimize the sum of
the squared errors? First, note:
ei = (yi - y$ i) = yi - (b0 + b1 xi) = yi - b0 - b1 xi
Thus,

G ei2 = G (yi - b0 - b1 xi )2
= G (yi2 - 2 yib0 - 2 yib1 xi + b02 - 2b0b1 xi + b12 xi2)

To find the values of b0 and b1 that minimize G ei2 , we need to take the partial
derivatives of G ei2 with respect to b0 and b1. Essentially, derivatives find the slope of a
curve at a particular point; they find the instantaneous rate of change in the dependent
variable with a very small change in whatever we are differentiating against. For our
purposes, we only need to look at three situations.
First, with a horizontal line, Y = a, that is, there is no change in E(Y) as X changes,
dY/dX = 0. (dY/dX means that we are differentiating (taking the derivative of) Y
with respect to X.) The slope is 0.
Second, when Y and X are linearly related, Y = a + b X, the slope is constant (b).
No matter what value X has, when it increases by 1, Y changes by b. dY/dX = b.
Third, when there is a quadratic relationship between Y and X, Y = a + b X + c X2,
the line is curved and has a different slope at different values of X. dY/dX = b + 2
c X.
To find the value of X at which Y reaches its maximum (or minimum), we set the
derivative equal to 0: b + 2cX = 0. Thus, 2cX = -b, and X = -b/2c at its
maximum. For instance, if Y = 100 + 50X - 5X2, then dY/dX = 50 - 10X. If 50 10X = 0, then X = (-50)/(-10) = 5, and Y reaches its maximum when X=5.
To be sure that we have a maximum rather than a minimum, we take the second
derivative. The second derivative is essentially the rate of change in the slope, its
acceleration or deceleration rate. d2Y/dX2 is the derivative of the first derivative
with respect to X. Thus, in the general case of the quadratic relationship, d2Y/dX2
= 2c. If 2c is positive, the slope is increasing at this point, and we have achieved a
102

minimum. If 2c is negative, the slope is decreasing at this point, and we have


achieved a maximum.
In our example, d2Y/dX2 = -10 and we have achieved a maximum. As a simple
check, if X=5, Y=100 + 250 -125 = 225. If X=4, Y=100 + 200 - 80 = 220 (which is
less than our hypothesized maximum). If X=6, Y=100 + 300 - 180 = 220 (which is
less than our hypothesized maximum). Thus, X=5 does seem to maximize Y.
To return to our problem, we want to choose the values of b0 and b1 that minimize the sum
of the squared deviations between the observed and expected values of Y. It is important
to remember that once we gather our data, the Xs and Ys are known constants, and the bs
are what change. To find the values of b0 and b1 that minimize G ei2 , we need to take the
partial derivatives of G ei2 with respect to b0 and b1.
S = G ei2

= G (yi - b0 - b1 xi )2
= G (yi2 - 2yi b0 - 2yi b1 xi + b02 + 2b0b1 xi + b12 xi2)
= G yi2 - 2 b0 G yi - 2 b1 G yi xi+ Nb02 + 2 b0b1 G xi + b12 G xi2

*S/*b0

= - 2 G yi + 2Nb0 + 2 b1 G xi = 0

*S/*b1

= - 2 G yi xi + 2 b0 G xi + 2b1 G xi2 = 0

These can be reorganized into the normal equations:

G yi

= Nb0 + b1 G xi

G yi xi

= b0 G xi + b1 G xi2

We can also reorganize to solve for the sample regression coefficients:


Nb0

= G yi - b1 G xi

b0

= G yi /N - b1 G xi /N
= y - b1 x

We substitute this value for b0 into the second normal equation:


b1 G xi2

= G yi xi - b0 G xi
= G yi xi - ( y - b1 x ) G xi
= G yi xi - y G xi + b1 x G xi

103

b1 G xi2 - b1 x G xi = G yi xi - (G yi /N)G xi
b1 (G xi2 - G xi G xi /N)
b1

= G yi xi - G yi G xi /N

= (G yi xi - G yi G xi /N) / (G xi2 - (G xi )2 /N)

which, still surprisingly to me, converts to:


b1

= G (xi - x )(yi - y ) / G (xi - x )2

and to:
b1

= [G (xi - x )yi - y G (xi - x )] / G (xi - x )2


= G (xi - x )yi / G (xi - x )2

and to:
b1

=(G yi xi - N y x ) / (G xi2 - N x 1 )

Implications:
1. The statistics are computed entirely from sample data (there are no unknown
parameters). Thus far, however, they are only point estimators. We will calculate
confidence intervals later.
2. The sample regression line passes through the means of X and Y.
(If b0 = y - b1 x , then y = b0 + b1 x .)
3. The mean of the expected values of Y equals the mean of the observed values of
Y. ( y$ = b0 + b1 xi = ( y - b1 x ) + b1 xi = y + b1 (xi - x ). Hence, G y$ = G y +
b1 G (xi - x ) = G y (because G (xi - x )=0). If the sums of the expected and
observed values are the same, so are the means.)

4. The sum (and, therefore, the mean) of the residuals is zero. See the first normal
equation. Hence, G yi = Nb0 + b1 G xi, which repeats point 2. (The sample
regression line passes through the means of X and Y; y = b0 + b1 x .)
5. The sum of the squared residuals is a minimum for this set of observations. No
alternative values of b0 and b1 will give a lower sum of the squared residuals. This
is what the normal equations show.

104

6. The residuals are not correlated with the independent variable. G xi ei =0. To
see this, look back at the second normal equation.

G xi ei

= G xi (yi- b0 - b1 G xi)
= G yi xi - b0 G xi - b1 G xi2
=0

[Remember *S/*b1 = - 2 G yi xi + 2 b0 G xi + 2b1 G xi2 = G yi xi - b0 G xi - b1 G xi2 = 0]


7. The residuals are not correlated with the predicted values of the dependent
variable.

G y$ i ei

= G(b0 + b1 xi ) ei

=b0 G ei + b1 G xiei

= 0.

Best Linear Unbiased Estimator (BLUE)


To show that the least squares estimators are the best linear unbiased estimators of
the population regression coefficients, we must show that they are unbiased, linear
combinations of the observed values, and the most efficient linear unbiased estimators.
If a statistic is an unbiased estimator of a parameter, then its expected value is the
parameter. Thus, we first need to show that the expected value of the sample regression
coefficient equals the population slope parameter:
E(b1) = E[G (xi - x )yi / G (xi - x )2 ]
= E[G (xi - x )($0 + $1 xi + ,i) / G (xi - x )2 ]
= [$0 G (xi - x ) + $1 G (xi - x ) xi + G (xi - x )E(,i)] / G (xi - x )2 ]
= [0 + $1 G (xi - x ) xi + 0] / G (xi - x )2 ]
= $1 G (xi - x ) xi / G (xi - x )2
= $1 G (xi - x )1 / G (xi - x )2
= $1
A linear estimator is a linear combination (weighted averages) of the sample data. It
follows the form:
c

= G aiyi

where the ai are constants and the yi are variables.


105

We therefore show that the sample regression coefficient is a linear combination of the
values of the dependent variable:
b1

= G (xi - x )yi / G (xi - x )2


= G [(xi - x )/ G (xi - x )2 ] yi
where ki = (xi - x )/ G (xi - x )2

= G ki yi ,

(Remember that the xi are known constants.) In sum, the sample regression coefficient
(b1) is a linear combination (or weighted average) of the yi .
Note that:

G ki

= G (xi - x )/ G (xi - x )2 = 0

G ki 2

= G (xi - x )2 /[ G (xi - x )2 ]2
= 1 / G (xi - x )2

G ki (xi - x )

= G ki xi - x G ki
= G ki xi
= G (xi - x ) xi / G (xi - x )2
=1

This makes the proof of unbiasedness easier:


b1

= G ki yi
= G ki ($0 + $1 xi + ,i)
= $0 G ki + $1 G ki xi + G ki ,i
= $1 + G ki ,i

E(b1 ) = $1 + G ki E(,i)

= $1

An unbiased linear estimator is best if it is most efficient that is, if it has a smaller
variance than any other linear unbiased estimator. We first find the variance of the sample
regression coefficient:
Var(b1 )

= E[b1 - E(b1)]2

106

= E[b1 - $1 ]2
= E[G ki ,i ]2
= E[ k12,12 + k22,22 + k32,32 + ... + 2k1, k1, + 2k1, k3, + ...]
0

= k12 E( ,12 ) + k22 E( ,22 ) + k32 E( ,32 )+ ... + 2k1k1 E(,1,2 ) + 2k1k3 E(,1,3) ...
= k1 F2 + k22 F2 + k32 F2 ...
= F2 G ki2
= F2 / G (xi - x )2
b1 is an unbiased linear estimator with a variance of F2/G (xi - x )2 . To show that b1 is
the best linear unbiased estimator of $1, we must show that all other linear unbiased
estimators of $1 have larger variances. We therefore define another linear unbiased
estimator of $1 (b1*) with different unknown weights:
b1*

= Gki* yi ,

where ki* = ki + ci = (xi - x )/ G (xi - x )2 + ci

= b1 + G ciyi
E(b1*)

= Gki* E(yi )
=

Gki* ($0 + $1 xi)

$0 Gki* + $1 Gki* xi

= $1
E(b1*)

ONLY IF Gki* = 0 and Gki* xi = 1

= E(b1) + G ci E(yi)

= $1 + ($0 + $1 xi )G ci

So b* is unbiased only if Gci = 0.


var(b* ) = var( Gki* yi ) = var[ G((xi - x )/ G (xi - x )2 + ci )yi]
= [G((xi - x )/ G (xi - x )2 + ci )]2 var(yi)
= F2 [G((xi - x )/ G (xi - x )2 + ci )]2
= F2 G [((xi - x )/ G (xi - x )2 )2 + 2 [(xi - x )/ G (xi - x )2 ]ci + ci 2 )]
= F2 / G (xi - x )2 + 2F2 [(xi - x )/ G (xi - x )2 ] G ci + F2 G ci 2
[but Gci =0 if b* is unbiased]
107

= F2 / G (xi - x )2 + F2 G ci 2
= var(b) + F2 G ci 2
> var(b)

[because G ci 2 > 0 if b* b]

Therefore, there is no linear unbiased estimator of $ that has a smaller variance than b.
[Indeed, if the population is normally distributed (as assumed), there is no unbiased
estimator of $ (linear or nonlinear) that has a smaller variance than b. Notice, however,
that we do not need to assume that ,i (or yi) is normally distributed in order to establish
that b is BLUE.]

MONTE CARLO for the Sampling Distribution


of the Sample Regression Coefficient for a Population
Meeting the Classical Assumptions
1. A key insight about the sample regression coefficient is that it is a variable. Although
we typically only see a single value of the regression coefficient, it is one of an almost
infinite number of regression coefficients we could have gotten from an almost infinite
number of samples of the same size from the same population.
2. This Monte Carlo exercise first creates a population that meets the classical
regression assumptions. Then it simulates 10,00 samples from that population;
calculates 10,000 y-intercepts and regression coefficients; and presents descriptive
statistics on them.
3. First, I create a sample of 100, 000 observations, using several useful tricks in Stata.
creates a blank/clear data set.
gives me enough space to run the simulation.
obs 100000 gives me 100,000 blank lines.

clear

set memory 50000


set

4. Next, I create a homoskedastic, non-autocorrelated, normally distributed error term.


creates a uniform distribution that ranges from 0 to 1.
creates a new variable (uni) that has a uniform distribution
that ranges from 0 to 1, a mean very close to .50, a minimum very close to 0 and a
maximum very close to 1.
invnorm(argument) turns a probability into a z-score. Thus, invnorm(.8560432)
creates a z-score of 1.06271, because P(z<1.06271) = .8560432; and invnorm(.1651841)
creates a z-score of -.9733723 because P(z<-.9733723) = .1651841. Since uni has values
restricted to the 0-to-1 range, it works well as a probability.
uniform()

gen uni = uniform()

108

creates a new, normally distributed variable (z) with a mean


of 0 and a standard deviation of 1.
gen error = 5*invnorm(uni) multiplies a z-score times 5 to create a normally
distributed error term (error) with a mean of 0 and a standard deviation of 5.
gen z = invnorm(uni)

5. To show that this error term is normally distributed, I run several useful procedures:
summarize produces summary statistics on all variables, showing that their means
and standard deviations approach the theoretically correct values.
ci generates 95 percent confidence intervals for the population means of the
variables. Here, the population means of .50, 0, and 0 are inside the confidence
intervals.
graph error, bin(50) creates a histogram with 50 bars. (The default, without the
bin(50) is 5 bars.) This shows the bell-shaped curve we expect.
sum error, detail produces more detailed summary statistics for error (sum is
short for summary; Stata allows abbreviations). In particular, we are interested in the
skewness and kurtosis. A normal distribution has a skewness of 0 (showing that it is
symmetrical) and a kurtosis (a measure of the size of its tails) of 3. In this case, the
skewness of .0111418 and the kurtosis of 2.992836 are very close their expected values.
sktest error tests the null hypothesis that error has a normal distribution. This
actually does three separate tests and provides three prob-values. First, the probability
that the sample skewness would be as far from 0 as .0111418 if the population were
normally distributed is 0.150, somewhat unusual but not unlikely enough to cause us to
reject the null. Second, the probability that the sample kurtosis would be as far from 3
as 2.992836 is pretty high (0.651), even if the population is normally distributed. Third,
sktest provides a joint test that the sample skewness would be as far from 0 as .0111418
and the sample kurtosis would be as far from 3 as 2.992836 if the population were
normally distributed; the test-statistic has a chi-square distribution with 2 degrees of
freedom if the population is normally distributed. Here, we would have a 0.3210
probability of getting a test-statistic as high as 2.27 if the null is true, far from enough
evidence to cause us to reject the null.
. clear
. set memory 50000
(50000k)
. set obs 100000
obs was 0, now 100000
. gen uni = uniform()
. gen z = invnorm(uni)
. gen error = 5*invnorm(uni)
. summarize
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+----------------------------------------------------uni | 100000
.4987242
.2885097
.0000419
.9999713
z | 100000
-.0033498
.9983583 -3.933437
4.023499
error | 100000
-.0167488
4.991791 -19.66718
20.1175

109

. ci
Variable |
Obs
Mean
Std. Err.
[95% Conf. Interval]
-------------+------------------------------------------------------------uni | 100000
.4987242
.0009123
.496936
.5005124
z | 100000
-.0033498
.0031571
-.0095376
.0028381
error | 100000
-.0167488
.0157854
-.047688
.0141904
. graph error, bin(50)

Fraction

.0643

0
20.1175

-19.6672
error

. sum error, detail


error
------------------------------------------------------------Percentiles
Smallest
1%
-11.61443
-19.66718
5%
-8.172531
-19.58148
10%
-6.40689
-19.18961
Obs
100000
25%
-3.397988
-19.0022
Sum of Wgt.
100000
50%
75%
90%
95%
99%

-.0268324
3.331517
6.396607
8.214046
11.59188

Largest
19.125
19.18238
19.72038
20.1175

Mean
Std. Dev.

-.0167488
4.991791

Variance
Skewness
Kurtosis

24.91798
.0111418
2.992836

. sktest error
Skewness/Kurtosis tests for Normality
------- joint -----Variable | Pr(Skewness)
Pr(Kurtosis) adj chi2(2)
Prob>chi2
-------------+------------------------------------------------------error |
0.150
0.651
2.27
0.3210

110

6. Next I create X as a set of known constants.


The Stata function _n assigns numbers to each observation; here I put that
number in a variable named id. Id has a range from 1 to 100,000.
The int(argument) function returns the integer obtained by truncating x;
that is, it drops everything after the decimal point. Here, I first subtract 1 from id
(giving it a range from 0 to 99,999), then divide it by 10,000 (giving it a range from 0 to
9.9999), then drop everything after the decimal point (giving it a range from 0 to 9).
There are now 10,000 observations on X at each integer value from 0 to 9.
. gen id = _n
. gen x = int((id-1)/10000)
. sum id x
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+----------------------------------------------------id | 100000
50000.5
28867.66
1
100000
x | 100000
4.5
2.872296
0
9
. tab x
x |
Freq.
Percent
Cum.
------------+----------------------------------0 |
10000
10.00
10.00
1 |
10000
10.00
20.00
2 |
10000
10.00
30.00
3 |
10000
10.00
40.00
4 |
10000
10.00
50.00
5 |
10000
10.00
60.00
6 |
10000
10.00
70.00
7 |
10000
10.00
80.00
8 |
10000
10.00
90.00
9 |
10000
10.00
100.00
------------+----------------------------------Total |
100000
100.00

7. Next I create Y as a function of the deterministic expected value (10


stochastic disturbance term (error).

+ 3*x)

and a

. gen y = 10 + 3*x + error

8. Then I run a regression to show that the pattern in a sample of size 100,000 nearly
matches the pattern in the population.
The sample y-intercept of 9.97275 nearly matches the population y-intercept (10).
The sample regression coefficient of 3.002334 nearly matches the population
coefficient (3).
The estimated standard deviation and variance of the error term (4.9918 and
24.9181849, respectively) are very close to the true values of 5 and 25.

111

. regress y x
Source |
SS
df
MS
-------------+-----------------------------Model | 7436560.16
1 7436560.16
Residual | 2491768.66 99998 24.9181849
-------------+-----------------------------Total | 9928328.82 99999
99.284281

Number of obs
F( 1, 99998)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

100000
.
0.0000
0.7490
0.7490
4.9918

-----------------------------------------------------------------------------y |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------x |
3.002334
.0054958
546.30
0.000
2.991563
3.013106
_cons |
9.972746
.0293396
339.91
0.000
9.915241
10.03025
------------------------------------------------------------------------------

9. What does the population look like?


As the error terms are drawn independently of each other and of X, the error
term is not correlated with either X or the lagged error term. (Here, the correlations are
not significant despite a sample of 100,000.)
The Stata function [_n1] attached to the end of a variable name creates the
lagged value of that variable. _n is the observation number and _n-1 is the number of the
previous observation, so error[_n1] is the value of the error term for the previous
observation.
. gen lagerror = error[_n-1]
(1 missing value generated)
. pwcorr x error lagerr, sig obs
|
x
error lagerror
-------------+--------------------------x |
1.0000
|
|
1.0e+05
|
error |
0.0013
1.0000
|
0.6710
|
1.0e+05
1.0e+05
|
lagerror |
0.0014
0.0011
1.0000
|
0.6688
0.7166
|
99999
99999
99999
|

At every value of X, the error term has a normal distribution with a mean of 0 and
a standard deviation of 5.
At every value of X, the expected value of Y is 10+3x. When X=0, the expected
value of Y is 10. When X=5, the expected value of Y is 25.
At every value of X, Y has a normal probability distribution with a mean equal to
the expected value of Y and a standard deviation equal to the standard deviation of the
error term.

112

The error term is homoskedastic. At every value of X, the standard deviation of


the error term (and of Y) is the same (5).
. table x, content(mean error sd error mean y sd y)
-------------------------------------------------------------x | mean(error)
sd(error)
mean(y)
sd(y)
----------+--------------------------------------------------0 |
-.0135516
5.001425
9.986448
5.001425
1 |
.0239855
4.980974
13.02399
4.980974
2 |
-.092391
4.973914
15.90761
4.973914
3 |
.0314353
5.015356
19.03144
5.015356
4 |
-.0103497
4.999057
21.98965
4.999057
5 |
-.0699126
5.007846
24.93009
5.007846
6 |
-.050905
5.031581
27.94909
5.031581
7 |
-.0489067
4.984283
30.95109
4.984283
8 |
.0238433
4.971429
34.02384
4.971429
9 |
.0392646
4.95183
37.03926
4.95183
-------------------------------------------------------------. sktest error if x==3
Skewness/Kurtosis tests for Normality
------- joint -----Variable | Pr(Skewness)
Pr(Kurtosis) adj chi2(2)
Prob>chi2
-------------+------------------------------------------------------error |
0.400
0.822
0.76
0.6832
. sktest y if x==7
Skewness/Kurtosis tests for Normality
------- joint -----Variable | Pr(Skewness)
Pr(Kurtosis) adj chi2(2)
Prob>chi2
-------------+------------------------------------------------------y |
0.209
0.988
1.58
0.4548
.

10. Now, lets run a simulation on a much smaller sample, one with 20 observations. I
take some shortcuts, using variations on the previous commands, to create two
observations at each integer value of X from 0 to 9 and to create Y as a function of the
deterministic expected value (10 + 3*x) and a stochastic disturbance term
(5*invnorm(uniform())).
11. I then regress Y on X in three separate simulations. Note that X is fixed in repeated
samples, that is, X has the same values in all three samples. Y, however, varies
substantially from sample to sample due to the stochastic term.
12. Econometric theory tells us that the expected values of the y-intercept, the
regression coefficient, and the standard deviation of the error are all equal to the
population parameters -- 10, 3, and 5, respectively.
The expected variance of the regression coefficient is the population variance (25)
divided by the total variation in X. One approach to figuring that total variation
is to square the standard deviation of X (2.946898) to get the variance of X
(8.6842078), then multiply that times N1 (19) to get the total variation (165).
Thus, the variance of the sampling distribution for the sample regression
113

coefficient is .151515 (5/165) and the standard error of the sampling distribution
is .389, the square root of the variance.
The actual values of the statistics are in the ballpark, but they vary substantially
from sample to sample. One regression coefficient (3.027408) is very close to the
population parameter, but other values are 1.878299 and 3.382154.
.
.
.
.

clear
set obs 20
gen x = int((_n-1)/2)
tab x

x |
Freq.
Percent
Cum.
------------+----------------------------------0 |
2
10.00
10.00
1 |
2
10.00
20.00
2 |
2
10.00
30.00
3 |
2
10.00
40.00
4 |
2
10.00
50.00
5 |
2
10.00
60.00
6 |
2
10.00
70.00
7 |
2
10.00
80.00
8 |
2
10.00
90.00
9 |
2
10.00
100.00
------------+----------------------------------Total |
20
100.00
. sum x
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+----------------------------------------------------x |
20
4.5
2.946898
0
9

. gen y = 10 + 3*x + 5*invnorm(uniform())


. reg y x
Source |
SS
df
MS
-------------+-----------------------------Model | 1512.25782
1 1512.25782
Residual | 497.476161
18 27.6375645
-------------+-----------------------------Total | 2009.73398
19 105.775473

Number of obs
F( 1,
18)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

20
54.72
0.0000
0.7525
0.7387
5.2571

-----------------------------------------------------------------------------y |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------x |
3.027408
.4092681
7.40
0.000
2.167568
3.887248
_cons |
8.559516
2.184894
3.92
0.001
3.969224
13.14981
------------------------------------------------------------------------------

114

.
.
.
.
.

clear
set obs 20
gen x = int((_n-1)/2)
gen y = 10 + 3*x + 5*invnorm(uniform())
reg y x

Source |
SS
df
MS
-------------+-----------------------------Model | 582.121236
1 582.121236
Residual | 568.616827
18 31.5898237
-------------+-----------------------------Total | 1150.73806
19 60.5651612

Number of obs
F( 1,
18)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

20
18.43
0.0004
0.5059
0.4784
5.6205

-----------------------------------------------------------------------------y |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------x |
1.878299
.437554
4.29
0.000
.9590323
2.797566
_cons |
14.18862
2.335899
6.07
0.000
9.281077
19.09616
-----------------------------------------------------------------------------.
.
.
.
.

clear
set obs 20
gen x = int((_n-1)/2)
gen y = 10 + 3*x + 5*invnorm(uniform())
reg y x

Source |
SS
df
MS
-------------+-----------------------------Model | 1887.42945
1 1887.42945
Residual | 407.213665
18 22.6229814
-------------+-----------------------------Total | 2294.64311
19
120.77069

Number of obs
F( 1,
18)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

20
83.43
0.0000
0.8225
0.8127
4.7564

-----------------------------------------------------------------------------y |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------x |
3.382154
.3702823
9.13
0.000
2.60422
4.160088
_cons |
7.769109
1.976767
3.93
0.001
3.616077
11.92214
------------------------------------------------------------------------------

13. Expected values tell us more about sampling distributions than about samples,
however. We dont expect any sample regression coefficient to equal the population
parameter; what we expect is that, if we draw a huge number of samples from the same
population, the mean regression coefficient should equal the population parameter.
14. Stata has Monte Carlo procedures that allow us to test this idea. First, we define a
new program. In this case, I call it bivar20. Lines 6 through 10 match the five lines we
used to create the previous three samples (substituting drop _all for clear). The
program starts with program define and ends with end. Im running version 6.0 instead
of Stata 10, because thats how I first wrote it up, and the ways to save statistics have
changed a little since then. Line 11 tells Stata to save the y-intercept, its standard error,
the regression coefficient, and its standard error. Lines 2-5 tell it to call them a se_a b
and se_b, respectively.

115

. program define bivar20


1.
version 6.0
2.
if "`1'" == "?" {
3.
global S_1 "a se_a b se_b"
4.
exit
5.
}
6.
drop _all
7.
set obs 20
8.
gen x = int((_n-1)/2)
9.
gen y = 10 + 3*x +(invnorm(uniform())*5)
10.
regress y x
11.
post `1' _b[_cons] _se[_cons] _b[x] _se[x]
12.
drop y
13. end

15. Next we run 10,000 iterations of this program, creating 10,000 different samples of
size 20, which have 10,000 different y-intercepts 10,000 different regression
coefficients, and 10,000 different estimated standard errors on the regression
coefficients. I save those statistics.
. simul bivar20, reps(10000)

16. We now have 10,000 y-intercepts, 10,000 regression coefficients, and 10,000
standard errors for each. The mean of the 10,000 sample y-intercepts is 9.990219 and
the mean of the 10,000 sample regression coefficients is 3.00021 that is, both nearly
equal the population parameters. The standard deviation of the 10,000 sample
regression coefficients is .3887085 that is, it nearly equals the standard error of the
sampling distribution for the sample regression coefficient.
. sum
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+----------------------------------------------------a |
10000
9.990219
2.070014
2.457855
18.03333
se_a |
10000
2.052933
.3427251
.774426
3.302449
b |
10000
3.00021
.3887085
1.412235
4.498831
se_b |
10000
.3845496
.0641983
.1450633
.6186054

17. Further, the distribution of the sample regression coefficient is quite plausibly
normal. The skewness is -.035087 (very close to 0) and the kurtosis is 2.979169 (very
close to 3). The sktest finds no reason to reject the null hypothesis of normality.
. sum b, detail
b
------------------------------------------------------------Percentiles
Smallest
1%
2.08046
1.412235
5%
2.363813
1.480688
10%
2.502738
1.642873
Obs
10000
25%
2.73879
1.671909
Sum of Wgt.
10000
50%
75%
90%
95%
99%

3.000436
3.261861
3.501068
3.642922
3.886923

Largest
4.232152
4.259563
4.32136
4.498831

Mean
Std. Dev.

3.00021
.3887085

Variance
Skewness
Kurtosis

.1510943
-.035087
2.979169

116

. sktest b
Skewness/Kurtosis tests for Normality
------- joint -----Variable | Pr(Skewness)
Pr(Kurtosis) adj chi2(2)
Prob>chi2
-------------+------------------------------------------------------b |
0.152
0.694
2.21
0.3319

18. If the distribution is normal, 95% of the sample regression coefficients should be
within 1.96 standard errors of the population parameter. I create a new dummy variable
within196, which is coded 1 when the sample regression coefficient (b) is within 1.96
standard errors of 3. Consistent with our expectations, 94.95% of the sample regression
coefficients are within 1.96 standard errors of the population parameter.
. gen within196 = (b>3-1.96*.38925) & (b<3+1.96*.38925)
. tab within196
within196 |
Freq.
Percent
Cum.
------------+----------------------------------0 |
505
5.05
5.05
1 |
9495
94.95
100.00
------------+----------------------------------Total |
10000
100.00

19. If the distribution is normal, then if we standardize the sample regression coefficient
by subtracting the population parameter (3) and dividing by the true standard error,
the standardized regression coefficient should have a standard normal (or z)
distribution that is, it should be normal, with a mean of 0 and a standard deviation of
1. I create a new variable zofb following that formula and a new dummy variable
(zisin) which is coded 1 when zofb is between -1.96 and +1.96. of 3. As we already
know from within196, 94.95% of the sample regression coefficients are within 1.96
standard errors of the population parameter.
. gen zofb = (b-3)/.38925
. gen zisin = ( zofb>-1.96) & ( zofb<1.96)
. tab zisin
zisin |
Freq.
Percent
Cum.
------------+----------------------------------0 |
505
5.05
5.05
1 |
9495
94.95
100.00
------------+----------------------------------Total |
10000
100.00

20. If the distribution is normal and we standardize the sample regression coefficient by
subtracting the population parameter (3) and dividing by the estimated standard
error, the standardized regression coefficient should have a t distribution with 18
degrees of freedom (the sample size minus one degree of freedom each for the yintercept and the regression coefficient).
I create a new variable tofb following that formula. Notice that in zofb I divide
by a parameter (.38925), but in calculating tofb I divide by a variable (se_b).
117

Each sample has its own estimated standard error, but there is only one true
standard error for this sampling distribution.
21. At the .05 level, the critical values of a t-distribution with 18 degrees of freedom are
-2.100922 and 2.100922. That is, 95% of the sample regression coefficients should be
within 2.100922 estimated standard errors of the population parameter. I create a
new dummy variable (tisin) which is coded 1 when tofb is between -2.100922 and
2.100922.
tofb, like zofb, has a mean very close to 0, but the standard deviation of tofb is
larger. Likewise, the minimum and maximum are much more extreme for tofb
than for zofb. Despite this, however, and in line with expectations, 94.98% of
these 10,000 sample regression coefficients are within 2.100922 estimated
standard errors of the population parameter (3).
. display invttail(18,.025)
2.100922
. gen tofb = (b-3)/ se_b
. gen tisin = ( tofb>-2.100922) & ( tofb<2.100922)
. sum z* t*
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+----------------------------------------------------zofb |
10000
.0005384
.9986089 -4.079037
3.850561
zisin |
10000
.9495
.2189853
0
1
tofb |
10000
-.0001648
1.056928 -4.808284
5.031728
tisin |
10000
.9498
.2183683
0
1
. tab tisin
tisin |
Freq.
Percent
Cum.
------------+----------------------------------0 |
502
5.02
5.02
1 |
9498
94.98
100.00
------------+----------------------------------Total |
10000
100.00

22. In sum, the Monte Carlo results are consistent with theory. In samples of size 20,
the sample statistics vary fairly widely from the population parameter, but the sample
regression coefficients come from a sampling distribution that is normal, with a mean
equal to the population parameter and a standard error equal to the square root of (the
variance of the disturbance term divided by the total variation in the independent
variable). As expected, 95% of the sample regression coefficients were within 1.96 true
standard errors of the population parameter. Also, 95% of the sample regression
coefficients were within 2.100922 estimated standard errors of the population
parameter (because when we standardize with the estimated rather than the true
standard error, the statistic has a t- rather than a z-distribution; and because the critical
value of a t-distribution with 18 df is 2.100922).

118

INFERENTIAL STATISTICS FOR REGRESSION ANALYSIS:


Hypothesis Tests and Confidence Intervals
1. Typically, a data set does not interest us for itself but for what it can tell us about the
population from which the data set is drawn. Inferential statistics allow us to
generalize from a simple random sample to the population from which the sample was
drawn. In regression analysis, we use hypothesis tests primarily to establish that
relationships exist in the population, by ruling out the possibility that the population
parameter is 0. We use confidence intervals to estimate the size of the population
parameter, that is, to estimate the strength of the relationship in the population.
2. Both rely on our knowledge of the sampling distribution for the sample regression
coefficient.
3. We have established that, under the classical assumptions, the OLS bivariate
regression estimator b1 is unbiased [E(b1)= $1] with a variance of F2/ G (xi - x )2. We
have further established that b1 is the best linear, unbiased estimator of $1.
If the error term is normally distributed, b1 ~ N($1, F2/ G (xi - x )2).
Using the central limit theorem, as the sample size increases, b1 is asymptotically
normally distributed.
4. That is, the sample regression coefficients are sample statistics with sampling
distributions. In general, given the classical assumptions and either a large sample size
or a normally distributed error term, the sampling distribution for the sample
regression coefficient is (1) normal (or at least asymptotically normal) with (2) a mean
equal to the population parameter [E(b1)= $1] and (3) a standard error equal to the
square root of the statistics variance > Fb1 = %[F2/ G (xi - xbar)2].
6. That means that (b1- $1) / Fb1 has a standard normal (z) distribution. That is, if we
subtract the population parameter from the sample regression coefficient, then divide
by the true standard error of the sampling distribution, this new standardized
regression coefficient is normally distributed with a mean of 0 and a variance of 1.
7. Further, we know how to calculate an unbiased estimator of the variance of the
sampling distribution by substituting the mean squared error (MSE) for the population
variance > Fhat-b1 = %[MSE/ G (xi - xbar)2]. Following on the logic we used with the
sample mean, we could show that (b1- $1) / Fhat-b1 has a t distribution with N - k - 1
degrees of freedom (where k is the number of independent variables in the model). That
is, if we subtract the population parameter from the sample regression coefficient, then
divide by the estimated standard error of the sampling distribution, this new
standardized regression coefficient has a t-distribution, just as a similarly
standardized sample mean has a t-distribution . This means that inference for the
119

population regression parameter is a simple variation on inference for the population


mean.
8. In hypothesis tests, we substitute a hypothesized value (typically, zero) for $1, then
determine whether the sample regression coefficient b1 is far from zero (measured in
estimated standard errors) that it is not plausible that $1=0. If so, we can confidently
state whether $1 is above or below zero, that is, we can confidently state whether the
relationship in the population is positive or negative.
9. In confidence intervals, we take advantage of the fact that the sample regression
coefficient has a known probability of falling within a set number of estimated standard
errors of the population parameter and that whenever the sample regression coefficient
is within that number of estimated standard errors of the population parameter, the
population parameter is also within that number of estimated standard errors of the
sample regression coefficient. With large samples, for instance, there is a 95%
probability that the sample regression coefficient will fall within 1.96 estimated standard
errors of the population parameter; therefore, we can be 95% confident that the
population parameter is within 1.96 estimated standard errors of the sample regression
coefficient.
The Hypothesis Test for the Population Parameter
17. The hypothesis test is designed to establish whether the sample provides enough
evidence to conclude that the population parameter differs from a particular number.
We are typically interested in establishing that a relationship between X and Y exists in
the population. We perform a hypothesis test hoping to convincingly reject the null
hypothesis that no relationship exists in the population. That is, we want to reject the
possibility that changes in X are not associated with changes in Y and that the
population parameter is zero, that is, to conclude that $1 0.
STEP ONE. State the research and null hypotheses. The null hypothesis is that $1
(beta-one, the population slope parameter) equals some particular number (typically
0). The research or alternative hypothesis can take three general forms: the population
regression parameter is (a) greater than, (b) less than, or (c) not equal to that number.
H0: $1= 0
H1: $1<0

or

$1>0 or

$1 0

STEP TWO. Remind yourself that if the null hypothesis is true, the sampling
distribution for (b1-0)/Fhat-b1 (which we will nickname t*) is a t-distributionwith N-k-1
degrees of freedom.
STEP THREE. Decide how willing you are to conclude that a true null hypothesis is
false. Then find the decision rule that gives you exactly that probability of rejecting a
120

true null hypothesis. You choose from three possible decision rules, depending on the
form of the research hypothesis.
(a) If H1 is that the population regression parameter is greater than some number ($1>0)
then reject H0 if t*>t" and accept H0 if t*<t".
(b) If H1 is that the population regression parameter is less than some number ($1<0),
then reject H0 if t*<-t" and accept H0 if t*>-t".
(c) If H1 is that the population regression parameter differs from some number ($1 0),
then reject H0 if t*<-t"/2 or if t*>t"/2 and accept H0 if -t"/2<t*<+t"/2.
STEP FOUR. Calculate the test-statistic (t*).
t* = (b1-0)/Fhat-b1
STEP FIVE. Compare the test-statistic to the decision rule. Decide whether you have
sufficient evidence to reject the null hypothesis. State your conclusion in substantive
terms.
Logic behind the Hypothesis Test
18. Imagine we want to know whether the mean salary of all male federal white-collar
employees was more than the mean salary of all female federal white-collar employees
in 1991. In a random sample of 3484 employees, the mean salary of women was
$27,423 and the mean salary of men was $12,582 higher. The estimated standard error
for the difference of mean salaries was $443.22. Is this sufficient information to draw
the conclusion that the mean salary for the population of male federal white-collar
employees was more than the mean salary for the population of female federal whitecollar employees in 1991?
TABLE 6. reg sal male
Source |
SS
df
MS
---------+-----------------------------Model | 1.3783e+11
1 1.3783e+11
Residual | 5.9555e+11 3482
171037759
---------+-----------------------------Total | 7.3338e+11 3483
210560961

Number of obs
F( 1, 3482)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

3484
805.85
0.0000
0.1879
0.1877
13078

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------male |
12581.9
443.2203
28.387
0.000
11712.9
13450.89
_cons |
27422.93
316.4478
86.659
0.000
26802.48
28043.37
------------------------------------------------------------------------------

19. A hypothesis test tests the plausibility that a population parameter has a particular
value. In this case, we imagine three possibilities about the difference of mean salaries
121

between men and women in the population of federal employees: $1>0, $1<0, and $1=0.
The only definite value suggested for $1 is 0, because the other possibilities are just that
$1 is bigger or smaller than 0, not that $1 equals some particular number.
20. We therefore assume that the population regression parameter equals 0. To
assume does not mean to believe or to conclude. An assumption is a starting place. We
are saying, Lets assume that $1=0 and see what that implies for the sampling
distribution of b1. That, in turn, leads to implications for the standardized sample
regression coefficient if $1=0. We decide how willing we are to make a certain type of
mistake (deciding that $1 differs from 0 when they are really equal). We combine the
willingness and the implications to come up with a decision rule that gives us a known
probability of saying that $1 doesnt equal 0 when it really does. Then we follow our rule
in drawing a conclusion about the population.
STEP ONE:
21. First, formulate the RESEARCH or ALTERNATIVE HYPOTHESIS (H1). It
states that the population regression parameter is (a) greater than, (b) less than, or (c)
not equal to some number. We almost always think H1 is true. Since we know that
men earn more than women, we hypothesize that $1 is greater than 0: H1: $1 > 0
22. Next, state the NULL HYPOTHESIS (H0), which is that the population
regression parameter equals the value that the alternative hypothesis claimed it was
different from. The null hypothesis always contains an equals sign when stated
symbolically, although sometimes the research hypothesis is such that the null
hypothesis implies greater than or equal to or less than or equal to. In this case the
null hypothesis is that the population regression parameter equals 0: H0: $1 = 0
STEP TWO:
23. Given a properly specified model meeting the classical assumptions, plus either a
normally distributed error term or a sufficiently large sample, the sampling distribution
for b1 (the sample regression coefficient ) is:
a) normal
b) E(b1)=$1 (the mean of the sampling distribution for b1 equals the population
regression parameter)
c) Fb1=F/ %G(xi - xbar)2 (the standard error of the sampling distribution for b1
equals the population standard deviation divided by the square root of the
variation in X.
24. We also know (whether H0 is true or false) that when we standardize the normally
distributed statistic b1 by subtracting the mean of its sampling distribution and dividing

122

by the true standard error [(b1-$1)/Fb1], the standardized regression parameter has a
standard normal (z) distribution.
25. We also know (whether H0 is true or false) that when we standardize the normally
distributed statistic b1 by subtracting the mean of its sampling distribution and dividing
by its estimated standard error [(b1-$1)/Fhat-b1], the standardized sample regression
coefficient has a t-distribution with N-k-1 degrees of freedom, where k=number
of independent variables.
26. If (and only if) the null hypothesis is true, the population regression
parameter is 0. If (and only if) the null hypothesis is true, we can standardize the
sample regression coefficient by subtracting the hypothesized population regression
parameter (0) and dividing by the estimated standard error [F-hatb1= s /%G(xi - xbar)2 =
Root MSE / %SSTx]. We call this new statistic t* [t*=(b-0)/F-hatb1].
t* is the standardized sample regression coefficient if and only if the null
hypothesis is true, and t* has a t-distribution with N-k-1 degrees of freedom if
and only if the null hypothesis is true.
27. If H0 is false, the sampling distribution for b1 is still normal, and its estimated
standard error (F-hatb1) is still s/%G(xi - xbar)2 , but the mean of the sampling
distribution is not 0. If H0 is false, t* does not have a t-distribution because t*
is not the standardized value of b1. t* is the standardized value of b1 only if
$1=0.
If H0 is false, we do not know what the sampling distribution of t* is.
28. Knowing the sampling distribution for t* if H0 is true allows us to predict what value
t* will have if H0 is true.
In this case, our sample is large enough that we can use the t-distribution with
infinite degrees of freedom. According to this t-distribution, we know, for
instance, that there is a 95 percent chance that t* will be less than 1.645, a 95
percent chance that t* will be greater than -1.645, and a 95 percent chance that t*
will be between -1.96 and +1.96 if H0 is true.
If H0 is false, we do not know the sampling distribution for t*; therefore, we do
not know how likely it is that t* will be in any of those intervals if H0 is false.

123

STEP THREE:
29. Naturally, the sample regression coefficient will not exactly equal the hypothesized
population regression parameter (0) even if the null hypothesis is true; so t* will not
exactly equal 0 (which it would if the sample regression coefficient were 0.0000). The
sampling distribution will tell us, however, how likely it would be to get a t* as far from
zero as the t* we actually got (and a sample regression coefficient as far from 0 as the
sample regression coefficient we actually got) if the null hypothesis were true.
30. If we got a t* that would be very unlikely to get from a t-distribution, then we
could conclude that the sampling distribution for t* was not a t-distribution. That
would imply that the population regression parameter did not equal the hypothesized
population regression parameter (0), and we would reject the null hypothesis as false.
31. If we got a t* that would be plausible to get from a t-distribution, then we could
conclude that the sampling distribution for t* could be a t-distribution, so the
population regression parameter could equal the hypothesized population regression
parameter. We would therefore tentatively accept that the null hypothesis could be
true.
32. Any time we make inferences from samples to populations, we take some chance of
making an error. If we reject the null hypothesis and H0 is actually true, we make a
Type I error. If we tentatively accept the null hypothesis and H0 is actually false, we
make a Type II error. In most cases, decreasing the probability of making a Type I
error increases the probability of making a Type II error, and vice versa.
33. Since we are assuming that the null hypothesis is true, we control the probability of
making a Type I error (rejecting a true null hypothesis). The significance level (",
alpha) is our statement of how willing we are to reject a true null hypothesis. The
most common significance levels are .05 and .01. Choosing a smaller alpha decreases
our chance of making a Type I error, but increases our chance of a Type II error.
34. We choose a decision rule that allows us to reject a true null hypothesis with a
fixed probability (alpha). That is, in repeated tests with different samples from a
population in which the null hypothesis is true, we will nonetheless reject the null
hypothesis a fixed proportion of the time (alpha).
35. If the null hypothesis is true, the sampling distribution for t* is a t-distribution
with N-k-1 degrees of freedom. Therefore, we can divide that t-distribution into two
regions (the acceptance region and the rejection region). We create the rejection
region so that it has exactly an alpha probability of including t* if H0 is true. We then
reject the null hypothesis whenever t* falls in the rejection region. There is a (1-")
probability that t* will fall inside the acceptance region if H0 is true. If we accept the
null hypothesis whenever t* falls into the acceptance region and reject H0 whenever t*
124

falls into the rejection region, we have a 1-" probability of making the right decision and
an " probability of making a Type I error if H0 is true.
36. If H0 is false, t* does not have a t-distribution and we do not know the probability
that t* will fall into the acceptance or rejection region.
In general, however, our probability of rejecting H0 (which would be the right
decision in this case) is higher than " and our probability of accepting H0 (which
would be the Type II error in this case) is less than 1-" .
37. The value of t that separates the acceptance region from the rejection region is
called the critical value. We choose the critical value(s) of t based on the form of the
alternative hypothesis.
38. If H1 is only that the population regression parameter is different from (not equal
to) 0, then we will reject the null hypothesis if b1 is much higher or much lower than 0.
We therefore need two rejection regions. To take an " (e.g., 5%) chance of rejecting a
true null hypothesis, we reject H0 if the probability of getting a t* at least as high as the
one we got is only "/2 (2.5%) or if the probability of getting a t* at least as low as the
one we got is only "/2 (2.5%) if H0 is true (that is, if t* really has a t-distribution). The
critical value that puts "/2 in the upper tail is called t"/2 and the critical value that puts
"/2 in the lower tail is called -t"/2.
Our decision rule, then, is to reject H0 if t*>t"/2 or if t*<-t"/2 and to accept H0 if
-t"/2<t*<t"/2. (With "=.05, we reject H0 if t*<-1.96 or if t*>+1.96 and accept H0 if
-1.96<t*<+1.96.) This is called a two-tailed test because you can reject the null
hypothesis if the t* is too far into either tail of the assumed sampling distribution.
40. If we are confident before we look at the sample data that $1 is greater than 0, then
we can be confident that b1 will also be larger than 0 and that t* will be positive. There is
no reason to waste half of our alpha in the negative side of the t-distribution. By putting
all of the rejection region into the upper tail of the t-distribution, we lower the critical
value from t"/2 to t" (when "=.05, this is the difference between 1.96 and 1.645).
41. This does not change the probability of rejecting H0 if H0 is true (that probability is
still "), but it does make it easier to reject H0 if H0 is false. If $1>0, then we expect t* to
be a positive number. Every t* that is greater than 1.96 (the critical value in the twotailed test) is also greater than 1.645 (the critical in the one-tailed test), but the opposite
is not true. Therefore, more positive values of t* allow us to reject H0 in a one-tailed test.
(For instance, with a .05 significance level, we can reject H0 with a t* of 1.85 in a onetailed test but not in a two-tailed test.)
42. If we are confident (before looking at the sample data) that $1<0, then we should be
confident that b1<0 and that t*<0. Therefore, we put the entire rejection region into the
125

lower tail of the t-distribution and reject H0 if t*<-t" (rather than -t"/2). This one-tailed
test gives us the same " probability of rejecting a true H0 as a two-tailed test would, but
it gives us a better chance of rejecting a false H0.
43. The form of the research hypothesis should always be chosen before we look at the
sample data to avoid the temptation to choose a one-tailed test based on whether the
sample regression coefficient is positive or negative. Remember, even if the null
hypothesis is true, there is virtually no chance that the sample statistic will be equal to 0.
Instead, there is a 50% chance b1 will be larger and a 50% chance b1 will be smaller than
$1. If you wait until you see whether b1 is positive or negative before deciding what tail
to put your alpha into, you will never put your " in the wrong tail.
In reality, you are putting " into both tails and doubling your likelihood of
making a Type I Error--that is, alpha is twice as large as you think it is. (If you
put your .05 into the upper or lower tail after looking at the sample regression
coefficient , you will only accept H0 if -1.645 < t* < 1.645 (with infinite degrees of
freedom), which really gives you an " of .10.)
44. Thus, STEP THREE is to choose a significance level and find the appropriate
decision rule. The smaller the alpha we choose, the less chance we have of rejecting the
null hypothesis (whether H0 is true or false). There are three possible decision rules,
depending on the form of the research hypothesis.
If H1 is that $1>0 (that is, that the population regression parameter is greater than
some number), reject H0 if t*>t"; accept H0 if t*<t".
If H1 is that $1<0, reject H0 if t*<-t"; accept H0 if t*>-t".
If H1 is that $1 0, reject H0 if t*<-t"/2 or if t*>+t"/2; accept H0 if -t"/2<t*<t"/2.
45. In this example, because we are virtually certain that men earn more than women
before looking at the data, we choose a .01 significance level and a one-tailed test:
Reject H0 if if t*>2.326; accept H0 if t*<2.326
STEP FOUR:
46. Calculate the test-statistic (t*). In this case, we take the sample regression
coefficient (12581.9) minus the hypothesized population regression parameter (0),
divided by the estimated standard error (443.22). t* = 12581.9/443.22 = 28.387. In
other words, the sample regression coefficient is more than 28 estimated standard
errors above 0.

126

STEP FIVE:
47. Compare the test-statistic to the critical value. Decide whether you have sufficient
evidence to reject the null hypothesis. If you reject H0, it is because you have gotten a
result that was very unlikely if H0 were true; therefore, you are confident when you
reject the null hypothesis. If you accept H0, it is because t* is somewhere in the
acceptance region and the null hypothesis is plausible. You do not have convincing
evidence that H0 is true; you simply lack convincing evidence that it is false. When you
accept H0, you do so tentatively and unconfidently. Several authors recommend that
you do not accept H0, you merely fail to reject it. That is, you stick with your initial
assumption rather than reaching a real conclusion that H0 is true. Be sure to state
your conclusion in substantive terms.
48. In this case, t* = 28.387 and t" = 2.326. Since t* > t", we reject H0 and conclude that
$1> 0, that is, that the mean salary for all male federal white-collar employees was
greater than the mean salary for all female federal white-collar employees in 1991. We
are quite confident of this conclusion, because the probability of getting a value further
from 0 than 28 from a t-distribution with infinite degrees of freedom is less than .0005.
Therefore, its quite unlikely that t* came from a t-distribution with infinite degrees of
freedom or that $1=0 (because if H0 were true and $1 did equal 0, then t* would have
come from a t-distribution with 4 df).

Confidence Intervals for the Population Regression Parameter


49. To determine the strength of the relationship in the population, we calculate a
confidence interval for the population parameter ($1).
50. Given the classical assumptions, the sample regression coefficient from OLS is the
best, linear unbiased estimator of the population parameter. It is, therefore, our one
best guess at the value of the population parameter: b1 is our best point estimate of 1
53. Unfortunately, P(b1 = 1) = 0 the sample statistic essentially never equals
the population parameter so our point estimate will never be right. An interval
estimate contains a range of possible values within which the population parameter is
likely to fall, with a known probability.
54. The confidence level is the probability that 1 actually falls within the confidence
interval.
We typically calculate 95 percent confidence intervals, though any size confidence
level is possible. This means that we are typically 95 percent confident that the
population parameter is inside the confidence interval.
127

If we wanted to be more confident that the population parameter was actually


inside the confidence interval (that is, if we wanted to raise the confidence level),
we would need a wider interval.
On the other hand, if we were willing to accept more than a five percent chance
that the population parameter was outside the confidence interval (lower the
confidence level), we could use a narrower interval.
There is a tradeoff between the precision of our estimate and our
probability of being correct.
55. The 95% confidence interval itself stretches from the lower confidence limit
(1.96 standard errors below the point estimate, in a large sample) to the upper
confidence limit (1.96 estimated standard errors above the point estimate).
56. The general logic here goes back to the sampling distribution of the sample
regression coefficient. Because the sample regression coefficient has a normal
probability distribution, there is a 95 percent probability that the sample regression
coefficient will fall within 1.96 standard errors of the mean of its sampling distribution.
Since the mean of the sampling distribution is the population parameter, there is a 95
percent probability that the sample regression coefficient will be within 1.96 standard
errors of the population parameter.
57. Whenever the sample regression coefficient is within 1.96 standard errors of the
population parameter, the population parameter is also within 1.96 standard errors of
the sample regression coefficient.
Whenever I am within 20 feet if you, you are within 20 feet of me. (Remember
that 1.96 standard errors can be interpreted as a measure of distance.)
58. If there is a 95 percent chance that the sample regression coefficient will be within
1.96 standard errors of the population parameter, there is also a 95 percent chance that
the population parameter will be within 1.96 standard errors of the sample regression
coefficient.
59. Thus, we can construct a confidence interval that begins 1.96 estimated standard
errors below our sample regression coefficient and ends 1.96 estimated standard errors
above our sample regression coefficient. If our sample regression coefficient was
actually within 1.96 estimated standard errors of the population parameter (and there
was a 95 percent chance that this was true), then the population parameter is also inside
this confidence interval.

128

Interpreting Hypothesis Tests and Confidence Intervals


60. The point of the hypothesis test is typically to establish that X and Y are related in
the population. To show this, we attempt to show that it is highly unlikely that 1 (the
population parameter) could equal zero. (If 1=0, then a change in X has no impact on
the expected value of Y.)
Note: beta, the population parameter, is completely different from the betaweight, which is the standardized coefficient. You will never see the population
parameter printed in your regression output.
61. Because b1 is a normally distributed statistic with a mean of 1, there is a 95%
probability that the sample regression coefficient (b1) will fall within 1.96 standard
errors of the population parameter (1).
62. If 1=0, there is a 95% probability that the sample regression coefficient (b1) will fall
within 1.96 standard errors of 0.
If the sample regression coefficient is more than 1.96 estimated standard errors
above 0, we will confidently conclude that the population parameter (and
therefore the relationship in the population) is positive, that is, that 1>0.
If the sample regression coefficient is more than 1.96 estimated standard errors
below 0, we confidently conclude that X and Y are negatively related in the
population.
If the sample regression coefficient is within 1.96 estimated standard
errors of 0, we are unable to confidently conclude that the population
parameter is either positive or negative. We will say that we tentatively accept
the null hypothesis that the population parameter is 0.
What this means is that were saying it is plausible that X and Y are not
related in the population, but it is also plausible that X and Y are either
positively or negatively related in the population. In other words, if we
accept the null hypothesis, we learn virtually nothing from the hypothesis
test.
63. Stata indicates how many estimated standard errors the sample regression
coefficient is from 0 in the t column. It calculates this number by dividing the sample
regression coefficient (in the Coef. column) by the estimated standard error (in the
Std. Err. column).

129

64. In the P>|t| column, Stata prints the probability of getting a sample regression
coefficient as far from zero as the sample regression coefficient we actually got if the
null hypothesis is true. This provides a short-cut for the hypothesis test:
(a) If we are doing a two-tailed hypothesis test and P>|t| < alpha, reject H0. We
have gotten a rare enough occurrence if the null is true that we can confidently
reject the null.
(b) If we are doing a one-tailed hypothesis test and t* is in the correct tail and
P>|t|/2 < alpha, reject H0.
Interpretation Examples:
In Table 5, the sample regression coefficient shows that as years of education
rises by one, expected salary rises by $3715 in the sample.
TABLE 5. reg sal edyrs
Source |
SS
df
MS
---------+-----------------------------Model | 2.2438e+11
1 2.2438e+11
Residual | 5.0900e+11 3482
146180591
---------+-----------------------------Total | 7.3338e+11 3483
210560961

Number of obs
F( 1, 3482)
Prob > F
R-squared
Adj R-squared
Root MSE

=
3484
= 1534.97
= 0.0000
= 0.3060
= 0.3058
=
12091

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------edyrs |
3715.173
94.82633
39.179
0.000
3529.253
3901.094
_cons | -19469.25
1375.916
-14.150
0.000
-22166.94
-16771.57
------------------------------------------------------------------------------

The sample regression coefficient ($3715) divided by the estimated standard


error of $94.83 yields a t* of 39.18. In other words, the sample regression
coefficient is 39 estimated standard errors above 0. This is an incredibly unlikely
occurrence if the population regression equation is 0. In a normal distribution,
95 percent of the observations fall within two standard errors of the regression
parameter (in this case, the population parameter, which should be zero), 99.7
percent fall within three standard errors of the mean, and almost none are more
than four standard errors from the mean.
The P>|t| column shows that there is less than 5 chances in 10,000 of getting a
sample regression coefficient more than 39 estimated standard errors from 0 if
the population parameter is 0. (If there were more than 5 chances in
10,000, Stata would round up to .001.)
In fact, if we run display ttail(3482,39.179), Stata tells us that the probability
of getting a sample regression coefficient at least 39.179 estimated standard errors

130

from the population parameter is 8.679e-279 that is, a decimal point and a 9
with 278 zeros between them approximately one chance in a bezillion.

We therefore conclude that salary and education are positively related in


the population.
The confidence interval tells us more. We are 95 percent confident that, in the
population, expected salary rose somewhere between $3529 and $3901 with one
additional year of education.
In Table 6, the sample regression coefficient shows that the mean salary of men
is $12,582 higher than the mean salary of women in the sample. The t* of
28.39 means that the sample regression coefficient of $12,582 is 28.39 estimated
standard errors above 0, again an extremely unlikely event if there were no
relationship in the population (that is, if the population parameter were 0).
[12,500 82/443.22 = 28.39]
We therefore conclude that the mean salary of men is higher than the
mean salary of women in the population.
Indeed, we are 95 percent confident that the mean salary of men is between
$11,713 and $13,451 higher than the mean salary of women in the population of
federal white-collar workers in 1991.
TABLE 6. reg sal male
Source |
SS
df
MS
---------+-----------------------------Model | 1.3783e+11
1 1.3783e+11
Residual | 5.9555e+11 3482
171037759
---------+-----------------------------Total | 7.3338e+11 3483
210560961

Number of obs
F( 1, 3482)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

3484
805.85
0.0000
0.1879
0.1877
13078

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------male |
12581.9
443.2203
28.387
0.000
11712.9
13450.89
_cons |
27422.93
316.4478
86.659
0.000
26802.48
28043.37
------------------------------------------------------------------------------

66. Remember that the point of hypothesis tests is to draw conclusions about the
population not the sample. Therefore, your conclusions should always include the
words in the population or other words that make clear that your conclusion is about
the population.
67. Notice too that the hypothesis test does not allow us to draw any conclusions about
the size of the population parameters (e.g., how big the salary advantage of men is). We
only draw conclusions about the sign of the coefficient in the population (e.g., we
conclude that men make more than comparable women in the population).
131

68. t* is a ratio of the sample regression coefficient to the estimated standard error.
Anything that increases the absolute value of the sample regression coefficient or
shrinks the estimated standard error will cause the absolute value of t* to get larger.
The stronger the relationship in the population, the higher the sample regression
coefficient will tend to be. With strong relationships in the population, t* tends
to be high and we tend to conclude that they are statistically significant, that
is, that the relationships exist in the population.
The larger the sample size, the smaller the standard error will tend to be. This
will tend to enlarge t* and make it more likely that you will conclude that are
relationship exists in the population if a relationship really does exist in the
population. [If the two variables are not related in the population, increasing the
sample size will have no impact on the expected size of t*.]
Tables 5A and 6A repeat the models from Tables 5 and 6, but tested on a random
10% sample of the original data set. The regression coefficients and coefficients
of determination are very similar for Tables 5 and 5A (b1 = 3715 and 3889 and R2
= .306 and .315, respectively), though Table 6A has a substantially higher
regression coefficient (16256 vs. 12582) and R2 (.285 vs. .188).
The standard errors are about three times larger in Tables 5A and 6A than in
Tables 5 and 6, because the samples are only one-tenth as large. The result is
that t* drops substantially for Tables 5A and 6A by two-thirds for Table 5A
(because the regression coefficient did not change much) and by more than half
in Table 6A (in spite of the fact that the regression coefficient is substantially
larger). Likewise, the 95% confidence intervals are about three times wider.
TABLE 5A. reg sal edyrs
Source |
SS
df
MS
-------------+-----------------------------Model | 2.5314e+10
1 2.5314e+10
Residual | 5.5163e+10
346
159430148
-------------+-----------------------------Total | 8.0477e+10
347
231922992

Number of obs
F( 1,
346)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

348
158.78
0.0000
0.3146
0.3126
12627

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------edyrs |
3889.67
308.6837
12.60
0.000
3282.537
4496.803
_cons | -22450.19
4487.339
-5.00
0.000
-31276.09
-13624.3
-----------------------------------------------------------------------------TABLE 6A. reg sal male
Source |
SS
df
MS
-------------+-----------------------------Model | 2.2963e+10
1 2.2963e+10
Residual | 5.7514e+10
346
166225158
-------------+------------------------------

132

Number of obs
F( 1,
346)
Prob > F
R-squared
Adj R-squared

=
=
=
=
=

348
138.15
0.0000
0.2853
0.2833

Total |

8.0477e+10

347

231922992

Root MSE

12893

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------male |
16256.11
1383.079
11.75
0.000
13535.81
18976.41
_cons |
25038.71
994.7037
25.17
0.000
23082.29
26995.14
------------------------------------------------------------------------------

Remember that tests of statistical significance depend on both the strength of


the relationship and on the sample size. If your sample is large, you are likely to
find that your regression coefficient is statistically significant, even if the
relationship is very weak. Especially with very large samples, pay more attention
to the strength than to the statistical significance of relationships.
69. The results of confidence intervals and hypothesis tests are consistent. Whenever a
95% confidence interval includes the hypothesized population parameter, a two-tailed
hypothesis test with a significance level of .05 will result in accepting the null. Whenever
a 95% confidence interval does not include the hypothesized population parameter,
you will be able to reject the null hypothesis at the .05 significance level in a two-tailed
test.
Likewise, whenever the P>|t| is less than .05, both confidence limits have the
same sign and the confidence interval does not include 0. Whenever P>|t| is
greater than .05, the lower confidence limit is negative and the upper
confidence limit is positive.
70. In most circumstances, confidence intervals tell us more than hypothesis tests.
Typically, a hypothesis test can only tell us that the parameter is positive or
negative (and then only if we reject H0).
If the entire confidence interval is positive or negative (which it must be if we are
able to reject the null hypothesis in a two-tailed test), a confidence interval will
also tell us that.
Because a 95% confidence interval tells us two numbers that the parameter is
between (with 95% likelihood), it helps us know whether the parameter is a lot or
a little different from zero.
The width of the confidence interval tells us how well we know the value of the
parameter.
If 0 is inside the confidence interval, the width tells us whether that happened
because the population parameter really is close to 0 or because our technique
and data are weak.
133

71. Because a 95% confidence interval stretches from 1.96 estimated standard errors
below the sample regression coefficient up to 1.96 estimated standard errors above the
sample regression coefficient, the width of the confidence interval depends on the size of
the estimated standard error. In general, the larger the sample size, the smaller the
standard error, and the narrower the confidence interval. Larger samples give more
precise estimates at the same confidence level, but at a cost of increased effort,
time, and money.

Using Monte Carlo to Understand Inferential Statistics


72. This Monte Carlo repeats and builds on the simulation from the first lecture. The
program bivar20 creates a data set with 20 observations, two observations each for
every integer value of X from 0 though 9 and values of Y generated by the following
process:
yi = $ 0 + $ 1 xi + , i

where $0 = 10,
$1 = 3,
,i ~ N(0, F2), and
F2 = 25
Stata saves the y-intercept, its standard error, the regression coefficient, and its
standard error, calling them a se_a b and se_b, respectively.
. program define bivar20
1.
version 6.0
2.
if "`1'" == "?" {
3.
global S_1 "a se_a b se_b"
4.
exit
5.
}
6.
drop _all
7.
set obs 20
8.
gen x = int((_n-1)/2)
9.
gen y = 10 + 3*x +(invnorm(uniform())*5)
10.
regress y x
11.
post `1' _b[_cons] _se[_cons] _b[x] _se[x]
12.
drop y
13. end

73. This time I ran 100,000 iterations of this program, creating 100,000 different
samples of size 20; 100,000 different y-intercepts; and 100,000 different regression
coefficients.
. simul bivar20, reps(100000)

134

74. The means of the 100,000 y-intercepts and regression coefficients are very close to
the parameter values, though individual values of the statistics vary widely. Sample yintercepts have values as low as .54 and as high as 18.55; sample regression range from
1.38 to 4.67.
. sum
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+----------------------------------------------------a | 100000
9.996483
2.081225
.5414817
18.54827
se_a | 100000
2.049386
.3440887
.7698553
3.597694
b | 100000
3.00094
.3902561
1.37501
4.674745
se_b | 100000
.3838852
.0644537
.1442071
.6739098

75. Because there are only 20 observations and the model estimates the y-intercept and
the regression coefficient, (b1- $1)/Fhat-b1 has a t distribution with 18 degrees of freedom
(N-k-1 = 20-1-1 = 18). I use Stata to calculate the critical values of t with 18 df:
. di invttail(18,.025)
2.100922
. di invttail(18,.05)
1.7340636
. di invttail(18,.005)
2.8784405

That is,

P(-2.100922 < t18df < 2.100922) = .95;


P(-1.7340636 < t18df < 1.7340636) = .90; and
P(-2.8784405 < t18df < 2.8784405) = .99.

76. First, I test the true null hypothesis that $1 = 3. For each of the 100,000 samples, I
create a test-statistic (tstar) by subtracting 3 from the sample regression coefficient and
dividing by the estimated standard error from the sample. I then create a dummy
variable (rejectt) coded 1 whenever the absolute value of tstar is greater than 2.100922
and coded 0 any time the absolute value of tstar is less than 2.100922; that is, rejectt
is 1 whenever the sample gives us enough evidence to reject the null hypothesis.
. gen tstar = (b-3)/se_b
. gen rejectt = abs(tstar)>2.100922

Because this null hypothesis is true, tstar should have a t-distribution with 18
degrees of freedom. It should have a mean very near 0, a variance very near
df/(df-2) = 18/16 = 1.125, and a standard error very near the square root of 1.125
= 1.061. Because 95% of the t-distribution with 18 degrees of freedom falls
between -2.100922 and +2.100922, we should accept the null 95% of the time,
and rejectt should equal 1 only 5% of the time.

135

. sum tstar rejectt


Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+----------------------------------------------------tstar | 100000
.0027405
1.062106 -5.692799
5.865121
rejectt | 100000
.05099
.2199784
0
1

As expected, tstar has a mean very near 0 (.0027405) and a standard error very
near 1.061 (1.062106). Rejectt=1 only 5.099% of the time in 100,000 trials.
That is, 95% of the time in these 100,000 trials we made the correct decision to
accept the null hypothesis. The other 5% of the time we made a Type I error we
rejected the null hypothesis even though the null was true.
77. Next, I test the false null hypothesis that $1 = 3.3. For each of the 100,000 samples,
the new test-statistic (tstar2) is created by subtracting 3.3 (rather than the true 3.0)
from the sample regression coefficient and dividing by the estimated standard error
from the sample. The dummy variable (rejectt2) is coded 1 any time the absolute value
of tstar2 is greater than 2.100922; that is, whenever the sample gives us enough
evidence to reject this false null hypothesis.
. gen tstar2 = (b-3.3)/se_b
. gen rejectt2 = abs(tstar2)>2.100922

Because this null hypothesis is false, tstar2 does not have a t-distribution with 18
degrees of freedom and tstar2 should be in the rejection region more than 5% of
the time.
. sum tstar rejectt
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+----------------------------------------------------tstar2 | 100000
-.8020985
1.071328 -6.913829
4.718794
rejectt2 | 100000
.11244
.3159086
0
1

In line with expectations, the mean value of tstar2 in 100,000 samples is -.802,
though its standard error (1.071328) is not much different from that of tstar. We
now reject the (false) null hypothesis 11.244% of the time in these 100,000 trials.
This time, rejecting the null hypothesis is the correct decision. We do this 11% of
the time more than twice as often as we did when the null was true.
Unfortunately, we are making the wrong decision 89% of the time a Type 2
error: accepting (or failing to reject) the false null hypothesis.
78. Next, I test three more extreme false null hypotheses, that $1 = 3.6 or 3.9 or 4.2.
This gives me three new test-statistics. The critical values of the t-test remain the same,
because my test-statistic would still have a t-distribution with 18 degrees of freedom if
my null hypotheses had been true.

136

. gen tstar3 = (b-3.6)/se_b


. gen rejectt3 = abs(tstar3)>2.100922
. gen tstar4 = (b-3.9)/se_b
. gen rejectt4 = abs(tstar4)>2.100922
. gen tstar5 = (b-4.2)/se_b
. gen rejectt5 = abs(tstar5)>2.100922
. sum tstar rejectt
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+----------------------------------------------------tstar3 | 100000
-1.606938
1.099386 -8.595816
3.572467
rejectt3 | 100000
.30916
.4621496
0
1
tstar4 |
rejectt4 |

100000
100000

-2.411777
.58979

1.144898
.4918741

-10.33981
0

2.450168
1

tstar5 |
rejectt5 |

100000
100000

-3.216616
.82969

1.205887
.3759068

-12.08381
0

1.360463
1

79. As the hypothesized value of $1 gets further and further from its true value (3), the
mean of the sampling distribution for the test-statistic gets further and further from 0
(-1.606938 to -2.411777 to -3.216616) and the standard error gets larger and larger.
Consequently, the proportion of the time we reject the (false) null hypothesis rises (from
.30916 to .58979 to .82969). We are still only making the correct decision 83% of the time
at the most extreme value of $1 tested. Even when we hypothesize that $1=4.2, we
incorrectly accept (fail to reject) that false null hypothesis 17% of the time.
Impact of Larger Samples
80. Now, imagine that I double the sample size, drawing twice as many observations at
each value of X. The variance of b1 should be cut in half, because we are doubling the total
variation in X. The standard error is cut by about 30% (we divide the original standard
error by the square root of 2).
81. Because the sample size has doubled, the degrees of freedom rises to 38 and the
critical value of t falls to 2.0243941 and the acceptance region gets a little narrower.
82. Neither the change in the variance nor the change in the decision rule should have any
effect on our probability of rejecting the true null hypothesis that $1 = 3, because we chose
to take a 5% chance of rejecting the true null hypothesis.
83. However, when the null hypothesis is $1 = 3.3 but the true value of $1 = 3, the
expected value of (b1 - 3.3) is E(b1 ) - 3.3 = $1 -3.3 = -.3. To get t*, we divide b1 - 3.3 by
the estimated standard error of the regression coefficient. The smaller that estimated
standard error gets, the higher the absolute value of the expected value of t*. That is, we
should be rejecting equally false null hypotheses more frequently with larger sample
sizes and smaller standard errors.

137

84. Again, I run 100,000 iterations, generating 100,000 sets of sample statistics. Again, I
create t-statistics to test five different null hypotheses that $1 has values from 3 to 4.2.
The tstar* variables show the values of the test-statistics; the rejectt* variables are coded
1 if the test-statistic leads us to reject the null, 0 if it leads us to accept the null.
85. The mean values of the 100,000 y-intercepts and regression coefficients are very close
to the population parameters of 10 and 3. The sample statistics vary somewhat less in
samples of size 40 than in samples of size 20: the standard deviation of b, for instance, has
dropped from .3902561 to .2734492 a decline of almost exactly 30%, as predicted.
86. Tstar, the test-statistic for the true null hypothesis, has a mean very close to 0
(-.0033935) and a standard deviation (error) of 1.027468 (down from 1.062106 when the
sample size was 20). In these 100,000 t-tests, we would have rejected the true null
hypothesis 4.955% of the time.
87. Because the standard error of b is smaller, deviations of a given size are larger when
measured in standard errors. Thus, tstar2 shows that the mean regression coefficient was
.8020985 estimated standard errors below 3.3 in samples of size 20, but the mean
regression coefficient is 1.115398 estimated standard errors below 3.3 in samples of size
40. The result is that we reject this false null hypothesis 18.7% of the time in samples of
size 40, compared to 11.2% of the time in samples of size 20.
88. Again, our probability of rejecting the false null hypothesis increases as the
hypothesized parameter gets further from the true parameter. We reject the null 18.7% of
the time when the null is that $1 = 3.3, 56.8% of the time when the null is that $1 = 3.6,
89.0% of the time when the null is that $1 = 3.9, and 98.9% of the time when the null is
that $1 = 4.2. Each of these probabilities is higher in samples of size 40 than in samples
of size 20.
. program define bivar40
1.
version 6.0
2.
if "`1'" == "?" {
3.
global S_1 "a se_a b se_b"
4.
exit
5.
}
6.
drop _all
7.
set obs 40
8.
gen x = int((_n-1)/4)
9.
gen y = 10 + 3*x +(invnorm(uniform())*5)
10.
regress y x
11.
post `1' _b[_cons] _se[_cons] _b[x] _se[x]
12.
drop y
13. end
. clear
. set memory 50000
. di invttail(38,.025)
2.0243941
.

138

. simul bivar40, reps(100000)


.
.
.
.
.
.
.
.
.
.

gen
gen
gen
gen
gen
gen
gen
gen
gen
gen

tstar = (b-3)/se_b
rejectt = abs(tstar)>2.0243941
tstar2 = (b-3.3)/se_b
rejectt2 = abs(tstar2)>2.0243941
tstar3 = (b-3.6)/se_b
rejectt3 = abs(tstar3)>2.0243941
tstar4 = (b-3.9)/se_b
rejectt4 = abs(tstar4)>2.0243941
tstar5 = (b-4.2)/se_b
rejectt5 = abs(tstar5)>2.0243941

. summarize
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+----------------------------------------------------a | 100000
10.0076
1.470638
2.375289
16.63236
se_a | 100000
1.459819
.1678445
.761266
2.188804
b | 100000
2.999041
.2752373
1.674333
4.239202
se_b | 100000
.2734492
.0314402
.1425982
.4100005
tstar | 100000
-.0033935
1.027468 -5.566206
5.542623
rejectt | 100000
.04955
.2170144
0
1
tstar2 | 100000
-1.115398
1.035621 -6.825844
4.046178
rejectt2 | 100000
.18737
.3902102
0
1
tstar3 | 100000
-2.227402
1.06014 -8.085484
2.549733
rejectt3 | 100000
.56752
.4954226
0
1
tstar4 | 100000
-3.339406
1.099932 -9.474916
1.337444
rejectt4 | 100000
.89046
.3123171
0
1
tstar5 | 100000
-4.45141
1.153417 -11.48481
.1470314
rejectt5 | 100000
.98931
.1028389
0
1

Confidence Intervals
89. In the Monte Carlo for samples of size 20, I also generated 90%, 95%, and 99%
confidence intervals.
In each case, I begin by generating the lower confidence limit, which is 1.73, 2.10,
or 2.88 estimated standard errors below the sample regression coefficient. I then
generate the upper confidence limit, which is the same number of estimated
standard errors above the sample regression coefficient. I calculate the width of
the confidence interval as the difference between the upper and lower confidence
limits.
As a test of the confidence intervals, I generate dummy variables coded 1 if the
true parameter (3) is inside the confidence interval (between the upper and lower
confidence limits). I can do this only because I created the population and
therefore know what the population parameter is something I would never
know in the real world.
If the confidence intervals work, 90% of the 90% confidence intervals should
include the population parameter, 95% of the 95% confidence intervals should
include 3, and 99% of the 99% confidence intervals should include 3. Naturally,
139

the intervals will need to get wider to have a higher probability of including the
population parameter.
.
.
.
.

gen
gen
gen
gen

lolimt90 = b - 1.7340636*se_b
uplimt90 = b + 1.7340636*se_b
widtht90 = uplimt90 - lolimt90
int90 = (lolimt90 < 3 & 3 < uplimt90)

.
.
.
.

gen
gen
gen
gen

lolimt95 = b - 2.100922*se_b
uplimt95 = b + 2.100922*se_b
widtht95 = uplimt95 - lolimt95
int95 = (lolimt95 < 3 & 3 < uplimt95)

.
.
.
.
.

gen
gen
gen
gen
sum

lolimt99 = b - 2.8784405*se_b
uplimt99 = b + 2.8784405*se_b
widtht99 = uplimt99 - lolimt99
int99 = (lolimt99 < 3 & 3 < uplimt99)
lolimt90 - int99

Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+----------------------------------------------------lolimt90 | 100000
2.335258
.4060002
.6590092
4.097146
uplimt90 | 100000
3.666621
.4058906
1.992182
5.385108
widtht90 | 100000
1.331363
.2235337
.5001287
2.337205
int90 | 100000
.89915
.3011315
0
1
lolimt95
uplimt95
widtht95
int95

|
|
|
|

100000
100000
100000
100000

2.194427
3.807453
1.613026
.94901

.4131467
.4130163
.2708245
.2199784

.505619
2.092206
.6059356
0

3.996149
5.546538
2.831664
1

lolimt99
uplimt99
widtht99
int99

|
|
|
|

100000
100000
100000
100000

1.895949
4.105931
2.209982
.99023

.4321963
.4320254
.3710524
.0983598

.1805244
2.304197
.830183
0

3.782095
5.888671
3.879619
1

90. The 100,000 confidence intervals of each size vary widely. The mean lower
confidence limit of the 90% confidence interval, for instance, is 2.335, but the lower
confidence limit has a standard deviation of .406 it has values as low as .659 and as
high as 4.097. The upper confidence limit ranges from 1.99 to 5.38, with a mean of 3.67.
The average width of the 90% confidence interval is 1.33, but one 90% confidence
interval is only .50 wide and one is 2.34 wide.
91. Despite all that variation, 89.915% of these 100,000 90% confidence intervals
include 3, the population parameter. Similarly, 94.901% of the 95% confidence intervals
and 99.023% of the 99% confidence intervals include the population parameter. In
other words, confidence intervals work.
92. The higher the confidence level, the wider the confidence interval. The mean width
rises from 1.33 to 1.61 to 2.21 as the confidence level rises from 90% to 95% to 99%. We
pay for higher confidence that our interval includes the parameter by sacrificing
precision we need to accept wider intervals to have greater confidence in their
accuracy.
140

Lecture 9. ESTIMATION AND INFERENCE IN MULTIPLE REGRESSION


1. Lets begin with a simple, properly specified population regression function with two
independent variables that meets all the classical assumptions:
Yi = $0 + $1 X1i + $2 X2i + ei
2. If we apply OLS to this equation, one way to picture the calculations is:
b0 = Ybar - b1 Xbar1 - b2 Xbar2
b1 = G (X1i - Xhat1i)(Yi - Ybar) / G (X1i - Xhat1i)2
b2 = G (X2i - Xhat2i)(Yi - Ybar) / G (X2i - Xhat2i)2
where Xhat1i is the expected value of X1i from a regression of X1 on X2
and Xhat2i is the expected value of X2i from a regression of X2 on X1
3. That is, we can think of the process as regressing each of the independent variables
on the other independent variable(s), and then regressing the dependent variable (Y) on
the residuals from those equations. The idea is what we saw in the Ballentines. With
three overlapping circles representing the variation in grade, male, and edyrs, the
regression coefficients are based on variation that grade shares with only one of the
independent variables. The variation that is jointly shared among all three variables is
discarded for purposes of calculating the regression coefficients (but not R2, F*, or most
other statistics), because it is not clear whether we should attribute that shared variation
to male or to edyrs. That is, when more-educated men have higher grades than lesseducated women, its not clear to what extent we should attribute that to sex and to what
extent to education, so we look at the relationship between grades and education within
each sex and the relationship between grades and sex at each level of education.
In Table 1, I regressed grade on male and edyrs. The coefficients indicate that
men tend to be 1.24 grades higher than equally educated women, and that moreeducated employees tend to be .82 grade higher than people of their same sex
with one less year of education. The two variables jointly explain 39.8% of the
variation in grade. Estimates of the standard errors of the regression
coefficients are based on an SSR of 6828 and an MSE of 6.849.
Table 1. reg grade male edyrs
Source |
SS
df
MS
---------+-----------------------------Model | 4519.21516
2 2259.60758
Residual | 6828.38084
997 6.84892762
---------+-----------------------------Total |
11347.596
999
11.358955

Number of obs
F( 2,
997)
Prob > F
R-squared
Adj R-squared
Root MSE

181

=
=
=
=
=
=

1000
329.92
0.0000
0.3983
0.3970
2.617

-----------------------------------------------------------------------------grade |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------male |
1.235889
.1739361
7.105
0.000
.8945662
1.577212
edyrs |
.8193167
.0384311
21.319
0.000
.7439016
.8947318
_cons | -2.801079
.5380099
-5.206
0.000
-3.856841
-1.745317
------------------------------------------------------------------------------

In Table 2, I regressed grade on male and saved the residuals. The male
coefficient is substantially smaller than in Table 1 (the mean grade of men was
2.37 higher than the mean grade of women) and the estimated standard error was
larger (.200 rather than .174). Because male alone could only explain 12.4% of
the variation in grade, edyrs must have been responsible for the other 27.4%
explained in Table 1 (a unique SSR of 3113). The residuals represent the variation
in grade that cannot be explained by male they indicate how many grades each
person was above or below the mean grade of people of their sex.
Table 2. reg grade male
Source |
SS
df
MS
---------+-----------------------------Model |
1406.3436
1
1406.3436
Residual |
9941.2524
998 9.96117475
---------+-----------------------------Total |
11347.596
999
11.358955

Number of obs
F( 1,
998)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

1000
141.18
0.0000
0.1239
0.1231
3.1561

-----------------------------------------------------------------------------grade |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------male |
2.372471
.1996689
11.882
0.000
1.980652
2.76429
_cons |
8.387295
.1428714
58.705
0.000
8.106932
8.667658
-----------------------------------------------------------------------------. predict gmres, res

In Table 3, I regressed edyrs on male and saved the residuals. The male
coefficient showed that the mean educational level of men was 1.39 years higher
than the mean education of women. The residuals represent the variation in
education that cannot be explained by gender they indicate how many years
each person was above or below the mean years of education of people of his/her
sex.
Table 3. reg edyrs male
Source |
SS
df
MS
---------+-----------------------------Model | 480.825378
1 480.825378
Residual | 4637.21862
998 4.64651165
---------+-----------------------------Total |
5118.044
999 5.12316717

Number of obs
F( 1,
998)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

1000
103.48
0.0000
0.0939
0.0930
2.1556

-----------------------------------------------------------------------------edyrs |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------male |
1.387231
.1363699
10.173
0.000
1.119626
1.654836
_cons |
13.65574
.0975784
139.946
0.000
13.46426
13.84722
-----------------------------------------------------------------------------. predict emres, res

182

In Table 4, I regressed the residuals from Table 2 on the residuals from Table 3
the differences between an individuals grade and the mean grade for his/her sex
on the differences between an individuals schooling and the mean schooling for
his/her sex. Because we have removed the effects of sex on each variable, the
regression coefficient shows how a persons grade (relative to the mean for
his/her sex) is related to that persons level of education (relative to the mean for
his/her sex). The coefficient on the residuals of edyrs on male is identical to the
coefficient on edyrs in Table 1.
Table 4. reg gmres emres
Source |
SS
df
MS
---------+-----------------------------Model | 3112.87154
1 3112.87154
Residual | 6828.38084
998 6.84206497
---------+-----------------------------Total | 9941.25239
999 9.95120359

Number of obs
F( 1,
998)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

1000
454.96
0.0000
0.3131
0.3124
2.6157

-----------------------------------------------------------------------------gmres |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------emres |
.8193167
.0384118
21.330
0.000
.7439395
.8946939
_cons |
8.40e-09
.0827168
0.000
1.000
-.1623187
.1623188
------------------------------------------------------------------------------

The total variation in the dependent variable in Table 4 (9941) is the variation in
grade that is not attributable to sex (the SSE in Table 2). The unexplained
variation in Table 4 (SSE=6828) is equal to that in Table 1 (the multiple
regression). The MSEs (the estimates of the variance of the error term) are also
equal. The estimated standard error of the regression coefficient (.038) is the
square root of MSE (called Root MSE in the output) divided by the variation in
the residuals from Table 3, that is, the variation in edyrs that cannot be
attributed to male. (Notice in the correlation matrix that the residuals from
edyrs on male have zero correlation with male.) The estimated standard error
of the regression coefficient is also identical to that on edyrs in Table 1 the
estimated standard error of the regression coefficient in multiple regression is the
square root of MSE (based on all the independent variables) divided by the
unique variation in X the variation that it does not share with any of the other
independent variables, the residuals from regressing X on all the other
independent variables.
Tables 5 through 7 repeat the same story, regressing grade and male on edyrs,
saving the residuals, then regressing the residuals on the residuals. Again, using
only the variation that each of these variables does not share with edyrs yields
the same regression coefficient, standard error, and MSE as the multiple
regression. Thus, as shown in the Ballentine, OLS focuses on the unique
variation in the independent variables.

183

Table 5. reg grade edyrs


Source |
SS
df
MS
---------+-----------------------------Model | 4173.43327
1 4173.43327
Residual | 7174.16273
998 7.18853981
---------+-----------------------------Total |
11347.596
999
11.358955

Number of obs
F( 1,
998)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

1000
580.57
0.0000
0.3678
0.3671
2.6811

-----------------------------------------------------------------------------grade |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------edyrs |
.9030145
.0374773
24.095
0.000
.8294712
.9765579
_cons | -3.370706
.5450339
-6.184
0.000
-4.44025
-2.301163
-----------------------------------------------------------------------------. predict geres, res
Table 6. reg male edyrs
Source |
SS
df
MS
---------+-----------------------------Model |
23.473246
1
23.473246
Residual | 226.382754
998 .226836427
---------+-----------------------------Total |
249.856
999 .250106106

Number of obs
F( 1,
998)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

1000
103.48
0.0000
0.0939
0.0930
.47627

-----------------------------------------------------------------------------male |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------edyrs |
.0677227
.0066574
10.173
0.000
.0546586
.0807868
_cons |
-.460905
.0968188
-4.760
0.000
-.6508967
-.2709133
-----------------------------------------------------------------------------. predict meres, res
Table 7. reg geres meres
Source |
SS
df
MS
---------+-----------------------------Model | 345.781897
1 345.781897
Residual | 6828.38094
998 6.84206507
---------+-----------------------------Total | 7174.16283
999 7.18134418

Number of obs
F( 1,
998)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

1000
50.54
0.0000
0.0482
0.0472
2.6157

-----------------------------------------------------------------------------geres |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------meres |
1.235889
.1738489
7.109
0.000
.8947377
1.57704
_cons |
9.06e-09
.0827168
0.000
1.000
-.1623187
.1623188
------------------------------------------------------------------------------

4. The true and estimated variances of the regression coefficients are:

F2b1 = F2 / E(x1i - xbar1)2(1-r2) = F2 / E(x1i - xhat1)2


F2b2 = F2 / E(x2i - xbar2)2(1-r2) = F2 / E(x2i - xhat2)2
Fhat2b1 = MSE / E(x1i - xbar1)2(1-r2) = MSE / E(x1i - xhat1)2
Fhat2b2 = MSE / E(x2i - xbar2)2(1-r2) = MSE / E(x2i - xhat2)2
184

where F2 is the population variance of the disturbance term, r2 is the square of


the correlation coefficient between x1 and x2, and xhat1 is the expected value of x1
from regressing x1 on x2.
Note that this can be generalized to any number of independent variables. In that
case, r2 is the coefficient of determination from an auxiliary regression that
is, a regression of one independent variable on all the other independent
variables.
5. The MSE (mean squared error) is an unbiased estimator of the variance of the
disturbance term. The estimated variance of each regression coefficient is the MSE
divided by the unique variation in that independent variable, the variation that cannot
be explained by the other independent variable(s).
In Table 8, the sample variance of edyrs is 5.123, so the variation in edyrs is
5.123(999) or 5118. Since the correlation between edyrs and male is .307, male
can explain .3072 (or .094) of the variation in edyrs, and the other 90.6% of the
variation is unexplained .906(5.123)(999) = 4636. Notice that the variation in
the unstandardized residual from edyrs on male is 4637, different only due to
rounding error.
We can also calculate the unexplained variation in the independent variables
directly. Meres and emres represent deviations in male and edyrs around
their expected values (from regressing them on the other independent variable).
The summarize command produces the means (0) and standard deviations. I
calculated the variances by squaring the standard deviations and the variation by
multiplying the variances times 999 (N-1).
. sum meres emres
Variable |
Obs
Mean
Std. Dev.
Variance
Variation
-------------+----------------------------------------------------meres |
1000
1.37e-09
.476035
0.226609
226.38271
emres |
1000
-9.60e-09
2.154498
4.641862
4637.2198

6. Will the estimated standard error of a regression coefficient change as additional


variables are entered into the model? The answer depends on both the additional
variation in Y uniquely explained by the new independent variable (which will lower the
MSE thus lowering the estimated standard error) and the correlation between the new
independent variable and the original independent variable (the stronger the
correlation, the less unexplained variation will remain in the original independent
variable and the higher the estimated standard error will be).
Again, this can be generalized to any number of independent variables. In that
case, what matters is how the new independent variable affects both the MSE and
the coefficient of determination for the auxiliary regression.
185

INFERENTIAL STATISTICS FOR REGRESSION ANALYSIS:


Hypothesis Tests and Confidence Intervals
1. Typically, a data set does not interest us for itself but for what it can tell us about the
population from which the data set is drawn. Inferential statistics allow us to
generalize from a simple random sample to the population from which the sample was
drawn. In regression analysis, we use hypothesis tests primarily to establish that
relationships exist in the population, by ruling out the possibility that the population
parameter is 0. We use confidence intervals to estimate the size of the population
parameter, that is, to estimate the strength of the relationship in the population.
2. Both rely on our knowledge of the sampling distribution for the sample regression
coefficient.
3. We have established that, under the classical assumptions, the OLS bivariate
regression estimator b1 is unbiased [E(b1)= $1] with a variance of F2/ G (xi - x )2. We
have further established that b1 is the best linear, unbiased estimator of $1.
If the error term is normally distributed, b1 ~ N($1, F2/ G (xi - x )2).
Using the central limit theorem, as the sample size increases, b1 is asymptotically
normally distributed.
4. We have taken on faith that, under the classical assumptions, the OLS multiple
regression estimator bk is unbiased [E(bk)= $k] with a variance of F2/ G(xki-xhat-ki)2
and that bk is the best linear, unbiased estimator of $k.
If the error term is normally distributed, bk ~ N($k, F2/ G (xki - xhat-ki)2).
Using the central limit theorem, as the sample size increases, bk is asymptotically
normally distributed.
5. That is, the sample regression coefficients are sample statistics with sampling
distributions. In general, given the classical assumptions and either a large sample size
or a normally distributed error term, the sampling distribution for the sample
regression coefficient is (1) normal (or at least asymptotically normal) with (2) a mean
equal to the population parameter [E(bk)= $k] and (3) a standard error equal to the
square root of the statistics variance > Fbk = %[F2/ G (xki - xhat-ki)2].
6. That means that (bk- $k) / Fbk has a standard normal (z) distribution. That is, if we
subtract the population parameter from the sample regression coefficient, then divide
by the true standard error of the sampling distribution, this new standardized
regression coefficient is normally distributed with a mean of 0 and a variance of 1.

186

7. Further, we know how to calculate an unbiased estimator of the variance of the


sampling distribution by substituting the mean squared error (MSE) for the population
variance > Fhat-bk = %[MSE/ G (xki - xhat-ki)2]. Following on the logic we used with the
sample mean, we could show that (bk- $k) / Fhat-bk has a t distribution with N - k - 1
degrees of freedom (where k is the number of independent variables in the model). That
is, if we subtract the population parameter from the sample regression coefficient, then
divide by the estimated standard error of the sampling distribution, this new
standardized regression coefficient has a t-distribution, just as a similarly
standardized sample mean has a t-distribution . This means that inference for the
population regression parameter is a simple variation on inference for the population
mean.
8. In hypothesis tests, we substitute a hypothesized value (typically, zero) for $k, then
determine whether the sample regression coefficient bk is far from zero (measured in
estimated standard errors) that it is not plausible that $k=0. If so, we can confidently
state whether $k is above or below zero, that is, we can confidently state whether the
relationship in the population is positive or negative.
9. In confidence intervals, we take advantage of the fact that the sample regression
coefficient has a known probability of falling within a set number of estimated standard
errors of the population parameter and that whenever the sample regression coefficient
is within that number of estimated standard errors of the population parameter, the
population parameter is also within that number of estimated standard errors of the
sample regression coefficient. With large samples, for instance, there is a 95%
probability that the sample regression coefficient will fall within 1.96 estimated standard
errors of the population parameter; therefore, we can be 95% confident that the
population parameter is within 1.96 estimated standard errors of the sample regression
coefficient.
The Sampling Distribution for the Sample Regression Coefficients
10. Assume that we have the following population regression function for a
multiple regression for the population of Agency X:
E(Y) = 1 + 2 X2 + 3 X3 + ... + n Xn
where E(Y) is the expected value of Y and 1, 2, 3, etc. are population
regression coefficients.
11. With bivariate regression, the equation for the population regression function
simplifies to:
E(Y) = 1 + 2 X2
In particular, assume that the equation for Agency X is:
187

E(salary) = -2300 + 4000 grade


Which we will estimate with the sample regression equation:
salary-hat = b1 + b2 grade
12. The values that the y-intercept (b0 ) and the regression coefficient (b2) will take on
depend on the particular sample we get.
Table 1 shows a sample regression equation for sample of 15,000 employees of
Agency X. The expected salary = -1952 + 3997 grade, which is pretty similar to
the population regression line.
TABLE 1. reg sal grade
Source |
SS
df
MS
---------+-----------------------------Model | 4.4723e+12
1 4.4723e+12
Residual | 3.4031e+12 14998
226904982
---------+-----------------------------Total | 7.8754e+12 14999
525064944

Number of obs
F( 1, 14998)
Prob > F
R-squared
Adj R-squared
Root MSE

=
15000
=19710.14
= 0.0000
= 0.5679
= 0.5679
=
15063

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------grade |
3996.574
28.46709
140.393
0.000
3940.775
4052.373
_cons | -1952.741
258.8262
-7.545
0.000
-2460.072
-1445.411
------------------------------------------------------------------------------

Table 2, however, based on a random sample of only 10 employees from Agency


X, says that the expected salary = 3082 + 3297 grade, and Table 3, based on a
different random sample of 10 employees, shows that the expected salary= -17543
+ 4890 grade. These two regression equations are substantially different from
each other and from the population regression equation.
TABLE 2. reg sal grade
Source |
SS
df
MS
---------+-----------------------------Model | 2.1626e+09
1 2.1626e+09
Residual |
896742665
8
112092833
---------+-----------------------------Total | 3.0594e+09
9
339927948

Number of obs
F( 1,
8)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

10
19.29
0.0023
0.7069
0.6702
10587

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------grade |
3297.4
750.7088
4.392
0.002
1566.262
5028.537
_cons |
3081.728
7473.23
0.412
0.691
-14151.57
20315.03
------------------------------------------------------------------------------

188

TABLE 3. reg sal grade


Source |
SS
df
MS
---------+-----------------------------Model | 5.8363e+09
1 5.8363e+09
Residual | 1.2061e+09
8
150758839
---------+-----------------------------Total | 7.0424e+09
9
782485374

Number of obs
F( 1,
8)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

10
38.71
0.0003
0.8287
0.8073
12278

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------grade |
4889.729
785.8823
6.222
0.000
3077.482
6701.977
_cons | -17542.61
8554.934
-2.051
0.074
-37270.32
2185.107
------------------------------------------------------------------------------

12. The y-intercept (b1 ) and the regression coefficient (b2) both have their own
sampling distributions. A sampling distribution is a probability distribution for
a sample statistic. It allows us to make predictions about the likely value of that
statistic. We discuss sampling or probability distributions in terms of their shapes, their
means, and their standard errors. (The standard error is a fancy name for the
standard deviation of a sampling distribution.)
13. Sampling distributions are theoretical probability distributions that have been
described and proven to exist by statisticians and mathematicians. Although you will
often deal with full data for a sample, you will never see a true sampling distribution
except in a table, graph, or theoretical model.
14. It is possible to imagine using empirical data to test the predictions of the sampling
distribution, however. Imagine drawing several million random samples of the same
size from a single population and calculating several million sample regressions. If you
graphed the distribution of the sample regression coefficients in a histogram, your graph
would be bell-shaped with its high point equal to the population mean. About twothirds of the distribution would fall within one standard error of the population mean,
about 95 percent would fall within two standard errors of the population mean; and
about 99.7 percent would fall within three standard errors of the population mean.
15. According to the central limit theorem, if the original population regression
function meets certain assumptions, we know three things about the sampling
distribution for the sample regression coefficient:
(1) the sampling distribution is normal.
Normal distributions are symmetrical, single-peaked, bell-shaped density
curves. All normal distributions have the same overall shape: only their
means and standard deviations change. If you know the mean and
standard deviation of a normally distributed variable, you can describe
the distribution exactly. The mean determines where the center of the
distribution is and the standard deviation determines how spread out it is.

189

In ANY normal distribution, no matter what its mean or standard


deviation, 95% of the cases are within two (actually 1.96) standard
deviations of the mean.
(2) the mean of the sampling distribution for the sample regression
coefficient is the population regression coefficient.
This means that b2 is an unbiased estimator of 2; that is, b2 is equally
likely to be larger or smaller than 2. (The normal distribution is
symmetrical around its mean.)
However, P(b2 =2) =0, because the sampling distribution for b2 is
continuous. In other words, the sample regression coefficient does not
equal the population regression coefficient.
(3) the standard error of the sampling distribution for the sample
regression coefficient depends on both the strength of the relationship in the
population and the amount of variation in the independent variable.
The standard error is a fancy name for the standard deviation of a
sampling distribution. How close or far the sample coefficient is from the
population coefficient depends on the standard error of the sampling
distribution for b2.
The greater the variation in the independent variable, due to the either a
larger sample size or greater dispersion of the values of the independent
variable, the smaller the standard error of the sampling distribution will
be. In general, the larger the sample size, the smaller the standard error
will be and the closer the sample coefficient will tend to be to the
population coefficient.
16. Table 4 and Figure 1 an approximation to a sampling distribution for the sample yintercept and regression coefficient for random samples of size 10 from Agency X. I ran
10,000 separate regressions on 10,000 samples of size 10. In each sample, there was
one person at each grade from grade 4 to grade 13.
The mean y-intercept for the 10,000 samples is -2219 $81 lower than be true
y-intercept. The mean regression coefficient for the 10,000 samples was $3998
$2 lower than the true regression coefficient.
Histograms show that both sample statistics were roughly normally distributed in
these 28,000 samples, and the sktest reveals no evidence to the contrary.

190

Figure 1. Histogram of Sample Regression Coefficients

Fraction

.0622

0
0

2000

4000
b

6000

8000

The standard deviation of the regression coefficient (1661) indicates how much
the sample regression coefficient varied from sample to sample. That is, this is
an estimate of the standard error of the sampling distribution for the sample
regression coefficient. The true standard error is the square root of [the variance
of the regression error term (225,000,000=15,0002 ) divided by the variation in
grade (82.5)], which is 1651.4456. That is, this estimate is very similar to the
true standard error. In these small samples, the sample regression coefficient
varied between -1970 and 10407, with about 95% falling between 2336 and 5660.
The sktest cannot reject normality for either a or b.
. simul agencyx, reps(10000)
. summarize
Variable |
Obs
Mean
Std. Dev.
Min
Max
---------+----------------------------------------------------a |
10000
-2219.793
16417.03 -70596.82
55389.37
se_a |
10000
15914.79
4037.404
3097.845
33231.89
b |
10000
3998.426
1661.099 -1967.949
10406.69
se_b |
10000
1603.551
406.803
312.1343
3348.397
. graph b, bin(50) normal
. sktest a b
Skewness/Kurtosis tests for Normality
------- joint ------Variable | Pr(Skewness)
Pr(Kurtosis) adj chi-sq(2) Pr(chi-sq)
----------+-------------------------------------------------------a |
0.942
0.574
0.32
0.8512
b |
0.668
0.377
0.97
0.6170

191

The Hypothesis Test for the Population Regression Coefficient


17. The hypothesis test is designed to establish whether the sample provides enough
evidence to conclude that the population regression coefficient differs from a particular
number. We are typically interested in establishing that a relationship between X and Y
exists in the population. We perform a hypothesis test hoping to convincingly
reject the null hypothesis that no relationship exists in the population. That is, we want
to reject the possibility that changes in X are not associated with changes in Y and that
the population regression coefficient is zero, that is, to conclude that $1 0.
STEP ONE. State the research and null hypotheses. The null hypothesis is that $1
(beta-one, the population regression parameter) equals some particular number
(typically 0). The research or alternative hypothesis can take three general forms: the
population regression parameter is (a) greater than, (b) less than, or (c) not equal to that
number (0).
H0: $1= 0
H1: $1<0 or $1>0 or $1 0
STEP TWO. State that if the null hypothesis is true, the sampling distribution for
(b1-0)/Fhat-b1 (which we will nickname t*) is a t-distributionwith N-k-1 degrees of
freedom.
STEP THREE. Decide how large a chance you are willing to take of concluding that a
true null hypothesis is false. Then find the decision rule that gives you exactly that
probability of rejecting a true null hypothesis. You choose from three possible decision
rules, depending on the form of the research hypothesis.
(a) If H1 is that the population regression parameter is greater than some number ($1>0)
then reject H0 if t*>t" and accept H0 if t*<t".
(b) If H1 is that the population regression parameter is less than some number ($1<0),
then reject H0 if t*<-t" and accept H0 if t*>-t".
(c) If H1 is that the population regression parameter differs from some number ($1 0),
then reject H0 if t*<-t"/2 or if t*>t"/2 and accept H0 if -t"/2<t*<+t"/2.
STEP FOUR. Calculate the test-statistic (t*).
t* = (b1-0)/Fhat-b1
STEP FIVE. Compare the test-statistic to the decision rule. Decide whether you have
sufficient evidence to reject the null hypothesis. State your conclusion in substantive
terms.

192

Logic behind the Hypothesis Test


18. Imagine we want to know whether the mean salary of all male federal white-collar
employees was more than the mean salary of all female federal white-collar employees
in 1991. In a random sample of 3438 employees, the mean salary of women was
$27,423 and the mean salary of men was $12,582 higher. The estimated standard error
for the difference of mean salaries was $443.22. Is this sufficient information to draw
the conclusion that the mean salary of all male federal white-collar employees was more
than the mean salary of all female federal white-collar employees in 1991?
TABLE 6. reg sal male
Source |
SS
df
MS
---------+-----------------------------Model | 1.3783e+11
1 1.3783e+11
Residual | 5.9555e+11 3482
171037759
---------+-----------------------------Total | 7.3338e+11 3483
210560961

Number of obs
F( 1, 3482)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

3484
805.85
0.0000
0.1879
0.1877
13078

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------male |
12581.9
443.2203
28.387
0.000
11712.9
13450.89
_cons |
27422.93
316.4478
86.659
0.000
26802.48
28043.37
------------------------------------------------------------------------------

19. A hypothesis test tests the plausibility that a population parameter has a particular
value. In this case, we imagine three possibilities about the difference of mean salaries
between men and women in the population of federal employees: $1>0, $1<0, and $1=0.
The only definite value suggested for $1 is 0, because the other possibilities are just that
$1 is bigger or smaller than 0, not that $1 equals some particular number.
20. We therefore assume that the population regression parameter equals 0. To
assume does not mean to believe or to conclude. An assumption is a starting place. We
are saying, Lets assume that $1=0 and see what that implies for the sampling
distribution of b1. That, in turn, leads to implications for the standardized sample
regression coefficient if $1=0. We decide how willing we are to make a certain type of
mistake (deciding that $1 differs from 0 when they are really equal). We combine the
willingness and the implications to come up with a decision rule that gives us a known
probability of saying that $1 doesnt equal 0 when it really does. Then we follow our rule
in drawing a conclusion about the population.
STEP ONE:
21. First, formulate the RESEARCH or ALTERNATIVE HYPOTHESIS (H1). It
states that the population regression parameter is (a) greater than, (b) less than, or (c)
not equal to some number. We almost always think H1 is true. Since we know that
men earn more than women, we hypothesize that $1 is greater than 0: H1: $1 0

193

22. Next, state the NULL HYPOTHESIS (H0), which is that the population
regression parameter equals the value that the alternative hypothesis claimed it was
different from. The null hypothesis always contains an equals sign when stated
symbolically, although sometimes the research hypothesis is such that the null
hypothesis implies greater than or equal to or less than or equal to. In this case the
null hypothesis is that the population regression parameter equals 0: H0: $1=0
STEP TWO:
23. Given a properly specified model meeting the classical assumption, plus either a
normally distributed error term or a sufficiently large sample, the sampling distribution
for b1 (the sample regression coefficient ) is:
a) normal
b) E(b1)=$1 (the mean of the sampling distribution for b1 equals the population
regression parameter)
c) Fb1=F/ %G(xi - xbar)2 (the standard error of the sampling distribution for b1
equals the population standard deviation divided by the square root of the
variation in X.
24. We also know (whether H0 is true or false) that when we standardize the normally
distributed statistic b1 by subtracting the mean of its sampling distribution and dividing
by the true standard error [(b1-$1)/Fb1], the standardized regression parameter has a
standard normal (z) distribution.
25. We also know (whether H0 is true or false) that when we standardize the normally
distributed statistic b1 by subtracting the mean of its sampling distribution and dividing
by its estimated standard error [(b1-$1)/Fhat-b1], the standardized sample regression
coefficient has a t-distribution with N-k-1 degrees of freedom, where k=number
of independent variables.
26. If (and only if) the null hypothesis is true, the population regression
parameter is 0. If (and only if) the null hypothesis is true, we can standardize the
sample regression coefficient by subtracting the hypothesized population regression
parameter (0) and dividing by the estimated standard error [F-hatb1=s/%G(xi - xbar)2].
We call this new statistic t* [t*=(b-0)/F-hatb1]. t* is the standardized sample regression
coefficient if and only if the null hypothesis is true, and t* has a t-distribution with
N-k-1 degrees of freedom if and only if the null hypothesis is true.
27. If H0 is false, the sampling distribution for b1 is still normal, and its estimated
standard error (F-hatb1) is still s/%G(xi - xbar)2 , but the mean of the sampling
distribution is not 0. If H0 is false, t* does not have a t-distribution because t*
is not the standardized value of b1. t* is the standardized value of b1 only if
$1=0. If H0 is false, we do not know what the sampling distribution of t* is.

194

28. Knowing the sampling distribution for t* if H0 is true allows us to predict what value
t* will have if H0 is true.
In this case, our sample is large enough that we can use the t-distribution with
infinite degrees of freedom. According to this t-distribution, we know, for
instance, that there is a 95 percent chance that t* will be less than 1.645, a 95
percent chance that t* will be greater than -1.645, and a 95 percent chance that t*
will be between -1.96 and +1.96 if H0 is true.
If H0 is false, we do not know the sampling distribution for t*; therefore, we do
not know how likely it is that t* will be in any of those intervals if H0 is false.
STEP THREE:
29. Naturally, the sample regression coefficient will not exactly equal the hypothesized
population regression parameter (0) even if the null hypothesis is true; so t* will not
exactly equal 0 (which it would if the sample regression coefficient were 0.0000). The
sampling distribution will tell us, however, how likely it would be to get a t* as far from
zero as the t* we actually got (and a sample regression coefficient as far from 0 as the
sample regression coefficient we actually got) if the null hypothesis were true.
30. If we got a t* that would be very unlikely to get from a t-distribution, then we
could conclude that the sampling distribution for t* was not a t-distribution. That
would imply that the population regression parameter did not equal the hypothesized
population regression parameter (0), and we would reject the null hypothesis as false.
31. If we got a t* that would be plausible to get from a t-distribution, then we could
conclude that the sampling distribution for t* could be a t-distribution, so the
population regression parameter could equal the hypothesized population regression
parameter. We would therefore tentatively accept that the null hypothesis could be
true.
32. Any time we make inferences from samples to populations, we take some chance of
making an error. If we reject the null hypothesis and H0 is actually true, we make a
Type I error. If we tentatively accept the null hypothesis and H0 is actually false, we
make a Type II error. In most cases, decreasing the probability of making a Type I
error increases the probability of making a Type II error, and vice versa.
33. Since we are assuming that the null hypothesis is true, we control the probability of
making a Type I error (rejecting a true null hypothesis). The significance level (",
alpha) is our statement of how willing we are to reject a true null hypothesis. The
most common significance levels are .05 and .01. Choosing a smaller alpha decreases
our chance of making a Type I error, but increases our chance of a Type II error.

195

34. We choose a decision rule that allows us to reject a true null hypothesis with a
fixed probability (alpha). That is, in repeated tests with different samples from a
population in which the null hypothesis is true, we will nonetheless reject the null
hypothesis a fixed proportion of the time (alpha).
35. If the null hypothesis is true, the sampling distribution for t* is a t-distribution
with N-k-1 degrees of freedom. Therefore, we can divide that t-distribution into two
regions (the acceptance region and the rejection region). We create the rejection
region so that it has exactly an alpha probability of including t* if H0 is true. We then
reject the null hypothesis whenever t* falls in the rejection region. There is a (1-")
probability that t* will fall inside the acceptance region if H0 is true. If we accept the
null hypothesis whenever t* falls into the acceptance region and reject H0 whenever t*
falls into the rejection region, we have a 1-" probability of making the right decision and
an " probability of making a Type I error if H0 is true.
36. If H0 is false, t* does not have a t-distribution and we do not know the probability
that t* will fall into the acceptance or rejection region.
37. The value of t that separates the acceptance region from the rejection region is
called the critical value. We choose the critical value(s) of t based on the form of the
alternative hypothesis.
38. If H1 is only that the population regression parameter is different from (not equal
to) 0, then we will reject the null hypothesis if b1 is much higher or much lower than 0.
We therefore need two rejection regions. To take an " (e.g., 5%) chance of rejecting a
true null hypothesis, we reject H0 if the probability of getting a t* at least as high as the
one we got is only "/2 (2.5%) or if the probability of getting a t* at least as low as the
one we got is only "/2 (2.5%) if H0 is true (that is, if t* really has a t-distribution). The
critical value that puts "/2 in the upper tail is called t"/2 and the critical value that puts
"/2 in the lower tail is called -t"/2.
39. Our decision rule, then, is to reject H0 if t*>t"/2 or if t*<-t"/2 and to accept H0 if
-t"/2<t*<t"/2. (With "=.05, we reject H0 if t*<-1.96 or if t*>+1.96 and accept H0 if
-1.96<t*<+1.96.) This is called a two-tailed test because you can reject the null
hypothesis if the t* is too far into either tail of the assumed sampling distribution.
40. If we are confident before we look at the sample data that $1 is greater than 0, then
we can be confident that b1 will also be larger than 0 and that t* will be positive. There is
no reason to waste half of our alpha in the negative side of the t-distribution. By putting
all of the rejection region into the upper tail of the t-distribution, we lower the critical
value from t"/2 to t" (when "=.05, this is the difference between 1.96 and 1.645).
41. This does not change the probability of rejecting H0 if H0 is true (that probability is
still "), but it does make it easier to reject H0 if H0 is false. If $1>0, then we expect t* to
be a positive number. Every t* that is greater than 1.96 (the critical value in the two196

tailed test) is also greater than 1.645 (the critical in the one-tailed test), but the opposite
is not true. Therefore, more positive values of t* (but no negative values of t*) allow us
to reject H0 in a one-tailed test. (For instance, with a .05 significance level, we can reject
H0 with a t* of 1.85 in a one-tailed test but not in a two-tailed test.)
42. If we are confident (before looking at the data) that $1<0, then we should be
confident that b1<0 and that t*<0. Therefore, we put the entire rejection region into the
lower tail of the t-distribution and reject H0 if t*<-t" (rather than -t"/2). This one-tailed
test gives us the same " probability of rejecting a true H0 as a two-tailed test would, but
it gives us a better chance of rejecting a false H0.
43. The form of the research hypothesis should always be chosen BEFORE looking at
the sample data to avoid the temptation to choose a one-tailed test based on whether the
sample regression coefficient is larger or smaller than the hypothesized population
regression parameter. Remember, even if the null hypothesis is true, there is virtually
no chance that the sample statistic will be equal to the population parameter. Instead,
there is a 50% chance b1 will be larger and a 50% chance b1 will be smaller than $1. If
you wait until you see whether b1 is larger or smaller than $1 before deciding what tail to
put your alpha into, you will never put your " in the wrong tail. In reality, you are
putting " into both tails and doubling your likelihood of making a Type I Error--that is,
alpha is twice as large as you think it is. (If you put your .05 into the upper or lower tail
after looking at the sample regression coefficient , you will only accept H0 if -1.645 < t*
< 1.645 (with infinite degrees of freedom), which really gives you an " of .10.)
44. Thus, STEP THREE is to choose a significance level and find the appropriate
decision rule. The smaller the alpha we choose, the less chance we have of rejecting the
null hypothesis (whether H0 is true or false). There are three possible decision rules,
depending on the form of the research hypothesis.
If H1 is that $1>0 (that is, that the population regression parameter is greater than
some number), reject H0 if t*>t"; accept H0 if t*<t".
If H1 is that $1<0, reject H0 if t*<-t"; accept H0 if t*>-t".
If H1 is that $1 0, reject H0 if t*<-t"/2 or if t*>+t"/2; accept H0 if -t"/2<t*<t"/2.
45. In this example, because we are virtually certain that men earn more than women
before looking at the data, we choose a .01 significance level and a one-tailed test:
Reject H0 if if t*>2.326
Accept H0 if t*<2.326

197

STEP FOUR:
46. Calculate the test-statistic (t*). In this case, we take the sample regression
coefficient (12581.9) minus the hypothesized population regression parameter (0),
divided by the estimated standard error (443.22). t* = 12581.9/443.22 = 28.387. In
other words, the sample regression coefficient is more than 28 estimated standard
errors above 0.
STEP FIVE:
47. Compare the test-statistic to the critical value. Decide whether you have sufficient
evidence to reject the null hypothesis. If you reject H0, it is because you have gotten a
result that was very unlikely if H0 were true; therefore, you are confident when you
reject the null hypothesis. If you accept H0, it is because t* is somewhere in the
acceptance region and the null hypothesis is plausible. You do not have convincing
evidence that H0 is true; you simply lack convincing evidence that it is false. When you
accept H0, you do so tentatively and unconfidently. Several authors recommend that
you do not accept H0, you merely fail to reject it. That is, you stick with your initial
assumption rather than reaching a real conclusion that H0 is true. Be sure to state
your conclusion in substantive terms.
48. In this case, t* = 28.387 and t" = 2.326. Since t* > t", we reject H0 and conclude that
$1> 0, that is, that the mean salary for all male federal white-collar employees was
greater than the mean salary for all female federal white-collar employees in 1991. We
are quite confident of this conclusion, because the probability of getting a value further
from 0 than 28 from a t-distribution with infinite degrees of freedom is less than .0005.
Therefore, its quite unlikely that t* came from a t-distribution with infinite degrees of
freedom or that $1=0 (because if H0 were true and $1 did equal 0, then t* would have
come from a t-distribution with 4 df).
Confidence Intervals for the Population Regression Parameter
49. To determine the strength of the relationship in the population, we calculate a
confidence interval for the population regression coefficient ($1).
50. Generally, a sample statistic is our best point estimate of a population parameter.
That is, if we need to take our one best guess at the value of a population parameter
(e.g., the population regression coefficient), it is typically a sample statistic (e.g., the
sample regression coefficient). There are three desirable characteristics that a statistic
should have to be a good estimator for a population parameter:
a) The estimator should be UNBIASED. It should be no more likely to be larger
than the population parameter than it is to be smaller. If we took 100,000
random samples from the same population, the mean of the 100,000 sample

198

statistics should be equal to the population parameter. [The mean of the


sampling distribution should equal the population parameter.]
b) The estimator should be EFFICIENT. That is, it should be close to the
population parameter. If we took several random samples from the same
population, the statistic should not vary too much from one sample to another.
[The standard error of its sampling distribution should be small.]
c) The estimator should be CONSISTENT. The larger the sample taken from a
given population, the closer the sample statistic should generally be to the
population parameter. For instance, the sample mean for 500 people is usually
closer to the population mean than is the sample mean for 50 people. [As the
sample size increases, the standard error of the sampling distribution approaches
zero.]
51. The sample regression coefficient is an unbiased, efficient, and consistent
estimator of the population regression coefficient.
b1 is an unbiased estimator of 1 because the sampling distribution for b1 is
normal with a mean of 1. That means that (1) in repeated simple random
samples from the same population, the mean of the sample regression
coefficients would equal the population regression coefficient, and (2) there is a
50-50 chance that any particular sample regression coefficient would be larger
(smaller) than the population regression coefficient.
b1 is an efficient estimator of 1 because there is no other estimator of 1 that
has a smaller standard error. (The OLS estimator is the best linear unbiased
estimator of 1.
b1 is a consistent estimator of 1 because it is unbiased even in small sample
and its variance shrinks to zero as the sample size rises to infinity (its variance is
the variance of error term (a constant) divided by the total variation in X, which
rises to infinity as the sample size rises to infinity. As the sample size increases,
the standard error of the sampling distribution shrinks, and the sample
regression coefficient tends to fall within a tighter interval around the population
parameter. As the sample size goes to infinity, the standard error goes to zero.
52. Therefore, the sample regression coefficient is our one best guess at the value of the
population regression coefficient. b1 is our best point estimate of 1 .
53. The problem is that P(b1 = 1) = 0 -- the sample statistic never equals the
population parameter -- so our point estimate will never be right. An interval
estimate contains a range of possible values within which the population regression
coefficient is likely to fall, with a known probability.

199

54. The confidence level is the probability that 1 actually falls within the confidence
interval.
We typically calculate 95 percent confidence intervals, though any size confidence
level is possible. This means that we are typically 95 percent confident that the
population regression coefficient is inside the confidence interval.
If we wanted to be more confident that the population regression coefficient was
actually inside the confidence interval (that is, if we wanted to raise the
confidence level), we would need a wider interval.
On the other hand, if we were willing to accept more than a five percent chance
that the population regression coefficient was outside the confidence interval
(lower the confidence level), we could use a narrower interval.
There is a tradeoff between the precision of our estimate and our
probability of being correct.
55. The 95% confidence interval itself stretches from the lower confidence limit
(about two estimated standard errors below the point estimate) to the upper
confidence limit (about two estimated standard errors above the point estimate).
56. The general logic here goes back to the sampling distribution of the sample
regression coefficient. Because the sample regression coefficient has a normal
probability distribution, there is a 95 percent probability that the sample regression
coefficient will fall within two standard errors of the mean of its sampling distribution.
Since the mean of the sampling distribution is the population regression coefficient,
there is a 95 percent probability that the sample regression coefficient will be within two
standard errors of the population regression coefficient.
57. Whenever the sample regression coefficient is within two standard errors of the
population regression coefficient, the population regression coefficient is also within
two standard errors of the sample regression coefficient.
Whenever I am within 20 feet if you, you are within 20 feet of me. (Remember
that two standard errors can be interpreted as a measure of distance.)
58. If there is a 95 percent chance that the sample regression coefficient will be within
two standard errors of the population regression coefficient, there is also a 95 percent
chance that the population regression coefficient will be within two standard errors of
the sample regression coefficient.
59. Thus, we can construct a confidence interval that begins two estimated standard
errors below our sample regression coefficient and ends two estimated standard errors
above our sample regression coefficient. If our sample regression coefficient was
200

actually within two estimated standard errors of the population regression coefficient
(and there was a 95 percent chance that this was true), then the population regression
coefficient is also inside this confidence interval.
Interpreting Hypothesis Tests and Confidence Intervals
60. The point of the hypothesis test is typically to establish that X and Y are related in
the population. To show this, we attempt to show that it is highly unlikely that 1 (the
population regression coefficient) could equal zero. (If 1=0, then a change in X has
no impact on the expected value of Y.)
Note: beta, the population regression coefficient, is completely different from the
beta-weight, which is the standardized coefficient. You will never see the
population regression coefficient printed in your regression output.
61. Because b1 is a normally distributed statistic with a mean of 1, we can be 95%
confident that the sample regression coefficient (b1) will fall within about two standard
errors of the population regression coefficient (1).
62. If 1=0, then we can be 95% confident that the sample regression coefficient (b1)
will fall within two standard errors of 0.
If the sample regression coefficient is more than two estimated standard errors
above 0, we will confidently conclude that the population regression coefficient
(and therefore the relationship in the population) is positive, that is, that 1>0.
If the sample regression coefficient is more than two estimated standard errors
below 0, we confidently conclude that X and Y are negatively related in the
population.
If the sample regression coefficient is within two estimated standard errors
of 0, we are unable to confidently conclude that the population regression
coefficient is either positive or negative. We will say that we tentatively accept
the null hypothesis that the population regression coefficient is 0.
What this means is that were saying it is plausible that X and Y are not
related in the population, but it is also plausible that X and Y are either
positively or negatively related in the population. In other words, if we
accept the null hypothesis, we learn virtually nothing from the hypothesis
test.
63. Stata indicates how many estimated standard errors the sample regression
coefficient is from 0 in the t column. It calculates this number by dividing the sample
regression coefficient (in the Coef. column) by the estimated standard error (in the
Std. Err. column).
201

64. In the P>|t| column, Stata prints the probability of getting a sample regression
coefficient as far from zero as the sample regression coefficient we actually got if the
null hypothesis is true. This provides a short-cut for the hypothesis test:
(a) If we are doing a two-tailed hypothesis test and P>|t| < alpha, reject H0. We
have gotten a rare enough occurrence if the null is true that we can confidently
reject the null.
(b) If we are doing a one-tailed hypothesis test and t* is in the correct tail and
P>|t|/2 < alpha, reject H0.
Interpretation Examples:
In Table 5, the sample regression coefficient shows that as years of education
rises by one, expected salary rises by $3715 in the sample.
The sample regression coefficient ($3715) divided by the estimated standard
error of $94.83 yields a t* of 39.18. In other words, the sample regression
coefficient is 39 estimated standard errors above 0. This is an incredibly unlikely
occurrence if the population regression equation is 0. In a normal distribution,
95 percent of the observations fall within two standard errors of the regression
parameter (in this case, the population regression coefficient, which should be
zero), 99.7 percent fall within three standard errors of the mean, and almost none
are more than four standard errors from the mean.
The P>|t| column shows that there is less than 5 chances in 10,000 of getting a
sample regression coefficient more than 39 estimated standard errors from 0 if
the population parameter is 0. (If there were more than 5 chances in
10,000, Stata would round up to .001.)
In fact, if we run . display ttail(3482,39.179), Stata tells us that the probability
of getting a sample regression coefficient at least 39.179 estimated standard errors
from the population parameter is 8.679e-279 that is, a decimal point and a 9
with 278 zeros between them approximately one chance in a bezillion.
We therefore conclude that salary and education are positively related in
the population.
The confidence interval tells us more. We are 95 percent confident that, in the
population, expected salary rose somewhere between $3529 and $3901 with each
additional year of education.
TABLE 5. reg sal edyrs
Source |
SS
df
MS
---------+-----------------------------Model | 2.2438e+11
1 2.2438e+11

Number of obs =
3484
F( 1, 3482) = 1534.97
Prob > F
= 0.0000

202

Residual | 5.0900e+11 3482


146180591
---------+-----------------------------Total | 7.3338e+11 3483
210560961

R-squared
=
Adj R-squared =
Root MSE
=

0.3060
0.3058
12091

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------edyrs |
3715.173
94.82633
39.179
0.000
3529.253
3901.094
_cons | -19469.25
1375.916
-14.150
0.000
-22166.94
-16771.57
------------------------------------------------------------------------------

In Table 6, the sample regression coefficient shows that the mean salary of men
is $12,582 higher than the mean salary of women in the sample. The t* of
28.39 means that the sample regression coefficient of $12,582 is 28.39 estimated
standard errors above 0, again an extremely unlikely event if there were no
relationship in the population (that is, if the population regression coefficient
were 0). [12,500 82/443.22 = 28.39]
We therefore conclude that the mean salary of men is higher than the
mean salary of women in the population.
Indeed, we are 95 percent confident that the mean salary of men is between
$11,713 and $13,451 higher than the mean salary of women in the population of
federal white-collar workers in 1991.
TABLE 6. reg sal male
Source |
SS
df
MS
---------+-----------------------------Model | 1.3783e+11
1 1.3783e+11
Residual | 5.9555e+11 3482
171037759
---------+-----------------------------Total | 7.3338e+11 3483
210560961

Number of obs
F( 1, 3482)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

3484
805.85
0.0000
0.1879
0.1877
13078

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------male |
12581.9
443.2203
28.387
0.000
11712.9
13450.89
_cons |
27422.93
316.4478
86.659
0.000
26802.48
28043.37
------------------------------------------------------------------------------

In Table 7, the sample regression coefficients show that, in the sample, men, on
average, earn $7983 more than women with the same amount of education and
that, as years of education rises by one, expected salary rises by $3076, holding
sex constant. The t*s of 19.22 and 32.00 indicate that both sample regression
coefficients are many estimated standard errors above 0.
We therefore conclude that men make more than comparably educated
women in the population and that more educated workers earn more
than less educated workers of the same sex in the population.
We are also 95 percent confident that men, on average, make between $7169 and
$8798 more than comparably educated women in the population and that
203

expected salary rises between $2888 and $3264 per year of education among
workers of the same sex in the population.
TABLE 7. reg sal male edyrs
Source |
SS
df
MS
---------+-----------------------------Model | 2.7323e+11
2 1.3662e+11
Residual | 4.6015e+11 3481
132189887
---------+-----------------------------Total | 7.3338e+11 3483
210560961

Number of obs
F( 2, 3481)
Prob > F
R-squared
Adj R-squared
Root MSE

=
3484
= 1033.48
= 0.0000
= 0.3726
= 0.3722
=
11497

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------male |
7983.294
415.2969
19.223
0.000
7169.044
8797.544
edyrs |
3075.956
96.11011
32.004
0.000
2887.518
3264.394
_cons | -14367.21
1335.065
-10.761
0.000
-16984.8
-11749.62
------------------------------------------------------------------------------

In Table 8, whites are the reference group and the sample regression coefficients
show that each group earns substantially less (by specific amounts) than whites,
on average, in the sample. For instance, the mean salary of Asians is $3244
less than the mean salary of whites in the sample.
All of the sample regression coefficients are more than two estimated standard
errors below zero (see the t-statistic column). We confidently conclude that the
mean salary of each group is lower than the mean salary of whites in the
population.
All the P>|t| values are less than .05, so we confidently reject the null that the
population mean salary for any group equals the population mean salary for
whites. The Asian-white difference is small enough that we could get it about
2.3% of the time even if there were no difference in the population. For all other
groups, the differences are clearly significant at the .001 level.
We are also 95% confident, for instance, that the mean salary of Asians is
between $6034 and $454 less than the mean salary of whites in the
population.
TABLE 8. reg sal asian black hispanic amerind
Source |
SS
df
MS
---------+-----------------------------Model | 3.5396e+10
4 8.8491e+09
Residual | 6.9799e+0 3479
200628771
---------+-----------------------------Total | 7.3338e+0 3483
210560961

Number of obs
F( 4, 3479)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

3484
44.0
0.0000
0.0483
0.0472
14164

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------asian |
-3244.08
1423.083
-2.280
0.023
-6034.242
-453.9175
black | -7777.392
636.0846
-12.227
0.000
-9024.529
-6530.255
hispanic | -5220.197
1250.381
-4.175
0.000
-7671.752
-2768.643
amerind |
-11109.1
2314.523
-4.800
0.000
-15647.06
-6571.143

204

_cons |
35624.42
278.0532
128.121
0.000
35079.26
36169.58
------------------------------------------------------------------------------

In Table 9, Hispanics are now the reference group. The sample regression
coefficients show that in the sample Asians and whites earn more than
Hispanics and blacks and American Indians earned less than Hispanics, but only
two of the sample regression coefficients are more than two estimated standard
errors away from 0.
Likewise, the P>|t| values are far below .05 for the amerind and white
coefficients, but they are greater than .05 for the asian and black coefficients.
We therefore conclude that the mean salary of whites is higher than the
mean salary of Hispanics in the population; and that the mean salary of
American Indians is lower than the mean salary of Hispanics in the
population.
Because the sample regression coefficients on asian and black are both
within two estimated standard errors of 0, we can not confidently state
whether Asians or blacks earn more or less than Hispanics in the
population.
We are 95 percent confident that, in the population, the mean salary of Asians is
between $1657 less and $5609 more than the mean salary of Hispanics, that the
mean salary of blacks is between $5197 less and $83 more than the mean salary
of Hispanics, that the mean salary of American Indians is between $10,989 and
$789 less than the mean salary of Hispanics, and that the mean salary of whites is
between $1769 and $7672 more than the mean salary of Hispanics.
TABLE 9. reg sal asian black amerind white
Source |
SS
df
MS
---------+-----------------------------Model | 3.5396e+10
4 8.8491e+09
Residual | 6.9799e+11 3479
200628771
---------+-----------------------------Total | 7.3338e+11 3483
210560961

Number of obs
F( 4, 3479)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

3484
44.11
0.0000
0.0483
0.0472
14164

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------asian |
1976.118
1853.103
1.066
0.286
-1657.162
5609.397
black | -2557.194
1346.636
-1.899
0.058
-5197.471
83.08236
amerind | -5888.906
2601.124
-2.264
0.024
-10988.79
-789.0241
white |
5220.197
1250.381
4.175
0.000
2768.643
7671.752
_cons |
30404.22
1219.073
24.940
0.000
28014.05
32794.39
------------------------------------------------------------------------------

In Table 10, the y-intercept represents the expected salary (in the sample) of a
newborn female nonminority with no education and no federal experience. The
sample regression coefficients show that, holding the other variables constant,
expected salaries rise with education and federal service, but drop slightly with
205

age, in the sample. Men earned $6034 more than comparable women (those of
the same race or ethnicity with the same levels of education, federal service, and
age) in the sample, and Asians, blacks, Hispanics, and American Indians all
earned less than comparable whites in the sample.
Most of the regression coefficients are statistically significant; that is,
they are more than two estimated standard errors away from 0, so that we
can be confident of the sign of the population regression coefficient.
Thus, we are very confident that expected salary rises with education and
federal service, holding the other variables constant, in the population.
We are also confident that men earn more than comparable women and
that Asians, blacks, and American Indians earn less than comparable
whites in the population.
However, because the t*s for age and hispanic are so close to zero, we
can not say whether, holding the other variables constant, expected salary
rises or falls with age in the population and whether Hispanics make
more or less than comparable whites in the population.
TABLE 10. reg sal edyrs yos age male

asian black hispanic amerind

Source |
SS
df
MS
---------+-----------------------------Model | 4.1289e+0
8 5.160e+10
Residual | 3.2050e+0 3475 92229060.6
---------+-----------------------------Total | 7.3338e+0 3483
210560961

Number of obs =
3484
F( 8, 3475) = 559.60
Prob > F
= 0.0000
R-squared
= 0.5630
Adj R-squared = 0.5620
Root MSE
= 9603.6

-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------edyrs |
3232.53
80.98997
39.913
0.000
3073.737
3391.323
yos |
730.5837
23.41642
31.200
0.000
684.6723
776.495
age | -10.80357
19.4365
-0.556
0.578
-48.9068
27.30455
male |
6034.097
355.1608
16.990
0.000
5337.752
6730.442
asian | -2086.068
967.2924
-2.157
0.031
-3982.586
-189.5492
black | -2615.829
445.444
-5.872
0.000
-3489.188
-1742.471
hispanic | -992.3598
851.6376
-1.165
0.244
-2662.12
677.4007
amerind | -6272.744
1572.172
-3.990
0.000
-9355.218
-3190.27
_cons | -24435.23
1358.659
-17.985
0.000
-27099.08
-21771.38
------------------------------------------------------------------------------

65. Notice that our conclusions make explicit that these are multiple regression
coefficients. That is, we conclude that men make more than women of the same race or
ethnicity with the same levels of education, federal service, and age in the population,
not that men make more than women in the population.
66. Remember that the point of hypothesis tests is to draw conclusions about the
population not the sample. Therefore, your conclusions should always include the
206

words in the population or other words that make clear that your conclusion is about
the population.
67. Notice too that the hypothesis test does not allow us to draw any conclusions about
the size of the population regression coefficients (e.g., how big the salary advantage of
men is). We only draw conclusions about the sign of the coefficient in the
population (e.g., we conclude that men make more than comparable women in the
population).
68. t* is a ratio of the sample regression coefficient to the estimated standard error.
Anything that increases the absolute value of the sample regression coefficient or
shrinks the estimated standard error will cause the absolute value of t* to get larger.
The stronger the relationship in the population, the higher the sample regression
coefficient will tend to be. With strong relationships in the population, t* tends
to be high and we tend to conclude that they are statistically significant, that
is, that the relationships exist in the population.
The larger the sample size, the smaller the standard error will tend to be. This
will tend to enlarge t* and make it more likely that you will conclude that are
relationship exists in the population if a relationship really does exist in the
population. [If the two variables are not related in the population, increasing the
sample size will have no impact on the expected size of t*.]
Table 11 presents exactly the same model as Table 10, but tested on a
sample of size 349 rather than 3484. The coefficients of determination are
very similar for the two tables (R2 = .563 for Table 10 and .567 for Table
11). The sample regression coefficients are all of the same order of
magnitude in both tables (except for that on hispanic).
However, all the standard errors are substantially larger in Table 11 than in
Table 10, and all the coefficients that are statistically significant in Table
10 have t-statistics with substantially smaller absolute values in Table 11,
and the asian and amerind coefficients are no longer statistically
significant.
Notice also that the 95% confidence intervals are much wider.
TABLE 11. reg sal edyrs yos age male

asian black hispanic amerind

Source |
SS
df
MS
---------+-----------------------------Model | 4.1937e+10
8 5.2422e+09
Residual | 3.2028e+10
340 94198725.0
---------+-----------------------------Total | 7.3965e+10
348
212543002

Number of obs
F( 8,
340)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

349
55.65
0.0000
0.5670
0.5568
9705.6

------------------------------------------------------------------------------

207

sal |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------edyrs |
2974.979
239.7324
12.410
0.000
2503.434
3446.525
yos |
733.7829
73.92502
9.926
0.000
588.3749
879.1909
age |
5.573991
59.68211
0.093
0.926
-111.8187
122.9666
male |
5684.518
110.1123
5.116
0.000
3499.173
7869.863
asian | -2272.854
2669.492
-0.851
0.395
-7523.653
2977.946
black | -3032.695
1445.627
-2.098
0.037
-5876.194
-189.1968
hispanic | -5497.107
2787
-1.972
0.049
-10979.04
-15.17374
amerind | -5677.169
4919.08
-1.154
0.249
-15352.83
3998.494
_cons | -21021.62
4160.298
-5.053
0.000
-29204.78
-12838.46
------------------------------------------------------------------------------

Remember that tests of statistical significance depend on both the strength of


the relationship and on the sample size. If your sample is large, you are likely to
find that at least one of your relationships is statistically significant, even if the
relationships are very weak. With very large samples, pay more attention to the
strength of relationships than you do to their statistical significance.
69. The results of confidence intervals and hypothesis tests are consistent. Whenever a
95% confidence interval includes the hypothesized population parameter, a two-tailed
hypothesis test with a significance level of .05 will result in accepting the null. Whenever
a 95% confidence interval does not include the hypothesized population parameter,
you will be able to reject the null hypothesis at the .05 significance level in a two-tailed
test. [Please note that the following paragraphs were pasted in from
somewhere else, so that table numbers and statistics do not match, but the
points are still valid.]
In Table 6, we are 95 percent confident that the expected salary of men is
between $10,969 and $12,727 higher than expected salary of women of the same
race or ethnicity in the population. Because both the upper and lower
confidence limits are above 0, we are very confident that men earn more than
women of the same race or ethnicity in the population. t* on male is 26.425;
that is, the sample regression coefficient is more than 26 estimated standard
errors above 0, so the hypothesis test also tells us that the expected salary of men
is higher than expected salary of women of the same race or ethnicity in the
population.
Also in Table 3, we are 95 percent confident that the expected salary of Asians is
between $1233 less and $5400 more than expected salary of Hispanics (the
reference group) of the same sex in the population. In this case, the lower
confidence limits is negative (-$1233), while the upper confidence limit is positive
($5400), so the confidence interval includes 0. Thus, we are unsure whether, in
the population, Asians earn more or less than Hispanics of the same sex.
Likewise, t* is only 1.232, far from statistically significant. The hypothesis test
also does not allow us to conclude confidently whether Asians earn more or less
than Hispanics of the same sex in the population.

208

Look down the P>|t| column. Whenever P>|t| is greater than .05, the lower
confidence limit is negative and the upper confidence limit is positive. Whenever
the P>|t| is less than .05, both confidence limits have the same sign.
In all four tables, the P>|t| values for the male coefficient are less than
.05, and both the upper and lower confidence limits are positive.
In Table 4, where the reference group is whites, the P>|t| values for the
asian, black, and indian regression coefficients are all less than .05, and
the upper and lower confidence limits for the regression coefficients are all
negative.
In Table 3, on the other hand, where the reference group is Hispanics, the
P>|t| values for the asian and black regression coefficients are greater
than .05. The lower confidence limits are negative, and the upper
confidence limits are positive.
70. In most circumstances, confidence intervals tell us more than hypothesis tests.
Typically, a hypothesis test can only tell us that the parameter is positive or
negative (and then only if we reject H0).
If the entire confidence interval is positive or negative (which it must be if we are
able to reject the null hypothesis in a two-tailed test), a confidence interval will
also tell us that.
Because a 95% confidence interval tells us two numbers that the parameter is
between (with 95% likelihood), it helps us know whether the parameter is a lot or
a little different from zero.
The width of the confidence interval tells us how well we know the value of the
parameter.
If 0 is inside the confidence interval, the width tells us whether that happened
because the population regression coefficient really is close to 0 or because our
technique and data are weak.
71. Because a confidence interval stretches from about two estimated standard errors
below the sample regression coefficient up to about two estimated standard errors above
the sample regression coefficient, the width of the confidence interval depends on the
size of the estimated standard error. In general, the larger the sample size, the smaller
the standard error, and the narrower the confidence interval. Larger samples give
more precise estimates at the same confidence level, but at a cost of increased
effort, time, and money.

209

Table 3A repeats the model of Table 3, but with a sample size of 500 rather than
3484. All of the standard errors are substantially larger, and all of the confidence
intervals are substantially wider.
In Table 3, the male regression coefficient is $11,847 and its standard error is
$448. We are 95 percent confident that the expected salary of men is between
$10,989 and $12,727 higher than expected salary of women of the same ethnicity
in the population.
In Table 3A, however, the male regression coefficient is very similar ($11,617),
but its standard error ($1190) is more than twice as large. We are 95 percent
confident that the expected salary of men is between $9,279 and $13,955 higher
than expected salary of women of the same ethnicity in the population. This
confidence interval is more than twice as wide ($4,676, compared to $1,740 in
Table 3).
Hypothesis Testing II: The F-Test
72. The population regression line for a multiple regression has the following form:
E(Y) = 1 + 2 X2 + 3 X3 + ... + k Xk
where E(Y) is the expected value of Y and 1, 2, 3, etc. are population regression
coefficients.
73. We sometimes want to know whether the dependent variable is related to any of the
independent variables in the population. To answer this question, we perform an Ftest, a test of the null hypothesis that all the population regression coefficients equal
zero, that changing the value of any of the independent variables will not affect the
expected value of the dependent variable:
H0: 2 = 3 = 4 = . . . = n = 0
The alternative hypothesis then is that at least one of the population regression
coefficients does not equal zero, that is, that the dependent variable is related to at least
one of the independent variables in the population.
74. To perform the F-test, we need to remind ourselves about the three types of
variation we discussed earlier.
The total variation in Y (SST = E (yi - ybar)2) is the sum of the squared deviations
between the observed values of Y and the mean value of Y. In Table 12, SST =
11347.596.

210

The explained variation in Y (SSR = E (yhat-i - ybar)2) is the sum of the squared
deviations between the expected values of Y (those predicted by the regression
equation) and the mean value of Y. In Table 12, SSR = 4173.43327.
The unexplained variation in Y (SSE = E (yi - yhat-i)2) is the sum of the squared
deviations between the observed values of Y and the expected values of Y. In
Table 12, SSE = 7174.16273.
TABLE 12. reg grade edyrs
Source |
SS
df
MS
-------------+-----------------------------Model | 4173.43327
1 4173.43327
Residual | 7174.16273
998 7.18853981
-------------+-----------------------------Total |
11347.596
999
11.358955

Number of obs
F( 1,
998)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

1000
580.57
0.0000
0.3678
0.3671
2.6811

-----------------------------------------------------------------------------grade |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------edyrs |
.9030145
.0374773
24.09
0.000
.8294712
.9765579
_cons | -3.370706
.5450339
-6.18
0.000
-4.44025
-2.301163
------------------------------------------------------------------------------

75. The unexplained variation is also the variation in the residuals (Eei2). The sum of
the squared residuals divided by its degrees of freedom (N-k-1) is an unbiased estimator
of the population variance of the error term. This is called the mean squared error or
MSE.
In Table 12, MSE = 7174.16273 / 998 = 7.18853981. If we take the square root
of the MSE, we get our best estimator of the standard deviation of the error term.
Stata calls this the Root MSE. (Other programs the standard error of the
estimate.) Root MSE = %7.18853981 = 2.6811.
76. The explained variation is also the variation in the expected values of the dependent
variable (E(yhat-i - ybar)2). If there is no relationship between X and Y in the population,
the explained variation is still expected to differ from 0 in the sample. In fact, the
explained variation is expected to be the estimated variance of the error term times the
number of independent variables. If the null hypothesis is true, SSR divided by its
degrees of freedom (k) is also an unbiased estimator of the population variance of the
error term. This is sometimes called the regression mean square or MSR.
In Table 12, MSR = 4173.43327 / 1 = 4173.43327.
77. If the null is true, MSR and MSE both have chi-square distributions, with k and N-k1 degrees of freedom, respectively. The ratio of two chi-square statistics is an F-statistic.
If the null is true, MSR / MSE has an F-distribution, with k numerator and N-k-1
denominator degrees of freedom.
211

78. We can test the null hypothesis that all the Xs are unrelated to Y in the population
by testing the possibility that F* = MSR/MSE has an F-distribution, with k numerator
and N-k-1 denominator degrees of freedom.
In Table 12, F* = MSR/MSE = 4173.43327/7.18853981 = 580.57.
The line F( 1,
998) = 580.57 shows that Stata recognizes that F* has an Fdistribution with 1 numerator and 998 denominator degrees of freedom if the
null is true.
The line Prob > F
= 0.0000 shows that the probability of getting an Fstatistic as high as 580.57 from an F-distribution with 1 numerator and 998
denominator degrees of freedom is less than 5 chances in 100,000. Indeed, the .
display Ftail(1,998,580.57) command yields an answer of 1.783e-101,
indicating that there is less than 1 chance in a bezillion of getting a F-statistic this
high from such a distribution.
One way to think of this is that the amount of variation accounted for by edyrs is
about 581 times as large as would be expected by chance alone.
Clearly, we reject the null hypothesis and conclude that grade is related to edyrs
in the population.
79. In a bivariate regression, the null hypothesis that all the population regression
coefficients equal zero is identical to the null hypothesis that the single population
regression coefficient equals zero. Thus, in bivariate regression, the F-test and the t-test
are the same, (t*)2=F*, and Sig (T) = Prob > F. (F).
In Table 12, %F* = %580.57 = 24.09 = t*.
In Table 13, F* = 39.6459394/11.5632047 =3.4286290374155531467846452636093
and the square root of F* = 1.8516557556456203089887568437986 = t*. In
addition, Prob>F=0.0671 and P>|t|= 0.067.
One way to think of this is that in bivariate regression, both the t-test and the F-test
are testing the same null hypothesis, that, in the population, Y is not related to the
only independent variable. It makes sense that the two tests would be consistent
with each other.
TABLE 13. reg grade age
Source |
SS
df
MS
-------------+-----------------------------Model | 39.6459394
1 39.6459394
Residual | 1133.19406
98 11.5632047
-------------+-----------------------------Total |
1172.84
99 11.8468687

212

Number of obs
F( 1,
98)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

100
3.43
0.0671
0.0338
0.0239
3.4005

-----------------------------------------------------------------------------grade |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------age |
.0599824
.0323939
1.85
0.067
-.0043023
.124267
_cons |
6.776389
1.488661
4.55
0.000
3.82219
9.730589
------------------------------------------------------------------------------

80. In multiple regression, the F-test and the t-test have separate purposes.
The F-test tests the regression model as a whole to determine whether the
dependent variable is related to any of the dependent variables. If you cannot
reject the null hypothesis that dependent variable is not related to any of the
independent variables, there is no point in doing any t-tests. (In a multiple
regression with 20 independent variables and a significance level of .05, we
would expect the 20 t-tests to show one of the relationships to be statistically
significant even if none of the relationships existed in the population.
Remember, we will reject a true null hypothesis 1 time in 20 at the .05 level.)
The t-test tests the null hypothesis that a particular regression coefficient equals
zero. If you can reject the null hypothesis in a t-test, you conclude that a
particular relationship is positive (or negative) in the population, after controlling
for all the other independent variables in the model.
Remember that the t-test can only establish that a relationship exists in the
population, and whether the relationship is positive or negative. The t-test
cannot determine how strong that relationship is.
81. In Table 14, a sample of 200 personnel records allows us to reject the null that
grade and race are unrelated in the population, but only allows us to conclude that the
mean grades of blacks and whites differ in the population. We cannot confidently
conclude that the mean grade of any other group differs from the mean grade of whites.
TABLE 14. reg grade

asian black hispanic indian

Source |
SS
df
MS
-------------+-----------------------------Model | 175.522846
4 43.8807116
Residual | 2182.69715
195 11.1933187
-------------+-----------------------------Total |
2358.22
199 11.8503518

Number of obs
F( 4,
195)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

200
3.92
0.0044
0.0744
0.0554
3.3456

-----------------------------------------------------------------------------grade |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------asian | -.7202797
1.522149
-0.47
0.637
-3.722269
2.281709
black | -2.378174
.610603
-3.89
0.000
-3.582408
-1.173941
hispanic |
-1.22028
1.2155
-1.00
0.317
-3.617493
1.176934
indian | -1.053613
1.394212
-0.76
0.451
-3.803284
1.696058
_cons |
9.72028
.2797766
34.74
0.000
9.168503
10.27206
------------------------------------------------------------------------------

213

82. The F-statistic has an F-distribution if the null hypothesis is true. Stata prints
a Prob > F. for the F-statistic, which is the probability of getting an F-statistic as high as
the one we got if the null hypothesis were true. Whenever Prob > F. is less than .05, we
confidently reject the null hypothesis and conclude that the dependent variable is
related to at least one of the independent variables in the population. The smaller the
value of Prob > F., the more confident we can be that at least one relationship exists in
the population.
In Table 5, F=1534.971 and Prob > F =.000. These values would be nearly
impossible to get if salary were not related to education in the population. We
therefore very confidently conclude that salary is related to education in the
population.
In Table 6, F = 806 and Prob > F = .000, so we confidently conclude that salary
is related to sex in the population.
In Table 7, F = 1033 and Prob > F = .000, so we confidently conclude that salary
is related to sex or education or both in the population.
Note that this is not a very specific conclusion. The F-test does not tell us
which of the independent variables salary is related to, much less whether
the relationship(s) is (are) positive or negative. We will need t-tests for
that.
In Table 8, F = 44 and Prob > F = .000, so we confidently conclude that salary is
related to race in the population, that is, that at least one population mean
salary differs from at least one other population mean salary. The F-test does not
tell us which groups differ.
In Table 9, all that has changed is the reference group. Therefore the F-statistic
and the Prob > F are identical to those in Table 8 and we draw the same
conclusion.
In this case, the purpose of the F-test is a little clearer. The F- statistic is
highly significant, even though only one of t*s is clearly significant. Notice
that all of t*s (except the one on white) are much smaller than t*s in Table
8, but only because we have changed the reference group. The
relationship between race and salary is equally strong in Tables 8 and 9.
Situations can occur when none of t*s is statistically significant (perhaps
due to a change of reference group), but the dependent variable is strongly
related to at least one of the independent variables in the population. An
F- test can usually catch this.

214

This holds true for Tables 1 and 2, where (39.179)2=1534.971 and


(28.387)2=805.847, but it does not apply in any of the multiple regressions.
83. Sample sizes affect F-statistics just as they affect t-statistics. The larger the sample
size, the higher t- and F-statistics will tend to be, holding constant the strength of the
relationship in the population. [If the variables are not related in the population,
increasing the sample size will not have any effect on rejecting the null hypothesis in
either the t-test or the F-test.]
Tables 10 and 11 test the same model on samples of size 3484 and 496,
respectively. The F-statistic is 560 in Table 10 and 82 in Table 0. Although we
easily reject the null hypothesis that none of the independent variables are
related to the dependent variable in the population in both cases, that would not
necessarily be the case with weaker relationships or even smaller sample sizes.
THE GENERAL F-TEST
84. The F-test for the statistical significance of the model as a whole is just one example
of the more general F-test. For instance, we can test whether adding an additional
variable will lead to a statistically significant improvement in our model (or, from the
opposite perspective, whether dropping one of the variables in the model will
significantly decrease its explanatory power) using either a t-test or an F-test.
85. The test-statistic in this case will be the ratio of (the variation explained by the set of
independent variables divided by the number of independent variables in the set) to the
MSE for the full model, the average unexplained variation per degree of freedom. This is
frequently written:
F*

= (change in explained variation/change in # independent variables) /


(total unexplained variation/N-k-1)
= [(SSRFULL - SSRREDUCED)/ (dfFULL - dfREDUCED)] / (SSEFULL/dfFULL)

86. Let's say that I want to test whether, in the population of federal employees in 1991,
men had significantly higher grades than women with the same level of education, federal
experience, and age. One possibility is simply to run a model including all of those
variables and run a t-test on the coefficient on male. If the male coefficient is statistically
significant, I conclude that, in the population, men have higher expected grades than
women with the same levels of education, federal experience, and age.
In Table 15, the coefficient on male is 1.56, its estimated standard error is .086, and
its t-statistic is 18.01. Since the sample regression coefficient is 18 estimated
standard errors above 0, we confidently conclude that the coefficient on male is
positive in the population, that is, that men have higher expected grades than
women with the same level of education, federal experience, and age, in the
population.

215

Table 15. reg grade male edyrs yos yossq age agesq
Source |
SS
df
MS
-------------+-----------------------------Model | 20142.5137
6 3357.08562
Residual | 19862.4945 3504 5.66852013
-------------+-----------------------------Total | 40005.0083 3510 11.3974383

Number of obs
F( 6, 3504)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

3511
592.23
0.0000
0.5035
0.5026
2.3809

-----------------------------------------------------------------------------grade |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------male |
1.557188
.086478
18.01
0.000
1.387636
1.726741
edyrs |
.7645529
.0198185
38.58
0.000
.725696
.8034098
yos |
.2226296
.0177929
12.51
0.000
.1877441
.2575151
yossq |
-.002671
.0005189
-5.15
0.000
-.0036883
-.0016536
age |
.1095003
.0306457
3.57
0.000
.0494151
.1695856
agesq | -.0014168
.0003393
-4.18
0.000
-.002082
-.0007516
_cons | -6.931486
.6649265
-10.42
0.000
-8.235169
-5.627804
-----------------------------------------------------------------------------TABLE 15A. test male
( 1)

male = 0.0
F(

1, 3504) =
Prob > F =

324.24
0.0000

87. Another method to achieve the same result is to use an F-test on the male
coefficient. Stata allows an easy F-test after running a multiple regression, simply by
typing the word "test" and the name(s) of the variable(s) whose coefficient(s) we want to
test. In this case, the value of the F-statistic is 324.24, which is highly significant. We
conclude that, in the population, men and women have different expected grades, even
holding education, federal experience, and age constant. Because the coefficient on
male is positive in the sample, we can conclude that it is also positive in the population.
88. To see where this F-statistic came from, run a second regression with the same
variables but excluding male. The explained variation in the restricted model (Table
2) is 18304.5364, which is 1837.977 lower than the explained variation in the full model
(20,142.5137 in Table 1). According to the MSE of the full model, there is an expected
error variation of 5.66852013 associated with each degree of freedom. If we divide the
increase in explained variation (1837.977) by the expected variation associated with the
loss of one degree of freedom (5.66852013), we get 324.24 -- that is, we explained 324
times as much variation as would be expected by chance alone. With 1 numerator
degree of freedom and 3504 denominator degrees of freedom, this F-statistic is
significant at .0000 level.
Note that the square root of the F-statistic is 18.01, the value of t* on the male
coefficient. As in bivariate regression, the F-statistic is t* squared.

216

Table 16. reg grade edyrs yos yossq age agesq


Source |
SS
df
MS
-------------+-----------------------------Model | 18304.5364
5 3660.90727
Residual | 21700.4719 3505 6.19129013
-------------+-----------------------------Total | 40005.0083 3510 11.3974383

Number of obs
F( 5, 3505)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

3511
591.30
0.0000
0.4576
0.4568
2.4882

-----------------------------------------------------------------------------grade |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------edyrs |
.8866101
.019463
45.55
0.000
.8484501
.9247702
yos |
.2179719
.0185933
11.72
0.000
.1815171
.2544267
yossq | -.0023805
.000542
-4.39
0.000
-.0034433
-.0013178
age |
.1408453
.031976
4.40
0.000
.0781519
.2035387
agesq | -.0016958
.0003542
-4.79
0.000
-.0023903
-.0010014
_cons |
-8.70404
.687254
-12.66
0.000
-10.0515
-7.356581
------------------------------------------------------------------------------

89. The general F-test is much more useful for testing whether a group of variables
should be added to (or dropped from) a model. For instance, assume that we want to
determine whether we should add race to the model from Table 15 but run on the 1994
rather than 1991 data. Since race is a nominal-level variable with five values, we'll
represent it with four dummy variables (asian, black, hispanic, and indian). Table
17 shows that only black is statistically significant at the .05 level in two-tailed t-tests.
Can we safely drop the other three race variables? In the same model, the age
coefficient is not significant at the .05 level, though the agesq coefficient is. Since they
have different signs, is it plausible that age is unrelated to grade in the population, once
we control for sex, education, experience, and race?
Table 17A tests the null that the coefficient on asian, hispanic, and indian are
all zero in the population. F* = 1.98. We would get an F* that high about 11.5%
of the time if Asians, Hispanics, and Indians had the same expected grades as
whites with the same sex, age, education, and experience in the population. We
cannot confidently reject the null hypothesis.
To calculate that yourself, subtract the SSR from Table 18 (without those three
variables), divide by three, then divide by MSE from Table 17:
[(5735.75006 - 5702.06192)/ 3]/ 5.67426283 = 1.9790024 = F*

If the null is true, this F* has an F-distribution with 3 numerator and 989
denominator degrees of freedom. F* is consistent with this null hypothesis.
. di Ftail(3,989,1.9790024)
.11545493

On the other hand, there is virtually no chance of getting the result we got if
neither age nor agesq is related to grade in the population after controlling for
217

these other variables. F* = 9.92; you would get a number that high only 1 time in
10,000 from an F-distribution with 2 numerator and 989 degrees of freedom.
TABLE 17. reg grade male edyrs yos yossq age agesq asian black hispanic indian
Source |
SS
df
MS
-------------+-----------------------------Model | 5735.75006
10 573.575006
Residual | 5611.84594
989 5.67426283
-------------+-----------------------------Total |
11347.596
999
11.358955

Number of obs
F( 10,
989)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

1000
101.08
0.0000
0.5055
0.5005
2.3821

-----------------------------------------------------------------------------grade |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------male |
1.014296
.1604387
6.32
0.000
.6994565
1.329135
edyrs |
.8049051
.0355829
22.62
0.000
.7350785
.8747318
yos |
.1550753
.0347998
4.46
0.000
.0867853
.2233652
yossq | -.0008291
.0009535
-0.87
0.385
-.0027001
.001042
age |
.1101846
.0566858
1.94
0.052
-.0010536
.2214228
agesq |
-.001577
.0006092
-2.59
0.010
-.0027724
-.0003816
asian |
.4366862
.442195
0.99
0.324
-.4310619
1.304434
black | -.9842708
.2062853
-4.77
0.000
-1.389078
-.5794636
hispanic | -.6466498
.3541598
-1.83
0.068
-1.341641
.0483411
indian |
-.72856
.5871941
-1.24
0.215
-1.880849
.4237294
_cons | -5.959507
1.251647
-4.76
0.000
-8.415695
-3.503319
-----------------------------------------------------------------------------TABLE 17A. test asian hispanic indian
( 1)
( 2)
( 3)

asian = 0.0
hispanic = 0.0
indian = 0.0
F(

3,
989) =
Prob > F =

1.98
0.1155

TABLE 17B. test age agesq


( 1)
( 2)

age = 0.0
agesq = 0.0
F(

2,
989) =
Prob > F =

9.92
0.0001

218

TABLE 18. reg grade male edyrs yos yossq age agesq
Source |
SS
df
MS
-------------+-----------------------------Model | 5702.06192
7 814.580275
Residual | 5645.53408
992 5.69106258
-------------+-----------------------------Total |
11347.596
999
11.358955

black
Number of obs
F( 7,
992)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

1000
143.13
0.0000
0.5025
0.4990
2.3856

-----------------------------------------------------------------------------grade |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------male |
.9902678
.1601148
6.18
0.000
.6760652
1.30447
edyrs |
.8114833
.0355227
22.84
0.000
.741775
.8811915
yos |
.1537874
.0346798
4.43
0.000
.0857332
.2218416
yossq | -.0008427
.0009522
-0.88
0.376
-.0027113
.0010259
age |
.1170318
.0566712
2.07
0.039
.0058226
.2282411
agesq | -.0016298
.0006094
-2.67
0.008
-.0028257
-.0004339
black | -.9410356
.2036949
-4.62
0.000
-1.340758
-.5413132
_cons | -6.250049
1.246704
-5.01
0.000
-8.696528
-3.80357
------------------------------------------------------------------------------

219

Lecture 10. SPECIFICATION ERROR


1. A quick review of the multiple regression model. The population regression equation
below follows all of the classical assumptions.
Yi = $0 + $1 X1i + $2 X2i + ... + $k Xki + ,i
A. We have properly specified the relationship. Only the Xs included have
important causal impacts on Y, and no other variable does. We have the correct
functional form of the relationship e.g., E(Y) is linearly related to these Xs and
not to lnX or X2 or 1/X.
B. Each X has unique variation; its value cannot be perfectly predicted by the other
Xs. There is no perfect multicollinearity.
C. The xis are known constants (rather than variables) or are at least fixed in
repeated samples. Alternatively, we can weaken this assumption to the
independence of the Xs and the error term: Cov (x1i , ,i) = Cov (x2i , ,i) = Cov
(xki , ,i) = E(xki ,i) = 0.
D. E(,i) = E(,i | X) = 0. That is, the expected value or conditional mean of ,i is
0 at every value (and every combination of values) of the independent variables.
Because xi is a constant, E(xi ,i) = xi E(,i) = 0 .
E. Var (,i ) = Var (,i | X) = F2. The probability distribution of ,i is the same at
every value of X: it has a mean of 0 and a variance of F2. This is called the
homoskedasticity assumption.
F. Cov (,i , ,j ) = 0 for all i j. The error or disturbance terms are uncorrelated
from observation to observation. This is the assumption of nonautocorrelation.
G. Sometimes we assume that the error term is normally distributed: ,i ~ N(0, F2).

2. The ordinary least squares (OLS) estimators run on the following model have the
following true and estimated variances of the regression coefficients are:
Y-hati = b0 + b1 X1i + b2 X2i + ... + bk Xki

F2bj = F2 / E(xji - xbarj)2(1-r2) = F2 / E(xji - xhatj1)2


Fhat2bj = MSE / E(xji - xbarj)2(1-r2) = MSE / E(xji - xhatj)2
3. These OLS estimators are the best linear unbiased estimators of the population
parameters. Indeed, if the error term is normally distributed, the OLS estimators

185

(which are also the maximum likelihood estimators) are the best unbiased estimators of
any form of the population parameters.
4. What happens when we violate the first assumption, a properly specified model? We
can potentially either (1) leave out an important independent variable (or include it in
the wrong form) or (2) include an irrelevant independent variable. The effects will be
different, with much more serious consequences in case (1).
5. If we incorrectly leave out an important independent variable, the OLS estimator of $j
(call it b*j ) will be biased. That is, E(b*j) $j. The OLS estimator b*j will remain
biased even in infinitely large samples.
6. In the causal model lectures, we learned that when we left out an intervening or
antecedent variable, the indirect or spurious effect was merged into the total or direct
effect. If we think of the direct effect of xj as its direct effect in a properly specified
model, then b*j will include unwanted indirect or spurious effects and give us a biased
estimator for the true causal impact.
7. The estimated standard error of b*j could be larger or smaller than the estimated
standard error of bj, depending on both the additional variation in Y that could uniquely
explained by adding the omitted variable to the model and the correlation between that
omitted variable and Xj.
8. Suppose, on the other hand, that we add an irrelevant variable (X3 ) to the correctly
specified population regression function from 1. above, so that we mistakenly test the
model:
Yi = $0 + $1 X1i + $2 X2i + $3 X3i + ei
9. OLS is still BLUE that is, OLS still gives the best linear unbiased estimator of the
population regression function. In this case, all the OLS estimators of the population
regression coefficients are unbiased, including that b3 is an unbiased estimator of $3,
which is 0.
10. On the other hand, since adding X3 wont meaningfully lower MSE (since X3 has no
causal impact on Y), adding X3 will almost certainly raise the standard errors (both true
and estimated) of the other coefficients, unless X3 is not correlated with either of the
other independent variables. If X3 is strongly correlated with either or both of the other
independent variables, their standard errors may increase substantially.
11. When the standard error of a regression coefficient is larger, (1) the coefficient will
vary more from sample to sample, (2) the confidence interval for the coefficient will be
wider, and (3) the expected value of t* will be smaller, making it more difficult to reject
the null hypothesis of no impact.

186

Monte Carlo
12. The following Monte Carlo demonstrates these points for a sample of 100,000. I
create the variable z as a way of ensuring that my Xs are correlated with each other. X1
through X4 are all created to be normally distributed variables with means of 500; X4
has a standard deviation of 100; X1 through X3 have variances of 1002 + .82 * 1002 (the
variance of z). X1 through X3 are all correlated at about r=.62 with z and about r=.39
with each other (Table 1). X4 is created independently, so it has correlations of
essentially 0 with the other Xs.
13. Y1 = 300 + .5 X1 + 1.5 X2 + ,i
where + ,i ~ N(0,2002)
14. In a properly specified OLS regression (Table 2), y1-hat = 305 + .49 X1 + 1.50
X2 The 95% confidence intervals for all three estimators include the population
parameters.
15. Tables 3 and 4 leave out one of the independent variables that have a causal impact
on Y1. Not only do the sample regression coefficients differ markedly from the
population parameters, the 95% confidence intervals do not come close to including
them.
16. Tables 5 and 6 include irrelevant variables, X3 and X4. The key difference between
these examples is that X3 is correlated with the other two independent variables, but
X4 is not. In both cases, b1 and b2 remain unbiased estimators of the population
parameters. Their values are barely changed from the properly specified model (Table
2), and their 95% confidence intervals easily include the true values of the population
parameters. In addition, b3 and b4 are unbiased estimators of their population
parameters. Neither coefficient is significantly different from 0, despite a sample of
100,000.
17. Notice that the estimated standard errors for b1 and b2 are larger in Table 5 than in
Table 6, because X1 and X2 are correlated wth X3 but not with X4. Although the
differences appear trivial with 100,000 observations, the effects would be substantial
with 100 observations.
. clear
. set memory 10000
(10000k)
. set obs 100000
obs was 0, now 10000
. gen z = 100*invnorm(uniform())
. gen x1=500+ .8*z + 100*invnorm(uniform())
. gen x2=500+ .8*z + 100*invnorm(uniform())
. gen x3=500+ .8*z + 100*invnorm(uniform())
. gen x4=500+ 100*invnorm(uniform())

187

. gen y1 = 300 + .5*x1 + 1.5*x2 + 200*invnorm(uniform())


TABLE 1. corr, m
(obs=100000)
Variable |
Mean
Std. Dev.
Min
Max
-------------+---------------------------------------------------z |
.1594882
99.87905
-408.6707
453.0629
x1 |
499.9751
127.7231
-22.60779
1106.381
x2 |
499.9765
128.0845
-52.52368
1156.617
x3 |
499.6491
128.393
-135.057
1103.369
x4 |
499.9021
99.64905
31.59486
922.9247
y1 |
1300.037
300.4487
59.76282
2696.168
|
z
x1
x2
x3
x4
y1
-------------+-----------------------------------------------------z |
1.0000
x1 |
0.6222
1.0000
x2 |
0.6256
0.3897
1.0000
x3 |
0.6237
0.3900
0.3909
1.0000
x4 |
0.0047
0.0066
0.0037
0.0059
1.0000
y1 |
0.5283
0.4584
0.7196
0.3295
0.0054
1.0000
TABLE 2. reg y1 x1 x2
Source |
SS
df
MS
-------------+-----------------------------Model | 5.0108e+09
2 2.5054e+09
Residual | 4.0161e+09 99997 40162.1184
-------------+-----------------------------Total | 9.0269e+09 99999 90269.4482

Number of obs
F( 2, 99997)
Prob > F
R-squared
Adj R-squared
Root MSE

= 100000
=62381.71
= 0.0000
= 0.5551
= 0.5551
= 200.40

-----------------------------------------------------------------------------y1 |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------x1 |
.4935053
.0053878
91.60
0.000
.4829453
.5040653
x2 |
1.496095
.0053726
278.47
0.000
1.485564
1.506625
_cons |
305.284
3.038694
100.47
0.000
299.3282
311.2398
-----------------------------------------------------------------------------TABLE 3. reg y1 x1
Source |
SS
df
MS
-------------+-----------------------------Model | 1.8964e+09
1 1.8964e+09
Residual | 7.1304e+09 99998 71305.8022
-------------+-----------------------------Total | 9.0269e+09 99999 90269.4482

Number of obs
F( 1, 99998)
Prob > F
R-squared
Adj R-squared
Root MSE

= 100000
=26595.55
= 0.0000
= 0.2101
= 0.2101
= 267.03

-----------------------------------------------------------------------------y1 |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------x1 |
1.078201
.0066114
163.08
0.000
1.065243
1.091159
_cons |
760.963
3.411702
223.04
0.000
754.2761
767.6499
------------------------------------------------------------------------------

188

TABLE 4. reg y1 x2
Source |
SS
df
MS
-------------+-----------------------------Model | 4.6738e+09
1 4.6738e+09
Residual | 4.3531e+09 99998 43531.3803
-------------+-----------------------------Total | 9.0269e+09 99999 90269.4482

Number of obs
F( 1, 99998)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

100000
.
0.0000
0.5178
0.5178
208.64

-----------------------------------------------------------------------------y1 |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------x2 |
1.687877
.0051512
327.67
0.000
1.67778
1.697973
_cons |
456.1378
2.658638
171.57
0.000
450.9269
461.3487
-----------------------------------------------------------------------------TABLE 5. reg y1 x1 x2 x3
Source |
SS
df
MS
-------------+-----------------------------Model | 5.0108e+09
3 1.6703e+09
Residual | 4.0161e+09 99996 40162.2212
-------------+-----------------------------Total | 9.0269e+09 99999 90269.4482

Number of obs
F( 3, 99996)
Prob > F
R-squared
Adj R-squared
Root MSE

= 100000
=41587.95
= 0.0000
= 0.5551
= 0.5551
= 200.41

-----------------------------------------------------------------------------y1 |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------x1 |
.494863
.005613
88.16
0.000
.4838616
.5058645
x2 |
1.497455
.0055994
267.43
0.000
1.486481
1.50843
x3 | -.0048194
.0055868
-0.86
0.388
-.0157694
.0061307
_cons |
306.3328
3.272885
93.60
0.000
299.9179
312.7476
-----------------------------------------------------------------------------TABLE 6. reg y1 x1 x2 x4
Source |
SS
df
MS
-------------+-----------------------------Model | 5.0108e+09
3 1.6703e+09
Residual | 4.0161e+09 99996 40162.2743
-------------+-----------------------------Total | 9.0269e+09 99999 90269.4482

Number of obs
F( 3, 99996)
Prob > F
R-squared
Adj R-squared
Root MSE

= 100000
=41587.85
= 0.0000
= 0.5551
= 0.5551
= 200.41

-----------------------------------------------------------------------------y1 |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------x1 |
.4934817
.0053879
91.59
0.000
.4829215
.5040419
x2 |
1.49609
.0053726
278.47
0.000
1.485559
1.50662
x4 |
.0049748
.0063599
0.78
0.434
-.0074905
.01744
_cons |
302.8114
4.384686
69.06
0.000
294.2175
311.4054
------------------------------------------------------------------------------

190

Dealing with Unobserved and Mis-Measured Explanatory Variables


Unobserved Variables
1. When an important independent variable is omitted from the model, OLS estimators
of all population parameters will be biased, to the extent that they are correlated with
the omitted variable (OV).
Wooldridge focuses on the effect of omitting ability from an earnings equation.
To the extent that this unmeasured ability is correlated with the variables in the
model, their coefficients, especially that of education, will be biased. If ability
and education are positively related, and if both have a positive impact on
earnings, then the coefficient on education will tend to overstate the true causal
impact of education.
2. One solution to the OV problem is to use a proxy in its place e.g., IQ as a proxy for a
broader ability measure. Suppose:
y = $0 + $1X1 + $2X2 + $3X3* + ,
but we cannot measure X3*. Instead, we have X3, where X3* = *0 + *1X3 + u3. For X3 to be
a good proxy for X3*, we require (1) a fairly strong correlation between the two variables;
(2) no correlation between , and any of the any variables (including both X3* and X3);
and (3) no correlation between u3 and any of the measured variables (X1, X2, and X3).
This latter means that neither X1 nor X2 can have a partial correlation with X3* after
controlling for X3.
In general, these conditions/assumptions are not testable, as they all involve
unobserved variables and/or disturbances.
3. If X3 is a good proxy for X3*, neither its coefficient nor the y-intercept will be unbiased
for $0 and $3, but b1 and b2 will be unbiased for $1 and $2.
4. In his example, he uses two different proxies for unmeasured ability (IQ and score on
the Knowledge of the World of Work test) both separately and together. As expected,
the coefficient on educ shrinks once these proxies for ability are included in the model.
Table 1
. use "E:\statDATA\wage2.dta", clear
. reg lwage educ exper tenure married south
Source |
SS
df
MS
---------+-----------------------------Model | 41.8377677
7 5.97682396
Residual | 123.818527
927 .133569069
---------+-----------------------------Total | 165.656294
934 .177362199

200

urban

black
Number of obs
F( 7,
927)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

935
44.75
0.0000
0.2526
0.2469
.36547

-----------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------educ |
.0654307
.0062504
10.468
0.000
.0531642
.0776973
exper |
.014043
.0031852
4.409
0.000
.007792
.020294
tenure |
.0117473
.002453
4.789
0.000
.0069333
.0165613
married |
.1994171
.0390502
5.107
0.000
.1227802
.2760541
south | -.0909036
.0262485
-3.463
0.001
-.142417
-.0393903
urban |
.1839121
.0269583
6.822
0.000
.1310056
.2368185
black | -.1883499
.0376666
-5.000
0.000
-.2622717
-.1144282
_cons |
5.395497
.113225
47.653
0.000
5.17329
5.617704
-----------------------------------------------------------------------------Table 2
. reg lwage educ exper tenure married south

urban

Source |
SS
df
MS
---------+-----------------------------Model | 43.5360229
8 5.44200287
Residual | 122.120271
926 .131879343
---------+-----------------------------Total | 165.656294
934 .177362199

black iq
Number of obs
F( 8,
926)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

935
41.27
0.0000
0.2628
0.2564
.36315

-----------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------educ |
.0544106
.0069285
7.853
0.000
.0408133
.068008
exper |
.0141458
.0031651
4.469
0.000
.0079342
.0203575
tenure |
.0113951
.0024394
4.671
0.000
.0066077
.0161825
married |
.1997644
.0388025
5.148
0.000
.1236134
.2759154
south | -.0801695
.0262529
-3.054
0.002
-.1316916
-.0286473
urban |
.1819463
.0267929
6.791
0.000
.1293645
.2345281
black | -.1431253
.0394925
-3.624
0.000
-.2206304
-.0656202
iq |
.0035591
.0009918
3.589
0.000
.0016127
.0055056
_cons |
5.176439
.1280006
40.441
0.000
4.925234
5.427644
-----------------------------------------------------------------------------Table 3
. reg lwage educ exper tenure married south
Source |
SS
df
MS
---------+-----------------------------Model | 42.8510762
8 5.35638452
Residual | 122.805218
926 .132619026
---------+-----------------------------Total | 165.656294
934 .177362199

urban

black kww
Number of obs
F( 8,
926)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

935
40.39
0.0000
0.2587
0.2523
.36417

-----------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------educ |
.0576277
.006838
8.428
0.000
.0442079
.0710475
exper |
.0122284
.003241
3.773
0.000
.0058678
.018589
tenure |
.011072
.0024564
4.507
0.000
.0062512
.0158927
married |
.1894612
.0390774
4.848
0.000
.1127707
.2661517
south | -.0916006
.0261562
-3.502
0.000
-.142933
-.0402683
urban |
.1755452
.0270323
6.494
0.000
.1224936
.2285969
black | -.1642666
.0385304
-4.263
0.000
-.2398837
-.0886495
kww |
.0050275
.0018188
2.764
0.006
.0014581
.008597

201

_cons |

5.358797

.1136002

47.172

Table 4
. reg lwage educ exper tenure married south

0.000

urban

5.135853

5.581741

black iq kww

Source |
SS
df
MS
---------+-----------------------------Model | 44.0968017
9 4.89964463
Residual | 121.559493
925 .131415668
---------+-----------------------------Total | 165.656294
934 .177362199

Number of obs
F( 9,
925)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

935
37.28
0.0000
0.2662
0.2591
.36251

-----------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------educ |
.0498375
.007262
6.863
0.000
.0355856
.0640893
exper |
.0127522
.0032308
3.947
0.000
.0064117
.0190927
tenure |
.0109248
.0024457
4.467
0.000
.006125
.0157246
married |
.1921449
.0389094
4.938
0.000
.1157839
.2685059
south | -.0820295
.0262222
-3.128
0.002
-.1334913
-.0305676
urban |
.1758226
.0269095
6.534
0.000
.1230118
.2286334
black | -.1303995
.0399014
-3.268
0.001
-.2087073
-.0520917
iq |
.0031183
.0010128
3.079
0.002
.0011306
.0051059
kww |
.003826
.0018521
2.066
0.039
.0001911
.0074608
_cons |
5.175643
.127776
40.506
0.000
4.924879
5.426408
------------------------------------------------------------------------------

5. If, however, the unmeasured ability measure is related to the other independent
variables, even after controlling for X3, then all OLS estimators will be biased even after
including the proxy.
If:
then:

X3* = *0 + *1X1 + *2X2 + *3X3 + u


y = ($0 + $3 *0) + ($1 + $3 *1) X1 + ($2 + $3 *2) X2 + $3 *3X3 + , + $3u

and all the estimators are biased.


6. In a precursor to the panel data, fixed effects models he introduces later, Wooldridge
notes that a lagged dependent variable can serve as a proxy for all the time-invariant
unobserved effects associated with the unit of analysis.
Note that you need something approaching panel data for this solution to work
this will be easier to achieve with governmental units than individuals, for
instance. You can go back and get a prior crime rate for the cities in your sample,
but you cant get new information on respondents to the General Social Survey.
MEASUREMENT ERRORS
6. Under the assumptions of the classical normal linear regression (CNLR) model, we
have not only specified the model and the distribution of the error term correctly, we
have measured all variables perfectly.
202

7. Measurement error in the dependent variable has only minor effects. Assuming that
the measurement error has an expected value of zero, is normally distributed, and is not
correlated with the disturbance term in the population regression model, the
measurement error can simply be combined with the disturbance term as a source of
error in the model. The standard error of the estimate and the estimated standard error
of the regression coefficient become larger, but OLS estimates of both the regression
coefficients and standard errors remain unbiased. (If the measurement error has a
nonzero mean, this will affect the intercept term but not the regression coefficient.)
Table 1 presents a Monte Carlo simulation of the CNLR: Y=10 + 3X + ,, where
,~N(0,100). With random samples of 30 (including 3 cases each where X has
each integer value from 0 to 9), 1000 sample regressions yield 1000 regression
coefficients whose mean value is 2.994. The standard error of that mean is .0197,
so we are 95% confident that the true mean of the sampling distribution for b is
between 2.955 and 3.033.
. program define bivariat
1.
version 6.0
2.
if "`1'" == "?" {
3.
global S_1 "a se_a b se_b"
4.
exit
5.
}
6.
drop _all
7.
set obs 30
8.
gen x = int((_n-1)/3)
9.
gen y = 10 + 3*x +(invnorm(uniform())*10)
10.
regress y x
11.
post `1' _b[_cons] _se[_cons] _b[x] _se[x]
12.
drop y
13. end
.
.
.
.

simul bivariat, reps(1000)


gen var_a= se_a^2
gen var_b= se_b^2
ci

Variable |
Obs
Mean
Std. Err.
[95% Conf. Interval]
-------------+------------------------------------------------------------a |
1000
10.01293
.1060017
9.804922
10.22095
se_a |
1000
3.369747
.01437
3.341548
3.397946
b |
1000
2.994029
.0196964
2.955378
3.03268
se_b |
1000
.6312114
.0026917
.6259293
.6364935
var_a |
1000
11.56148
.0980973
11.36898
11.75398
var_b |
1000
.4056661
.003442
.3989117
.4124205

Table 2 presents a Monte Carlo where (as above) the properly measured variable
Ystar =10 + 3X + ,, where ,~N(0,100), but we only observe the imperfectly
measured variable Y, which has an expected value of Ystar but distributed
measurement error with a mean of 0 and a variance of 25. [Y = Ystar + u, where
u~N(0,25).] With random samples of 30, the 1000 sample regressions yield
1000 regression coefficients whose mean value is 3.018. The standard error of
that mean is .0231, meaning that we are 95% confident the true mean of the
sampling distribution is between 2.973 and 3.063. Thus, the regression
203

coefficient is still unbiased, but note that the average standard error of the
regression coefficient has increased from .631 to .708.
. program define bivarmeas
1.
version 6.0
2.
if "`1'" == "?" {
3.
global S_1 "a se_a b se_b"
4.
exit
5.
}
6.
drop _all
7.
set obs 30
8.
gen x = int((_n-1)/3)
9.
gen ystar = 10 + 3*x +(invnorm(uniform())*10)
10.
gen y = ystar +(invnorm(uniform())*5)
11.
regress y x
12.
post `1' _b[_cons] _se[_cons] _b[x] _se[x]
13.
drop y
14. end
.
.
.
.

simul bivarmeas, reps(1000)


gen var_a= se_a^2
gen var_b= se_b^2
ci

Variable |
Obs
Mean
Std. Err.
[95% Conf. Interval]
-------------+------------------------------------------------------------a |
1000
9.807771
.1241758
9.564096
10.05145
se_a |
1000
3.780097
.0156544
3.749377
3.810816
b |
1000
3.018103
.0230986
2.972776
3.063431
se_b |
1000
.708077
.0029323
.7023227
.7138312
var_a |
1000
14.53394
.1192872
14.29986
14.76803
var_b |
1000
.509963
.0041855
.5017495
.5181764

Measurement error in the independent variables


8. Measurement error in the independent variable has much more serious implications.
The mis-measured independent variable is correlated with the new disturbance term.
This means that the OLS estimator of the population regression coefficient is biased
and inconsistent. Large sample sizes cannot solve the problem: the estimated
regression coefficient is asymptotically biased toward zero. That is, the effect of any
independent variable will tend to be under-estimated, even in infinitely large samples, if
that variable is mis-measured.
In Table 3, the measurement error in the independent variable X. The causal
model for Y is still the same (Y=10 + 3Xstar + ,, where ,~N(0,100)), but we are
now going to regress Y on mis-measured X instead of true Xstar, and the
measurement error J~N(0,25). The coefficient on X is consistently below the
true coefficient on Xstar, with a mean of .756 in 1000 trials. Thus, if we were
trying to assess the impact of Xstar on Y but were using a proxy with substantial
measurement error (X), we would consistently under-estimate the impact of X.
Note that we are 95% confident that the true mean of the sampling distribution
for this regression coefficient is between .73 and .780 very far from the
population parameter.
204

. program define bivarmeas2


1.
version 6.0
2.
if "`1'" == "?" {
3.
global S_1 "a se_a b se_b"
4.
exit
5.
}
6.
drop _all
7.
set obs 30
8.
gen xstar = int((_n-1)/3)
9.
gen y = 10 + 3*xstar +(invnorm(uniform())*10)
10.
gen x = xstar +(invnorm(uniform())*5)
11.
regress y x
12.
post `1' _b[_cons] _se[_cons] _b[x] _se[x]
13.
drop y
14. end
.
.
.
.

simul bivarmeas2, reps(1000)


gen var_a= se_a^2
gen var_b= se_b^2
ci

Variable |
Obs
Mean
Std. Err.
[95% Conf. Interval]
-------------+------------------------------------------------------------a |
1000
20.10469
.084072
19.93971
20.26967
se_a |
1000
2.963227
.0156303
2.932555
2.993899
b |
1000
.755585
.0125709
.7309167
.7802533
se_b |
1000
.4114577
.0025966
.4063623
.4165532
var_a |
1000
9.024774
.0966603
8.835093
9.214454
var_b |
1000
.1760332
.0022808
.1715575
.1805089

9. When the sample size is large and $=0, the OLS estimator of $ is approximately
normally distributed with a mean of 0 and the estimated standard error is consistent.
Thus, with a large sample, you can validly perform a t-test of the null hypothesis that
$=0 even when there is measurement error in X. You cannot compute valid confidence
intervals, however, or validly test any other null hypothesis (e.g., that $=1).
10. Measurement error is generally not a problem for OLS, unless the measurement
error is correlated with one or more independent variables. When the measurement
error is in the dependent variable, for instance, it can be combined with the disturbance
term in the regression, leading to larger standard errors and smaller R2 values, but
unbiased estimators that are the most efficient available, given the measurement error.
11. Wooldridge also points out one possible situation in which measurement error in the
independent variable also does not lead to biased estimators.
The basic model is:

y = $0 + $1X1* + ,

but we measure the true independent variable (X1*) with error, so that X1 = X1* + e, and e
= X1 - X1*. If we assume that E(eX1)=0, so that the measurement error is uncorrelated
with the observed (mis-measured variable) [and the disturbance term (,) is

205

uncorrelated with both the properly and incorrectly measured independent variables (X1
and X1*)], then
y = $0 + $1X1 + , - $1e
and OLS yields consistent estimators.
12. The more likely case (the classical errors-in-variables assumption) is that the
measurement error is uncorrelated with the true independent variable. Then the
covariance between the measurement error and the mis-measured variable is the
variance of the measurement error:
Cov(X1e) = E(X1e) = E([X1* +e]e) = E(X1* e) + E(e2) = E(e2) = F2e
13. By assumption, E(X1*,)=0, so
E[X1(, - $1e)] = E[(X1*+e)(, - $1e)]
= E[X1*,] - $1 E[X1*e] + E[e,] - $1 E[e2]
= -$1 F2e
and OLS estimators are not merely biased but inconsistent. The probability limit of b1
is:
plim(b1) = $1 - [$1F2e /(F2X* + F2e)] = $1[F2X* /(F2X* + F2e)]
which is biased toward zero, because [F2X* /(F2X* + F2e)] < 1, and it is more biased toward
zero the larger the variance in the measurement error, relative to the variance of the
(correctly measured) independent variable.
14. With multiple regression, the independent variable with measurement error is still
attenuated (biased toward zero), but in general OLS estimators for all other coefficients
are also biased and inconsistent, and the direction is typically not clear.
15. Frequently, measurement error is correlated with both the true and the mismeasured variable. Wooldridge cites the example of the number of times a person has
smoked marijuana in the past month. Generally, those who do not smoke at all will
answer correctly (unless they are lying to look cool), but the more frequently the person
smokes, the more unreliable his memory is likely to be and the larger the measurement
error. Thus, there will be very little measurement error both for those who do not
smoke at all and for those who report that they do not smoke at all, and much greater
error both for those who smoke a lot and for those who report smoking a lot. In either
case, OLS estimators are inconsistent.

206

Using Instrumental Variables to Cope with Omitted Variables


16. Wooldridge lists four options for coping with the OV problem:
(1) Ignore the problem and run OLS this is acceptable if you can state the
direction of the bias of the key parameters and if the coefficient is statistically
significant despite being biased toward zero. When the coefficient is biased
toward zero and not statistically significant, however, were unclear whether its
the bias or the weakness of the impact that causes the lack of statistical
significance. In complex models, it may be difficult to determine the direction of
the bias.
(2) Use a proxy variable a good choice if a suitable proxy variable is available,
which it frequently is not.
(3) Use lagged dependent variables, fixed-effects or first-difference models to
cope with time-invariant unobserved variables. Unfortunately, lagged values are
frequently not available, and this does not correct for time-varying unobserved
variables.
(4) Use one or more instrumental variables.
17. Assume that the correct model is:
y = $0 + $1X1 + $2X2 + ,
but that X2 is not available. Then, $2X2 gets transferred to the error term, and the OLS
estimator b1 will be biased for $1 if X1 is correlated with X2.
18. If there is a variable Z that is uncorrelated with u = $2X2 + , but correlated with X1,
Z can serve as an instrument for X1. Note that the second condition can be tested
empirically, but that the former is an assumption that must be justified logically and
generally is not testable.
19. A reduced form equation includes only exogenous variables, those that are not
correlated with any of the disturbances in the model. In this case,
X1 = B0 + B1Z + e
y = B3 + B4Z + v
We can test the second assumption with a standard t-test for H0: B1 = 0, but we
cannot test the assumption that E(Zv)=0.

207

In the following example, OLS yields a coefficient of .109 on education, implying


that one additional year of education raises expected wages by nearly 11%.
Table 5
. use "E:\statDATA\mroz.dta", clear
. reg lwage educ
Source |
SS
df
MS
---------+-----------------------------Model | 26.3264237
1 26.3264237
Residual | 197.001028
426 .462443727
---------+-----------------------------Total | 223.327451
427 .523015108

Number of obs
F( 1,
426)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

428
56.93
0.0000
0.1179
0.1158
.68003

-----------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------educ |
.1086487
.0143998
7.545
0.000
.0803451
.1369523
_cons | -.1851969
.1852259
-1.000
0.318
-.5492674
.1788735
------------------------------------------------------------------------------

20. It seems reasonable to believe, however, that education is correlated with


unmeasured ability, inflating the coefficient on education beyond its causal impact.
Wooldridge proposes fathers education as an instrument for own education. This
requires (1) that ones own and ones fathers education be empirically related, but (2)
that fathers education not be correlated with the error term in the wage on education
model that is, we must assume, among other things, that ones fathers education is
not correlated with ones own unmeasured ability. We test the first requirement with
the following regression; own and fathers education are moderately strongly related
[R2=.173], and the coefficient on fatheduc is highly significant.
Table 6
. reg educ fatheduc if lwage<.
* Note that the if lwage<. is necessary to keep the number of obs the same.
Source |
SS
df
MS
---------+-----------------------------Model | 384.841983
1 384.841983
Residual | 1845.35428
426 4.33181756
---------+-----------------------------Total | 2230.19626
427 5.22294206

Number of obs
F( 1,
426)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

428
88.84
0.0000
0.1726
0.1706
2.0813

-----------------------------------------------------------------------------educ |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------fatheduc |
.2694416
.0285863
9.426
0.000
.2132538
.3256295
_cons |
10.23705
.2759363
37.099
0.000
9.694685
10.77942
-----------------------------------------------------------------------------predict educhat

208

21. Stata uses the procedure ivreg to conduct two-stage least squares or instrumental
variables regression. The command lists the dependent variable followed by the
independent variables, with the endogenous variables listed in parentheses, followed by
an equals sign and their instrumental variables. Here: ivreg lwage
(educ=fatheduc).
Table 7
. ivreg lwage ( educ= fatheduc)
Instrumental variables (2SLS) regression
Source |
SS
df
MS
---------+-----------------------------Model | 20.8673618
1 20.8673618
Residual | 202.460089
426 .475258426
---------+-----------------------------Total | 223.327451
427 .523015108

Number of obs
F( 1,
426)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

428
2.84
0.0929
0.0934
0.0913
.68939

-----------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------educ |
.0591735
.0351418
1.684
0.093
-.0098994
.1282463
_cons |
.4411035
.4461018
0.989
0.323
-.4357311
1.317938
-----------------------------------------------------------------------------Instrumented: educ
Instruments:
fatheduc
------------------------------------------------------------------------------

22. The coefficient on educ has fallen from .109 in OLS (Table 5) to .059 in 2SLS (Table
7), and the standard error has more than doubled, from .014 to .035.
The regression in Table 6 is the first stage. We can get the ivreg coefficients if we
save the predicted values of education (educhat) from Table 6, which removes
the variation in education that is related to the disturbance term in the first
regression. Note that this means that there is only about 17% as much variation
in educhat (SSR=384.8) as in educ (SST=2230.2). This will naturally inflate
the standard error.
Table 8
. reg lwage educhat if lwage<.
Source |
SS
df
MS
---------+-----------------------------Model | 1.34752393
1 1.34752393
Residual | 221.979927
426 .521079642
---------+-----------------------------Total | 223.327451
427 .523015108

Number of obs
F( 1,
426)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

428
2.59
0.1086
0.0060
0.0037
.72186

-----------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------educhat |
.0591735
.0367969
1.608
0.109
-.0131525
.1314995
_cons |
.4411035
.4671121
0.944
0.346
-.4770279
1.359235
------------------------------------------------------------------------------

209

23. Note that if we regress lwage on fatheduc, we get the same R2, but the coefficient
on fatheduc is very different from that on educ. Instrumental variables are not
proxies. But it is instructive to divide the coefficient on fatheduc in Table 9 (.0159) by
the coefficient on fatheduc in Table 6 (.269) this yields our IV coefficient on educ of
.059. In this simple instrumental variable model, we are regressing our dependent
variable (lwage) on a linear transformation of our instrumental variable. By
assumption, regressing the dependent variable on the instrumental variable solves our
problem of a correlation between educ and the disturbance term and the instrumental
variable framework essentially translate the coefficient on fatheduc in Table 9 into the
appropriate units for educ.
Table 9
. reg lwage

fatheduc

Source |
SS
df
MS
---------+-----------------------------Model | 1.34752424
1 1.34752424
Residual | 221.979927
426 .521079641
---------+-----------------------------Total | 223.327451
427 .523015108

Number of obs
F( 1,
426)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

428
2.59
0.1086
0.0060
0.0037
.72186

-----------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------fatheduc |
.0159438
.0099146
1.608
0.109
-.0035438
.0354314
_cons |
1.046865
.0957031
10.939
0.000
.8587564
1.234974

24. Another way to see what is going on is to take the ratio of the covariance between
fatheduc and lwage and divide it by the covariance between fatheduc and educ
(.1979/3.345 = .059). Since in bivariate regression, the regression coefficient is the
covariance of the independent and dependent variables divided by the variance of the
independent variable, and since the independent variable is the same in Tables 6 and 9,
dividing the covariances is the same as dividing the regression coefficients.
. corr lwage educ fatheduc, cov
(obs=428)
|
lwage
educ fatheduc
---------+--------------------------lwage | .523015
educ | .567466 5.22294
fatheduc | .197932 3.34495 12.4144

25. One cost of instrumental variable regression is that the standard errors grow so
much because the unique variation in the independent variable shrinks so much. If OLS
is plausibly consistent [if we do not have convincing evidence of correlation between the
independent variable and the disturbance term, leading to bias and inconsistency], it is
clearly more efficient than IV regression.

210

26. One way to test the consistency of the OLS estimator is to use the Hausman test.
The Hausman test has the general framework of comparing two estimators, the more
efficient of which is consistent only under more restrictive assumptions. In this case,
instrumental variable regression is consistent whether educ is correlated with the
disturbance term or not, but OLS is only consistent in the absence of that correlation. If
both estimators are consistent, they should generate similar estimates. The Hausman
test determines whether they generate significantly different results. The null
hypothesis is that the more efficient estimator (OLS in this case) is consistent because
educ is not correlated with the disturbance term. In this case, we cannot reject the null
hypothesis that the differences in the coefficients are not systematic, so we can stick
with the OLS estimator. [Note: to run this test, type hausman, save after running
the ivreg, then type the following command after the OLS regression on the same
model (Table 5, in this case).]
. hausman, constant sigmamore
---- Coefficients ---|
(b)
(B)
(b-B)
sqrt(diag(V_b-V_B))
|
Prior
Current
Difference
S.E.
---------+------------------------------------------------------------educ |
.0591735
.1086487
-.0494752
.0315324
_cons |
.4411035
-.1851969
.6263004
.3991644
---------+------------------------------------------------------------b = less efficient estimates obtained previously from ivreg.
B = more efficient estimates obtained from regress.
Test:

Ho:

difference in coefficients not systematic


chi2( 1) = (b-B)'[(V_b-V_B)^(-1)](b-B)
=
2.46
Prob>chi2 =
0.1166

27. Another way to get essentially the same result is to run the first stage of the twostage least squares model, save the residuals, and re-run the simple OLS, but this time
include those residuals as an extra independent variable. If the coefficient on the
residuals is significant, this is strong evidence that OLS is not consistent. Note that the
prob-value for the t-statistic essentially equals that for the chi-square statistic in the
Hausman test.
. reg

educ

fatheduc if lwage<.

Source |
SS
df
MS
---------+-----------------------------Model | 384.841983
1 384.841983
Residual | 1845.35428
426 4.33181756
---------+-----------------------------Total | 2230.19626
427 5.22294206

Number of obs
F( 1,
426)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

428
88.84
0.0000
0.1726
0.1706
2.0813

-----------------------------------------------------------------------------educ |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------fatheduc |
.2694416
.0285863
9.426
0.000
.2132538
.3256295
_cons |
10.23705
.2759363
37.099
0.000
9.694685
10.77942
------------------------------------------------------------------------------

211

. predict res, r

. reg

lwage educ res

Source |
SS
df
MS
---------+-----------------------------Model | 27.4648913
2 13.7324457
Residual |
195.86256
425 .460853082
---------+-----------------------------Total | 223.327451
427 .523015108

Number of obs
F( 2,
425)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

428
29.80
0.0000
0.1230
0.1189
.67886

-----------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------educ |
.0591735
.0346051
1.710
0.088
-.008845
.1271919
res |
.0597931
.0380427
1.572
0.117
-.0149823
.1345684
_cons |
.4411035
.439289
1.004
0.316
-.4223459
1.304553
------------------------------------------------------------------------------

Note, by the way, that if we regress lwage on the original educ plus the
instrumental variable fatheduc, the P>|t| is the same for fatheduc as for res.
. reg lwage educ fatheduc
Source |
SS
df
MS
-------------+-----------------------------Model | 27.4648914
2 13.7324457
Residual |
195.86256
425 .460853082
-------------+-----------------------------Total | 223.327451
427 .523015108

Number of obs
F( 2,
425)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

428
29.80
0.0000
0.1230
0.1189
.67886

-----------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------educ |
.1189665
.0158031
7.53
0.000
.0879046
.1500284
fatheduc | -.0161107
.0102503
-1.57
0.117
-.0362583
.0040368
_cons | -.1710012
.1851275
-0.92
0.356
-.5348807
.1928784

28. Read carefully Wooldridges section on the properties of IV with poor instrumental
variables. Even with small correlations between the instrumental variable and the error
term, IV regression can have a large asymptotic bias. The bias of the IV estimator is the
ratio of the correlations of Z with , and with X, times the ratio of the standard deviation
of the disturbance term divided by the standard deviation of X. The bias of the OLS
estimator is the correlation of X and , times the ratio of the standard deviation of the
disturbance term divided by the standard deviation of X. As Wooldridge notes, IV is
preferred to OLS on asymptotic bias grounds when Corr(Z,u)/Corr(Z,X) < Corr(X,u).

212

Errors-in-Variables Problems
29. Instrumental variables also provide potential solutions to measurement error in an
independent variable. Essentially, we need an alternative measure of the same variable,
one that is also mis-measured, but whose measurement error is uncorrelated with the
measurement error in the original measure.
Identification
30. To identify an instrumental variables model for a bivariate regression, you need an
instrumental variable Z that is correlated with X but not correlated with ,.
31. In the case of multiple regression, you need an instrumental variable Z that is
correlated with the endogenous variable X, even after controlling for all the exogenous
X, but that is not correlated with ,.
32. If you have more than one endogenous independent variable, you need at least as
many exogenous (instrumental) variables Z that are correlated with the endogenous
variables X, even after controlling for all the exogenous X, but that are not correlated
with ,. This is called the order condition. You must argue that all of your
instrumental variables have no direct impact on your dependent variable when the
model is properly specified, all of the causal variables are included, and all of the causal
variables are measured accurately.
33. Your instrumental variables must also meet a rank condition, which we will
discuss in more detail when we talk about structural equations. The rank refers to the
rank of a matrix. In this case, it means at least that each of the instrumental variables
has a significant impact on at least one of the endogenous variables, after controlling for
all the other variables in the model, and that each of the endogenous variables has at
least one instrument that is significantly related to it, after controlling for all the other
variables in the model.
Testing for Endogeneity
34. Heres another example of the Hausman test. In the MROZ data set, we regress own
education on mothers and fathers education and the other variables in the wage
equation (experience and experience-squared). We save the residuals and add them to
the OLS regression and test the statistical significance of their coefficient. Wooldridge
feels that the prob-value is low enough to cause concern about endogeneity (though this
seems fairly marginal to me, given a sample of size 428). Note that the coefficients are
the same in the second regression and the IV regression.

213

. reg educ fatheduc motheduc exper expersq if wage<.


Source |
SS
df
MS
Number of obs
---------+-----------------------------F( 4,
423)
Model | 471.620998
4
117.90525
Prob > F
Residual | 1758.57526
423 4.15738833
R-squared
---------+-----------------------------Adj R-squared
Total | 2230.19626
427 5.22294206
Root MSE

=
=
=
=
=
=

428
28.36
0.0000
0.2115
0.2040
2.039

-----------------------------------------------------------------------------educ |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------fatheduc |
.1895484
.0337565
5.615
0.000
.1231971
.2558997
motheduc |
.157597
.0358941
4.391
0.000
.087044
.2281501
exper |
.0452254
.0402507
1.124
0.262
-.0338909
.1243417
expersq | -.0010091
.0012033
-0.839
0.402
-.0033744
.0013562
_cons |
9.10264
.4265614
21.340
0.000
8.264196
9.941084
-----------------------------------------------------------------------------. predict res, r
. reg lwage educ exper expersq res
Source |
SS
df
MS
---------+-----------------------------Model |
36.257316
4 9.06432899
Residual | 187.070135
423 .442246183
---------+-----------------------------Total | 223.327451
427 .523015108

Number of obs
F( 4,
423)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

428
20.50
0.0000
0.1624
0.1544
.66502

-----------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------educ |
.0613966
.0309849
1.981
0.048
.000493
.1223003
exper |
.0441704
.0132394
3.336
0.001
.0181471
.0701937
expersq |
-.000899
.0003959
-2.271
0.024
-.0016772
-.0001208
res |
.0581666
.0348073
1.671
0.095
-.0102501
.1265834
_cons |
.0481003
.3945753
0.122
0.903
-.7274721
.8236727
-----------------------------------------------------------------------------. ivreg lwage (educ= motheduc fatheduc) exper expersq
Instrumental variables (2SLS) regression
Source |
SS
df
MS
---------+-----------------------------Model | 30.3074295
3 10.1024765
Residual | 193.020022
424
.4552359
---------+-----------------------------Total | 223.327451
427 .523015108

Number of obs
F( 3,
424)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

428
8.14
0.0000
0.1357
0.1296
.67471

-----------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------educ |
.0613966
.0314367
1.953
0.051
-.0003945
.1231878
exper |
.0441704
.0134325
3.288
0.001
.0177679
.0705729
expersq |
-.000899
.0004017
-2.238
0.026
-.0016885
-.0001094
_cons |
.0481003
.4003281
0.120
0.904
-.7387744
.834975
-----------------------------------------------------------------------------Instrumented: educ
Instruments:
motheduc fatheduc + exper expersq
------------------------------------------------------------------------------

214

Testing Overidentifying Restrictions


35. If we have more than one instrument for our endogenous variable, our model is
overidentified. If we had used motheduc (mothers education) instead of fatheduc as
our instrumental variable, the 2SLS coefficient on education would have been .039
instead of .059.
. ivreg lwage (educ= motheduc)
Instrumental variables (2SLS) regression
Source |
SS
df
MS
-------------+-----------------------------Model | 15.3676157
1 15.3676157
Residual | 207.959836
426 .488168628
-------------+-----------------------------Total | 223.327451
427 .523015108

Number of obs
F( 1,
426)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

428
1.02
0.3138
0.0688
0.0666
.69869

-----------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------educ |
.0385499
.0382279
1.01
0.314
-.0365888
.1136887
_cons |
.7021743
.4850991
1.45
0.148
-.2513114
1.65566
-----------------------------------------------------------------------------Instrumented: educ
Instruments:
motheduc
------------------------------------------------------------------------------

36. The 2SLS model using both as instrumental variables computes a weighted average
of these estimates and should provide a better joint instrument than a single
instrumental variable, provided that both IVs meet the assumptions of the model, that
they have unique correlation with the endogenous variable and no correlation with the
disturbance term. We have already seen how to test this first assumption. Having a
second IV allows us to test the second.
37. Save the residuals from the 2SLS regression with both instrumental variables, then
regress them on the exogenous variables. Do a Lagrange Multiplier test (multiply R2
times the sample size and compare the product to the critical value of the chi-square
statistic if it is statistically significant, reject the null hypothesis that the instrumental
variables are all uncorrelated with the disturbance term).
This wont work with a just-identified model (e.g., one endogenous independent
variable and one instrumental variable), because the R2 for this model will be
.0000. By its very nature, OLS yields residuals that are uncorrelated with all the
independent variables; when we have a single instrumental variable, IV
regression essentially uses a linear transformation of that IV in place of the
endogenous variable, so the residuals have to be uncorrelated with it. With
multiple instrumental variables, the expected value of the endogenous variable is
not a perfect linear transformation of either instrumental variable, and the
residual can be correlated with both.

215

. ivreg lwage (educ= motheduc fatheduc) exper expersq


Instrumental variables (2SLS) regression
Source |
SS
df
MS
Number of obs
---------+-----------------------------F( 3,
424)
Model | 30.3074295
3 10.1024765
Prob > F
Residual | 193.020022
424
.4552359
R-squared
---------+-----------------------------Adj R-squared
Total | 223.327451
427 .523015108
Root MSE

=
=
=
=
=
=

428
8.14
0.0000
0.1357
0.1296
.67471

-----------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------educ |
.0613966
.0314367
1.953
0.051
-.0003945
.1231878
exper |
.0441704
.0134325
3.288
0.001
.0177679
.0705729
expersq |
-.000899
.0004017
-2.238
0.026
-.0016885
-.0001094
_cons |
.0481003
.4003281
0.120
0.904
-.7387744
.834975
-----------------------------------------------------------------------------Instrumented: educ
Instruments:
motheduc fatheduc + exper expersq
-----------------------------------------------------------------------------. predict res2, r
. reg res2 motheduc fatheduc exper expersq
Source |
SS
df
MS
---------+-----------------------------Model | .170502982
4 .042625746
Residual | 192.849518
423 .455909026
---------+-----------------------------Total | 193.020021
427
.45203752

Number of obs
F( 4,
423)
Prob > F
R-squared
Adj R-squared
Root MSE

=
428
=
0.09
= 0.9845
= 0.0009
= -0.0086
= .67521

-----------------------------------------------------------------------------res2 |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------motheduc | -.0066065
.0118864
-0.556
0.579
-.0299704
.0167573
fatheduc |
.0057823
.0111786
0.517
0.605
-.0161902
.0277547
exper | -.0000183
.0133291
-0.001
0.999
-.0262179
.0261813
expersq |
7.34e-07
.0003985
0.002
0.999
-.0007825
.000784
_cons |
.0109641
.1412571
0.078
0.938
-.2666892
.2886173
-----------------------------------------------------------------------------428*.0009 = .386 < 3.84

In this case, however, the residuals are uncorrelated with the instrumental
variables, holding the other exogenous variables constant. The LM test-statistic
is only .386, well below the critical value of 3.84. It is plausible that both
instrumental variables fulfill the necessary assumptions.
SIMULTANEOUS EQUATION SYSTEMS
1. An integrated structure of regression equations includes at least two endogenous
variables that have simultaneous or reciprocal impacts on each other.
2. In an integrated structure, the endogenous independent variables will generally be
correlated with the error terms in the same equation. Thus, ordinary least squares will
216

lead to biased and inconsistent estimates of the regression coefficients for the structural
equations.
Note that we will typically assume that all the error terms are uncorrelated with
all the exogenous variables but will not assume that the error terms are
uncorrelated across equations.
3. Through substitution we can re-express the structural equations as reduced form
equations. Reduced form equations only include exogenous independent variables.
Because the exogenous variables are uncorrelated with all the error terms (even
the transformed ones, which are linear transformations of the original error
terms), OLS will give unbiased and consistent estimates of the parameters for the
reduced form equations.
Identification
4. Imagine that you have a perfectly drawn, infinite sample so that you can estimate the
reduced form equations precisely using OLS. (Even with this sample, OLS estimates of
the structural equations will be biased and inconsistent.) With perfect knowledge of the
reduced form coefficients, could you calculate the structural coefficients? This is the
essence of the identification problem.
Look at the examples on the attached pages.
5. With infinitely large samples, structural equations are either identified or underidentified. If there is more than one way to calculate a structural coefficient, all will
yield the same answers.
6. In the finite samples we deal with, however, identified equations can be just or
exactly identified (if there is exactly one way to calculate each structural coefficient) or
over-identified (if there is more than one way to calculate at least one structural
coefficient). With finite sample data, those methods will yield different answers.
7. There are two conditions/tests for establishing identification. A necessary condition
is that the number of excluded exogenous variables in an equation must be at least as
large as the number of included endogenous variables. This is called the order
condition.
8. The necessary and sufficient condition for identification is the rank condition. Its
a little tricky if you dont understand matrix algebra.

217

Lecture 11. MULTICOLLINEARITY


1. We frequently run into the problem that independent variables that we know have
causal impacts on our dependent variable turn out to have statistically insignificant
coefficients in our regressions. There are a number of possible explanations:
A. We were wrong. The variable does not have a causal impact especially
given the way we have structured our model. For instance, we may expect that
older workers will earn more than younger workers because they have more
experience but then we control for experience in our regression. In that case,
we need to have a good causal argument for why older workers should earn more
than younger workers with the same amount of experience.
B. The sample may be too small to yield significant results for the size
of the effect. Smaller sample sizes mean less variation in X and larger standard
errors for the regression coefficient. If the variables population regression
coefficient is small relative to the standard error in the sample, the probability of
getting a significant t-statistic will be small.
C. X may be too strongly correlated with the other independent
variables to yield significant results for the size of the effect. High
multicollinearity means less unexplained variation in X and larger standard
errors for the regression coefficient. If the population regression coefficient (the
direct effect of X, holding the other variables constant) is small relative to its
standard error with the other variables in the model, the probability of getting a
significant t-statistic will be small.
2. Independent variables are almost always inter-correlated in our samples, because
they are almost always inter-correlated in the world. In many cases, we specifically
model inter-correlations among the variables. We view multicollinearity as a problem
primarily when it leads to insignificant coefficients, but whether we get insignificant
coefficients depends on a variety of factors the strength of the relationship in the
population, the sample size, the amount of variation in X in the sample, the amount of
unique variation in X in the sample, and the specific sample that we got.
3. Even in the presence of high multicollinearity, if the population regression function
meets the classical assumptions, OLS estimates are BLUE. Multicollinearity does not
affect R2 or the F-statistic, so it generally does not affect the predictive power of the
model or the statistical significance of the model as a whole. Multicollinearity shows up
primarily in the standard errors of the regression coefficients, and therefore in the width
of the confidence intervals and the size of the t-statistics.

190

Indicators
4. There are a number of indicators and measures of possible multicollinearity in a data
set.
A. The R2 and F-statistic are high but the t-statistics are insignificant.
In the following sample of 100 observations from opm94, the model explains
almost 60% of the variation in grade and the F* of 13.21 is significant at the
.0001 level, but only the edyrs and male coefficients are significant at the .05
level. This suggests that multicollinearity among the other variables may be
leading to their insignificant coefficients.
. fit grade edyrs yos yossq age agesq male asian black hispanic indian
Source |
SS
df
MS
-------------+-----------------------------Model | 755.825898
10 75.5825898
Residual | 509.414102
89 5.72375395
-------------+-----------------------------Total |
1265.24
99
12.780202

Number of obs
F( 10,
89)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

100
13.21
0.0000
0.5974
0.5521
2.3924

-----------------------------------------------------------------------------grade |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------edyrs |
.8996764
.121531
7.40
0.000
.6581969
1.141156
yos |
.2445191
.1324762
1.85
0.068
-.0187082
.5077465
yossq | -.0035776
.0035326
-1.01
0.314
-.0105969
.0034416
age | -.0458705
.1769288
-0.26
0.796
-.3974243
.3056834
agesq |
.0001219
.0017918
0.07
0.946
-.0034383
.0036821
male |
1.464755
.5540132
2.64
0.010
.3639422
2.565567
asian |
1.110076
1.464366
0.76
0.450
-1.799588
4.01974
black | -.8721685
.6817244
-1.28
0.204
-2.22674
.4824034
hispanic | -.1294638
1.127623
-0.11
0.909
-2.370026
2.111099
indian | -.5799867
1.275864
-0.45
0.651
-3.115102
1.955128
_cons | -4.907165
4.034378
-1.22
0.227
-12.92339
3.109059
------------------------------------------------------------------------------

B. The independent variables have high correlation coefficients. In


this case federal experience is very highly correlated with federal experience
squared (r = .9675), age is very highly correlated with age squared (r = .9885),
and all four variables are highly inter-correlated (the minimum r is .6304). In
addition, gender and education are moderately positively related (r = . 4470).
. corr yos yossq age agesq male edyrs
(obs=100)
|
yos
yossq
age
agesq
male
edyrs
-------------+-----------------------------------------------------yos |
1.0000
yossq |
0.9675
1.0000
age |
0.7207
0.6590
1.0000
agesq |
0.6797
0.6304
0.9885
1.0000
male |
0.0462
0.0877
0.0195
0.0333
1.0000
edyrs |
0.0737
0.0813
0.1079
0.1134
0.4470
1.0000

191

C. When we regress one independent variables on the other


independent variables (called an auxiliary regression), we get a high
R2. For instance, the other independent variables can explain 98.45% of the
variation in age and 95.9% of the variation in yos. Since the standard error for a
regression coefficient is based on the unique variation in the independent
variable (variation that cannot be explained by the other independent variables),
the standard errors on both variables are likely to be large, making it difficult to
reject the null hypothesis that the variable has no impact on the dependent
variable in the population.
. reg age edyrs yos yossq agesq male asian black hispanic indian
Source |
SS
df
MS
-------------+-----------------------------Model | 11629.1547
9 1292.12831
Residual | 182.845253
90 2.03161392
-------------+-----------------------------Total |
11812.00
99 119.313131

Number of obs
F( 9,
90)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

100
636.01
0.0000
0.9845
0.9830
1.4253

-----------------------------------------------------------------------------age |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------edyrs |
.0150993
.0723873
0.21
0.835
-.1287106
.1589093
yos |
.3524597
.0696343
5.06
0.000
.2141189
.4908005
yossq | -.0072478
.0019611
-3.70
0.000
-.0111438
-.0033518
agesq |
.0099471
.0002003
49.66
0.000
.0095492
.010345
male | -.1643717
.3296105
-0.50
0.619
-.8192005
.4904571
asian | -.6735527
.8695345
-0.77
0.441
-2.401035
1.053929
black | -.1996105
.405607
-0.49
0.624
-1.00542
.6061985
hispanic |
.1615997
.6715901
0.24
0.810
-1.172631
1.495831
indian |
.2998521
.7594668
0.39
0.694
-1.208961
1.808665
_cons |
20.37101
1.079945
18.86
0.000
18.22551
22.51651
-----------------------------------------------------------------------------. reg yos age edyrs

yossq agesq male asian black hispanic indian

Source |
SS
df
MS
-------------+-----------------------------Model | 7684.60904
9 853.845449
Residual | 326.140958
90 3.62378842
-------------+-----------------------------Total |
8010.75
99 80.9166667

Number of obs
F( 9,
90)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

100
235.62
0.0000
0.9593
0.9552
1.9036

-----------------------------------------------------------------------------yos |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------age |
.6286821
.1242067
5.06
0.000
.3819237
.8754404
edyrs |
.035934
.0966261
0.37
0.711
-.1560306
.2278987
yossq |
.025345
.0008738
29.01
0.000
.0236091
.0270809
agesq | -.0054778
.0013035
-4.20
0.000
-.0080674
-.0028881
male |
-.347554
.4392948
-0.79
0.431
-1.22029
.5251817
asian | -.8014129
1.162107
-0.69
0.492
-3.110141
1.507315
black |
.6080221
.538638
1.13
0.262
-.4620763
1.67812
hispanic |
.3179352
.896606
0.35
0.724
-1.463329
2.099199
indian |
.7710833
1.011926
0.76
0.448
-1.239285
2.781452
_cons |
-9.58836
3.046831
-3.15
0.002
-15.64142
-3.5353
------------------------------------------------------------------------------

192

D. Low tolerance and high VIF (variance inflation factor). Tolerance is


1 - R2, where R2 is for the auxiliary regression. (Stata displays tolerance as 1/VIF.)
VIF is 1/(1-R2). The auxiliary regressions can explain 98.45% of the variation in
age and 95.93% of the variation in yos.. Thus, tolerance for age is 1 - .9845 or
.0155 and VIF is 1/.0155 or 64.52. Tolerance for yos is 1 - .9593 or .0407 and
VIF is 1/.0407 or 24.57.
One rule of thumb is that you have high multicollinearity if VIF>10 (that is, if R2
in the auxiliary regression is greater than .90).
[Run the vif command immediately after the regression of interest (not after the
auxiliary regression), which you can run with either the fit or the regress
command.]
. vif
Variable |
VIF
1/VIF
-------------+---------------------age |
64.60
0.015480
agesq |
56.10
0.017826
yos |
24.56
0.040713
yossq |
19.63
0.050953
male |
1.34
0.748630
edyrs |
1.32
0.756735
indian |
1.09
0.915674
black |
1.09
0.916355
asian |
1.09
0.917252
hispanic |
1.06
0.947675
-------------+---------------------Mean VIF |
17.19

Solutions
5. The question is not whether multicollinearity exists but how strong it is. Even if its
very strong, its not clear what you should do about it. OLS estimates are BLUE even in
the presence of high multicollinearity, so the preferred solution of theoretical
statisticians and econometricians is to do nothing. Simply accept that your data are not
strong enough to answer all the questions you would like to put to them.
6. Most of the time, however, you will feel a strong desire to do something about it. A
common situation is that either X1 or X2 (or both) will have a statistically significant,
even strong, coefficient if the other variable is left out of the model, but that neither will
be statistically significant if both are in the model. Possible solutions include:
A. Assign a particular value to one of the coefficients or to the ratio of
the coefficients (based on theory or prior research). This generally means that
we only have to estimate one of the coefficients and that it will be statistically
significant. The solution is rarely practical, however, because we dont know the
effect of independent variables especially if were doing exploratory research. If
we assign the wrong value to a coefficient or ratio, all the coefficients will be
biased.
193

B. Drop one of the variables. The other variable then becomes significant
and the model looks better. In general, this should be done only on theoretical
grounds (but, in practice, theory will frequently be weak). In practice,
researchers tend to let the data choose the model, typically dropping the variable
with the smaller t-statistic. This is dangerous. If the originally specified
model was correct, the new coefficients will be biased. Remember that accepting
the null hypothesis that the population coefficient equals 0 does not mean that
the null hypothesis is true, simply that you have insufficient evidence to reject it.
The extreme form of letting the data choose the model is stepwise
regression. In this case, you let the computer choose from a number of
possible independent variables. The computer first enters the
independent variable that is most strongly correlated with the dependent
variable. If the t-statistic is high enough, the computer then chooses the
residualized independent variable (the residuals from regressing that
independent variable on the first independent variable) that is most
strongly correlated with the independent variable. If its t-statistic is high
enough, the computer then chooses the residualized independent variable
(based on regression on the first two independent variables) that is most
strongly correlated with independent variable, etc.
Forward Selection Stepwise Regression
. sw reg grade edyrs yos yossq age agesq male asian black hispanic indian, pe(.15)
begin with empty model
p = 0.0000 < 0.1500 adding
edyrs
p = 0.0007 < 0.1500 adding
yos
p = 0.0018 < 0.1500 adding
male
p = 0.1051 < 0.1500 adding
agesq
p = 0.1452 < 0.1500 adding
hispanic
p = 0.1322 < 0.1500 adding
black
Source |
SS
df
MS
Number of obs =
100
-------------+-----------------------------F( 6,
93) =
19.65
Model | 782.677028
6 130.446171
Prob > F
= 0.0000
Residual | 617.322972
93 6.63788142
R-squared
= 0.5591
-------------+-----------------------------Adj R-squared = 0.5306
Total |
1400.00
99 14.1414141
Root MSE
= 2.5764
-----------------------------------------------------------------------------grade |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------edyrs |
.9100092
.1209027
7.53
0.000
.6699204
1.150098
yos |
.1404958
.0366222
3.84
0.000
.0677714
.2132203
male |
2.029557
.5580349
3.64
0.000
.9214104
3.137704
agesq | -.0006585
.0003413
-1.93
0.057
-.0013363
.0000193
hispanic |
-2.14599
1.25157
-1.71
0.090
-4.631361
.3393798
black | -.9516112
.6264658
-1.52
0.132
-2.195648
.2924258
_cons | -5.183749
1.739216
-2.98
0.004
-8.637488
-1.730011
------------------------------------------------------------------------------

In alternative versions, stepwise checks the statistical significance of all


independent variables already in the model each time before it adds a new one,
sometimes dropping variables that have become insignificant. In other versions,
194

stepwise starts with all the independent variables included in the model, then
drops them one by one, based on how close their t-statistics are to 0. In this case,
forward and backward selection yield the same final regression this will not
necessarily be true.
Backward Selection Stepwise Regression
. sw reg grade edyrs yos yossq age agesq male asian black hispanic indian, pr(.15)
begin with full model
p = 0.9698 >= 0.1500 removing age
p = 0.5113 >= 0.1500 removing yossq
p = 0.3400 >= 0.1500 removing indian
p = 0.2877 >= 0.1500 removing asian
Source |
SS
df
MS
Number of obs =
100
-------------+-----------------------------F( 6,
93) =
19.65
Model | 782.677028
6 130.446171
Prob > F
= 0.0000
Residual | 617.322972
93 6.63788142
R-squared
= 0.5591
-------------+-----------------------------Adj R-squared = 0.5306
Total |
1400.00
99 14.1414141
Root MSE
= 2.5764
-----------------------------------------------------------------------------grade |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------edyrs |
.9100092
.1209027
7.53
0.000
.6699204
1.150098
yos |
.1404958
.0366222
3.84
0.000
.0677714
.2132203
hispanic |
-2.14599
1.25157
-1.71
0.090
-4.631361
.3393798
black | -.9516112
.6264658
-1.52
0.132
-2.195648
.2924258
agesq | -.0006585
.0003413
-1.93
0.057
-.0013363
.0000193
male |
2.029557
.5580349
3.64
0.000
.9214104
3.137704
_cons | -5.183749
1.739216
-2.98
0.004
-8.637488
-1.730011
------------------------------------------------------------------------------

The danger here is that youre relying too much on chance relationships among
variables in your sample. Remember that all sample statistics have sampling
distributions, and that relationships in the sample may be stronger or weaker
than relationships the population. Stepwise can easily include variables in your
regression that are not related in the population and discard variables are
related in the population. If youre using a .05 significance level and testing 20
independent variables that are not related to the dependent variable in the
population, the odds are that one will have a statistically significant relationship
with the dependent variable in this sample.
In a Monte Carlo simulation, I created 120 observations on one normally
distributed dependent variable and 50 independently distributed, normal
independent variables (tables 10 and 11A-D). Because I created the variables, I
know that they are not related in the population. Nonetheless, in five
experiments (with 120 observations created the same way each time), stepwise
included at least one independent variable four times out of five, even though
none of the independent variables are related to the dependent variable in the
population. In the first experiment, six regression coefficients were statistically
significant at the .05 level, and those six variables jointly explained 24 percent of
the variation in the dependent variable. In other words, stepwise was extremely
misleading in this case.
195

clear
set memory 10000
set matsize 250
set obs 120
gen y=invnorm(uniform())*100
for new x1-x50: gen X= invnorm(uniform())*100
sw reg y x1-x50, pe(.05)
1. sw reg y x1-x50, pe(.05)
begin with empty model
p = 0.0194 < 0.0500 adding
x34
p = 0.0106 < 0.0500 adding
x49
p = 0.0219 < 0.0500 adding
x8
p = 0.0235 < 0.0500 adding
x3
p = 0.0289 < 0.0500 adding
x15
p = 0.0432 < 0.0500 adding
x45
Source |
SS
df
MS
-------------+-----------------------------Model | 279743.811
6 46623.9686
Residual | 897471.084
113 7942.22198
-------------+-----------------------------Total | 1177214.90
119 9892.56215

Number of obs
F( 6,
113)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

120
5.87
0.0000
0.2376
0.1972
89.119

-----------------------------------------------------------------------------y |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------x34 | -.2729371
.0860706
-3.17
0.002
-.4434585
-.1024156
x49 | -.3674921
.0964063
-3.81
0.000
-.5584904
-.1764938
x8 | -.2380103
.0807125
-2.95
0.004
-.3979164
-.0781042
x3 |
.2201066
.0861434
2.56
0.012
.0494409
.3907722
x15 |
.1879747
.0781927
2.40
0.018
.033061
.3428885
x45 | -.1755879
.085878
-2.04
0.043
-.3457277
-.0054481
_cons | -9.996866
8.376517
-1.19
0.235
-26.59226
6.598524
-----------------------------------------------------------------------------2. sw reg y x1-x50, pe(.05)
NOTE THAT STEPWISE DID NOT PICK UP ANY SIGNIFICANT
VARIABLES IN THIS EXAMPLE.
begin with empty model
p >=0.0500
for all terms in model
Source |
SS
df
MS
-------------+-----------------------------Model |
0.00
0
.
Residual | 1167835.87
119 9813.74676
-------------+-----------------------------Total | 1167835.87
119 9813.74676

Number of obs
F( 0,
119)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

120
0.00
.
0.0000
0.0000
99.064

-----------------------------------------------------------------------------y |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------_cons |
12.22178
9.043297
1.35
0.179
-5.684853
30.12841
-----------------------------------------------------------------------------3. sw reg y x1-x50, pe(.05)
begin with empty model
p = 0.0099 < 0.0500 adding
x20
p = 0.0415 < 0.0500 adding
x15
Source |
SS
df
MS
-------------+-----------------------------Model | 83857.4193
2 41928.7097
Residual | 867466.685
117 7414.24517

196

Number of obs
F( 2,
117)
Prob > F
R-squared

=
=
=
=

120
5.66
0.0045
0.0881

-------------+-----------------------------Total | 951324.104
119
7994.3202

Adj R-squared =
Root MSE
=

0.0726
86.106

-----------------------------------------------------------------------------y |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------x20 | -.2378268
.0786196
-3.03
0.003
-.3935289
-.0821247
x15 | -.1518291
.0736524
-2.06
0.041
-.2976937
-.0059644
_cons |
5.45497
7.861522
0.69
0.489
-10.11436
21.0243
-----------------------------------------------------------------------------4. sw reg y x1-x50, pe(.05)
begin with empty model
p = 0.0140 < 0.0500 adding
x15
p = 0.0433 < 0.0500 adding
x13
p = 0.0380 < 0.0500 adding
x24
p = 0.0446 < 0.0500 adding
x14
p = 0.0483 < 0.0500 adding
x23
p = 0.0312 < 0.0500 adding
x34
Source |
SS
df
MS
-------------+-----------------------------Model | 246545.787
6 41090.9646
Residual | 932815.824
113 8255.00729
-------------+-----------------------------Total | 1179361.61
119 9910.60177

Number of obs
F( 6,
113)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

120
4.98
0.0001
0.2091
0.1671
90.857

-----------------------------------------------------------------------------y |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------x15 | -.2177895
.0831663
-2.62
0.010
-.3825569
-.0530221
x13 | -.2050253
.0938609
-2.18
0.031
-.3909807
-.0190698
x24 | -.1927952
.0853278
-2.26
0.026
-.3618449
-.0237455
x14 | -.1743014
.0807068
-2.16
0.033
-.3341961
-.0144067
x23 |
.1917715
.084333
2.27
0.025
.0246926
.3588505
x34 |
.1869151
.085667
2.18
0.031
.0171934
.3566368
_cons | -.7589077
8.79647
-0.09
0.931
-18.1863
16.66849
-----------------------------------------------------------------------------5. sw reg y x1-x50, pe(.05)
begin with empty model
p = 0.0006 < 0.0500 adding
x18
p = 0.0352 < 0.0500 adding
x28
p = 0.0122 < 0.0500 adding
x12
p = 0.0323 < 0.0500 adding
x39
Source |
SS
df
MS
-------------+-----------------------------Model | 196417.359
4 49104.3398
Residual | 749823.919
115 6520.20799
-------------+-----------------------------Total | 946241.278
119 7951.60738

Number of obs
F( 4,
115)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

120
7.53
0.0000
0.2076
0.1800
80.748

-----------------------------------------------------------------------------y |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------x18 |
.284116
.0796954
3.57
0.001
.1262548
.4419771
x28 |
.2269257
.0822566
2.76
0.007
.0639913
.3898601
x12 |
.1949221
.0694085
2.81
0.006
.0574371
.332407
x39 | -.1801371
.0831205
-2.17
0.032
-.3447828
-.0154914
_cons | -5.407107
7.647917
-0.71
0.481
-20.55616
9.741943
------------------------------------------------------------------------------

197

In the following odd experiment, I ran forward and backward stepwise on the
same sample of 120. Forward selection yielded five statistically significant
coefficients and explained 21% of the variation in the dependent variable.
Backward selection yielded ten statistically significant coefficients and explained
33% of the variation in the dependent variable.
A. sw reg y x1-x50, pe(.05) (pe means forward selection)
begin with empty model
p = 0.0148 < 0.0500 adding
x28
p = 0.0236 < 0.0500 adding
x11
p = 0.0211 < 0.0500 adding
x41
p = 0.0236 < 0.0500 adding
x43
p = 0.0200 < 0.0500 adding
x5
Source |
SS
df
MS
-------------+-----------------------------Model | 224269.179
5 44853.8359
Residual | 854329.395
114
7494.1175
-------------+-----------------------------Total | 1078598.57
119 9063.85356

Number of obs
F( 5,
114)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

120
5.99
0.0001
0.2079
0.1732
86.569

-----------------------------------------------------------------------------y |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------x28 | -.1547886
.071566
-2.16
0.033
-.2965604
-.0130169
x11 | -.2495149
.0877035
-2.84
0.005
-.4232549
-.0757749
x41 | -.2313524
.08885
-2.60
0.010
-.4073634
-.0553413
x43 | -.2067182
.074505
-2.77
0.006
-.354312
-.0591244
x5 | -.1978574
.0838427
-2.36
0.020
-.3639491
-.0317656
_cons |
2.769697
8.11214
0.34
0.733
-13.30039
18.83978
-----------------------------------------------------------------------------B. sw reg y x1-x50, pr(.05) (pr means backward selection)
Source |
SS
df
MS
-------------+-----------------------------Model | 354375.565
10 35437.5565
Residual | 724223.009
109 6644.24779
-------------+-----------------------------Total | 1078598.57
119 9063.85356

Number of obs
F( 10,
109)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

120
5.33
0.0000
0.3286
0.2670
81.512

-----------------------------------------------------------------------------y |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------x1 |
.1751702
.0734842
2.38
0.019
.029527
.3208134
x2 | -.1799849
.0850599
-2.12
0.037
-.3485709
-.0113989
x28 | -.1652647
.0676473
-2.44
0.016
-.2993395
-.0311899
x45 |
.1769283
.0796461
2.22
0.028
.0190723
.3347843
x5 | -.2110569
.0799876
-2.64
0.010
-.3695897
-.052524
x43 | -.1780254
.072789
-2.45
0.016
-.3222908
-.03376
x11 | -.1965396
.0840563
-2.34
0.021
-.3631364
-.0299427
x27 | -.1471263
.0674265
-2.18
0.031
-.2807635
-.0134891
x9 | -.1795693
.080686
-2.23
0.028
-.3394864
-.0196522
x41 | -.1898003
.0862399
-2.20
0.030
-.3607249
-.0188757
_cons |
3.815572
7.796542
0.49
0.626
-11.63692
19.26806
------------------------------------------------------------------------------

198

C. Create a new variable which is a combination of X1 and X2. With a


large number of independent variables, this would typically be done with
principal components analysis or factor analysis. These are methods for
finding commonalities in sets of variables and can be quite useful, but the
meaning of the new variable and of its coefficient will usually be pretty unclear.
D. Get a bigger or better data set (one with more unexplained variation in
the independent variables). This leads to smaller standard errors for the
coefficients. When we repeat the original regression on the full opm94.dta data
set, the explanatory power of the model is somewhat lower (50.6% instead of
59.7%) and the Variance Inflation Factors (VIF) are comparable, but the standard
errors are now much smaller, so most of the independent variables are now
statistically significant and the 95% confidence intervals are much narrower.
. fit grade edyrs yos yossq age agesq male asian black hispanic indian
Source |
SS
df
MS
-------------+-----------------------------Model | 5735.75006
10 573.575006
Residual | 5611.84594
989 5.67426283
-------------+-----------------------------Total |
11347.596
999
11.358955

Number of obs
F( 10,
989)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

1000
101.08
0.0000
0.5055
0.5005
2.3821

-----------------------------------------------------------------------------grade |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------edyrs |
.8049051
.0355829
22.62
0.000
.7350785
.8747318
yos |
.1550753
.0347998
4.46
0.000
.0867853
.2233652
yossq | -.0008291
.0009535
-0.87
0.385
-.0027001
.001042
age |
.1101846
.0566858
1.94
0.052
-.0010536
.2214228
agesq |
-.001577
.0006092
-2.59
0.010
-.0027724
-.0003816
male |
1.014296
.1604387
6.32
0.000
.6994565
1.329135
asian |
.4366862
.442195
0.99
0.324
-.4310619
1.304434
black | -.9842708
.2062853
-4.77
0.000
-1.389078
-.5794636
hispanic | -.6466498
.3541598
-1.83
0.068
-1.341641
.0483411
indian |
-.72856
.5871941
-1.24
0.215
-1.880849
.4237294
_cons | -5.959507
1.251647
-4.76
0.000
-8.415695
-3.503319
-----------------------------------------------------------------------------. vif
Variable |
VIF
1/VIF
-------------+---------------------age |
59.32
0.016858
agesq |
56.40
0.017731
yos |
16.81
0.059491
yossq |
15.51
0.064458
edyrs |
1.14
0.875635
male |
1.13
0.882270
black |
1.08
0.923593
asian |
1.04
0.966043
hispanic |
1.03
0.970810
indian |
1.02
0.984790
-------------+---------------------Mean VIF |
15.45

199

Lecture 12. HETEROSKEDASTICITY


1. A key assumption of the classical linear regression model is that the error term is
homoskedastic. That is, the variance of the error term is the same at all values of X. In
cross-sectional analysis, this assumption is frequently false particularly if the
observations differ substantially in size. When the variance of the error terms is
different for different observations, the error is said to be heteroskedastic.
People with low incomes, for instance, tend to be much more restricted in their
spending patterns than high-income people, who will typically have greater
variation in their consumption and saving patterns. Large cities will tend to vary
more than small cities on a variety of measures.
2. Even in the face of heteroskedasticity, OLS regression yields unbiased estimators of
the population regression coefficients, but OLS estimators are not BLUE (Best Linear
Unbiased Estimators). In the presence of heteroskedasticity, there are alternative
unbiased estimators that have smaller standard errors.
3. The more important problem is that OLS yields biased estimators of the variances
and standard errors of the regression coefficients. With biased estimators of the
standard errors, neither the t-tests nor the confidence intervals can be trusted.
4. In the following Monte Carlo, we use 100 observations, where
(1) X ranges from 1 to 10,
(2) E(Y) = 10 + 3X, and
(3) Var(Y|X) = X2.
. program define hetero100
1.
version 6.0
2.
if "`1'" == "?" {
3.
global S_1 "a se_a b_ols se_b"
4.
exit
5.
}
6.
drop _all
7.
set obs 100
8.
gen x = 1 + int((_n-1)/10)
9.
gen y = 10 + 3*x +(invnorm(uniform())*x)
10.
regress y x
11.
post `1' _b[_cons] _se[_cons] _b[x] _se[x]
12.
drop y
13. end

In one simulation of this experiment, a scatterplot of Y and X shows a classic


pattern of heteroskedasticity the expected value of Y is rising as X rises, but the
spread of Y around its expected value fans out as X increases.

200

60

40

20

0
0

5
x

10

5. We first run OLS 10,000 times. The mean value of the sample regression coefficient
is 2.999959 (consistent with OLS yielding unbiased estimators), and its standard error
is about .2342457. Unfortunately, we reject the true null hypothesis 7.19% of the time at
the .05 level; the 90% confidence interval only includes the population regression
coefficient 87.07% of the time; the 95% confidence interval, only 92.81% of the time;
and the 99% confidence interval, only 98.13% of the time.
. simul hetero100, reps(10000)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

gen
gen
gen
gen
gen
gen
gen
gen
gen
gen
gen
gen
gen
gen
sum

tstar = (b-3)/se_b
rejectt = abs(tstar)>1.9845
lolimt90 = b - 1.6606*se_b
uplimt90 = b + 1.6606*se_b
widtht90 = uplimt90 - lolimt90
int90 = (lolimt90 < 3 & 3 < uplimt90)
lolimt95 = b - 1.9845*se_b
uplimt95 = b + 1.9845*se_b
widtht95 = uplimt95 - lolimt95
int95 = (lolimt95 < 3 & 3 < uplimt95)
lolimt99 = b - 2.6269*se_b
uplimt99 = b + 2.6269*se_b
widtht99 = uplimt99 - lolimt99
int99 = (lolimt99 < 3 & 3 < uplimt99)

TABLE 1.

MONTE CARLO USING OLS WITH HETEROSKEDASTIC ERROR

Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+----------------------------------------------------a |
10000
10.00553
.9104162
6.916804
13.54396
se_a |
10000
1.332366
.124157
.8831902
1.841925
b_ols |
10000
2.999959
.2342457
2.095613
3.799754
se_b |
10000
.2147302
.0200097
.142339
.2968531
tstar |
10000
.0007942
1.101198 -4.151646
4.1451

201

rejectt
lolimt90
uplimt90
widtht90
int90
lolimt95
uplimt95
widtht95
int95
lolimt99
uplimt99
widtht99
int99

|
|
|
|
|
|
|
|
|
|
|
|
|

10000
10000
10000
10000
10000
10000
10000
10000
10000
10000
10000
10000
10000

.0719
2.643378
3.35654
.713162
.8707
2.573827
3.426091
.8522643
.9281
2.435884
3.564034
1.12815
.9813

.2583352
.2368226
.2363585
.0664563
.3355485
.2378636
.2373112
.0794185
.2583352
.2404323
.2397086
.105127
.1354701

0
1.733871
2.457355
.4727364
0
1.663313
2.527913
.5649433
0
1.523373
2.667852
.7478204
0

1
3.464789
4.162679
.9859085
1
3.402322
4.245828
1.17821
1
3.27843
4.410738
1.559607
1

6. In the second experiment, we run OLS 10,000 times, but use robust standard errors.
The estimates of the y-intercept and regression coefficient are identical to the OLS
estimates and have the same true standard errors. However, the mean of the estimated
standard error of the sample regression coefficient is a bit larger (.2293816, instead of
.2147302) and inferential statistics work somewhat better. We reject the true null
hypothesis 5.89% of the time (instead of 7.19%) at the .05 level; the 90% confidence
interval includes the population regression coefficient 88.94% of the time (instead of
87.07%); the 95% confidence interval, 94.11% (instead of 92.81%) of the time; and the
99% confidence interval, 98.63% of the time (only slightly higher than 98.13%).
. program define het100rob
1.
version 6.0
2.
if "`1'" == "?" {
3.
global S_1 "a se_a b_ols se_b"
4.
exit
5.
}
6.
drop _all
7.
set obs 100
8.
gen x = 1 + int((_n-1)/10)
9.
gen y = 10 + 3*x +(invnorm(uniform())*x)
10.
regress y x, robust
11.
post `1' _b[_cons] _se[_cons] _b[x] _se[x]
12.
drop y
13. end
. simul het100rob, reps(10000)
. sum
TABLE 2.

MONTE CARLO USING OLS AND ROBUST STANDARD ERRORS WITH HETEROSKEDASTIC ERROR

Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+----------------------------------------------------a |
10000
10.00553
.9104162
6.916804
13.54396
se_a |
10000
.898447
.106302
.5412205
1.363116
b_ols |
10000
2.999959
.2342457
2.095613
3.799754
se_b |
10000
.2293816
.0317044
.1194888
.3569466
tstar |
10000
.00148
1.046484 -3.986354
3.920514
rejectt |
10000
.0589
.2354492
0
1
lolimt90 |
10000
2.619048
.2405877
1.708083
3.452858
uplimt90 |
10000
3.38087
.2395898
2.483142
4.253922
widtht90 |
10000
.761822
.1052965
.3968458
1.185491
int90 |
10000
.8894
.3136518
0
1
lolimt95 |
10000
2.544751
.2431377
1.632495
3.387671
uplimt95 |
10000
3.455167
.2419572
2.55873
4.354867
widtht95 |
10000
.9104154
.1258346
.4742508
1.416721
int95 |
10000
.9411
.2354492
0
1

202

lolimt99
uplimt99
widtht99
int99

|
|
|
|

10000
10000
10000
10000

2.397397
3.602521
1.205125
.9863

.2493718
.2478473
.1665684
.1162483

1.465004
2.708645
.6277699
0

3.259036
4.555075
1.875326
1

7. In the third experiment, we run WLS (weighted least squares, explained below)
10,000 times. The WLS estimator appears unbiased (its mean value is 3.001162), but its
standard error is smaller, at about .1709413. The mean of the estimated standard error
of the sample regression coefficient is .1619884. The inferential statistics work
somewhat better. We reject the true null hypothesis 6.18% of the time at the .05 level;
the 90% confidence interval includes the population regression coefficient 88.39% of
the time the 95% confidence interval, 93.82% of the time; and the 99% confidence
interval, 98.61% of the time. This is in spite of the fact that the confidence intervals are
much narrower than with OLS or OLS with robust standard errors.
. program define het100wls
1.
version 6.0
2.
if "`1'" == "?" {
3.
global S_1 "a se_a b_wls se_b"
4.
exit
5.
}
6.
drop _all
7.
set obs 100
8.
gen x = 1 + int((_n-1)/10)
9.
gen invx = 1/x
10.
gen y = 10 + 3*x +(invnorm(uniform())*x)
11.
regress y x [weight = invx]
12.
post `1' _b[_cons] _se[_cons] _b[x] _se[x]
13.
drop y
14. end
. simul het100wls, reps(10000)
. sum
TABLE 3. MONTE CARLO USING WEIGHTED LEAST SQUARES WITH HETEROSKEDASTIC ERROR
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+----------------------------------------------------a |
10000
9.998913
.471167
8.102437
11.7595
se_a |
10000
.701953
.0566042
.4867501
.9315131
b_wls |
10000
3.001162
.1709413
2.334404
3.615796
se_b |
10000
.1619884
.0130625
.1123264
.2149636
tstar |
10000
.0082073
1.0641
-4.32574
3.882328
rejectt |
10000
.0618
.240804
0
1
lolimt90 |
10000
2.732164
.1725293
2.033539
3.3524
uplimt90 |
10000
3.27016
.1720945
2.589919
3.879193
widtht90 |
10000
.537996
.043383
.3730588
.7139373
int90 |
10000
.8839
.3203608
0
1
lolimt95 |
10000
2.679696
.1731543
1.974783
3.301024
uplimt95 |
10000
3.322628
.1726365
2.639757
3.930568
widtht95 |
10000
.6429321
.0518449
.4458237
.8531904
int95 |
10000
.9382
.240804
0
1
lolimt99 |
10000
2.575635
.1746908
1.85825
3.19913
uplimt99 |
10000
3.42669
.174011
2.738602
4.045607
widtht99 |
10000
.8510549
.0686276
.5901408
1.129376
int99 |
10000
.9861
.1170819
0
1

203

Solution 1: Use Robust Standard Errors


8. One way to cope with heteroskedastic error terms is to use OLS regression but
correct the estimated standard errors so that they are no longer biased. Wooldridge
shows the formula. Stata makes it quite easy. At the end of the regression command,
you simply add a comma and the word "robust."
In Table 4, we are trying to determine what factors influence how much federal
experience employees have. Clearly, age will be the most important factor, but
we expect that more educated employees tend to have less federal experience
than less educated employees of the same age, because better-educated
employees wait longer to enter the federal service.
There may also be race and sex differences in federal experience levels among
equally educated employees of the same age. Women might have less experience
than men of the same age due to time out of the labor market to raise children
and minorities might have less federal experience than comparable whites
because of longer periods of unemployment.
. use "D:\statDATA\opm\opm94.dta"
TABLE 4. OLS REGRESSION
. reg yos age agesq edyrs asian black hispanic indian male
Source |
SS
df
MS
-------------+-----------------------------Model | 32780.5392
8
4097.5674
Residual | 45979.3608
991 46.3969332
-------------+-----------------------------Total |
78759.90
999 78.8387387

Number of obs
F( 8,
991)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

1000
88.32
0.0000
0.4162
0.4115
6.8115

-----------------------------------------------------------------------------yos |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------age |
1.285941
.1483667
8.67
0.000
.9947925
1.57709
agesq | -.0083291
.0016313
-5.11
0.000
-.0115302
-.0051279
edyrs | -.2288689
.1013174
-2.26
0.024
-.4276901
-.0300477
asian | -5.444608
1.251583
-4.35
0.000
-7.900665
-2.98855
black | -.0654091
.5896458
-0.11
0.912
-1.222507
1.091689
hispanic | -.0188062
1.012675
-0.02
0.985
-2.006039
1.968427
indian |
.2895073
1.675736
0.17
0.863
-2.998891
3.577906
male |
.5828604
.4583841
1.27
0.204
-.3166545
1.482375
_cons | -21.56324
3.485198
-6.19
0.000
-28.40245
-14.72402
-----------------------------------------------------------------------------. test
( 1)
( 2)
( 3)
( 4)

black hispanic indian male


black = 0.0
hispanic = 0.0
indian = 0.0
male = 0.0
F( 4,
991) =
0.43
Prob > F =
0.7881

Most of our expectations are upheld by the data. Federal experience rises with
age, but at a decreasing rate (peaking at 77), and each additional year of
204

education lowers expected federal experience by about one-quarter of year,


holding age, race, and sex constant.
Surprisingly to me, Asians have five fewer expected years of service than whites of
the same age and educational level, a finding that is quite significant statistically.
On the other hand, there is no particular evidence that blacks, Hispanics, or
American Indians differ in experience levels from comparable whites, or that
women have less federal experience than comparably educated and aged men.
An F-test (F* = .43) is consistent with the null hypothesis that none of these race
or sex differences (aside from the Asian-white difference) exists in the
population, once age and education are controlled.
9. It is extremely likely that the error term is heteroskedastic in this case. There can be
very little variation in the federal experience levels of 20-year-old federal employees but
very substantial variation in federal experience levels of 60-year-old federal employees.
Although the OLS estimators for the regression coefficients are unbiased, we cannot
trust the t-statistics or confidence intervals because the estimators of the standard
errors are biased.
Therefore, we ask Stata to calculate robust standard errors. In Table 5, the
regression coefficients are identical to those in Table 4 but the robust standard
errors are all slightly different, meaning that the t-statistics and confidence
intervals also change. None of the changes is large. The t-statistic for age rises
the most, but all of the coefficients that were significant in the first model are also
significant in the second, and those that were insignificant in Table 4 remain
insignificant in Table 5.
TABLE 5. OLS REGRESSION WITH ROBUST STANDARD ERRORS
. reg yos age agesq edyrs asian black hispanic indian male, robust
Regression with robust standard errors

Number of obs
F( 8,
991)
Prob > F
R-squared
Root MSE

=
=
=
=
=

1000
118.90
0.0000
0.4162
6.8115

-----------------------------------------------------------------------------|
Robust
yos |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------age |
1.285941
.1301363
9.88
0.000
1.030567
1.541316
agesq | -.0083291
.0015295
-5.45
0.000
-.0113306
-.0053276
edyrs | -.2288689
.1048981
-2.18
0.029
-.4347168
-.0230211
asian | -5.444608
1.12463
-4.84
0.000
-7.651536
-3.237679
black | -.0654091
.587002
-0.11
0.911
-1.217319
1.086501
hispanic | -.0188062
.8995956
-0.02
0.983
-1.784137
1.746525
indian |
.2895073
1.64156
0.18
0.860
-2.931825
3.51084
male |
.5828604
.4626792
1.26
0.208
-.325083
1.490804
_cons | -21.56324
2.898479
-7.44
0.000
-27.2511
-15.87538
------------------------------------------------------------------------------

205

Notice that although the R2 is the same in both tables, F* has risen from 88.32 in
Table 4 to 118.90 in Table 5. In other words, Stata is calculating
heteroskedasticity-robust F-statistics.
. test black hispanic indian male
(
(
(
(

1)
2)
3)
4)

black = 0.0
hispanic = 0.0
indian = 0.0
male = 0.0
F(

4,
991) =
Prob > F =

0.42
0.795

In addition, when I perform an F-test to see whether I can drop black,


hispanic, indian, and male, F* (.42) is slightly different from the F* in Table 3
(.43). Again, Stata has calculated a heteroskedasticity-robust F-statistic.
You don't need to learn Wooldridges formula to calculate a heteroskedasticityrobust F-statistic, since Stata does it for you.
Testing for Heteroskedasticity
10. Probably the simplest test for heteroskedasticity is to graph the residuals against the
independent variable or variables that you suspect are responsible for the
heteroskedasticity. Typically, if heteroskedasticity is present, you will observe a fanning
out of the residuals as the independent variable increases or decreases.

Residuals

17.0649

-20.042
79

19
age

206

11. Stata allows a simple, formal test for heteroskedasticity, which is a modified version
of the Breusch-Pagan test. You simply follow the regression output with the hettest
command.
This comes in two forms:
First, you can name the variables that you suspect are responsible for the
heteroskedasticity.
Second, you can run the hettest command without naming any variables,
in which case hettest uses the expected value of the dependent variable as
the independent variable.
After the regression in Table 4, I ran the following two commands. In both cases,
I am easily able to reject the null hypothesis that the error term has a constant
variance.
1. hettest age agesq edyrs asian black hispanic indian male
Cook-Weisberg test for heteroskedasticity using variables specified
Ho: Constant variance
chi2(8)
=
189.58
Prob > chi2 =
0.0000
2. hettest
Cook-Weisberg test for heteroskedasticity using fitted values of yos
Ho: Constant variance
chi2(1)
=
177.99
Prob > chi2 =
0.0000

To see what Stata is doing:


1. Run the OLS regression, as in Table 3.
2. Calculate the maximum likelihood estimator of the variance of the error term
(the sum of the squared residuals divided by the number of observations (rather than by
the number of degrees of freedom)).
In this case, SSE (the error or residual sum of squares) is 45979.3608, so the
maximum likelihood estimator of the variance is 45.9793608.
3. (a) Save the residuals from Table 3, (b) square them, and (c) divide them by
the maximum likelihood estimator of the variance .
.
.
.
.

predict res, res


gen ressq=res^2
gen bpressq = ressq/45.9793608
predict yoshat [note we will use this one later]

4. Regress this standardized squared residual on all the independent variables.


207

TABLE 6. EXAMPLE OF BREUSCH-PAGAN TEST FOR HETEROSKEDASTICITY


. reg bpressq age agesq edyrs asian black hispanic indian male
Source |
SS
df
MS
-------------+-----------------------------Model | 379.160711
8 47.3950889
Residual | 1478.88219
991
1.492313
-------------+-----------------------------Total |
1858.0429
999
1.8599028

Number of obs
F( 8,
991)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

1000
31.76
0.0000
0.2041
0.1976
1.2216

-----------------------------------------------------------------------------bpressq |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------age |
.0888103
.0266086
3.34
0.001
.0365947
.141026
agesq | -.0003261
.0002926
-1.11
0.265
-.0009002
.000248
edyrs | -.0617095
.0181706
-3.40
0.001
-.0973668
-.0260522
asian | -.0881263
.2244631
-0.39
0.695
-.5286039
.3523513
black |
.0978591
.1057491
0.93
0.355
-.1096587
.3053769
hispanic | -.0695503
.1816165
-0.38
0.702
-.4259473
.2868467
indian |
.0592777
.3005321
0.20
0.844
-.5304747
.6490301
male |
.0596678
.0822082
0.73
0.468
-.1016542
.2209898
_cons | -1.393309
.6250471
-2.23
0.026
-2.619876
-.1667407
------------------------------------------------------------------------------

5. If the null hypothesis of homoskedastic errors is true, SSR (the regression or


model sum of squares) divided by 2 will have a chi-square distribution with k degrees of
freedom (the number of independent variables in this regression).
379.160711 / 2 = 189.5803555

(the same value as from the first hettest)

Since the critical value of chi-square with 8 degrees of freedom at the .01 level is 20.09,
we confidently reject the null hypothesis of homoskedasticity. The prob-value of the
hettest (.0000) indicates that we can be even more confident than that.
Alternative steps 4 and 5:
4A. Save the predicted value of the dependent variable from Table 3. Regress
bpressq on the expected value of the dependent variable.
In a bivariate regression, y-hat is a linear transformation of x, so regressing
bpressq on either x or y-hat will yield the same SSR, F*, R2, etc. In multiple
regression, y-hat is a function of all the xs and may be able to capture most of the
effect of the xs on the variances of the error terms.

208

TABLE 7. EXAMPLE OF BREUSCH-PAGAN TEST FOR HETEROSKEDASTICITY, VARIATION 1


. reg bpressq yoshat
Source |
SS
df
MS
-------------+-----------------------------Model | 355.989682
1 355.989682
Residual | 1502.05322
998 1.50506334
-------------+-----------------------------Total |
1858.0429
999
1.8599028

Number of obs
F( 1,
998)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

1000
236.53
0.0000
0.1916
0.1908
1.2268

-----------------------------------------------------------------------------bpressq |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------yoshat |
.1042103
.0067759
15.38
0.000
.0909136
.117507
_cons | -.5433546
.1075896
-5.05
0.000
-.7544823
-.3322269
------------------------------------------------------------------------------

5A. If the null hypothesis of homoskedastic errors is true, SSR divided by 2 will
have a chi-square distribution with 1 degree of freedom (the number of independent
variables in this regression).
355.989682 / 2 = 177.994841

(the same value as from the second hettest)

Since the critical value of chi-square with 1 degree of freedom at the .01 level is 6.63, we
confidently reject the null hypothesis of homoskedasticity. The prob-value of the second
hettest (.0000) indicates that we can be even more confident than that.
F-Test
12. Wooldridge notes that you can also do an F-test of the null hypothesis that the
independent variables are not related to the variances of the error terms in the
population. His formula yields the same F* as is reported in these regressions. There is
no need to calculate F* by hand. Since the independent variables or the expected value
of the dependent variable explains about 20% of the variation in the variances, it is not
surprising that both regressions yield F* values (31.76, 236.53) that are highly
significant. There is little question that heteroskedasticity exists in this example.
Solution 2: Use Weighted Least Squares
13. First, let me show that OLS does not yield BLUE estimators.
Assume the bivariate case, where Yi = $0 + $1Xi + ,i and E(,i)=0 but var(,i) =
E(,i2) = F2i.
Set wi = 1/Fi, Yi*= Yi/Fi, Xi*=Xi/Fi, and ,i* = ,i /Fi.
Then Yi/Fi = $0/Fi + $1Xi/Fi + ,i/Fi
= Yi* = $0wi + $1Xi* + ,i*
or,
wi Yi = wi $0 + wi $1Xi + wi ,i .
209

Now, ,i* meets the classical assumptions.


E(,i*)=0;
cov(,i*,,j*) = 0;
and var(,i*) = var(,i/Fi)= var(,i)/F2i = 1.
Therefore, OLS on the transformed equation is BLUE (and OLS on the un-transformed
equation is not BLUE). This is known as Weighted Least Squares (WLS) or Generalized
Least Squares (GLS).
The normal equations become

G wi yi
G wi yi x i

= b0* G wi + b1* G wi xi
= b0* G wi xi + b1* G wi xi2

We can also reorganize to solve for the sample regression coefficients:


b0 * G wi
b0*

= G wi yi - b1 G wi x i
= G wi yi /G wi - b1 G wi xi /G wi
= y-tilde - b1* x-tilde
= weighted mean of y - b1* weighted mean of x

We substitute this value for b0* into the second normal equation and get:
b1*

= G wi (xi - x-tilde)(yi - y-tilde) / G wi (xi - x-tilde)2

14. In order to use WLS or GLS, we need to know F2i. In the Monte Carlo experiment on
the first page, I knew that the standard deviation of the error term equaled X (because I
created it to be that), so I weighted by the inverse of X. The variance of the WLS
estimator was smaller than the variance of the OLS estimator.
15. In practice, we wont know what the variance of the error term is, but if we can
estimate it reasonably well, we can use Feasible or Estimated Generalized Least Squares
(FGLS or EGLS), which is also unbiased and is asymptotically efficient.
16. In some cases, we will be confident that the error variance is a function of one
independent variable, and we may try weighting by the inverse (or some other
transformation) of that variable. In this case, we expect the variance of the error term to
rise with age. We therefore want to weight the observations for young employees more
heavily than the observations for older employees, since the variances will be smaller for
the former and the observed values will tend to be closer to the expected values for these
employees. One way to do that is to calculate the inverse of age and weight by that.

210

TABLE 8.

EXAMPLE OF ESTIMATED WEIGHTED LEAST SQUARES REGRESSION

. gen invage = 1/age


. reg yos age agesq edyrs asian black hispanic indian male [weight=invage]
(analytic weights assumed)
(sum of wgt is
2.4135e+01)
Source |
SS
df
MS
-------------+-----------------------------Model | 34192.6652
8 4274.08315
Residual | 39266.9573
991 39.6235694
-------------+-----------------------------Total | 73459.6225
999 73.5331556

Number of obs
F( 8,
991)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

1000
107.87
0.0000
0.4655
0.4611
6.2947

-----------------------------------------------------------------------------yos |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------age |
1.121714
.1372766
8.17
0.000
.8523279
1.3911
agesq | -.0064797
.0015731
-4.12
0.000
-.0095666
-.0033928
edyrs | -.3014223
.0946211
-3.19
0.001
-.487103
-.1157416
asian | -4.524488
1.140007
-3.97
0.000
-6.761592
-2.287383
black | -.0676801
.5333632
-0.13
0.899
-1.114331
.9789708
hispanic | -.0509101
.9059695
-0.06
0.955
-1.828749
1.726929
indian | -.4263532
1.505922
-0.28
0.777
-3.381516
2.528809
male |
.4515991
.4196691
1.08
0.282
-.371943
1.275141
_cons | -17.01646
3.090109
-5.51
0.000
-23.08036
-10.95255
-----------------------------------------------------------------------------. test black hispanic indian male
(
(
(
(

1)
2)
3)
4)

black = 0.0
hispanic = 0.0
indian = 0.0
male = 0.0
F(

4,
991) =
Prob > F =

0.32
0.8629

17. The results are pretty similar to OLS. The significant coefficients have the same
signs and comparable sizes. There is clearer evidence that more educated employees
have less federal experience, on average, than less educated employees of the same age,
race, and sex. Otherwise, the statistical significance of the coefficients has not changed.
We still fail to reject the null that there are no sex differences in federal experience (and
no race differences other than the Asian-white difference), holding age and education
constant.
18. In the more general case, we will probably want to estimate the error variances. Our
estimator of the error variance will generally be the squared residual, but the problem
with regressing the squared residuals on the independent variables is that the regression
may easily yield some negative expected values, which make no sense when our
dependent variable is the variance (or the squared residual). One way around this is to:
(a) use the natural logarithm of the squared residual as our dependent variable,
(b) save the expected values from the regression,
(c) convert the expected values into expected variances using the exponential
function, and
211

(d) weight the regression by the inverse of the square root of the expected
variance. (Use [weight = weighting variable].)
TABLE 9. EXAMPLE OF ESTIMATED WEIGHTED LEAST SQUARES REGRESSION, SEVERAL VARIABLES
RESPONSIBLE FOR HETEROSKEDASTICITY STEP ONE
. gen lressq=ln(ressq)
. reg lressq age agesq edyrs asian black hispanic indian male
Source |
SS
df
MS
-------------+-----------------------------Model | 849.923883
8 106.240485
Residual | 3767.11578
991 3.80132773
-------------+-----------------------------Total | 4617.03966
999 4.62166133

Number of obs
F( 8,
991)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

1000
27.95
0.0000
0.1841
0.1775
1.9497

-----------------------------------------------------------------------------lressq |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------age |
.2676579
.0424678
6.30
0.000
.1843209
.350995
agesq | -.0020077
.0004669
-4.30
0.000
-.002924
-.0010914
edyrs | -.0400239
.0290006
-1.38
0.168
-.0969335
.0168857
asian |
.1503519
.3582471
0.42
0.675
-.5526581
.8533619
black |
.0435915
.1687774
0.26
0.796
-.2876106
.3747935
hispanic |
.4331816
.2898631
1.49
0.135
-.1356343
1.001998
indian | -.0103776
.4796546
-0.02
0.983
-.9516328
.9308777
male | -.1198815
.1312057
-0.91
0.361
-.3773543
.1375913
_cons | -4.522415
.9975862
-4.53
0.000
-6.480039
-2.564791
-----------------------------------------------------------------------------. predict lreshat
(option xb assumed; fitted values)
TABLE 10.

SECOND EXAMPLE OF ESTIMATED WEIGHTED LEAST SQUARES REGRESSION STEP TWO

. gen reswt = 1/sqrt(exp(lreshat))


. reg yos age agesq edyrs asian black hispanic indian male [weight=reswt]
(analytic weights assumed)
(sum of wgt is
3.1445e+02)
Source |
SS
df
MS
-------------+-----------------------------Model | 35075.8752
8
4384.4844
Residual | 32707.3501
991 33.0043896
-------------+-----------------------------Total | 67783.2253
999 67.8510763

Number of obs
F( 8,
991)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

1000
132.85
0.0000
0.5175
0.5136
5.7449

-----------------------------------------------------------------------------yos |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------age |
1.004368
.1204562
8.34
0.000
.7679891
1.240746
agesq | -.0051254
.0014179
-3.61
0.000
-.0079079
-.0023429
edyrs | -.3582387
.0869424
-4.12
0.000
-.528851
-.1876264
asian | -3.540333
1.048392
-3.38
0.001
-5.597657
-1.483009
black | -.0492726
.4808413
-0.10
0.918
-.9928567
.8943116
hispanic | -.1407248
.8765589
-0.16
0.872
-1.860849
1.5794
indian | -.8058337
1.319386
-0.61
0.541
-3.394946
1.783278
male |
.3899457
.3809188
1.02
0.306
-.3575544
1.137446
_cons | -13.81212
2.645322
-5.22
0.000
-19.0032
-8.621046
------------------------------------------------------------------------------

212

. test black hispanic indian male


(
(
(
(

1)
2)
3)
4)

black = 0.0
hispanic = 0.0
indian = 0.0
male = 0.0
F(

4,
991) =
Prob > F =

0.36
0.8382

19. In this case, the error variance is related almost strictly to age. (An F-test of the
education, race, and sex variables [not shown] is not statistically significant.) The final
model is pretty similar to Table 4, except that we find a substantially stronger negative
effect of education on federal experience (an additional year of education lowers
expected federal experience by .36 year instead of .23 year, holding the other variables
constant) and that the Asian-white difference in expected experience levels among
equally educated employees of the same age is 3.5 instead of 5.4 years. This Asianwhite difference is nearly as significant as in Table 4, while the effect of education is
more clearly significant. We still fail to reject the null that there are no sex differences
in federal experience (and no race differences other than the Asian-white difference),
holding age and education constant.
Conclusion
20. Heteroskedasticity introduces two problems for OLS. OLS estimators of the
regression coefficients are not the most efficient linear estimators (as they are under the
classical linear regression model), and OLS estimators of the variances of the regression
coefficients are biased.
21. Biased estimators of the variances of the regression coefficients are generally a more
serious problem, since they invalidate the logic of the t- and F-tests and of the
confidence intervals. The easiest way to deal with this problem is by using
heteroskedasticity-robust standard errors, an easy option with Stata.
22. Using robust standard errors is a low-cost strategy and probably is worth doing any
time you suspect you may have heteroskedastic errors. Estimated weighted least
squares (feasible generalized least squares) is more trouble and may yield worse
estimators than OLS if your estimates of the weights are not very good, so its more
important to be sure you have a problem with heteroskedasticity before trying WLS.
23. The simplest test for heteroskedasticity is a visual test, but Stata allows a simple,
formal test (hettest) that is a variation on the Breusch-Pagan test.
24. If you have serious problems with heteroskedasticity, it may make sense to try to
correct for it. First, you need to model the sources of the heteroskedasticity. Then, you
either divide through by the estimated standard deviations of the error terms or use the
weight command.

213

You might also like