You are on page 1of 12

11-1

Session 11

MuItipIe Linear Regression



Multiple Linear Regression 11-2
The Stepwise Method 11-6
Practical session 11 11-11
Additional practical session 11 11-12

11-2
SESSION 11: MuItipIe Linear Regression


MuItipIe Linear Regression

Simple Linear Regression looks at one Dependent variable in terms of one
Independent (or ExpIanatory) variable. When we want to 'explain' a
Dependent variable in terms of two or more Independent variables we use
MuItipIe Linear Regression.

Just as in Simple Linear Regression, Least Squares is used to estimate the
Coefficients (the constant and the ) of the ndependent variables in the
now more general equation:

Dependent = Constant + B
1
* ndependent + B
2
* ndependent + .
Variable (or B
0
) Variable 1 Variable 2

Using the GSS91t data (H:\My Documents\spss data\Gss91t.sav), we will
investigate the effect of the respondent's age (AGE), sex (SEX), education
(EDUC) and spouse's education (SPEDUC) on the Occupational Prestige
score (PRESTG80). Firstly, we will produce Scatter Plots and Correlations of
the numerical variables (i.e. not SEX)



Figure 11.1(i) Figure 11.1(ii)

11-3




Figure 11.1(iii)


Figure 11.1(iv)


We cannot see any unusual patterns in the Scatter Plots that would indicate
relationships other than Linear ones might be present. The Correlations
indicate that there are significant Linear Relationships between Occupational
Prestige and the two education variables, but not age. However, there are
also significant Correlations between what will be our three continuous
Independent variables (EDUC, SPEDUC and AGE). How will this affect our
MuItipIe Regression?

We follow the same procedure as Simple Linear Regression; we select:

Analyze
Regression
Linear.

11-4
n the Linear Regression dialogue box, we choose PRESTG80 as our
Dependent variable, and EDUC, SPEDUC, AGE and SEX (not a continuous
variable, but, as it is a binary variable, we can use it if we interpret the results
with care) as the Independent variables. At present, we do not change any
of the options from their default settings, and we click on OK. As with SimpIe
Linear Regression, four tables are sent to the Output Viewer. These are
seen below in Figures 11.2(i) to 11.2(iv).




VariabIes Entered/Removed
b
sex
Respondent's
Sex, speduc
Highest Year
School
Completed,
Spouse, age
Age of
Respondent,
educ Highest
Year of School
Completed
a
. Enter
Model
1
Variables
Entered
Variables
Removed Method
All requested variables entered.
a.
Dependent Variable: prestg80 R's
Occupational Prestige Score (1980)
b.



Figure 11.2(i)



ModeI Summary
.576
a
.332 .328 10.755
Model
1
R R Square
Adjusted
R Square
Std. Error of
the Estimate
Predictors: (Constant), sex Respondent's Sex, speduc
Highest Year School Completed, Spouse, age Age of
Respondent, educ Highest Year of School Completed
a.


Figure 11.2(ii)

11-5
ANOVA
b
43186.164 4 10796.541 93.341 .000
a
86982.092 752 115.668
130168.256 756
Regression
Residual
Total
Model
1
Sum of
Squares df Mean Square F Sig.
Predictors: (Constant), sex Respondent's Sex, speduc Highest Year
School Completed, Spouse, age Age of Respondent, educ Highest Year
of School Completed
a.
Dependent Variable: prestg80 R's Occupational Prestige Score (1980)
b.


Figure 11.2(iii)

Coefficients
a
5.307 3.001 1.769 .077
2.477 .170 .554 14.562 .000
.236 .164 .054 1.441 .150
.123 .027 .145 4.643 .000
-1.646 .789 -.063 -2.086 .037
(Constant)
educ Highest Year of
School Completed
speduc Highest Year
School Completed,
Spouse
age Age of Respondent
sex Respondent's Sex
Model
1
B Std. Error
Unstandardized
Coefficients
Beta
Standardized
Coefficients
t Sig.
Dependent Variable: prestg80 R's Occupational Prestige Score (1980)
a.

Figure 11.2(iv)

The first table, VariabIes Entered/Removed, simply lists the ExpIanatory
variables that have been entered into the model. Since we used the default
method of Enter, all the variables in our Independent list have been inserted,
and none have been removed.

The ModeI Summary table tells us how well we are explaining the
Dependent variable, PRESTG80, in terms of the variables we have entered
into the model; the figures here are sometimes called the Goodness of Fit
statistics.

The figure in the column headed R is the absolute value of the CorreIation
Coefficient between the values predicted by the current model and the
observed values of the Dependent variable.

We can use the fourth table, Coefficients, to write out the equation for
PRESTG80 in terms of a constant, EDUC, SPEDUC, AGE and SEX, just as
we did for Simple Linear Regression in Session 10. f we then estimate
PRESTG80 from this equation for each respondent, using their values of the
ExpIanatory variables, and perform a CorreIation calculation between these
Predicted values and the actual observed values of PRESTG80, we will have
11-6
a measure of how strongly related they are. This is the value in column R of
the ModeI Summary table.

Simply put, the closer this value is to 1, the more accurate is our prediction; if
it is close to zero, our model is not explaining what is happening very well.

The figure in the column headed R Square is the proportion of variability in
the Dependent variable that can be explained by changes in the values of the
Independent variables. This can also be calculated from the figures in the
third table, ANOVA (the Analysis Of Variance); it is the Regression Sum of
Squares divided by the TotaI Sum of Squares. The higher this proportion,
the better the model is fitting to the data.

The ANOVA table also indicates whether there is a significant Linear
Relationship between the Dependent variable and the combination of the
ExpIanatory variables; an F-Test is used to test the NuII Hypothesis that
there is no Linear Relationship. We can see in our example that, with a
Significance value (Sig.) of less than 0.05, we have evidence that there is a
significant Linear Relationship.

n the fourth table, Coefficients, we have the figures that will be used in our
equation. All four ExpIanatory variables have been entered, but should they
all be there? Looking at the final two columns, headed t and Sig., we can see
that the Significance level for the variable SPEDUC is more than 0.05. This
indicates that, when the other variables (a constant, EDUC, AGE and SEX)
are used to explain the variability in PRESTG80, using SPEDUC as well
doesn't help to explain it any better; the Coefficient of SPEDUC is not
significantIy different from zero. t is not needed in the model.

Recall that, when we looked at the CorreIation Coefficients before fitting this
model, EDUC and SPEDUC were both significantly Correlated with
PRESTG80, but EDUC had the stronger relationship (0.520 compared to
0.355). n addition, the Correlation between EDUC and SPEDUC, 0.619,
showed a stronger Linear Relationship. We should not be surprised,
therefore, that the Multiple Linear Regression indicates that using EDUC to
explain PRESTG80 means you don't need to use SPEDUC as well.

On the other hand, AGE was not significantly Correlated with PRESTG80, but
was significantly Correlated with both education variables. We find that it
appears as a significant effect when combined with these variables in the
Multiple Linear Regression.


The Stepwise Method

We now want to remove the insignificant variable SPEDUC, as its presence in
the model affects the coefficients of the other variables. There are several
ways of doing this, using the various different Methods.

11-7
The effect of using the various Methods for fitting a model is summarised as
follows:

Method Effect

Enter Enters all variables in the list in a single step.

Forward Enters the variables in the list one by one (the
order determined by the significance in the
model) until no more can be entered.

Backward Enters all the variables in the list in a single
step, then removes the insignificant variables
one by one (the order determined by the
significance in the model) until no more can be
removed.

Stepwise A combination of the Forward and Backward
procedures.

Remove Removes all the variables in the list from the
model. A model using one of the above
procedures must be fitted first (in Block 1)
before this method is used in a later Block.

The Stepwise Method is a very useful way of fitting the model, since most of
the work is done for you. We will now use this Method to perform the Multiple
Linear Regression.

We recall the Linear Regression dialogue box, and change the Method to
Stepwise by clicking on the down arrow next to the Method area to see the
list of options (Figure 11.3). Click on Stepwise to make the change, then
click OK.


Figure 11.3

We now have five tables sent to the Output Viewer (Figures 11.4 (i) to
11.4(v)), with much more information in each than was the case when the
Enter option was used.
11-8
VariabIes Entered/Removed
a
educ Highest
Year of School
Completed
.
Stepwise (Criteria:
Probability-of-
F-to-enter <= .050,
Probability-of-
F-to-remove >= .100).
age Age of
Respondent
.
Stepwise (Criteria:
Probability-of-
F-to-enter <= .050,
Probability-of-
F-to-remove >= .100).
sex
Respondent's
Sex
.
Stepwise (Criteria:
Probability-of-
F-to-enter <= .050,
Probability-of-
F-to-remove >= .100).
Model
1
2
3
Variables
Entered
Variables
Removed Method
Dependent Variable: prestg80 R's Occupational
Prestige Score (1980)
a.


Figure 11.4(i)
Each table has three lots of output; one for each model that was fitted. The
VariabIes Entered/Removed table tells us that at the first stage, the variable
EDUC was entered into the model; then AGE was added; finally, SEX was
added to the previous two. At no point was any variable removed due to
becoming insignificant in the model, and SPEDUC was not found to satisfy
the criteria for inclusion in the model.

The inclusion and removal thresholds are set at 0.05 (5%) for inclusion and
0.1 (10%) for removal, but these can be changed if required by clicking on the
Options button in the Linear Regression dialogue box.

ModeI Summary
.553
a
.306 .305 10.940
.571
b
.326 .324 10.786
.574
c
.330 .327 10.763
Model
1
2
3
R R Square
Adjusted
R Square
Std. Error of
the Estimate
Predictors: (Constant), educ Highest Year of
School Completed
a.
Predictors: (Constant), educ Highest Year of
School Completed, age Age of Respondent
b.
Predictors: (Constant), educ Highest Year of
School Completed, age Age of Respondent, sex
Respondent's Sex
c.


Figure 11.4(ii)

n the ModeI Summary table, we can see how the R and R Square are
increasing with each adjustment to the variables included in the model,
indicating a better fit. The ANOVA table also shows that the combination of
11-9
variables in each model has a significant Linear ReIationship with
PRESTG80.

ANOVA
d
39799.799 1 39799.799 332.515 .000
a
90368.457 755 119.693
130168.256 756
42455.767 2 21227.884 182.481 .000
b
87712.489 754 116.330
130168.256 756
42945.906 3 14315.302 123.586 .000
c
87222.350 753 115.833
130168.256 756
Regression
Residual
Total
Regression
Residual
Total
Regression
Residual
Total
Model
1
2
3
Sum of
Squares df Mean Square F Sig.
Predictors: (Constant), educ Highest Year of School Completed
a.
Predictors: (Constant), educ Highest Year of School Completed, age Age
of Respondent
b.
Predictors: (Constant), educ Highest Year of School Completed, age Age
of Respondent, sex Respondent's Sex
c.
Dependent Variable: prestg80 R's Occupational Prestige Score (1980)
d.


Figure 11.4(iii)
Coefficients
a
11.585 1.829 6.335 .000
2.474 .136 .553 18.235 .000
3.627 2.454 1.478 .140
2.639 .138 .590 19.104 .000
.125 .026 .148 4.778 .000
6.646 2.855 2.328 .020
2.620 .138 .586 18.963 .000
.119 .026 .140 4.511 .000
-1.624 .789 -.062 -2.057 .040
(Constant)
educ Highest Year of
School Completed
(Constant)
educ Highest Year of
School Completed
age Age of Respondent
(Constant)
educ Highest Year of
School Completed
age Age of Respondent
sex Respondent's Sex
Model
1
2
3
B Std. Error
Unstandardized
Coefficients
Beta
Standardized
Coefficients
t Sig.
Dependent Variable: prestg80 R's Occupational Prestige Score (1980)
a.


Figure 11.4(iv)

11-10
ExcIuded VariabIes
d
.032
a
.846 .398 .031 .629
.148
a
4.778 .000 .171 .938
-.078
a
-2.578 .010 -.093 .999
.053
b
1.398 .163 .051 .622
-.062
b
-2.057 .040 -.075 .984
.054
c
1.441 .150 .052 .621
speduc Highest Year
School Completed,
Spouse
age Age of Respondent
sex Respondent's Sex
speduc Highest Year
School Completed,
Spouse
sex Respondent's Sex
speduc Highest Year
School Completed,
Spouse
Model
1
2
3
Beta n t Sig.
Partial
Correlation Tolerance
Collinearity
Statistics
Predictors in the Model: (Constant), educ Highest Year of School Completed
a.
Predictors in the Model: (Constant), educ Highest Year of School Completed, age Age
of Respondent
b.
Predictors in the Model: (Constant), educ Highest Year of School Completed, age Age
of Respondent, sex Respondent's Sex
c.
Dependent Variable: prestg80 R's Occupational Prestige Score (1980)
d.


Figure 11.4(v)

Looking at the table of ExcIuded VariabIes, we can see how, at each stage,
the next variable to be included in the model is selected. At the first model
stage, it is AGE that is most significant, and so in the Coefficients table we
find AGE in the second model stage. Both EDUC and AGE remain significant
in the model, so no variables are chosen for removal. This leaves SPEDUC
and SEX as the excluded variables at the second model stage, and now SEX
is the more significant of the two; it is inserted in the third model stage. Once
again, all variables in the model remain significant, but the excluded variable,
SPEDUC, has not satisfied the 5% inclusion criteria (the significance is 0.150)
and so SPSS exits the procedure, having found the best fitting model from
these variables.

Our final model is therefore displayed as ModeI 3 in the Coefficients table.
We translate these figures into the following equation:

PRESTG80 = 6.646 + (2.62 * EDUC) + (0.119 * AGE) - (1.624 * SEX)

So, for example, for a male (SEX equal to 1) aged 40 with 12 years of
education, we estimate the Occupational Prestige score as:

6.646 + (2.62 * 12) + (0.119 * 40) - (1.624 * 1) = 41.222

and for a similar female:

6.646 + (2.62 * 12) + (0.119 * 40) - (1.624 * 2) = 39.598
11-11


n general for this model:

Females have 1.624 points lower on average than males (of the same
age and years of education).

For two people of the same gender and with the same years of
education, the elder person has on average 0.119 points more on the
Prestige scale for each year of age they are the elder.

For two people of the same gender and the same age, the person with
more years of education has on average 2.62 points more on the
Prestige scale for each year of education they have extra.


PracticaI session 11

Open the STATLAB data (H:\My Documents\spss data\StatIaba.sav)

1. A ChiId's Weight and PhysicaI Characteristics

Use the methods from this session to investigate the relationship between the
weight of the child at age 10 (CTW) and some physical characteristics:

CBW child's weight at birth
CTH child's height at age 10
SEX child's gender (coded 1 for girls, 2 for boys).

2. A ChiId's Weight and Hereditary Characteristics

Repeat Question 1, but instead of CBW and CTH use the following
explanatory variables:

FTH Father's height
FTW Father's weight
MTH Mother's height
MTW Mother's weight
SEX child's gender (coded 1 for girls, 2 for boys).


Save your output as exer11.spo


11-12
AdditionaI practicaI session 11

Open the GSS91t data (H:\My Documents\spss data\Gss91t.sav)

1. EducationaI ReIationships

a) nvestigate the Linear Relationships between the following variables using
Correlations:

EDUC Education of respondent
MAEDUC Education of respondent's mother
PAEDUC Education of respondent's father
SPEDUC Education of respondent's spouse

b) Using Linear Regression, investigate the influence of education and
parental education on the choice of marriage partner (Dependent variable
SPEDUC). Use the variable SEX to distinguish between any gender effects.

c) t is thought that the size of the family might affect educational attainment.
nvestigate this using EDUC and SIBS (the number of siblings) in a Linear
Regression.

d) Also investigate whether the education of the parents (MAEDUC and
PAEDUC) affects the family size (SIBS).

e) How does the result of d) influence your interpretation of c) ? Are you
perhaps finding a spurious effect? Test whether SIBS still has a significant
effect on EDUC when MAEDUC and PAEDUC are included in the model.

2. Average Years of Education

Compute a new variable PARED = (MAEDUC + PAEDUC) / 2, being the
average years of education of the parents. By including PARED, MAEDUC
and PAEDUC in a Stepwise Linear Regression, investigate which is the better
predictor of EDUC; the separate measures or the combined measure.

You might also like