You are on page 1of 14

1.

Multiple regression example

Quantitative Methods for Business


COMM5005
Lecture 12

We will now look at an example of a multiple


regression where we have an employees
superannuation account balance as the
dependent variable. There is data from 20
employees.
We are trying to estimate the relationship that
the independent variables years in workforce,
gender and current salary have on the
superannuation balance.
Gender is a qualitative variable. We represent it
by using a dummy variable which has a value of
1 for a male and 0 for a female employee.

In this final lecture we will


look at how to interpret the Excel
output for a multiple regression
see how to separate the trend and
cyclical components in time
series data
forecasting
see how to calculate index numbers
Business School
J.Watson, Semester 1 2015

superannuation data

Readings
Y e a r s in
W o rk fo rc e
25
31
37
37
40
30
32
26
29
36
28
29
10
15
30
28
35
17
25
22

For todays topics Berenson et al.


Ch 12.7
Ch 13.1-13.4, 13.6
Ch 14.1-14.4
Ch 14.9

G ender
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0

S a la r y
$ '0 0 0
5 0 .6
7 5 .2
4 8 .3
5 2 .3
1 0 6 .2
6 1 .3
5 2 .6
4 8 .9
4 2 .6
8 9 .5
3 3 .1
3 5 .6
3 1 .2
3 3 .9
4 9 .7
6 9 .3
8 6 .4
2 8 .1
4 6 .2
5 0 .7

S u p e r a n n u a tio n
B a la n c e $ '0 0 0
1 1 7 .9
4 1 7 .1
1 5 6 .2
2 0 2 .9
5 0 6 .2
255
1 7 9 .8
8 2 .6
4 7 .3
4 8 8 .5
7 0 .5
1 2 0 .1
1 5 .6
8 .9
1 2 4 .7
2 2 3 .4
3 0 1 .6
5 2 .8
6 7 .9
8 9 .5
4

Excel output

Estimate 1

SUMMARY OUTPUT

So a male employee
with a current salary
of $50,000 who has
worked for 30 years is
estimated to have a
superannuation
balance of

Regression Statistics
Multiple R
0.951399308
R Square
0.905160643
Adjusted R Square
0.887378264
Standard Error
50.11767261
Observations
20
ANOVA
df
Regression
Residual
Total

3
16
19

Intercept
Years in Workforce
Gender
Current Salary $'000

SS
MS
F
Significance F
383564.8798
127855 50.90211 2.09087E-08
40188.49773 2511.781
423753.3775

Coefficients
Standard Error
-196.0738799 45.52498098
-0.604845409 2.633748738
59.58681943 29.84990611
6.480588884 0.799454349

t Stat
-4.306951
-0.229652
1.996215
8.106265

P-value
0.000543
0.821272
0.06322
4.67E-07

Y 196.07 0.6048(30) 59.5868(1) 6.4805(50)

Lower 95% Upper 95%


-292.582528 -99.56523
-6.18814328 4.978452
-3.6921543 122.8658
4.785821383 8.175356

169.397

or $169,397

Regression equation

Estimate 2

The sample regression equation is now

Yi b0 b1 X 1i b2 X 2i b3 X 3i
We can no longer draw this as a line because it is in four
dimensional space. From the Excel output below we see
that the equation should be

Yi 196.07 0.6048 X 1i 59.5868 X 2 i 6.4806 X 3i


where Yi predicted superannuation balance $'000

However a female employee with the same salary


and period in the workforce is estimated to have a
superannuation balance of
Y 196.07 0.6048(30) 0 6.4806(50)
109.810

X 1i years in workforce
X 2 i gender

or only $109,810

X 3i salary$'000
6

Interpretation of coefficients

residual plots
We should check the residuals against Yi
plus (separately) against each of the
independent variables. These plots are
shown below and on the next slide.

Each b coefficient can be seen as a partial


derivative i.e. the rate at which the super
balance changes if all other variables are kept
constant
So we can interpret b3 6.4806 to mean that
the estimated superannuation account
balance will increase by 6.4806 thousand
dollars for every extra thousand dollars of
current salary, keeping gender and working
years constant.

Residuals versus predicted Y


100
80
60

residuals

40

Check
either side
of 0 level

20
0
-100

-20

Residuals
0

100

200

300

400

500

600

-40
-60
-80
-100

11

predicted super

Residuals

residuals versus X variables


Residuals

176.3096018

-58.40960183

332.1030159

84.99698407

154.1461025

2.053897502

180.068458

22.83154197

527.5576627

-21.35766266

242.6276759

12.37232415

185.0368617

-5.236861742

164.6877553

-82.08775532

122.0455091

-74.74550913

10

421.7512099

66.74879007

11

61.08476014

9.415239862

12

76.68138694

13

0.072039187

15.52796081

14

14.54540213

-5.645402131

15

107.8660254

16.83397463

16

236.0952583

-12.69525831

17

342.6794104

-41.07941037

18

-24.25170421

77.05170421

19

88.20819132

-20.30819132

20

119.1853775

-29.68537752

Current Salary $'000 Residual Plot

Years in Workforce Residual Plot

100

Residuals

Predicted Superannuation Balance

R e s id u a ls

Observation

-100

20

40

60

80

100

120

100
50
0
-50 0

10

20

30

40

50

-100
Years in Workforce

Current Salary $'000

43.41861306

Residuals

Gender Residual Plot


100
50
0
-50 0
-100

0.5

1.5

Gender

10

12

Residuals normal?

Look at the t Stat and P-value columns. Will we


reject the hypotheses that 0 and 3 are zero
using two tail tests and a 1% level of significance?
Histogram
F req u en cy

We also check to see if


the residuals are
normally distributed.
The histogram here
created from the data
on slide 10 shows that
they are.

3. Which variables are significant?

8
6
4
2
0

Frequency

-100

-50

50

100

Intercept
Years in Workforce
Gender
Current Salary $'000

Coefficients
Standard Error
-196.0738799 45.52498098
-0.604845409 2.633748738
59.58681943 29.84990611
6.480588884 0.799454349

t Stat
-4.30695
-0.22965
1.996215
8.106265

P-value
0.000543
0.821272
0.06322
4.67E-07

More

Residual

13

2.

R Square

15

Which variables are significant ( cont.)

The R Square value tells us that 90.5 % of


the variation in superannuation balances is
explained by this model.
Remember though that if we were
comparing alternative models with
different numbers of variables it would be
preferable to use the adjusted R Square
which tells us that 88.7% of variation of
superannuation is explained.
14

Yes we can reject 0 0 and 3 0


The reason is that for the Intercept,
the P-value is 0.00543<0.01
and for Current salary, the P-value is
4.67 107 0.01

16

5. Analysis of Variance
However, we can only reject the
hypothesis that the gender coefficient 2 0
using a two tail test at the 10% level since
its P-value is 0.06322
We are not able to reject the hypothesis
that the years in the workforce coefficient
1 0 even at the 10% level.

As mentioned in the last lecture, the ANOVA


section of the output refers to analysis of
variance and shows sums of squares due to the
regression (SSR), residual or errors (SSE) and
total sum of squares (SST).
We have already seen how these have been
used to give the R 2 value.
They are also used to calculate the F value
which is a measure of the overall significance of
the regression model.

17

4. Confidence intervals for s


Excel also gives us another way of
analysing the beta coefficients. We can
find a confidence interval for each
Lower 95% Upper 95%
population coefficient.
-292.582528 -99.5652
-6.18814328 4.978452
The endpoints of these
-3.6921543 122.8658
4.785821383 8.175356
are shown in the Columns
Lower 95% and Upper 95%.
We can also check these to see which
intervals contain zero and which do not. 18

19

F test
For a regression with only one independent variable the
significance level of F is the same as the p-value for the
slopes t test. In a multiple regression the F value is
used to perform a joint test of the regression coefficients
i.e. that

H 0 1 2 ... K 0

against the alternative


H1 : At least one of the coefficients 1 , 2 ,... K is

non-zero
where the test statistic is F

SSR / k
MSR

SSE / n k 1 MSE

From the Excel output we see F = 50.90211

20

F tables

6.Time Series and Forecasting

The F tables are at Table E4 in your textbook.


Four separate tables are given for alpha levels
of 0.05, 0.025, 0.01 and 0.005 in the upper tail.
The F distribution depends on degrees of
freedom for both the numerator and
denominator.
If we use a 1% level of significance 3 degrees of
freedom in the numerator and 16 in the
denominator, the critical value is F0.01,3,16 5.29
Since F = 50.90211 > 5.29 we reject the null
hypothesis.

Time series data shows values for a variable or


set of variables over time.
Graphs are usually drawn with time on the
horizontal axis and lines connecting the data
points. Often patterns can be determined such
as seasonal variations and long term trends.

21

23

What do you notice about temperatures


here?

F test conclusion
Instead of using tables you can simply read off
the Significance F relating to F=50.90211 and
compare it with the desired level of significance.
The given value is 2.09087E-08
or 2.09087x10-8 < 0.01
At the 1 % level we would reject
H 0 1 2 ... K 0

Conclusion: there is a linear relationship


between at least one of the variables and the
superannuation balance.
22

Source: Bureau of Meteorology


http://www.bom.gov.au/cgi-bin/climate/change/timeseries.cgi

Compare this graph

Trend and Seasonal component


Trend
The trend is the continuous long term movement in a
variable over a period of time. If a linear relationship is
appropriate, the most widely used technique for isolating
the trend is to use a linear regression with Y b b t
t
0
1
where t the independent variable is time.
Seasonal Component
Many economic variables fluctuate on a regular basis
throughout a defined period, usually a year. This may be
due to agricultural growing and harvest cycles, holiday
periods such as Christmas or time when school leavers
join the workforce. If only annual data were used the
model would not need to include a seasonal component.

Source: Bureau of Meteorology http://www. bom.gov.au /cgibin/climate/change/timeseries.cgi


25

Also compare this one

27

Cyclical and irregular components


Cyclical Variation
Business cycles with upswings, peaks,
contractions and troughs will produce a wavelike
effect on time series over a relatively long
period.
Irregular fluctuations
These will be often caused by natural disasters
such as floods, cyclones and tsunamis or manmade disruptions such as wars and elections.

Source: Bureau of Meteorology http://www. bom.gov.au /cgibin/climate/change/timeseries.cgi


26

28

additive model

multiplicative model

A time series can be decomposed into


several components. In an additive model
these are usually regarded as the Trend
(T), the Seasonal component (S), Cyclical
variations (C) and some Irregular or
random fluctuations (I). Therefore the
additive model can be written as
Yt Tt S t C t I t

The additive model suffers from the


assumption that the components are
independent of each other. A more
realistic alternative is the multiplicative
model which can be written as
Yt Tt S t C t I t

Here only T is expressed in original units


and the other terms are proportions.

29

31

additive example

Petrol example

If we were to develop a monthly times series


model of the sales value of swimwear in a retail
store with the additive model
we might find that at time t = 12
Y = $2,000 + $3,000 -$500 +$50 = $4,550
The large positive seasonal component ($3,000)
reflects that swimwear sales are strongly
seasonal and are high in month 12 (December).
The negative value for the cyclical component
might be the result of a downswing in the
business cycle.

We will use a petrol price example to see


how to separate out the trend and (day of
week) cyclical components from a set of
data and to use this to forecast future
Week 1 Week 2 Week 3 Week 4
prices.

30

Sun
Mon
Tue
Wed
Thu
Fri
Sat

1.20
1.18
1.16
1.18
1.27
1.25
1.24

1.23
1.22
1.21
1.21
1.30
1.28
1.26

1.25
1.22
1.20
1.32
1.32
1.30
1.29

1.28
1.26
1.25
1.26
1.35
1.34
1.30
32

Step 1
Day

We need to find the trend


line so will arrange the
prices in a single column
and number the days
from Sunday in Week 1
as shown at left then
perform a regression with
price as the dependent
variable and time as the
independent one.
The result is the equation

Price
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28

Step 2 multiplicative model


Yt Tt Ct I t

1.20
1.18
1.16
1.18
1.27
1.25
1.24
1.23
1.22
1.21
1.21
1.30
1.28
1.26
1.25
1.22
1.20
1.32
1.32
1.30
1.29
1.28
1.26
1.25
1.26
1.35
1.34
1.30

Y 1.1917 0.0043t
33

Assume we have a multiplicative model with a


cyclical component but no seasonal one. We will
try to adjust the prices to remove the cyclical
(day of week) component.
Copy the predicted prices from the regression
output onto the data page. Divide the price for
each time period by the corresponding predicted
price. Thus we have an estimate

Yt
Ct I t
Tt
35

Trend line
The plot below shows how the daily price
fluctuates in a fairly regular pattern around
this trend line.
Day Line Fit Plot

Price

1.40
1.30

Price

1.20

Predicted Price

1.10
0

10

20

30

Next find the average (price/predicted price) for


each day of the week. For the four Sundays we
find the average for observations 1,8,15 and 22.
We calculate seven averages in all, one for each
day using the four weeks of observations.
These values become the multiplicative cyclical
(i.e. daily) index. We can check that the index
numbers add to 7. In this case they are very
close to 7 so will not need to be adjusted.

Day
34

36

days 13-28
The adjusted index is then found by
dividing each daily price by the index for
the corresponding day.
For example to find the adjusted price for
Day 1(a Sunday), as we see on the next
slide
adjusted price 1.20 0.998808 1.201

13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28

1.28
1.26
1.25
1.22
1.20
1.32
1.32
1.30
1.29
1.28
1.26
1.25
1.26
1.35
1.34
1.30

1.248128079
1.252471264
1.25681445
1.261157635
1.265500821
1.269844007
1.274187192
1.278530378
1.282873563
1.287216749
1.291559934
1.29590312
1.300246305
1.304589491
1.308932677
1.313275862

1.025535778
1.006011104
0.994577998
0.967365193
0.948241186
1.03949776
1.035954535
1.016792423
1.005555058
0.994393525
0.975564483
0.964578278
0.969047168
1.034808274
1.023734852
0.989891033

37

adjusting first 12 days shown

Day

Price
1
2
3
4
5
6
7
8
9
10
11
12

Price/predicted
Predicted Price price
1.20 1.196009852 1.003336216
1.18 1.200353038 0.983044124
1.16 1.204696223 0.962898345
1.18 1.209039409 0.975981421
1.27 1.213382594 1.046660802
1.25
1.21772578 1.026503685
1.24 1.222068966 1.014672686
1.23 1.226412151 1.002925484
1.22 1.230755337
0.99126119
1.21 1.235098522 0.979678931
1.21 1.239441708 0.976245992
1.30 1.243784893 1.045196808

Day
Sun
Mon
Tue
Wed
Thu
Fri
Sat

1.246951
1.24178
1.245844
1.241043
1.246237
1.352485
1.261154
1.266435
1.271346
1.275744
1.281733
1.298164
1.291008
1.289816
1.305402
1.281201

39

forecasting

Average
Adjusted
(price/pred
series
price)
0.998808306 1.201432
0.979308748 1.204932
0.963849185 1.203508
0.990193085 1.191687
1.040655105 1.220385
1.023141684 1.221727
1.00403247 1.23502
1.22591
6.999988584 1.241043
1.256623
1.239778
1.242045

38

If, instead of adjusting observed prices, we wish


to make a forecast from our trend line we need
to include the cyclical component.
If we wish to forecast the price at day 35 we
would substitute t = 35 into the regression
equation and multiply the result by the index
relating to a Saturday.
Thus the forecast of Y35 is
1.0040 (1.1917 + 0.0043(35)) = 1.3476

40

10

Step 2- additive model

forecasting

Yt Tt Ct I t

The daily adjusted series is found by subtracting


the index for the relevant day from the observed.
When forecasting with the additive model it
should be assumed I = 0 and the index should
be added to the estimate from the trend.
As day 36 is Sunday forecast of Y36 would be
1.1917+0.0043(36)+(-0.0016) = 1.3449

After copying the predicted prices to the


data page this time they should be
subtracted from the observed prices to
give C I . Then the average (pricepredicted price) for each day of the week
should be found , giving a daily index.
t

41

adjusting first 18 days shown


Day
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

Price
1.20
1.18
1.16
1.18
1.27
1.25
1.24
1.23
1.22
1.21
1.21
1.30
1.28
1.26
1.25
1.22
1.20
1.32

Predicted Price
1.196009852
1.200353038
1.204696223
1.209039409
1.213382594
1.21772578
1.222068966
1.226412151
1.230755337
1.235098522
1.239441708
1.243784893
1.248128079
1.252471264
1.25681445
1.261157635
1.265500821
1.269844007

Price predicted
price
Day
0.003990148 Sun
-0.020353038Mon
-0.044696223Tue
-0.029039409Wed
0.056617406 Thu
0.03227422 Fri
0.017931034 Sat
0.003587849
-0.010755337total
-0.025098522
-0.029441708
0.056215107
0.031871921
0.007528736
-0.00681445
-0.041157635
-0.065500821
0.050155993

Average (pricepred.price)
-0.0016133
-0.025956486
-0.045299672
-0.012142857
0.051013957
0.029170772
0.004827586
-3.88578E-16

43

7. Price indices

Adjusted
series
1.201613
1.205956
1.2053
1.192143
1.218986
1.220829
1.235172
1.231613
1.245956
1.2553
1.222143
1.248986
1.250829
1.255172
1.251613
1.245956
1.2453
1.332143

What do the CPI, S&P/ASX 200 and the


Hang Seng all have in common? They are
indices.
A simple price index looks at only one item
e.g. the price of 1 kg of navel oranges in
2012 compared with the base year of 2000
2012 price
2.99
100
100 186.875
2000 price
1.60
42

44

11

simple aggregate index

deficiencies

A composite index is made up of changes


in a number of items.
A simple aggregate index can be found by
finding the sum of current prices x 100
divided by the sum of base year prices
or

the main deficiency of a simple aggregate


index is that it takes no account of
quantities purchased.
if prices are quoted in different units, eg
per mushroom instead of per kilo, the
index will be affected and give a different
result.
a large price for one item may dominate

p 100
p
n
0

45

Example 1

47

Weighted indices
Weighted indices allow greater importance
to be given to items for which greater
quantities are sold or consumed
Laspeyres index uses base period
quantities ( q0 ) as weights
It can be used to compare prices between
other periods

Construct a simple aggregate index for the


following basket of food items
Item

2000

2012

Zucchini/kg

3.99

5.99

Mushrooms/kg

6.50

7.99

Pink Lady
Apples/kg

3.99

5.99

Navel Oranges/kg 1.60

2.99

Laspeyres index
46

p q 100
pq
n

48

12

Notices

Example 2- Laspeyres Index 2012


Item

2000 p

2012p

Zucchini/kg
Calculate the 3.99 5.99

2000 q

2012q

3.2

4.3

Mushrooms/kg 6.50

7.99

1.2

1.5

Pink Lady
Apples/kg

3.99

5.99

5.2

5.6

Navel
Oranges/kg

1.60

2.99

6.2

7.0

Dont forget to complete CATEI evaluations for this


course on myUNSW. They are carried out
anonymously and will help us plan for future
changes in the course.
We will hand out the assignments which have
been corrected so far during Week 12 tutorials. If
there are any we have not managed to mark by
Thursday the rest will be handed out during the
Week 13 tutorials or can be picked up from
Judiths office after that date.

49

51

Stuvac consultations
The Paasche index used the current
period quantities but has some practical
problems such as obtaining quantity data
for every period
Paasche index =

p q 100
pq
n

50

Judiths consultation times up to the exam


will be slightly different:
Week 13 as normal- Tuesday 2-4, Thursday
4-5
Tuesday June 9, 2-4
Thursday June 11, 2-3
Monday June 15, 2-4
Thursday June 18, 2-3
Tuesday June 23, 2-4
Thursday June 25, 2-3
and by appointment
52

13

Exam
The exam will consist of two parts:
Part A: 16 multiple choice questions on both
maths and statistics, each worth 1 mark. Use a
pencil to mark answers and your personal
details.
Part B: 3 written problems with two of the three
based on the statistics section of the course.
They are not all of equal marks so plan your time
carefully.
Total marks for the exam: 50
Please bring an approved calculator, textbook,
notes, pencil, pen, ruler, eraser. No tables will
be supplied (you can use the textbook ones). 53

And last of all.


Dont forget that the final Regression eLearning
tutorial will run in week 13 to help with your
revision.
Thanks for your participation in COMM5005.
We hope you have learned some useful skills
and that your efforts are rewarded with good
results.

54

14