You are on page 1of 42

Statistics, part 5

Outline:
Simple linear regression
Multiple linear regression

We take the least square line


70

60

50

40

ANN.RISK

30

20

10
20

30

40

50

60

70

80

90

100

AV.CRED

What is the story behind this line?

The linear model


The simple linear probabilistic model
y = 0 + 1 x +
y = dependent variable
x = independent variable
E(y) = 0 + 1x is deterministic component
= random error
0 = y intercept of line
1

= slope of line

Model: y = + x +
0
1
Line of means

y = 0 + 1 x
+x
0 1 i
yi
0

i = y - ( + x i)
i
0 1
random error

observed
value

xi

mean
value

Suppose we can describe the relation


between x and y by a straight line. Then we
can make statistical inferences about x and y,
e.g. for some specific x we can predict
(with some confidence) the outcome of y.
What steps are needed to construct a
probabilistic straight line (=linear) model,
and what do we assume!

Executing a simple linear regression


(to investigation relation between x and y):
1. assume straight line model can be used
2. estimate 0 and 1 with least square line
using a sample
3. Test or make confidence interval to check
validity of coefficients
4. Check whether complete model is useful
5. Use the model for predictions

1. assume straight line model can be used


Example: Let y be supply of some good and let
x be the price of this good.
We assume that the straight line model can be
used to describe the relation between y and x:
y = 0 + 1 x +
Hence the deterministic component, which is
mean of y is linear (straight line)

2. estimate 0 and 1 with least square line


using a sample
Example: Suppose we have the following data:
(x, y) = (0.5, 2), (x, y) = (1, 1), (x, y) = (1.5 , 3)
3
2
1
0.5

1.5

Find straight line that fits best to these points:


y^ = ^ + ^ x
0
1

^
^
^
Problem: If y = + x is least square line,
0
1
^
^
How do we find the coefficients 0 and 1 ?
We have explicit formulas:
^
2
1 = Cov(x,y) / sx
^
^
=y- x
1
0

Using Excel (analyse/regression/linear ):

Coefficients

Standard
Error

t Stat

Pvalue

Lower 95%

Upper 95%

Intercept

1.870828693

0.5345

0.6875

-22.7710306

24.7710306

X Variable 1

1.732050808

0.5774

0.6667

-21.0076979

23.0076979

Hence y^ = 1 + x

3. test or make confidence interval to check


validity of coefficients
Note that we can conclude that there is
no relationship between x and y if 1 = 0
y
change of x does
not affect y
x

a. Using test to check validity slope


Hypothesis

H 0 : 1 = 0
H a : 1 0

Test with a level of significance of = 0.05

Excel yields (analyse/regression/linear)


Coefficients

Standard
Error

t Stat

Pvalue

Lower 95%

Upper 95%

Intercept

1.870828693

0.5345

0.6875

-22.7710306

24.7710306

X Variable 1

1.732050808

0.5774

0.6667

-21.0076979

23.0076979

Since p value is larger than significance level = 0.05


we can conclude that the sample that supports this test
does not contain enough evidence to reject H0
Hence, we may have 1 = 0 , so based on this sample
we may conclude that there is NO relation

b. Using confidence interval to check validity


slope
A 100(1- ) % confidence interval for

is:

^
^
( - t
s^ , + t
s^ )
1 n-2;/2
1
n-2;/2
1
1
Forget the formula, but observe that structure is similar
To confidence interval of mean and population proportion

Excel yields
Coefficients

Standard
Error

t Stat

Pvalue

Lower 95%

Upper 95%

Intercept

1.870828693

0.5345

0.6875

-22.7710306

24.7710306

X Variable 1

1.732050808

0.5774

0.6667

-21.0076979

23.0076979

So, confidence interval (-21.0077, 23.0077)

Conclusion is that 0 is contained in


95% confidence interval. Hence, 1 can
take the value 0, which implies that
there may be no relation between x and y

4. Check whether complete model is useful


Checks the strength of the linear relationship:
Coefficient of determination

R2

SSyy - SSE
=
SSyy

where

^
SSyy = (yi y )2

and

SSE = (yi ^yi )2

Explained variation y
Variation in y

4. Check whether complete model is useful


Checks the strength of the linear relationship:
Coefficient of determination

R2

SSyy - SSE
=
SSyy

Properties:

Explained variation y

a. 0 R2

Variation in y

b. 100(R2)% of the sample


variation in y can be
explained by using x to
predict y in a linear model

explanation of property b.
SSyy - SSE
If
= 0, then SS yy= SSE,
SSyy
then x contributes no information about y,
since observed points are in the same distance
from the line y = y, as from the least square line
If

SSyy - SSE

= 1, we have SSE = 0,

SSyy
hence all observed points are on the
least square line

Excel yields:
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.5
R Square
0.25
Adjusted R
-0.5
Standard E 1.224745
Observatio
3
ANOVA
df
Regression
Residual
Total

SS
1
1
2

MS
0.5
1.5
2

ignificance F
F
0.5 0.333333 0.666667
1.5

Coefficientstandard Erro t Stat


P-value Lower 95%Upper 95%Lower 95.0%
Upper 95.0%
Intercept
1 1.870829 0.534522 0.687494
-22.771 24.77103
-22.771 24.77103
X Variable
1 1.732051 0.57735 0.666667 -21.0077 23.0077 -21.0077 23.0077

Hence, only 25 % of sample variation in y can


be explained by using x in the linear model

Simple linear regression: a complete example


y = price of house
x= area in square feet
Sample:

x
2306
2677
2324
1447
3333
3004
4142
2923
2902
1847
2148
2819

145541
179900
149000
113900
189000
184500
339717
228000
209000
133000
168000
205000

1753
3206
2474
2933
3987
2598
4934
2253
2998
2791
2865
4417

129900
235000
129900
199500
319000
185500
375000
169000
185900
189800
192000
379900

Step 1. We assume a linear relation between


x and y. Hence, we use the model
y = 0 + 1 x + ,
Step 2. Determine least square line
Coefficie
nts

Standard
Error

P-value

Lower
95%

Upper
95%

t Stat

Intercept

-39001.1

18237.94

-2.13846

0.043834

-76824.4

-1177.92

X Variable 1

84.98698

6.095676

13.94217

2.12E-12

72.3453

97.62865

^y = -39001.1 + 84.987 x

Step 3. Is there a linear relation between x and y?


a. use test (with significance level of 5 %)
Hypothesis
H 0 : 1 = 0
H a : 1 0
Coefficie
nts

Standard
Error

P-value

Lower
95%

Upper
95%

t Stat

Intercept

-39001.1

18237.94

-2.13846

0.043834

-76824.4

-1177.92

X Variable 1

84.98698

6.095676

13.94217

2.12E-12

72.3453

97.62865

p-value smaller than 0.05, so based on this


sample we can reject H0

b. use 95 % confidence interval of 1

Coefficie
nts

Standard
Error

t Stat

P-value

Lower
95%

Upper
95%

Intercept

-39001.1

18237.94

-2.13846

0.043834

-76824.4

-1177.92

X Variable 1

84.98698

6.095676

13.94217

2.12E-12

72.3453

97.62865

95% CI is (72.345, 97.629)


Because 0 is not in the C.I. interval, we can conclude
that there is a linear relation between x and y

4. Check whether complete model is useful

SUMMARY OUTPUT

Regression Statistics
Multiple R

0.947802

R Square

0.898329

Adjusted R Square

0.893708

Standard Error

24383.43

Observations

24

Hence, 89 % of sample variation in y can


be explained by the linear model

Step 5. Use model to make estimations


or predictions

Question:
Predict the selling price of a home
with a area of 3000 square feet. Use a
95 % confidence interval (prediction interval)

C. I. for estimation and prediction

95% limits prediction

Property: P.I. is smallest at x

Use SPSS gives:

3000

164326.0100 267593.5720

PI

Interpretation PI (prediction interval):


we have confidence of 95% that selling price of
a house with area 3000 square feet will be
in between $164326 and $267593

Application of simple linear regression:


1. Capital Asset Pricing Model (revisited)

0.10

y = 0.7553x + 0.0026

0.08
0.06
0.04
0.02

-0.05

0.00
0.00
-0.02

0.05

0.10

-0.04

Excess share = Excess market +


(capm.xlsx)

exces marktet exces company


0.08
0.09
0.06
0.03
0.02
0.02
0.02
-0.02
-0.03
-0.02
0.01
0.01
-0.01
-0.02
-0.03
-0.02
0.00
0.00
-0.03
-0.01
-0.01
0.00
0.04
0.03
0.01
0.03
-0.02
0.00
SUMMARY OUTPUT

0.10
Regression Statistics
Multiple R
0.846514
R Square
0.716586
Adjusted R
Square
0.692968
Standard
Error
0.016815
Observatio
ns
14

0.08
0.06
0.04
0.02
-0.05

Intercept
X Variable
1

0.00
0.00
-0.02
-0.04

ANOVA

Regression
Residual
Total

y = 0.7553x + 0.0026

df

SS
1
12
13

0.008579
0.003393
0.011971

Coefficients
0.002637

Standard
Error
0.004621

0.755344

0.13713

MS
0.008579
0.000283

F
30.34084

t Stat
P-value
0.570526 0.578848
5.508252

0.000134

Significance F
0.000134465

Lower 95%
Upper 95% Lower 90.0% Upper 90.0%
-0.007432405 0.0127056 -0.005599936
0.0108731
0.456564681

1.0541242

0.510940035

0.9997488

0.05

0.10

Application of simple linear regression:


2. The money market rate (time series)

Rt+1 = + (Rt - ) + t+1


=
c
+

R
+

with
c
=

(1)
R
t+1
t
t+1
(longtermrate.xlsx)

period

interest
rate
1
7.229
2
7.725
3
7.671
4
8.037
5
7.516
6
6.996
7
6.719
8
7.056
9
7.243

y=R_t+1 x=R_t
7.725
7.229
7.671
7.725
8.037
7.671
7.516
8.037
6.996
7.516
6.719
6.996
7.056
6.719
7.243
7.056
7.109
7.243

= 0.9075
c = 0.5922

SUMMARY
OUTPUT
Regression
Statistics
Multiple R 0.905166
R Square 0.819326
Adjusted
R Square 0.816817
Standard
Error
0.501022
Observati
ons
74
ANOVA
df
Regressio
n
Residual
Total

Rt+1 = c + Rt + t+1

SS

MS

Significan
ce F

1 81.9613 81.9613 326.5086 1.83E-28


72 18.07368 0.251023
73 100.035

Coefficien Standard
Lower
Upper
Lower
Upper
ts
Error
t Stat
P-value
95%
95%
90.0%
90.0%
Intercept 0.592209 0.338067 1.751747 0.084075 -0.08172 1.266134 0.028889 1.155528
X
Variable 1 0.907522 0.050224 18.06955 1.83E-28 0.807403 1.007641 0.823834 0.99121

Substitution in
c = (1- ) gives:
0.5922= (1-0.9075)
Hence = 6.4038

Application of simple linear regression:


3. Difference in income (dummy variable)
Difference in income male and female
Let y denote the income, and introduce dummy variable x
representing male if x= 1 and female if x=0
Regression model: y = 1 x + 0
Observe that interpretation of slope 1 changes!!
1 = 0 - 1 with 0 the mean income of man and
1 the mean income of women
Test mean income man higher than mean income women
H0 : 0 - 1 = 0
(incomedifference2012.xlsx)
Ha : 0 - 1 > 0

average income

average income

single man

single woman

1 000 euro

1 000 euro

15.4

13.9

2000

17.2

15.1

2001

17.6

15.8

2002

17.4

16

2003

17.6

16.2

2004

17.8

16.5

2005

18.9

17.1

2006

19.7

17.7

2007

20.2

18

2008

20.1

18.2

2009

20

18.1

2010

SUMMARY OUTPUT
(income) (man/woman)
y
x
15.4
1
17.2
1
17.6
1
17.4
1
17.6
1
17.8
1
18.9
1
19.7
1
20.2
1
20.1
1
20
1
13.9
0
15.1
0
15.8
0
16
0
16.2
0
16.5
0
17.1
0
17.7
0
18
0
18.2
0
18.1
0

Regression Statistics
Multiple R
0.53318
R Square
0.284281
Adjusted R
Square
0.248495
Standard
Error
1.459919
Observatio
ns
22
ANOVA
df
Regressio
n
Residual
Total

Intercept
X Variable
1

SS
1
20
21

16.93136
42.62727
59.55864

Coefficient Standard
s
Error
16.6 0.440182
1.754545

0.622512

so u_0 - u_1 > 0 so u_0 > u_1

MS
16.93136
2.131364

F
7.943911

Significanc
eF
0.010614

t Stat
P-value Lower 95% Upper 95%
37.71166 4.67E-20
15.6818
17.5182

Lower
95.0%
15.6818

Upper
95.0%
17.5182

2.818495

0.456009

3.053082

0.010614

0.456009

3.053082

(siignificance level of 0.05)

Conclusion: mean income single men > mean income single women

executing a multiple linear regression


(investigation relation between x1 ,x2 ,xn
and y):
1. assume multiple regression model can be used
2. estimate 0 ,1 , ..., k with least square
method using a sample
3. test or make confidence intervals to check
contribution independent variables
4. use other relationship tests
5. if model is ok, use it for predictions

Multiple regression: a complete example


y = revenue sales
x1 = number of
employees
x2 = exp. R&D

y
10012
326
13376
13767
662
857
1259
18842
6763
16681
7094
10021
5142
5104
7039

x1
50.24
1.44
64.71
49.14
3.61
2.84
7.89
82.3
26.8
45.2
35
43.8
28
20.1
37

x2
1072
20
1354
1199
26
503
64
1634
4239
5269
3383
3472
1621
2098
2006

Step 1. We assume a linear relation between


x1, x2 and y. Hence, we use the
model y = + x + x + ,
0
1 1
2 2
Step 2. Determine least square plane

Coefficie
nts

Standard
Error

t Stat

P-value

-754.711

895.5027

-0.84278

0.415835

-2705.843455

1196.422214

X Variable 1

217.493

21.96107

9.903567

3.98E-07

169.6438844

265.3420219

X Variable 2

0.713124

0.327711

2.176075

0.050246

-0.000897191

1.427144982

Intercept

Lower 95%

Upper 95%

y^ = -754.711 +217.493 x1+0.713 x2

Step 3. Is there a linear relation between


x1, x2 and y?
F test (complete model)
t test (on coefficients)
coefficient of determination R2

3a. Is the complete model valid?


Hypothesis
H 0 : 1 = 2 = 0
H a :at least one of parameters not zero
use F - test (with significance level of 5 %)
ANOVA
df

SS

MS

4.52E+08

2.26E+08

68.24643

Residual

12

39727837

3310653

Total

14

4.92E+08

Regression

Significance F
2.78519E-07

Because p-value is smaller that 0.05, we reject H0

Step 3a showed that model is useful


Now do some t-test on
the individual coefficients of the variables
G 0 : 2 = 0
H 0 : 1 = 0
G a : 2 0
H a : 1 0
Test at a level of significance of 0.05
Coefficie
nts

Standard
Error

t Stat

P-value

-754.711

895.5027

-0.84278

0.415835

-2705.843455

1196.422214

X Variable 1

217.493

21.96107

9.903567

3.98E-07

169.6438844

265.3420219

X Variable 2

0.713124

0.327711

2.176075

0.050246

-0.000897191

1.427144982

Intercept

Lower 95%

Upper 95%

Hence, reject H0 , hence x1 contributes to


explanation of y.
Hence, we can not reject G0 . So, not clear whether
x2 contributes to explanation of y.

How much of the variability of y is explained


by the model?
Regression Statistics
Multiple R

0.958743

R Square

0.919188

Adjusted R Square

0.905719

Standard Error
Observations

1819.52
15

adjR2 = 0.9192
Hence, 91.92 % of sample variation in y can
be explained by the linear model

Step 4. use model to estimate and predict

Question: What are the revenues for a


company that has 50 employees (x1)
and R&B expenses of 1000 million dollar (x2).

We use SPSS

y
10012
326
13376
13767
662
857
1259
18842
6763
16681
7094
10021
5142
5104
7039
,

x1
50
1
65
49
4
3
8
82
27
45
35
44
28
20
37
50

x2

lci

uci

lpi

upi

1072
20
1354
1199
26
503
64
1634
4239
5269
3383
3472
1621
2098
2006
1000

9398,11
-2334,43
12322,22
9332,70
-1801,68
-1532,50
-735,34
15688,44
6000,59
10328,59
7799,05
9764,18
5438,21
3870,05
7684,96
9272,79

12475,08
1479,91
16247,43
12243,15
1899,64
1975,84
2749,24
20931,96
10193,46
15338,24
10741,02
12730,70
7543,91
6356,00
9761,15
12393,32

6684,15
-4826,54
9861,22
6564,88
-4326,10
-4113,48
-3323,40
13557,30
3612,45
8144,01
5041,54
7014,66
2389,24
958,34
4624,99
6572,67

15189,05
3972,029
18708,42
15010,97
4424,06
4556,82
5337,30
23063,10
12581,61
17522,83
13498,54
15480,23
10592,88
9267,71
12821,11
15093,44

95% PI

You might also like