You are on page 1of 7

Chapter 8.

SIMPLE LINEAR REGRESSION

8.1 Introduction

Regression analysis is the statistics that deal with the relationship between two or more
variables. The simplest relationship between x and y is the linear relationship:

y=β0+ β1∙x

β0 : y-intercept when x = 0.
β1 : dy/dx (slope of the line)

Definition: The model equation

Y = β0+ β1∙x + ε

is the equation that relates the random response variable Y to the predictor value x. The
random variable ε is referred as the random error term, and ε ~N(0,).

The line defined by the equation “y=β0+ β1∙x” is referred as the population (or, true)
regression line. For a given value of x, the expected value of the variable Y is,

E[Y] = E[β0+ β1∙x + ε] = β0+ β1∙x + E[ε] = β0+ β1∙x

Thus, the population regression line is the line defining the mean of Y (Y) for a given
value of x. The variance of Y is,

2
 Y2 = Var[β0+ β1∙x + ε] = Var[ε] = 

Thus, for any given predictor value x, Y~N(β0+ β1∙x,).

8.2 Estimation of Model Parameters

Step 1. Provide a scatter plot for the sample set (xi , yi)
Step 2. If a linear trend is observed, estimate parameters of the model employing the
principle of least squares:

For a given sample set (xi , yi) of size n, the deviation of yi from the line y=β0+ β1∙x, is,

yi – y = yi –(β0+ β1∙x)

The sum of squared deviations is,

n
f (  0 ,  1 )    y i   0   1  x 
2

i 1
The point estimates ̂ 0 are ˆ1 that minimize f (  0 , 1 ) are called the least-squares
estimates of  0 and  1 respectively. The estimated regression line (or, the least-squares
line) is

y  ˆ0  ˆ1 x

Thus,


f ˆ0 , ˆ1  
  2 y i  ˆ0  ˆ1 xi  1  0
ˆ
0


f ˆ0 , ˆ1   
  2 y i  ˆ0  ˆ1 xi  xi   0
ˆ
1

Hence,

ˆ 1 
 xi  x  yi  y   S XY
 xi  x 
2 S XX

ˆ 0 
 yi  ˆ 1  xi  y  ˆ 1 x
n

The computational formulae for SXY and SXX are:

S XY   xi y i   xi  y i  n
S XX   xi2   xi 2 n

The computations of ̂ 0 and ˆ1 require only the summary statistics  xi ,  y i ,  xi2
,  xi y i .

Definition: The predicted values ŷ1, ŷ2, …ŷn are obtained through successful substitutions
of x1 , x2 , … xn into the estimated regression line:

ŷ i  ˆ 0  ˆ 1 xi

The residuals are the differences y1-ŷ1, y2-ŷ2, …yn-ŷn .

Definition: The error sum of squares, denoted by SSE, is

SSE    y i  ŷ i 2   y i2  ˆ 0  y i  ˆ 1  xi y i
The estimate of 2 is,
SSE
s2 
n2

Definition: The total sum of squares, denoted by SST, is

SST  S YY    y i  y 2   y i2   y i 2 n

Definition: The coefficient of determination, denoted by r2, is given by:

r2 = 1- SSE / SST

r2 is interpreted as the proportion of observed y variation that can be explained by the


simple linear regression model. The value of r 2 can vary in the range [0,1]. The higher the
value of r2, the more successful is the simple linear regression model in explaining the
variation.

Definition: The sample correlation coefficient for the n sample pairs (x1,y1), (x2,y2), …,
(xn,yn), is

S xy
r
S xx S yy

 The value of r is independent of the units of x and y.


 -1.0 ≤ r ≤ 1.0
 r = 1.0 iff yi = b0 + b1xi , where b1>0
 (r)2 = r2
 r 0 is the evidence of lack of linear relationship.
0.0 ≤ r ≤ ~0.5 weak relationship (0.0 ≤ r2 ≤ ~0.3)
~0.8 ≤ r ≤ 1.0 strong relationship (~0.6 ≤ r2 ≤ 1.0)

Step 3. Compute r or r2 to determine the strength of the relationship.

Step 4. Provide residuals plot to see if yi-ŷi values are consistent with the uniformity
assumption (i.e., ε ~N(0,)).

8.3 Regression with transformed variables:

Relationship Transformation Linear Form


y =  ex y = ln(y) y = ln() + x
y =  x y = log(y), x = log(x) y = log() + x
y =  +  log(x) x = log(x) y =  + x
y=  +  / x x = 1 / x y =  + x
AN EXAMPLE FOR SIMPLE REGRESSION ANALYSIS
DATA
i xi yi x2 y2 xy
1 200 114.8 40000 13179 22960
2 511 115.7 261121 13386 59123
3 543 140 294849 19600 76020
4 758 194.3 574564 37752 147279
5 814 90.6 662596 8208 73748
6 897 217.4 804609 47263 195008
7 975 228.5 950625 52212 222788
8 1183 201.3 1399489 40522 238138
9 1261 202.8 1590121 41128 255731
10 1344 318.6 1806336 101506 428198
11 1338 186.8 1790244 34894 249938
12 1571 302.4 2468041 91446 475070
13 1637 346.1 2679769 119785 566566
14 1702 373.4 2896804 139428 635527
15 1766 342.4 3118756 117238 604678
16 1757 382.6 3087049 146383 672228
17 1912 394.1 3655744 155315 753519
18 2077 439.9 4313929 193512 913672
19 2131 429 4541161 184041 914199
20 2236 482.6 4999696 232903 1079094
21 2550 395.4 6502500 156341 1008270
22 2491 466.9 6205081 217996 1163048
23 2522 461.4 6360484 212890 1163651
24 2528 531.2 6390784 282173 1342874
25 2644 569.7 6990736 324558 1506287
26 2879 619.9 8288641 384276 1784692
27 3005 726.8 9030025 528238 2184034
28 3112 646.6 9684544 418092 2012219
29 3217 643.6 10349089 414221 2070461
30 3318 590.6 11009124 348808 1959611
Σ 54879 11155.4 122746511 5077294 24778631

Scatter Plot
800

600

400
y

200

0
0 500 1000 1500 2000 2500 3000 3500
x

Calculations:

30 30 30 30
 xi  54879  yi  11155  xi2  122746511  yi2  5077294
i 1 i 1 i 1 i 1
30
 xi yi  24778631
i 1
x  54789 / 30  1829 y  11155 / 30  371 .8
S xx   xi2  xi  / n  122746511  54879 2 / 30  2.2356  10 7
2

S yy   yi2  yi  / n  5077294  11155 2 / 30  9.2920  105


2

S xy   xi y i  xi  y i  / n  24778631  54879  11155 / 30  4.3721  10 6

ˆ 1  S xy S xx  4.3728  10 6 2.2356  10 7  0.19556


ˆ0  y  ˆ1 x  371.8  0.19556  1829  14.1

Estimated regression line: y = 14.1+ 0.196x

SSE   yi2  ˆ0  yi  ˆ1  xi yi  5077294  14.1  11155  0.19556  24778631  74187

SST  S yy  9.2920  10 5

Coefficient of determination: r 2  1  SSE SST  1  74187 / 9.2920  10 5  0.92


S xy
Sample correlation coefficient: r  r 2  0.96 , alternative: r 
S xx S yy
(conclusion: strong linear relationship between x and y)

s 2  SSE ( n  2)  74187 28  2649 .5 s  2649.5  51.5

1  x  x 2 1 x  18292
sŶ  s    51.5  
n S xx 30 2.2356  10 7

The 95% confidence interval for  Y x* :

ˆ 0  ˆ 1 x*  t / 2 ,n  2  sŶ  14.1  0.196 x*  2.048  51.5 


1

x  1829 2
30 2.2356  10 7
800

Sample

600 Least-Squares Line

95% CI

400
y

200

0
0 500 1000 1500 2000 2500 3000 3500
x
Least squares line: y = 14.1+ 0.196x r=0.96

DATA
i xi yi ŷ Residual: ŷ –yi SŶ
1 200 114.8 53.2 61.6 20.073
2 511 115.7 114.0 1.7 17.155
3 543 140 120.3 19.7 16.864
4 758 194.3 162.3 32.0 14.978
5 814 90.6 173.3 -82.7 14.508
6 897 217.4 189.5 27.9 13.832
7 975 228.5 204.8 23.7 13.222
8 1183 201.3 245.5 -44.2 11.740
9 1261 202.8 260.7 -57.9 11.251
10 1344 318.6 276.9 41.7 10.781
11 1338 186.8 275.8 -89.0 10.813
12 1571 302.4 321.3 -18.9 9.809
13 1637 346.1 334.2 11.9 9.628
14 1702 373.4 347.0 26.4 9.499
15 1766 342.4 359.5 -17.1 9.423
16 1757 382.6 357.7 24.9 9.431
17 1912 394.1 388.0 6.1 9.441
18 2077 439.9 420.3 19.6 9.777
19 2131 429 430.8 -1.8 9.955
20 2236 482.6 451.4 31.2 10.388
21 2550 395.4 512.8 -117.4 12.242
22 2491 466.9 501.3 -34.4 11.841
23 2522 461.4 507.3 -45.9 12.049
24 2528 531.2 508.5 22.7 12.090
25 2644 569.7 531.2 38.5 12.922
26 2879 619.9 577.1 42.8 14.795
27 3005 726.8 601.8 125.0 15.879
28 3112 646.6 622.7 23.9 16.832
29 3217 643.6 643.2 0.4 17.792
30 3318 590.6 663.0 -72.4 18.734
150
residuals
100

50

0
y

-50

-100

-150
0 500 1000 1500 2000 2500 3000 3500
x

SUMMARY OUTPUT

Regression Statistics
Multiple R 0.959249948
R Square 0.920160462
Adjusted R Square 0.91730905
Standard Error 51.47349033
Observations 30

ANOVA
df SS MS F Significance F
Regression 1 855009.2689 855009.2689 322.7034339 6.64289E-17
Residual 28 74186.56579 2649.520207
Total 29 929195.8347

Standard
Coefficients Error t Stat P-value Lower 95% Upper 95%
Intercept 14.1047458 22.020494 0.640528129 0.527037012 -31.00224201 59.2117336
X Variable 1 0.19556219 0.01088637 17.96394817 6.64289E-17 0.17326245 0.21786193

You might also like