You are on page 1of 45

Chapter 17

Simple Linear Regression


and Correlation

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

17.1

Linear Regression Analysis


Regressionanalysisisusedtopredictthevalueofonevariable
(thedependentvariable)onthebasisofothervariables(the
independentvariables).
Dependentvariable:denotedY
Independentvariables:denotedX1, X2, , Xk
If we only have ONE independent variable, the model is

whichisreferredtoassimplelinearregression.Wewouldbe
interestedinestimating0and1fromthedatawecollect.
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

17.2

Linear Regression Analysis


Variables:
X=IndependentVariable(weprovidethis)
Y=DependentVariable(weobservethis)
Parameters:
0=YIntercept
1=Slope
~NormalRandomVariable( = 0, = ???)
[Noise]

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

17.3

Effect of Larger Values of

House
Price

Lower vs. Higher


Variability

25K$

House Price = 25,000 + 75(Size) +

Same square footage, but different price points


e.g. dcor options, cabinet upgrades, lot location)
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

House size
17.4

Theoretical Linear Model

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

17.5

1. Building the Model Collect Data


Test2Grade=0+1*(Test1Grade)
FromData:
Estimate0
Estimate1
Estimate

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

17.6

Linear Regression Analysis

Plot of Fitted Model

100

92

80

82

Test B2

Test 2

Plot of Fitted Model

60
40
20

72
62
52

0
40

50

60

70

80

90

42

100

Test 1

60

70

80

90

100

Test B1

Plot of Fitted Model


100

Test B2

90
80
70
60
50
50

60

70

80

90

100

Test B1
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

17.7

Correlation Analysis -1 < < 1


Ifweareinterestedonlyindeterminingwhethera
relationshipexists,weemploycorrelationanalysis.
Example:Studentsheightandweight.
Plot of Height vs Weight

Plot of Height vs Weight

7
6.6

6.2

Height

Height

6.6

5.8
5.4

6.2
5.8

5
4.6
100

140

180

220

5.4

260

100

Weight

140

180

220

260

Weight

Plot of Height vs Weight

Plot of Height vs Weight

6.8

6.6
6.2

6.2

Height

Height

6.5

5.9
5.6

5.8
5.4

5.3
100

140

180

220

260

Weight
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

5
100

140

180

220

260

Weight

17.8

Correlation Analysis -1 < < 1


Ifthecorrelationcoefficientiscloseto+1thatmeansyou
haveastrongpositiverelationship.
Ifthecorrelationcoefficientiscloseto1thatmeansyou
haveastrongnegativerelationship.
Ifthecorrelationcoefficientiscloseto0thatmeansyou
havenocorrelation.
WEHAVETHEABILITYTOTESTTHEHYPOTHESIS
H0:=0
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

17.9

Regression: Model Types X=size of house, Y=cost of


house

DeterministicModel:anequationorsetofequationsthat
allowustofullydeterminethevalueofthedependent
variablefromthevaluesoftheindependentvariables.
y = $25,000 + (75$/ft2)(x)
Area of a circle: A = *r2
ProbabilisticModel:amethodusedtocapturethe
randomnessthatispartofareallifeprocess.
y=25,000+75x+
E.g.doallhousesofthesamesize(measuredinsquarefeet)
sellforexactlythesameprice?
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

17.10

Simple Linear Regression Model


Meaningofand
>0[positiveslope]<0[negativeslope]

rise
run

=slope (=rise/run)

=y-intercept
x

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

17.11

Which line has the best fit to the


data?
?
?
?

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

17.12

Estimating the Coefficients


Inmuchthesamewaywebaseestimatesofon,we
estimatewithb0andwithb1,theyinterceptandslope
(respectively)oftheleastsquaresorregressionlinegiven
by:

(Thisisanapplicationoftheleastsquaresmethodandit
producesastraightlinethatminimizesthesumofthe
squareddifferencesbetweenthepointsandtheline)

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

17.13

Least Squares Line


s
ce
n
ere

diff
d
these differences
are
u
are called
sq line
e
e
th
residuals or
of nd th
um ts a
errors
s
the poin
s
ize the
m
i
en
n
i
e
??
?
m
w
e
t
om slop
ne be
i
r
l
f
e
is
or
f
m
Th
o
c
14
1
n
.
o
ati and 2
u
eq pt
e
lin terce
e
th y-in
d
i
d
ra
e
r
o
f
he 34
w
ut et .9
b
eg
w
d
i
wd
o
H
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

17.14

Least Squares Line[sure glad we have computers


now!]

Thecoefficientsb1andb0forthe
leastsquaresline

arecalculatedas:

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

17.15

Least Squares Line See if you can estimate Y-intercept and slope from
this data

Recall

Statistics

Data

Information

Data Points:
x

17

12

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

y = .934 + 2.114x
17.16

Least Squares Line See if you can estimate Y-intercept and slope from
this data

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

17.17

Excel: Data Analysis - Regression

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

17.18

Excel: Plotted Regression Model You will need to play


around with this to get the plot to look Good

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

17.19

Required Conditions
Fortheseregressionmethodstobevalidthefollowingfour
conditionsfortheerrorvariable()mustbemet:
Theprobabilitydistributionofisnormal.
Themeanofthedistributionis0;thatis,E()=0.
Thestandarddeviationofis,whichisaconstant
regardlessofthevalueofx.
Thevalueofassociatedwithanyparticularvalueofyis
independentofassociatedwithanyothervalueofy.

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

17.20

Assessing the Model


Theleastsquaresmethodwillalwaysproduceastraightline,
evenifthereisnorelationshipbetweenthevariables,orif
therelationshipissomethingotherthanlinear.
Hence,inadditiontodeterminingthecoefficientsoftheleast
squaresline,weneedtoassessittoseehowwellitfitsthe
data.Wellseetheseevaluationmethodsnow.Theyrebased
onthewhatiscalledsumofsquaresforerrors(SSE).

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

17.21

Sum of Squares for Error (SSE another thing to


calculate)

Thesumofsquaresforerroriscalculatedas:

andisusedinthecalculationofthestandarderrorof
estimate:

Ifiszero,allthepointsfallontheregressionline.

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

17.22

Standard Error

Ifissmall,thefitisexcellentandthelinearmodelshould
beusedforforecasting.Ifislarge,themodelispoor
But what is small and what is large?
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

17.23

Standard Error
Judgethevalueofbycomparingittothesamplemeanof
thedependentvariable().
Inthisexample,
=.3265and
=14.841
so(relativelyspeaking)itappearstobesmall,henceour
linearregressionmodelofcarpriceasafunctionof
odometerreadingisgood.
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

17.24

Testing the SlopeExcel output does this for


you.
Ifnolinearrelationshipexistsbetweenthetwovariables,we
wouldexpecttheregressionlinetobehorizontal,thatis,to
haveaslopeofzero.
Wewanttoseeifthereisalinearrelationship,i.e.wewant
toseeiftheslope()issomethingotherthanzero.Our
researchhypothesisbecomes:
H1:0
Thusthenullhypothesisbecomes:
H0:=0
Alreadydiscussed!
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

17.25

Testing the Slope


Wecanimplementthisteststatistictotryourhypotheses:
H0:1=0
whereisthestandarddeviationofb1,definedas:

Iftheerrorvariable()isnormallydistributed,thetest
statistichasaStudenttdistributionwithn2degreesof
freedom.Therejectionregiondependsonwhetherornot
weredoingaoneortwotailtest(twotailtestismost
typical).
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

17.26

Example 17.4
Testtodetermineiftheslopeissignificantlydifferentfrom
0(at5%significancelevel)
Wewanttotest:
H1:0
H0:=0
(ifthenullhypothesisistrue,nolinearrelationshipexists)
Therejectionregionis:

ORcheckthepvalue.
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

17.27

Example 17.4

COMPUTE

WecancomputetmanuallyorrefertoourExceloutput

p-value

Weseethatthetstatisticfor
odometer(i.e.theslope,b1)is13.49

Compare

whichisgreaterthantCritical=1.984.Wealsonotethatthep
valueis0.000.
Thereisoverwhelmingevidencetoinferthatalinear
relationshipbetweenodometerreadingandpriceexists.
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

17.28

Testing the Slope


Wecanalsoestimate(tosomelevelofconfidence)and
intervalfortheslopeparameter,.
Recallthatyourestimateforisb1.
Theconfidenceintervalestimatorisgivenas:

Hence:
Thatis,weestimatethattheslopecoefficientliesbetween
.0768and.0570
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

17.29

Coefficient of Determination
Teststhusfarhaveshownifalinearrelationshipexists;itis
alsousefultomeasurethestrengthoftherelationship.This
isdonebycalculatingthecoefficientofdeterminationR2.

Thecoefficientofdeterminationisthesquareofthe
coefficientofcorrelation(r),henceR2=(r)2
rwillbecomputedshortlyandthisistrueformodels
withonly1indepenentvariable
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

17.30

Coefficient of Determination
R2hasavalueof.6483.Thismeans64.83%ofthevariation
intheauctionsellingprices(y)isexplainedbyyour
regressionmodel.Theremaining35.17%isunexplained,
i.e.duetoerror.
Unlikethevalueofateststatistic,thecoefficientof
determinationdoesnothaveacriticalvaluethatenablesus
todrawconclusions.
IngeneralthehigherthevalueofR2,thebetterthemodelfits
thedata.
R2=1:Perfectmatchbetweenthelineandthedatapoints.
R2=0:Therearenolinearrelationshipbetweenxandy.
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

17.31

Remember Excels Output


Ananalysisofvariance(ANOVA)tableforthe
simplelinearregressionmodelcanbegiveby:
Source

degrees
of
freedom

Sums of
Squares

Mean
Squares

F-Statistic

Regressio
n

SSR

MSR =
SSR/1

F=MSR/MSE

MSE =
SSE/(n2)

Error

n2

SSE

Total

n1

Variation
in y (SST)

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

17.32

Using the Regression Equation


Wecoulduseourregressionequation:
y=17.250.0669x
topredictthesellingpriceofacarwith40(40,000)mileson
it:
y=17.250.0669x=17.250.0669(40)=14,574
Wecallthisvalue($14,574)apointprediction(estimate).
Chancesarethoughtheactualsellingpricewillbedifferent,
hencewecanestimatethesellingpriceintermsofa
confidenceinterval.
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

17.33

Prediction Interval
Thepredictionintervalisusedwhenwewanttopredictone
particularvalueofthedependentvariable,givenaspecific
valueoftheindependentvariable:

(xgisthegivenvalueofxwereinterestedin)

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

17.34

Confidence Interval Estimator for Mean of Y


Theconfidenceintervalestimatefortheexpectedvalueofy
(MeanofY)isusedwhenwewanttopredictanintervalwe
areprettysurecontainsthetrueregressionline.Inthis
case,weareestimatingthemeanofygivenavalueofx:

(Technicallythisformulaisusedforinfinitelylarge
populations.However,wecaninterpretourproblemas
attemptingtodeterminetheaveragesellingpriceofallFord
Tauruses,allwith40,000milesontheodometer)
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

17.35

Whats the Difference?


Prediction Interval

1
Used to estimate the value
of one value of y (at given
x)

Confidence Interval

no 1
Used to estimate the mean
value of y (at given x)

The confidence interval estimate of the expected value of y will be


narrower than the prediction interval for the same given value of x and
confidence level. This is because there is less error in estimating a mean
value as opposed to predicting an individual value.
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

17.36

Regression Diagnostics
Therearethreeconditionsthatarerequiredinorderto
performaregressionanalysis.Theseare:
Theerrorvariablemustbenormallydistributed,
Theerrorvariablemusthaveaconstantvariance,&
Theerrorsmustbeindependentofeachother.
Howcanwediagnoseviolationsoftheseconditions?
ResidualAnalysis,thatis,examinethedifferences
betweentheactualdatapointsandthosepredictedbythe
linearequation

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

17.37

Nonnormality
Wecantaketheresidualsandputthemintoahistogramto
visuallycheckfornormality

werelookingforabellshapedhistogramwiththemean
closetozero[ouroldtestfornormality].
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

17.38

Heteroscedasticity
Whentherequirementofaconstantvarianceisviolated,we
haveaconditionofheteroscedasticity.

Wecandiagnoseheteroscedasticitybyplottingtheresidual
againstthepredictedy.
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

17.39

Heteroscedasticity
Ifthevarianceoftheerrorvariable()isnotconstant,then
wehaveheteroscedasticity.Herestheplotoftheresidual
againstthepredictedvalueofy:

there doesnt appear to be


a change in the spread of
the plotted points, therefore
no heteroscedasticity

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

17.40

Nonindependence of the Error


Variable
Ifweweretoobservetheauctionpriceofcarseveryweek
for,say,ayear,thatwouldconstituteatimeseries.

Whenthedataaretimeseries,theerrorsoftenarecorrelated.
Errortermsthatarecorrelatedovertimearesaidtobe
autocorrelatedorseriallycorrelated.
Wecanoftendetectautocorrelationbygraphingthe
residualsagainstthetimeperiods.Ifapatternemerges,itis
likelythattheindependencerequirementisviolated.

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

17.41

Nonindependence of the Error


Variable
Patternsintheappearanceoftheresidualsovertime
indicatesthatautocorrelationexists:

Note the runs of positive residuals,


replaced by runs of negative residuals

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Note the oscillating behavior of th


residuals around zero.

17.42

Outliers Problem worked earlier


Anoutlierisanobservationthatisunusuallysmallor
unusuallylarge.
E.g.ourusedcarexamplehadodometerreadingsfrom19.1
to49.2thousandmiles.Supposewehaveavalueofonly
5,000miles(i.e.acardrivenbyanoldpersononlyon
Sundays)thispointisanoutlier.

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

17.43

Outliers
Possiblereasonsfortheexistenceofoutliersinclude:
Therewasanerrorinrecordingthevalue
Thepointshouldnothavebeenincludedinthesample
*Perhapstheobservationisindeedvalid.
Outlierscanbeeasilyidentifiedfromascatterplot.
Iftheabsolutevalueofthestandardresidualis>2,we
suspectthepointmaybeanoutlierandinvestigatefurther.
Theyneedtobedealtwithsincetheycaneasilyinfluence
theleastsquaresline
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

17.44

Procedure for Regression


Diagnostics
1. Developamodelthathasatheoreticalbasis.
2. Gatherdataforthetwovariablesinthemodel.
3. Drawthescatterdiagramtodeterminewhetheralinear
modelappearstobeappropriate.Identifypossible
outliers.
4. Determinetheregressionequation.
5. Calculatetheresidualsandchecktherequiredconditions
6. Assessthemodelsfit.
7. Ifthemodelfitsthedata,usetheregressionequationto
predictaparticularvalueofthedependentvariable
and/orestimateitsmean.
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

17.45

You might also like