Regress Ssss I On

Chapter 17
Simple Linear Regression

and Correlation
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.
17.1
Linear Regression Analysis

Regressionanalysisisusedtopredictthevalueofonevariable
(thedependentvariable)onthebasisofothervariables(the
independentvariables).
Dependentvariable:denotedY
Independentvariables:denotedX1, X2, , Xk
If we only have ONE independent variable, the model is
whichisreferredtoassimplelinearregression.Wewouldbe
interestedinestimating0and1fromthedatawecollect.
17.2

Variables:
X=IndependentVariable(weprovidethis)
Y=DependentVariable(weobservethis)
Parameters:
0=YIntercept
1=Slope
~NormalRandomVariable( = 0, = ???)
[Noise]
17.3
Effect of Larger Values of
House
Price
Lower vs. Higher

Variability
25K$
House Price = 25,000 + 75(Size) +
Same square footage, but different price points

e.g. dcor options, cabinet upgrades, lot location)
House size
17.4
Theoretical Linear Model
17.5
1. Building the Model Collect Data

Test2Grade=0+1*(Test1Grade)
FromData:
Estimate0
Estimate1
Estimate
17.6
Plot of Fitted Model
100
92
80
82
Test B2
Test 2
60
40
20
72
62
52
0
40
50
60
70
80
90
42
100
Test 1
60
70
80
90
100
Test B1

100
Test B2
90
80
70
60
50
50
60
70
80
90
100
Test B1
17.7
Correlation Analysis -1 < < 1

Ifweareinterestedonlyindeterminingwhethera
relationshipexists,weemploycorrelationanalysis.
Example:Studentsheightandweight.
Plot of Height vs Weight
7
6.6
6.2
Height
Height
6.6
5.8
5.4
6.2
5.8
5
4.6
100
140
180
220
5.4
260
100
Weight
140
180
220
260
Weight
6.8
6.6
6.2
6.2
Height
Height
6.5
5.9
5.6
5.8
5.4
5.3
100
140
180
220
260
Weight
5
100
140
180
220
260
Weight
17.8
Correlation Analysis -1 < < 1

Ifthecorrelationcoefficientiscloseto+1thatmeansyou
haveastrongpositiverelationship.
Ifthecorrelationcoefficientiscloseto1thatmeansyou
haveastrongnegativerelationship.
Ifthecorrelationcoefficientiscloseto0thatmeansyou
havenocorrelation.
WEHAVETHEABILITYTOTESTTHEHYPOTHESIS
H0:=0
17.9
Regression: Model Types X=size of house, Y=cost of

house
DeterministicModel:anequationorsetofequationsthat
allowustofullydeterminethevalueofthedependent
variablefromthevaluesoftheindependentvariables.
y = $25,000 + (75$/ft2)(x)
Area of a circle: A = *r2
ProbabilisticModel:amethodusedtocapturethe
randomnessthatispartofareallifeprocess.
y=25,000+75x+
E.g.doallhousesofthesamesize(measuredinsquarefeet)
sellforexactlythesameprice?
17.10
Simple Linear Regression Model

Meaningofand
>0[positiveslope]<0[negativeslope]
rise
run
=slope (=rise/run)
=y-intercept
x
17.11
Which line has the best fit to the

data?
?
?
?
17.12
Estimating the Coefficients

Inmuchthesamewaywebaseestimatesofon,we
estimatewithb0andwithb1,theyinterceptandslope
(respectively)oftheleastsquaresorregressionlinegiven
by:
(Thisisanapplicationoftheleastsquaresmethodandit
producesastraightlinethatminimizesthesumofthe
squareddifferencesbetweenthepointsandtheline)
17.13
Least Squares Line

s
ce
n
ere
diff
d
these differences
are
u
are called
sq line
e
e
th
residuals or
of nd th
um ts a
errors
s
the poin
s
ize the
m
i
en
n
i
e
??
?
m
w
e
t
om slop
ne be
i
r
l
f
e
is
or
f
m
Th
o
c
14
1
n
.
o
ati and 2
u
eq pt
e
lin terce
e
th y-in
d
i
d
ra
e
r
o
f
he 34
w
ut et .9
b
eg
w
d
i
wd
o
H
17.14
Least Squares Line[sure glad we have computers

now!]
Thecoefficientsb1andb0forthe
leastsquaresline
arecalculatedas:
17.15
Least Squares Line See if you can estimate Y-intercept and slope from
this data
Recall
Statistics
Data
Information
Data Points:
x
17
12
y = .934 + 2.114x
17.16
Least Squares Line See if you can estimate Y-intercept and slope from
this data
17.17
Excel: Data Analysis - Regression
17.18
Excel: Plotted Regression Model You will need to play

around with this to get the plot to look Good
17.19
Required Conditions
Fortheseregressionmethodstobevalidthefollowingfour
conditionsfortheerrorvariable()mustbemet:
Theprobabilitydistributionofisnormal.
Themeanofthedistributionis0;thatis,E()=0.
Thestandarddeviationofis,whichisaconstant
regardlessofthevalueofx.
Thevalueofassociatedwithanyparticularvalueofyis
independentofassociatedwithanyothervalueofy.
17.20
Assessing the Model

Theleastsquaresmethodwillalwaysproduceastraightline,
evenifthereisnorelationshipbetweenthevariables,orif
therelationshipissomethingotherthanlinear.
Hence,inadditiontodeterminingthecoefficientsoftheleast
squaresline,weneedtoassessittoseehowwellitfitsthe
data.Wellseetheseevaluationmethodsnow.Theyrebased
onthewhatiscalledsumofsquaresforerrors(SSE).
17.21
Sum of Squares for Error (SSE another thing to

calculate)
Thesumofsquaresforerroriscalculatedas:
andisusedinthecalculationofthestandarderrorof
estimate:
Ifiszero,allthepointsfallontheregressionline.
17.22
Standard Error
Ifissmall,thefitisexcellentandthelinearmodelshould
beusedforforecasting.Ifislarge,themodelispoor
But what is small and what is large?
17.23
Standard Error
Judgethevalueofbycomparingittothesamplemeanof
thedependentvariable().
Inthisexample,
=.3265and
=14.841
so(relativelyspeaking)itappearstobesmall,henceour
linearregressionmodelofcarpriceasafunctionof
odometerreadingisgood.
17.24
Testing the SlopeExcel output does this for

you.
Ifnolinearrelationshipexistsbetweenthetwovariables,we
wouldexpecttheregressionlinetobehorizontal,thatis,to
haveaslopeofzero.
Wewanttoseeifthereisalinearrelationship,i.e.wewant
toseeiftheslope()issomethingotherthanzero.Our
researchhypothesisbecomes:
H1:0
Thusthenullhypothesisbecomes:
H0:=0
Alreadydiscussed!
17.25
Testing the Slope

Wecanimplementthisteststatistictotryourhypotheses:
H0:1=0
whereisthestandarddeviationofb1,definedas:
Iftheerrorvariable()isnormallydistributed,thetest
statistichasaStudenttdistributionwithn2degreesof
freedom.Therejectionregiondependsonwhetherornot
weredoingaoneortwotailtest(twotailtestismost
typical).
17.26
Example 17.4
Testtodetermineiftheslopeissignificantlydifferentfrom
0(at5%significancelevel)
Wewanttotest:
H1:0
H0:=0
(ifthenullhypothesisistrue,nolinearrelationshipexists)
Therejectionregionis:
ORcheckthepvalue.
17.27
Example 17.4
COMPUTE
WecancomputetmanuallyorrefertoourExceloutput
p-value
Weseethatthetstatisticfor
odometer(i.e.theslope,b1)is13.49
Compare
whichisgreaterthantCritical=1.984.Wealsonotethatthep
valueis0.000.
Thereisoverwhelmingevidencetoinferthatalinear
relationshipbetweenodometerreadingandpriceexists.
17.28
Testing the Slope

Wecanalsoestimate(tosomelevelofconfidence)and
intervalfortheslopeparameter,.
Recallthatyourestimateforisb1.
Theconfidenceintervalestimatorisgivenas:
Hence:
Thatis,weestimatethattheslopecoefficientliesbetween
.0768and.0570
17.29
Coefficient of Determination
Teststhusfarhaveshownifalinearrelationshipexists;itis
alsousefultomeasurethestrengthoftherelationship.This
isdonebycalculatingthecoefficientofdeterminationR2.
Thecoefficientofdeterminationisthesquareofthe
coefficientofcorrelation(r),henceR2=(r)2
rwillbecomputedshortlyandthisistrueformodels
withonly1indepenentvariable
17.30
Coefficient of Determination
R2hasavalueof.6483.Thismeans64.83%ofthevariation
intheauctionsellingprices(y)isexplainedbyyour
regressionmodel.Theremaining35.17%isunexplained,
i.e.duetoerror.
Unlikethevalueofateststatistic,thecoefficientof
determinationdoesnothaveacriticalvaluethatenablesus
todrawconclusions.
IngeneralthehigherthevalueofR2,thebetterthemodelfits
thedata.
R2=1:Perfectmatchbetweenthelineandthedatapoints.
R2=0:Therearenolinearrelationshipbetweenxandy.
17.31
Remember Excels Output

Ananalysisofvariance(ANOVA)tableforthe
simplelinearregressionmodelcanbegiveby:
Source
degrees
of
freedom
Sums of
Squares
Mean
Squares
F-Statistic
Regressio
n
SSR
MSR =
SSR/1
F=MSR/MSE
MSE =
SSE/(n2)
Error
n2
SSE
Total
n1
Variation
in y (SST)
17.32
Using the Regression Equation

Wecoulduseourregressionequation:
y=17.250.0669x
topredictthesellingpriceofacarwith40(40,000)mileson
it:
y=17.250.0669x=17.250.0669(40)=14,574
Wecallthisvalue($14,574)apointprediction(estimate).
Chancesarethoughtheactualsellingpricewillbedifferent,
hencewecanestimatethesellingpriceintermsofa
confidenceinterval.
17.33
Prediction Interval
Thepredictionintervalisusedwhenwewanttopredictone
particularvalueofthedependentvariable,givenaspecific
valueoftheindependentvariable:
(xgisthegivenvalueofxwereinterestedin)
17.34
Confidence Interval Estimator for Mean of Y

Theconfidenceintervalestimatefortheexpectedvalueofy
(MeanofY)isusedwhenwewanttopredictanintervalwe
areprettysurecontainsthetrueregressionline.Inthis
case,weareestimatingthemeanofygivenavalueofx:
(Technicallythisformulaisusedforinfinitelylarge
populations.However,wecaninterpretourproblemas
attemptingtodeterminetheaveragesellingpriceofallFord
Tauruses,allwith40,000milesontheodometer)
17.35
Whats the Difference?

Prediction Interval
1
Used to estimate the value
of one value of y (at given
x)
Confidence Interval
no 1
Used to estimate the mean
value of y (at given x)
The confidence interval estimate of the expected value of y will be

narrower than the prediction interval for the same given value of x and
confidence level. This is because there is less error in estimating a mean
value as opposed to predicting an individual value.
17.36
Regression Diagnostics
Therearethreeconditionsthatarerequiredinorderto
performaregressionanalysis.Theseare:
Theerrorvariablemustbenormallydistributed,
Theerrorvariablemusthaveaconstantvariance,&
Theerrorsmustbeindependentofeachother.
Howcanwediagnoseviolationsoftheseconditions?
ResidualAnalysis,thatis,examinethedifferences
betweentheactualdatapointsandthosepredictedbythe
linearequation
17.37
Nonnormality
Wecantaketheresidualsandputthemintoahistogramto
visuallycheckfornormality
werelookingforabellshapedhistogramwiththemean
closetozero[ouroldtestfornormality].
17.38
Heteroscedasticity
Whentherequirementofaconstantvarianceisviolated,we
haveaconditionofheteroscedasticity.
Wecandiagnoseheteroscedasticitybyplottingtheresidual
againstthepredictedy.
17.39
Heteroscedasticity
Ifthevarianceoftheerrorvariable()isnotconstant,then
wehaveheteroscedasticity.Herestheplotoftheresidual
againstthepredictedvalueofy:
there doesnt appear to be

a change in the spread of
the plotted points, therefore
no heteroscedasticity
17.40
Nonindependence of the Error

Variable
Ifweweretoobservetheauctionpriceofcarseveryweek
for,say,ayear,thatwouldconstituteatimeseries.
Whenthedataaretimeseries,theerrorsoftenarecorrelated.
Errortermsthatarecorrelatedovertimearesaidtobe
autocorrelatedorseriallycorrelated.
Wecanoftendetectautocorrelationbygraphingthe
residualsagainstthetimeperiods.Ifapatternemerges,itis
likelythattheindependencerequirementisviolated.
17.41
Nonindependence of the Error

Variable
Patternsintheappearanceoftheresidualsovertime
indicatesthatautocorrelationexists:
Note the runs of positive residuals,

replaced by runs of negative residuals
Note the oscillating behavior of th

residuals around zero.
17.42
Outliers Problem worked earlier

Anoutlierisanobservationthatisunusuallysmallor
unusuallylarge.
E.g.ourusedcarexamplehadodometerreadingsfrom19.1
to49.2thousandmiles.Supposewehaveavalueofonly
5,000miles(i.e.acardrivenbyanoldpersononlyon
Sundays)thispointisanoutlier.
17.43
Outliers
Possiblereasonsfortheexistenceofoutliersinclude:
Therewasanerrorinrecordingthevalue
Thepointshouldnothavebeenincludedinthesample
*Perhapstheobservationisindeedvalid.
Outlierscanbeeasilyidentifiedfromascatterplot.
Iftheabsolutevalueofthestandardresidualis>2,we
suspectthepointmaybeanoutlierandinvestigatefurther.
Theyneedtobedealtwithsincetheycaneasilyinfluence
theleastsquaresline
17.44
Procedure for Regression

Diagnostics
1. Developamodelthathasatheoreticalbasis.
2. Gatherdataforthetwovariablesinthemodel.
3. Drawthescatterdiagramtodeterminewhetheralinear
modelappearstobeappropriate.Identifypossible
outliers.
4. Determinetheregressionequation.
5. Calculatetheresidualsandchecktherequiredconditions
6. Assessthemodelsfit.
7. Ifthemodelfitsthedata,usetheregressionequationto
predictaparticularvalueofthedependentvariable
and/orestimateitsmean.
17.45

Regress Ssss I On

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Regress Ssss I On

Uploaded by

Copyright:

Available Formats

Chapter 17

Simple Linear Regression

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Linear Regression Analysis

Linear Regression Analysis

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Effect of Larger Values of

Lower vs. Higher

House Price = 25,000 + 75(Size) +

Same square footage, but different price points

Theoretical Linear Model

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

1. Building the Model Collect Data

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Linear Regression Analysis

Plot of Fitted Model

Plot of Fitted Model

Plot of Fitted Model

Correlation Analysis -1 < < 1

Plot of Height vs Weight

Plot of Height vs Weight

Plot of Height vs Weight

Correlation Analysis -1 < < 1

Regression: Model Types X=size of house, Y=cost of

Simple Linear Regression Model

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Which line has the best fit to the

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Estimating the Coefficients

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Least Squares Line

Least Squares Line[sure glad we have computers

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Excel: Data Analysis - Regression

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Excel: Plotted Regression Model You will need to play

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Assessing the Model

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Sum of Squares for Error (SSE another thing to

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Testing the SlopeExcel output does this for

Testing the Slope

Testing the Slope

Remember Excels Output

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Using the Regression Equation

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Confidence Interval Estimator for Mean of Y

Whats the Difference?

The confidence interval estimate of the expected value of y will be

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

there doesnt appear to be

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Nonindependence of the Error

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Nonindependence of the Error

Note the runs of positive residuals,

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Note the oscillating behavior of th