You are on page 1of 9

ZeenaJarrar

Kiker6
TheLinearRegressionProject
HaveyoueverwonderedwhyAmericahasbeengrowingsobig,sofast?Itmightjustbe
becauseofhowlongpeopleareabletolive.Thelesspeopledying,thegreaterourpopulation
cangrow.Formyproject,Idecidedtoanalyzetherelationshipbetweenlifeexpectancyand
populationintheUnitedStates.IchosethesevariablesbecauseIbelievelifeexpectancyand
populationbothseemtoincreaseastheyearspass,aswellasgohandinhand.Thelongerpeople
live,themorepeoplethatwillbealiveoverall,thereforethelargerthepopulationwillbeoverall.
ThisinterestedmespecificallybecauseasanAmericanIfinditinterestinghowourpopulationis
rapidlyexpandingwhileinhabitedspaceisrapidlydecreasing.Watchingthecorrelationbetween
howlongwecanliveandhowmanypeoplearelivinginAmericaisshocking,andmakesone
wonderhowlonguntilwecantholdanymorepeopleorwhattheoldestagepeoplewillbeable
tolivetointhefuture.
OnceIpickedmyvariables,Iidentifiedwhichwastheexplanatoryandwhichwasthe
responsevariable.SinceIbelievedthatthelongerapersonlives,themorepopulationwillgrow,
thenlifeexpectancyistheexplanatorywhilepopulationistheresponse.Thisisbecausewith
everychangeinthelifeexpectancy,thepopulationwillthenchangeaswell.AfterIcreatedmy
scatterplotIcreatedboxplotsforboththepopulationandthelifeexpectancytoseeifIhadany
outliers.Bylookingattheboxplot,IcouldntseeanyoutlierssoIdecidedtodoublecheckthis
bycalculatingtheIQRusingthefivenumfunctioninR,thenImultipliedtheIQRby1.5and
addedthefinalnumbertomy3rdQuartilenumberandsubtracteditfrommy1stQuartile

number.Icomparedthesenumberswithmymaximumandminimumandnoticedthatthe
calculatednumberwaslargerthanmymaximumandsmallerthanmyminimum,whichmeantI
infacthadnooutliers.SinceIdidnothaveanyoutliersIdidnothaveanyinfluentialpoints
eitherbecauseinfluentialpointsmustbeoutliers.
Aftercalculatingmylinearregressionfit,Iwasgiventheslope,yintercept,and
rsquaredvalue.UsingthesevaluesIcalculatedmyleastsquaresregressionequationwhich
modelsmydatatobeYhat=16,136,393*X955,286,783.Yhatbeingthepredicted
populationvalueandXbeingthelifeexpectancy.Myslopeis16,136,393whichmeansthat
everyyearincreaseinthelifeexpectancythepopulationispredictedtoincreaseby16,136,393
people.Myyinterceptis955,286,783whichisirrelevantbecauselifeexpectancycantbe0or
elsenoonewouldbealive,anditisimpossibletohaveanegativeamountofpeople.Myvalue
ofris.982whichindicatedthatthedatainthescatterplothasastronglinearrelationship.My
rsquaredvaluewascalculatedtobe0.965whichmeansthat96.5%ofthevariabilityinlife
expectancycanbeusedtopredictthevalueofpopulationintheUnitedStates.Totestoutmy
leastsquaresregressionequationIpickedthepoint(78.09,306771529)andcalculatedwhatthe
predictedYwouldcomeouttobetocompareitwiththerealYvalue.Afterpluggingin78.09
intomyequation,IgotapredictedYof304,808,019,whichwasfairlyclosetotherealYvalue.
IcalculatedmyresidualbysubtractingtherealYvaluewiththepredictedYvalueandgot
1,963,510asmyresidual.Althoughthismayseemlikeanextremelylargeresidual,in
comparisontothelargenumbersweareusingitsnottoobig.ThepredictedYvalue
underestimatedtherealvalue,whichresultedinthisresidual.Icreatedaresidualplotformy
populationbasedonthelifeexpectancy.Ratherthanhavearandomscattertotheresidualplot,I

noticeditwasmoreofanupsidedownUshape.Itestedmygraphtoseeifitwasbestfitbya
linearregression,myrsquaredwas.965.Itriedquadraticandgotalargersquaredof0.959,
whichislargebutnotaslargeasthersquaredforalinearregression.Ilastlytriedtofitittoa
logisticgraphwhichIassumedwouldfitbestbecauseofthepatternedresiduals.Igotmy
rsquaredtobe.982,whichwasthehighestithadbeen.Therefore,thelogisticregressionfitmy
graphbetterthanalinearregression.
Acareerthatwouldusethistypeofanalysiswouldbesomeonewhoworksforthe
NationalCenterforHealthStatisticsbecausetheystudythehealthstatusofthepopulation.By
lookingatthegrowthandfallsinthegraphtheycangetageneralideaofpopulationandlife
expectancyintheUSandfigureoutwhichdiseasesmaybeaffectingit.Therelevanceofthis
careeristhatithelpsusidentifyhealthproblemsaffectingpeople,documentthestatusofthe
population'shealth,evaluatehowwellhealthpoliciesandprogramswork,andmanyotherthings
relatedtoourhealth.
Frommydatawecantconcludethatlifeexpectancyisdirectlycausingpopulationto
increasebecausetherecouldbeanunderlyingvariablewhichisaffectingthepopulationtomake
itincrease,suchaspeopleinAmericahavingmorekidsoranincreaseinimmigration.Sincethe
twovariablesdohavesuchahighrsquaredvaluewecantelltheystillhaveaverystrong
correlationbetweenthem.Basedonmyresults,thepopulationmustcapoffatsomepoint
becausethereisonlysomanypeoplewecanfitinAmerica.

>plot(zj$Life.expectancy,zj$Population,main="LifeExpectancyandPopulationintheUS",
xlab="LifeExpectancy(Years)",ylab="Population")
>abline(lm(zj$Population~zj$Life.expectancy))

>linFit(zj$Life.expectancy,zj$Population)
Intercept=955286783
Slope=16136393
Rsquared=0.96516
>boxplot(zj$Population,horizontal=TRUE,main="BoxplotofPopulation",xlab=
"Population")
>fivenum(zj$Population)
[1]272657000284968955295516599306771529316497531
>(306771529284968955)*1.5
[1]32703861
>306771529+32703861
[1]339475390
>28496895532703861
[1]252265094
>boxplot(zj$Life.expectancy,horizontal=TRUE,main="BoxplotofLifeExpectancy",xlab=
"LifeExpectancy(Years)")
>fivenum(zj$Life.expectancy)
[1]76.4292776.7365977.3390278.0902478.84146
>(78.0902476.73659)*1.5
[1]2.030475
>78.09024+2.030475
[1]80.12071

>76.736592.030475
[1]74.70612
>16136393*(78.09024)955286783
[1]304808019
>306771529304808019
[1]1963510
>Population.lm=lm(Population~Life.expectancy,data=zj)
>Population.res=resid(Population.lm)
>plot(zj$Life.expectancy,Population.res,ylab="Residuals",xlab="LifeExpectancy",
main="ResidualsinPopulation")
>abline(0,0)
>expFit(zj$Life.expectancy,zj$Population)
a=4287471
b=1.05611
Rsquared=0.95862
>logisticFit(zj$Life.expectancy,zj$Population)
LogisticFit
C=329855050
a=2.133667e+18
b=1.77413
Rsquared=0.98206

WorksCited
Rothwell,Charles."AboutNCHS."
CentersforDiseaseControlandPrevention
.Centersfor
DiseaseControlandPrevention,20Jan.2011.Web.18Nov.2015.
WorldDevelopmentIndicators
.Rep.N.p.:n.p.,n.d.
WorldBank
.Web.16Nov.2015.

You might also like