Correlation and Regression

 Correlation Coefficient (r)
 Regression Analysis
Y
X
Y
(Xi, Yi)
Yi
Xi X
Y
X
Y
X
Y
X
 The correlation coefficient is based on the covariance.
 For a sample, the covariance is calculated as:
_ _
 sxy = (Xi - X)(Yi - Y)
N-1
 Interpretation: Covariance tells us how variation in one
variable “goes with” variation in another variable
(“covary”).
 Two variables are statistically independent
(perfectly unrelated) when their covariance =
0.
 Positive relationships indicated by + value,
negative relationships by a – value.
 Problem with Covariance as a measure of
association?
 Correlation Coefficient (Pearson’s r)
 A way of standardizing the covariance.
 rxy = sxy / sxsy
 Intepretation: Measures the strength of a linear
relationship.
 -1  r  1
 X and Y are perfectly unrelated (independent, uncorrelated) iff rxy
=0
 What explains variation the generosity of
state welfare expenditures? (STATES – 55)
 What explains variation in the generosity of
state welfare expenditures? (STATES – 1570)
 738 – Poverty rate
 739 – Median Family Income
 1644 - %Clinton
 1715 – Female State Legislators
 Regression is concerned with dependence of one variable
(the dependent variable, measured at the interval/ratio
level) on one or more other variables (independent
variables, measured at the interval, ratio, ordinal or
nominal levels).
 Bivariate vs. Multivariate regression analysis
 Y used as dependent variable and X as independent

variable.
 The correlation coefficient measures the
strength of a linear association between two
variables measured at the interval level
 In a scatterplot – the degree to which the

points in the plot cluster around a “best-
fitting” line
 The purpose of regression analysis is to
determine exactly what that line is (i.e. to
estimate the equation for the line)
 The regression line represents predicted

values of Y based on the value of X
Y
X
Y
X
Y
Xi X
Y
Yi
Xi X
Yi = a + bXi
a = Intercept, or Constant = The value
of Y when X = 0
b = Slope coefficient = The change (+ or -) in Y given a

one unit increase in X
Yi = a + bXi + ei
Residual (ei ) – for every observation, the

difference between the observed value of Y
and the regression line (“prediction errors”)
 Using statistical calculations, for any relationship between
X and Y, we can determine the best-fitting line for the
relationship
 This means finding specific values for a and b for the

regression equation
Yi = a + bXi + ei
 Regression analysis finds the line that
minimizes the sum of squared residuals
Yi = a + bXi + ei
 a = the expected value of Y when X=0
 b = the expected change in Y given a one
unit increase in X
Yi = a + bXi + ei
 We can calculate a predicted value for the
dependent variable for any value of X by
using the regression equation for the
regression line:
 ^
Yi = a + bXi
Y
intercept
Xi X
Y
Slope (b)
Xi Xj X
One
unit of
X
Y
Yi
ei
Yi
Xi X
 Research Question: Did the butterfly
ballot result in an unusual number of
votes for Pat Buchanan in the 2000
election in Palm Beach Co.?
 Did it cost Al Gore the election?
 Research Question: Did the butterfly ballot
result in an unusual number of votes for Pat
Buchanan in the 2000 election in Palm
Beach Co.?
 Unit of analysis – Fla. Counties (67)
 Dependent variable (Y) – vote for Buchanan
in 2000
 Independent variable (X) – vote for
Buchanan in 1996 Republican primary
4000
PALM BEACH
3000
2000
1000
PINELLAS
HILLSBOROUGH BROWARD
DUVAL
MARION PASCO
POLK DADE
ESCAMBIA BREVARD
VOLUSIA ORANGE
SANTA
LEON
CITRUS
ALACHUA
BAY ROSA
LAKE
HERNANDO SARASOTA
MANATEE
OKALOOSA LEE
ST JOHNS
CLAY SEMINOLE
CHARLOTTE
PUTNAM
WALTON
SUMTEROSCEOLA
HIGHLANDS
SUWANNEE COLLIER
ST LUCIE
MARTIN
SOTOINDIAN
JACKSON
CALHOUN
WASHINGTON
NASSAU
COLUMBIA
FLAGLER
HOLMES
BAKER
GULF
LEVY
BRADFORD
WAKULLA
OKEECHOBEE
LIBERTY
GADSDEN
DE
FRANKLIN
UNION
DIXIE
GILCHRIST
HARDEE
JEFFERSON
MADISON
HAMILTON
TAYLOR
HENDRY
LAFAYETTE
GLADES
RIVER
MONROE
0
0 5000 10000 15000

Buchanan Vote in 1996
Y = 12.957 + .101(X)
Y = 12.957 + .101(X)
Intercept (a)
Y = 12.957 + .101(X)
Intercept (a) Slope (b)

2000 vote = 12.957 + .101(1996 Vote)
Intercept (a) Slope (b)

 To generate a predicted value for Palm Beach in 2000, we
could simply plug in the appropriate X value in the regression
equation and solve for Y.
 Regression equation: 2000 vote = 12.957 + .101(1996 Vote)
 In 1996, Buchanan received 8788 votes in Palm Beach. Our

prediction for Palm Beach in 2000 based on this regression is
thus:
12.957 + .101*8788 = 903.45
(What does this tell us?)

4000
PALM BEACH
3000
2000
1000
PINELLAS
HILLSBOROUGH BROWARD
DUVAL
MARION PASCO
POLK DADE
ESCAMBIA BREVARD
VOLUSIA ORANGE
SANTA
LEON
CITRUS
ALACHUA
BAY ROSA
LAKE
HERNANDO SARASOTA
MANATEE
OKALOOSA LEE
ST JOHNS
CLAY SEMINOLE
CHARLOTTE
PUTNAM
WALTON
SUMTEROSCEOLA
HIGHLANDS
SUWANNEE COLLIER
ST LUCIE
MARTIN
SOTOINDIAN
JACKSON
CALHOUN
WASHINGTON
NASSAU
COLUMBIA
FLAGLER
HOLMES
BAKER
GULF
LEVY
BRADFORD
WAKULLA
OKEECHOBEE
LIBERTY
GADSDEN
DE
FRANKLIN
UNION
DIXIE
GILCHRIST
HARDEE
JEFFERSON
MADISON
HAMILTON
TAYLOR
HENDRY
LAFAYETTE
GLADES
RIVER
MONROE
0
0 5000 10000 15000

4000
PALM BEACH (Actual vote: 3,407)

3000
Actual Vote for

Buchanan in
Palm Beach,
2000 election
2000
Predicted Vote from Regression Model = 903.45

1000
Predicted vote for

PINELLAS
HILLSBOROUGH Buchanan based on 1996
BROWARD
DUVAL vote (from regression
MARION PASCO
POLK DADE
ESCAMBIA BREVARD model)
VOLUSIA ORANGE
SANTA
LEON
CITRUS
ALACHUA
BAY ROSA
LAKE
HERNANDO SARASOTA
MANATEE
OKALOOSA LEE
ST JOHNS
CLAY SEMINOLE
CHARLOTTE
PUTNAM
WALTON
SUMTEROSCEOLA
HIGHLANDS
SUWANNEE COLLIER
ST LUCIE
MARTIN
SOTOINDIAN
JACKSON
CALHOUN
WASHINGTON
NASSAU
COLUMBIA
FLAGLER
HOLMES
BAKER
GULF
LEVY
BRADFORD
WAKULLA
OKEECHOBEE
LIBERTY
GADSDEN
DE
FRANKLIN
UNION
DIXIE
GILCHRIST
HARDEE
JEFFERSON
MADISON
HAMILTON
TAYLOR
HENDRY
LAFAYETTE
GLADES
RIVER
MONROE
0
0 5000 10000 15000

 We can calculate the residual for any
observation by first calculating the predicted
value for Y, and then subtracting the
predicted value from the observed value of Y:
 ^
ei = Yi - Yi
 For any observation in our data, the residual
represents the “prediction error” for that
observation (based on the regression
equation)
 ^
ei = Yi - Yi
4000

3000
Actual Vote for

Buchanan in
Palm Beach,
2000 election
2000

1000
Predicted vote for

PINELLAS
BROWARD
MARION PASCO
POLK DADE
VOLUSIA ORANGE
SANTA
LEON
CITRUS
ALACHUA
BAY ROSA
LAKE
HERNANDO SARASOTA
MANATEE
OKALOOSA LEE
ST JOHNS
CLAY SEMINOLE
CHARLOTTE
PUTNAM
WALTON
SUMTEROSCEOLA
HIGHLANDS
SUWANNEE COLLIER
ST LUCIE
MARTIN
SOTOINDIAN
JACKSON
CALHOUN
WASHINGTON
NASSAU
COLUMBIA
FLAGLER
HOLMES
BAKER
GULF
LEVY
BRADFORD
WAKULLA
OKEECHOBEE
LIBERTY
GADSDEN
DE
FRANKLIN
UNION
DIXIE
GILCHRIST
HARDEE
JEFFERSON
MADISON
HAMILTON
TAYLOR
HENDRY
LAFAYETTE
GLADES
RIVER
MONROE
0
0 5000 10000 15000

4000

3000
Actual Vote for

Buchanan in
Palm Beach,
Palm Beach Residual
2000 election
3407-903.45 =
2000
2503.55

1000
Predicted vote for

PINELLAS
BROWARD
MARION PASCO
POLK DADE
VOLUSIA ORANGE
SANTA
LEON
CITRUS
ALACHUA
BAY ROSA
LAKE
HERNANDO SARASOTA
MANATEE
OKALOOSA LEE
ST JOHNS
CLAY SEMINOLE
CHARLOTTE
PUTNAM
WALTON
SUMTEROSCEOLA
HIGHLANDS
SUWANNEE COLLIER
ST LUCIE
MARTIN
SOTOINDIAN
JACKSON
CALHOUN
WASHINGTON
NASSAU
COLUMBIA
FLAGLER
HOLMES
BAKER
GULF
LEVY
BRADFORD
WAKULLA
OKEECHOBEE
LIBERTY
GADSDEN
DE
FRANKLIN
UNION
DIXIE
GILCHRIST
HARDEE
JEFFERSON
MADISON
HAMILTON
TAYLOR
HENDRY
LAFAYETTE
GLADES
RIVER
MONROE
0
0 5000 10000 15000

 Testing for statistical significance for the slope
 The p-value - probability of observing a sample slope

value at least as large (different from zero) as the one we
are observing in our sample IF THE NULL HYPOTHESIS IS
TRUE
 P-values closer to zero suggest the null hypothesis is less

likely to be true (.05 usually the threshold for statistical
significance)
 The R-squared = the proportion of variation
in the dependent variable (Y) explained by
the independent variable (X).
 In bivariate regression analysis it is simply the

square of the correlation coefficient (r)
 Intercept (a)
 Slope (b)
 Predicted values of Y
 Residuals
 P-value for the slope
 R-squared

Correlation and Regression

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Correlation and Regression

Uploaded by

Copyright:

Available Formats

 Correlation Coefficient (r)

 Bivariate vs. Multivariate regression analysis

 Y used as dependent variable and X as independent

 In a scatterplot – the degree to which the

 The regression line represents predicted

b = Slope coefficient = The change (+ or -) in Y given a

Residual (ei ) – for every observation, the

 This means finding specific values for a and b for the

0 5000 10000 15000

Intercept (a) Slope (b)

Intercept (a) Slope (b)

 Regression equation: 2000 vote = 12.957 + .101(1996 Vote)

 In 1996, Buchanan received 8788 votes in Palm Beach. Our

(What does this tell us?)

0 5000 10000 15000

PALM BEACH (Actual vote: 3,407)

Actual Vote for

Predicted Vote from Regression Model = 903.45

Predicted vote for

0 5000 10000 15000

PALM BEACH (Actual vote: 3,407)

Actual Vote for

Predicted Vote from Regression Model = 903.45

Predicted vote for

0 5000 10000 15000

PALM BEACH (Actual vote: 3,407)

Actual Vote for

Predicted Vote from Regression Model = 903.45

Predicted vote for

0 5000 10000 15000

 The p-value - probability of observing a sample slope

 P-values closer to zero suggest the null hypothesis is less

 In bivariate regression analysis it is simply the

You might also like