Professional Documents
Culture Documents
Chapter 5 of the textbook introduced you to the two most widely used measures of
relationship: the Pearson product-moment correlation and the Spearman rank-order
correlation. We will be covering these statistics in this section, as well as other
measures of relationship among variables.
What is a Relationship?
Correlation coefficients are measures of the degree of relationship between two or
more variables. When we talk about a relationship, we are talking about the manner in
which the variables tend to vary together. For example, if one variable tends to
increase at the same time that another variable increases, we would say there is a
positive relationship between the two variables. If one variable tends to decrease as
another variable increases, we would say that there is a negative relationship between
the two variables. It is also possible that the variables might be unrelated to one
another, so that there is no predictable change in one variable based on knowing about
changes in the other variable.
As a child grows from an infant into a toddler into a young child, both the child's
height and weight tend to change. Those changes are not always tightly locked to one
another, but they do tend to occur together. So if we took a sample of children from a
few weeks old to 3 years old and measured the height and weight of each child, we
would likely see a positive relationship between the two.
A relationship between two variables does not necessarily mean that one variable
causes the other. When we see a relationship, there are three possible causal
interpretations. If we label the variables A and B, A could cause B, B could cause A,
or some third variable (we will call it C) could cause both A and B. With the
relationship between height and weight in children, it is likely that the general growth
of children, which increases both height and weight, accounts for the observed
correlation. It is very foolish to assume that the presence of a correlation implies a
causal relationship between the two variables. There is an extended discussion of this
issue in Chapter 7 of the text.
graph, in which the dimensions are defined by the variables. For example, if we
wanted to create a scatter plot of our sample of 100 children for the variables of height
and weight, we would start by drawing the X and Y axes, labeling one height and the
other weight, and marking off the scales so that the range on these axes is sufficient to
handle the range of scores in our sample. Let's suppose that our first child is 27 inches
tall and 21 pounds. We would find the point on the weight axis that represents 21
pounds and the point on the height axis that represents 27 inches. Where these two
points cross, we would put a dot that represents the combination of height and weight
for that child, as shown in the figure below.
We then continue the process for all of the other children in our sample, which might
produce the scatter plot illustrated below.
It is always a good idea to produce scatter plots for the correlations that you compute
as part of your research. Most will look like the scatter plot above, suggesting a linear
relationship. Others will show a distribution that is less organized and more scattered,
suggesting a weak relationship between the variables. But on rare occasions, a scatter
plot will indicate a relationship that is not a simple linear relationship, but rather
shows a complex relationship that changes at different points in the scatter plot. The
scatter plot below illustrates a nonlinear relationship, in which Y increases
as X increases, but only up to a point; after that point, the relationship reverses
direction. Using a simple correlation coefficient for such a situation would be a
mistake, because the correlation cannot capture accurately the nature of a nonlinear
relationship.
You can learn how to compute the Pearson product-moment correlation either by hand
or using SPSS for Windows by clicking on one of the buttons below. Use the browser's
return arrow key to return to this page.
Spearman Rank-Order Correlation
The Spearman correlation has the same range as the Pearson correlation, and the
numbers mean the same thing. A zero correlation means that there is no relationship,
whereas correlations of +1.00 and -1.00 mean that there are perfect positive and
negative relationships, respectively. The formula for computing this correlation is
shown below. Traditionally, the lowercase r with a subscript s is used to designate the
Spearman correlation (i.e., r ). The one term in the formula that is not familiar to you
is d, which is equal to the difference in the ranks for the two variables. This is
explained in more detail in the section that covers the manual computation of the
Spearman rank-order correlation.
s
The Phi coefficient is an index of the degree of relationship between two variables
that are measured on a nominal scale. Because variables measured on a nominal scale
are simply classified by type, rather than measured in the more general sense, there is
no such thing as a linear relationship. Nevertheless, it is possible to see if there is a
relationship. For example, suppose you want to study the relationship between
religious background and occupations. You have a classification systems for religion
that includes Catholic, Protestant, Muslim, Other, and Agnostic/Atheist. You have also
developed a classification for occupations that include Unskilled Laborer, Skilled
Laborer, Clerical, Middle Manager, Small Business Owner, and Professional/Upper
Management. You want to see if the distribution of religious preferences differ by
occupation, which is just another way of saying that there is a relationship between
these two variables.
The Phi Coefficient is not used nearly as often as the Pearson and Spearman
correlations. Therefore, we will not be devoting space here to the computational
procedures. However, interested students can consult advances statistics textbooks for
the details. you can compute Phi easily as one of the options in
the crosstabs procedure in SPSS for Windows. Click on the button below to see how.
Advanced Correlational Techniques
LookingforRelationshipsintheData
Whentherearetwoseriesofdata,thereareanumberofstatisticalmeasures
thatcanbeusedtocapturehowtheseriesmovetogetherovertime.
CorrelationsandCovariances
Thetwomostwidelyusedmeasuresofhowtwovariablesmovetogether(ordo
not)arethecorrelationandthecovariance.Fortwodataseries,X(X 1,X2,)andY(Y,
Y...),thecovarianceprovidesameasureofthedegreetowhichtheymovetogether
and is estimated by taking the product of the deviations from the mean for each
variableineachperiod.
Thesignonthecovarianceindicatesthetypeofrelationshipthetwovariableshave.A
positivesignindicatesthattheymovetogetherandanegativesignthattheymovein
opposite directions. Although the covariance increases with the strength of the
relationship,itisstillrelativelydifficulttodrawjudgmentsonthestrengthofthe
relationshipbetweentwovariablesbylookingatthecovariance,becauseitisnot
standardized.
Thecorrelationisthestandardizedmeasureoftherelationshipbetweentwo
variables.Itcanbecomputedfromthecovariance:
Thecorrelationcanneverbegreaterthanoneorlessthannegativeone.Acorrelation
closetozeroindicatesthatthetwovariablesareunrelated.Apositivecorrelation
indicatesthatthetwovariablesmovetogether,andtherelationshipisstrongerasthe
correlationgetsclosertoone.Anegativecorrelationindicatesthetwovariablesmove
inoppositedirections,andthatrelationshipgetsstrongertheasthecorrelationgets
closertonegativeone.Twovariablesthatareperfectlypositivelycorrelated( XY=1)
essentiallymoveinperfectproportioninthesamedirection,whereastwovariables
that are perfectly negatively correlated move in perfect proportion in opposite
directions.
Regressions
Asimpleregressionisanextensionofthecorrelation/covarianceconcept.It
attemptstoexplainonevariable,thedependentvariable,usingtheothervariable,the
independentvariable.
ScatterPlotsandRegressionLines
Keepingwithstatisticaltradition,letYbethedependentvariableandXbethe
independentvariable.Ifthetwovariablesareplottedagainsteachotherwitheachpair
ofobservationsrepresentingapointonthegraph,youhaveascatterplot,withYonthe
verticalaxisandXonthehorizontalaxis.FigureA1.3illustratesascatterplot.
FigureA1.3:ScatterPlotofYversusX
Inaregression,weattempttofitastraightlinethroughthepointsthatbestfits
thedata.Initssimplestform,thisisaccomplishedbyfindingalinethatminimizesthe
sumofthesquareddeviationsofthepointsfromtheline.Consequently,itiscalled
anordinaryleastsquares(OLS)regression.Whensuchalineisfit,twoparameters
emergeoneisthepointatwhichthelinecutsthroughtheYaxis,calledtheintercept
oftheregression,andtheotheristheslopeoftheregressionline:
Y=a+bX
Theslope(b)oftheregressionmeasuresboththedirectionandthemagnitudeofthe
relationship between the dependent variable (Y) and the independent variable (X).
When the two variables are positively correlated, the slope will also be positive,
whereaswhenthetwovariablesarenegativelycorrelated,theslopewillbenegative.
Themagnitudeoftheslopeoftheregressioncanbereadasfollows:Foreveryunit
increase in the dependent variable (X), the independent variable will change
byb(slope).
EstimatingRegressionParameters
Althoughtherearestatisticalpackagesthatallowustoinputdataandgetthe
regressionparametersasoutput,itisworthlookingathowtheyareestimatedinthe
firstplace.Theslopeoftheregressionlineisalogicalextensionofthecovariance
concept introduced in the last section. In fact, the slope is estimated using the
covariance:
Theintercept(a)oftheregressioncanbereadinanumberofways.Oneinterpretation
isthatitisthevaluethatYwillhavewhenXiszero.Anotherismorestraightforward
andisbasedonhowitiscalculated.Itisthedifferencebetweentheaveragevalue
ofY,andtheslopeadjustedvalueofX.
Regressionparametersarealwaysestimatedwithsomeerrororstatisticalnoise,partly
becausetherelationshipbetweenthevariablesisnotperfectandpartlybecausewe
estimatethemfromsamplesofdata.Thisnoiseiscapturedinacoupleofstatistics.
OneistheR2oftheregression,whichmeasurestheproportionofthevariabilityinthe
dependentvariable(Y)thatisexplainedbytheindependentvariable(X).Itisalsoa
directfunctionofthecorrelationbetweenthevariables:
AnR2valueclosetooneindicatesastrongrelationshipbetweenthetwovariables,
thoughtherelationshipmaybeeitherpositiveornegative.Anothermeasureofnoise
inaregressionisthestandarderror,whichmeasuresthespreadaroundeachof
thetwoparametersestimatedtheinterceptandtheslope.Eachparameterhasan
associatedstandarderror,whichiscalculatedfromthedata:
StandardErrorofIntercept=SEa=
If we make the additional assumption that the intercept and slope estimates are
normallydistributed,theparameterestimateandthestandarderrorcanbecombined
togetatstatisticthatmeasureswhethertherelationshipisstatisticallysignificant.
tStatisticforIntercept=a/SEa
tStatisticfromSlope=b/SEb
Forsampleswithmorethan120observations,atstatisticgreaterthan1.95indicates
thatthevariableissignificantlydifferentfromzerowith95%certainty,whereasa
statisticgreaterthan2.33indicatesthesamewith99%certainty.Forsmallersamples,
thetstatistichastobelargertohavestatisticalsignificance. [1]
UsingRegressions
Theregressionthatmeasurestherelationshipbetweentwovariablesbecomes
a multiple regression when it is extended to includemore than one independent
variables(X1,X2,X3,X4...)intryingtoexplainthedependentvariableY.Although
Multipleregressionsarepowerfultoolsthatallowustoexaminethedeterminantsof
anyvariable.
RegressionAssumptionsandConstraints
Boththesimpleandmultipleregressionsdescribedinthissectionalsoassume
linear relationships between the dependent and independent variables. If the
relationshipisnotlinear,wehavetwochoices.Oneistotransformthevariablesby
takingthesquare,squareroot,ornaturallog(forexample)ofthevaluesandhopethat
therelationshipbetweenthetransformedvariablesismorelinear.Theotheristorun
nonlinear regressions that attempt to fit a curve (rather than a straight
line)throughthedata.
Thereareimplicitstatisticalassumptionsbehindeverymultipleregressionthat
we ignore at our own peril. For the coefficients on the individual independent
variablestomakesense,theindependentvariableneedstobeuncorrelatedwitheach
other, a condition that is often difficult to meet. When independent variables are
correlated with each other, the statistical hazard that is created is
calledmulticollinearity.Initspresence,thecoefficientsonindependentvariablescan
takeonunexpectedsigns(positiveinsteadofnegative,forinstance)andunpredictable
values.Therearesimplediagnosticstatisticsthatallowustomeasurehowfarthedata
maybedeviatingfromourideal.
Correlation Types
Correlation is a measure of association between two variables. The variables are not designated as
dependent or independent. The two most popular correlation coefficients are: Spearman's correlation
coefficient rho and Pearson's product-moment correlation coefficient.
When calculating a correlation coefficient for ordinal data, select Spearman's technique. For interval or
ratio-type data, use Pearson's technique.
The value of a correlation coefficient can vary from minus one to plus one. A minus one indicates a
perfect negative correlation, while a plus one indicates a perfect positive correlation. A correlation of zero
means there is no relationship between the two variables. When there is a negative correlation between
two variables, as the value of one variable increases, the value of the other variable decreases, and vise
versa. In other words, for a negative correlation, the variables work opposite each other. When there is a
positive correlation between two variables, as the value of one variable increases, the value of the other
variable also increases. The variables move together.
The standard error of a correlation coefficient is used to determine the confidence intervals around a true
correlation of zero. If your correlation coefficient falls outside of this range, then it is significantly different
than zero. The standard error can be calculated for interval or ratio-type data (i.e., only for Pearson's
product-moment correlation).
The significance (probability) of the correlation coefficient is determined from the t-statistic. The probability
of the t-statistic indicates whether the observed correlation coefficient occurred by chance if the true
correlation is zero. In other words, it asks if the correlation is significantly different than zero. When the tstatistic is calculated for Spearman's rank-difference correlation coefficient, there must be at least 30
cases before the t-distribution can be used to determine the probability. If there are fewer than 30 cases,
you must refer to a special table to find the probability of the correlation coefficient.
Example
A company wanted to know if there is a significant relationship between the total number of salespeople
and the total number of sales. They collect data for five months.
Variable 1
Variable 2
207
6907
180
5991
220
6810
205
6553
190
6190
Variable 2
Regression
Simple regression is used to examine the relationship between one dependent and one independent
variable. After performing an analysis, the regression statistics can be used to predict the dependent
variable when the independent variable is known. Regression goes beyond correlation by adding
prediction capabilities.
People use regression on an intuitive level every day. In business, a well-dressed man is thought to be
financially successful. A mother knows that more sugar in her children's diet results in higher energy
levels. The ease of waking up in the morning often depends on how late you went to bed the night before.
Quantitative regression adds precision by developing a mathematical formula that can be used for
predictive purposes.
For example, a medical researcher might want to use body weight (independent variable) to predict the
most appropriate dose for a new drug (dependent variable). The purpose of running the regression is to
find a formula that fits the relationship between the two variables. Then you can use that formula to
predict values for the dependent variable when only the independent variable is known. A doctor could
prescribe the proper dose based on a person's body weight.
The regression line (known as the least squares line) is a plot of the expected value of the dependent
variable for all values of the independent variable. Technically, it is the line that "minimizes the squared
residuals". The regression line is the one that best fits the data on a scatterplot.
Using the regression equation, the dependent variable may be predicted from the independent variable.
The slope of the regression line (b) is defined as the rise divided by the run. The y intercept (a) is the
point on the y axis where the regression line would intercept the y axis. The slope and y intercept are
incorporated into the regression equation. The intercept is usually called the constant, and the slope is
referred to as the coefficient. Since the regression model is usually not a perfect predictor, there is also an
error term in the equation.
In the regression equation, y is always the dependent variable and x is always the independent variable.
Here are three equivalent ways to mathematically describe a linear regression model.
y = intercept + (slope x) + error
y = constant + (coefficient x) + error
y = a + bx + e
The significance of the slope of the regression line is determined from the t-statistic. It is the probability
that the observed correlation coefficient occurred by chance if the true correlation is zero. Some
researchers prefer to report the F-ratio instead of the t-statistic. The F-ratio is equal to the t-statistic
squared.
The t-statistic for the significance of the slope is essentially a test to determine if the regression model
(equation) is usable. If the slope is significantly different than zero, then we can use the regression model
to predict the dependent variable for any value of the independent variable.
On the other hand, take an example where the slope is zero. It has no prediction ability because for every
value of the independent variable, the prediction for the dependent variable would be the same. Knowing
the value of the independent variable would not improve our ability to predict the dependent variable.
Thus, if the slope is not significantly different than zero, don't use the model to make predictions.
The coefficient of determination (r-squared) is the square of the correlation coefficient. Its value may vary
from zero to one. It has the advantage over the correlation coefficient in that it may be interpreted directly
as the proportion of variance in the dependent variable that can be accounted for by the regression
equation. For example, an r-squared value of .49 means that 49% of the variance in the dependent
variable can be explained by the regression equation. The other 51% is unexplained.
The standard error of the estimate for regression measures the amount of variability in the points around
the regression line. It is the standard deviation of the data points as they are distributed around the
regression line. The standard error of the estimate can be used to develop confidence intervals around a
prediction.
Example
A company wants to know if there is a significant relationship between its advertising expenditures and its
sales volume. The independent variable is advertising budget and the dependent variable is sales
volume. A lag time of one month will be used because sales are expected to lag behind actual advertising
expenditures. Data was collected for a six month period. All figures are in thousands of dollars. Is there a
significant relationship between advertising budget and sales volume?
Indep. Var.
Depen. Var
4.2
27.1
6.1
30.4
3.9
25.0
5.7
29.7
7.3
40.1
5.9
28.8
You might make a statement in a report like this: A simple linear regression was performed on six months
of data to determine if there was a significant relationship between advertising expenditures and sales
volume. The t-statistic for the slope was significant at the .05 critical alpha level, t(4)=3.96, p=.015. Thus,
we reject the null hypothesis and conclude that there was a positive significant relationship between
advertising expenditures and sales volume. Furthermore, 80.7% of the variability in sales volume could be
explained by advertising expenditures.