Professional Documents
Culture Documents
Measures of Association
To give examples of the types of business questions that may be answered by ana.
.l.
lyzing the association between two variables
T.o,list,the common procedures for measuring association and to discuss how the
measUremert scale will influence the selection of statistical tests.
-rf the simnle
To discuss the concept of
simple cnrelafinn
correlation cneffinicnt
coefficient.
causation.
matrix.
'r":
':,,1,.:,,t',"
regression.
,.,
analysis.
line.
t
,,t'
.,,..,,,,1
:::.:...:
regression.
regression.
..lllr,
...:,
,,.1.:
Z2
CHAPTER
551
Sample question
Measure of association
Measurement levela
sn coeff icie nt
{Pearson's r)
Co rfe,lati
E.iva,rf
r:::,-ij,,:,:,1]:ri:,::::::::,:::::::::::::::::::
ii
'::
.:;:::::::::::::::::':
:::::::j:::::::::::i::::::::::i::::r
1 .
: .:
.::..::.i::':
:: ..:.
iiiiii.iiiiiiiiiiiii""'
::i:::::
... .,.,.......,H.iffi.1H#l.ii..h.dill$*...
Chi-square
Spearman rank correlation
Kendall's rank correlation
::::l:::llllllllll:::::llt::lll..,l..6,h,i;squaro
::::::i:i::::::i:.i::::i::::i,ii:i:i:;i:Hl
""
liisoef{ic|efl'f''
a given level of measurement, the appropriate procedure is the one with the lewest assumptions about
the data.
The formula for calculating the correlation coefficient for two variables X
and
Iis:
("
r"
I(X,- X)V,-Yt
l\x,-
-);,,>ff,
- Yr
respectively.
517
PART
VI
rr_
orv
ryx_
\/oW,
where
o? = variance ofX
o? = vaiance of Y
ofX and
o,r' = covoriance
with
r(X,
o" =-
- Iog,-
Yt
lg
If associated values of X, and Yr differ from their means in the same direction.
then their covariance will be positive. Covariance will be negative if the values of Xi
and Y, have a tendency to deviate in opposite directions.
EXHIBTT ?2,7
Scatter Diagrams
r = .30
Illustrating Correlation
OO
oo
oo
ao
o
Patterns
o'
ao
a'
o
aO
Oa
oa'
Oa
ooo
OO
aa
aO
aO
aa
ao
aO
o
+1 .0
aa
Oa
OO
o
a
aa
f=.80
aO
'a
aa
a
o
a
Low Positive
High Positive
Perfect Positive
Correlation
Correlation
Correlation
f=0
= -.60
I = -1 .0
a
o
ooo
o
'
o 'a
OO
'a
oo
o
a
o
Moderate Negative
Correlation
Perfect Negative
Correlation
CHAPTER
22
557
In actuality, the simple correlation coefficient is a standardized measure of covariance. In the formula the numerator represents covariance and the denominator is
the square root of the product of the sample variances. Researchers find the correlation coefficient useful because two correlations can be compared without regard to
the amount of variation exhibited by each variable separately.
Exhibit 22.3 ilhstrates the correlation coefficients and scatter diagrams for several sets of data.
An Erarnple
To illustrate the calculation of the correlation coefficient, an investigation is made to
determine if the average number of hours worked in manufacturing industries is related to unemployment. A correlation analysis on the data in Table 22.1 is used to
determine if the two variables are associated.
The correlation between the two variables is -.635, which indicates an inverse relationship. Thus when the number of hours worked is high, unemployment is low.
This makes intuitive sense. If factories are increasing output, regular workers typically work more overtime and new employees are hired (reducing the unemployment rate). Both variables are probably related to overall economic conditions.
mate correlation coefficient is r = .9. This high correlation does not indicate that
teachers drink, nor does it indicate that the sale of liquor increases teachers' salaries.
It is more likely that both teachers' salaries and liquor sales covary because they are
In this example relationship between the two variables is apparent but not real.
Even though the variables are not causally related, they can be statistically related.
Researchers who examine
statistical relationsh ips m ust
be aware that the variables
may not be causally related.
fr
-;
!t
554
TABLtr
PART
VI
22.1
Correlation Analvsis
of Number of
Hours \\'orked in
N,Ianufacturing
Industries w'ith
unemplor,'ment Rate
Number
Unemployment of Hours
Rate
(X,)
Worked
(Y;) X,- X
(X,- X),
5.5
39.6
0.51
0.2601
4.4
40.7
-0.59
0.3481
4.1
40.4
-0.89
0.7921
4.3
39.8
-0.69
0.4761
6.8
39.2
1.81
3.2761
5.5
40.3
0.51
0.260
5.5
39.7
0.51
0.2601
6.7
39.8
.71
2.9241
5.5
40.4
0.51
0.2601
5.7
40.5
0.71
0.5041
5.2
40.7
0.21
0.0441
4.5
41 .2
-0.49
0.2401
3.8
41 .3
-1.19
.4161
.4161
3.8
40.6
-1 .19
3.6
40.7
-1 .39
1.9321
3.5
40.6
-1 .49
2.2201
4.9
39.8
-0.09
0.0081
5.9
39.9
0.91
0.8281
5.6
40.6
0.61
0.3721
Y,
-Y
-0.71
0.39
0.09
-0.51
-1 .11
-0.01
-0.61
-0.51
0.09
0.1 9
0.39
0.89
0.99
0.29
0.39
0.29
-0.51
-0.41
0.29
(Y,
- Y)' (X,-
X)(V,
0.5041
-0.3621
0.1521
-0.2301
0.0081
-0.0801
0.2601
0.3519
.2321
-2.0091
0.0001
-0.0051
0.3721
-0.31
0.2601
-0.8721
0.0081
0.0459
0.0361
0.1 349
0.1521
0.0819
0.7921
0.9801
-0.4361
-1 .1781
0.0841
-0.3451
0.1 521
-0.5421
0.0841
-0.4321
0.2601
0.
681
0.0841
-D
0.0459
-0.3731
0.1 769
X = 4.99
Y
= 40.31
I(X,-X)r-17.8379
>(f-Y)'=5.5899
Y) - -6.338e
2(X,-
z(X,- xl(Y,-
xlI - Y)
=:f-6.3389
ge.ttz
=
-.635
This can occur because both are caused by a third (or more) factor(s). When this is
so, the variables are said to be spuriously related.
coefficient of
determination (r2)
A measure of that portion of
the total variance of a variable
that is accounted for by
knowing the value of another
variable.
.
--
Explained variance
Total variance
t)6
PART
'f ABLE
22.2
VI
Variables
S
JS
GE
SE
OD
Vl
JT
RA
TP
WL
JS
GE
Performance
1.00
Job satisfaction
.45b 1.00
.31b .10 1.00
.61b .28b .36b
.05 -.03 -.44b
-.36b -.13 -.14
Generalized self-esteem
Specific self-esteem
Other-directedness
Verbal intelligence
Job-related tension
Role ambiguity
Territory potential
Workload
SE
OD
VI
JT
RA
TP
WL
1.00
-.24"
1.00
-.11
8d
.26b
.38b
.09
-.04
-.34b
-.39b
.zgb
.29"
-.1
.00
-.02
-.05
-.09
-j2
1.00
.44b 1.00
-.38b -.26b 1.00
_.27" _.22d .4gb
1.00
"Numbers below the diagonal are for the sample. Those above the diagonal are omitted.
op
<
.05.
REGRESSION AN'ALYSIS
intercept
An intercepted segment of a
iine. The point at which a
regression lrne intersects the
Y-axis.
slope
The inclination of a regression
line as compared to a base
line. Rise (vertical distance)
over run (horizontal difference),
Regression is another technique for measuring the linear association between a dependent and independent variable. Although regression and correlation are mathematically related, regression assumes the dependent (or criterion) variable, I, is
predictively linked to the independent (or predictor) variable, X. Regression analysis
attempts to predict the values of a continuous, interval-scaled dependent variable
from the specific values ofthe independent variable. For example, the amount ofexternal funds required (the dependent variable) might be predicted on the basis of
sales growth rates (independent variable). Although there are numerous applications
of regression analysis, forecasting sales is by far the most common.
The discussion here concerns bivariate linear regression. This form of regression investigates a straight-line relationship of the type Y = a + 9X, where I is the
dependent variable and X is the independent variable and a and B are two constants
to be estimated. The symbol a represents the I intercept and B is the slope coefficient. The slope B is the change in Idue to a corresponding change in one unit ofX.
The slope may also be thought of as "rise over run" (the rise in units on the I axis divided by the run in units along the X axis.) (The A is the notation for "a change in.",
Suppose a researcher is interested in forecasting sales for a construction distributor (wholesaler) in Florida. Further, the distributor believes a reasonable associatioi
exists between sales and building permits issued by counties. Using bivariate linea:
regression on the data in Table 22.3, the researcher will be able to estimate sales potential (Y) in various counties based on the number of building permits (X).
For a better understanding of the data in Table 22.3, the data can be plotted on
"
scatter diagram (Exhibit 22.4).ln the diagram the vertical axis indicates the value c:
the dependent variable I and the horizontal axis indicates the value of the independent variable X. Each point in the diagram represents an observation of the X and i'
at a given point in time, that is, the paired values of Y arrd X. The relationshr:
CIIAP'|ER
22
ss7
:|:;jr:,:;,1:;:::;i;::';;
,.,,it:,f
,\J,
:::,:::::t:):a::::t:)):)
i:::fim
]ffi
lia.d
mediocrity,o' a phenomenon observed in studies of inheritance. "Tall men will tend to have shorter sons,
and short men taller sons. The sons' heights, then,
tend to 'regress to,' or 'go back to,' the mean of the
population. Statistically, if we want to predict Y and X
and the correlation between X and Y is zero, then our
best prediction is to the mean." (lncidentally, the symbol r, used for the coefficient of correlation, was origi-
between X and Y could be "eyeballed," that is, a straight line could be drawn through
the points in the figure. However, such a line would be subject to human error. Two
researchers might draw different lines to describe the same data.
least-squares rnethod
A mathematical iechnique
ensuring that the regression
line will hest represent the
linear relationship between
X and
Y.
'I'atble 72.3
Relationsliil> of Salcs
Potential to Rtrilcling
Pernrits Issrrecl
Dealer
Dealer's Sales
Volume (000)
x
Building
Permits
77
86
79
93
80
95
83
104
101
139
117
180
129
165
I
I
120
147
97
119
10
106
132
11
99
126
12
121
156
13
103
129
14
86
96
15
99
108
558
PART
VI
EXHIBIT 22,4
Scatter Diagram and
Eyeball Forecast
165
160
155
150
My
145
line
140
135
130
12s
120
t'
115
110
Your
105
line
100
95
90
85
80
85
95
105
115
125
135
145
155
165
175
18s
195
correlation between two variables, there will be a discrepancy between most of the
actual scores (each dot) and the predicted score based on the regression line. Simply
stated, any straight line that is drawn will generate errors. The method of least
squares uses the criterion of attempting to make the least amount of total error in
prediction of Y from X. More technically, the procedure used in the least-squares
method generates a straight line, which minimizes the sum of squared deviations of
the actual values from this predicted regression line. Using the symbol e to represent
the deviations ofthe dots from the line, the least-squares criterion is:
Le?
is-iri*r*
where
residual
The difference between the
actual value of the dependent
variable and the estimated
22
CHAPTER
559
=a
Y=6+BX+e
The symbols A and B ate utilized when the equation is a regression estimate of
the line. Thus, to comPute the estimated values of a and 9, *. use the following
formulas:
A
p-
n(>xY)
- (>x)(Ir)
and
6=V - 0X
where
coefficient")
n = number of observations
tl
195
TABLtr
22.4
Least-Squares
Computation
rf the
mply
least
ror in
pares
ms
of
resent
Dealer
177
279
380
483
5
6
7
8
997
10
11
12
13
14
15
)e
XY
5,929
86
7,396
6,622
6,241
93
8,649
7,347
7,600
6,400
95
9,025
6,889
104
10,816
8,632
101
10,201
139
19,321
14,039
117
13,689
180
32,400
21 ,060
129
16,641
165
27,225
21
120
14,400
147
21 ,609
,285
17,640
't
19
14,161
11,543
132
17,424
13,992
9,409
106
11,236
99
9,801
126
15,876
12,474
121
14,641
156
24,336
18,876
103
10,609
129
16,641
13,287
86
7,396
96
9,216
8,256
99
7 = 99.8
9,801
108
11,664
10,692
>Y2 = 153,283
2X - 1,W5
>X2 = 245,759
>xY= 193ffi
X -125
560
I']AR'f
VI
These equations may be solved by simple arithmetic (see Table 22.4). To estimate the relationship between the distributor's sales to a dealer and the number of
building permits, the following manipulations are performed:
0-
- (>))(Ir;
- (I4'
5( 93,345.) - 2,906,975
15(215 ,l 59) - 3,5 15 ,625
2,900,115 - 2,906,,915
3,686,38s - 3,5ts,62s
n(ZxY)
n(2X2)
93,300
110,160
= .54638
h=Y - gX
= 99.8
.54638(125)
= 99.8
68.3
= 31.5
The formula i' = 31.5 + 0.546X is the regression equation used for the prediction of
the dependent variable. suppose the wholesaler considers a new dealership in an
area where the number of building permits equals 89. Sales may be forecast in this
area as:
i'=
31.5 + .546
(n
t,
and
?rwill
be cal-
=
Dealer 3 (actual Y value = 80):
I,
121.6
= 31.5 + .546(95)
= 83.4
once the two Y values have been predicted, a straight line connecting the points
?t
567,
PART
VI
Data Analysis
ar-rd Presentatior-r
trXHIBT'I'22.6
Scatter f)iagranr
of fhplained ancl
Llnerplainecl Yariation
Dea ler B
actua I sales
130
\
120
110
\o
$ry
Yi- Y = Deviation
explained by regression
100
90
AY
AX
80
100
120
110
130
140
150
160
170
180
using r, - Y; rather than { - 7. ttris is the "explained" deviation due to the regression. The smaller number 8.2 is the deviation not explained by the regression.
Thus the total deviation can be partitioned into two parts:
(y,-V)
Total
deviation
=1?,-r1 +g,-?;
Deviation Deviation
by + unexplained
=
explained
the
regression
by
the regression
(residual error)
where
Yi
= actual value
For Dealer 8 the total deviation is 120 - 99.8 = 20.2, the deviation explained by the
is I 1 1.8 - 99.8 = 12, and the deviation unexplained by the regression is
120 - 111.8 = 8.2. If these values are summed over all values of y,(i.e., all observations) and squared, these deviations provide an estimate of the variation of r explained by the regression and unexplained by the regression:
regression
explained
(residual)
we have thus partitioned the total sum of squares, ssr, into two parts: the regression sum of squares, SSr, and the error sum of squares, SSe..
SSr-SSr+SSe
CHAPTER
22
tions. The beta coefficients of some well-known companies, as calculated by Merrill Lynch, are shown in
the table below. Most stocks have betas in the range
of 0.75 to 1.50, The average for all stocks is 1.0 by
definition. A list of beta coefficients is given below:
Stock
Beta
Apple Computer
1.60
Union Pacific
1.43
Georgia-Pacific
1.36
Mattel
General Electric
1.09
.15
Bristol Myers
1.00
General Motors
0.94
McDonald's
0.93
0.80
IBM
0.70
Anheuser-Busch
0.58
4.47
F-test
A procedure used to
determine if there is more
variability in the scores of one
sample than in the scores of
another sample.
'l':\ULIi,22.5
Analvsis ol- \'ariance
'l':rble fr;r llivariatc
Source of Variation
Rc-gre ssion
Explained by regression
Unexplained (error)
where k
/?
Degrees of
Freedom
k-1
n- k
Sum of Squares
>(V,- V1,
- >(Y,- Y)'
Mean Square
(Variance)
k-1
-k
SSr =
SSrl
SSe
SSeln
PART
564
TABLtr
22.6
Analvsis of Yariance
Summarr''I-able for
Regression of Sales on
Building Pennits
VI
summarY table
A table that Presents the
results of a regression
calculation.
Mean Square
F-Value
3398.49
3398.49
Explained bY regression
91 .30
analysis of variance
d.f
Sum of Squares
Source of Variation
483.91
1!
3882.40
14
37.22
Fortheexampleonsalesforecasting,theanalysisofvariancesummarytable'
comparingrelativemagnitudesofthemeanSquare,ispresentedinTable22.6,From
Table6intheAppendixwefindthattheF-valuegl.3,withldegreeoffreedomin
probabil-
- SSr=, _F
r.=lS
SSe
"
3398.49
"=ffii
='875
"^ftuir.d
SUMN,TARY
associated. Many bivariate statisIn many situations two variables are interrelated or
Researchers select the appropritical techniques can be used to measure association.
of measurement'
technique on the basis of each variable's scale
ate
Thecorrelationcoefficient(r),astatisticalmeasureofassociationbetweentwo
themeasureoftherelationshipofonevariabletoanother.Thecorrelationcoefficient
of that
of the association of two variables and the direction
indicates the strength
association.Itmustberememberedthatcorrelationdoesnotprovecausation,as
deter-
CHAPTER
22
;i1,:
iEL.
:fldr
_,ar
.$ml
8. A football team's
of
ffi
i.dlH
{ffij
..=:'
ffi'
fp
'{&
Year
{&
Number of
Season
Ticket Sales
Percentage of
Games Won
Active Alumni
qif,.
4'i,,
985
4,995
40
NA
986
8,599
54
NA
ffi
i#
987
8,479
55
NA
988
8,419
58
NA
989
10,253
63
NA
990
12,457
75
6,315
991
13,285
36
6,860
1992
14,177
27
8,423
993
15,730
63
9,000
1,rffi
,*
r;ie,
9. Are the different forms of consumer installment credit in the table below highly
correlated? Explain.
Credit Card Debt Outstanding (Millions of Dollars)
Gas
Year
1
Cards
and
Cards
Travel
Bank
Entertainment Credit
e3e
$61
1,119
76
Retail
Cards
Gards
828
1,312
Total
Credit
Cards
Total
lnstallment
Credit
$ 79,428
9,400
$1 1,229
10,200
12,707
87,745
98,1 05
1,298
110
2,639
10,900
14,947
1,650
122
3,792
11,500
17,064
102,064
1,804
132
4,490
13,925
20,351
11
1,762
164
5,408
14,763
22,097
127,332
1,832
191
6,838
16,395
25,256
147,437
1,823
238
9,281
17,933
28,275
156,124
1,993
273
9,501
18,002
29,669
164,955
10
1,981
238
1,351
19,052
32,622
185,489
11
2,074
284
14,262
,082
37,702
216,572
21
1,295
10.
11.