You are on page 1of 7

Comparedto Regression of Logistic The Efficiency Analysis NormalDiscriminant

EFRON* BRADLEY

normaldistribuA randomvectorx arises fromone oftwo multivariate in mean but not covariance. A training set xl, X2, X*, tions differing of previous cases, along with their correct assignments, is known. These can be used to estimate Fisher's discriminantby maximum likelihoodand then to assign x on the basis of the estimated discrimas the normaldiscrimination inant,a method known procedure.Logistic regression does the same thingbut withthe estimationof Fisher's disriminantdone conditionallyon the observed values of xi, X2, *-, of the two x,. This article computes the asymptoticrelativeefficiency procedures. Typically,logistic regression is shown to be between one halfand two thirdsas effectiveas normal discrimination for statistically interesting values of the parameters.

and, of course,

XiIyi "9 IP(Uvlj,

; (1.4)

The (yj, xj) are assumed independentof each other for j = 1, 2, *.. , n. In this case, maximumlikelihoodestimates of the parametersare available,
7r* =

nl/n,

7ro

= no/n
XYj=O

U1=Xl-

1Y=1

E Xi/nl,

Uo

xi/no

and
Vj=1

1. INTRODUCTION AND SUMMARY


Suppose that a random vector x can arise fromone of two p-dimensionalnormal populations differing in mean but not in covariance, + E
Y/j=0

(xj

xo)(xj

ko)']/n

7 Yo and no ni - no are the number where ni of population 1 and population0 cases observed,respecwith prob 7ro, x P(tiO, 1) tively. Substitutingthese into (1.2) gives a version of Anderson's [1] estimated linear discriminant function, where 7r- + ro = 1 say, (x) = ABo + "'x, and an estimateddiscrimination If the parametersxi-, ro = 1 -7ri, il, Lo, X are known, procedurewhich assigns a new x to population 1 or 0 then x can be assigned to a population on the basis of as S (x) is greaterthan or less than zero. This will be Fisher's "linear discriminant function"[1]. referred to as the "normal discrimination procedure." X(x) = /30 + O'x, Bayes' theoremshows that X(x), as given in (1.2), is actually the a posteriori log odds ratio for Population 1 io'L;-'lo) ' (1.2) 0--log- - (Ulversus Population 0 having observedx,
x - 9t,,(a1, 1)
-

with

prob 7r-,

(1.1)

-- 2U1 7r

(1.2)1

prob {yj = iIxj} ri (xi) j, The assignmentis to population 1 if X(x) > 0 and to i = 1,0 . (1.6) population 0 if X(x) < 0. This method of assignment notationwe will also write minimizesthe expected probabilityof misclassification, To simplify as is easily shown by applyingBayes theorem.There is . (1.7) 7rij- ri(xj) and X log (7r1/7ro) no loss of generalityin assuming s nonsingularas we have done, since singular cases can always be made Given the values x1,x2, *.., xn, the yJare conditionally nonsingularby an appropriatereductionof dimension. independent binaryrandomvariables, In usual practice, the parameters 7r, To, U'o,Uo, Y will be unknownto the statistician,but a trainingset prob {yj = l1xj} = 7rij (yl, xi), (y2,x2), - -, (y., xn) will be available, where = exp (Oo+ g'xj)/[l'+ exp (io + f'x)], (1.8) so yj indicateswhichpopulationxj comes from, = 1/[1 + exp (3o + f'x)]. prob {yj = OlxjA = 7rOj yj = 1 with prob7ii , (1.3) To estimate (3oy, g), we can maximize the conditional 0 with prob 7rO,
X(x,) log
* Bradley Efron is professor, Department of Statistics,StanfordUniversity, Calif. 94305. The authoris grateful Stanford, to Gus Haggstrom ofRAND Corporation forhelpful comments.

7ri(Xj)

a Journalof the American Statistical Association December 1975, Volume 70, Number 352 Theoryand Methods Section

892

This content downloaded from 132.170.168.139 on Thu, 4 Apr 2013 13:37:07 PM All use subject to JSTOR Terms and Conditions

Logistic Regression VersusNormal Discrimination


likelihood

893

unansweredquestionis the relativeefficiency undersome model other than we (1.1), when are not playing ball .. Yn fflo,f(Yly / IX1, *...*, Xn) on normaldiscrimination's home court. n In many situations,the sampling probabilities rn, 7ro 7i ljY7r o(l i1) ) = in (1.1) may be systematicallydistorted from acting j=1 19 in the population of interest.For example, their values n (1) exp [(Eo + 5'x,)yj] if 1 is murdervictims and Population 0 is Population =11 + exp (3o + 5'xj)]' all other people, a study conductedin a morgue would j= [1 have Irl much largerthan in the whole population.Quite withrespectto (0o, 1).The maximizing values, call them oftenni and no are set by the experimenter and are not (Bo D), give X(x) = Bo+ D'x as an estimateof the linear randomvariables at all. These cases are discussedbriefly discriminant function. The discriminationprocedure in Section 5. which chooses Population 1 or 0 as X(x) is greaterthan Technical details relating to asymptotic normality or less than zero will be referred to as the "logistic reand consistency are omitted throughout the article. gressionprocedure."An excellentdiscussionof such proThese gaps can be filledin by the applicationof standard ceduresis given in Cox's monograph[2]. exponential family theory, as presented, say, in [5], The logisticregression proceduremust be less efficient to (1.1). For another comparisonof normal discriminathan the normal discrimination procedureunder model tion and logisticregression, the readeris referred to [4]. (1.1), at least asymptotically,as n goes to infinity, In that article, and also in [3], the distributions of x since the latter is based on the full maximumlikelihood are allowed to have discretecomponents. estimatorforX(x). This articlecalculates the asymptotic relative efficiencies of the two procedures. The (ARE) 2. EXPECTED ERROR RATE central result is that, under a variety of situationsand By means of a linear transformation x = a + Ax, measuresof efficiency, the ARE is given by we can always reduce (1.1) to the case 1 + A'riro e-r2/2 X = ARE edA2/8 dx , (1.10) x - 91p((A/2)el, I) , with prob 7r,X (2.1) + woe-Ax/2 7re-,Ax/2 (27r) where
A-E(
-

xvo)'1(Ii
v-

(- (A/2)e1,I), mp

with prob 7ro

O)]i

(1.11)

where ri + ro = 1

(1, 0, 0, *0*, 0); I is the p X p identity matrix; the square root of the Mahalanobis distance. Following and el' and A = ((p, - po)''-l (is a small tabulation of (1.10) for reasonable values of p)) as before. The boundary B -{ x: X(x) = 01 between Fisher's A, with ri = 7= (the case most favorable to the optimumdecisionregionsforthe two populationstranslogisticregression procedure). forms to the new optimumboundaryin the obvious way,
A

.5

1.5

2.5

3.5

ARE

1.000

1.000

.995 .968 .899 .786 .641 .486 .343

B_ {x:X(x) = O - {:x

= a+Ax,xEB}

(2.2)

a moderately steep price for this added generality, assuming,of course,'that (1.1) is actually correct.Just when good discrimination becomes possible, for A between 2.5 and 3.5, the ARE of the logisticprocedurefalls off sharply.The questionofhow to chooseor compromise between the two procedures seems important,but no results are available at this time. Another important

Moreover, if xl, x2, * x.n is an iid sample from (1.1), and xi = a + Axi, i=1, 2, ... , n, is the Why use logisticregression at all if it is less efficient transformedsample, then both estimated boundaries (and also more difficult to calculate)? Because it is more P _ {x: S (x) = 0} and B- {x: X(x) = 0} also transrobust,at least theoretically, than normaldiscrimination. formas in (2.2). In words,then,forboth logisticregresThe conditionallikelihood (1.9) is valid under general sion and normal discrimination,the estimated disexponentialfamilyassumptionson the densityf(x) of x, criminationprocedure based on the transformed data is the transform of that based on the originaldata. All f (x) = g(01,vj)h(x,q) exp (01'x) 'with prob r (1.13) of these statements are easy to verify. f (x) = g(6o q)h (x, 0) exp (0o'x) with prob 7ro Suppose we have the regionsRo and Ri, a partition = 1 where Ir + mro of the p-dimensional space E', and we decide forpopula0 tion 1 as x falls into Ro or Ri, respecor population Here, n is an arbitrarynuisance parameter,like X in tively. The error rate of such a partitionis the prob(1.1). Equation (1.13) includes (1.1) as a special case. ability of misclassification under assumptions(1.1), Unfortunately, (1.12) shows that the statisticianpays ErrorRate -7r prob {x C RoIx -D9p(tll m) +ro prob {x E R1IX 9p(-o, a t(.3 (2.3)

(1.12)

When the partitionis chosen randomly,as it is by the logisticregression and normaldiscrimination procedures, errorrate is a randomvrariable. For eitherprocedure, it followsfromthe precedingthat errorrate will have the

This content downloaded from 132.170.168.139 on Thu, 4 Apr 2013 13:37:07 PM All use subject to JSTOR Terms and Conditions

894

of the American Journal Statistical Association, December 1975 2

under(1.1) and (2.1). Henceforth, same distribution Now, define we will workwith the simpler assumptions (2.1), calling di- (D1 - dT) cos (da), thisthe"standard situation" thebasicrandom vari(with do (Do + dr) cos (da), able referred to as "x" rather than"x" forconvenience). Forthestandard Fisher's linear discriminantthe distancesfromL and to B(dr, dca).Then, situation, t,o function (1.2) becomes X(x) =X + Ax1
The boundary X(x)
=

(2.4)

ER

(dr, dac) = 7r4(- di) + ro4 (-do) cos (dca) = 1 - (da)2/2 +

(2.9)

0 is the (p - 1)-dimensional to the x1 axis and intersecting plane orthogonal it at thevalue and (2.5)

From the Taylor expansions,


*.

In the figure, is labeledB(O, 0). the optimalboundary (D) (dr)2/2 + + D (p The figure also shows another boundary,labeled we get the following lemma. B (dr,da), intersecting thexi axisat r + dT,with normal terms of third and vectorat an angleda from the xi axis. The differential Lemma 1: Ignoring differential notation dr and da indicates small discrepancies from higherorders, which willbe thecaseinthelarge optimal, sample theory. ER (dr,da) = ER (0, 0) The error rate(2.3) oftheregions separated byB (dT,da) + (A/2)7riq'(Di)[(dr)2 + (da)2] . (2.10) willbe denoted by ER (di, da). Letting
Di)

P(-D

+ dT) = 4(-D)

+ p(D)dr

(A/2)

T,

Do- (A/2) +

(2.6)

Equation (2.10) makes use of the fact that, by Bayes theorem7rsp so (Do) = 1, or equivalently, (Di)/7ro
7rl o(Di)=

we see that the errorrate of the optimalboundary Suppose now that the boundary,B(dr, da) is given B(O, 0) is by those x satisfying
ER

ro.p(Do) .

(2.11)

(0, 0)

rit(-Di)

+ 7ro4(-Do) '

(2.7)

where
?I'(Z)

(X + d,3o)+ (Ael + dg)'x = 0

(2.12)

jp(t)dt and

-00

p(t)

(2r)-irexp (-t2/2)

., dfo and dg = (d,3i, d02, * dO,)', indicatingsmall discrepanciesfromthe optimallinear function(2.4). Again, ignoring higher-order terms,we have

as usual. (We are tacitly assuming thatthe tworegions divided by B (dr,doe) areassigned to populations 1 and 0, and so 2X in thebestway.) respectively, 1/ (dr)2= - (Vd,o)2 --d#3dA
a Optimum Boundaryx(x) = 0 in Standard Situation

dr = (1/A) (-df3o + (X/A)dO,)

+ - (dgl)2).

(2.13)

Similarly, expansionof da gives


= ((d32)2 + (d33)2 + (dax)2 ...

TheOptimum B(0,) Boundary


(Mx) -Q)

+ * arctanE ((d,B2)2

+ (dfp)2)i/(A + df3l)] + (diP)2)/A2 (2.14)

Non-Optimum B(dr,da) Boundary

Finally, suppose that under some method of estimation, the (p + 1) vector of errors (d,3o, d@) has a limiting normal distribution with mean vector 0 and covariancematrix s/n,
-

d~~ d0

?S: v/(dIo)

1(o, 1) 9zP+

(2.15)

A~~~~~~
-A/2 0
I 1

<

x1axis

The differential termwhichappears in Lemma 1, (dr) 2 + (da)2


+
-2 =-

A2 L

I (d3o)2-

dfod/i (2.16)

+ (dIA2)2 (dA31)2 + ** + (dp2,

Also shownis some otherboundaryintersecting the xl axis at angle da.


a

+ dT and at

will then have the limiting distribution of 1/n times the

This content downloaded from 132.170.168.139 on Thu, 4 Apr 2013 13:37:07 PM All use subject to JSTOR Terms and Conditions

Discrimination VersusNormal Regression Logistic


normalquadratic form
(1/A')[zE2 - (2X/A)zoz, +
(X/A)2Z12 +
Z22

895
unMoreover,X, Li, o, &(1) and (2) are asymptotically of correlated.(We do not need the limitingdistribution
&(2)

Zp2]

where z - %p+1(O,1). Assuming moments converge El, is the p X p matrixhaving upper left element one which turnsout to be the case forthe logistic and all otherszero. correctly, Differentiating (1.2) gives regressionand normal discriminant procedures,Lemma 1 gives a simple expressionfor the expected errorrate 813o O/3o in termsof the elementsaio of 1. t=_U/ -1 2 Theorem 1: Ignoringtermsof orderless than 1/n,

ofLemma2.) Here,X for theproof

log ri/7ro, and

El ER(dr, da)
=
______)l)

ER (0, 0) }

=/30

2An

L~~aoo - -01o +

2X
A

-2 11l + a722+ ***+

OfppJ
(2.17)
____

,a 0x

__

',af*

d/0

hoi.-Lt0 -

Uoli)lt t

aoi1
-0,7

+ij b(3.6)

~~a@ Ei, + Ej The quantity E{EER(dr, da) - ER (0, 0)1 is a measureof our expected regret,in terms of increased error rate, I + 8b.j (Ocri C, ot La when using some estimated discriminationprocedure. indicating the ith component of Lao; likewise for t1, In Section 3, we evaluate X for the logistic regression MuOi with derivativesinvolvingvectorstaken componentwise procedure and the normal discriminantprocedure and in the obvious way. Moreover, bij = 1 or 0 as i = j or then use Theorem 1 to comparethe two procedures. i # j, and Eij is the matrixwith one in the ijth position 3. ASYMPTOTIC ERROR RATES OF THE TWO and zero elsewhere.In the standard situation,we have PROCEDURES the differential relationship
First we consider the normal discriminant procedure describedafter (1.5). Lemma 2: In the standard situation, the normal ') discriminant procedure produces estimates ( (X Ae,') + (do,0 d ') satisfying

dx
d) dO 1 O
2t

ei -I

0 AI

j( ?~ d c 1
id(2)

C: -\/n (d)
where
a/4 -AOr+ ~ 7170 2
-(wo-27)

Y) ' 91P+1(?,
0 * * *0 0

(3.1) Letting M be the m,atrix on the rightside of (3.7),


S:

V/n (d

-*

St1(O,

M[nl 1;$^# yo;(1) (2)]M')

(3.8)

(2r?

? 0

i0

1 + + Aalo o

...

0
0

(2)), as indicated by (3.5). matrix of (ta, l, 'ao, Evaluation of (3.8) gives the result.

where nl $,$,,r(i)<(2)

is the joint limiting covariance

.o 1 +

(3.2) at (1.9).
Proof: The densityof a single (y, x) pair under (1.3)(1.4) is
A'Als.o.2;

estimatesdefined Next considerthe logisticregression

X exp [-2-(x1-)'- -X T)].

(y, x) =

ryI X I

(2ir)x/I X (3.3)
'-1
.
orpp)

Lemma 3: In the standard situation, the logistic regression procedure producesestimates(X, A') (x, zel') + (d5o,d5') satisfying
C:

log irl/lro, as before. Let us write the distinct elements of p(p + 1)/2 vector (irf", y12,*. * lp, a22C 23, .. indicate this vector as (e(l), C(2)), where a (1)
9 (a(ll a12 * * * ,rlP) , (2) =
(?22 ?728 . . .

'\n (do)

9 zP?1(,

;) ,

(3.9)

as a and

where

A2

-A1

rPP) . (3.4)

Standard results using Fisher's information matrix then give the following asymptoticdistributions for the maximumlikelihoodestimatesin the standard situation.
X3:V/n(>S-

0r?r

AoA2-A12 AoA2-A12 -A1 Ao A12 AoA2-A12 AoA2=

0
:

(3.10)
,O

3.0

?: V'n(i~S: V/n(&f()
-

w) Sp(O,
a(1))
-'

) -S~(0,

(1/iroir,)),
(1/iri)I), i = 1, 0
,

(3.5)

XL(O, I + E,,).

..

Ao0**

*~~~01

This content downloaded from 132.170.168.139 on Thu, 4 Apr 2013 13:37:07 PM All use subject to JSTOR Terms and Conditions

896
As-Ai(1, A) being definedby

1975 December Statistical Association, of theAmerican Journal

We can now computethe relative efficiency of logistic regression to normal discriminationby Theorem 1. o A e-A2O8xi(p(x) Denote the errorsfor the two procedures by (di, d&x) dx Ai(7r, A)and (dT,dai), respectively, and define the efficiency + 7roe-Ax/2 J_ 7rieAx/2 i =; O,1 2 .(3. 11) measure, ( E{ER (sr, da) - ER (0, 0) lim (3.17) p (XI A) Proof: The density (1.9) can be written in exponential Eff n E{ER (dT, d&i) - ER (0, O)} familyformas Theorem 1, and Lemmas 2 and 3 then give
f,Bo, ,(y,
y2,

. . .

X Yn lxI,

**

Xn)

exp [(13, g')T


Xj

q6(8o, g)] (3.12)

Effp- (Ql + (p where

1)Q2)/(Q3

+ (p-

1)Q4),

(3.18)

T
i-1

1 y

1+-

'A2A

'(/03,)

n
=

(2rol-ri)-

E j=1

log (1 + exp (,1o+ 5'xj))

AA

matrix

The sufficient statisticT has mean vectorand covariance


Eff2
E

A,
A) xrsinfrES \

EcO,OT

vlj(

) Covpo,#T

=E

(- A 1232 Q= (1~)AA p-landpro-.i-12orA Rewritin (3.19)gvsasml


xj ) *(3.13) xn, and Q4

j=1

7rij7roj ()(1, j

Ao

-.

Let F(n) denote the sample cdf of xl, x2, .*, . Then, suppose that S: F n) F as n > oo

lim- Covo,,T
n-,o n

r/1
Ep X

) (1,x')ri(x)rro(x)dF(x) ,

p = 1 and p -= roo.

Q2 1+ (3.18) 1ff for rj0A) f(2,)+ Ap1\) (3.19) Effoo Rewriting gives a simple expression (X,A) as a weightedaverage of the relative efficiencies when

(3.14)

Theorem 2: The relativeefficiency of logisticregression to normaldiscrimination is

where 7ro(x)= 1- ri(x) = [1 + exp - (3o + L'x)]-I. Effp (XI A) Exponential familytheorysays that the mapping from _q(X, A) Eff i (X,A) ? (p - 1) Eff0. (X, A) the expectationvector E#,,#T to the natural parameters q A + (P 1) (Xc #o 5 has Jacobian matrix ECovP0,#T]-'.Therefore,the "delta method" gives where
limn 7 Coveoat/

(3.20)

n ---o

Eff1(N, A)
o\

Q-/Q3 and Effi\ (X,A)

Q2/Q4 ,(3.21)

=-[j

(1) (1, x')7rl(x)ro(x)dF(x)7j.

(3.15)

as definedin (3.19), are the relative efficiencies when and p = 1 and p = c(A,respectively,
q(),QA)
Q4

(3.22)

Under the sampling scheme (2.1), F will be the mixtureof the normal populations 0Y,((A/2)ej, I) and In the standard 9lp((-A/2)ej, I) in proportions irl, 7ro. situation, iro(x) = 1 - r1(x) = [1 + exp - (X + Ax,)}1. We get Ao A1 O...o-

It is obvious from(3.18) that Effp. (X,A)= Q2/Q4 really is the asymptoticefficiency as p -* o, . For p = 1, (3.18) gives Effi (N,A) = Q1/Q3. This follows fromLemma 1 because da can always be taken equal to zero when
p
=

1.

1 A1 A = 7rlwro lim- Cov#,,,#T


n -*oo n

A2 0

A00.
A

? ?.

(3.16)

The case X = 0 gives a particularlysimple answer (since then A, = 0). When N = 0, i.e., when 7ri = Corollary: Eff p (N,A)
= 7r0 =Y

0 ..

Ao

Effw(N,A)

A o(1 + A2/4)

,(3.23)

follows from (3.15). The fact that (5o, i') is consistent for (B,} ~') and asymptoticallynormal,which is the re-

from (3.14). The covariance matrix (3.10) for (d5o,d5')

forall values of p. Table I gives numericalvalues for the quantities in-

mainderof Lemma 3, is not difficult to show, given the structure(2.1). Like most of the other regularity properties,it will not be demonstrated here.

This content downloaded from 132.170.168.139 on Thu, 4 Apr 2013 13:37:07 PM All use subject to JSTOR Terms and Conditions

Discrimination VersusNormal Regression Logistic

897

is thus error ofangular in terms discrimination ofLogisticRegressionto Normal to normal Efficiencies Relative Discrimination a ARE = (1 + A2irTlo)A o
(or iro) .5 .6 .667 .75 .9 .95 .5 .6 .667 .75 .9 .95 .5 .6 .667 .75 .9 .95 .5 .6 .667 .75 .9 .95 .5 .6 .667 .75 .9 .95
a

IT1

A 2 2 2 2 2 2 2.5 2.5 2.5 2.5 2.5 2.5 3 3 3 3 3 3 3.5 3.5 3.5 3.5 3.5 3.5 4 4 4 4 4 4

Effx .899 .892 .879 .855 .801 .801 .786 .778 .762 .733 .660 .650 .641 .633 .618 .589 .511 .492 .486 .479 .467 .442 .370 .348 .343 .338 .328 .309 .252 .230

EffI .899 .906 .913 .915 .804 .706 .786 .794 .806 .819 .750 .637 .641 .649 .662 .682 .667 .588 .486 .493 .505 .526 .550 .516 .343 .348 .358 .375 .416 .416

q 1 1.024 1.070 1.177 1.697 2.233 1 1.013 1.038 1.096 1.379 1.671 1 1.008 1.023 1.057 1.225 1.400 1 1.005 1.014 1.035 1.142 1.252 1 1.003 1.009 1.024 1.094 1.168

Ao .450 .458 .465 .488 .589 .674 .307 .311 .319 .337 .423 .501 .197 .200 .206 .219 .282 .344 .120 .122 .125 .134 .176 .220 .069 .070 .072 .077 .103 .131

Al 0 -.038 -.067 -.108 -.253 -.375 0 -.025 -.044 -.074 -.181 -.282 0 -.016 -.027 -.046 -.117 -.189 0 -.009 -.016 -.027 -.070 -.116 0 -.005 -.009 -.014 -.039 -.065

1+

A 17r17rO

A2

(2Xr)*

e-iA2/8

- irie1AX2 +

e_X2/2 ex-Iroe-AxI2

dx

(4.4)

.266 .273 .287 .319 .487 .667 .154 .158 .167 .188 .304 .441 .084 .087 .092 .104 .175 .265 .044 .045 .048 .055 .095 .147 .022 .022 .024 .027 .048 .076

sensethata sampleofsize niusinglogistic in thestrong the same angular producesasymptotically regression 4 = ARE n, using of size as a sample distribution error From (1.12), we see that if normaldiscrimination. to n = 786. ("Effo"in the table is matelyequivalent also "ARE" as givenby (4.4).) is not error forintercept statement The corresponding definition in the involved truebecausethetwomatrices We have to of Q, and Q3, (3.19), are not proportional. stateefficiency settlefor the weakersecond-moment 7ri= ro 2, X = 0,i.e.,when when ment(4.2). However, (2.13) and Lemmas2 and 3 showthat
i, n.(dA)2 ?: n*(d 1)22 X = 0, A = 2.5, for example, n = 1,000 is approxi-

X2 (4/A2)(1 + A2/4)
(1/Ao)X21 (4/A2)

= 2 again gives the In this case, (4.4) with 7ri = 7ro equivalent of sense asymptotically in the strong ARE sample sizes. Combining (4.3) and (4.5) with Lemma 1 shows that when X = 0 (and so D, = A/2),

2:
S:

n{ER

(di, dc) (di,da)

ER (0,

0)} ( (/ 2(A/2)/A)(1/Ao)X2v.

+ A2/4)X2p (4.6) (so(A/2)/aA)(1


n{ER ER (0, 0)

ofterms. See (3.17),(3.19),(3.20),(3.21) for definition

Thus, errorrates for samples of size ni and 4 = ARE* with will have asymptoticallyequivalent distributions, The terms "Effl (X, A)" and "Effl (X,A)" which for is not true This = -. (4.4), by ARE given 7rO 7r, appear in (3.20), Theorem 2, have another interpretais. That it dimension gets # p but as the large, to 2, of 7rl, tion. Effl,(X, A) is the asymptoticrelative efficiency forestimating is, errorrates forthe two procedureswill have the same to normaldiscrimination logisticregression asympto.ticdistributionif n = ARE, ni, ARE given by the angle of the discriminant boundary, followsfrom(2.16) and Lemmas 2 and 3. Effl,(\, A) = lim Var (dea) (4.1) The angular error,da, unlike the errorrate, is not n-.oVar (due) Formulas (4.1), invariantunder linear transformations. (See the figureand the definitionspreceding (3.17).) (4.3), and (4.4) referto a "standardized angular error" Likewise,Effl1 (X, A) is the asymptoticrelativeefficiency definedafterwe have made the linear transformations, forestimating the intercept ofthe discriminant boundary, which take the general model (1.1) into the standard situation (2.1). However, it is easy to show that (4.1) Var (dA) Effl1 lim V (X, A) (4.2) and (4.4) (but not (4.3)) also hold for the true, unn-aoo Var (dT) standardized, angular error. This true error will be These results follow immediatelyfrom (2.13), (2.14), some quadratic form in the standardized coordinates (3.2), (3.10) and (3.21). not dependingon which procedureis d33, ..., d,Bp, df32, Comparing (2.14) with Lemmas 2 and 3 shows that used. The result follows,because for both procedures, with (d32) ''c4 d3,) has a limitingnormal distribution 1
S: n.(dA)2nri7roA2 (1 + A27r,7ro)X2p_l (4.4), when p --oo and ni/p-oo. A simple proof of this

ERROR 4. ANGLEAND INTERCEPT

of logistic regression of Section 5, it is the only errorof interest.Second, The asymptoticrelative efficiency

(Actually to theidentity. matrix proportional covariance weighted by a certain (4.3) holdswith"'X2p1"' replaced X2i variates.) sumofindependent in angular to be interested Therearetwogoodreasons setup proportion sampling First,underthe fixed error.

This content downloaded from 132.170.168.139 on Thu, 4 Apr 2013 13:37:07 PM All use subject to JSTOR Terms and Conditions

898

Journal of theAmerican Statistical Association, December 1975

Either normal discriminationor logistic regression may be used to estimatethe vector g in (1.2). It can be shownthat d' and d5 stillhave the limiting distributions indicatedin Lemmas 2 and 3, with7r, and rOreplacedby ri n1/nand ro- no/n.In termsof angular error,the ARE (4.4) still gives the asymptotic relative efficiency of logisticregression to normaldiscrimination in the strong sense of Section 4. The quantities 7r,,7roin (4.4) are replaced by ri = n1/n,ro = no/n,where these proportions are assumed to exist and do not equal zero in the limit. The estimatests,1, i, given in (1.5), are maximum 5. DISTORTED SAMPLING PROPORTIONS likelihood,whether n1,no are fixedor random. It follows ' = ( -' that 'o) '-, which we can still call the It may happen that the true probabilities7il and io normal discrimination estimate,is maximumlikelihood forpopulations 1 and 0 are distortedin a knownway to in either case. Standard maximumlikelihoodargumeffts, different values ri and 7ro by the nature of the sampling similar to the proof of Lemma 2, show that d' is disscheme employed. Letting X log Iri/ro, X -log ir/i?o, tributed as in 2, stated Lemma with Ir, 7ro replaced by suppose that forsome knownconstantc, ri, rO. X = X+c . (5.1) Let T1- .1 yj be the first coordinateofT in (3.12), For example, experimentalconstraintsmight cause the and let T2 be the remainingp coordinates.Given that statistician to randomly exclude fromhis trainingset T, = ni, the conditionaldensity of Yl, Y2, .*, yn is an nine out of ten population 0 members,in which case exponentialfamilywith natural parameter I and suffic = log 10. The normal discriminationprocedure de- cient statisticT2, scribed at (1.5) is then modified in the obvious way. A f(Yl, Y2) Y2, , Yn | T1 = nx,Xl, 12, ., Xn) new x is assigned to Population 1 or 0 as S (x) is greater = exp [ 'T2 - tnj(g) ], (5.3) or less than c. The logisticregression procedure (1.9) is whereV nl(g) is chosen to make (5.3) sum to unity over similarly modified. Theorem 2 remains true as stated except for the all choices of Yi, *,Yn with Y7.=1 yj = ni. The analog following modification.The vector (1, X/A) (and its of the logistic regressionprocedure is to select 5 to ofthe proof transpose),whichappears in the definitions of Q1and Q3 maximizethe likelihood(5.3). A modification in (3.19), is replaced by (1, X/A). The constants of Lemma 3, which will not be presented,shows that as stated there,with 7r,,mro A), which appear in Q3,are not changed to d5 is distributed Ai replaced by Ai((X, Ai(X, A). The proofof this is almost exactlythe same as ri, ro. In practice,the simplestway to apply logisticregresthe proofof Theorem2. is simplyto ignorethisfact. remainsunchanged, sion whenn1and noare fixed Eff. (X,A), the angular efficiency, whichis not surprising, maximize (1.9) over the possible sincethe discrimination boundary The standardprograms forany choice of c is parallel to that forc = 0. Only the choicesof ,3o,g, and then presentthe maximizer5 as the interceptis changed. When ri = Iro = .5, the effectof estimateof 1. This method can be shown to be asympchoosing c $ 0 is to reduce Eff1(X,A), the intercept toticallyequivalent to the conditionalmaximumlikeliof logistic regression based on (5.3). efficiency compared to normal dis- hood estimator crimination, as shownin the following tabulation. [ReceivedDecember 1974. RevisedMarch 1975.] A = 2, 71r1 = .5 A = 3, ri = .5
c Eff, 0 1 42 ?3 .899 .869 .836 .819 0 i 1 ?2 ?3 .641 .604 .550 .516 (5.2)

there is the well-known fact that minimizing Li=, [yi - (a + b'xi)]2 over all choices of the constant a and vectorb gives b equal to . (But a does not equal with ordinary 4o.) This connects normal discrimination least squares analysis and provides some justification, or at least rationale, for using } outside the framework (2.1). Other efficiency comparisons between the two procedures,e.g., in estimating the slope 11 11 of the discriminantfunction,can be obtained from Lemmas 2 and 3.

REFERENCES
[1] Anderson, T.W., An Introduction to Multivariate Statistical Analysis, New York: John Wiley& Sons,Inc., 1958. [2] Cox, D.R., Analysisof BinaryData, London: Chapmanand Hall, Ltd., 1970. [3] Dempster,A., "Aspects of the Multinomial Logit Model," Multivariate Analysis, 3 (March1973),129-42. [4] Halperin, M., Blackwelder, W.C. and Verter, J.I., "Estimation of the Multivariate LogisticRisk Function;A Comparison of the Discriminant Function and Maximum LikelihoodApproaches,"Journalof Chronic Diseases, 24 (January1971), 125-58. [5] Lehmann,E., Testing Statistical Hypotheses, New York: John Wiley& Sons,Inc., 1959.

Effiforothervalues of c, 71, A can be obtained usingthe entriesAO, A1,A2 in the table. Most frequently, the sample sizes ni and no are set by the statisticianand are not randomvariables at all. The usual procedurein this situationis to estimateonly the angle, not the intercept, of the discrimination boundary. In termsof the figure, the statisticianuses the data to select a family of parallel boundaries B (., da). The value of the intercept dT is chosenon a priorigroundsby just guessingwhat X is, or may not be formally selected at all.

This content downloaded from 132.170.168.139 on Thu, 4 Apr 2013 13:37:07 PM All use subject to JSTOR Terms and Conditions

You might also like