You are on page 1of 51

630 F.

2d 79

23 Fair Empl.Prac.Cas. 909,


23 Empl. Prac. Dec. P 31,154
The GUARDIANS ASSOCIATION OF the NEW YORK CITY
POLICE
DEPARTMENT, INC., The Hispanic Society of the New York
City
Police Department, Inc., Nydia I. Diaz, James Michael
Hidalgo, Wilfred Cebellero, Andre Lopez, Reinaldo Salgado,
Denise Santos, Deborah Holmes and Pamela Obey, individually
and on behalf of all those similarly situated, PlaintiffsAppellees,
v.
CIVIL SERVICE COMMISSION OF the CITY OF NEW
YORK, Department
of Personnel of the City of New York, and The New
York City Police Department, DefendantsAppellants.
No. 849, Docket 80-7027.

United States Court of Appeals,


Second Circuit.
Argued Feb. 6, 1980.
Decided July 31, 1980.

Peter Bienstock, New York City (M. D. Taracido, Kenneth Kimmerling,


Robert L. Becker, Puerto Rican Legal Defense & Education Fund, New
York City, on brief), for plaintiffs-appellees.
L. Kevin Sheridan, Asst. Corp. Counsel, New York City (Allen G.
Schwartz, Corp. Counsel, Judith A. Levitt, Steven M. Goldberg, Maureen
M. McCabe, New York City, on brief), for defendants-appellants.
David L. Rose, Washington, D. C. (Robert B. Fiske, Jr., U. S. Atty.,
Nancy E. Friedman, Richard N. Papper, Dennison Young, Jr., Asst. U. S.

Attys., New York City, Drew S. Days III, Asst. Atty. Gen., Steven H.
Rosenbaum, Washington, D. C., on brief), for the United States as amicus
curiae.
Ira H. Leibowitz, Barry Lasky, Garden City, N. Y., submitted a brief for
The Policewomen's Endowment Assoc., Inc., as amicus curiae.
H. Elliot Wales, New York City, submitted a brief for Seven Civil Service
Organizations as amicus curiae.
Before MANSFIELD and NEWMAN, Circuit Judges, and SIFTON, *
District Judge.
NEWMAN, Circuit Judge:

This employment discrimination suit pursuant to Title VII of the Civil Rights
Act of 1964, 42 U.S.C. 2000e-2, once again requires this Court to venture into
the complex realm of testing and test validation. The test at issue was designed
by New York City officials and administrated on June 30, 1979 to 36,797
applicants for positions on the City's police force. Plaintiffs are the Guardians
Association of the New York City Police Department, Inc., an organization of
Black police officers, the Hispanic Society of the New York City Police
Department, Inc., an organization of Hispanic police officers, and eight
individual Black or Hispanic applicants. Defendants are the New York City
Department of Personnel, which performed much of the test preparation, the
New York City Civil Service Commission, and the New York City Police
Department. The United States District Court for the Southern District of New
York (Robert L. Carter, Judge) found that use of the test unjustifiably
discriminates against Blacks and Hispanics in violation of Title VII. Guardians
Ass'n v. Civil Service Commission, 484 F.Supp. 785 (S.D.N.Y.1980). The
Court ordered a broad remedy, including a 50% minority hiring quota. We
affirm the District Court's finding that the City's specific use of the test violates
Title VII, but vacate the remedy and remand for entry of a revised decree.

I. Factual Background
2

The test in question, designated Exam No. 8155, was designed to select
candidates for hiring as entry-level police officers. Those who pass the exam
are selected, in rank order of their test scores, to complete the other aspects of
the hiring process-a medical examination, a physical agility test, a
psychological test, and a character investigation. These last four components of
the hiring process are scored only on a pass/fail basis. Thus, an appellant's

score on Exam No. 8155 is a major determinant of his prospects for becoming a
police officer. It is also the only feature of the process alleged to have a
discriminatory impact. Once an applicant scores high enough to be selected for
the final four hiring steps and successfully completes those steps, he or she
becomes a sworn police officer and enters the police academy for five months
of training. While successful completion of the training program is a
requirement of continuing as a police officer, the Department does not use the
training program as a selection device, but anticipates that nearly all academy
entrants will go on to active duty.
3

The exam was developed by a fairly elaborate two-stage process; the first stage
was an analysis of the police officer's job, and the second was construction of
the test itself. The job analysis consisted of five separate steps. First, the
Department of Personnel identified 71 tasks that police officers generally
perform, based on interviews with 49 police officers and 49 supervisors.
Second, a panel of seven officers and supervisors reviewed the list to add any
tasks that had been omitted, and to eliminate those items that were duplicative,
or too specialized to be performed by entry-level officers. The result was a
consolidated list of 42 entry-level tasks.

Third, a questionnaire was distributed to 5,600 police officers, requesting them


to rate each of the 42 tasks on the basis of its frequency of occurrence, its
importance, and the amount of time normally spent in performing it. The 2,600
responses that were received were then analyzed by computer to yield a ranking
of the 42 tasks, according to the combined rating of all the responses. In
addition, faculty members of John Jay College were asked to observe police
officers during an entire tour of duty and record the tasks that they performed;
their survey generally confirmed the identification of the 42 tasks.

In the fourth step of the job analysis, the Department of Personnel divided the
list of 42 ranked tasks into clusters of related activities. Five such clusters were
established: the arrest process, providing assistance to people, police
operations, stationhouse activities, and handling unusual and other occurrences.
The fifth step was an analysis of all five clusters, each one by a separate panel
of police officers, to identify the "knowledge, skills and abilities" required to
perform these tasks at the entry level, and to assign percentages reflecting the
relative importance of each of the identified knowledges, skills, and abilities for
the cluster as a whole. One panel listed five such qualities for its cluster, all of
which are properly characterized as "abilities" or "skills" (hereafter referred to
as "abilities"): recalling facts, filling out forms, understanding and applying
statutory definitions of crimes, understanding written instructions and applying
appropriate procedures, and human relations skills, including communication

techniques. Each of the other four panels used the first panel's list of abilities,
but developed its own percentages to express the relative importance of each
ability to the tasks within its cluster.
6

The second major stage in developing Exam No. 8155, the process of test
construction, consisted of four identifiable steps. First, the percentages of the
five abilities necessary to perform each of the five task clusters were multiplied
by the weightings that had been given to each task in Step 3 of the job analysis
on the basis of frequency, importance, and time spent. This yielded a general
measurement for the importance of each of the five abilities for performance of
the job of police officer. As a result of this computation, the Department of
Personnel concluded that on a test with 100 questions, 15 questions should test
for the ability to recall facts, 9 questions for filling out forms, 14 questions for
understanding and applying sections of the criminal law, 32 questions for
understanding written instructions and applying appropriate procedures, and 30
questions for human relations skills. Next, a group of eleven police officers was
selected to write multiple-choice questions that tested for the five abilities, as
they related to the 42 identified tasks. The officers wrote many of these
questions from Police Academy materials and similar sources, however,
without having access to descriptions of the five identified abilities, or the 42
ranked tasks. In the third step, Department of Personnel staff members who did
have access to the description of abilities and the ranking of tasks reviewed the
questions written by the police officers to assure that the questions were not
ambiguous, overly complex, overly specialized, or dependent on prior
knowledge. As a result of this review, some questions were discarded, others
were revised, and still others were added. Finally, the resulting questions were
subjected to a further review by a panel of six police experts, and by various
members of the Department of Personnel.

The test that resulted consisted of 100 multiple-choice questions, designed so


that the candidate could answer correctly without knowledge of any
information beyond what was provided on the test itself. The test materials
were determined by the Department of Personnel to require an eighth-grade
reading level, on the average, although the 14 questions on law required
college-level reading ability. The estimated time for completing the exam was 1
1/2 hours, but 3 1/2 hours were allowed.

The first part of the exam, designed to measure the ability to recall facts,
consisted of a page-and-a-half description of a burglary, and a series of 15
questions to be answered without referring back to the description. In the
second part, testing ability to fill out forms, the candidates were given a
simplified arrest form, and a page-long description of both a robbery and an

arrested suspect, and then asked 9 questions about the proper entries to be made
in filling out the form. Part three, intended to test ability to apply provisions of
law, consisted of 14 questions, each briefly presenting the facts of an incident,
and then requiring the candidate to identify the precise criminal offense
involved on the basis of definitions provided in the test materials.1 The
remaining 62 questions, of which 32 were intended to measure the ability to
follow appropriate procedures and 30 were intended to measure human
relations skills, consisted of general instructions as to procedures or appropriate
responses for certain types of situations, a description of a specific situation,
and then one or more questions asking the proper response to the situation
presented. Three of the questions dealing with appropriate procedures, for
example, involved the proper response to a bomb threat. Four of the questions
in the human relations section involved the proper way to deal with a person
who appears to be mentally ill.
9

The test was scored from zero to one hundred, with one point given for each
correct answer, and bonus points given for veterans. 2 The candidates were then
rank-ordered on the basis of their scores. Scores were generally high, with 13%
of the applicants scoring 98 or above, and fully 50% scoring 91 or above.
Because of the number of candidates taking the test and the bunching of
candidates in the upper range of scores, each point a candidate achieved made a
substantial difference in his position on the rank-ordering list. More than 2,000
applicants achieved a score at each numerical grade from 92 to 97.

10

The passing grade was determined in the following manner. The Police
Department first estimated that 4,000 police officers would be hired during the
four-year period for which the eligibility list resulting from Exam No. 8155
would be valid. The Department further estimated that only one out of three
applicants who passed Exam No. 8155 would successfully complete all the
remaining steps in the hiring process. Therefore, if this eligibility list was to
meet the Department's needs, 12,000 persons had to pass the exam to provide
the 4,000 needed police officers. With all this in mind, the Department simply
set the passing grade at the score achieved by the 12,000th highest scoring
candidate, which turned out to be 94. Because of the bunching phenomenon, a
large number of candidates, 2,124, achieved this same score, so that the actual
number who received a passing grade was 13,749.

11

Of the 36,797 applicants who took the test, 6,142 identified themselves as
Black, 5,239 identified themselves as Hispanic, 19,798 identified themselves as
White, and 4,847, or 13.2% did not identify their race. Thus, identified Blacks
constituted 16.7% of the total applicants, and 19.7% of all those who identified
their race, while the equivalent percentages for identified Hispanics were

14.2% and 16.8%, and for identified Whites, 53.8% and 64.5%.
12

Of those who passed the exam, i. e., scored 94 or better, 7.6% had identified
themselves as Black and 7.8% had identified themselves as Hispanic, for a
known minority population in the passing group of 15.4%, against a known
minority population in the applicant pool of 30.9%. In contrast, 66.6% of the
passing applicants had identified themselves as White, although Whites
comprised only 53.8% of the applicant pool.

13

Viewed in another and more revealing way, the figures show that, among those
who had identified themselves by race, the passing rate for Whites was 45.9%
compared to 17% for Blacks and 20.5% for Hispanics. The combined minority
pass rate was thus about two-fifths of the pass rate for Whites.

14

The Police Department accepted 415 candidates from the list in November,
1979 and planned to hire another 380 in January, 1980. Of this group of 795
candidates, 89.2% were White, 3.5% were Black, and 6.8% were Hispanic. The
selection rates (number chosen compared to number of applicants) for these
first two uses of the list were 0.5% for Blacks, 1% for Hispanics, and 3.6% for
Whites.

II. The District Court's Prior Proceedings and Decision


15
16

The plaintiffs filed their complaint in this suit, together with a motion for a
preliminary injunction, in October, 1979, before any candidates had been
accepted on the basis of the list. They charged that the intended use of the lists
by the Police Department constituted discrimination against Blacks and
Hispanics in violation of the Fourteenth Amendment, Title VII, and various
other Federal and state laws. The District Court, by consent of the parties,
consolidated the hearing on the preliminary injunction and the trial on the
merits. This proceeding was held on November 13, 14, and 15, 1979, shortly
after the Police Department's first use of the list. On January 11, 1980, three
days before the Department intended to use the list to accept a second group of
trainees, the District Court held a second hearing. That same day, the Court
issued an opinion, which was subsequently re-issued in revised form on
January 23.3

17

The Court's basic conclusion was that Exam No. 8155 violated Title VII. In
reaching this conclusion, the Court used the common mode of Title VII
analysis, in which the plaintiff is first required to establish a prima facie case on
the basis of disparate impact, and then the defendant is required to rebut the

plaintiff's case by proving that the disparity results from legitimate, job-related
selection procedures. The Court first found that the disparity between the
percentage of minority group members who achieved a passing score and the
percentage of minority group members in the applicant pool was sufficient to
establish a prima facie case. It based this finding of disparate impact on the
standards developed by the Supreme Court in Castaneda v. Partida, 430 U.S.
482, 97 S.Ct. 1272, 51 L.Ed.2d 498 (1977), and by the Equal Employment
Opportunity Commission (EEOC) in its Uniform Guidelines on Employee
Selection Procedures, 29 C.F.R. 1607 (1979). Castaneda stated that, in cases
involving large samples, "if the difference between the expected value (from a
random selection) and the observed number is greater than two or three
standard deviations," a prima facie case is established. 430 U.S. at 496 n.17, 97
S.Ct. at 1281 n.17. 4 The Uniform Guidelines provide that "(a) selection rate for
any race, sex, or ethnic group which is less than four-fifths ( 4/5) (or eighty
percent) of the rate for the group with the highest rate will generally be
regarded by the Federal enforcement agencies as evidence of adverse impact."
1607.4(D) (hereinafter Guideline sections are cited only by the subdivisions
of 29 C.F.R. 1607). The District Court then noted that the discrepancy
between the percentage of minority group members in the applicant pool and
the percentage of minority group members who passed the test was 39 standard
deviations. The evidence also showed that the passing rate of the minority
group members was 44.3% of the passing rate of Whites, or about two-fifths.
18

Having concluded that the plaintiffs had established a prima facie case, the
Court next concluded that the test was not sufficiently valid to constitute a
legitimate attempt to choose those applicants who would become better police
officers. Relying primarily on the EEOC Guidelines, the Court stated that a
determination of validity based on the content of the test would be
inappropriate, first because the test purported to measure abilities that the
accepted applicants would be trained to acquire, see Guidelines, 14(C)(1), and
second, because the test actually measured constructs, not abilities, see id.5
Moreover, the Court concluded that the job analysis was not sufficiently precise
to satisfy the Guideline requirement even for content validation, see Guidelines
14(C)(2). Since content validation was the only method of validation which
the City attempted, the Court concluded that the test was invalid, and thus an
inadequate rebuttal to the plaintiffs' prima facie case.

19

In fashioning relief, the District Court noted that a previous examination


administered by the New York City Police Department had been found to be in
violation of Title VII. Guardians Association v. Civil Service Commission, 431
F.Supp. 526 (S.D.N.Y.), vacated and remanded on other grounds, 562 F.2d 38
(2d Cir. 1977). Concluding that the defendants had "persisted in devising and

utilizing testing procedures that continue to discriminate against blacks and


hispanics," the District Court found that the defendants' "studied adherence to
discriminatory procedures must at this point be deemed conscious and
deliberate." 484 F.Supp. at 798. On this basis, the Court held that "affirmative
action is mandated as an interim measure either until such discrimination has
been totally eliminated or until defendants proceed to select police officers
under procedures that are in full compliance with Title VII." Id. at 799. It
enjoined the City from using Exam No. 8155, although permitting it to use the
eligibility list from that exam for purposes designated by the Court. In its order,
also issued on January 11, the Court ordered the Police Department to achieve
at least 30% minority composition of the force, a level comparable to the
percentage of minorities in the labor force of the relevant hiring area. To
achieve this goal, the Court further ordered that the defendants should "as an
interim goal appoint 50% of their entry level police officers from among
qualified black and hispanic applicants." Finally, the Court awarded the
plaintiffs attorneys' fees and costs, and retained jurisdiction "for such further
relief or other orders as may be necessary or appropriate to enforce and insure
rights to equal employment opportunity within the New York City Police
Department."
20

The City moved to stay the District Court's order pending consideration of its
appeal from the decision. This Court denied the motion. However, we granted a
conditional stay, in view of the City's declared need to hire new police officers6
and set an expedited schedule for the appeal.

III. The Framework of Title VII Analysis


21

As the District Court concluded, the accepted procedure for Title VII cases is to
require the plaintiffs to establish a prima facie case, and then to require the
defendants to rebut this showing with proof that the test was legitimately jobrelated. See Albemarle Paper Co. v. Moody, 422 U.S. 405, 95 S.Ct. 2362, 45
L.Ed.2d 280 (1975); McDonnell Douglas Corp. v. Green, 411 U.S. 792, 93
S.Ct. 1817, 36 L.Ed.2d 668 (1973); Griggs v. Duke Power Co., 401 U.S. 424,
91 S.Ct. 849, 28 L.Ed.2d 158 (1971). The Court correctly concluded that a
prima facie case had been established. By any reasonable measure, including
the standard deviation rule of Castaneda, supra, or the four-fifths rule of the
EEOC Guidelines, Exam No. 8155 had a disparate racial impact.

22

The City argues that statistics alone, specifically a comparison of the racial
composition of the passing group to that of the applicant group, are not
sufficient to establish a prima facie case. But statistics showing a significantly
disparate racial impact have consistently been held to create a presumption of

Title VII discrimination. See International Brotherhood of Teamsters v. United


States, 431 U.S. 324, 339, 97 S.Ct. 1843, 1856, 52 L.Ed.2d 396 (1977); United
States v. City of Chicago, 549 F.2d 415, 428 (7th Cir.), cert. denied, 434 U.S.
875, 98 S.Ct. 225, 54 L.Ed.2d 155 (1977); Kirkland v. New York State
Department of Corrections, 520 F.2d 420, 425 (2d Cir. 1975), cert. denied, 429
U.S. 823, 97 S.Ct. 73, 50 L.Ed.2d 84 (1976); Vulcan Society v. Civil Service
Commission, 490 F.2d 387, 392-93 (2d Cir. 1973); Chance v. Board of
Examiners, 458 F.2d 1167, 1172-73 (2d Cir. 1972).7
23

The City also claims that finding a prima facie Title VII violation by state or
local governments without a showing of discriminatory intent violates the
Tenth Amendment. This view has been definitively rejected by the Seventh
Circuit in United States v. City of Chicago, 573 F.2d 416, 422-24 (7th Cir.
1978), and we agree with that analysis. Congress may enforce the Fourteenth
Amendment by legislation that prohibits practices the Amendment might not of
its own force condemn. See Katzenbach v. Morgan, 384 U.S. 641, 86 S.Ct.
1717, 16 L.Ed.2d 828 (1966).

24

The real issue in this case, therefore, is whether the defendants have rebutted
the plaintiffs' prima facie case by proving that its test was job-related: that the
test accurately selected applicants who would be better police officers.
Adjudication of this issue presents a more complex problem in the present case
than it has in many previous Title VII suits. Many of the previous suits involved
tests that were so artlessly constructed that they could be judged invalid without
extensive inquiry, fine distinctions, or a precise notion of where the line
between validity and invalidity was located. See, e. g., Griggs, supra, 401 U.S.
at 431, 91 S.Ct. at 853 (intelligence tests used "on the Company's judgment that
they generally would improve the overall quality of the work force"); United
States v. N. L. Industries, Inc., 479 F.2d 354, 371 (8th Cir. 1973) (test given to
one applicant "consisted of four or five mathematical problems which a
Company employee jotted down on a sheet of yellow paper"); Brito v. Zia Co.,
478 F.2d 1200, 1205-06 (10th Cir. 1973) (test based almost entirely on
subjective judgments of supervisors, not administered or scored under
controlled and standardized conditions); Vulcan Society, supra, 490 F.2d at
396-98 (no job analysis, test measured abilities that were clearly of secondary
importance to job).

25

Exam No. 8155, in contrast, is a "second generation" selection procedure.


Despite the various flaws in construction of the test, it is clear that some
attempt was made to develop the test with recognition of at least some of the
standards that courts had established in the first wave of Title VII cases. Aware
that the validity of the test would likely have to be demonstrated, the City

performed an extensive job analysis, consciously used Guideline concepts in


determining the qualities that were being tested for, and attempted to eliminate
extraneous variables, such as the applicant's prior knowledge, his reading level,
and his ability to complete the test in a relatively short amount of time.
26

Nevertheless the plaintiffs have alleged and the District Court has concluded
that the construction and use of Exam No. 8155 failed in several respects to
meet test validity standards, particularly those specified in the Guidelines.
Whether or not these deficiencies are fatal, they are plainly more substantial
than the defects deemed not to defeat validity in prior cases. Detroit Police
Officers Association v. Young, 446 F.Supp. 979, 990-91, 1007-08
(E.D.Mich.1978); Bridgeport Guardians v. Bridgeport Police Department, 431
F.Supp. 931 (D.Conn.1977); cf. Washington v. Davis, 426 U.S. 229, 248-52, 96
S.Ct. 2040, 2051-53, 48 L.Ed.2d 597 (1976) (Fourteenth Amendment case
involving some Title VII concepts). Consequently, assessment of Exam No.
8155 necessarily carries this Court into difficult areas of judging test validity.
We must determine, with some care, what the general standards are for judging
validity, and how these standards are to be applied in a specific factual
situation.

27

The study of employment testing, although it has necessarily been adopted by


the law as a result of Title VII and related statutes, is not primarily a legal
subject. It is part of the general field of educational and industrial psychology,
and possesses its own methodology, its own body of research, its own experts,
and its own terminology. The translation of a technical study such as this into a
set of legal principles requires a clear awareness of the limits of both testing
and law. It would be entirely inappropriate for the law to ignore what has been
learned about employment testing in assessing the validity of these tests. At the
same time, the science of testing is not as precise as physics or chemistry, nor
its conclusions as provable. While courts should draw upon the findings of
experts in the field of testing, they should not hesitate to subject these findings
to both the scrutiny of reason and the guidance of Congressional intent.

28

The need to modify rigid technical conclusions from the field of testing is
indicated by the view of certain testing experts, including those who testified
for the plaintiffs in this case, that there is no test that can be considered
completely valid to select candidates for any but the most rudimentary tasks. If
this view guided interpretation of Title VII, then at the current stage of the
technology of testing, no test that produces a disparate racial impact could be
used for positions such as police officers.

29

While this position is a conceivable one, it is supported neither by the statutory

language nor by judicial precedent. Had Congress felt that testing for virtually
all employment was invalid, it would not have made a specific exception to
Title VII for the proper use of professionally designed employment tests.8
Clearly, Congress did not intend that the standard used in interpreting Title VII
would reject every test with a disparate racial impact. Nor have the courts
permitted this to occur; although they have subjected employment testing to the
careful scrutiny required by the statute, they have found a variety of tests to be
valid, despite a disparate racial impact. See, e. g., Sims v. Sheet Metal Workers,
Local 65, 489 F.2d 1023, 1025-26 (6th Cir. 1973); Detroit Police Officers
Association v. Young, supra, 446 F.Supp. at 1007-08; Friend v. Leidinger, 446
F.Supp. 361 (E.D.Va.1977), aff'd, 588 F.2d 61 (4th Cir. 1978); United States v.
South Carolina, 445 F.Supp. 1094 (D.S.C.1977), aff'd, 434 U.S. 1026, 98 S.Ct.
756, 54 L.Ed.2d 777 (1978); Bridgeport Guardians v. Bridgeport Police
Department, supra, 431 F.Supp. at 936-39; Jackson v. Nassau County Civil
Service Commission, 424 F.Supp. 1162 (E.D.N.Y.1976); Buckner v. Goodyear
Tire & Rubber Co., 339 F.Supp. 1108, 1113-16 (N.D.Ala.1972), aff'd, 476 F.2d
1287 (5th Cir. 1973).
30

The danger of too rigid an application of technical testing principles is that tests
for all but the most mundane tasks would lack sufficient validity to permit their
use. At least that is the risk given the current state of the art of employment
testing. This risk can be appreciated by considering the one example even most
test critics acknowledge to have substantial, though not complete, validity. This
is a typing test given to a group of applicants for jobs as typists.9 Such a test
substantially meets all the criteria suggested by plaintiffs' experts for content
validation, but the very success of this test casts doubt on the usefulness of the
example. To begin with, typing is a task that readily yields to quantitative
measurement. The quality of a typist's job performance depends on two factors,
both of which can be captured with precision in numbers: how fast he types,
and how many errors he commits. Most jobs involve tasks whose performance
can be evaluated only in the more subjective light of judgment. Surely this is
true of nearly all the tasks required to be performed by police officers. In
addition, there is a more basic problem with the typing test example. Typing is
one of the few activities that a test-taker can perform in virtually the same
manner as he will be required to perform on the job. That is obviously an ideal
testing situation, but it is not one that is frequently available, and such "on-thejob" testing could not possibly be done to select police officers. Yet the force of
the typing test example easily leads to one of the conclusions of the District
Court in this case: that Exam No. 8155 lacked validity because it measured
performance in an artificial classroom setting and did not necessarily indicate
who would perform well on the job. 10

31

Closely related to the question of the proper weight to be given to technical


conclusions of testing theory is the question of the proper weight to be given to
the EEOC Uniform Guidelines, which are largely based on these technical
conclusions. See Guidelines 5(C). The District Court drew its methodology
from the Guidelines, concluding that the City's test was invalid because it failed
to satisfy all of the Guidelines. The Supreme Court has relied upon some of the
Guidelines in several of the leading cases, see Albemarle, supra, 422 U.S. at
431, 95 S.Ct. at 2378; Espinoza v. Farah Manufacturing Co., 414 U.S. 86, 94,
94 S.Ct. 334, 339, 38 L.Ed.2d 287 (1973); Griggs, supra, 401 U.S. at 433-34,
91 S.Ct. at 854-55, but the Court has not ruled that every deviation from any of
the Guidelines automatically results in a violation of Title VII. The Court
appears to have applied the Guidelines only to the extent that they are useful, in
the particular setting of the case under consideration, for advancing the basic
purposes of Title VII. See Espinoza, supra, 414 U.S. at 94, 94 S.Ct. at 339;
Guardians Association v. Civil Service Commission, 490 F.2d 400, 403 n.1 (2d
Cir. 1973), United States v. Georgia Power Co., 474 F.2d 906, 913 (5th Cir.
1973). To the extent that the Guidelines reflect expert, but non-judicial opinion,
they must be applied by courts with the same combination of deference and
wariness that characterizes the proper use of expert opinion in general. See
Albemarle, supra, 422 U.S. at 449, 95 S.Ct. at 2390 (Blackmun, J., concurring)
(Guidelines "have never been subjected to the test of adversary comment. Nor
are the theories on which the Guidelines are based beyond dispute.") Thus, the
Guidelines should always be considered, but they should not be regarded as
conclusive unless reason and statutory interpretation support their conclusions.
As this Court has previously stated: "If the EEOC's interpretations go beyond
congressional intent, the Guidelines must give way." Guardians Association,
supra, 490 F.2d at 403 n.1.

32

In addition to their force as the expression of expert opinion, the Guidelines


also possess legal force. But here too, it is necessary to keep their limits in
mind. The primary purpose of the Guidelines is to indicate the standards that
various Federal agencies, such as the EEOC, the Civil Service Commission,
and the Department of Justice are to use in enforcing Title VII and related
statutes. See Guidelines 2(A). But the fact that an agency or group of agencies
has announced the standards they will use does not convert those standards into
mandatory legal rules.

33

A second legal basis for following the Guidelines is that they represent the
"administrative interpretation of the Act by the enforcing agency," and are
"entitled to great deference" on that basis. Griggs, supra, 401 U.S. at 433-34, 91
S.Ct. at 854-55; see Albemarle, supra, 422 U.S. at 431, 94 S.Ct. at 2378.
However, the Court has also recognized that the Guidelines "are not

administrative 'regulations' promulgated pursuant to formal procedures


established by Congress." Ibid. They are entitled to deference, not obedience.
See Espinoza, supra, 414 U.S. at 94, 94 S.Ct. at 339 (1973) (Guideline rule on
discrimination against non-citizen "is no doubt entitled to great deference . . . ,
but that deference must have limits where, as here, application of the guideline
would be inconsistent with an obvious congressional intent"). Moreover, the
Court in Griggs was following the Guidelines only to make the straightforward
distinction between general intelligence tests and job-related tests; it is not at all
clear that Griggs requires observance of all the intricate details of the
Guidelines. It might be desirable for all employers to follow the more careful
practices required of the Federal Government, but there is no reason to think
that Congress intended to impose such practices, in their full rigor, when it
enacted Title VII.
34

With these considerations in mind, we turn to the validity of Exam No. 8155.

IV. The Validity of Exam No. 8155


A. Selecting the Validation Technique
35

The threshold task in determining the validity of a challenged examination is to


select the appropriate method for assessing its job-relatedness. The Guidelines
describe three techniques: content validation, construct validation, and
criterion-related validation. Guidelines 5(B), 14. The Guidelines specify
when each technique is appropriate and also specify the requirements for
successfully validating an exam by use of each technique. Defendants have
attempted to justify Exam No. 8155 by content validation, a technique
appropriate for tests that measure "knowledges, skills or abilities"
representative of the "content" of the job. Guidelines 14(C)(1). Plaintiffs
contend that construct validation must be used to assess this exam because, in
their view, the exam attempts to measure "constructs," that is, inferences about
mental processes or traits, such as "intelligence, aptitude, personality,
commonsense, judgment, leadership and spatial ability." Ibid.

36

This content-construct distinction has a significance beyond just selecting the


proper technique for validating the exam; it frequently determines who wins
the lawsuit. Content validation is generally feasible while construct validation
is frequently impossible. Even the Guidelines acknowledge that construct
validation requires "an extensive and arduous effort." Guidelines 14(D)(1).
The principal difficulty with construct validation is that it requires a technique
that includes a criterion-related study, Guidelines 14(D)(4)-a demonstration
from empirical data that the test successfully predicts job performance.11

Developing such data is difficult, and tests for which it is required have
frequently been declared invalid.12 As a result, a conclusion that construct
validation is required would often decide a case against a test-maker, once a
disparate racial impact has been demonstrated.
37

To determine whether defendants are entitled to use content validation, we


examine the Guidelines' criteria for that technique, but we do so bearing in
mind our cautionary approach to the Guidelines, previously expressed. The
Guidelines specify two basic conditions that must be met before content
validation may be used. First, it must appear that what the test attempts to
measure is knowledge or an ability, and not a general trait, such as intelligence.
Guidelines 14(C)(1). Second, the test must not measure knowledge or ability
that an employee will be expected to learn on the job. Ibid.; see also 5(F). The
District Court rejected content validation, concluding both that Exam No. 8155
measures constructs, not abilities, and that, even if what was tested for could be
considered abilities, they could be learned in the five-month training program.

38

In specifying how the selection of validation techniques is to be made, the


Guidelines adopt too rigid an approach, one that is inconsistent with Title VII's
endorsement of professionally developed tests. Taken literally, the Guidelines
would mean that any test for a job that included a training period is almost
inevitably doomed: if the attributes the test attempts to measure are too general,
they are likely to be regarded as constructs, in which event validation is usually
too difficult to be successful; if the attributes are fairly specific, they are likely
to be appropriate for content validation, but this too will prove unsuccessful
because the specific attributes will usually be learned in a training program or
on the job.

39

The origin of this dilemma is not any inherent defect in testing, but rather the
Guidelines' definition of "content." This definition makes too sharp a distinction
between "content" and "construct," while at the same time blurring the
distinction between the two components of "content" : knowledge and ability.
The knowledge covered by the concept of "content" generally mean factual
information. The abilities refer to a person's capacity to carry out a particular
function, once the necessary information is supplied. Unless the ability requires
virtually no thinking, the "ability" aspect of "content" is not closely related to
the "knowledge" aspect of "content"; instead it bears a closer relationship to a
"construct." Some researchers regard content tests as nothing more than
assessments of particular kinds of constructs, e. g., Tenopyr, Content-Construct
Confusion, 30 Personnel Psych. 47 (1977); others regard any ability that is
evidenced by observable behavior as sufficiently non-inferential to be
considered content, see Ebel, Comments on Some Problems of Employment

Testing, 30 Personnel Psych. 55 (1977). See generally Catell, Validity and


Reliability: A Proposed More Basic Set of Concepts, 55 J.Ed.Psych. 1 (1964).
Whichever view is adopted, it would seem that abilities, at least those that
require any thinking, and constructs are simply different segments along a
continuum reflecting a person's capacity to perform various categories of tasks.
This continuum starts with precise capacities and extends to increasingly
abstract ones-from the capacity for filling out forms to the capacity for
exercising judgment.
40

Recognition that abilities and constructs are not entirely distinct leads to a
conclusion that a validation technique for purposes of determining Title VII
compliance can best be selected by a functional approach that focuses on the
nature of the job. The crucial question under Title VII is job relatednesswhether or not the abilities being tested for are those that can be determined by
direct, verifiable observation to be required or desirable for the job. See Griggs,
supra, 401 U.S. at 431, 91 S.Ct. at 853; Vulcan Society, supra, 490 F.2d at 39495; Chance, supra, 458 F.2d at 1177. If the job in question involves primarily
abilities that are somewhat abstract, content validation should not be rejected
simply because these abilities could be categorized as constructs. However, if
the test attempts to measure general qualities such as intelligence or
commonsense, which are no more relevant to the job in question than to any
other job, then insistence on the rigorous standards of construct validation is
needed. Since tests of this kind are often biased in favor of a person's familiarity
with the dominant culture, permitting them to be used without a showing of
predictive validity would perpetuate the effects of prior discrimination. But as
long as the abilities that the test attempts to measure are no more abstract than
necessary, that is, as long as they are the most observable abilities of
significance to the particular job in question, content validation should be
available. To lessen the risks of perpetuating cultural disadvantages, the degree
to which content validation must be demonstrated should increase as the
abilities tested for become more abstract.

41

This functional approach, which adjusts the distinction between content and
construct to the nature of the job being tested for, expands the opportunity for
both employers and courts to rely on content validation. It also avoids making a
threshold choice between content and construct validation based solely on the
nature of the quality tested for unrelated to the job, a choice that might make
content validation seem inappropriate. To base the content-construct
determination on the nature of the job, it is necessary first to analyze the job to
see if it requires abilities appropriate for content validation. Instead of choosing
between content and construct validation at the outset, as the Guidelines seem
to require, employers and courts can start the content validation inquiry and use

its results to determine both whether content validation is appropriate and


whether it has been achieved. Should the attempted content validation be found
inadequate, the reason may be that this method of validation was not
appropriate because of the pertinent job abilities revealed by the job analysis.
On the other hand, this approach will sometimes indicate that content
validation is appropriate, even though the abilities tested for could be
considered constructs.13
42

Just as lessening the severity of the Guidelines' distinction between content and
construct reduces the likelihood that a test is invalid because it measures
constructs, so sharpening the distinction between knowledge and ability, now
obscured by the Guidelines, reduces the problem that the test is invalid because
it duplicates the training period, i. e., tests for what will later be learned. Unlike
knowledge, some abilities are appropriate for testing confirmed by content
validation despite their overlap with post-selection training. A valid
measurement of some abilities can select applicants who will ultimately use
their training to perform their tasks more effectively or who will more
effectively perform similar tasks for which they have not been specifically
trained. On the other hand, content validation remains inappropriate for tests
that measure knowledge of factual information if that knowledge will be fully
acquired in a training program. Approval of such tests, without predictive
validation, risks favoring applicants with prior exposure to the information, a
course likely to discriminate against a disadvantaged minority. For example, it
would be duplicative of the Police Department's training program, and thus
invalid, to test applicants for their knowledge of the Department's arrest form.
Testing for their ability to fill out the form, however, can be expected to select
applicants who can be successfully trained to perform well at that task and
others like it.

43

Applying the approach just outlined, we conclude, at least as an initial matter,


that content validation may properly be selected as the appropriate technique
for assessing Exam No. 8155. The exam tests for three basic abilities (although
it purports to test for five): the ability to remember details, the ability to fill out
forms, and the ability to apply general principles to specific facts. This third
ability is assessed in three contexts: the application of general statements of
criminal offenses to the facts of specific events, the application of procedures
and standards to the facts of specific policing activities, and the application of
procedures and standards to the facts of specific situations involving human
relations problems. These three basic abilities are not so abstract, on their face,
as to preclude content validation, provided subsequent consideration of the job
analysis does not demonstrate that important and more concrete abilities
necessary for the job were needlessly omitted from those considered for

measurement. Though all three abilities involve some inference about mental
processes, they are based on observable behaviors and are far less abstract than
such traits as intelligence, leadership, or judgment. Moreover, testing for these
three abilities sufficiently avoids the objection that the test duplicates the
Department's training program. Though all three abilities can be trained to
some extent, the test-makers were entitled to select applicants with existing
ability so that training would both enhance their abilities and prepare them for
other tasks requiring similar talents. The vice of testing for knowledge readily
taught in the training program was totally avoided.
B. Assessing the Content Validity of Exam No. 8155
44
45

Since content validation appears to be an appropriate method for assessing


Exam No. 8155, we proceed to consider whether the use of this method
indicates that the exam has sufficient validity to select applicants for the job of
police officer. The Guidelines describe various aspects of content validation,
but do not neatly list ingredients of an adequate exam. From our study of the
Guidelines, we distill five attributes of an exam with sufficient content validity
to be used notwithstanding its disparate racial impact. The first two concern the
quality of the test's development: (1) the test-makers must have conducted a
suitable job analysis, and (2) they must have used reasonable competence in
constructing the test itself.14 The next three attributes are more in the nature of
standards that the test, as produced and used, must be shown to have met. The
basic requirement, really the essence of content validation, is (3) that the
content of the test must be related to the content of the job. In addition, (4) the
content of the test must be representative of the content of the job. Finally, the
test must be used with (5) a scoring system that usefully selects from among the
applicants those who can better perform the job. We consider each of these five
matters in turn.

The Job Analysis


46

According to the Guidelines, a job analysis involves an assessment "of the


important work behavior(s) required for successful performance and their
relative importance." 14(C)(2). The job analysis performed by the City, while
somewhat flawed as the District Court pointed out, is nonetheless adequate to
meet this standard. As far as the first part of the standard is concerned, the
work behaviors involved in being a police officer were identified by extensive
interviewing, and subjected to serious review (Job Analysis, Steps 1 and 2).
The District Court found that these work behaviors "were not delineated with
precision." 484 F.Supp. at 795. In fact, the descriptions of the 42 tasks that
ultimately appeared on the job analysis list vary considerably in the level of

precision. Some are complete and unambiguous, such as "1. Checks the
condition of personal and department equipment such as radio, patrol car,
weapons, etc."; "35. Attends training sessions." Others are more open-ended,
but do manage to fulfill their function by defining the behaviors associated with
the task, such as "3. Performs foot patrol"; "40. Controls various types of
crowds." Still others are so vague that they communicate very little real
information, such as "10. Interacts with juveniles in non-arrest situations"; "39.
Performs duties in hostage situations."
47

While greater precision might have been achieved, a complete description of


the observable tasks associated with being a police officer would be a reworded
version of the entire training manual. The Police Department's list of tasks,
despite some lapses in specificity, contains a sufficient amount of meaningful
information to satisfy the relevant requirement.

48

The second part of the Guideline standard for a job analysis requires
determination of the relative importance of the identified work behaviors. The
City performed this function by means of an extensively distributed
questionnaire, specifying the criteria to be used in ranking the 42 tasks (Job
Analysis, Step 3). The process as a whole appears to be reasonably accurate,
and neither the plaintiffs nor the District Court raised any serious objection to
it.

49

Having determined the work behaviors and established their relative


importance, the City then grouped the 42 tasks into five clusters and asked
panels of police officers to identify the knowledges, skills, or abilities necessary
to the effective performance of these tasks. (Job Analysis, Steps 4 and 5). This
function was implemented in a much less satisfactory manner. Only one of the
panels identified the abilities; the other four used the list of five abilities that
the first panel had developed. This lessened the value of having five
independent panels make this complicated and subjective determination.
Moreover, no effort was made to explain the relationship between any of the
five abilities and the 42 job tasks from which they were ostensibly derived.15

50

The plaintiffs criticize the required abilities identified by the City for being
undefined. But the type of definition suggested by the Guidelines-one that
describes the abilities in terms of "observable behaviors and outcomes,"
Guidelines 15(C)(3)-seems repetitive, since the work behaviors are already
defined in this way. The five identified abilities, with the possible exception of
"human relations, including communication techniques," are comprehensible
enough. Their appropriateness for measurement would have been considerably
clearer, however, if each panel had explained which tasks required which

abilities. While the Guidelines may be unnecessarily stringent in regarding the


identification of this relationship of ability to task as "essential," see ibid., such
identification does go far toward eliminating the ambiguities that are otherwise
inherent in generalized descriptions of abilities. Only if the relationship of
abilities to tasks is clearly set forth can there be confidence that the pertinent
abilities have been selected for measurement.
The Test Construction Process
51

With a job analysis of questionable sufficiency, the City then proceeded to the
test construction stage. As an initial matter, we note that Exam No. 8155 was
developed "in-house," by staff members of New York City's Police Department
and Department of Personnel; there was little input from any outside source,
and no participation by anyone specializing in test preparation. Of course, the
law should not be designed to subsidize specialists. But employment testing is a
task of sufficient difficulty to suggest that an employer dispenses with expert
assistance at his peril. Certainly, the decision to forgo such assistance should
require a Court to give the resulting test careful scrutiny. See Kirkland, supra,
520 F.2d at 425-26; Vulcan Society, supra, 490 F.2d at 395-96.

52

While the determination of how many questions should be included for each
identified ability was made by a fairly careful numerical analysis (Test
Construction, Step 1), the process of writing the questions themselves was
rather haphazard. The questions were initially framed by police officers, who
may have had expertise in identifying tasks involved in their job but were
amateurs in the art of test construction. In addition, the officers did not have
access to the job analysis material during much of the process. Finally, the
questions, although they were reviewed, were not tested on a sample
population. To be sure, a complete determination of the questions' accuracy in
measuring the identified abilities would be equivalent in its complexity to a
criterion-related study. But the City did not even perform the minimal sample
testing to ensure that the questions were comprehensible and unambiguous.

53

Not surprisingly, the test construction process did not fully succeed in meeting
even its own goal of testing for all the identified abilities. As previously
indicated, Exam No. 8155 does appear to test for the three identified abilities of
remembering details, filling out forms, and applying general principles to
specific facts. However, the fourth identified ability, human relations skill,
proved more troublesome. In deciding how to test for this ability, the City faced
a dilemma inherent in testing for all but the most mundane jobs. To be fully
representative of the job, a test should measure all the significant abilities
needed for successful job performance, yet some abilities, especially in jobs of

any complexity, are far along the construct end of the content-construct
continuum where successful validation is difficult. If a test tries to be
representative and measure all significant abilities, including those that are
clearly constructs, it risks the use of inadequate assessment devices, because the
rigorous standard for construct validation will rarely be met. On the other hand,
if the test-makers acknowledge the difficulty of satisfactorily measuring
constructs and test only for those abilities that are appropriate for content
validation, they encounter the objection that the test is not sufficiently
representative of the job.
54

Recognizing the difficulty of construct validation, yet reluctant to omit


assessment of an important characteristic of successful job performance, the
City attempted to resolve the dilemma by treating human relations skill as an
ability suitable for content validation and devoting 30 questions, nearly onethird of the exam, to an effort to assess this ability. Mindful of an important
requirement of content validity, the City carefully avoided rewarding a testtaker's prior knowledge and, instead, supplied in the test itself all the
information necessary to select the correct answers to the human relations
questions. Included before each group of questions was a set of appropriate
standards-essentially "do's" and "don'ts"-for handling a particular type of
human relations matter. But supplying this guidance rendered the 30 questions
primarily a further assessment of a candidate's ability to apply written standards
to specific fact situations, and only slightly a measure of his talent for human
relations. Anyone with minimal analytic ability needed to apply the standards to
the various fact situations could select the one correct answer, even if his
intuitive reaction to a human relations problem might be woefully inadequate.

55

Assessing human relations skill will always be a difficult enterprise, but the
deficiency of the City's attempt does not mean that a content validation
approach is necessarily impermissible nor impossible to achieve. As indicated
above, at least within the middle range of the content-construct continuum, the
distinction between content and construct should be determined functionally, in
relation to the job. If the quality measured is not unduly abstract, and if it
constitutes a significant aspect of the job, content validation of the test
component used to measure that quality should be permitted. But that
component must be designed in an extremely careful way. Test-makers will be
well advised to obtain highly qualified assistance in constructing this portion of
an exam.

56

One desirable approach would be to confront applicants with simulated real life
situations and assess the appropriateness of their volunteered response. See
Firefighters Institute for Racial Equality v. City of St. Louis, 616 F.2d 350 (8th

Cir. 1980). That technique is normally too costly for large numbers of
applicants, but might have usefulness as a testing device to be used toward the
end of the overall selection procedure, after an initially large group of
applicants has been narrowed down by the results of a written exam and a
background check. If the test component is limited to traditional pencil and
paper methods, it may be preferable to forgo any pretense of being able to
make fine differentiation among candidates' human relations skills and instead
adopt a pass/fail approach, rejecting those whose demonstrably inappropriate
responses to human relations questions mark them as unsuitable for police
work. Another possibility is to recognize that questions in this area for which
only one answer is correct are likely to be too easy, as were most of the
questions on Exam No. 8155, and therefore of little use in making selections
from among applicants. Instead questions can be designed for which some
answers are appropriate responses and others are inappropriate. As feasible
techniques in this area evolve, employers will be expected to use them.
57

With these strengths and weaknesses of the job analysis and the test
construction in mind, we now consider how well the test, as constructed and
used, met the basic requirements of content validity.

The Direct Relationship Requirement


58

The central requirement of Title VII, relationship of test content to job content,
was sufficiently satisfied by Exam No. 8155. The job analysis procedure
provides adequate assurance that the identified tasks are in fact the tasks that a
police officer performs. While the procedure for identifying the abilities
required for those tasks was less satisfactory, the three abilities that were
actually tested for appear adequately related to most of the identified tasks. The
list of tasks confirms one's intuitive assumption that police officers are required
to fill out forms (see, e. g., "16. Processes arrests using appropriate police
department forms and notifications"), to remember facts (see, e. g., "18. Gives
testimony in court (oral and written)"), and to apply general principles to
specific fact situations (see, e. g., "26. Executes warrants.").

59

Moreover, these abilities are among the most concrete ones that can be derived
from the list; they are certainly more concrete than human relations skills,
which the test purported to measure, but did not. Two of the abilities tested for,
filling out forms and remembering facts, are as specifically stated as they could
be without resort to trivial distinctions about particular kinds of forms and facts.
The ability to apply general standards is somewhat more problematical, since it
is a relatively abstract skill that is relevant to many jobs. However, if there is
any job for which ability in applying and following rules is an especially

important requirement, it is the job of a law enforcement officer.16


The Representativeness Requirement
60

The second requirement established by the Guidelines is that the test must be a
"representative sample of the content of the job." As presented by the
Guidelines, this representativeness requirement has two different meanings. The
first is that the content of the test must be representative of the content of the
job; the second is that the procedure, or methodology, of the test must be
similar to the procedures required by the job itself. The Guidelines express this
dual requirement in the following somewhat inscrutable language: "For any
selection procedure measuring a knowledge, skill, or ability the user should
show that (a) the selection procedure measures and is a representative sample
of that knowledge, skill, or ability. . . ." Guidelines 14(C)(4) (emphasis
added).

61

Both aspects of the representativeness requirement, if interpreted rigorously,


would once again foreclose any possibility of constructing a valid test. The
United States, as amicus, argues that the requirement that the content of the
exam be representative means that all the knowledges, skills, or abilities
required for the job be tested for, each in its proper proportion. This is not even
theoretically possible, since some of the required capacities cannot be tested for
in any valid manner. Even if they could be, the task of identifying every
capacity and determining its appropriate proportion is a practical impossibility.

62

It is similarly impossible for the procedures of the test to be truly representative


of the actual job procedures. Tests, by their nature, are a controlled, simplified
version of the job activities, not the activities themselves. As a practical matter,
virtually any realistic test, except one that directly measures a physical skill,
like lifting 50-pound sacks, is likely to be a pencil and paper activity, quite
different from the job it tests for. An elaborate effort to simulate the actual
work setting would be beyond the resources of most employers, and perhaps
beyond the capacities of even the most professional test-makers.

63

More reasonable interpretations of the representativeness requirement are


appropriate in light of Title VII's basic purposes. The reason for a requirement
that the content of the exam be representative is to prevent either the use of
some minor aspect of the job as the basis for the selection procedure or the
needless elimination of some significant part of the job's requirements from the
selection process entirely; this adds a quantitative element to the qualitative
requirement-that the content of the test be related to the content of the job.
Thus, it is reasonable to insist that the test measure important aspects of the job,

at least those for which appropriate measurement is feasible, but not that it
measure all aspects, regardless of significance, in their exact proportions. The
reason for a requirement that the test's procedure be representative is to prevent
distorting effects that go beyond the inherent distortions present in any
measuring instrument. For example, although all pencil and paper tests are
dependent on reading, even if many aspects of the job are not, the reading level
of the test should not be pointlessly high. Similarly, the instructions should not
be overly complex, and the exam should not place candidates under excessive
time pressure unless such time pressure is an identifiable aspect of the job.
64

Exam No. 8155 meets these representativeness requirements to an adequate


degree. While it did not test for all the skills involved in being a police officer
nor adequately test for the human relations skill that the job analysis identified
as important, the ones it did measure-memory, the ability to fill out forms, and
the ability to apply rules to factual situations-are all significant aspects of entrylevel police work. To be sure, this conclusion would have been easier to reach
if the City had spelled out the relationship between the abilities that were tested
for and the job behaviors that had been identified. But the relationship is
sufficiently apparent to indicate that the City was not seizing on minor aspects
of the police officer's job as the basis for selection of candidates. The
inadequate assessment of human relations skill lessens the representativeness of
the exam and consequently lessens its degree of content validity, but this
deficiency is not fatal, especially in light of the difficulty of assessing such an
abstract ability. Though human relations skill was deemed so important as to
warrant 30 of the exam's 100 questions, the City could just as plausibly have
concluded that equally important for the job of policing are such other abstract
qualities as common sense, leadership potential, sound judgment, or ability to
resist provocation. When a police exam inadequately tests for any of these
abstract abilities, it simply recognizes the limits of the art of testing. Indeed, the
more a test concerns itself with relatively concrete abilities identified as
necessary for successful job performance, the more likely it is to achieve a
sound basis for assessment of applicants.

65

Similarly, the procedure that the test employed was not needlessly
unrepresentative of the job itself. In electing to use a pencil and paper test, the
City did not forgo any readily available and realistically feasible alternative
procedure that would have been more representative of the job. Moreover, the
risks of using a written test were substantially minimized. The reading level
necessary to understand the questions was in some cases equal to, but generally
well below, the training materials used in the Police Academy. The instructions
were clear enough, and employed an ordinary four-answer, multiple-choice
format, perhaps the most familiar standardized test technique. In addition,

ample time was allowed for taking the exam, thereby avoiding an unnecessarily
pressured situation.
66

Thus the exam is adequately related to the content of the police officer's job,
and adequately representative. The combined effect of this assessment might
support a conclusion that the exam as a whole has content validity, though it
would be a close question whether a test with the disparate racial impact of this
one can be validated when its development departs in some significant respects
even from reasonably attainable requirements of the Guidelines. However, even
if the construction of the exam passes muster, the way in which it was used to
distinguish among candidates seriously departs from the third requirement for
content validity and defeats any claim of validity for a testing process that
produces disparate racial results.

The Scoring Requirement


67

Essentially, the City used the results of the exam to compile a rank-ordering of
all the applicants, and then selected a passing score sufficient to generate the
required number of potential trainees. Neither the rank-ordering nor the passing
score conforms to even the most minimal standards for these two devices.

68

Rank-Ordering. The Guidelines provide that rank-ordering should be used only


if it can be shown that "a higher score . . . is likely to result in better job
performance." Guidelines 14(C)(9). This requirement is reasonable and
consistent with Title VII's provision that the "results" of a test may not be "used
to discriminate." 42 U.S.C. 2000e-2(h). If test scores do not vary directly with
job performance, ranking the candidates on the basis of their scores will not
select better employees. It is possible to read the Guidelines' standard for rankordering as if the required relationship between better scores and better job
performance had to be demonstrated by a criterion-related study. However, the
EEOC's interpretation of the Guidelines disclaims such a high standard. The
relationship between higher scores and better job performance may permissibly
rest on an inference, but where, as here, the test scores reveal a disparate racial
impact, and that disparity is greater at high passing scores than at low passing
scores, the appropriateness of inferring that higher scores closely correlate with
better job performance must be closely scrutinized.

69

This close scrutiny is required because rank-ordering makes such a refined use
of the test's basic power to distinguish between those who are qualified to
perform the job and those who are not. If a test is content valid, it may be
reasonable to infer that the test scores make some useful gross distinctions
between candidates. Candidates with high scores may well be expected to

perform the job better than candidates with low scores. See Science Research
Associates, Validation: Procedures and Results (1972) (use of criterion "tails"
identifying best and worst candidates more justifiable than continuous rating).
And it may even be that within some range of scores, some incremental
improvements in scores show some positive correlation with improvements in
job performance. But neither of these propositions provides confidence for
inferring that one-point increments among those who took Exam No. 8155 are
a valid basis for making job-related hiring decisions, especially in the range of
scores between 94 and 100. The reason such a precise inference cannot be so
readily drawn is that content validity is not an all or nothing matter; it comes in
degrees. A test may have enough validity for making gross distinctions
between those qualified and unqualified for a job, yet may be totally inadequate
to yield passing grades that show positive correlation with job performance.
70

Overlooking this point, the City earnestly contends that if the appropriate
abilities were tested for, it makes eminent sense to select candidates strictly on
the basis of ranked scores, even to the extent of concluding that a candidate
scoring 98 will perform better as a police officer than a candidate scoring 97.
The frequency with which such one-point differentials are used for important
decisions in our society, both in academic assessment and civil service
employment, should not obscure their equally frequent lack of demonstrated
significance. Rank-ordering satisfies a felt need for objectivity, but it does not
necessarily select better job performers. In some circumstances the virtues of
objectivity may justify the inherent artificiality of the substantively deficient
distinctions being made. But when test scores have a disparate racial impact, an
employer violates Title VII if he uses them in ways that lack significant
relationship to job performance.

71

Permissible use of rank-ordering requires a demonstration of such substantial


test validity that it is reasonable to expect one- or two-point differences in
scores to reflect differences in job performance. Our prior conclusion that the
test itself may have had enough validity to be used does not, therefore, lead to
approval of using its results for rank-ordered selections. On the contrary, the
defects we noted in the job analysis and the test construction are substantial
enough to preclude an inference that passing scores will correlate with job
performance closely enough to justify rank-ordered selections. While we do not
criticize the City's efforts as extensively as did the District Court, we agree that
the identification of pertinent abilities, the demonstration of their relationship to
the job tasks, and the process of developing the questions, were flawed.

72

These shortcomings take on added significance when it is recognized that the


test just barely satisfied even our lenient construction of the Guidelines

requirement of procedural representativeness. As the EEOC has advised, it is


"easier" to make the inference of a relationship between higher scores and better
job performance "(t)he more closely and completely the selection procedure
approximates the important work behaviors." EEOC Questions and Answers,
supra, Q. 62. Unlike the District Court, we are not willing to reject any use of a
police exam simply because the pencil and paper procedure of the test is not a
close approximation of the job. Nor are we willing to preclude rank-ordering
because a pencil and paper procedure was used. Given the current state of the
art in employment testing, we think it would be unrealistic to condemn pencil
and paper tests. Alternative procedures have not been shown to be readily
available within the limitations of time and resources confronting most
employers. Nevertheless, we cannot ignore the Guidelines' criticism of
assessing ability to perform complex tasks by a test procedure so different from
the work setting. When the selection procedure does not closely approximate
the important job tasks, it becomes especially important to insist upon a strong
showing that other aspects of content validity have been demonstrated. And that
demonstration must be very substantial when a test procedure that does not
closely approximate the job is sought to reflect the fine gradations required for
rank-ordering. In short, while we might not agree with the District Judge that
the defects in the test preclude a finding of sufficient content validity to permit
its use, we agree that content validity has not been shown to the extent
necessary for rank-ordering.
73

In addition to inadequate demonstration of validity, the test may not be used for
rank-ordered selections because of the total absence of any evidence that the
exam possessed another vital feature-reliability, that is, the extent to which the
exam would produce consistent results if applicants repeatedly took it or
similar tests. Of course, there is no expectation that applicants will take any
given test more than once. But if an exam lacks reliability to such an extent that
results would be significantly inconsistent if the same applicants were to take it
again, that is an important indication that the test is not especially useful in
measuring their abilities. Although not explicitly mentioned in the Guidelines,
reliability is prominently identified in the APA Standards (to which the
Guidelines refer in 5(C)) to be as basic for evaluating an exam as validity
itself. See APA Standards, supra at 48-55. Like content validity, reliability is
not an all or nothing matter. It too comes in degrees. What is required is not
perfect reliability, but rather a sufficient degree of reliability to justify the use
being made of the test results. Without some substantial demonstration of
reliability it is wholly unwarranted to make hiring decisions, with a disparate
racial impact, for thousands of applicants that turn on one-point distinctions
among their passing grades.

74

Two aspects of reliability deserve consideration in assessing the use of rankordering. The first is the quality of the exam questions. The more skillfully they
have been formulated, the more likely it is that results on one question will
correlate with results on the other questions and that successive test scores
would be consistent. This will avoid the tendency of scores to vary because of
extraneous factors such as test administration. Whether this aspect of reliability
has been achieved to an extent sufficient to justify rank-ordering need not be
left to general consideration of the quality of the test construction process. A
basic demonstration of this aspect of reliability can easily be made by the testmaker, before the test is administered to job applicants. The test-maker can pretest his exam by giving it twice to a sample of persons generally approximating
the characteristics of the population where the test is expected to be used for
employment selection. To avoid distortion due to recollection, the test given at
a later date to such a sample can use similar but not identical questions. Another
somewhat useful indicator of reliability is a technique known as a split-half
correlation-dividing each component of the test into equal halves and observing
how consistent were an individual's scores on each half.17 This technique can
also be used in the process of pre-testing the exam, before it is administered to
job applicants. The technique also is easily used on actual test results to provide
some minimal evidence of reliability. In this case the City offered no evidence
to demonstrate the quality of the questions used in Exam No. 8155.

75

The second aspect of reliability concerns what testing experts call the error of
measurement. See generally, H. Gulliksen, The Theory of Mental Test (1950).
This is a statistical phenomenon indicating the degree to which scores on
successive tests will be subject to inevitable random variation, no matter how
carefully the test-makers have eliminated or at least lessened the effects of
extraneous factors within their control. The error of measurement can be
calculated by use of the standard deviation concept. For any test, regardless of
how carefully it was prepared, statistical analysis, based on the normal
distribution curve, shows that there is 68% probability that successive scores
would fall within a range of one standard deviation from an actual score and a
95% probability (generally a satisfactory confidence level) that successive
scores would fall within a range of two standard deviations from the actual
score. It is also possible to estimate, again for any test, how many raw score
points above and below the applicant's actual score are within the range of one
or more standard deviations. This calculation, as explained in the margin,18
depends upon the applicant's score and the number of items on the test. Thus,
though the test-maker can never eliminate the error of measurement, he can
minimize its effect for all scores by increasing the number of questions.

76

NOTE: OPINION CONTAINS TABLE OR OTHER DATA THAT IS NOT

VIEWABLE
77

The inevitable error of measurement for a test consisting of 100 items, like
Exam No. 8155, has significance in assessing the use of rank-ordering. At the
passing score of 94, one standard deviation is equivalent to a range between 2.4
points above and below 94. The range narrows as actual scores approach 100.
At 97, for example, the range is plus or minus 1.7. Thus, to have 95%
confidence that an applicant's grade has statistical reliability, grades within two
standard deviations of his grade should theoretically be treated as equivalent to
his grade, for in fact there is a 95% likelihood that each applicant at each grade
would score within such a range on successive takings of equivalent tests. This
means that the range in which a satisfactory confidence level is achieved for an
applicant who scores 94 lies between 89 and 99, and even for one who scored
97, the range extends from 94 to 100. Care must be taken not to over-emphasize
the significance of the error of measurement. Though grounded on sound
principles of statistics, it remains an estimate, and it need not prevent the usual
use of test scores that do not have a disparate racial impact. At a minimum,
however, it should serve to illustrate the risks of making hiring decisions turn
on one-point increments at scores where even a single standard deviation
covers a raw score range greater than one point.

78

The most serious implication of error of measurement for Exam No. 8155 arises
from the extraordinary extent to which high test scores were closely bunched.19
Each score from 94 to 97 was achieved by over 2,000 applicants.20 If the test
questions had sufficient differentiating power to produce a somewhat even
distribution of scores, or at least to avoid excessive bunching among the high
scores, the error of measurement would not have affected the ultimate selection
of such a significant portion of the applicants. But when 8,928 applicants, twothirds of all who passed, are bunched between 94 and 97, the error of
measurement makes the use of rank-ordering an extremely unreliable basis for
hiring decisions.

79

If test scores produce disparate racial results, an employer who wants to use
rank-ordering of the scores for hiring decisions faces a substantial task in
demonstrating that rank-ordering is sufficiently justified to be used. But the task
is by no means impossible. Even without resorting to a criterion-related study,
the test-maker still has several ways to increase the justification for rankordering sufficiently to use it. First, he can conduct a job analysis and construct
the test with a high degree of adherence to Guideline requirements. That would
produce a much stronger showing of content validation than the City was able
to demonstrate in this case. Even content validity sufficient for rank-ordering
does not require literal compliance with every aspect of the Guidelines. But

there must be a substantial demonstration of job relatedness and


representativeness to show a sound basis for making rank-ordering hiring
decisions. Second, the test-maker can achieve an adequate degree of reliability
by careful design of the exam so that the questions will yield a satisfactory
degree of consistent results. To guard against inconsistency based on
extraneous factors, the test-maker can pre-test the exam by successive
applications to an appropriate sample or at least analyze the results of split-half
correlations. Inconsistencies revealed by these techniques can be lessened by
redesign of needlessly unreliable questions or components of the exam. To
reduce inconsistency based on random variation, the number of questions can
be increased. Of course, the size of an exam must observe realistic limits of
cost and time of administration, but in the case of Exam No. 8155, using 200
instead of 100 questions would have significantly increased reliability. Because
some error of measurement is inevitable, even an increase in the number of
questions will not eliminate all random variation. However, the effect of such
random variation can be reduced by using questions that are shown to have
significant differentiating power, so that scores are not bunched at the high end
of the scale.21
80

Alternatively, the employer can acknowledge his inability to justify rankordering and resort to random selection from within either the entire group that
achieves a properly determined passing score, or some segment of the passing
group shown to be appropriate. The City itself, perhaps unwittingly, has
acknowledged the reasonableness of this second alternative. Since each of the
scores between 94 and 97 was achieved by more than 2,000 candidates, and
since each training class can accommodate slightly more than 400 candidates,
the test scores provide no basis for selecting from among candidates at each of
these scoring levels. At oral argument, the City acknowledged that random
selection would be used; for example, if all candidates scoring 98 or above have
been selected, and 400 academy trainees are needed from the 2,000 candidates
scoring 97, a random drawing from among all 2,000 would be used.22 Thus,
even the City recognizes that when the test scores afford no job-related basis
for making selections from within a group that passed the test, random
selection is appropriate.

81

We do not conclude that Title VII requires random selection from among those
who pass a content valid test. In some instances rank-ordering may be shown to
be justified. But where it is not, random selection from within a group validly
determined to have passed a content valid exam is simply an available option.
See Association Against Discrimination in Employment v. City of Bridgeport,
594 F.2d 306, 313 n.19 (2d Cir. 1979). The City may prefer not to use it.
However that may be, the City cannot use rank-ordering not shown to be job-

related when test scores produce a disparate racial impact. Nor can the City
justify the use of rank-ordering by reliance on what it contends are
requirements of state law. See N.Y.Const. art. 5, 6; N.Y.Civil Service Law
61(1). Title VII explicitly relieves employers from any duty to observe a state
hiring provision "which purports to require or permit" any discriminatory
employment practice. 42 U.S.C. 2000e-7 (1976).
82

If rank-ordering were the only unjustified use of test scores, it would be


possible to limit a Title VII remedy to the elimination of this device. In other
words, if an exam has adequate content validity and a passing score has been
adequately determined, the employer could still limit selections to those within
the group that passed, provided only that he abandons rank-ordered choices.
But in this case, the impermissible use of the test scores extends beyond rankordering to the setting of the cutoff score.

83

Cutoff Score. The Guidelines state that a cutoff score "should normally be set
so as to be reasonable and consistent with normal expectations of acceptable
proficiency within the work force." Guidelines 5(H). This also makes sense.
No matter how valid the exam, it is the cutoff score that ultimately determines
whether a person passes or fails. A cutoff score unrelated to job performance
may well lead to the rejection of applicants who were fully capable of
performing the job. When a cutoff score unrelated to job performance produces
disparate racial results, Title VII is violated. See Association Against
Discrimination, supra, 594 F.2d at 312-13; Bridgeport Guardians, Inc. v.
Bridgeport Civil Service Commission, 482 F.2d 1333, 1338 (2d Cir. 1973).
Consequently, there should generally be some independent basis for choosing
the cutoff. As with rank-ordering, a criterion-related study is not necessarily
required; the employer might establish a valid cutoff score by using a
professional estimate of the requisite ability levels, or, at the very least, by
analyzing the test results to locate a logical "break-point" in the distribution of
scores. The City offered no such basis in this case. It merely chose as many
candidates as it needed, and then set the cutoff score so that the remaining
candidates would fail.

84

If it had been shown that the exam measures ability with sufficient
differentiating power to justify rank-ordering, it would have been valid to set
the cutoff score at the point where rank-ordering filled the City's needs. The
justification would be that each incremental change in score represents an
incremental change in job-related ability, so that, for any given cutoff (even one
determined solely by hiring needs), those who passed would likely perform the
job better than those who failed. But the City can make no such claim, since it
never established a valid basis for rank-ordering.

85

Indeed, the problems of both validity and reliability, which prevent the justified
use of rank-ordering, also cast serious doubt on the justification for the cutoff
score of 94. Of all these problems, the unreliability attributable to the error of
measurement has special significance for the cutoff score. As previously noted,
the error of measurement had an especially extensive impact on the applicants
because of the bunching of scores at the high end. The bunching occurred not
only at passing scores from 94 to 97, but also at failing scores of 92 and 93,
each of which was also achieved by more than 2,000 applicants. Scores within
a range of two points above and below the passing grade of 94 were achieved
by 10,731 applicants, 29% of the total. Had the scores been evenly distributed,
1,800 applicants, only 5%, would have fallen within this range. Selecting a
cutoff score in the middle of the range in which the test scores were closely
bunched meant that the inevitable error of measurement led to a much higher
number of mistaken passes and failures than would otherwise have occurred.23
Perhaps an even distribution of scores cannot be readily achieved, but the
impact of the error of measurement could have been held to acceptable limits if
a cutoff score had been selected within some range where scores were not
closely bunched. This does not mean that every person who fails a test by a
single point necessarily has a claim for legal redress. A cutoff score, properly
selected, is not impermissible simply because there will always be some error
of measurement associated with it. But when an exam produces disparate racial
results, a cutoff score requires adequate justification and cannot be used at a
point where its unreliability has such an extensive impact as occurred in this
case.

86

Primarily on the basis of Exam No. 8155's improper use of rank-ordering, and
of the cutoff score, we affirm the conclusion of the District Court that the exam
as used was invalid. Since we agree with the District Court that the exam had a
significant disparate racial impact, we hold that the City's use of the exam
violated Title VII.

V. Relief
87

The fashioning of relief in employment discrimination cases is always a


sensitive matter, especially in cases like this one where the District Court
endeavors to order some form of affirmative action, including the use of a
quota. Our task of determining whether the District Court's remedy conforms to
prevailing standards for Title VII relief has been made somewhat more difficult
than usual because the precise effect of the Court's order is not clear and, in
some respects, the order is not adequately supported by necessary findings or
sufficient evidence.

88

The District Court's order is set out in full in the margin.24 It deals with several
topics including the use of Exam No. 8155, the development and approval of a
new selection procedure, hiring in the interim until a new selection procedure
receives court approval, and long-term hiring. The City objects most
strenuously to the provisions concerning interim and long-term hiring, since
these provisions involve the use of a quota.

89

In considering the District Court's order, we find it useful to distinguish


between those aspects of the order that are designed to assure compliance with
Title VII and those aspects that provide affirmative relief as a remedy for past
discrimination. Compliance involves restricting the use of an invalid exam,
specifying procedures and standards for a new valid selection procedure, and
authorizing interim hiring that does not have a disparate racial impact.25
Affirmative relief involves interim hiring at any ratio greater than what is
necessary just to avoid a disparate racial impact and any required long-term
hiring targets or ratios. Though standards in this difficult area are only
beginning to emerge and have been a source of disagreement within this and
other courts, we distill from the case law the following general principles,
applicable to remedies for discrimination in entry-level hiring.26

90

1. As a general matter Title VII relief should at least assure compliance with
the law. When it has been established that a selection procedure has been
unlawfully used, an appropriate compliance remedy should forbid the use of
that procedure, or its disparate racial impact, and may properly assure the
establishment of a lawful new procedure. When it also appears that the
employer has discriminated prior to the use of the challenged selection
procedure, then it may also be appropriate to fashion some form of affirmative
relief, on an interim and long-term basis, to remedy past violations, see, e. g.,
Prate v. Freedman, 583 F.2d 42, 47 (2d Cir. 1978); United States v. City of
Chicago, supra, 549 F.2d at 436-37; Morrow v. Crisler, 491 F.2d 1053 (5th
Cir.) (en banc ), cert. denied, 419 U.S. 895, 95 S.Ct. 173, 42 L.Ed.2d 139
(1974); Bridgeport Guardians, supra, 482 F.2d at 1340; Carter v. Gallagher,
452 F.2d 315, 331 (8th Cir. 1972) (en banc ); cf. Franks v. Bowman
Transportation Co., 424 U.S. 747, 763-64, 96 S.Ct. 1251, 1263-64, 47 L.Ed.2d
444 (1976) (Title VII authorizes broad remedial relief); Albemarle, supra, 422
U.S. at 418 (same). However, the form of such affirmative relief, especially the
use of quotas, requires a most sensitive approach, see Association Against
Discrimination, supra, 594 F.2d at 310-11; Kirkland, supra, 520 F.2d at 427-28;
Patterson v. Newspaper & Mail Deliverers' Union, 514 F.2d 767, 775-76 (2d
Cir. 1975) (Feinberg, J., concurring), cert. denied, 427 U.S. 911, 96 S.Ct. 3198,
49 L.Ed.2d 1203 (1976); Bridgeport Guardians, supra, 482 F.2d at 1340;
Vulcan Society, supra, 490 F.2d at 398-99.

91

2. Initial consideration should be given to relief for the plaintiffs and those
similarly situated, that is, Black and Hispanic applicants who took Exam No.
8155. While relief in Title VII cases need not necessarily be limited to the
applicant class nor framed in specific relation to that class, their interests
obviously deserve consideration. See Castro v. Beecher, 459 F.2d 725, 736-37
(1st Cir. 1972); Carter v. Gallagher, supra, 452 F.2d at 328-31.

92

3. Interim hiring provisions, for the period prior to use of a valid selection
procedure, should be considered and formulated separately from long-term
hiring provisions.

93

4. Since interim hiring provisions, where needed to satisfy immediate personnel


requirements, are to be used prior to the development and approval of a valid
selection procedure, such provisions cannot meet Title VII standards by
demonstrated job relatedness. Therefore, one appropriate way to assure Title
VII compliance on an interim basis is to avoid a disparate racial impact. This
means selecting from among adequately qualified applicants either on a random
basis, see, e. g., Association Against Discrimination, supra, 594 F.2d at 313,
n.19, or according to some appropriately noncompensatory ratio, see, e. g.,
Kirkland, supra, 520 F.2d at 429-30; Vulcan Society, supra, 490 F.2d at 39899, normally reflecting the minority ratio of the applicant pool or the relevant
work force.

94

5. Any use of a hiring ratio during the interim period to compensate for prior
discrimination, that is, a ratio greater than the minority percentage in the
applicant pool or the relevant work force, should be imposed only upon clear
evidence and appropriate findings of the need to redress demonstrated prior
discrimination of long standing that has had a significant impact on minority
employment. See Association Against Discrimination, supra, 594 F.2d at 312;
Patterson, supra, 514 F.2d at 776 (Feinberg, J., concurring); Vulcan Society,
supra, 490 F.2d at 398-99.

95

6. If a hiring ratio is imposed beyond the interim period in which a valid


selection procedure is developed in order to reach a required long-term target,
the justification for its use must be especially compelling.27 See Bridgeport
Guardians, supra, 482 F.2d at 1340. The prior discrimination warranting such a
remedy must either be intentional, or it must plainly appear that significant
discrimination has persisted for a substantial time. Gross disparity between
minority employment and minority percentage in the relevant work force may
imply such discrimination, especially when the minority employment is
extremely low. Otherwise, the instances, impact, and duration of prior
discrimination must be established.

96

With these principles in mind we turn to consideration of the order's provisions


for compliance remedies and for affirmative relief.

Compliance Remedies
97

Paragraph 2 of the order enjoins the use of Exam No. 8155 as a selection
procedure, except in connection with the implementation of the interim and
long-term hiring provisions. Deferring for the moment the exception
concerning the permissible use of the exam, we readily affirm the District
Court's prohibition against the unqualified use of the exam. The exam as used
violated Title VII, and it is obviously appropriate to bar its continued use,
except on an interim basis with adjustments that eliminate its disparate racial
impact and thereby avoid its unlawful effect.

98

The order prescribes four requirements for the development and approval of a
new selection procedure. We affirm the requirement, in paragraph 6 of the
order, that the City make extensive efforts in its search for a new procedure,
including consideration of "all reasonably available alternative selection
procedures" and broad consultation with appropriate professionals.

99

We also affirm the procedural requirement in paragraph 4 of the order that the
new selection device must be approved by the District Court prior to its use.
Once an exam has been adjudicated to be in violation of Title VII, it is a
reasonable remedy to require that any subsequent exam or other selection
device receive court approval prior to use. See, e. g., Bridgeport Guardians,
supra, 482 F.2d at 1339. This situation is to be contrasted with a case like
Guardians Ass'n v. Civil Service Commission, 490 F.2d 400 (2d Cir. 1973)
(Guardians I or "the '68-'70 exams case"), where the exams had not yet been
found to be invalid. In that case the City was obliged only to show the new
exam to plaintiffs and afford them an opportunity to criticize it.

100 However, we reject the District Court's principal substantive standard for
approval of the new selection procedure to the extent that it requires any new
procedure to be validated in accordance with the Guidelines and consistent with
the APA Standards. As discussed in part III of this opinion, we have concluded
that literal compliance with the Guidelines and with professional testing criteria
is not required by Title VII and can, in some instances, lead to results
inconsistent with Title VII's explicit endorsement of "any professionally
developed ability test." 42 U.S.C. 2000e-2(h). We therefore conclude that the
District Court, in determining the legality of a new selection procedure, should
not require that it must conform in all respects to the Guidelines and the APA
Standards; it will be sufficient if the new procedure conforms to the essential

purposes of Title VII. We have endeavored to outline the extent to which the
Guidelines are useful in carrying out those purposes and some of the respects in
which excessive rigidity in application of the Guidelines may undermine those
purposes. No all-encompassing formula is possible. The Guidelines remain
useful as a source of guidance, but they need not be adhered to in every detail
as if they were substantive regulations.
101 We also reject the District Court's substantive requirement, as expressed in
paragraph 6 of the order, that the new selection procedure must have "the least
adverse impact on minority applicants." This requirement appears to be an
attempt to implement the principle expressed in Albemarle that once an
employer has established the job-relatedness of a selection procedure that has a
disparate racial impact, the plaintiff may still establish a Title VII violation by
proving that "other tests or selection devices, without a similarly undesirable
racial effect, would also serve the employer's legitimate interest in 'efficient and
trustworthy workmanship.' " Albemarle Paper Co. v. Moody, supra, 422 U.S. at
425, 95 S.Ct. at 2375, quoting McDonnell Douglas Corp. v. Green, supra, 411
U.S. at 801, 93 S.Ct. at 1823. Of course, a decree may incorporate this
principle into the standard for approving any new selection procedure, but the
phrasing in paragraph 6 imposes a stricter and impermissible burden. To
comply with Title VII a new selection procedure need not have the least
adverse impact on minority applicants. That requirement would prohibit any
exam with any disparate racial impact because random selection would always
be a procedure with less adverse impact. What Albemarle contemplates, and
what the decree may require, is that a selection procedure proposed by the City
may not be used if the plaintiffs can establish the existence of an alternative
procedure with an equivalent degree of job relatedness and a lesser disparate
racial impact.
Affirmative Relief
102 Our initial concern with the provisions of the order concerning affirmative
relief arises from our uncertainty as to precisely what the District Court has
required. The District Court's January 11 order states, and its January 23
revised opinion appears to require, that the City take the affirmative action of
hiring 50% of entry-level police officers from among qualified Black and
Hispanic applicants. But the opinion and the order seem to contain different
provisions about the length of time for which this 50% minority hiring ratio is
to apply. The opinion characterizes the affirmative action required as an
"interim" measure. 484 F.Supp. at 799. In the context of Title VII testing cases,
"interim" has meant the time period between the date of a decree and the
subsequent use of a valid selection procedure. See, e. g., EEOC v. Local 638,

Sheet Metal Workers' Association, 532 F.2d 821, 829 (2d Cir. 1976); Kirkland,
supra, 520 F.2d at 423, 429-30; Vulcan Society, supra, 490 F.2d at 398. It is to
be contrasted with a long-term or permanent hiring requirement, which
specifies a minority composition of the employer's work force that must be
achieved, even if a valid testing procedure has been developed and approved
before the targeted minority composition is reached.28
103 In contrast to the District Court's opinion, its order, which is the operative
document brought here for review, appears to require numerical quotas that
continue beyond interim relief. Paragraph 3 states that the defendants "shall"
seek to achieve minority (Black and Hispanic) representation in the Police
Department "comparable to that of the minority composition of the labor force
in the relevant hiring area," a representation stated to be at least 30%. Paragraph
4 also indicates that the required 50% quota hiring29 may well last beyond the
interim that ends with approval of a valid test. Paragraph 4 prescribes 50%
minority hiring either until minority representation in the Department equals
minority representation in the relevant labor force or until the District Court has
both approved a valid selection procedure and, in addition, found that 50%
quota hiring is no longer "appropriate." That such a finding might not be made
until sometime after approval of a valid selection procedure and perhaps not
until the long-term hiring goal has been reached is indicated by the specific
reservation in Paragraph 4 of the plaintiffs' right to advocate the continued use
of hiring quotas because "the continuing effects of past discrimination have not
been eliminated."
104 In addition to creating uncertainty whether affirmative relief has been ordered
on an interim or long-term basis, the record contains inadequate findings and,
more significantly, inadequate evidence to support the hiring provisions of the
order. The District Court determined that affirmative relief was warranted
based upon conclusions concerning the prior employment practices of the
defendants and their state of mind in preparing and using Exam No. 8155. The
prior practices concern the defendants' use of an eligibility list compiled from
the results of police exams given between 1968 and 1970. The continued use of
the results of those exams after 1972, when Title VII was amended to include
municipal employers, had previously been found to violate Title VII because
the exams had a disparate racial impact and were not job-related. That
conclusion had been reached in litigation concerning the Police Department's
layoff policy, Guardians Ass'n v. Civil Service Commission, 431 F.Supp. 526
(S.D.N.Y.) (Guardians II or "the first layoff policy case"), vacated and
remanded for reconsideration, 562 F.2d 38 (2d Cir. 1977), and reaffirmed in
Guardians Ass'n v. Civil Service Commission, 466 F.Supp. 1273
(S.D.N.Y.1979) (Guardians III or "the second layoff policy case"), aff'd in part,

remanded in part, No. 79-7377, --- F.2d ---- (2d Cir. July 25, 1980).30
105 The District Court grounded its decision to impose affirmative relief on the
conclusion that the defendants designed Exam No. 8155 "either with a
deliberate intention to discriminate against blacks and hispanics or with reckless
disregard of whether the test would have that result." 484 F.Supp. at 798-99.
This serious indictment of responsible city and police administrators is
unsupported and indeed contradicted by the record. The conclusion is based
virtually exclusively on the fact that the Police Department failed to assemble a
valid eligibility list from the '68-'70 exams and failed again in 1979.31 It would
be contrary to Title VII's provision allowing the use of valid exams to hold that
once an employer tries to construct such an exam and fails, any further failure
to develop a valid exam constitutes intentional discrimination. Such a second
attempt, by itself, is evidence only of a desire to make use of a technique that
the law explicitly allows. Persistent use of exams with disparate racial effects
would support an inference of intentional discrimination if proper test
construction were not even attempted. But the record here indicates that the
City's police and personnel officials made extensive efforts to understand and
apply the Guidelines and develop a test they hoped would have the requisite
validity.32 Their failure entitles the plaintiffs to some relief, but does not justify
a remedy based upon an unwarranted inference of deliberate discrimination.
106 In the absence of intentional discrimination, affirmative relief requires some
demonstrated pattern of significant prior discrimination. There are no adequate
findings concerning such a pattern, and the record lacks sufficient evidence on
which such findings could be based. The District Court referred to the City's
prior Title VII violation in using the eligibility list resulting from the '68-'70
exams and concluded that the minority "imbalance" on the City's police force is
"directly caused by past and current discriminatory practices." 484 F.Supp. at
799. Obviously Exam No. 8155 has had no significant effect upon the current
minority proportion of the police force, because its results have been used to
select only one class for the training academy. As to the use of the eligibility
list from the '68-'70 exams, there are no findings and no evidence to indicate
the extent to which use of that list has affected the minority proportion of the
police force. The first layoff policy case provides some data as to the numbers
of Whites and minority members hired as a result of two of the '68-'70 exams,
431 F.Supp. at 552-53, Tables 4 and 6, but even with that data, the record in
this case does not disclose the minority percentage in the police department
before and after the '68-'70 exams. Nor is there any evidence of the impact of
hiring resulting from Exam No. 3014, administered in 1973, the validity of
which has not been challenged. The '68-'70 exams undoubtedly made some
contribution to the current racial imbalance of the police force, but the record

does not contain even estimates of how the hiring prior to the 1973 exam
currently affects the composition of the police force. Plaintiffs have failed to
prove that prior use of discriminatory exams has created a situation warranting
affirmative relief.
107 In the absence of proof of the specific impact of such prior discrimination, the
only probative evidence in the record is that the current minority proportion of
the police force is 12.7% compared to a relevant work force percentage of at
least 30%. That is cause for some concern, but does not reveal the flagrant
disparity shown in prior cases where long-term hiring quotas were in issue. Cf.
Association Against Discrimination, supra, 594 F.2d at 308 (minorities
constituted 0.2% of employees, 41% of population; quota vacated for
reconsideration and findings); EEOC v. Local 14, International Union of
Operating Engineers, 553 F.2d 251, 256 (2d Cir. 1977) (minorities constituted
2.8% of union members, at least 16.2% of relevant labor force; judgment
including quota vacated for further findings); Patterson, supra, 514 F.2d at 770,
772 (minorities constituted 2.45% of union and union-affected job-seekers, 30%
of relevant labor force; quota sustained); Bridgeport Guardians, supra, 482 F.2d
at 1335 (minorities constituted 3.6% of employees, 25% of population; hiring
quota sustained). If the disparity between existing minority employment and
relevant work force percentage were extreme and long-standing, that
circumstance alone might justify some affirmative relief, especially if minority
employment were low. But where, as here, the disparity is not extreme and
minority employment is not insubstantial, an affirmative hiring remedy must be
based on detailed findings, supported by evidence, that there exists a pattern of
prior discrimination warranting such relief. Cf. Association Against
Discrimination, supra, 594 F.2d at 312-13; Kirkland, supra, 520 F.2d at 427-28.
108 We therefore conclude that the affirmative hiring provisions of the order must
be set aside. This will require elimination of the affirmative hiring quota of
50%, both as interim and long-term relief, and elimination of the long-term
hiring goal of 30%, a goal that obviously could be achieved only by affirmative
hiring at a ratio above the minority percentage of the relevant work force. The
only hiring remedy justified by this record is a compliance remedy, one
designed to make sure that the City complies with the requirements of Title VII
in making appointments to the police force. Such a remedy should permit the
City, in the interim period prior to development of a new, valid selection
procedure, to use the results of Exam No. 8155 in a way that avoids any
disparate racial impact. This means selecting candidates from the eligibility list
subject to the minority proportion of either the applicant pool or the relevant
work force, a determination to be made upon remand.

109 To accomplish such interim hiring the City may assemble a minority pool and a
majority pool of qualified candidates from the eligibility list. In assembling
these pools, the City may use a cut-off score somewhat lower than 94, a score
that was originally determined by the City's manpower needs, rather than by an
independent estimate of adequate ability. Within the majority and minority
pools, the City may choose candidates, maintaining the requisite proportion of
minority candidates and taking into account those already hired as a result of
this exam. The City is not obliged to hire on an interim basis, but it should have
the option of doing so in order to meet its manpower needs.
110

We remand this case to the District Court for the entry of a revised decree
consistent with this opinion. Pending the entry of that decree, we continue in
effect the provisions of the stay order we previously entered, under which the
City is afforded the option of hiring from those who scored 94 or above on
Exam No. 8155 provided such hiring achieves a minority ratio of 33%, taking
into account those already hired as a result of this exam.

111 Affirmed in part, vacated in part, and remanded.


SIFTON, District Judge, concurring:
112 I concur in the result and in the reasoning of Judge Newman's thorough opinion
with the exception of his conclusion that Exam No. 8155 adequately tested a
representative sample of the skills, knowledges and abilities required for police
work in New York City. While I join in rejecting the argument of the United
States as amicus, that the requirement of representativeness means that all the
knowledges, skills and abilities needed for police work must be tested, each in
its proper proportions, I disagree with the conclusion that an exam which
contains omissions as extensive as the present exam meets the
"representativeness requirements to an adequate degree."
113 As Judge Newman correctly points out, Exam No. 8155 failed to test for human
relations skills despite the City's job analysis which found, not surprisingly, that
such skills constitute a significant part of police work in New York Citysufficiently important to warrant devoting 30 of the exam's 100 questions to
determining whether the applicants possessed them and in what degree.
Unfortunately, as Judge Newman so well explains, the portion of the exam
which attempted to test for human relations skills in fact simply replicated other
portions of the exam which tested the applicants' abilities to remember details,
fill out forms, and apply general principles in specifically described factual
settings. In my view, an exam which does not test for a skill or ability

determined to constitute close to one-third of the talents required for success on


the job cannot be called representative in any meaningful sense.
114 The legal effect of excusing a lack of representativeness of these dimensions, as
Judge Newman's opinion appears to do, because the record does not contain
evidence that a test for the relevant skills or abilities is "readily available and
realistically feasible," appears to me, with all due respect, to shift to plaintiffs
the burden of explaining the exam's racially disparate impact. The practical
effect of validating a test for New York City police work which does not
examine for human relations skills is to leave out of the entry level employment
decision an area of qualifications in which minority groups would, one must
assume, perform well despite educational and other deprivations. The
immediate consequence of the decision is that high performance in the area of
human relations skills will not be available to alter the overall assessment of
applicants who perform less well in other areas.
115 Nor will refinements in rank-ordering or in the determination of an appropriate
cut-off score overcome the effect of an omission of these dimensions. Selection
at random or in rank order without the benefit of any assessment of the
applicants' performance in one large area of police work will reflect the same
racial disparities as exist in the pool from which random selection is made or in
the ranks established by the unrepresentative test.1

Of the United States District Court for the Eastern District of New York, sitting
by designation

One example:

37

Chris Hart and Larry Burns are walking by a big museum late one night. Hart
notices that, although the museum is closed to the public, one of the doors is
unlocked. Hart suggests that for a prank they go into the museum to see the
exhibits. Hart has a flashlight with him while Burns has an illegal gun hidden
under his clothing. Hart does not know that Burns has a gun illegally in his
possession. About five minutes after they enter the museum, they hear
footsteps and leave the museum the same way they entered. According to the
definitions given,
(A) Burns committed the crime of criminal trespass, but Hart did not
(B) Hart committed the crime of criminal trespass, but Burns did not

(C) both Hart and Burns committed the crime of criminal trespass
(D) neither Hart nor Burns committed the crime of criminal trespass.
The definition provided for criminal trespass is as follows:
The crime of criminal trespass is committed when a person knowingly enters or
remains in a building in which he has no right to be and while in the building
possesses, or knows that another person accompanying him possesses, an
explosive or a gun.
It will be noted, simply as an indication of the difficulty of the test construction
enterprise, that there is a slight ambiguity in this question. The applicant is told
that Hart is unaware that "Burns has a gun illegally in his possession," but not
whether Hart was unaware that Burns had a gun at all. This could affect the
answer, since, according to the definition given, knowledge of any gun, even of
a legal one, would render Hart's action a criminal trespass. While it would be
unreasonable to suggest that a test must be free of every possible ambiguity in
order to be acceptable, the ease with which such ambiguities can appear
emphasizes the value of confirming the test's reliability by some empirical
procedure. See Section IV, infra.
2

As a result of these bonus points, it was possible to achieve a score as high as


110. Some 482 applicants, or 1.3% of those taking the test, scored above 100

By the time of the November hearing, the Police Department had already
proceeded to use the list to accept 415 trainees, as described above. In order to
forestall any further hiring on the basis of the list, and to assist the City in
making alternative arrangements, the District Court informed the parties on
December 17, 1979 that the test violated Title VII, and orally enjoined the
defendants from its further use. On December 27, 1979, the City filed an order
to show cause and a motion for a stay, in which it stated that the Police
Department needed to accept an additional 380 trainees from the list on January
14, 1980. When the Court denied its motion, the City filed a petition in this
Court for a writ of mandamus. This Court granted the petition, ordering the
District Court to issue findings of fact and conclusions of law, pursuant to
Fed.R.Civ.P. 52(a), at least 48 hours before any injunction against the City was
to take effect. The District Court then held its second hearing, which was
devoted to the issue of relief. Since the Court's decision, issued that day,
preceded the January 14 action by more than 48 hours, it fulfills the
requirements of this Court's mandamus

The standard deviation for a particular set of data provides a measure of how
much the particular results of that data differ from the expected results. In

essence, the standard deviation is a measure of the average variance of the


sample, that is, the amount by which each item differs from the mean. The
number of standard deviations by which the actual results differ from the
expected results can be compared to the normal distribution curve, yielding the
likelihood that this difference would have been the result of chance. The
likelihood that the actual results will fall more than one standard deviation
beyond the expected results is about 32%. For more than two standard
deviations, it is about 4.6% and for more than three standard deviations, it is
about .03%. On this basis, the Supreme Court concluded in Castaneda that
when actual results fell more than three standard deviations from the expected
result (that is, a race-neutral selection), the deviation could be regarded as
caused by some factor other than chance
5

A construct is generally defined as "an idea developed or 'constructed' as a


work of informed, scientific imagination; that is, it is a theoretical idea
developed to explain and to organize some aspects of existing knowledge."
American Psychological Association, Inc., Standards for Educational &
Psychological Tests 29 (1974) (hereinafter APA Standards). Neither the APA
Standards nor the Guidelines appear to include a definition of content, apart
from the concept of content validity, although the Guidelines do describe
content as involving "knowledges, skills, or abilities." Guidelines 14(C)(1)

The terms of this conditional stay were that, if the City wished to hire, pending
appeal, it should establish two pools of candidates, one consisting of all the
minority applicants who passed the exam, and the second consisting of all
others who passed, and select trainees from these pools in the ratio of one
minority applicant for every two others. The 415 applicants already hired were
to be counted in determining whether the new hires conformed to this 1 to 2
ratio

The City hypothesizes some situations in which statistics could be misleading


(e. g., if some of the candidates taking the test had not been eligible to apply for
the position) but presents no evidence to show that this occurred. To accept
such unsupported possibilities, and require the plaintiffs to refute every
circumstance that could explain the disparate impact shown by the statistics,
would create an onerous burden of proof, far in excess of the Title VII
standards as interpreted by the Supreme Court. See Dothard v. Rawlinson, 433
U.S. 321, 329-30, 97 S.Ct. 2720, 2726-27, 53 L.Ed.2d 786 (1977); Jones v.
New York City Human Resources Administration, 528 F.2d 696, 698 (2d Cir.
1976)

Title VII states, in relevant part:

Notwithstanding any other provision of this subchapter, it shall not be an


unlawful employment practice . . . for an employer to give and to act upon the
results of any professionally developed ability test provided that such test, its
administration or action upon the results is not designed, intended or used to
discriminate because of race, color, religion, sex or national origin.
42 U.S.C. 2000e-2(h). Cf. International Brotherhood of Teamsters, supra, 431
U.S. at 348-56, 97 S.Ct. at 1861-65 (interpreting Title VII's seniority exception,
2000e-2(h), to give operative effect to exception's full scope).
9

See EEOC, Uniform Employee Selection Guidelines: Interpretation and


Clarification (Questions and Answers) Q. 62 (1979) (hereinafter cited as EEOC
Questions and Answers); APA Standards, supra at 29; accord, Vulcan Society,
supra, 490 F.2d at 395

10

In criticizing the questions involving application of the law, the District Court
stated: "In a real situation, the officer sees activity and must determine rather
quickly whether the activity is illegal, with no definitional aids before him. He
must operate on instinct and experience." 484 F.Supp. at 797. That is true, but it
is a criticism of all testing. In any situation, there are generally at least two
steps that are necessary to produce the correct behavioral response. The first is
to know what to do, and the second is to act accordingly. Clearly the limits of
any test, no matter how well designed, is to determine whether the applicant
knows or can determine what to do. Only a probationary period can determine
if the applicant will act correctly in a real life situation

11

Such data may be obtained by studying correlations of test scores of accepted


candidates with their subsequent job performances, or correlations of the test
scores of present employees with their current job performances

12

See United States v. City of Chicago, supra, 549 F.2d at 430-32; Douglas v.
Hampton, 512 F.2d 976, 985-86 (D.C.Cir.1975); Vulcan Society, supra, 490
F.2d at 395 & n. 10; Bridgeport Guardians, Inc. v. Civil Service Commission,
354 F.Supp. 778 (D.Conn.), aff'd in part, rev'd in part, 482 F.2d 1333 (2d Cir.
1973); cf. Schmidt & Hunter, The Future of Criterion-Related Validity, 33
Personnel Psych. 41, 48 (1980) ("criterion related validity studies will
frequently, perhaps typically, be technically infeasible")
A rare example of a criterion-related study that was found acceptable is
Washington v. Davis, supra, 426 U.S. at 249-52, 251 n. 17, 96 S.Ct. at 2052-53,
2053 n. 17. This was not a Title VII case however, and the Court's use of Title
VII concepts, in dictum, to assess the validity of the test in question under the
Fourteenth Amendment does not indicate that the Court was reviewing the test
with the stringency that Title VII requires. See Gudians Ass'n v. Civil Service

Commission, 633 F.2d 232 at 245 - 246 (2d Cir., 1980). In fact, a less
demanding standard was almost certainly being used, as is clear from the
comparison between Davis and the Court's Title VII decision the preceding
term in Albemarle.
13

There will be some tests whose character is sufficiently clear so that the
content-construct distinction can be applied at the threshold, without the need
to place the test in the context of the job it tests for. A general intelligence test,
see Griggs, supra (Wonderlic Personnel Test), will almost always need to be
assessed by construct validation, since it necessarily measures for an inferred
ability, regardless of the context. The much-vaunted typing test, in contrast, can
always be regarded as amenable to content validation. However, there are a
large number of tests, including virtually all the "second generation" tests for
jobs such as the one considered here, that will fall into the middle range

14

These considerations are particularly crucial to an exam validated on content


grounds. See APA Standards at 29: "Content validity is determined by a set of
operations, and one evaluates content validity by the thoroughness and care
with which these operations have been conducted."

15

In some instances the relationship is obvious. Plainly the ability to fill out
forms is needed for task 16, "Processes arrests using appropriate police
department forms and notifications." But it is not evident, for example, why any
of the five listed abilities are critical to task 32, "Searches for lost children,
runaways, etc." or task 19, "Guards and transports prisoners."

16

The fact that the factual subject matter of the exam questions (as opposed to
their purpose in measuring abilities) was related to the subject matter of the job
is not a major indicator of the test's validity, since the test measured abilities,
not knowledge. But it does suggest that the test has avoided the dangers
inherent in using irrelevant factual material. Such material could skew the test
for ability in directions unrelated to the job, a phenomenon that even the best
designed test might not be able to avoid. The present test avoids that problem
by ensuring that any distorting effects resulting from the subject matter of the
questions are themselves job-related

17

See APA Standards, supra at 48-50; Kuder & Richardson, The Theory of
Estimation of Test Reliability in Principles of Educational and Psychological
Measurement 95 (W. Mehrens & R. Ebel, eds. 1967); Rulon, A Simplified
Procedure for Determining the Reliability of a Test by Split Halves, in id. at
104

18

This effect can be demonstrated more precisely in psychometric terms, through


the use of the standard error concept. The standard error is the raw score

variance corresponding to a single standard deviation of the scores that would


be obtained by a test-taker on successive, equivalent tests. It can be
approximated by the quantity
where t is the test-taker's score and n is the number of items on the test. This
formula is derived from the general formula for a standard deviation. See Lord,
Do Tests of the Same Length Have the Same Standard Error of Measurement?,
in Principles of Educational and Psychological Measurement 192 (W. Mehrens
& R. Ebel, eds. 1967). The formula is an approximation because a particular
applicant's error of measurement as defined by Lord, supra, is a function of his
"true" score, that is, the score he would have obtained had no error been
present. However, as the number of items on the test increases, the observed
score will approach the true score, so that the approximation is permissible. In
substituting observed score for actual score Lord introduces the refinement of
reducing the denominator of his function by one to eliminate sampling bias, but
this has an insignificant effect when the numbers are rounded off to the extent
that they are in this discussion.
19

The reason this bunching occurred was that the exam was too easy. An exam
that was too difficult might have had the same effect, except that the bunching
would have occurred at the lower end of the scale. Neither excessive easiness
nor excessive difficulty is necessarily fatal, but each magnifies effects that may
make scoring arrangements unjustified

20

Set forth below are the total number of applicants who achieved each score
from 110 to 70 and the number of White and minority (Black and Hispanicsurnamed) applicants at each of these scores. The White and minority figures
do not always equal the total because some applicants were members of other
minority groups and some applicants were not identified

Score
110
109
108
107
106
105
104
103
102
101
100
99
98
97
96
95

Total Applicants
3
6
3
9
13
36
90
102
95
125
823
1570
1845
2238
2311
2255

White
1
4
1
5
5
19
59
64
53
60
565
1067
1372
1562
1516
1434

Minority
1
2
0
1
5
6
16
23
17
38
96
177
223
282
390
428

94
93
92
91
90
89
88
87
86
85
84
83
82
81
80
79
78
77
76
75
74
73
72
71
70

2124
2024
2017
1772
1675
1498
1359
1171
1077
989
870
813
722
608
558
497
449
409
363
323
342
265
247
213
198

1307
1195
1174
1009
913
774
678
569
500
468
395
358
313
246
214
191
163
157
130
117
121
89
87
89
65

425
504
504
524
529
541
485
452
452
402
371
362
330
298
287
245
222
197
192
156
182
141
138
102
112

21

The differentiating power of a question can be easily determined by means of


an item analysis. The City could have tested each question on a sample
population to determine its power to distinguish between different levels of
ability. The most simple item analysis would have quickly revealed that this
exam would produce a large number of closely bunched high scores. See
Englehart, A Comparison of Several Item Discrimination Indices, reprinted in
Principles of Educational and Psychological Measurement 387 (W. Mehrens &
R. Ebel eds. 1967); Findley, A Rationale for Evaluation of Item Discrimination
Statistics, reprinted in id. at 381

22

Actually, in this example, presumably 1,200 names would be drawn to yield the
anticipated 400 candidates who would complete the post-examination steps of
the hiring process

23

This comparison can be expressed mathematically by considering the statistical


probability that a person who passed with a score of 94 or failed with a score of
93 would have achieved that same score on successive exams. If one were to
administer successive exams and average the results, the test-taker would have
to achieve an average score no more than one-half point above or below his
original score to be regarded as achieving the same score. For a test-taker who
scored 93 or 94 on Exam No. 8155, the standard error is approximately 2.4.
This means that 2.4 points above or below 93 or 94 represents one standard

deviation. Similarly, one-half a point above or below these scores represents .


5/2.4 or .21 standard deviations. Assuming that the normal distribution curve
applies, and ignoring the effect of bonus points, it can be derived from the table
of normal curve distributions that .21 standard deviations represent a 17% level
of certainty. In other words about 17% of the test takers who scored 93 or 94
would achieve that same score, on the average, if they had taken successive
equivalent exams. Of the remainder, we can assume that half would have
scored higher and half would have scored lower. Since 94 was passing, these
figures mean that 41.5% of those who scored 93 (41.5% being half the
remainder, which is 100%-17%, or 83%), would have achieved an average
score of 94 on a series of successive equivalent tests, and thereby passed the
exam, while 41.5% of those who scored 94 would have failed
If the test-takers had been evenly distributed by score, which is ideal, or at least
approximated an even distribution in the cutoff region, which is generally
feasible, 729 persons would have scored 93 or 94, and 41.5% of these, or 300,
would have been incorrectly placed. In fact, some 4,148 achieved these two
scores, so that at least 1,721 of the test-takers were incorrectly placed, just
counting those who achieved these two sets of scores. As one moves away
from the cutoff, the percentage of incorrect placements becomes much less. Of
those who scored 99, for example, virtually none would have failed on
successive tests, since the standard error for a score of 99 on a test with 100
questions is about 1, and 93, the highest failing grade, is thus more than three
standard deviations below 99. Similarly, of those who scored 85, only one half
of one percent would have passed on successive tests, since the standard error
for a score of 85 is 3.6, which is 2.5 standard deviations below 94. For these
scores, the test had a very low error of measurement. But relatively few of the
test-takers in Exam No. 8155 scored 85 or 99. In contrast, each score in the 92
to 97 range, where the error of measurement was greatest, was achieved by
over 2,000 test-takers. And the selected cutoff score was right in the middle of
that range.
24

Upon consideration of the evidence presented at the liability and relief stages of
this case, and consideration of the briefs and oral arguments of the parties and
amicus curiae United States of America and Policewomen's Endowment
Association of New York City, Inc., and entry of findings of fact and
conclusions of law, it is hereby ORDERED:

Defendants, their officers, officials, agents, employees, successors, and all


persons in active concert or participation with them or any of them are hereby
permanently enjoined from engaging in any act or practice with respect to the
selection of candidates for appointment to and training for the position in the
New York City Police Department of entry-level police officer, which act or

practice has the purpose or effect of discriminating against such persons


because of race or national origin
2

Defendants are hereby permanently enjoined from using Examination 8155 in


any manner, except as specifically provided in paragraph 5 of this order

The defendants shall seek to achieve as a long-term goal black and hispanic
(hereinafter "minority") representation in the sworn ranks of the Police
Department comparable to that of the minority composition of the labor force in
the relevant hiring area. As of 1978, the labor force of the relevant hiring area
was at least 30% black and hispanic

To achieve the long-term goal set forth in paragraph 3, supra, the defendants
shall as an interim goal appoint 50% of their entry level police officers from
among qualified black and hispanic applicants. The interim hiring goal for
minorities shall remain in effect until the minority representation in the sworn
ranks of the Police Department is at least equal to the percentage of minorities
in the labor force of the relevant hiring area as described in paragraph 3, supra,
or until this court has found, after a hearing, that all proposed selection
procedures for police officer positions have been validated in accordance with
the Uniform Guidelines on Employee Selection Procedures, 28 C.F.R. 50.14,
29 C.F.R. 1607, effective September 25, 1978 ("Uniform Guidelines"), and
that no further interim goals are appropriate. Nothing herein shall preclude
plaintiffs from advocating the continuance of the interim goals on the basis that
the continuing effects of past discrimination have not been eliminated

To satisfy the goals set forth above in paragraphs 3 and 4, defendants may use
an eligibility list derived from Examination No. 8155 as the pool from which it
selects police officers. At such time as that eligibility list or any future
eligibility list for the position of police officer does not contain sufficient
minority candidates to meet the interim goals set out in paragraph 4, supra, the
City shall take whatever steps are necessary to achieve the interim hiring goals

Defendants shall make reasonable efforts to develop within a reasonable period


of time a procedure for the selection of candidates for the entry level position of
police officer which shall be lawful and validated in accordance with the
Uniform Guidelines or successor guidelines similarly promulgated, and which
is consistent with generally accepted psychological standards as defined by the
American Psychological Association from time to time, and which has the least
adverse impact on minority applicants. Consistent with this requirement
defendants (a) shall examine all reasonably available alternative selection
procedures on the subject of testing of police officer applicants, and (b) shall
consult with industrial psychologists, psychometricians and/or others who have

experience in the field of selection testing, and preferably who have performed
or have knowledge of analyses of the job of police officers
7

Defendants may continue to use the current qualifications and selection criteria
for police officer positions. However, no such qualification or selection
criterion shall be a valid basis for or defense for failure to meet the interim
hiring goals set out in paragraph 4, supra, unless the court has ruled that such
qualification or selection criterion has been validated in accordance with the
Uniform Guidelines and has determined that there is no further basis for
continuing the interim goals

Members of the plaintiff class shall be afforded the opportunity, at appropriate


later proceedings, to show that they are entitled to an award of back pay and
constructive seniority

Plaintiffs are entitled to their court costs and reasonable attorneys' fees to date.
The amount of such costs and fees shall be set by the court after a hearing.
Costs and attorneys' fees for work done in the future shall be fixed in such
manner as the court may determine

10

Within thirty (30) days after the entry of this order, and every six (6) months
thereafter, the defendant city shall submit to the plaintiffs the following reports:
(a) A list of its then current uniformed employees in the Police Department
showing for each person: name, address, race or national origin, police station
or other place of assignment, date of appointment, rank, and date such rank was
achieved.
(b) The total number of uniformed personnel employed by the Police
Department, by rank, race and national origin.
(c) A list of all minority applicants for all vacancies, including date of
application, names, addresses, and telephone numbers, whether the applicant
was accepted or rejected and reason(s) for rejection.
(d) The name, address, and telephone number of any minority employee
involuntarily terminated prior to the completion of the probationary period, and
reason(s) for termination.
(e) List of all hires, promotions and voluntary and involuntary terminations
showing race and national origin.

11

Defendants shall provide to plaintiffs, upon request, such other records or


documents as are necessary to monitor compliance with this order

12

The court retains jurisdiction of this action for such further relief or other
orders as may be necessary or appropriate to enforce and insure rights to equal
employment opportunity within the New York City Police Department
SO ORDERED.

25

An example would be an interim minority hiring ratio of 30%, where the


minority percentage of the applicant pool or the relevant work force is also
30%. In this case, the District Court's provision for interim hiring specifies a
minority ratio of a remedial nature, i. e., greater than needed simply to avoid a
disparate racial impact. We therefore consider this provision as part of
affirmative relief

26

The formulation of remedies for discrimination in promotion of employees


requires even greater caution than is appropriate for entry-level hiring cases,
since such remedies inevitably impact adversely identifiable employees who
have committed themselves to a particular career track. See Kirkland, supra,
520 F.2d at 429; Bridgeport Guardians, supra, 482 F.2d at 1341

27

This requirement of reaching a target should be contrasted with the far more
modest use of a target figure merely to limit the extent of interim relief. The
latter occurs when a remedy provides that the interim relief will continue until
an acceptable selection procedure is developed or until a particular target figure
for minority employment is achieved. In such a case, the target figure does not
function as an absolute requirement; it simply serves as a means of assuring that
the interim requirements will end at some point, even if development of a valid
selection procedure is unduly delayed. The use of a target figure for this
limiting purpose does not require the same compelling justification as a target
figure prescribed as an absolute requirement, since it imposes no additional
obligation on the defendant

28

Though the District Court's opinion refers to affirmative action as an "interim"


measure, it prescribes that it will last either until "discrimination has been
totally eliminated" or until valid selection procedures are being used. (484
F.Supp. at 799). The meaning of the first phrase is not entirely clear; however,
even if it means a long-term objective of minority police employment equal to
the minority percentage in the work force, the opinion still requires no more
than an interim remedy because of the provision in the second phrase that the
affirmative action can be terminated once a valid selection procedure has been
approved

29

The order unfortunately refers to the 50% quota as an "interim goal." It is not a
goal at all. It is a procedure that the District Court has required to be used at
least until the true interim goal approval of a valid selection procedure has been

reached, and perhaps until the long-term goal minority employment percentage
equal to minority work force percentage has been reached
30

In 1973 the defendants gave still another exam whose validity has not been
adjudicated. See 431 F.Supp. at 545 n.36. It is interesting to note, however, that
even plaintiffs' well-known testing expert, Dr. Richard Barrett, acknowledged
on cross-examination during the first layoff policy case that he could not be
certain whether or not this exam is content valid. Id. n.37

31

The only evidence referred to by the District Court, in addition to the invalidity
of the '68-'70 exams, is the testimony of some police officers who warned the
test makers that the test would disfavor minorities. (484 F.Supp. at 798). Such a
caution would be significant if the test makers had made no effort to satisfy
Guideline standards. But when test makers have undertaken the elaborate
process of job analysis and test construction revealed by this record, their
willingness to disregard a prediction of disparate racial impact does not indicate
that they lacked a good faith belief that their exam could nonetheless be shown
to be adequately job-related

32

It is true that the City's previous unsuccessful attempt to design an exam on its
own renders somewhat questionable its apparent enthusiasm for another "inhouse" effort. However, the decision to produce Exam No. 8155 "in-house"
may well have been motivated by a bureaucratic preference for internal
procedures, a need to save money, a naive self-confidence, or simply a desire to
try again. None of these motives is a basis for inferring a conscious intention or
even a reckless willingness to violate the law

Of course, an unrepresentative exam-as is the case with exams at other stages


of the qualification process for New York City police work-can be
administered on a pass/fail basis as part of a larger qualification process
provided the cut-off score represents that level of skills shown to be the level
below which an applicant is disqualified for police work. There is, in other
words, no objection on grounds of discriminatory impact to a test which
eliminates those whose skills and abilities in the areas actually tested for are so
low as to disqualify the applicant no matter how well he or she might perform
in other untested areas

You might also like