Professional Documents
Culture Documents
2d 79
Attys., New York City, Drew S. Days III, Asst. Atty. Gen., Steven H.
Rosenbaum, Washington, D. C., on brief), for the United States as amicus
curiae.
Ira H. Leibowitz, Barry Lasky, Garden City, N. Y., submitted a brief for
The Policewomen's Endowment Assoc., Inc., as amicus curiae.
H. Elliot Wales, New York City, submitted a brief for Seven Civil Service
Organizations as amicus curiae.
Before MANSFIELD and NEWMAN, Circuit Judges, and SIFTON, *
District Judge.
NEWMAN, Circuit Judge:
This employment discrimination suit pursuant to Title VII of the Civil Rights
Act of 1964, 42 U.S.C. 2000e-2, once again requires this Court to venture into
the complex realm of testing and test validation. The test at issue was designed
by New York City officials and administrated on June 30, 1979 to 36,797
applicants for positions on the City's police force. Plaintiffs are the Guardians
Association of the New York City Police Department, Inc., an organization of
Black police officers, the Hispanic Society of the New York City Police
Department, Inc., an organization of Hispanic police officers, and eight
individual Black or Hispanic applicants. Defendants are the New York City
Department of Personnel, which performed much of the test preparation, the
New York City Civil Service Commission, and the New York City Police
Department. The United States District Court for the Southern District of New
York (Robert L. Carter, Judge) found that use of the test unjustifiably
discriminates against Blacks and Hispanics in violation of Title VII. Guardians
Ass'n v. Civil Service Commission, 484 F.Supp. 785 (S.D.N.Y.1980). The
Court ordered a broad remedy, including a 50% minority hiring quota. We
affirm the District Court's finding that the City's specific use of the test violates
Title VII, but vacate the remedy and remand for entry of a revised decree.
I. Factual Background
2
The test in question, designated Exam No. 8155, was designed to select
candidates for hiring as entry-level police officers. Those who pass the exam
are selected, in rank order of their test scores, to complete the other aspects of
the hiring process-a medical examination, a physical agility test, a
psychological test, and a character investigation. These last four components of
the hiring process are scored only on a pass/fail basis. Thus, an appellant's
score on Exam No. 8155 is a major determinant of his prospects for becoming a
police officer. It is also the only feature of the process alleged to have a
discriminatory impact. Once an applicant scores high enough to be selected for
the final four hiring steps and successfully completes those steps, he or she
becomes a sworn police officer and enters the police academy for five months
of training. While successful completion of the training program is a
requirement of continuing as a police officer, the Department does not use the
training program as a selection device, but anticipates that nearly all academy
entrants will go on to active duty.
3
The exam was developed by a fairly elaborate two-stage process; the first stage
was an analysis of the police officer's job, and the second was construction of
the test itself. The job analysis consisted of five separate steps. First, the
Department of Personnel identified 71 tasks that police officers generally
perform, based on interviews with 49 police officers and 49 supervisors.
Second, a panel of seven officers and supervisors reviewed the list to add any
tasks that had been omitted, and to eliminate those items that were duplicative,
or too specialized to be performed by entry-level officers. The result was a
consolidated list of 42 entry-level tasks.
In the fourth step of the job analysis, the Department of Personnel divided the
list of 42 ranked tasks into clusters of related activities. Five such clusters were
established: the arrest process, providing assistance to people, police
operations, stationhouse activities, and handling unusual and other occurrences.
The fifth step was an analysis of all five clusters, each one by a separate panel
of police officers, to identify the "knowledge, skills and abilities" required to
perform these tasks at the entry level, and to assign percentages reflecting the
relative importance of each of the identified knowledges, skills, and abilities for
the cluster as a whole. One panel listed five such qualities for its cluster, all of
which are properly characterized as "abilities" or "skills" (hereafter referred to
as "abilities"): recalling facts, filling out forms, understanding and applying
statutory definitions of crimes, understanding written instructions and applying
appropriate procedures, and human relations skills, including communication
techniques. Each of the other four panels used the first panel's list of abilities,
but developed its own percentages to express the relative importance of each
ability to the tasks within its cluster.
6
The second major stage in developing Exam No. 8155, the process of test
construction, consisted of four identifiable steps. First, the percentages of the
five abilities necessary to perform each of the five task clusters were multiplied
by the weightings that had been given to each task in Step 3 of the job analysis
on the basis of frequency, importance, and time spent. This yielded a general
measurement for the importance of each of the five abilities for performance of
the job of police officer. As a result of this computation, the Department of
Personnel concluded that on a test with 100 questions, 15 questions should test
for the ability to recall facts, 9 questions for filling out forms, 14 questions for
understanding and applying sections of the criminal law, 32 questions for
understanding written instructions and applying appropriate procedures, and 30
questions for human relations skills. Next, a group of eleven police officers was
selected to write multiple-choice questions that tested for the five abilities, as
they related to the 42 identified tasks. The officers wrote many of these
questions from Police Academy materials and similar sources, however,
without having access to descriptions of the five identified abilities, or the 42
ranked tasks. In the third step, Department of Personnel staff members who did
have access to the description of abilities and the ranking of tasks reviewed the
questions written by the police officers to assure that the questions were not
ambiguous, overly complex, overly specialized, or dependent on prior
knowledge. As a result of this review, some questions were discarded, others
were revised, and still others were added. Finally, the resulting questions were
subjected to a further review by a panel of six police experts, and by various
members of the Department of Personnel.
The first part of the exam, designed to measure the ability to recall facts,
consisted of a page-and-a-half description of a burglary, and a series of 15
questions to be answered without referring back to the description. In the
second part, testing ability to fill out forms, the candidates were given a
simplified arrest form, and a page-long description of both a robbery and an
arrested suspect, and then asked 9 questions about the proper entries to be made
in filling out the form. Part three, intended to test ability to apply provisions of
law, consisted of 14 questions, each briefly presenting the facts of an incident,
and then requiring the candidate to identify the precise criminal offense
involved on the basis of definitions provided in the test materials.1 The
remaining 62 questions, of which 32 were intended to measure the ability to
follow appropriate procedures and 30 were intended to measure human
relations skills, consisted of general instructions as to procedures or appropriate
responses for certain types of situations, a description of a specific situation,
and then one or more questions asking the proper response to the situation
presented. Three of the questions dealing with appropriate procedures, for
example, involved the proper response to a bomb threat. Four of the questions
in the human relations section involved the proper way to deal with a person
who appears to be mentally ill.
9
The test was scored from zero to one hundred, with one point given for each
correct answer, and bonus points given for veterans. 2 The candidates were then
rank-ordered on the basis of their scores. Scores were generally high, with 13%
of the applicants scoring 98 or above, and fully 50% scoring 91 or above.
Because of the number of candidates taking the test and the bunching of
candidates in the upper range of scores, each point a candidate achieved made a
substantial difference in his position on the rank-ordering list. More than 2,000
applicants achieved a score at each numerical grade from 92 to 97.
10
The passing grade was determined in the following manner. The Police
Department first estimated that 4,000 police officers would be hired during the
four-year period for which the eligibility list resulting from Exam No. 8155
would be valid. The Department further estimated that only one out of three
applicants who passed Exam No. 8155 would successfully complete all the
remaining steps in the hiring process. Therefore, if this eligibility list was to
meet the Department's needs, 12,000 persons had to pass the exam to provide
the 4,000 needed police officers. With all this in mind, the Department simply
set the passing grade at the score achieved by the 12,000th highest scoring
candidate, which turned out to be 94. Because of the bunching phenomenon, a
large number of candidates, 2,124, achieved this same score, so that the actual
number who received a passing grade was 13,749.
11
Of the 36,797 applicants who took the test, 6,142 identified themselves as
Black, 5,239 identified themselves as Hispanic, 19,798 identified themselves as
White, and 4,847, or 13.2% did not identify their race. Thus, identified Blacks
constituted 16.7% of the total applicants, and 19.7% of all those who identified
their race, while the equivalent percentages for identified Hispanics were
14.2% and 16.8%, and for identified Whites, 53.8% and 64.5%.
12
Of those who passed the exam, i. e., scored 94 or better, 7.6% had identified
themselves as Black and 7.8% had identified themselves as Hispanic, for a
known minority population in the passing group of 15.4%, against a known
minority population in the applicant pool of 30.9%. In contrast, 66.6% of the
passing applicants had identified themselves as White, although Whites
comprised only 53.8% of the applicant pool.
13
Viewed in another and more revealing way, the figures show that, among those
who had identified themselves by race, the passing rate for Whites was 45.9%
compared to 17% for Blacks and 20.5% for Hispanics. The combined minority
pass rate was thus about two-fifths of the pass rate for Whites.
14
The Police Department accepted 415 candidates from the list in November,
1979 and planned to hire another 380 in January, 1980. Of this group of 795
candidates, 89.2% were White, 3.5% were Black, and 6.8% were Hispanic. The
selection rates (number chosen compared to number of applicants) for these
first two uses of the list were 0.5% for Blacks, 1% for Hispanics, and 3.6% for
Whites.
The plaintiffs filed their complaint in this suit, together with a motion for a
preliminary injunction, in October, 1979, before any candidates had been
accepted on the basis of the list. They charged that the intended use of the lists
by the Police Department constituted discrimination against Blacks and
Hispanics in violation of the Fourteenth Amendment, Title VII, and various
other Federal and state laws. The District Court, by consent of the parties,
consolidated the hearing on the preliminary injunction and the trial on the
merits. This proceeding was held on November 13, 14, and 15, 1979, shortly
after the Police Department's first use of the list. On January 11, 1980, three
days before the Department intended to use the list to accept a second group of
trainees, the District Court held a second hearing. That same day, the Court
issued an opinion, which was subsequently re-issued in revised form on
January 23.3
17
The Court's basic conclusion was that Exam No. 8155 violated Title VII. In
reaching this conclusion, the Court used the common mode of Title VII
analysis, in which the plaintiff is first required to establish a prima facie case on
the basis of disparate impact, and then the defendant is required to rebut the
plaintiff's case by proving that the disparity results from legitimate, job-related
selection procedures. The Court first found that the disparity between the
percentage of minority group members who achieved a passing score and the
percentage of minority group members in the applicant pool was sufficient to
establish a prima facie case. It based this finding of disparate impact on the
standards developed by the Supreme Court in Castaneda v. Partida, 430 U.S.
482, 97 S.Ct. 1272, 51 L.Ed.2d 498 (1977), and by the Equal Employment
Opportunity Commission (EEOC) in its Uniform Guidelines on Employee
Selection Procedures, 29 C.F.R. 1607 (1979). Castaneda stated that, in cases
involving large samples, "if the difference between the expected value (from a
random selection) and the observed number is greater than two or three
standard deviations," a prima facie case is established. 430 U.S. at 496 n.17, 97
S.Ct. at 1281 n.17. 4 The Uniform Guidelines provide that "(a) selection rate for
any race, sex, or ethnic group which is less than four-fifths ( 4/5) (or eighty
percent) of the rate for the group with the highest rate will generally be
regarded by the Federal enforcement agencies as evidence of adverse impact."
1607.4(D) (hereinafter Guideline sections are cited only by the subdivisions
of 29 C.F.R. 1607). The District Court then noted that the discrepancy
between the percentage of minority group members in the applicant pool and
the percentage of minority group members who passed the test was 39 standard
deviations. The evidence also showed that the passing rate of the minority
group members was 44.3% of the passing rate of Whites, or about two-fifths.
18
Having concluded that the plaintiffs had established a prima facie case, the
Court next concluded that the test was not sufficiently valid to constitute a
legitimate attempt to choose those applicants who would become better police
officers. Relying primarily on the EEOC Guidelines, the Court stated that a
determination of validity based on the content of the test would be
inappropriate, first because the test purported to measure abilities that the
accepted applicants would be trained to acquire, see Guidelines, 14(C)(1), and
second, because the test actually measured constructs, not abilities, see id.5
Moreover, the Court concluded that the job analysis was not sufficiently precise
to satisfy the Guideline requirement even for content validation, see Guidelines
14(C)(2). Since content validation was the only method of validation which
the City attempted, the Court concluded that the test was invalid, and thus an
inadequate rebuttal to the plaintiffs' prima facie case.
19
The City moved to stay the District Court's order pending consideration of its
appeal from the decision. This Court denied the motion. However, we granted a
conditional stay, in view of the City's declared need to hire new police officers6
and set an expedited schedule for the appeal.
As the District Court concluded, the accepted procedure for Title VII cases is to
require the plaintiffs to establish a prima facie case, and then to require the
defendants to rebut this showing with proof that the test was legitimately jobrelated. See Albemarle Paper Co. v. Moody, 422 U.S. 405, 95 S.Ct. 2362, 45
L.Ed.2d 280 (1975); McDonnell Douglas Corp. v. Green, 411 U.S. 792, 93
S.Ct. 1817, 36 L.Ed.2d 668 (1973); Griggs v. Duke Power Co., 401 U.S. 424,
91 S.Ct. 849, 28 L.Ed.2d 158 (1971). The Court correctly concluded that a
prima facie case had been established. By any reasonable measure, including
the standard deviation rule of Castaneda, supra, or the four-fifths rule of the
EEOC Guidelines, Exam No. 8155 had a disparate racial impact.
22
The City argues that statistics alone, specifically a comparison of the racial
composition of the passing group to that of the applicant group, are not
sufficient to establish a prima facie case. But statistics showing a significantly
disparate racial impact have consistently been held to create a presumption of
The City also claims that finding a prima facie Title VII violation by state or
local governments without a showing of discriminatory intent violates the
Tenth Amendment. This view has been definitively rejected by the Seventh
Circuit in United States v. City of Chicago, 573 F.2d 416, 422-24 (7th Cir.
1978), and we agree with that analysis. Congress may enforce the Fourteenth
Amendment by legislation that prohibits practices the Amendment might not of
its own force condemn. See Katzenbach v. Morgan, 384 U.S. 641, 86 S.Ct.
1717, 16 L.Ed.2d 828 (1966).
24
The real issue in this case, therefore, is whether the defendants have rebutted
the plaintiffs' prima facie case by proving that its test was job-related: that the
test accurately selected applicants who would be better police officers.
Adjudication of this issue presents a more complex problem in the present case
than it has in many previous Title VII suits. Many of the previous suits involved
tests that were so artlessly constructed that they could be judged invalid without
extensive inquiry, fine distinctions, or a precise notion of where the line
between validity and invalidity was located. See, e. g., Griggs, supra, 401 U.S.
at 431, 91 S.Ct. at 853 (intelligence tests used "on the Company's judgment that
they generally would improve the overall quality of the work force"); United
States v. N. L. Industries, Inc., 479 F.2d 354, 371 (8th Cir. 1973) (test given to
one applicant "consisted of four or five mathematical problems which a
Company employee jotted down on a sheet of yellow paper"); Brito v. Zia Co.,
478 F.2d 1200, 1205-06 (10th Cir. 1973) (test based almost entirely on
subjective judgments of supervisors, not administered or scored under
controlled and standardized conditions); Vulcan Society, supra, 490 F.2d at
396-98 (no job analysis, test measured abilities that were clearly of secondary
importance to job).
25
Nevertheless the plaintiffs have alleged and the District Court has concluded
that the construction and use of Exam No. 8155 failed in several respects to
meet test validity standards, particularly those specified in the Guidelines.
Whether or not these deficiencies are fatal, they are plainly more substantial
than the defects deemed not to defeat validity in prior cases. Detroit Police
Officers Association v. Young, 446 F.Supp. 979, 990-91, 1007-08
(E.D.Mich.1978); Bridgeport Guardians v. Bridgeport Police Department, 431
F.Supp. 931 (D.Conn.1977); cf. Washington v. Davis, 426 U.S. 229, 248-52, 96
S.Ct. 2040, 2051-53, 48 L.Ed.2d 597 (1976) (Fourteenth Amendment case
involving some Title VII concepts). Consequently, assessment of Exam No.
8155 necessarily carries this Court into difficult areas of judging test validity.
We must determine, with some care, what the general standards are for judging
validity, and how these standards are to be applied in a specific factual
situation.
27
28
The need to modify rigid technical conclusions from the field of testing is
indicated by the view of certain testing experts, including those who testified
for the plaintiffs in this case, that there is no test that can be considered
completely valid to select candidates for any but the most rudimentary tasks. If
this view guided interpretation of Title VII, then at the current stage of the
technology of testing, no test that produces a disparate racial impact could be
used for positions such as police officers.
29
language nor by judicial precedent. Had Congress felt that testing for virtually
all employment was invalid, it would not have made a specific exception to
Title VII for the proper use of professionally designed employment tests.8
Clearly, Congress did not intend that the standard used in interpreting Title VII
would reject every test with a disparate racial impact. Nor have the courts
permitted this to occur; although they have subjected employment testing to the
careful scrutiny required by the statute, they have found a variety of tests to be
valid, despite a disparate racial impact. See, e. g., Sims v. Sheet Metal Workers,
Local 65, 489 F.2d 1023, 1025-26 (6th Cir. 1973); Detroit Police Officers
Association v. Young, supra, 446 F.Supp. at 1007-08; Friend v. Leidinger, 446
F.Supp. 361 (E.D.Va.1977), aff'd, 588 F.2d 61 (4th Cir. 1978); United States v.
South Carolina, 445 F.Supp. 1094 (D.S.C.1977), aff'd, 434 U.S. 1026, 98 S.Ct.
756, 54 L.Ed.2d 777 (1978); Bridgeport Guardians v. Bridgeport Police
Department, supra, 431 F.Supp. at 936-39; Jackson v. Nassau County Civil
Service Commission, 424 F.Supp. 1162 (E.D.N.Y.1976); Buckner v. Goodyear
Tire & Rubber Co., 339 F.Supp. 1108, 1113-16 (N.D.Ala.1972), aff'd, 476 F.2d
1287 (5th Cir. 1973).
30
The danger of too rigid an application of technical testing principles is that tests
for all but the most mundane tasks would lack sufficient validity to permit their
use. At least that is the risk given the current state of the art of employment
testing. This risk can be appreciated by considering the one example even most
test critics acknowledge to have substantial, though not complete, validity. This
is a typing test given to a group of applicants for jobs as typists.9 Such a test
substantially meets all the criteria suggested by plaintiffs' experts for content
validation, but the very success of this test casts doubt on the usefulness of the
example. To begin with, typing is a task that readily yields to quantitative
measurement. The quality of a typist's job performance depends on two factors,
both of which can be captured with precision in numbers: how fast he types,
and how many errors he commits. Most jobs involve tasks whose performance
can be evaluated only in the more subjective light of judgment. Surely this is
true of nearly all the tasks required to be performed by police officers. In
addition, there is a more basic problem with the typing test example. Typing is
one of the few activities that a test-taker can perform in virtually the same
manner as he will be required to perform on the job. That is obviously an ideal
testing situation, but it is not one that is frequently available, and such "on-thejob" testing could not possibly be done to select police officers. Yet the force of
the typing test example easily leads to one of the conclusions of the District
Court in this case: that Exam No. 8155 lacked validity because it measured
performance in an artificial classroom setting and did not necessarily indicate
who would perform well on the job. 10
31
32
33
A second legal basis for following the Guidelines is that they represent the
"administrative interpretation of the Act by the enforcing agency," and are
"entitled to great deference" on that basis. Griggs, supra, 401 U.S. at 433-34, 91
S.Ct. at 854-55; see Albemarle, supra, 422 U.S. at 431, 94 S.Ct. at 2378.
However, the Court has also recognized that the Guidelines "are not
With these considerations in mind, we turn to the validity of Exam No. 8155.
36
Developing such data is difficult, and tests for which it is required have
frequently been declared invalid.12 As a result, a conclusion that construct
validation is required would often decide a case against a test-maker, once a
disparate racial impact has been demonstrated.
37
38
39
The origin of this dilemma is not any inherent defect in testing, but rather the
Guidelines' definition of "content." This definition makes too sharp a distinction
between "content" and "construct," while at the same time blurring the
distinction between the two components of "content" : knowledge and ability.
The knowledge covered by the concept of "content" generally mean factual
information. The abilities refer to a person's capacity to carry out a particular
function, once the necessary information is supplied. Unless the ability requires
virtually no thinking, the "ability" aspect of "content" is not closely related to
the "knowledge" aspect of "content"; instead it bears a closer relationship to a
"construct." Some researchers regard content tests as nothing more than
assessments of particular kinds of constructs, e. g., Tenopyr, Content-Construct
Confusion, 30 Personnel Psych. 47 (1977); others regard any ability that is
evidenced by observable behavior as sufficiently non-inferential to be
considered content, see Ebel, Comments on Some Problems of Employment
Recognition that abilities and constructs are not entirely distinct leads to a
conclusion that a validation technique for purposes of determining Title VII
compliance can best be selected by a functional approach that focuses on the
nature of the job. The crucial question under Title VII is job relatednesswhether or not the abilities being tested for are those that can be determined by
direct, verifiable observation to be required or desirable for the job. See Griggs,
supra, 401 U.S. at 431, 91 S.Ct. at 853; Vulcan Society, supra, 490 F.2d at 39495; Chance, supra, 458 F.2d at 1177. If the job in question involves primarily
abilities that are somewhat abstract, content validation should not be rejected
simply because these abilities could be categorized as constructs. However, if
the test attempts to measure general qualities such as intelligence or
commonsense, which are no more relevant to the job in question than to any
other job, then insistence on the rigorous standards of construct validation is
needed. Since tests of this kind are often biased in favor of a person's familiarity
with the dominant culture, permitting them to be used without a showing of
predictive validity would perpetuate the effects of prior discrimination. But as
long as the abilities that the test attempts to measure are no more abstract than
necessary, that is, as long as they are the most observable abilities of
significance to the particular job in question, content validation should be
available. To lessen the risks of perpetuating cultural disadvantages, the degree
to which content validation must be demonstrated should increase as the
abilities tested for become more abstract.
41
This functional approach, which adjusts the distinction between content and
construct to the nature of the job being tested for, expands the opportunity for
both employers and courts to rely on content validation. It also avoids making a
threshold choice between content and construct validation based solely on the
nature of the quality tested for unrelated to the job, a choice that might make
content validation seem inappropriate. To base the content-construct
determination on the nature of the job, it is necessary first to analyze the job to
see if it requires abilities appropriate for content validation. Instead of choosing
between content and construct validation at the outset, as the Guidelines seem
to require, employers and courts can start the content validation inquiry and use
Just as lessening the severity of the Guidelines' distinction between content and
construct reduces the likelihood that a test is invalid because it measures
constructs, so sharpening the distinction between knowledge and ability, now
obscured by the Guidelines, reduces the problem that the test is invalid because
it duplicates the training period, i. e., tests for what will later be learned. Unlike
knowledge, some abilities are appropriate for testing confirmed by content
validation despite their overlap with post-selection training. A valid
measurement of some abilities can select applicants who will ultimately use
their training to perform their tasks more effectively or who will more
effectively perform similar tasks for which they have not been specifically
trained. On the other hand, content validation remains inappropriate for tests
that measure knowledge of factual information if that knowledge will be fully
acquired in a training program. Approval of such tests, without predictive
validation, risks favoring applicants with prior exposure to the information, a
course likely to discriminate against a disadvantaged minority. For example, it
would be duplicative of the Police Department's training program, and thus
invalid, to test applicants for their knowledge of the Department's arrest form.
Testing for their ability to fill out the form, however, can be expected to select
applicants who can be successfully trained to perform well at that task and
others like it.
43
measurement. Though all three abilities involve some inference about mental
processes, they are based on observable behaviors and are far less abstract than
such traits as intelligence, leadership, or judgment. Moreover, testing for these
three abilities sufficiently avoids the objection that the test duplicates the
Department's training program. Though all three abilities can be trained to
some extent, the test-makers were entitled to select applicants with existing
ability so that training would both enhance their abilities and prepare them for
other tasks requiring similar talents. The vice of testing for knowledge readily
taught in the training program was totally avoided.
B. Assessing the Content Validity of Exam No. 8155
44
45
precision. Some are complete and unambiguous, such as "1. Checks the
condition of personal and department equipment such as radio, patrol car,
weapons, etc."; "35. Attends training sessions." Others are more open-ended,
but do manage to fulfill their function by defining the behaviors associated with
the task, such as "3. Performs foot patrol"; "40. Controls various types of
crowds." Still others are so vague that they communicate very little real
information, such as "10. Interacts with juveniles in non-arrest situations"; "39.
Performs duties in hostage situations."
47
48
The second part of the Guideline standard for a job analysis requires
determination of the relative importance of the identified work behaviors. The
City performed this function by means of an extensively distributed
questionnaire, specifying the criteria to be used in ranking the 42 tasks (Job
Analysis, Step 3). The process as a whole appears to be reasonably accurate,
and neither the plaintiffs nor the District Court raised any serious objection to
it.
49
50
The plaintiffs criticize the required abilities identified by the City for being
undefined. But the type of definition suggested by the Guidelines-one that
describes the abilities in terms of "observable behaviors and outcomes,"
Guidelines 15(C)(3)-seems repetitive, since the work behaviors are already
defined in this way. The five identified abilities, with the possible exception of
"human relations, including communication techniques," are comprehensible
enough. Their appropriateness for measurement would have been considerably
clearer, however, if each panel had explained which tasks required which
With a job analysis of questionable sufficiency, the City then proceeded to the
test construction stage. As an initial matter, we note that Exam No. 8155 was
developed "in-house," by staff members of New York City's Police Department
and Department of Personnel; there was little input from any outside source,
and no participation by anyone specializing in test preparation. Of course, the
law should not be designed to subsidize specialists. But employment testing is a
task of sufficient difficulty to suggest that an employer dispenses with expert
assistance at his peril. Certainly, the decision to forgo such assistance should
require a Court to give the resulting test careful scrutiny. See Kirkland, supra,
520 F.2d at 425-26; Vulcan Society, supra, 490 F.2d at 395-96.
52
While the determination of how many questions should be included for each
identified ability was made by a fairly careful numerical analysis (Test
Construction, Step 1), the process of writing the questions themselves was
rather haphazard. The questions were initially framed by police officers, who
may have had expertise in identifying tasks involved in their job but were
amateurs in the art of test construction. In addition, the officers did not have
access to the job analysis material during much of the process. Finally, the
questions, although they were reviewed, were not tested on a sample
population. To be sure, a complete determination of the questions' accuracy in
measuring the identified abilities would be equivalent in its complexity to a
criterion-related study. But the City did not even perform the minimal sample
testing to ensure that the questions were comprehensible and unambiguous.
53
Not surprisingly, the test construction process did not fully succeed in meeting
even its own goal of testing for all the identified abilities. As previously
indicated, Exam No. 8155 does appear to test for the three identified abilities of
remembering details, filling out forms, and applying general principles to
specific facts. However, the fourth identified ability, human relations skill,
proved more troublesome. In deciding how to test for this ability, the City faced
a dilemma inherent in testing for all but the most mundane jobs. To be fully
representative of the job, a test should measure all the significant abilities
needed for successful job performance, yet some abilities, especially in jobs of
any complexity, are far along the construct end of the content-construct
continuum where successful validation is difficult. If a test tries to be
representative and measure all significant abilities, including those that are
clearly constructs, it risks the use of inadequate assessment devices, because the
rigorous standard for construct validation will rarely be met. On the other hand,
if the test-makers acknowledge the difficulty of satisfactorily measuring
constructs and test only for those abilities that are appropriate for content
validation, they encounter the objection that the test is not sufficiently
representative of the job.
54
55
Assessing human relations skill will always be a difficult enterprise, but the
deficiency of the City's attempt does not mean that a content validation
approach is necessarily impermissible nor impossible to achieve. As indicated
above, at least within the middle range of the content-construct continuum, the
distinction between content and construct should be determined functionally, in
relation to the job. If the quality measured is not unduly abstract, and if it
constitutes a significant aspect of the job, content validation of the test
component used to measure that quality should be permitted. But that
component must be designed in an extremely careful way. Test-makers will be
well advised to obtain highly qualified assistance in constructing this portion of
an exam.
56
One desirable approach would be to confront applicants with simulated real life
situations and assess the appropriateness of their volunteered response. See
Firefighters Institute for Racial Equality v. City of St. Louis, 616 F.2d 350 (8th
Cir. 1980). That technique is normally too costly for large numbers of
applicants, but might have usefulness as a testing device to be used toward the
end of the overall selection procedure, after an initially large group of
applicants has been narrowed down by the results of a written exam and a
background check. If the test component is limited to traditional pencil and
paper methods, it may be preferable to forgo any pretense of being able to
make fine differentiation among candidates' human relations skills and instead
adopt a pass/fail approach, rejecting those whose demonstrably inappropriate
responses to human relations questions mark them as unsuitable for police
work. Another possibility is to recognize that questions in this area for which
only one answer is correct are likely to be too easy, as were most of the
questions on Exam No. 8155, and therefore of little use in making selections
from among applicants. Instead questions can be designed for which some
answers are appropriate responses and others are inappropriate. As feasible
techniques in this area evolve, employers will be expected to use them.
57
With these strengths and weaknesses of the job analysis and the test
construction in mind, we now consider how well the test, as constructed and
used, met the basic requirements of content validity.
The central requirement of Title VII, relationship of test content to job content,
was sufficiently satisfied by Exam No. 8155. The job analysis procedure
provides adequate assurance that the identified tasks are in fact the tasks that a
police officer performs. While the procedure for identifying the abilities
required for those tasks was less satisfactory, the three abilities that were
actually tested for appear adequately related to most of the identified tasks. The
list of tasks confirms one's intuitive assumption that police officers are required
to fill out forms (see, e. g., "16. Processes arrests using appropriate police
department forms and notifications"), to remember facts (see, e. g., "18. Gives
testimony in court (oral and written)"), and to apply general principles to
specific fact situations (see, e. g., "26. Executes warrants.").
59
Moreover, these abilities are among the most concrete ones that can be derived
from the list; they are certainly more concrete than human relations skills,
which the test purported to measure, but did not. Two of the abilities tested for,
filling out forms and remembering facts, are as specifically stated as they could
be without resort to trivial distinctions about particular kinds of forms and facts.
The ability to apply general standards is somewhat more problematical, since it
is a relatively abstract skill that is relevant to many jobs. However, if there is
any job for which ability in applying and following rules is an especially
The second requirement established by the Guidelines is that the test must be a
"representative sample of the content of the job." As presented by the
Guidelines, this representativeness requirement has two different meanings. The
first is that the content of the test must be representative of the content of the
job; the second is that the procedure, or methodology, of the test must be
similar to the procedures required by the job itself. The Guidelines express this
dual requirement in the following somewhat inscrutable language: "For any
selection procedure measuring a knowledge, skill, or ability the user should
show that (a) the selection procedure measures and is a representative sample
of that knowledge, skill, or ability. . . ." Guidelines 14(C)(4) (emphasis
added).
61
62
63
at least those for which appropriate measurement is feasible, but not that it
measure all aspects, regardless of significance, in their exact proportions. The
reason for a requirement that the test's procedure be representative is to prevent
distorting effects that go beyond the inherent distortions present in any
measuring instrument. For example, although all pencil and paper tests are
dependent on reading, even if many aspects of the job are not, the reading level
of the test should not be pointlessly high. Similarly, the instructions should not
be overly complex, and the exam should not place candidates under excessive
time pressure unless such time pressure is an identifiable aspect of the job.
64
65
Similarly, the procedure that the test employed was not needlessly
unrepresentative of the job itself. In electing to use a pencil and paper test, the
City did not forgo any readily available and realistically feasible alternative
procedure that would have been more representative of the job. Moreover, the
risks of using a written test were substantially minimized. The reading level
necessary to understand the questions was in some cases equal to, but generally
well below, the training materials used in the Police Academy. The instructions
were clear enough, and employed an ordinary four-answer, multiple-choice
format, perhaps the most familiar standardized test technique. In addition,
ample time was allowed for taking the exam, thereby avoiding an unnecessarily
pressured situation.
66
Thus the exam is adequately related to the content of the police officer's job,
and adequately representative. The combined effect of this assessment might
support a conclusion that the exam as a whole has content validity, though it
would be a close question whether a test with the disparate racial impact of this
one can be validated when its development departs in some significant respects
even from reasonably attainable requirements of the Guidelines. However, even
if the construction of the exam passes muster, the way in which it was used to
distinguish among candidates seriously departs from the third requirement for
content validity and defeats any claim of validity for a testing process that
produces disparate racial results.
Essentially, the City used the results of the exam to compile a rank-ordering of
all the applicants, and then selected a passing score sufficient to generate the
required number of potential trainees. Neither the rank-ordering nor the passing
score conforms to even the most minimal standards for these two devices.
68
69
This close scrutiny is required because rank-ordering makes such a refined use
of the test's basic power to distinguish between those who are qualified to
perform the job and those who are not. If a test is content valid, it may be
reasonable to infer that the test scores make some useful gross distinctions
between candidates. Candidates with high scores may well be expected to
perform the job better than candidates with low scores. See Science Research
Associates, Validation: Procedures and Results (1972) (use of criterion "tails"
identifying best and worst candidates more justifiable than continuous rating).
And it may even be that within some range of scores, some incremental
improvements in scores show some positive correlation with improvements in
job performance. But neither of these propositions provides confidence for
inferring that one-point increments among those who took Exam No. 8155 are
a valid basis for making job-related hiring decisions, especially in the range of
scores between 94 and 100. The reason such a precise inference cannot be so
readily drawn is that content validity is not an all or nothing matter; it comes in
degrees. A test may have enough validity for making gross distinctions
between those qualified and unqualified for a job, yet may be totally inadequate
to yield passing grades that show positive correlation with job performance.
70
Overlooking this point, the City earnestly contends that if the appropriate
abilities were tested for, it makes eminent sense to select candidates strictly on
the basis of ranked scores, even to the extent of concluding that a candidate
scoring 98 will perform better as a police officer than a candidate scoring 97.
The frequency with which such one-point differentials are used for important
decisions in our society, both in academic assessment and civil service
employment, should not obscure their equally frequent lack of demonstrated
significance. Rank-ordering satisfies a felt need for objectivity, but it does not
necessarily select better job performers. In some circumstances the virtues of
objectivity may justify the inherent artificiality of the substantively deficient
distinctions being made. But when test scores have a disparate racial impact, an
employer violates Title VII if he uses them in ways that lack significant
relationship to job performance.
71
72
In addition to inadequate demonstration of validity, the test may not be used for
rank-ordered selections because of the total absence of any evidence that the
exam possessed another vital feature-reliability, that is, the extent to which the
exam would produce consistent results if applicants repeatedly took it or
similar tests. Of course, there is no expectation that applicants will take any
given test more than once. But if an exam lacks reliability to such an extent that
results would be significantly inconsistent if the same applicants were to take it
again, that is an important indication that the test is not especially useful in
measuring their abilities. Although not explicitly mentioned in the Guidelines,
reliability is prominently identified in the APA Standards (to which the
Guidelines refer in 5(C)) to be as basic for evaluating an exam as validity
itself. See APA Standards, supra at 48-55. Like content validity, reliability is
not an all or nothing matter. It too comes in degrees. What is required is not
perfect reliability, but rather a sufficient degree of reliability to justify the use
being made of the test results. Without some substantial demonstration of
reliability it is wholly unwarranted to make hiring decisions, with a disparate
racial impact, for thousands of applicants that turn on one-point distinctions
among their passing grades.
74
Two aspects of reliability deserve consideration in assessing the use of rankordering. The first is the quality of the exam questions. The more skillfully they
have been formulated, the more likely it is that results on one question will
correlate with results on the other questions and that successive test scores
would be consistent. This will avoid the tendency of scores to vary because of
extraneous factors such as test administration. Whether this aspect of reliability
has been achieved to an extent sufficient to justify rank-ordering need not be
left to general consideration of the quality of the test construction process. A
basic demonstration of this aspect of reliability can easily be made by the testmaker, before the test is administered to job applicants. The test-maker can pretest his exam by giving it twice to a sample of persons generally approximating
the characteristics of the population where the test is expected to be used for
employment selection. To avoid distortion due to recollection, the test given at
a later date to such a sample can use similar but not identical questions. Another
somewhat useful indicator of reliability is a technique known as a split-half
correlation-dividing each component of the test into equal halves and observing
how consistent were an individual's scores on each half.17 This technique can
also be used in the process of pre-testing the exam, before it is administered to
job applicants. The technique also is easily used on actual test results to provide
some minimal evidence of reliability. In this case the City offered no evidence
to demonstrate the quality of the questions used in Exam No. 8155.
75
The second aspect of reliability concerns what testing experts call the error of
measurement. See generally, H. Gulliksen, The Theory of Mental Test (1950).
This is a statistical phenomenon indicating the degree to which scores on
successive tests will be subject to inevitable random variation, no matter how
carefully the test-makers have eliminated or at least lessened the effects of
extraneous factors within their control. The error of measurement can be
calculated by use of the standard deviation concept. For any test, regardless of
how carefully it was prepared, statistical analysis, based on the normal
distribution curve, shows that there is 68% probability that successive scores
would fall within a range of one standard deviation from an actual score and a
95% probability (generally a satisfactory confidence level) that successive
scores would fall within a range of two standard deviations from the actual
score. It is also possible to estimate, again for any test, how many raw score
points above and below the applicant's actual score are within the range of one
or more standard deviations. This calculation, as explained in the margin,18
depends upon the applicant's score and the number of items on the test. Thus,
though the test-maker can never eliminate the error of measurement, he can
minimize its effect for all scores by increasing the number of questions.
76
VIEWABLE
77
The inevitable error of measurement for a test consisting of 100 items, like
Exam No. 8155, has significance in assessing the use of rank-ordering. At the
passing score of 94, one standard deviation is equivalent to a range between 2.4
points above and below 94. The range narrows as actual scores approach 100.
At 97, for example, the range is plus or minus 1.7. Thus, to have 95%
confidence that an applicant's grade has statistical reliability, grades within two
standard deviations of his grade should theoretically be treated as equivalent to
his grade, for in fact there is a 95% likelihood that each applicant at each grade
would score within such a range on successive takings of equivalent tests. This
means that the range in which a satisfactory confidence level is achieved for an
applicant who scores 94 lies between 89 and 99, and even for one who scored
97, the range extends from 94 to 100. Care must be taken not to over-emphasize
the significance of the error of measurement. Though grounded on sound
principles of statistics, it remains an estimate, and it need not prevent the usual
use of test scores that do not have a disparate racial impact. At a minimum,
however, it should serve to illustrate the risks of making hiring decisions turn
on one-point increments at scores where even a single standard deviation
covers a raw score range greater than one point.
78
The most serious implication of error of measurement for Exam No. 8155 arises
from the extraordinary extent to which high test scores were closely bunched.19
Each score from 94 to 97 was achieved by over 2,000 applicants.20 If the test
questions had sufficient differentiating power to produce a somewhat even
distribution of scores, or at least to avoid excessive bunching among the high
scores, the error of measurement would not have affected the ultimate selection
of such a significant portion of the applicants. But when 8,928 applicants, twothirds of all who passed, are bunched between 94 and 97, the error of
measurement makes the use of rank-ordering an extremely unreliable basis for
hiring decisions.
79
If test scores produce disparate racial results, an employer who wants to use
rank-ordering of the scores for hiring decisions faces a substantial task in
demonstrating that rank-ordering is sufficiently justified to be used. But the task
is by no means impossible. Even without resorting to a criterion-related study,
the test-maker still has several ways to increase the justification for rankordering sufficiently to use it. First, he can conduct a job analysis and construct
the test with a high degree of adherence to Guideline requirements. That would
produce a much stronger showing of content validation than the City was able
to demonstrate in this case. Even content validity sufficient for rank-ordering
does not require literal compliance with every aspect of the Guidelines. But
Alternatively, the employer can acknowledge his inability to justify rankordering and resort to random selection from within either the entire group that
achieves a properly determined passing score, or some segment of the passing
group shown to be appropriate. The City itself, perhaps unwittingly, has
acknowledged the reasonableness of this second alternative. Since each of the
scores between 94 and 97 was achieved by more than 2,000 candidates, and
since each training class can accommodate slightly more than 400 candidates,
the test scores provide no basis for selecting from among candidates at each of
these scoring levels. At oral argument, the City acknowledged that random
selection would be used; for example, if all candidates scoring 98 or above have
been selected, and 400 academy trainees are needed from the 2,000 candidates
scoring 97, a random drawing from among all 2,000 would be used.22 Thus,
even the City recognizes that when the test scores afford no job-related basis
for making selections from within a group that passed the test, random
selection is appropriate.
81
We do not conclude that Title VII requires random selection from among those
who pass a content valid test. In some instances rank-ordering may be shown to
be justified. But where it is not, random selection from within a group validly
determined to have passed a content valid exam is simply an available option.
See Association Against Discrimination in Employment v. City of Bridgeport,
594 F.2d 306, 313 n.19 (2d Cir. 1979). The City may prefer not to use it.
However that may be, the City cannot use rank-ordering not shown to be job-
related when test scores produce a disparate racial impact. Nor can the City
justify the use of rank-ordering by reliance on what it contends are
requirements of state law. See N.Y.Const. art. 5, 6; N.Y.Civil Service Law
61(1). Title VII explicitly relieves employers from any duty to observe a state
hiring provision "which purports to require or permit" any discriminatory
employment practice. 42 U.S.C. 2000e-7 (1976).
82
83
Cutoff Score. The Guidelines state that a cutoff score "should normally be set
so as to be reasonable and consistent with normal expectations of acceptable
proficiency within the work force." Guidelines 5(H). This also makes sense.
No matter how valid the exam, it is the cutoff score that ultimately determines
whether a person passes or fails. A cutoff score unrelated to job performance
may well lead to the rejection of applicants who were fully capable of
performing the job. When a cutoff score unrelated to job performance produces
disparate racial results, Title VII is violated. See Association Against
Discrimination, supra, 594 F.2d at 312-13; Bridgeport Guardians, Inc. v.
Bridgeport Civil Service Commission, 482 F.2d 1333, 1338 (2d Cir. 1973).
Consequently, there should generally be some independent basis for choosing
the cutoff. As with rank-ordering, a criterion-related study is not necessarily
required; the employer might establish a valid cutoff score by using a
professional estimate of the requisite ability levels, or, at the very least, by
analyzing the test results to locate a logical "break-point" in the distribution of
scores. The City offered no such basis in this case. It merely chose as many
candidates as it needed, and then set the cutoff score so that the remaining
candidates would fail.
84
If it had been shown that the exam measures ability with sufficient
differentiating power to justify rank-ordering, it would have been valid to set
the cutoff score at the point where rank-ordering filled the City's needs. The
justification would be that each incremental change in score represents an
incremental change in job-related ability, so that, for any given cutoff (even one
determined solely by hiring needs), those who passed would likely perform the
job better than those who failed. But the City can make no such claim, since it
never established a valid basis for rank-ordering.
85
Indeed, the problems of both validity and reliability, which prevent the justified
use of rank-ordering, also cast serious doubt on the justification for the cutoff
score of 94. Of all these problems, the unreliability attributable to the error of
measurement has special significance for the cutoff score. As previously noted,
the error of measurement had an especially extensive impact on the applicants
because of the bunching of scores at the high end. The bunching occurred not
only at passing scores from 94 to 97, but also at failing scores of 92 and 93,
each of which was also achieved by more than 2,000 applicants. Scores within
a range of two points above and below the passing grade of 94 were achieved
by 10,731 applicants, 29% of the total. Had the scores been evenly distributed,
1,800 applicants, only 5%, would have fallen within this range. Selecting a
cutoff score in the middle of the range in which the test scores were closely
bunched meant that the inevitable error of measurement led to a much higher
number of mistaken passes and failures than would otherwise have occurred.23
Perhaps an even distribution of scores cannot be readily achieved, but the
impact of the error of measurement could have been held to acceptable limits if
a cutoff score had been selected within some range where scores were not
closely bunched. This does not mean that every person who fails a test by a
single point necessarily has a claim for legal redress. A cutoff score, properly
selected, is not impermissible simply because there will always be some error
of measurement associated with it. But when an exam produces disparate racial
results, a cutoff score requires adequate justification and cannot be used at a
point where its unreliability has such an extensive impact as occurred in this
case.
86
Primarily on the basis of Exam No. 8155's improper use of rank-ordering, and
of the cutoff score, we affirm the conclusion of the District Court that the exam
as used was invalid. Since we agree with the District Court that the exam had a
significant disparate racial impact, we hold that the City's use of the exam
violated Title VII.
V. Relief
87
88
The District Court's order is set out in full in the margin.24 It deals with several
topics including the use of Exam No. 8155, the development and approval of a
new selection procedure, hiring in the interim until a new selection procedure
receives court approval, and long-term hiring. The City objects most
strenuously to the provisions concerning interim and long-term hiring, since
these provisions involve the use of a quota.
89
90
1. As a general matter Title VII relief should at least assure compliance with
the law. When it has been established that a selection procedure has been
unlawfully used, an appropriate compliance remedy should forbid the use of
that procedure, or its disparate racial impact, and may properly assure the
establishment of a lawful new procedure. When it also appears that the
employer has discriminated prior to the use of the challenged selection
procedure, then it may also be appropriate to fashion some form of affirmative
relief, on an interim and long-term basis, to remedy past violations, see, e. g.,
Prate v. Freedman, 583 F.2d 42, 47 (2d Cir. 1978); United States v. City of
Chicago, supra, 549 F.2d at 436-37; Morrow v. Crisler, 491 F.2d 1053 (5th
Cir.) (en banc ), cert. denied, 419 U.S. 895, 95 S.Ct. 173, 42 L.Ed.2d 139
(1974); Bridgeport Guardians, supra, 482 F.2d at 1340; Carter v. Gallagher,
452 F.2d 315, 331 (8th Cir. 1972) (en banc ); cf. Franks v. Bowman
Transportation Co., 424 U.S. 747, 763-64, 96 S.Ct. 1251, 1263-64, 47 L.Ed.2d
444 (1976) (Title VII authorizes broad remedial relief); Albemarle, supra, 422
U.S. at 418 (same). However, the form of such affirmative relief, especially the
use of quotas, requires a most sensitive approach, see Association Against
Discrimination, supra, 594 F.2d at 310-11; Kirkland, supra, 520 F.2d at 427-28;
Patterson v. Newspaper & Mail Deliverers' Union, 514 F.2d 767, 775-76 (2d
Cir. 1975) (Feinberg, J., concurring), cert. denied, 427 U.S. 911, 96 S.Ct. 3198,
49 L.Ed.2d 1203 (1976); Bridgeport Guardians, supra, 482 F.2d at 1340;
Vulcan Society, supra, 490 F.2d at 398-99.
91
2. Initial consideration should be given to relief for the plaintiffs and those
similarly situated, that is, Black and Hispanic applicants who took Exam No.
8155. While relief in Title VII cases need not necessarily be limited to the
applicant class nor framed in specific relation to that class, their interests
obviously deserve consideration. See Castro v. Beecher, 459 F.2d 725, 736-37
(1st Cir. 1972); Carter v. Gallagher, supra, 452 F.2d at 328-31.
92
3. Interim hiring provisions, for the period prior to use of a valid selection
procedure, should be considered and formulated separately from long-term
hiring provisions.
93
94
5. Any use of a hiring ratio during the interim period to compensate for prior
discrimination, that is, a ratio greater than the minority percentage in the
applicant pool or the relevant work force, should be imposed only upon clear
evidence and appropriate findings of the need to redress demonstrated prior
discrimination of long standing that has had a significant impact on minority
employment. See Association Against Discrimination, supra, 594 F.2d at 312;
Patterson, supra, 514 F.2d at 776 (Feinberg, J., concurring); Vulcan Society,
supra, 490 F.2d at 398-99.
95
96
Compliance Remedies
97
Paragraph 2 of the order enjoins the use of Exam No. 8155 as a selection
procedure, except in connection with the implementation of the interim and
long-term hiring provisions. Deferring for the moment the exception
concerning the permissible use of the exam, we readily affirm the District
Court's prohibition against the unqualified use of the exam. The exam as used
violated Title VII, and it is obviously appropriate to bar its continued use,
except on an interim basis with adjustments that eliminate its disparate racial
impact and thereby avoid its unlawful effect.
98
The order prescribes four requirements for the development and approval of a
new selection procedure. We affirm the requirement, in paragraph 6 of the
order, that the City make extensive efforts in its search for a new procedure,
including consideration of "all reasonably available alternative selection
procedures" and broad consultation with appropriate professionals.
99
We also affirm the procedural requirement in paragraph 4 of the order that the
new selection device must be approved by the District Court prior to its use.
Once an exam has been adjudicated to be in violation of Title VII, it is a
reasonable remedy to require that any subsequent exam or other selection
device receive court approval prior to use. See, e. g., Bridgeport Guardians,
supra, 482 F.2d at 1339. This situation is to be contrasted with a case like
Guardians Ass'n v. Civil Service Commission, 490 F.2d 400 (2d Cir. 1973)
(Guardians I or "the '68-'70 exams case"), where the exams had not yet been
found to be invalid. In that case the City was obliged only to show the new
exam to plaintiffs and afford them an opportunity to criticize it.
100 However, we reject the District Court's principal substantive standard for
approval of the new selection procedure to the extent that it requires any new
procedure to be validated in accordance with the Guidelines and consistent with
the APA Standards. As discussed in part III of this opinion, we have concluded
that literal compliance with the Guidelines and with professional testing criteria
is not required by Title VII and can, in some instances, lead to results
inconsistent with Title VII's explicit endorsement of "any professionally
developed ability test." 42 U.S.C. 2000e-2(h). We therefore conclude that the
District Court, in determining the legality of a new selection procedure, should
not require that it must conform in all respects to the Guidelines and the APA
Standards; it will be sufficient if the new procedure conforms to the essential
purposes of Title VII. We have endeavored to outline the extent to which the
Guidelines are useful in carrying out those purposes and some of the respects in
which excessive rigidity in application of the Guidelines may undermine those
purposes. No all-encompassing formula is possible. The Guidelines remain
useful as a source of guidance, but they need not be adhered to in every detail
as if they were substantive regulations.
101 We also reject the District Court's substantive requirement, as expressed in
paragraph 6 of the order, that the new selection procedure must have "the least
adverse impact on minority applicants." This requirement appears to be an
attempt to implement the principle expressed in Albemarle that once an
employer has established the job-relatedness of a selection procedure that has a
disparate racial impact, the plaintiff may still establish a Title VII violation by
proving that "other tests or selection devices, without a similarly undesirable
racial effect, would also serve the employer's legitimate interest in 'efficient and
trustworthy workmanship.' " Albemarle Paper Co. v. Moody, supra, 422 U.S. at
425, 95 S.Ct. at 2375, quoting McDonnell Douglas Corp. v. Green, supra, 411
U.S. at 801, 93 S.Ct. at 1823. Of course, a decree may incorporate this
principle into the standard for approving any new selection procedure, but the
phrasing in paragraph 6 imposes a stricter and impermissible burden. To
comply with Title VII a new selection procedure need not have the least
adverse impact on minority applicants. That requirement would prohibit any
exam with any disparate racial impact because random selection would always
be a procedure with less adverse impact. What Albemarle contemplates, and
what the decree may require, is that a selection procedure proposed by the City
may not be used if the plaintiffs can establish the existence of an alternative
procedure with an equivalent degree of job relatedness and a lesser disparate
racial impact.
Affirmative Relief
102 Our initial concern with the provisions of the order concerning affirmative
relief arises from our uncertainty as to precisely what the District Court has
required. The District Court's January 11 order states, and its January 23
revised opinion appears to require, that the City take the affirmative action of
hiring 50% of entry-level police officers from among qualified Black and
Hispanic applicants. But the opinion and the order seem to contain different
provisions about the length of time for which this 50% minority hiring ratio is
to apply. The opinion characterizes the affirmative action required as an
"interim" measure. 484 F.Supp. at 799. In the context of Title VII testing cases,
"interim" has meant the time period between the date of a decree and the
subsequent use of a valid selection procedure. See, e. g., EEOC v. Local 638,
Sheet Metal Workers' Association, 532 F.2d 821, 829 (2d Cir. 1976); Kirkland,
supra, 520 F.2d at 423, 429-30; Vulcan Society, supra, 490 F.2d at 398. It is to
be contrasted with a long-term or permanent hiring requirement, which
specifies a minority composition of the employer's work force that must be
achieved, even if a valid testing procedure has been developed and approved
before the targeted minority composition is reached.28
103 In contrast to the District Court's opinion, its order, which is the operative
document brought here for review, appears to require numerical quotas that
continue beyond interim relief. Paragraph 3 states that the defendants "shall"
seek to achieve minority (Black and Hispanic) representation in the Police
Department "comparable to that of the minority composition of the labor force
in the relevant hiring area," a representation stated to be at least 30%. Paragraph
4 also indicates that the required 50% quota hiring29 may well last beyond the
interim that ends with approval of a valid test. Paragraph 4 prescribes 50%
minority hiring either until minority representation in the Department equals
minority representation in the relevant labor force or until the District Court has
both approved a valid selection procedure and, in addition, found that 50%
quota hiring is no longer "appropriate." That such a finding might not be made
until sometime after approval of a valid selection procedure and perhaps not
until the long-term hiring goal has been reached is indicated by the specific
reservation in Paragraph 4 of the plaintiffs' right to advocate the continued use
of hiring quotas because "the continuing effects of past discrimination have not
been eliminated."
104 In addition to creating uncertainty whether affirmative relief has been ordered
on an interim or long-term basis, the record contains inadequate findings and,
more significantly, inadequate evidence to support the hiring provisions of the
order. The District Court determined that affirmative relief was warranted
based upon conclusions concerning the prior employment practices of the
defendants and their state of mind in preparing and using Exam No. 8155. The
prior practices concern the defendants' use of an eligibility list compiled from
the results of police exams given between 1968 and 1970. The continued use of
the results of those exams after 1972, when Title VII was amended to include
municipal employers, had previously been found to violate Title VII because
the exams had a disparate racial impact and were not job-related. That
conclusion had been reached in litigation concerning the Police Department's
layoff policy, Guardians Ass'n v. Civil Service Commission, 431 F.Supp. 526
(S.D.N.Y.) (Guardians II or "the first layoff policy case"), vacated and
remanded for reconsideration, 562 F.2d 38 (2d Cir. 1977), and reaffirmed in
Guardians Ass'n v. Civil Service Commission, 466 F.Supp. 1273
(S.D.N.Y.1979) (Guardians III or "the second layoff policy case"), aff'd in part,
remanded in part, No. 79-7377, --- F.2d ---- (2d Cir. July 25, 1980).30
105 The District Court grounded its decision to impose affirmative relief on the
conclusion that the defendants designed Exam No. 8155 "either with a
deliberate intention to discriminate against blacks and hispanics or with reckless
disregard of whether the test would have that result." 484 F.Supp. at 798-99.
This serious indictment of responsible city and police administrators is
unsupported and indeed contradicted by the record. The conclusion is based
virtually exclusively on the fact that the Police Department failed to assemble a
valid eligibility list from the '68-'70 exams and failed again in 1979.31 It would
be contrary to Title VII's provision allowing the use of valid exams to hold that
once an employer tries to construct such an exam and fails, any further failure
to develop a valid exam constitutes intentional discrimination. Such a second
attempt, by itself, is evidence only of a desire to make use of a technique that
the law explicitly allows. Persistent use of exams with disparate racial effects
would support an inference of intentional discrimination if proper test
construction were not even attempted. But the record here indicates that the
City's police and personnel officials made extensive efforts to understand and
apply the Guidelines and develop a test they hoped would have the requisite
validity.32 Their failure entitles the plaintiffs to some relief, but does not justify
a remedy based upon an unwarranted inference of deliberate discrimination.
106 In the absence of intentional discrimination, affirmative relief requires some
demonstrated pattern of significant prior discrimination. There are no adequate
findings concerning such a pattern, and the record lacks sufficient evidence on
which such findings could be based. The District Court referred to the City's
prior Title VII violation in using the eligibility list resulting from the '68-'70
exams and concluded that the minority "imbalance" on the City's police force is
"directly caused by past and current discriminatory practices." 484 F.Supp. at
799. Obviously Exam No. 8155 has had no significant effect upon the current
minority proportion of the police force, because its results have been used to
select only one class for the training academy. As to the use of the eligibility
list from the '68-'70 exams, there are no findings and no evidence to indicate
the extent to which use of that list has affected the minority proportion of the
police force. The first layoff policy case provides some data as to the numbers
of Whites and minority members hired as a result of two of the '68-'70 exams,
431 F.Supp. at 552-53, Tables 4 and 6, but even with that data, the record in
this case does not disclose the minority percentage in the police department
before and after the '68-'70 exams. Nor is there any evidence of the impact of
hiring resulting from Exam No. 3014, administered in 1973, the validity of
which has not been challenged. The '68-'70 exams undoubtedly made some
contribution to the current racial imbalance of the police force, but the record
does not contain even estimates of how the hiring prior to the 1973 exam
currently affects the composition of the police force. Plaintiffs have failed to
prove that prior use of discriminatory exams has created a situation warranting
affirmative relief.
107 In the absence of proof of the specific impact of such prior discrimination, the
only probative evidence in the record is that the current minority proportion of
the police force is 12.7% compared to a relevant work force percentage of at
least 30%. That is cause for some concern, but does not reveal the flagrant
disparity shown in prior cases where long-term hiring quotas were in issue. Cf.
Association Against Discrimination, supra, 594 F.2d at 308 (minorities
constituted 0.2% of employees, 41% of population; quota vacated for
reconsideration and findings); EEOC v. Local 14, International Union of
Operating Engineers, 553 F.2d 251, 256 (2d Cir. 1977) (minorities constituted
2.8% of union members, at least 16.2% of relevant labor force; judgment
including quota vacated for further findings); Patterson, supra, 514 F.2d at 770,
772 (minorities constituted 2.45% of union and union-affected job-seekers, 30%
of relevant labor force; quota sustained); Bridgeport Guardians, supra, 482 F.2d
at 1335 (minorities constituted 3.6% of employees, 25% of population; hiring
quota sustained). If the disparity between existing minority employment and
relevant work force percentage were extreme and long-standing, that
circumstance alone might justify some affirmative relief, especially if minority
employment were low. But where, as here, the disparity is not extreme and
minority employment is not insubstantial, an affirmative hiring remedy must be
based on detailed findings, supported by evidence, that there exists a pattern of
prior discrimination warranting such relief. Cf. Association Against
Discrimination, supra, 594 F.2d at 312-13; Kirkland, supra, 520 F.2d at 427-28.
108 We therefore conclude that the affirmative hiring provisions of the order must
be set aside. This will require elimination of the affirmative hiring quota of
50%, both as interim and long-term relief, and elimination of the long-term
hiring goal of 30%, a goal that obviously could be achieved only by affirmative
hiring at a ratio above the minority percentage of the relevant work force. The
only hiring remedy justified by this record is a compliance remedy, one
designed to make sure that the City complies with the requirements of Title VII
in making appointments to the police force. Such a remedy should permit the
City, in the interim period prior to development of a new, valid selection
procedure, to use the results of Exam No. 8155 in a way that avoids any
disparate racial impact. This means selecting candidates from the eligibility list
subject to the minority proportion of either the applicant pool or the relevant
work force, a determination to be made upon remand.
109 To accomplish such interim hiring the City may assemble a minority pool and a
majority pool of qualified candidates from the eligibility list. In assembling
these pools, the City may use a cut-off score somewhat lower than 94, a score
that was originally determined by the City's manpower needs, rather than by an
independent estimate of adequate ability. Within the majority and minority
pools, the City may choose candidates, maintaining the requisite proportion of
minority candidates and taking into account those already hired as a result of
this exam. The City is not obliged to hire on an interim basis, but it should have
the option of doing so in order to meet its manpower needs.
110
We remand this case to the District Court for the entry of a revised decree
consistent with this opinion. Pending the entry of that decree, we continue in
effect the provisions of the stay order we previously entered, under which the
City is afforded the option of hiring from those who scored 94 or above on
Exam No. 8155 provided such hiring achieves a minority ratio of 33%, taking
into account those already hired as a result of this exam.
Of the United States District Court for the Eastern District of New York, sitting
by designation
One example:
37
Chris Hart and Larry Burns are walking by a big museum late one night. Hart
notices that, although the museum is closed to the public, one of the doors is
unlocked. Hart suggests that for a prank they go into the museum to see the
exhibits. Hart has a flashlight with him while Burns has an illegal gun hidden
under his clothing. Hart does not know that Burns has a gun illegally in his
possession. About five minutes after they enter the museum, they hear
footsteps and leave the museum the same way they entered. According to the
definitions given,
(A) Burns committed the crime of criminal trespass, but Hart did not
(B) Hart committed the crime of criminal trespass, but Burns did not
(C) both Hart and Burns committed the crime of criminal trespass
(D) neither Hart nor Burns committed the crime of criminal trespass.
The definition provided for criminal trespass is as follows:
The crime of criminal trespass is committed when a person knowingly enters or
remains in a building in which he has no right to be and while in the building
possesses, or knows that another person accompanying him possesses, an
explosive or a gun.
It will be noted, simply as an indication of the difficulty of the test construction
enterprise, that there is a slight ambiguity in this question. The applicant is told
that Hart is unaware that "Burns has a gun illegally in his possession," but not
whether Hart was unaware that Burns had a gun at all. This could affect the
answer, since, according to the definition given, knowledge of any gun, even of
a legal one, would render Hart's action a criminal trespass. While it would be
unreasonable to suggest that a test must be free of every possible ambiguity in
order to be acceptable, the ease with which such ambiguities can appear
emphasizes the value of confirming the test's reliability by some empirical
procedure. See Section IV, infra.
2
By the time of the November hearing, the Police Department had already
proceeded to use the list to accept 415 trainees, as described above. In order to
forestall any further hiring on the basis of the list, and to assist the City in
making alternative arrangements, the District Court informed the parties on
December 17, 1979 that the test violated Title VII, and orally enjoined the
defendants from its further use. On December 27, 1979, the City filed an order
to show cause and a motion for a stay, in which it stated that the Police
Department needed to accept an additional 380 trainees from the list on January
14, 1980. When the Court denied its motion, the City filed a petition in this
Court for a writ of mandamus. This Court granted the petition, ordering the
District Court to issue findings of fact and conclusions of law, pursuant to
Fed.R.Civ.P. 52(a), at least 48 hours before any injunction against the City was
to take effect. The District Court then held its second hearing, which was
devoted to the issue of relief. Since the Court's decision, issued that day,
preceded the January 14 action by more than 48 hours, it fulfills the
requirements of this Court's mandamus
The standard deviation for a particular set of data provides a measure of how
much the particular results of that data differ from the expected results. In
The terms of this conditional stay were that, if the City wished to hire, pending
appeal, it should establish two pools of candidates, one consisting of all the
minority applicants who passed the exam, and the second consisting of all
others who passed, and select trainees from these pools in the ratio of one
minority applicant for every two others. The 415 applicants already hired were
to be counted in determining whether the new hires conformed to this 1 to 2
ratio
10
In criticizing the questions involving application of the law, the District Court
stated: "In a real situation, the officer sees activity and must determine rather
quickly whether the activity is illegal, with no definitional aids before him. He
must operate on instinct and experience." 484 F.Supp. at 797. That is true, but it
is a criticism of all testing. In any situation, there are generally at least two
steps that are necessary to produce the correct behavioral response. The first is
to know what to do, and the second is to act accordingly. Clearly the limits of
any test, no matter how well designed, is to determine whether the applicant
knows or can determine what to do. Only a probationary period can determine
if the applicant will act correctly in a real life situation
11
12
See United States v. City of Chicago, supra, 549 F.2d at 430-32; Douglas v.
Hampton, 512 F.2d 976, 985-86 (D.C.Cir.1975); Vulcan Society, supra, 490
F.2d at 395 & n. 10; Bridgeport Guardians, Inc. v. Civil Service Commission,
354 F.Supp. 778 (D.Conn.), aff'd in part, rev'd in part, 482 F.2d 1333 (2d Cir.
1973); cf. Schmidt & Hunter, The Future of Criterion-Related Validity, 33
Personnel Psych. 41, 48 (1980) ("criterion related validity studies will
frequently, perhaps typically, be technically infeasible")
A rare example of a criterion-related study that was found acceptable is
Washington v. Davis, supra, 426 U.S. at 249-52, 251 n. 17, 96 S.Ct. at 2052-53,
2053 n. 17. This was not a Title VII case however, and the Court's use of Title
VII concepts, in dictum, to assess the validity of the test in question under the
Fourteenth Amendment does not indicate that the Court was reviewing the test
with the stringency that Title VII requires. See Gudians Ass'n v. Civil Service
Commission, 633 F.2d 232 at 245 - 246 (2d Cir., 1980). In fact, a less
demanding standard was almost certainly being used, as is clear from the
comparison between Davis and the Court's Title VII decision the preceding
term in Albemarle.
13
There will be some tests whose character is sufficiently clear so that the
content-construct distinction can be applied at the threshold, without the need
to place the test in the context of the job it tests for. A general intelligence test,
see Griggs, supra (Wonderlic Personnel Test), will almost always need to be
assessed by construct validation, since it necessarily measures for an inferred
ability, regardless of the context. The much-vaunted typing test, in contrast, can
always be regarded as amenable to content validation. However, there are a
large number of tests, including virtually all the "second generation" tests for
jobs such as the one considered here, that will fall into the middle range
14
15
In some instances the relationship is obvious. Plainly the ability to fill out
forms is needed for task 16, "Processes arrests using appropriate police
department forms and notifications." But it is not evident, for example, why any
of the five listed abilities are critical to task 32, "Searches for lost children,
runaways, etc." or task 19, "Guards and transports prisoners."
16
The fact that the factual subject matter of the exam questions (as opposed to
their purpose in measuring abilities) was related to the subject matter of the job
is not a major indicator of the test's validity, since the test measured abilities,
not knowledge. But it does suggest that the test has avoided the dangers
inherent in using irrelevant factual material. Such material could skew the test
for ability in directions unrelated to the job, a phenomenon that even the best
designed test might not be able to avoid. The present test avoids that problem
by ensuring that any distorting effects resulting from the subject matter of the
questions are themselves job-related
17
See APA Standards, supra at 48-50; Kuder & Richardson, The Theory of
Estimation of Test Reliability in Principles of Educational and Psychological
Measurement 95 (W. Mehrens & R. Ebel, eds. 1967); Rulon, A Simplified
Procedure for Determining the Reliability of a Test by Split Halves, in id. at
104
18
The reason this bunching occurred was that the exam was too easy. An exam
that was too difficult might have had the same effect, except that the bunching
would have occurred at the lower end of the scale. Neither excessive easiness
nor excessive difficulty is necessarily fatal, but each magnifies effects that may
make scoring arrangements unjustified
20
Set forth below are the total number of applicants who achieved each score
from 110 to 70 and the number of White and minority (Black and Hispanicsurnamed) applicants at each of these scores. The White and minority figures
do not always equal the total because some applicants were members of other
minority groups and some applicants were not identified
Score
110
109
108
107
106
105
104
103
102
101
100
99
98
97
96
95
Total Applicants
3
6
3
9
13
36
90
102
95
125
823
1570
1845
2238
2311
2255
White
1
4
1
5
5
19
59
64
53
60
565
1067
1372
1562
1516
1434
Minority
1
2
0
1
5
6
16
23
17
38
96
177
223
282
390
428
94
93
92
91
90
89
88
87
86
85
84
83
82
81
80
79
78
77
76
75
74
73
72
71
70
2124
2024
2017
1772
1675
1498
1359
1171
1077
989
870
813
722
608
558
497
449
409
363
323
342
265
247
213
198
1307
1195
1174
1009
913
774
678
569
500
468
395
358
313
246
214
191
163
157
130
117
121
89
87
89
65
425
504
504
524
529
541
485
452
452
402
371
362
330
298
287
245
222
197
192
156
182
141
138
102
112
21
22
Actually, in this example, presumably 1,200 names would be drawn to yield the
anticipated 400 candidates who would complete the post-examination steps of
the hiring process
23
Upon consideration of the evidence presented at the liability and relief stages of
this case, and consideration of the briefs and oral arguments of the parties and
amicus curiae United States of America and Policewomen's Endowment
Association of New York City, Inc., and entry of findings of fact and
conclusions of law, it is hereby ORDERED:
The defendants shall seek to achieve as a long-term goal black and hispanic
(hereinafter "minority") representation in the sworn ranks of the Police
Department comparable to that of the minority composition of the labor force in
the relevant hiring area. As of 1978, the labor force of the relevant hiring area
was at least 30% black and hispanic
To achieve the long-term goal set forth in paragraph 3, supra, the defendants
shall as an interim goal appoint 50% of their entry level police officers from
among qualified black and hispanic applicants. The interim hiring goal for
minorities shall remain in effect until the minority representation in the sworn
ranks of the Police Department is at least equal to the percentage of minorities
in the labor force of the relevant hiring area as described in paragraph 3, supra,
or until this court has found, after a hearing, that all proposed selection
procedures for police officer positions have been validated in accordance with
the Uniform Guidelines on Employee Selection Procedures, 28 C.F.R. 50.14,
29 C.F.R. 1607, effective September 25, 1978 ("Uniform Guidelines"), and
that no further interim goals are appropriate. Nothing herein shall preclude
plaintiffs from advocating the continuance of the interim goals on the basis that
the continuing effects of past discrimination have not been eliminated
To satisfy the goals set forth above in paragraphs 3 and 4, defendants may use
an eligibility list derived from Examination No. 8155 as the pool from which it
selects police officers. At such time as that eligibility list or any future
eligibility list for the position of police officer does not contain sufficient
minority candidates to meet the interim goals set out in paragraph 4, supra, the
City shall take whatever steps are necessary to achieve the interim hiring goals
experience in the field of selection testing, and preferably who have performed
or have knowledge of analyses of the job of police officers
7
Defendants may continue to use the current qualifications and selection criteria
for police officer positions. However, no such qualification or selection
criterion shall be a valid basis for or defense for failure to meet the interim
hiring goals set out in paragraph 4, supra, unless the court has ruled that such
qualification or selection criterion has been validated in accordance with the
Uniform Guidelines and has determined that there is no further basis for
continuing the interim goals
Plaintiffs are entitled to their court costs and reasonable attorneys' fees to date.
The amount of such costs and fees shall be set by the court after a hearing.
Costs and attorneys' fees for work done in the future shall be fixed in such
manner as the court may determine
10
Within thirty (30) days after the entry of this order, and every six (6) months
thereafter, the defendant city shall submit to the plaintiffs the following reports:
(a) A list of its then current uniformed employees in the Police Department
showing for each person: name, address, race or national origin, police station
or other place of assignment, date of appointment, rank, and date such rank was
achieved.
(b) The total number of uniformed personnel employed by the Police
Department, by rank, race and national origin.
(c) A list of all minority applicants for all vacancies, including date of
application, names, addresses, and telephone numbers, whether the applicant
was accepted or rejected and reason(s) for rejection.
(d) The name, address, and telephone number of any minority employee
involuntarily terminated prior to the completion of the probationary period, and
reason(s) for termination.
(e) List of all hires, promotions and voluntary and involuntary terminations
showing race and national origin.
11
12
The court retains jurisdiction of this action for such further relief or other
orders as may be necessary or appropriate to enforce and insure rights to equal
employment opportunity within the New York City Police Department
SO ORDERED.
25
26
27
This requirement of reaching a target should be contrasted with the far more
modest use of a target figure merely to limit the extent of interim relief. The
latter occurs when a remedy provides that the interim relief will continue until
an acceptable selection procedure is developed or until a particular target figure
for minority employment is achieved. In such a case, the target figure does not
function as an absolute requirement; it simply serves as a means of assuring that
the interim requirements will end at some point, even if development of a valid
selection procedure is unduly delayed. The use of a target figure for this
limiting purpose does not require the same compelling justification as a target
figure prescribed as an absolute requirement, since it imposes no additional
obligation on the defendant
28
29
The order unfortunately refers to the 50% quota as an "interim goal." It is not a
goal at all. It is a procedure that the District Court has required to be used at
least until the true interim goal approval of a valid selection procedure has been
reached, and perhaps until the long-term goal minority employment percentage
equal to minority work force percentage has been reached
30
In 1973 the defendants gave still another exam whose validity has not been
adjudicated. See 431 F.Supp. at 545 n.36. It is interesting to note, however, that
even plaintiffs' well-known testing expert, Dr. Richard Barrett, acknowledged
on cross-examination during the first layoff policy case that he could not be
certain whether or not this exam is content valid. Id. n.37
31
The only evidence referred to by the District Court, in addition to the invalidity
of the '68-'70 exams, is the testimony of some police officers who warned the
test makers that the test would disfavor minorities. (484 F.Supp. at 798). Such a
caution would be significant if the test makers had made no effort to satisfy
Guideline standards. But when test makers have undertaken the elaborate
process of job analysis and test construction revealed by this record, their
willingness to disregard a prediction of disparate racial impact does not indicate
that they lacked a good faith belief that their exam could nonetheless be shown
to be adequately job-related
32
It is true that the City's previous unsuccessful attempt to design an exam on its
own renders somewhat questionable its apparent enthusiasm for another "inhouse" effort. However, the decision to produce Exam No. 8155 "in-house"
may well have been motivated by a bureaucratic preference for internal
procedures, a need to save money, a naive self-confidence, or simply a desire to
try again. None of these motives is a basis for inferring a conscious intention or
even a reckless willingness to violate the law