Professional Documents
Culture Documents
f o r E d u c a t i o n a l a n d P s y c h o l o g i c a l Te st ing
Published by the
American Educational Research Association
1430 K St., NW, Suite 1200
Washington, DC 20005
Prepared by the
Joint Committee on the Standardsfor Educational and Psychological Testing of the American Educational
Research Association, the American Psychological Association, and the National Council on Measurement
in Education
INTRODUCTION
The Purpose o f the Stan d ard s............................................................................................................... 1
Legal D isclaim er......................................................................................................................................1
Tests and Test Uses to Which These Standards A p ply ........................................................................2
Participants in the Testing Process.........................................................................................................3
Scope o f the Revision .............................................................................................................................4
Organization o f the V olu m e..................................................................................................................5
Categories o f Standards...........................................................................................................................5
Presentation o f Individual Standards.................................................................................................... 6
Cautions to Be Considered in Using the Standards............................................................................7
PART!
FOUNDATIONS
1. V alidity...................................................................................................................................................l i
Background........................................................................................................................................... 11
Sources of Validity Evidence .........................................................................................................13
Integrating the Validity Evidence.................................................................................................. 21
Standards for Validity ........................................................................................................................ 23
Cluster 1. Establishing Intended Uses and Interpretations........................................................23
Cluster 2. Issues Regarding Samples and Settings Used in Validation ....................................25
Cluster 3. Specific Forms of Validity Evidence............................................................................26
PART II
OPERATIONS
PART 111
TESTING APPLICATIONS
v
CONTENTS
13. U ses of Tests for Program Evaluation, Policy Studies, and Accountability...................... 203
Background ..................................................................................................................................... 203
Evaluation of Programs and Policy Initiatives ........................................................................ 204
Test-Based Accountability Systems ......................................................................................... 205
Issues in Program and Policy Evaluation and Accountability................................................ 206
Additional Considerations......................................................................................................... 207
Standards for Uses of Tests for Program Evaluation, Policy Studies,
and Accountability....................................................................................................................209
Cluster 1. Design and Development o f Testing Programs and Indices for
Program Evaluation, Policy Studies, and Accountability System s................................... 209
Cluster 2. Interpretations and Uses of Information From Tests Used in
Program Evaluation, Policy Studies, and Accountability Systems .................................210
G LO SSA RY ..............................................................................................................................................215
IN D E X .........................................................................................................................................................227
vi
PREFACE
This edition of Standardsfo r Educational and Psy The present edition of the Standards was developed
chological Testing is sponsored by the American by the Joint Committee on the Standards fo r Ed
Educational Research Association (AERA), the ucational and Psychological Testing, appointed by
American Psychological Association (APA), and the Standards Management Committee in 2008.
the National Council on Measurement in Education Members of the Joint Committee are members of
(NCME). Earlier documents from the sponsoring at least one of the three sponsoring organizations,
organizations also guided the development and AERA, APA, and NCM E. The Joint Committee
use of tests. The first was Technical Recommendations was charged with the revision of the Standards
for Psychological Tests and Diagnostic Techniques, and the preparation o f a final document for pub
prepared by an APA committee and published by lication. It held its first meeting in January 2009.
APA in 1954. The second was Technical Recom
mendations fo r Achievement Tests, prepared by a Joint Committee on the Standards for
committee representing AERA and the National Educational and Psychological Testing
Council on Measurement Used in Education B a rb a ra S . Plake (C o-C hair)
(NCM UE) and published by the National Edu Lauress L. W ise (C o-C hair)
cation Association in 1955. L in da L. C ook
Fritz D rasgow
The third, which replaced the earlier two, was
Brian T. G on g
prepared by a joint committee representing AERA,
Laura S. H am ilton
APA, and N CM E and was published by APA in
Jo-Ida Hansen
1966. It was the First edition of the Standards for Joan L. H erm an
Educational and Psychological Testing, also known M ichael T. Kane
as the Standards. Three subsequent editions of M ichael J . Kolen
the Standards were prepared by joint committees A ntonio E. Puente
representing AERA, APA, and N CM E, published Paul R. Sackett
in 1974, 1985, and 1999. N an cy T. Tippin s
The current Standards Management Committee W alter D . Way
was formed by AERA, APA, and N CM E, the Frank C . Worrell
Standards Management Committee Marianne Ernesto (APA) served as the project di
W ayne J . C am ara (C hair), appointed by APA rector for the Joint Committee, and Dianne L.
D av id Frisbie (2 0 0 8 -p rcsen t), appointed by N C M E Schneider (APA) served as the project coordinator.
Suzanne Lane, appointed by A E R A Gerald Sroufe (AERA) provided administrative
Barbara S. Plake (2 0 0 5 - 2 0 0 7 ), appointed by N C M E support for the Management Committee. APAs
PREFACE
legal counsel managed the external legal review of APA Committee on Aging
the Standards. Daniel R. Eignor and James C. APA Committee on Children, Youth, and Families
Impara reviewed the Standards for technical APA Committee on Ethnic Minority Affairs
accuracy and consistency across chapters. APA Committee on International Relations in
Psychology
In 2008, each of the three sponsoring organi
APA Committee on Legal Issues
zations released a call for comments on the 1999
APA Committee on Psychological Tests and
Standards. Based on a review o f the comments re Assessment
ceived, the Management Committee identified APA Committee on Socioeconomic Status
four main content areas of focus for the revision: APA Society for the Psychology of Women
technological advances in testing, increased use (Division 35)
o f tests for accountability and education policy- APA Division of Evaluation, Measurement, and
setting, access for all examinee populations, and Statistics (Division 5)
issues associated with workplace testing. In addition, APA Division of School Psychology (Division 16)
the committee gave special attention to ensuring APA Ethics Committee
a common voice and consistent use o f technical APA Society for Industrial and Organizational
Psychology (Division 14)
language across chapters.
APA Society of Clinical Child and Adolescent
In January 2011, a draft: of the revised Standards
Psychology (Division 53)
was made available for public review and comment. APA Society of Counseling Psychology (Division 17)
Organizations that submitted comments on the Asian American Psychological Association
draft and/or comments in response to the 2008 Association of Test Publishers
call for comments are listed below. Many individuals District of Columbia Psychological Association
from each organization contributed comments, Massachusetts Neuropsychological Society
as did many individual members of AERA, APA, Massachusetts Psychological Association
and N CM E. The Joint Committee considered National Academy of Neuropsychology
each comment in its revision of the Standards. National Association of School Psychologists
These thoughtful reviews from a variety o f pro National Board of Medical Examiners
National Council of Teachers of Mathematics
fessional vantage points helped the Joint Committee
NCME Board of Directors
in drafting the final revisions of the present edition
NCME Diversity Issues and Testing Committee
of the Standards. NCME Standards and Test Use Committee
Comments came from the following organi
zations: Testing Companies
Sponsoring Organizations ACT
Alpine Testing Solutions
American Educational Research Association
The College Board
American Psychological Association
Educational Testing Service
National Council on Measurement in Education
Harcourt Assessment, Inc.
Professional Associations Hogan Assessment Systems
Pearson
American Academy of Clinical Neuropsychology
Prometric
American Board of Internal Medicine
Vangent Human Capital Management
American Counseling Association
Wonderlic, Inc.
American Institute of CPAs, Examinations Team
APA Board for the Advancement of Psychology in
Academic and Research Institutions
the Public Interest
APA Board of Educational Affairs Center for Educational Assessment, University of
APA Board of Professional Affairs Massachusetts
APA Board of Scientific Affairs George Washington University Center for Equity
APA Policy and Planning Board and Excellence in Education
PREFACE
Human Resources Research Organization (HumRRO) AERA: The AERAs approval of the Standards
National Center on Educational Outcomes, means that the Council adopts the document
University of Minnesota as AERA policy.
Educational and psychological testing and assess why a standard is not relevant or technically
ment arc among the most important contributions feasible in a particular case.
o f cognitive and behavioral sciences to our society, The Standards makes no attempt to provide
providing fundamental and significant sources of psychometric answers to questions of public policy
information about individuals and groups. Not regarding the use of tests. In general, the Statidards
all tests are well developed, nor are all testing advocates that, within feasible limits, the relevant
practices wise or beneficial, but there is extensive technical information be made available so that
evidence documenting the usefulness of well-con those involved in policy decisions may be fully
structed, well-interpreted tests. Well-constructed informed.
tests that are valid for their intended purposes
have the potential to provide substantial benefits Legal Disclaimer
for test takers and test users. Their proper use can
result in better decisions about individuals and The Standards is not a statement of legal require
programs than would result without their use and ments, and compliance with the Standards is not a
can also provide a route to broader and more eq substitute for legal advice. Numerous federal, state,
uitable access to education and employment. The and local statutes, regulations, rules, and judicial
improper use of tests, on the other hand, can decisions relate to some aspects of the use, pro
causc considerable harm to test takers and other duction, maintenance, and development of tests
parties affected by test-based decisions: The intent and test results and impose standards that may be
of the Standards for Educational and Psychological different for different types of testing. A review of
Testing is to promote sound testing practices and these legal issues is beyond the scope of the
to provide a basis for evaluating the quality of Standards, the distinct purpose of which is to set
those practices. The Standards is intended for forth the criteria for sound testing practices from
professionals who specify, develop, or select tests the perspective of cognitive and behavioral science
and for those who interpret, or evaluate the professionals. Where it appears that one or more
technical quality of, test results. standards address an issue on which established
legal requirements may be particularly relevant,
the standard, comment, or introductory material
The Purpose of the Standards
may make note of that fact. Lack o f specific
The purpose of the Standards is to provide criteria reference to legal requirements, however, does not
for the development and evaluation of tests and imply the absence of a relevant legal requirement.
testing practices and to provide guidelines for as WTien applying standards across international bor
sessing the validity o f interpretations o f test scores ders, legal differences may raise additional issues
for the intended test uses. Although such evaluations or require different treatment of issues.
should depend heavily on professional judgment, In some areas, such as the collection, analysis,
the Standards provides a frame of reference to and use of test data and results for different sub
ensure that relevant issues are addressed. All pro groups, the law may both require participants in
fessional test developers, sponsors, publishers, and the testing process to take certain actions and
users should make reasonable efforts to satisfy prohibit those participants from taking other
and follow the Standards and should encourage actions. Furthermore, because the science of testing
others to do so. All applicable standards should is an evolving discipline, recent revisions to the
be met by all tests and in all test uses unless a Standards may not be reflected in existing legal
sound professional reason is available to show authorities, including judicial decisions and agency
1
INTRODUCTION
guidelines. In all situations, participants in the Tests differ on a number of dimensions: the
testing process should obtain the advice o f counsel mode in which test materials are presented (e.g.,
concerning applicable legal requirements. paper-and-pencil, oral, or computerized adminis
In addition, although the Standards is not en tration); the degree to which stimulus materials
forceable by the sponsoring organizations, it has are standardized; the type o f response format (se
been repeatedly recognized by regulatory authorities lection of a response from a set o f alternatives, as
and courts as setting forth the generally accepted opposed to the production of a free-form response);
professional standards that developers and users and the degree to which test materials are designed
of tests and other selection procedures follow. to reflect or simulate a particular context. In all
Compliance or noncompliance with the Standards cases, however, tests standardize the process by
may be used as relevant evidence o f legal liability which test takers’ responses to test materials are
in judicial and regulatory proceedings. The Standards evaluated and scored. As noted in prior versions
therefore merits careful consideration by all par o f the Standards, the same general types o f infor
ticipants in the testing process. mation are needed to judge the soundness of
Nothing in the Standards is meant to constitute results obtained from using all varieties o f tests.
legal advice. Moreover, the publishers disclaim The precise demarcation between measurement
any and all responsibility for liability created by devices used in the fields of educational and psy
participation in the testing process. chological testing that do and do not fall within
the purview of the Standards is difficult to identify
Tests and Test Uses to Although the Standards applies most directly to
standardized measures generally recognized as
Which These Standards Apply
“tests,” such as measures of ability, aptitude,
A test is a device or procedure in which a sample achievement, attitudes, interests, personality, cog
o f an examinees behavior in a specified domain is nitive functioning, and mental health, the Standards
obtained and subsequently evaluated and scored may also be usefully applied in varying degrees to
using a standardized process. Whereas the label a broad range of less formal assessment techniques.
test is sometimes reserved for instruments on Rigorous application of the Standards to unstan
which responses are evaluated for their correctness dardized employment assessments (such as some
or quality, and the terms scale and inventory are job interviews) or to the broad range of unstructured
used for measures of attitudes, interest, and dis behavior samples used in some forms of clinical
positions, the Standards uses the single term test and school-based psychological assessment (e.g.,
to refer to all such evaluative devices. an intake interview), or to instructor-made tests
A distinction is sometimes made between tests that are used to evaluate student performance in
and assessments. Assessment is a broader term than education and training, is generally not possible.
test, commonly referring to a process that integrates It is useful to distinguish between devices that lay
test information with information from other claim to the concepts and techniques o f the field
sources (e.g., information from other tests, inven of educational and psychological testing and
tories, and interviews; or the individual’s social, devices that represent unstandardized or less stan
educational, employment, health, or psychological dardized aids to day-to-day evaluative decisions.
history). The applicability of the Standards to an Although the principles and concepts underlying
evaluation device or method is determined by the Standards can be fruitfully applied to day-to-
substance and not altered by the label applied to day decisions— such as when a business owner
it (e.g., test, assessment, scale, inventory). The interviews a job applicant, a manager evaluates
Standards should not be used as a checklist, as is the performance of subordinates, a teacher develops
emphasized in the section “Cautions to Be Con a classroom assessment to monitor student progress
sidered in Using the Standards' at the end of this toward an educational goal, or a coach evaluates a
chapter. prospective athlete— it would be overreaching to
INTRODUCTION
expect that the standards of the educational and The interests of the various parties involved
psychological testing field be followed by those in the testing process may or may not be congruent.
making such decisions. In contrast, a structured For example, when a test is given for counseling
interviewing system developed by a psychologist purposes or for job placement, the interests of the
and accompanied by claims that the system has individual and the institution often coincide. In
been found to be predictive of job performance contrast, when a test is used to select from among
in a variety o f other settings falls within the many individuals for a highly competitive job or
purview of the Standards. Adhering to the Standards for entry into an educational or training program,
becomes more critical as the stakes for the test the preferences of an applicant may be inconsistent
taker and the need to protect the public increase. with those of an employer or admissions officer.
Similarly, when testing is mandated by a court,
Participants in the Testing Process the interests o f the test taker may be different
from those of the party requesting the court order.
Educational and psychological testing and assess Individuals or institutions may serve several
ment involve and significantly affect individuals, roles in the testing process. For example, in clinics
institutions, and society as a whole. The individuals the test taker is typically the intended beneficiary
affected include students, parents, families, teachers, of the test results. In some situations the test ad
educational administrators, job applicants, em ministrator is an agent of the test developer, and
ployees, clients, patients, supervisors, executives, sometimes the test administrator is also the test
and evaluators, among others. The institutions user. When an organization prepares its own em
affected include schools, colleges, businesses, in ployment tests, it is both the developer and the
dustry, psychological clinics, and government user. Sometimes a test is developed by a test
agencies. Individuals and institutions benefit when author but published, marketed, and distributed
testing helps them achieve their goals. Society, in by an independent publisher, although the publisher
turn, benefits when testing contributes to the may play an active role in the test development
achievement o f individual and institutional goals. process. Roles may also be further subdivided.
There are many participants in the testing For example, both an organization and a professional
process, including, among others, (a) those who assessor may play a role in the provision of an as
prepare and develop the test; (b) those who publish sessment center. Given this intermingling o f roles,
and market the test; (c) those who administer and it is often difficult to assign precise responsibility
score the test; (d) those who interpret test results for addressing various standards to specific par
for clients; (e) those who use the test results for ticipants in the testing process. Uses o f tests and
some decision-making purpose (including policy testing practices are improved to the extent that
makers and those who use data to inform social those involved have adequate levels of assessment
policy); (f) those who take the test by choice, di literacy.
rection, or necessity; (g) those who sponsor tests, Tests are designed, developed, and used in a
such as boards that represent institutions or gov wide variety o f ways. In some cases, they are de
ernmental agencies that contract with a test veloped and “published” for use outside the or
developer for a specific instrument or service; and ganization that produces them. In other cases, as
(h) those who select or review tests, evaluating with state educational assessments, they are designed
their comparative merits or suitability for the uses by the state educational agency and developed by
proposed. In general, those who are participants contractors for exclusive and often one-time use
in the testing process should have appropriate by the state and not really “published” at all.
knowledge of tests and assessments to allow them Throughout the Standards, we use the general
to make good decisions about which tests to use term test developer, rather than the more specific
and how to interpret test results. term test publisher, to denote those involved in
INTRODUCTION
the design and development of tests across the To be responsive to this charge, several actions
full range of test development scenarios. were taken:
The Standards is based on the premise that ef
• The chapters “Educational Testing and As
fective testing and assessment require that all pro
sessment” and "Testing in Program Evaluation
fessionals in the testing process possess the knowl
and Public Policy,” in the 1999 version, were
edge, skills, and abilities necessary to fulfill their
rewritten to attend to the issues associated
roles, as well as an awareness of personal and con
with the uses of tests for educational account
textual factors that may influence the testing
ability purposes.
process. For example, test developers and those
selecting tests and interpreting test results need • A new chapter, “Fairness in Testing,” was
adequate knowledge o f psychometric principles written to emphasize accessibility and fairness
such as validity and reliability. They also should as fundamental issues in testing. Specific con
obtain any appropriate supervised experience and cerns for fairness are threaded throughout all
legislatively mandated practice credentials that of the chapters of the Standards.
are required to perform competently those aspects
• The chapter “Testing in Employment and
o f the testing process in which they engage. All
Credentialing” (now “Workplace Testing and
professionals in the testing process should follow
Credentialing”) was reorganized to more clearly
the ethical guidelines o f their profession.
identify when a standard is relevant to em
ployment and/or credentialing.
Scope of the Revision
• The impact o f technology was considered
This volume serves as a revision of the 1999 Stan throughout the volume. One of the major
dards fo r Educational and Psychological Testing. technology issues identified was the tension
The revision process started with the appointment between the use of proprietary algorithms and
of a Management Committee, composed o f rep the need for test users to be able to evaluate
resentatives of the three sponsoring organizations complex applications in areas such as automated
responsible for overseeing the general direction of scoring o f essays, administering and scoring
the effort: the American Educational Research of innovative item types, and computer-based
Association (AERA), the American Psychological testing. These issues are considered in the
Association (APA), and the National Council on chapter “Test Design and Development.”
Measurement in Education (NCME). To guide
the revision, the Management Committee solicited ° A content editor was engaged to help with the
and synthesized comments on the 1999 Standards technical accuracy and clarity o f each chapter
from members of the sponsoring organizations and with consistency of language across chapters.
and convened the joint Committee for the Revision As noted below, chapters in Part I (“Founda
o f the 1999 Standards in 2009 to do the actual re tions”) and Part II (“Operations”) now have
vision. The Joint Committee also was composed an “overarching standard” as well as themes
of members of the three sponsoring organizations under which the individual standards are or
and was charged by the Management Committee ganized. In addition, the glossary from the
with addressing five major areas: considering the 1999 Standardsfor Educational and Psychological
accountability issues for use of tests in educational Testing was updated. As stated above, a major
policy; broadening the concept o f accessibility of change in the organization of this volume in
tests for all examinees; representing more com volves the conceptualization of fairness. The
prehensively the role of tests in the workplace; 1999 edition had a part devoted to this topic,
broadening the role o f technology in testing; and with separate chapters tided “Fairness in Testing
providing for a better organizational structure for and Test Use,” “Testing Individuals o f Diverse
communicating the standards. Linguistic Backgrounds,” and “Testing Indi
4
INTRODUCTION
viduals With Disabilities.” In the present Each chapter begins with introductory text
edition, the topics addressed in those chapters that provides background for the standards that
are combined into a single, comprehensive follow. Although the introductory text is at times
chapter, and the chapter is located in Part I. prescriptive, it should not be interpreted as
This change was made to emphasize that imposing additional standards.
fairness demands that all test takers be treated
equitably. Fairness and accessibility, the un Categories of Standards
obstructed opportunity for all examinees to
demonstrate their standing on the constructs) The text of each standard and any accompanying
being measured, are relevant for valid score commentary include the conditions under which a
interpretations for all individuals and subgroups standard is relevant. Depending on the context
in the intended population of test takers. Be and purpose of test development or use, some
cause issues related to fairness in testing are standards will be more salient than others. Moreover,
not restricted to individuals with diverse lin some standards are broad in scope, setting forth
guistic backgrounds or those with disabilities, concerns or requirements relevant to nearly all tests
the chapter was more broadly cast to support or testing contexts, and other standards are narrower
appropriate testing experiences for all individ in scope. However, all standards are important in
uals. Although the examples in the chapter the contexts to which they apply. Any classification
often refer to individuals with diverse linguistic that gives the appearance of elevating the general
and cultural backgrounds and individuals with importance of some standards over others could
disabilities, they also include examples relevant invite neglect of certain standards that need to be
to gender and to older adults, people of various addressed in particular situations. Rather than dif
ethnicities and racial backgrounds, and young ferentiate standards using priority labels, such as
children, to illustrate potential barriers to fair “primary,” “secondary,” or “conditional” (as were
and equitable assessment for all examinees. used in the 1985 Standards), this edition emphasizes
that unless a standard is deemed clearly irrelevant,
inappropriate, or technically infeasible for a particular
Organization of the Volume
use, all standards should be met, making all of
Part I of the Standards, “Foundations," contains them essentially “primary” for that context.
standards for validity (chap. 1); reliability/precision Unless otherwise specified in a standard or
and errors of measurement (chap. 2); and fairness commentary, and with the caveats outlined below,
in testing (chap. 3). Part II, “Operations,” addresses standards should be met before operational test
test design and development (chap. 4); scores, use. Each standard should be carefully considered
scales, norms, score linking, and cut scores (chap. to determine its applicability to the testing context
5); test administration, scoring, reporting, and in under consideration. In a given case there may
terpretation (chap. 6); supporting documentation be a sound professional reason that adherence to
for tests (chap. 7); the rights and responsibilities the standard is inappropriate. There may also be
of test takers (chap. 8); and the rights and respon occasions when technical feasibility influences
sibilities o f test users (chap. 9). Part III, “Testing whether a standard can be met prior to operational
Applications,” treats specific applications in psy test use. For example, some standards may call
chological testing and assessment (chap. 10); work for analyses o f data that are not available at the
place testing and credentialing (chap. 11); educa point of initial operational test use. In other
tional testing and assessment (chap. 12); and uses cases, traditional quantitative analyses may not
of tests for program evaluation, policy studies, be feasible due to small sample sizes. However,
and accountability (chap. 13). Also included is a there may be other methodologies that could be
glossary, which provides definitions for terms as used to gather information to support the standard,
they are used specifically in this volume. such as small sample methodologies, qualitative
5
INTRODUCTION
studies, focus groups, and even logical analysis. Some of the individual standards and intro
In such instances, test developers and users should ductory text refer to groups and subgroups. The
make a good faith effort to provide the kinds of term g)'0Up is generally used to identify the full
data called for in the standard to support the examinee population, referred to as the intended
valid interpretations o f the test results for their examinee group, the intended test-taker group, the
intended purposes. If test developers, users, and, intended examinee population, or the population.
when applicable, sponsors have deemed a standard A subgroup includes members of the larger group
to be inapplicable or technically infeasible, they who are identifiable in some way that is relevant
should be able, if called upon, to explain the to the standard being applied. When data or
basis for their decision. However, there is no ex analyses are indicated for various subgroups, they
pectation that documentation o f all such decisions are generally referred to as subgroups within the
be routinely available. intended examinee group, groups from the intended
examinee population, or relevant subgroups.
In applying the Standards, it is important to
Presentation of Individual Standards
bear in mind that the intended referent subgroups
Individual standards are presented after an intro for the individual standards are context specific.
ductory text that presents some key concepts for For example, referent ethnic subgroups to be con
interpreting and applying the standards. In many sidered during the design phase of a test would
cases, the standards themselves are coupled with depend on the expected ethnic composition of
one or more comments. These comments are in the intended test group. In addition, many more
tended to amplify, clarify, or provide examples to subgroups could be relevant to a standard dealing
aid in the interpretation o f the meaning o f the with the design of fair test questions than to a
standards. The standards often direct a developer standard dealing with adaptations of a tests format.
or user to implement certain actions. Depending Users of the Standards will need to exercise pro
on the type o f test, it is sometimes not clear in the fessional judgment when deciding which particular
statement of a standard to whom the standard is subgroups are relevant for the application o f a
directed. For example, Standard 1.2 in the chapter specific standard.
“Validity” states: In deciding which subgroups are relevant for
a particular standard, the following factors, among
A rationale should be presented for
others, may be considered: credible evidence that
each intended interpretation of test
suggests a group may face particular construct-
scores for a given use, together with
irrelevant barriers to test performance, statutes or
a summary of the evidence and
regulations that designate a group as relevant to
theory bearing on the intended in
score interpretations, and large numbers o f indi
terpretation.
viduals in the group within the general population.
The party responsible for implementing this stan Depending on the context, relevant subgroups
dard is the party or person who is articulating the might include, for example, males and females,
recommended interpretation of the test scores. individuals of differing socioeconomic status, in
This may be a test user, a test developer, or dividuals differing by race and/or ethnicity, indi
someone who is planning to use the test scores viduals with different sexual orientations, individuals
for a particular purpose, such as making classification with diverse linguistic and cultural backgrounds
or licensure decisions. It often is not possible in (particularly when testing extends across interna
the statement of a standard to specify who is re tional borders), individuals with disabilities, young
sponsible for such actions; it is intended that the children, or older adults.
party or person performing the action specified Numerous examples are provided in the Stan
in the standard be the party responsible for dards to clarify points or to provide illustrations
adhering to the standard. o f how to apply a particular standard. Many of
INTRODUCTION
rhe examples are drawn from research with students should not be considered in isolation. Therefore,
with disabilities or persons from diverse language evaluating acceptability depends on (a) pro
or cultural groups; fewer, from research with other fessional judgment that is based on a knowledge
identifiable groups, such as young children or of behavioral science, psychometrics, and the
adults. There was also a purposeful effort to relevant standards in the professional field to
provide examples for educational, psychological, which the test applies; (b) the degree to which
and industrial settings. the intent of the standard has been satisfied
The standards in each chapter in Parts I and by the test developer and user; (c) the alternative
II (“Foundations” and “Operations”) are introduced measurement devices that are readily available;
by an overarching standard, designed to convey (d) research and experiential evidence regarding
the central intent of the chapter. These overarching the feasibility o f meeting the standard; and
standards are always numbered with .0 following (e) applicable laws and regulations.
the chapter number. For example, the overarching
0 When tests arc at issue in legal proceedings
standard in chapter 1 is numbered 1.0. The over-
and other situations requiring expert witness
arching standards summarize guiding principles
testimony, it is essential that professional judg
that are applicable to all tests and test uses.
ment be based on the accepted corpus o f
Further, the themes and standards in each chapter
knowledge in determining the relevance of
are ordered to be consistent with the sequence of
particular standards in a given situation. The
the material in the introductory text for the
intent of the Standards is to offer guidance for
chapter. Because some users o f the Standards may
such judgments.
turn only to chapters directly relevant to a given
application, certain standards are repeated in dif ° Claims by test developers or test users that a
ferent chapters, particularly in Part III, “Testing test, manual, or procedure satisfies or follows
Applications.” When such repetition occurs, the the standards in this volume should be made
essence of the standard is the same. Only the with care. It is appropriate for developers or
wording, area of application, or level of elaboration users to state that efforts were made to adhere
in the comment is changed. to the Standards., and to provide documents
describing and supporting those efforts. Blanket
claims without supporting evidence should
Cautions to Be Considered
not be made.
in Using the Standards
0 The standards are concerned with a field that
In addition to the legal disclaimer set forth above, is rapidly evolving. Consequently, there is a
several cautions are important if we are to avoid continuing need to monitor changes in the
misinterpretations, misapplications, and misuses field and to revise this document as knowledge
o f the Standards: develops. The use of older versions of the
Standards may be a disservice to test users and
• Evaluating the acceptability of a test or test
test takers.
application does not rest on the literal satis
faction of every standard in this document, • Requiring the use of specific technical methods
and the acceptability of a test or test application is not the intent of the Standards. For example,
cannot be determined by using a checklist. where specific statistical reporting requirements
Specific circumstances affect the importance are mentioned, the phrase “or generally accepted
of individual standards, and individual standards equivalent” should always be understood.
7
ji
PARTI
Foundations
1. VALIDITY
BACKGROUND
Validity refers to the degree to which evidence the construct interpretation that will be made on
and theory support the interpretations o f test the basis of the score or response pattern.
scores for proposed uses o f tests. Validity is, Examples of constructs currently used in as
therefore, the most fundamental consideration in sessment include mathematics achievement, general
developing tests and evaluating tests. The process cognitive ability, racial identity attitudes, depression,
o f validation involves accumulating relevant and self-esteem. To support test development,
evidence to provide a sound scientific basis for the proposed construct interpretation is elaborated
the proposed score interpretations. It is the inter by describing its scope and extent and by delin
pretations o f test scores for proposed uses that are eating the aspects of the construct that are to be
evaluated, not the test itself. When test scores are represented. The detailed description provides a
interpreted in more than one way (e.g., both to conceptual framework for the test, delineating
describe a test takers current level of the attribute the knowledge, skills, abilities, traits, interests,
being measured and to make a prediction about a processes, competencies, or characteristics to be
future outcome), each intended interpretation assessed. Ideally, the framework indicates how
must be validated. Statements about validity the construct as represented is to be distinguished
should refer to particular interpretations for from other constructs and how it should relate to
specified uses. It is incorrect to use the unqualified other variables.
phrase “the validity of the test.” The conceptual framework is partially shaped
Evidence of the validity of a given interpretation by the ways in which test scores will be used. For
of test scores for a specified use is a necessary con instance, a test of mathematics achievement might
dition for the justifiable use of the test. Where suf be used to place a student in an appropriate program
ficient evidence of validity exists, the decision as of instruction, to endorse a high school diploma,
to whether to actually administer a particular test or to inform a college admissions decision. Each of
generally takes additional considerations into ac these uses implies a somewhat different interpretation
count. These include cost-benefit considerations, of the mathematics achievement test scores: that a
framed in different subdisciplines as utility analysis student will benefit from a particular instructional
or as consideration of negative consequences of intervention, that a student has mastered a specified
test use, and a weighing of any negative consequences curriculum, or that a student is likely to be successful
against the positive consequences of test use. with college-level work. Similarly, a test of consci
Validation logically begins with an explicit entiousness might be used for psychological coun
statement of the proposed interpretation o f test seling, to inform a decision about employment, or
scores, along with a rationale for the relevance of for the basic scientific purpose of elaborating the
the interpretation to the proposed use. The construct of conscientiousness. Each o f these
proposed interpretation includes specifying the potential uses shapes the specified framework and
construct the test is intended to measure. The the proposed interpretation of the test’s scores and
term construct is used in the Standards to refer to also can have implications for test development
the concept or characteristic that a test is designed and evaluation. Validation can be viewed as a
to measure. Rarely, if ever, is there a single possible process of constructing and evaluating arguments
meaning that can be attached to a test score or a for and against the intended interpretation of test
pattern of test responses. Thus, it is always in scores and their relevance to the proposed use. The
cumbent on test developers and users to specify conceptual framework points to the kinds of
CHAPTER 1
evidence chat might be collected to evaluate the consider the perspectives of different interested
proposed interpretation in light of the purposes of parties, existing experience with similar tests and
testing. As validation proceeds, and new evidence contexts, and the expected consequences of the
regarding the interpretations that can and cannot proposed test use. A finding of unintended con
be drawn from test scores becomes available, sequences of test use may also prompt a consider
revisions may be needed in the test, in the conceptual ation of rival hypotheses. Plausible rival hypotheses
framework that shapes it, and even in the construct can often be generated by considering whether a
underlying the test. test measures less or more than its proposed con
The wide variety of tests and circumstances struct. Such considerations are referred to as
makes it natural that some types o f evidence will construct undenepresentation (or construct deficiency)
be especially critical in a given case, whereas and construct-irrelevant variance (or construct con
other types will be less useful. Decisions about tamination), respectively.
what types o f evidence are important for the val Construct underrepresentation refers to the
idation argument in each instance can be clarified degree to which a test fails to capture important
by developing a set o f propositions or claims aspects of the construct. It implies a narrowed
that support the proposed interpretation for the meaning of test scores because the test does not
particular purpose of testing. For instance, when adequately sample some types of content, engage
a mathematics achievement test is used to assess some psychological processes, or elicit some ways
readiness for an advanced course, evidence for of responding that are encompassed by the intended
the following propositions might be relevant: construct. Take, for example, a test intended as a
(a) that certain skills are prerequisite for the ad comprehensive measure of anxiety. A particular
vanced course; (b) that the content domain ol test might underrepresent the intended construct
the test is consistent with these prerequisite because it measures only physiological reactions
skills; (c) that test scores can be generalized and not emotional, cognitive, or situational com
across relevant sets of items; (d) that test scores ponents. As another example, a test of reading
are not unduly influenced by ancillary variables, comprehension intended to measure children’s
such as writing ability; (e) that success in the ad ability to read and interpret stories with under
vanced course can be validly assessed; and (f) standing might not contain a sufficient variety of
that test takers with high scores on the test will reading passages or might ignore a common type
be more successful in the advanced course than o f reading material.
test takers with low scores on the test. Examples Construct-irrelevance refers to the degree to
of propositions in other testing contexts might which test scores are affected by processes that are
include, for instance, the proposition that test extraneous to the tests intended purpose. The
takers with high general anxiety scores experience test scores may be systematically influenced to
significant anxiety in a range o f settings, the some extent by processes that are not part of the
proposition that a child’s score on an intelligence construct. In the case of a reading comprehension
scale is strongly related to the child’s academic test, these might include material too far above or
performance, or the proposition that a certain below the level intended to be tested, an emotional
pattern of scores on a neuropsychological battery reaction to the test content, familiarity with the
indicates impairment that is characteristic of subject matter o f the reading passages on the test,
brain injury. The validation process evolves as or the writing skill needed to compose a response.
these propositions are articulated and evidence Depending on the detailed definition of the con
is gathered to evaluate their soundness. struct, vocabulary knowledge or reading speed
Identifying the propositions implied by a pro might also be irrelevant components. On a test
posed test interpretation can be facilitated by designed to measure anxiety, a response bias to
considering rival hypotheses that may challenge underreport one’s anxiety might be considered a
the proposed interpretation. It is also useful to source of construct-irrelevant variance. In the case
VALIDITY
o f a mathematics test, it might include overreliance of the testing materials and procedures for the
on reading comprehension skills that English lan full range of applicants, and the consistency of
guage learners may be lacking. On a test designed the support for the proposed interpretation across
to measure science knowledge, test-taker inter groups. Professional judgment guides decisions
nalizing of gender-based stereotypes about women regarding the specific forms of evidence that can
in the sciences might be a source o f construct-ir- best support the intended interpretation for a
reievant variance. specified use. As in all scientific endeavors, the
Nearly all tests leave out elements that some quality of the evidence is paramount. A few pieces
potential users believe should be measured and of solid evidence regarding a particular proposition
include some elements that some potential users are better than numerous pieces of evidence of
consider inappropriate. Validation involves careful questionable quality. The determination that a
attention to possible distortions in meaning given test interpretation for a specific purpose is
arising from inadequate representation o f the warranted is based on professional judgment that
construct and also to aspects o f measurement, the preponderance o f the available evidence
such as test format, administration conditions, supports that interpretation. The quality and
or language level, that may materially limit or quantity of evidence sufficient to reach this judg
qualify the interpretation o f test scores for various ment may differ for test uses depending on the
groups of test takers. That is, the process of vali stakes involved in the testing. A given interpretation
dation may lead to revisions in the test, in the may not be warranted either as a result of insufficient
conceptual framework o f the test, or both. Inter evidence in support of it or as a result of credible
pretations drawn from the revised test would evidence against it.
again need validation. Validation is the joint responsibility of the
When propositions have been identified that test developer and the test user. The test developer
would support the proposed interpretation of test is responsible for furnishing relevant evidence and
scores, one can proceed with validation by obtaining a rationale in support of any test score interpretations
empirical evidence, examining relevant literature, for specified uses intended by the developer. The
and/or conducting logical analyses to evaluate test user is ultimately responsible for evaluating
each o f the propositions. Empirical evidence may the evidence in the particular setting in which the
include both local evidence, produced within the test is to be used. When a test user proposes an
contexts where the test will be used, and evidence interpretation or use o f test scores that differs
from similar testing applications in other settings. from those supported by the test developer, the
Use of existing evidence from similar tests and responsibility for providing validity evidence in
contexts can enhance the quality of the validity support of that interpretation for the specified
argument, especially when data for the test and use is the responsibility of the user. It should be
context in question are limited. noted that important contributions to the validity
Because an interpretation for a given use typ evidence may be made as other researchers report
ically depends on more than one proposition, findings of investigations that are related to the
strong evidence in support of one part o f the in meaning o f scores on the test.
terpretation in no way diminishes the need for
evidence to support other parrs of the interpretation. Sources of Validity Evidence
For example, when an employment test is being
considered for selection, a strong predictor-criterion The following sections outline various sources
relationship in an employment setting is ordinarily o f evidence that might be used in evaluating the
not sufficient to justify use o f the test. One should validity o f a proposed interpretation of test
also consider the appropriateness and meaning scores for a particular use. These sources o f evi
fulness o f the criterion measure, the appropriateness dence may illuminate different aspects o f validity,
13
CHAPTER 1
but they do not represent distinct types of Evidence Based on Test Content
validity. Validity is a unitary concept. It is the
degree to which all the accumulated evidence Important validity evidence can be obtained from
supports the intended interpretation of test an analysis of the relationship between the content
scores for the proposed use. Like the 1999 Stan o f a test and the construct it is intended to
dards, this edition refers to types o f validity evi measure. Test content refers to the themes, wording,
dence, rather than distinct types o f validity. To and format o f the items, tasks, or questions on a
emphasize this distinction, the treatment that test. Administration and scoring may also be
follows does not follow historical nomenclature relevant to content-based evidence. Test developers
(i.e., the use of the terms content validity or pre often work from a specification o f the content
dictive validity'). domain. The content specification carefully describes
As the discussion in the prior section emphasizes, the content in detail, often with a classification of
each type of evidence presented below is not areas of content and types of items. Evidence
required in all settings. Rather, support is needed based on test content can include logical or
for each proposition that underlies a proposed empirical analyses o f the adequacy with which
test interpretation for a specified use. A proposition the test content represents the content domain
that a test is predictive of a given criterion can be and of the relevance of the content domain to the
supported without evidence that the test samples proposed interpretation of test scores. Evidence
a particular content domain. In contrast, a propo based on content can also come from expert judg
sition that a test covers a representative sample of ments of the relationship between parts of the
a particular curriculum may be supported without test and the construct. For example, in developing
evidence that the test predicts a given criterion. a licensure test, the major facets that are relevant
However, a more complex set o f propositions, to the purpose for which the occupation is regulated
e.g., that a test samples a specified domain and can be specified, and experts in that occupation
thus is predictive o f a criterion reflecting a related can be asked to assign test items to the categories
domain, will require evidence supporting both defined by those facets. These or other experts
parts o f this set of propositions. Tests developers can then judge the representativeness of the chosen
are also expected to make the case that the scores set of items.
are not unduly influenced by construct-irrelevant Some tests are based on systematic observations
variance (see chap. 3 for detailed treatment of of behavior. For example, a list o f the tasks con
issues related to construct-irrelevant variance). In stituting a job domain may be developed from
general, adequate support for proposed interpre observations of behavior in a job, together with
tations for specific uses will require multiple judgments o f subject matter experts. Expert judg
sources o f evidence. ments can be used to assess the relative importance,
The position developed above also underscores criticality, and/or frequency of the various tasks.
the fact that if a given test is interpreted in A job sample test can then be constructed from a
multiple ways for multiple uses, the propositions random or stratified sampling of tasks rated highly
underlying these interpretations for different uses on these characteristics. The test can then be ad
also are likely to differ. Support is needed for the ministered under standardized conditions in an
propositions underlying each interpretation for a off-the-job setting.
specific use. Evidence supporting the interpretation The appropriateness of a given content domain
o f scores on a mathematics achievement test for is related to the specific inferences to be made
placing students in subsequent courses (i.e., from test scores. Thus, when considering an
evidence that the test interpretation is valid for its available test for a purpose other than that for
intended purpose) does not permit inferring which it was first developed, it is especially
validity for other purposes (e.g., promotion or important to evaluate the appropriateness of the
teacher evaluation). original content domain for the proposed new
14
VALIDITY
purpose. For example, a test given for research and empirical analyses o f the response processes
purposes to compare student achievement across o f test takers can provide evidence concerning the
states in a given domain may properly also cover fit between the construct and the detailed nature
material that receives little or no attention in the of the performance or response actually engaged
curriculum. Policy makers can then evaluate in by test takers. For instance, if a test is intended
student achievement with respect to both content to assess mathematical reasoning, it becomes im
neglected and content addressed. On the other portant to determine whether test takers are, in
hand, when student mastery of a delivered cur fact, reasoning about the material given instead
riculum is tested for purposes o f informing of following a standard algorithm applicable only
decisions about individual students, such as pro to the specific items on the test.
motion or graduation, the framework elaborating Evidence based on response processes generally
a content domain is appropriately limited to what comes from analyses o f individual responses.
students have had an opportunity to learn from Questioning test takers from various groups
the curriculum as delivered. making up the intended test-taking population
Evidence about content can be used, in part, to about their performance strategies or responses
address questions about differences in the meaning to particular items can yield evidence that enriches
or interpretation of test scores across relevant sub the definition of a construct. Maintaining records
groups of test takers. O f particular concern is the that monitor the development o f a response to a
extent to which construct underrepresentation or writing task, through successive written drafts
construct-irrelevance may give an unfair advantage or electronically monitored revisions, for instance,
or disadvantage to one or more subgroups of test also provides evidence o f process. Documentation
takers. For example, in an employment test, the of other aspects of performance, like eye move
use of vocabulary more complex than needed on ments or response times, may also be relevant to
the job may be a source of construct-irrelevant some constructs. Inferences about processes in
variance for English language learners or others. volved in performance can also be developed by
Careful review of the construct and test content analyzing the relationship among parts o f the
domain by a diverse panel of experts may point to test and between the test and other variables.
potential sources of irrelevant difficulty (or easiness) Wide individual differences in process can be re
that require further investigation. vealing and may lead to reconsideration o f certain
Content-oriented evidence of validation is at test formats.
the heart of the process in the educational arena Evidence o f response processes can contribute
known as alignment, which involves evaluating the to answering questions about differences in meaning
correspondence between student lear ning standards or interpretation o f test scores across relevant sub
and test content. Content-sampling issues in the groups o f test takers. Process studies involving
alignment process include evaluating whether test test takers from different subgroups can assist in
content appropriately samples the domain set forward determining the extent to which capabilities irrel
in curriculum standards, whether the cognitive de evant or ancillary to the construct may be differ
mands o f test items correspond to the level reflected entially influencing test takers’ test performance.
in the student learning standards (e.g., content Studies of response processes are not limited
standards), and whether the test avoids the inclusion to the test taker. Assessments often rely on observers
of features irrelevant to the standard that is the in or judges to record and/or evaluate test takers’
tended target o f each test item. performances or products. In such cases, relevant
validity evidence includes the extent to which the
Evidence Based on Response Processes processes of observers or judges are consistent
Some construct interpretations involve more or with the intended interpretation of scores. For in
less explicit assumptions about the cognitive stance, if judges are expected to apply particular
processes engaged in by test takers. Theoretical criteria in scoring test takers’ performances, it is
15
CHAPTER 1
important to ascertain whether they are, in fact, tionships form the basis for an estimate of score
applying the appropriate criteria and not being reliability, but such an index would be inappropriate
influenced by factors that are irrelevant to the in for tests with a more complex internal structure.
tended interpretation (e.g., quality of handwriting Some studies of the internal structure of tests
is irrelevant to judging the content of an written are designed to show whether particular items
essay). Thus, validation may include empirical may function differently for identifiable subgroups
studies o f how observers or judges record and of test takers (e.g., racial/ethnic or gender sub
evaluate data along with analyses o f the appropri groups.) Differential item fim ctioning occurs when
ateness o f these processes to the intended inter different groups of test takers with similar overall
pretation or construct definition. ability, or similar status on an appropriate criterion,
While evidence about response processes may have, on average, systematically different responses
be central in settings where explicit claims about to a particular item. This issue is discussed in
response processes are made by test developers or chapter 3. However, differential item functioning
where inferences about responses are made by test is not always a flaw or weakness. Subsets of items
users, there are many other cases where claims that have a specific characteristic in common
about response processes are not part of the (e.g., specific content, task representation) may
validity argument. In some cases, multiple response function differently for different groups of similarly
processes are available for solving the problems of scoring test takers. This indicates a kind o f multi
interest, and the construct of interest is only con dimensionality that may be unexpected or may
cerned with whether the problem was solved cor conform to the test framework.
rectly. As a simple example, there may be multiple
possible routes to obtaining the correct solution
Evidence Based on Relations to Other Variables
to a mathematical problem. In many cases, the intended interpretation for a
given use implies that the construct should be
Evidence Based on Internal Structure related to some other variables, and, as a result,
Analyses o f the internal structure of a test can analyses of the relationship of test scores to
indicate the degree to which the relationships variables external to the test provide another im
among test items and test components conform to portant source of validity evidence. External
the construct on which the proposed test score in variables may include measures o f some criteria
terpretations are based. The conceptual framework that the test is expected to predict, as well as rela
for a test may imply a single dimension of behavior, tionships to other tests hypothesized to measure
or it may posit several components that are each the same constructs, and tests measuring related
expected to be homogeneous, but that are also or different constructs. Measures other than test
distinct from each other. For example, a measure scores, such as performance criteria, arc often
o f discomfort on a health survey might assess both used in employment settings. Categorical variables,
physical and emotional health. The extent to which including group membership variables, become
item interrelationships bear out the presumptions relevant when the theory underlying a proposed
o f the framework would be relevant to validity. test use suggests that group differences should be
The specific types of analyses and their inter present or absent if a proposed test score interpre
pretation depend on how the test will be used. tation is to be supported. Evidence based on rela
For example, if a particular application posited a tionships with other variables provides evidence
series o f increasingly difficult test components, about the degree to which these relationships are
empirical evidence of the extent to which response consistent with the construct underlying the pro
patterns conformed to this expectation would be posed test score interpretations.
provided. A theory that posited unidimensionality
would call for evidence of item homogeneity. In Convergent and discriminant evidence. Rela
this case, the number o f items and item interrela tionships between test scores and other measures
16
VALIDITY
intended to assess the same or similar constructs criterion scores are o f central importance. The
provide convergent evidence, whereas relationships credibility of a test-criterion study depends on
between test scores and measures purportedly of the relevance, reliability, and validity of the inter
different constructs provide discriminant evidence. pretation based on the criterion measure for a
For instance, within some theoretical frameworks, given testing application.
scores on a multiple-choice test o f reading com Historically, two designs, often called predictive
prehension might be expected to relate closely and concurrent, have been distinguished for eval
(convergent evidence) to other measures of reading uating tcst-criterion relationships. A predictive
comprehension based on other methods, such as study indicates the strength of the relationship
essay responses. Conversely, test scores might be between test scores and criterion scores that are
expected to relate less closely (discriminant evidence) obtained at a later time. A concurrent study
to measures of other skills, such as logical reasoning. obtains test scores and criterion information at
Relationships among different methods of meas about the same rime. When prediction is actually
uring the construct can be especially helpful in contemplated, as in academic admission or em
sharpening and elaborating score meaning and ployment settings, or in planning rehabilitation
interpretation. regimens, predictive studies can retain the temporal
Evidence of relations with other variables can differences and other characteristics of the practical
invoJve experimental as well as correlational evi situation. Concurrent evidence, which avoids tem
dence. Studies might be designed, for instance, to poral changes, is particularly useful for psychodi-
investigate whether scores on a measure of anxiety agnosuc tests or in investigating alternative measures
improve as a result of some psychological treatment of some specified construct for which an accepted
or whether scores on a test of academic achievement measurement procedure already exists. The choice
differentiate between instructed and noninstructed o f a predictive or concurrent research strategy in
groups. If performance increases due to short a given domain is also usefully informed by prior
term coaching are viewed as a threat to validity, it research evidence regarding the extent to which
would be useful to investigate whether coached predictive and concurrent studies in that domain
and uncoached groups perform differently. yield the same or different results.
Test scores are sometimes used in allocating
Test-criterion relationships. Evidence of the individuals to different treatments in a way that is
relation of test scores to a relevant criterion may advantageous for the institution and/or for the
be expressed in various ways, but the fundamental individuals. Examples would include assigning
question is always, how accurately do test scores individuals to different jobs within an organization,
predict criterion performance? The degree of ac or determining whether to place a given student
curacy and the score range within which accuracy in a remedial class or a regular class. In that
is needed depends on the purpose for which the context, evidence is needed to judge the suitability
test is used. of using a test when classifying or assigning a
The criterion variable is a measure o f some at person to one job versus another or to one
tribute or outcome that is operationally distinct treatment versus another. Support for the validity
from the test. Thus, the test is not a measure of a of the classification procedure is provided by
criterion, but rather is a measure hypothesized as showing that the test is useful in determining
a potential predictor of that targeted criterion. which persons are likely to profit differentially
Whether a test predicts a given criterion in a from one treatment or another. It is possible for
given context is a testable hypothesis. The criteria tests to be highly predictive of performance for
that are of interest are determined by test users, different education programs or jobs without pro
for example administrators in a school system or viding the information necessary to make a com
managers of a firm. The choice of the criterion parative judgment of the efficacy of assignments
and the measurement procedures used to obtain or treatments. In general, decision rules for selection
17
CHAPTER 1
or placement are also influenced by the number atively small. Thus, statistical summaries o f past
o f persons to be accepted or the numbers that can validation studies in similar situations may be
be accommodated in alternative placement cate useful in estimating test-criterion relationships in
gories (see chap. 11). a new situation. This practice is referred to as the
Evidence about relations to other variables is study o f validity generalization.
also used to investigate questions o f differential In some circumstances, there is a strong
prediction for subgroups. For instance, a finding basis for using validity generalization. This
that the relation of test scores to a relevant criterion would be the case where the meta-analytic data
variable differs from one subgroup to another base is large, where the meta-analytic data ade
may imply that the meaning o f the scores is not quately represent the type o f situation to which
the same for members of the different groups, one wishes to generalize, and where correction
perhaps due to construct underrepresentation or for statistical artifacts produces a clear and con
construct-irrelevant sources o f variance. However, sistent pattern o f validity evidence. In such cir
the difference may also imply that the criterion cumstances, the informational value o f a local
has different meaning for different groups. The validity study may be relatively limited if not
differences in test-criterion relationships can also actually misleading, especially if its sample size
arise from measurement error, especially when is small. In other circumstances, the inferential
group means differ, so such differences do not leap required for generalization may be much
necessarily indicate differences in score meaning. larger. The meta-analytic database may be small,
See the discussion of fairness in chapter 3 for the findings may be less consistent, or the new
more extended consideration o f possible courses situation may involve features markedly different
of action when scores have different meanings for from those represented in the meta-analytic
different groups. database. In such circumstances, situation-specific
validity evidence will be relatively more inform
Validity generalization. An important issue in ative. Although research on validity generalization
educational and employment settings is the degree shows that results o f a single local validation
to which validity evidence based on test-criterion study may be quite imprecise, there are situations
relations can be generalized to a new situation where a single study, carefully done, with adequate
without further study of validity in that new situ sample size, provides sufficient evidence to
ation. When a test is used to predict the same or support or reject test use in a new situation.
similar criteria (e.g., performance o f a given job) This highlights the importance o f examining
at different times or in different places, it is carefully the comparative informational value
typically found that observed test-criterion corre o f local versus meta-analytic studies.
lations vary substantially. In the past, this has In conducting studies o f the generalizability
been taken to imply that local validation studies of validity evidence, the prior studies that are in
are always required. More recently, a variety of cluded may vary according to several situational
approaches to generalizing evidence from other facets. Some o f the major facets are (a) differences
settings has been developed, with meta-analysis in the way the predictor construct is measured,
the most widely used in the published literature. (b) the type o f job or curriculum involved, (c) the
In particular, meta-analyses have shown that in type o f criterion measure used, (d) the type o f test
some domains, much of this variability may be takers, and (e) the time period in which the study
due to statistical artifacts such as sampling fluctu was conducted. In any particular study of validity
ations and variations across validation studies in generalization, any number o f these facets might
the ranges of test scores and in the reliability of vary, and a major objective o f the study is to de
criterion measures. When these and other influences termine empirically the extent to which variation
are taken into account, it may be found that the in these facets affects the test-criterion correlations
remaining variability in validity coefficients is rel obtained.
18
VALIDITY
The extent to which predictive or concurrent Still other consequenccs are unintended, and
validity evidence can be generalized to new are often negative. For example, school district or
situations is in large measure a function o f accu statewide educational testing on selected subjects
mulated research. Although evidence of general may lead teachers to focus on those subjects at
ization can often help to support a claim of the expense of others. As another example, a test
validity in a new situation, the extent of available developed to measure knowledge needed for a
data limits the degree to which the claim can be given job may result in lower passing rates for one
sustained. group than for another. Unintended consequences
The above discussion focuses on the use of merit close examination. While not all consequences
cumulative databases to estimate predictor-criterion can be anticipated, in some cases factors such as
relationships. Meta-analytic techniques can also prior experiences in other settings offer a basis for
be used to summarize other forms o f data relevant anticipating and proactively addressing unintended
to other inferences one may wish to draw from consequences. See chapter 12 for additional ex
test scores in a particular application, such as amples from educational settings. In some cases,
effects o f coaching and effects of certain alterations actions to address one consequence bring about
in testing conditions for test takers with specified other consequences. One example involves the
disabilities. Gathering evidence about how well notion of “missed opportunities,” as in the case o f
validity findings can be generalized across groups moving to computerized scoring o f student essays
of test takers is an important part o f the validation to increase grading consistency, thus forgoing the
process. When the evidence suggests that inferences educational benefits of addressing the same problem
from test scores can be drawn for some subgroups by training teachers to grade more consistently.
but not for others, pursuing options such as those These types of consideration of consequences
discussed in chapter 3 can reduce the risk of o f testing are discussed further below.
unfair test use.
Interpretation and uses of test scores intended by
Evidence for Validity and test developers. Tests are commonly administered
Consequences of Testing in the expectation that some benefit will be realized
Some consequences of test use follow directly from the interpretation and use of the scores intended
from the interpretation of test scores for uses in by the test developers. A few of the many possible
tended by the test developer. The validation benefits that might be claimed are selection of effi
process involves gathering evidence to evaluate cacious therapies, placement of workers in suitable
the soundness of these proposed interpretations jobs, prevention of unqualified individuals from
for their intended uses. entering a profession, or improvement of classroom
Other consequences may also be part o f a instructional practices. A fundamental purpose of
claim that extends beyond the interpretation or validation is to indicate whether these specific
use of scores intended by the test developer. For benefits are likely to be realized. Thus, in the case of
example, a test of student achievement might a test used in placement decisions, the validation
provide data for a system intended to identify would be informed by evidence that alternative
and improve lower-performing schools. The claim placements, in fact, are differentially beneficial to
that testing results, used this way, will result in the persons and the institution. In the case of em
improved student learning may rest on propositions ployment testing, if a test publisher asserts that use
about the system or intervention itself, beyond of the test will result in reduced employee training
propositions based on the meaning of the test costs, improved workforce efficiency, or some other
itself. Consequences may point to the need for benefit, then the validation would be informed by
evidence about components of the system that evidence in support o f that proposition.
will go beyond the interpretation o f test scores as It is important to note that the validity o f test
a valid measure o f student achievement. score interpretations depends not only on the uses
19
CHAPTER 1
of the test scores but specifically on the claims that existing data collected for purposes other than
underlie the theory o f action for these uses. For test validation; in other cases new information
example, consider a school district that wants to will be needed to address the impact of the testing
determine childrens readiness for kindergarten, program.
and so administers a test battery and screens out
students with low scores. If higher scores do, in Consequences that are unintended. Test score
fact, predict higher performance on key kindergarten interpretation for a given use may result in unin
tasks, the claim that use of the test scores for tended consequences. A key distinction is between
screening results in higher performance on these consequences that result from a source of error in
key tasks is supported and the interpretation of the intended test score interpretation for a given
the test scores as a predictor o f kindergarten use and consequences that do not result from
readiness would be valid. If, however, the claim error in test score interpretation. Examples of
were made that use of the test scores for screening each are given below.
would result in the greatest benefit to students, As discussed at some length in chapter 3, one
the interpretation of test scores as indicators of domain in which unintended negative consequences
readiness for kindergarten might not be valid of test use are at times observed involves test score
becausc students with low scores might actually differences for groups defined in terms of race/eth-
benefit more from access to kindergarten. In this nicity, gender, age, and other characteristics. In
case, different evidence is needed to support such cases, however, it is important to distinguish
different claims that might be made about the between evidence that is directly relevant to validity
same use of the screening test (for example, evidence and evidence that may inform decisions about
that students below a certain cut score benefit social policy but falls outside the realm of validity.
more from another assignment than from assignment For example, concerns have been raised about the
to kindergarten). The test developer is responsible effect of group differences in test scores on em
for the validation of the interpretation that the ployment selection and promotion, the placement
test scores assess the indicated readiness skills. The of children in special education classes, and the
school district is responsible for the validation of narrowing of a school’s curriculum to exclude
the proper interpretation of the readiness test learning objectives that are not assessed. Although
scores and for evaluation of the policy o f using the information about the consequences of testing
readiness test for placement/admissions decisions. may influence decisions about test use, such con
sequences do not, in and of themselves, detract
Claims made about test use that are not directly from the validity of intended interpretations of
based on test score interpretations. Claims are the test scores. Rather, judgments o f validity or
sometimes made for benefits of testing that go invalidity in the light of testing consequences
beyond the direct interpretations or uses of the depend on a more searching inquiry into the
test scores themselves that are specified by the test sources of those consequences.
developers. Educational tests, for example, may Take, as an example, a finding of different
be advocated on the grounds that their use wi// hiring rates for members of different groups as a
improve student motivation to learn or encourage consequence of using an employment test. If the
changes in classroom instructional practices by difference is due solely to an unequal distribution
holding educators accountable for valued learning of the skills the test purports to measure, and if
outcomes. Where such claims are central to the those skills are, in fact, important contributors to
rationale advanced for testing, the direct exami job performance, then the finding of group dif
nation o f testing consequences necessarily assumes ferences per se does not imply any lack o f validity
even greater importance. Those making the claims for the intended interpretation. If, however, the
are responsible for evaluation of the claims. In test measured skill differences unrelated to job
some cases, such information can be drawn from performance (e.g., a sophisticated reading test for
20
VALIDITY
a job that required only minimal functional also illustrates that different decision makers may
literacy), or if the differences were due to the make different value judgments about the impact
tests sensitivity to some test-taker characteristic of consequences on test use.
not intended to be part o f the test construct, then The fact that the validity evidence supports
the intended interpretation of test scores as pre the intended interpretation of test scores for use
dicting job performance in a comparable manner in applicant screening does not mean that test use
for all groups of applicants would be rendered in is thus required; Issues other than validity, including
valid, even if test scores correlated positively with legal constraints, can play an important and, in
some measure of job performance. If a test covers some cases, a determinative role in decisions about
most o f the relevant content domain but omits test use. Legal constraints may also limit an em
some areas, the content coverage might be judged ployers discretion to discard test scores from tests
adequate for some purposes. However, if it is that have already been administered, when that
found that excluding some components that could decision is based on differences in scores for sub
readily be assessed has a noticeable impact on se groups of different races, ethnicities, or genders.
lection rates for groups o f interest (e.g., subgroup Note that unintended consequences can also
differences are found to be smaller on excluded be positive. Reversing the above example of test
components than on included components), the takers who form a negative impression of an or
intended interpretation of test scores as predicting ganization based on the use of a particular test, a
job performance in a comparable manner for all different test may be viewed favorably by applicants,
groups o f applicants would be rendered invalid. leading to a positive impression of the organization.
Thus, evidence about consequences is relevant to A given test use may result in multiple consequences,
validity when it can be traced to a source of some positive and some negative.
invalidity such as construct underrepresentation In short, decisions about test use are appro
or construct-irrelevant components. Evidence priately informed by validity evidence about in
about consequences that cannot be so traced is tended test score interpretations for a given use,
not relevant to the validity o f the intended inter by evidence evaluating additional claims about
pretations of the test scores. consequences of test use that do not follow directly
As another example, consider the case where from test score interpretations, and by value judg
research supports an employers use o f a particular ments about unintended positive and negative
test in the personality domain (i.e., the test proves consequences of test use.
to be predictive of an aspect of subsequent job
performance), but it is found that some applicants Integrating the Validity Evidence
form a negative opinion o f the organization due
to the perception that the test invades personal A sound validity argument integrates various
privacy. Thus, there is an unintended negative strands o f evidence into a coherent account o f the
consequence of test use, but one that is not due degree to which existing evidence and theory sup
to a flaw in the intended interpretation of test port the intended interpretation of test scores for
scores as predicting subsequent performance. Some specific uses. It encompasses evidence gathered
employers faced with this situation may conclude from new studies and evidence available from
that this negative consequence is grounds for dis earlier reported research. The validity argument
continuing test use; others may conclude that the may indicate the need for refining the definition
benefits gained by screening applicants outweigh of the construct, may suggest revisions in the test
this negative consequence. As this example illus or other aspects of the testing process, and may
trates, a consideration of consequences can influence indicate areas needing further study.
a decision about test use, even though the conse It is commonly observed that the validation
quence is independent o f the validity o f the process never ends, as there is always additional
intended test score interpretation. The example information that can be gathered to more fully
21
CHAPTER 1
understand a test and the inferences that can be as research on a topic advances. For example, pre
drawn from it. In this way an inference o f validity vailing standards of evidence may vary with the
is similar to any scientific inference. However, a stakes involved in the use or interpretation of the
test interpretation for a given use rests on evidence test scores. Higher stakes may entail higher
for a set of propositions making up the validity standards of evidence. As another example, in
argument, and at some point validation evidence areas where data collection comes at a greater
allows for a summary judgment o f the intended cost, one may find it necessary to base interpretations
interpretation that is well supported and defensible. on fewer data than in areas where data collection
At some point the effort to provide sufficient comes with less cost.
validity evidence to support a given test interpre Ultimately, the validity of an intended inter
tation for a specific use does end (at least provi pretation of test scores relies on all the available
sionally, pending the emergence o f a strong basis evidence relevant to the technical quality o f a
for questioning that judgment). Legal requirements testing system. Different components o f validity
may necessitate that the validation study be evidence are described in subsequent chapters of
updated in light of such factors as changes in the the Standards, and include evidence o f careful test
test population or newly developed alternative construction; adequate score reliability; appropriate
testing methods. test administration and scoring; accurate score
The amount and character of evidence required scaling, equating, and standard setting; and careful
to support a provisional judgment o f validity attention to fairness for all test takers, as appropriate
often vary between areas and also within an area to the test interpretation in question.
22
VALIDITY
The standards in this chapter begin with an over be employed, and the processes by which the test
arching standard (numbered 1.0), which is designed is to be administered and scored.
to convey the central intent or primary focus of
the chapter. The overarching standard may also
Standard 1.2
be viewed as the guiding principle of the chapter,
and is applicable to all tests and test users. All A rationale should be presented for each intended
subsequent standards have been separated into interpretation of test scores for a given use,
three thematic clusters labeled as follows: together with a summary of the evidence and
theory bearing on the intended interpretation.
1. Establishing Intended Uses and Interpreta
Comment: The rationale should indicate what
tions
propositions are necessary to investigate the
2. Issues Regarding Samples and Settings Used
intended interpretation. The summary should
in Validation
combine logical analysis with empirical evidence
3. Specific Forms of Validity Evidence
to provide support for the test rationale. Evidence
may come from studies conducted locally, in the
Standard 1.0 setting where the test is to be used; from specific
prior studies; or from comprehensive statistical
Clear articulation of each intended test score in
syntheses of available studies meeting clearly spec
terpretation for a specified use should be set forth,
ified study quality criteria. N o type of evidence, is
and appropriate validity evidence in support of
inherently preferable to others; rather, the quality
each intended interpretation should be provided.
and relevance o f the evidence to the intended test
score interpretation for a given use determine the
Cluster 1. Establishing Intended value of a particular kind of evidence. A presentation
Uses and Interpretations o f empirical evidence on any point should give
due weight to all relevant findings in the scientific
literature, including those inconsistent with the
Standard 1.1
intended interpretation or use. Test developers
The test developer should set forth clearly how have the responsibility to provide support for
test scores are intended to be interpreted and their own recommendations, but test users bear
consequently used. The population (s) for which ultimate responsibility for evaluating the quality
a test is intended should be delimited clearly, o f the validity evidence provided and its relevance
and the construct or constructs that the test is to the local situation.
intended to assess should be described clearly.
23
CHAPTER 1
kinds o f decisions or certain kinds o f test takers, as well as empirical data. Appropriate weight
specific warnings against such uses should be should be given to findings in the scientific
given. Professional judgment is required to evaluate literature that may be inconsistent with the stated
the extent to which existing validity evidence sup expectation.
ports a given test use.
Standard 1.6
Standard 1.4
When a test use is recommended on the grounds
If a test score is interpreted for a given use in a that testing or the testing program itself will
way that has not been validated, it is incumbent result in some indirect benefit, in addition to
on the user to justify the new interpretation for the utility o f information from interpretation of
that use, providing a rationale and collecting the test scores themselves, the recommender
new evidence, if necessary. should make explicit the rationale for anticipating
Comment: Professional judgment is required to the indirect benefit. Logical or theoretical argu
evaluate the extent to which existing validity evi ments and empirical evidence for the indirect
dence applies in the new situation or to the new benefit should be provided. Appropriate weight
group of test takers and to determine what new should be given to any contradictory findings in
evidence may be needed. The amount and kinds the scientific literature, including findings sug
of new evidence required may be influenced by gesting important indirect outcomes other than
experience with similar prior test uses or interpre those predicted.
tations and by the amount, quality, and rclevance Comment: For example, certain educational testing
o f existing data. programs have been advocated on the grounds
A test that has been altered or administered in that they would have a salutary influence on class
ways that change the construct underlying the room instructional practices or would clarify stu
test for use with subgroups of the population re dents’ understanding of the kind or level of
quires evidence o f the validity o f the interpretation achievement they were expected to attain. To the
made on the basis of the modified test (see chap. extent that such claims enter into the justification
3). For example, if a test is adapted for use with for a testing program, they become part o f the ar
individuals with a particular disability in a way gument for test use. Evidence for such claims
that changes the underlying construct, the modified should be examined— in conjunction with evidence
test should have its own evidence of validity for about the validity of intended test score interpre
the intended interpretation. tation and evidence about unintended negative
consequences o f test use— in making an overall
Standard 1.5 decision about test use. Due weight should be
given to evidence against such predictions, for ex
"When it is clearly stated or implied that a rec ample, evidence that under some conditions edu
ommended test score interpretation for a given cational testing may have a negative effect on
use will result in a specific outcome, the basis classroom instruction.
for expecting that outcome should be presented,
together with relevant evidence.
Standard 1.7
Comment: If it is asserted, for example, that in
terpreting and using scores on a given test for em If test performance, or a decision made therefrom,
ployee selection will result in reduced employee is claimed to be essentially unaffected by practice
errors or training costs, evidence in support of and coaching, then the propensity for test per
that assertion should be provided. A given claim formance to change with these forms o f instruction
may be supported by logical or theoretical argument should be documented.
24
VALIDITY
Comment: Materials to aid in score interpretation well as the gender and ethnic composition of the
should summarize evidence indicating the degree sample. Sometimes legal restrictions about privacy
to which improvement with practice or coaching preclude obtaining or disclosing such population
can be expected. Also, materials written for test information or limit the level of particularity at
takers should provide practical guidance about which such data may be disclosed. The specific
the value o f test preparation activities, including privacy laws, if any, governing the type o f data
coaching. should be considered, in order to ensure that any
description of a population does not have the po
tential to identify an individual in a manner in
Cluster 2. Issues Regarding Samples consistent with such standards. The extent of
and Settings Used in Validation missing data, if any, and the methods for handling
missing data (e.g., use of imputation procedures)
should be described.
Standard 1.8
25
CHAPTER 1
job incumbents, supervisors) as appropriate experts ifying and generating test content should be de
for the judgment or rating task should be articulated. scribed and justified with reference to the intended
It may be entirely appropriate to have experts work population to be tested and the construct the
together to reach consensus, but it would not then test is intended to measure or the domain it is
be appropriate to treat their respective judgments intended to represent. If the definition o f the
as statistically independent. Different judges may content sampled incorporates criteria such as
be used for different purposes (e.g., one set may importance, frequency, or criticality, these criteria
rate items for cultural sensitivity while another should also be clearly explained and justified.
may rate for reading level) or for different portions
Comment: For example, test developers might
o f a test.
provide a logical structure that maps the items on
the test to the content domain, illustrating the
Standard 1.10 relevance o f each item and the adequacy with
which the set of items represents the content do
When validity evidence includes statistical analyses
main. Areas of the content domain that are not
o f test results, either alone or together with data
included among the test items could be indicated
on other variables, the conditions under which
as well. The match of test content to the targeted
the data were collected should be described in
domain in terms of cognitive complexity and the
enough detail that users can judge the relevance
accessibility of the test contcnt to all members of
o f the statistical findings to local conditions. At
the intended population are also important con
tention should be drawn to any features of a val
siderations.
idation data collection that are likely to differ
from typical operational testing conditions and
that could plausibly influence test performance. (b) Evidence Regarding Cognitive
Comment: Such conditions might include (but Processes
would not be limited to) the following: test-taker
Standard 1.12
motivation or prior preparation, the range of test
scores over test takers, the time allowed for test If the rationale for score interpretation for a given
takers to respond or other administrative conditions, use depends on premises about the psychological
the mode o f test administration (e.g., unproctored processes or cognitive operations o f test takers,
online testing versus proctored on-site testing), then theoretical or empirical evidence in support
examiner training or other examiner characteristics, o f those premises should be provided. When state
the time intervals separating collection of data on ments about the processes employed by observers
different measures, or conditions that may have or scorers are part of the argument for validity,
changed since the validity evidence was obtained. similar information should be provided.
Comment: If the test specification delineates the
Cluster 3. Specific Forms of processes to be assessed, then evidence is needed
Validity Evidence that the test items do, in fact, tap the intended
processes.
26
VALIDITY
tionships among test items or among parts of the rationale and relevant evidence in support of
the test, evidence concerning the internal structure such interpretation should be provided. When
o f the test should be provided. interpretation o f individual item responses is
likely but is not recommended by the developer,
Comment: It might be claimed, for example,
the user should be warned against making such
that a test is essentially unidimensional. Such a
interpretations.
claim could be supported by a multivariate statistical
analysis, such as a factor analysis, showing that Comment: Users should be given sufficient guidance
the score variability attributable to one major di to enable them to judge the degree of confidence
mension was much greater than the score variability warranted for any interpretation for a use recom
attributable to any other identified dimension, or mended by the test developer. Test manuals and
showing that a single factor adequately accounts score reports should discourage overinterpretation
for the covariation among test items. When a test o f information that may be subject to considerable
provides more than one score, the interrelationships error. This is especially important if interpretation
of those scores should be shown to be consistent of performance on isolated items, small subsets of
with the construct(s) being assessed. items, or subtest scores is suggested.
Standard 1.14
(d) Evidence Regarding Relationships
When interpretation of subscores, score differences, With Conceptually Related Constructs
or profiles is suggested, the rationale and relevant
evidence in support o f such interpretation should Standard 1.16
be provided. Where composite scores are devel
oped, the basis and rationale for arriving at the When validity evidence includes empirical analyses
composites should be given. of responses to test items together with data on
other variables, the rationale for selecting the ad
Comment: When a test provides more than one ditional variables should be provided. Where ap
score, the distinctiveness and reliability of the propriate and feasible, evidence concerning the
separate scores should be demonstrated, and the constructs represented by other variables, as well
interrelationships o f those scores should be shown as their technical properties, should be presented
to be consistent with the construct(s) being or cited. Attention should be drawn to any likely
assessed. Moreover, evidence for the validity of sources o f dependence (or lack of independence)
interpretations of two or more separate scores among variables other than dependencies among
would not necessarily justify a statistical or sub the construct(s) they represent.
stantive interpretation o f the difference between
them. Rather, the rationale and supporting evidence Comment: The patterns of association between
must pertain directly to the specific score, score and among scores on the test under study and
combination, or score pattern to be interpreted other variables should be consistent with theoretical
for a given use. When subscores from one test or expectations. The additional variables might be
scores from different tests are combined into a demographic characteristics, indicators of treatment
composite, the basis for combining scores and for conditions, or scores on other measures. They
how scores are combined (e.g., differential weighting might include intended measures of the same
versus simple summation) should be specified. construct or o f different constructs. The reliability
of scores from such other measures and the validity
of intended interpretations of scores from these
Standard 1.15
measures are an important part o f the validity ev
When interpretation o f performance on specific idence for the test under study. If such variables
items, or small subsets o f items, is suggested, include composite scores, the manner in which
27
CHAPTER 1
die composites were constructed should be explained as information about the distribution o f criterion
(e.g., transformation or standardization of the performances conditional upon a given test score.
variables, and weighting of the variables). In In the case of categorical rather than continuous
addition to considering the properties of each variables, techniques appropriate to such data
variable in isolation, it is important to guard should be used (e.g., the use of logistic regression
against faulty interpretations arising from spurious in the case of a dichotomous criterion). Evidence
sources of dependency among measures, including about the overall association between variables
correlated errors or shared variance due to common should be supplemented by information about
methods o f measurement or common elements. the form of that association and about the variability
o f that association in different ranges o f test scores.
Note that data collections employing test takers
(e) Evidence Regarding Relationships
selected for their extreme scores on one or more
With Criteria measures (extreme groups) typically cannot provide
Standard 1.17 adequate information about the association.
28
VALIDITY
29
CHAPTER 1
for example, by reporting separate distributions porting of meta-analytic evidence, the individual
for subsets of studies or by estimating the magni drawing on existing meta-analytic evidence must
tudes o f the influences of situational features on evaluate the soundness of the meta-analysis for
effect sizes. the setting in question.
This standard addresses the responsibilities of
the individual who is drawing on meta-analytic
Standard 1.24
evidence to support a test score interpretation for
a given use. In some instances, that individual I f a test is recommended for use in assigning
may also be the one conducting the meta-analysis; persons to alternative treatments, and if outcomes
in other instances, existing meta-analyses are relied from those treatments can reasonably be compared
on. In the latter instance, the individual drawing on a common criterion, then, whenever feasible,
on meta-analytic evidence does not have control supporting evidence o f differential outcomes
over how the meta-analysis was conducted or re should be provided.
ported, and must evaluate the soundness of the
Comment: If a test is used for classification into
meta-analysis for the setting in question.
alternative occupational, therapeutic, or educational
programs, it is not sufficient just to show that the
Standard 1.23 test predicts treatment outcomes. Support for the
Any meta-analytic evidence used to support an validity o f the classification procedure is provided
intended test score interpretation for a given use by showing that the test is useful in determining
should be clearly described, including method which persons are likely to profit differentially
ological choices in identifying and coding studies, from one treatment or another. Treatment categories
correcting for artifacts, and examining potential may have to be combined to assemble sufficient
moderator variables. Assumptions made in cor cases for statistical analysis. It is recognized,
recting for artifacts such as criterion unreliability however, that such research may not be feasible,
and range restriction should be presented, and because ethical and legal constraints on differential
the consequences of these assumptions made assignments may forbid control groups.
clear.
Comment: The description should include docu (f) Evidence Based on Consequences of
mented information about each study used as Tests
input to the meta-analysis, thus permitting evalu
ation by an independent party. Note also that Standard 1.25
meta-analysis inevitably involves judgments re
When unintended consequences result from test
garding a number o f methodological choices. The
use, an attempt should be made to investigate
bases for these judgments should be articulated.
whether such consequences arise from the test’s
In the case of choices involving some degree of
sensitivity to characteristics other than those it
uncertainty, such as artifact corrections based on
is intended to assess or from the test’s failure to
assumed values, the uncertainty should be ac
fully represent the intended construct.
knowledged and the degree to which conclusions
about validity hinge on these assumptions should Comment: The validity o f test score interpreta
be examined and reported. tions may be limited by construct-irrelevant
As in the case o f Standard 1.22, the individual components or construct underrepresentation.
who is drawing on meta-analytic evidence to When unintended consequences appear to stem,
support a test score interpretation for a given use at least in part, from the use o f one or more
may or may not also be the one conducting the tests, it is especially important to check that
meta-analysis. As Standard 1.22 addresses the re these consequences do not arise from construct-
30
VALIDITY
31
2. RELIABILITY/PRECISION AND
ERRORS OF MEASUREMENT
BACKGROUND
A test, broadly defined, is a set of tasks or stimuli reliability/precision is warranted. If a decision can
designed to elicit responses that provide a sample and will be corroborated by information from
of an examinee’s behavior or performance in a other sources or if an erroneous initial decision
specified domain. Coupled with the test is a scoring can be easily corrected, scores with more modest
procedure that enables the scorer to evaluate the reliability/precision may suffice.
behavior or work samples and generate a score. In Interprecations o f test scores generally depend
interpreting and using test scores, it is important on assumptions that individuals and groups exhibit
to have some indication of dieir reliability. some degree o f consistency in their scores across
The term reliability has been used in two ways independent administrations of the testing pro
in the measurement literature. First, the term has cedure. However, different samples o f performance
been used to refer to the reliability coefficients of from the same person are rarely identical. An in
classical test theory, defined as the correlation be dividual’s performances, products, and responses
tween scores on two equivalent forms of the test, to sets of tasks or test questions vary in quality or
presuming that taking one form has no effect on character from one sample of tasks to another
performance on the second form. Second, the and from one occasion to another, even under
term has been used in a more general sense, to strictly controlled conditions. Different raters may
refer to the consistency of scores across replications award different scores to a specific performance.
of a testing procedure, regardless of how this con All of these sources of variation are reflected in
sistency is estimated or reported (e.g., in terms of the examinees’ scores, which will vary across in
standard errors, reliability coefficients per se, gen- stances of a measurement procedure.
eralizability coefficients, error/tolerance ratios, The reliability/precision of the scores depends
item response theory (IRT) information functions, on how much the scores vary across replications
or various indices o f classification consistency). o f the testing procedure, and analyses of
To maintain a link to the traditional notions of reliability/precision depend on the lands of vari
reliability while avoiding the ambiguity inherent ability allowed in the testing procedure (e.g., over
in using a single, familiar term to refer to a wide tasks, contexts, raters) and the proposed interpre
range o f concepts and indices, we use the term re tation of the test scores. For example, if the inter
liability/precision to denote the more general notion pretation o f the scores assumes that the construct
of consistency o f the scores across instances of the being assessed does not vary over occasions, the
testing procedure, and the term reliability coefficient variability over occasions is a potential source of
to refer to the reliability coefficients of classical measurement error. If the test tasks vary over al
test theory. ternate forms o f the test, and the observed per
The reliability/precision o f measurement is formances are treated as a sample from a domain
always important. However, the need for precision of similar tasks, the random variability in scores
increases as the consequences o f decisions and in from one form to another would be considered
terpretations grow in importance. If a test score error. If raters are used to assign scores to responses,
leads to a decision that is not easily reversed, such the variability in scores over qualified raters is a
as rejection or admission o f a candidate to a pro source of error. Variations in a test taker’s scores
fessional school, or a score-based clinical judgment that are not consistent with the definition of the
(e.g., in a legal context) that a serious cognitive construct being assessed are attributed to errors
injury was sustained, a higher degree o f o f measurement.
33
CHAPTER 2
A very basic way to evaluate the consistency variable that fluctuates around the true score for
of scores involves an analysis of the variation in the person.
each test taker’s scores across replications of the Generalizability theory provides a different
testing procedure. The test is administered and framework for estimating reliability/precision.
then, after a brief period during which the exam While classical test theory assumes a single dis
inees standing on the variable being measured tribution for the errors in a test taker’s scores,
would not be expected to change, the test (or a generalizability theory seeks to evaluate the con
distinct but equivalent form of the test) is admin tributions of different sources o f error (e.g., items,
istered a second time; it is assumed that the first occasions, raters) to the overall error. The universe
administration has no influence on the second score for a person is defined as the expected value
administration. Given that the attribute being over a universe o f all possible replications of the
measured is assumed to remain the same for each testing procedure for the test taker. The universe
test taker over the two administrations and that score of generalizability theory plays a role that is
the test administrations are independent of each similar to the role o f true scores in classical test
other, more variation across the two administrations theory.
indicates more error in the test scores and therefore Item response theory (IRT) addresses the basic
lower reliability/precision. issue of reliability/precision using information
The impact of such measurement errors can functions, which indicate the precision with which
be summarized in a number of ways, but typically, observed task/item performances can be used to
in educational and psychological measurement, it estimate the value of a latent trait for each test
is conceptualized in terms of the standard deviation taker. Using IRT, indices analogous to traditional
in the scores for a person over replications of the reliability coefficients can be estimated from the
testing procedure. In most testing contexts, it is item information functions and distributions of
not possible to replicate the testing procedure re the latent trait in some population.
peatedly, and therefore it is not possible to estimate In practice, the reliability/precision of the
the standard error for each person’s score via scores is typically evaluated in terms o f various
repeated measurement. Instead, using model- coefficients, including reliability coefficients, gen
based assumptions, the average error of measure eralizability coefficients, and IRT information
ment is estimated over some population, and this functions, depending on the focus o f the analysis
average is referred to as the standard error o f meas and the measurement model being used. The co
urement (SEM). The SEM is an indicator of a efficients tend to have high values when the vari
lack of consistency in the scores generated by the ability associated with the error is small compared
testing procedure for some population. A relatively with the observed variation in the scores (or score
large SEM indicates relatively low reiiability/pre- differences) to be estimated.
cision. The conditional standard error o f measurement
for a score level is the standard error of measurement Implications for Validity
at that score level.
To say that a score includes error implies that Although reliability/precision is discussed here as
there is a hypothetical error-free value that char an independent characteristic o f test scores, it
acterizes the variable being assessed. In classical should be recognized that the level o f reliability/pre
test theory this error-free value is referred to as cision of scores has implications for validity. Reli
the persons true score for the test procedure. It is ability/precision o f data ultimately bears on the
conceptualized as the hypothetical average score generalizability or dependability o f the scores
over an infinite set of replications o f the testing and/or the consistency of classifications o f indi
procedure. In statistical terms, a persons true viduals derived from the scores. To the extent
score is an unknown parameter, or constant, and that scores are not consistent across replications
the observed score for the person is a random of the testing procedure (i.e., to the extent that
RELIABILITY/PRECISION AND ERRORS OF MEASUREMENT
they reflect random errors of measurement), their comparisons of scores across individuals. Conditions
potential for accurate prediction o f criteria, for of observation that are fixed or standardized for
beneficial examinee diagnosis, and for wise decision the testing procedure remain the same across
making is limited. replications. However, some aspects o f any stan
dardized testing procedure will be allowed to vary.
Specifications for Replications The time and place of testing, as well as the
of the Testing Procedure persons administering the test, are generally allowed
to vary to some extent. The particular tasks
As indicated earlier, the general notion of reliability/ included in the test may be allowed to vary (as
precision is defined in terms of consistency over samples from a common content domain), and
replications of the testing procedure. Reliability/pre the persons who score the results can vary over
cision is high if the scores for each person are some set o f qualified scorers.
consistent over replications of the testing procedure Alternate forms (or parallel foims) of a stan
and is low if the scores are not consistent over dardized test are designed to have the same general
replications. Therefore, in evaluating reliability/pre distribution of content and item formats (as de
cision, it is important to be clear about what scribed, for example, in detailed test specifications),
constitutes a replication of the testing procedure. the same administrative procedures, and at least
Replications involve independent administra approximately the same score means and standard
tions of the testing procedure, such that the deviations in some specified population or popu
attribute being measured would not be expected lations. Alternate forms of a test are considered
to change. For example, in assessing an attribute interchangeable, in the sense that they are built to
that is not expected to change over an extended the same specifications, and are interpreted as
period of time (e.g., in measuring a trait), scores measures of the same construct.
generated on two successive days (using different In classical test theory, strictly parallel tests are
test forms if appropriate) would be considered assumed to measure the same construct and to
replications. For a state variable (e.g., mood or yield scores that have the same means and standard
hunger), where fairly rapid changes are common, deviations in the populations of interest and have
scores generated on two successive days would the same correlations with all other variables. A
not be considered replications; the scores obtained classical reliability coefficient is defined in terms
on each occasion would be interpreted in terms o f the correlation between scores from strictly
of the value of the state variable on that occasion. parallel forms of the test, but it is estimated in
For many tests of knowledge or skill, the admin terms of the correlation between alternate forms
istration o f alternate forms o f a test with different of the test that may not quite be strictly parallel.
samples o f items would be considered replications Different approaches to the estimation of reli
of the test; for survey instruments and some per ability/precision can be implemented to fit different
sonality measures, it is expected that the same data-collection designs and different interpretations
questions will be used every time the test is ad and uses of scores. In some cases, it may be
ministered, and any substantial change in wording feasible to estimate the variability over replications
would constitute a different test form. directly (e.g., by having a number of qualified
Standardized tests present the same or very raters evaluate a sample o f test performances for
similar test materials to all test takers, maintain each test taker). In other cases, it may be necessary
close adherence to stipulated procedures for test to use less direct estimates o f the reliability coeffi
administration, and employ prescribed scoring cient. For example, internal-consistency estimates
rules that can be applied with a high degree of o f reliability (e.g., split halves coefficient, KR-20,
consistency. Administering the same questions or coefficient alpha) use the observed extent of agree
commonly scaled questions to all test takers under ment between different parts of one test to estimate
the same conditions promotes fairness and facilitates the reliability associated with form-to-form vari
35
CHAPTER 2
ability. For the split-haives method, scores on two differences in the difficulty of test forms that
more-or-less parallel halves of the test (e.g., odd- have not been adequately equated or linked; ex
numbered items and even-numbered items) are aminees who take one form may receive higher
correlated, and the resulting half-test reliability scores on average than if they had taken the other
coefficient is statistically adjusted to estimate reli form. Such systematic errors would not generally
ability for the full-length test. However, when a be included in the standard error o f measurement,
test is designed to reflect rate o f work, internal- and they are not regarded as contributing to a
consistency estimates o f reliability (particularly lack of reliability/precision. Rather, systematic
by the odd-even method) are likely to yield inflated errors constitute construct-irrelevant factors that
estimates o f reliability for highly speeded tests. reduce validity but not reliability/precision.
In some cases, it may be reasonable to assume Important sources o f random error may be
that a potential source o f variability is likely to be grouped in two broad categories: those rooted
negligible or that the user will be able to infer ad within the test takers and those external to them.
equate reliability from other types of evidence. Fluctuations in the level of an examinees motivation,
For example, if test scores are used mainly to interest, or attention and the inconsistent application
predict some criterion scores and the test does an of skills are clearly internal sources that may lead
acceptable job in predicting the criterion, it can to random error. Variations in testing conditions
be inferred that the test scores are reliable/precise (e.g., time of day, level o f distractions) and
enough for their intended use. variations in scoring due to scorer subjectivity are
The definition of what constitutes a standardized examples of external sources that may lead to ran
test or measurement procedure has broadened dom error. The importance of any particular
significantly over the last few decades. Various source of variation depends on the specific condi
kinds o f performance assessments, simulations, tions under which the measures are taken, how
and portfolio-based assessments have been developed performances are scored, and the interpretations
to provide measures of constructs that might oth derived from the scores.
erwise be difficult to assess. Each step toward Some changes in scores from one occasion to
greater flexibility in the assessment procedures another are not regarded as error (random or sys
enlarges the scope of the variations allowed in tematic), because they result, in part, from changes
replications of the testing procedure, and therefore in the construct being measured (e.g., due to
tends to increase the measurement error. However, learning or maturation that has occurred between
some o f these sacrifices in reliability/precision the initial and finaJ measures). In such cases, the
may reduce construct irrelevance or construct.un- changes in performance would constitute the phe
derrepresentation and thereby improve the validity nomenon o f interest and would not be considered
of the intended interpretations o f the scores. For errors o f measurement.
example, performance assessments that depend Measurement error reduces the usefulness of
on ratings o f extended responses tend to have test scores. It limits the extent to which test results
lower reliability than more structured assessments can be generalized beyond the particulars of a
(e.g., multiple-choice or short-answer tests), but given replication o f the testing procedure. It
they can sometimes provide more direct measures reduces the confidence that can be placed in the
of the attribute of interest. results from any single measurement and therefore
Random errors of measurement are viewed as the reliability/precision o f the scores. Because ran
unpredictable fluctuations in scores. They are dom measurement errors are unpredictable, they
conceptually distinguished from systematic errors, cannot be removed from observed scores. However,
which may also affect the performances o f indi their aggregate magnitude can be summarized in
viduals or groups but in a consistent rather than a several ways, as discussed below, and they can be
random manner. For example, an incorrect answer controlled to some extent (e.g., by standardization
key would contribute systematic error, as would or by averaging over multiple scores).
36
RELIABILITY/PRECISION AND ERRORS OF MEASUREMENT
The standard error o f measurement, as such, Therefore, to the extent feasible (i.e., if sample
provides an indication of the expected level of sizes are large enough), reliability/precision should
random error over score points and replications be estimated separately for all relevant subgroups
for a specific population. In many cases, it is (e.g., defined in terms of race/ethnicity, gender,
useful to have estimates o f the standard errors for language proficiency) in the population. (Also see
individual examinees (or for examinees with scores chap. 3, “Fairness in Testing.”)
in certain score ranges). These conditional standard
errors are difficult to estimate directly, but can be Reifability/Generalizability Coefficients
estimated indirectly. For example, the test infor
mation functions based on IRT models can be In classical test theory, the consistency of test scores
used to estimate standard errors for different is evaluated mainly in terms of reliability coefficients,
values o f a latent ability parameter and/or for dif defined in terms of the correlation between scores
ferent observed scores. In using any of these mod- derived from replications of the testing procedure
el-based estimates of conditional standard errors, on a sample of test takers. Three broad categories
it is important that the model assumptions be of reliability coefficients are recognized: (a) coefficients
consistent with the data. derived from the administration of alternate forms
in independent testing sessions (altemate-form co
Evaluating Reliability/Precision efficients); (b) coefficients obtained by administration
of the same form on separate occasions (test-retest
The ideal approach to the evaluation of reliability/pre coefficients); and (c) coefficients based on the rela
cision would require many independent replications tionships/interactions among scores derived from
of the testing procedure on a large sample of test individual items or subsets of the items within a
takers. The range of differences allowed in replications test, ail data accruing from a single administration
of the testing procedure and the proposed inter (intemal-consistency coefficients). In addition, where
pretation of the scores provide a framework for in test scoring involves a high level of judgment,
vestigating reliability/precision. indices of scorer consistency are commonly obtained.
For most testing programs, scores are expected In formal treatments of classical test theory, reliability
to generalize over alternate forms of the test, oc can be defined as the ratio o f true-score variance to
casions (within some period), testing contexts, observed score variance, but it is estimated in terms
and raters (if judgment is required in scoring). To o f reliability coefficients of the kinds mentioned
the extent that the impact of any of these sources above.
of variability is expected to be substantial, the In generalizability theory, these different reli
variability should be estimated in some way. It is ability analyses are treated as special cases o f a
not necessary that the different sources o f variance more general framework for estimating error vari
be estimated separately. The overall reliability/pre ance in terms of the variance components associated
cision, given error variance due to the sampling with different sources of error. A generalizability
o f forms, occasions, and raters, can be estimated coefficient is defined as the ratio o f universe score
through a test-retest study involving different variance to observed score variance. Unlike tradi
forms administered on different occasions and tional approaches to the study of reliability, gen
scored by different raters. eralizability theory encourages the researcher to
The interpretation o f reliability/precision analy specify and estimate components o f true score
ses depends on the population being tested. For variance, error score variance, and observed score
example, reliability or generalizability coefficients variance, and to calculate coefficients based on
derived from scores o f a nationally representative these estimates. Estimation is typically accomplished
sample may differ significandy from those obtained by the application of analysis-of-variance techniques.
from a more homogeneous sample drawn from The separate numerical estimates of the components
one gender, one ethnic group, or one community. of variance (e.g., variance components for items,
37
CHAPTER 2
occasions, and raters, and for the interactions The information function may be viewed as a
among these potential sources of error) can be mathematical statement of the precision o f meas
used to evaluate the contribution of each source urement at each level of the given trait. The IRT
of error to the overall measurement error; the information function is based on the results
variance-component estimates can be helpful in obtained on a specific occasion or in a specific
identifying an effective strategy for controlling context, and therefore it does not provide an in
overall error variance. dication of generalizability over occasions or con
Different reliability (and generalizability) co texts.
efficients may appear to be interchangeable, but Coefficients (e.g., reliability, generalizability,
the different coefficients convey different infor and IRT-based coefficients) have two major ad
mation. A coefficient may encompass one or more vantages over standard errors. First, as indicated
sources o f error. For example, a coefficient may above, they can be used to estimate standard
reflect error due to scorer inconsistencies but not errors (overall and/or conditional) in cases where
reflect the variation over an examinees performances it would not be possible to do so directly. Second,
or products. A coefficient may reflect only the in coefficients (e.g., reliability and generalizability
ternal consistency o f item responses within an in coefficients), which are defined in terms o f ratios
strument and fail to reflect measurement error as o f variances for scores on the same scale, are
sociated with day-to-day changes in examinee invariant over linear transformations o f the score
performance. scale and can be useful in comparing different
It should not be inferred, however, that alter- testing procedures based on different scales. How
nate-form or test-retest coefficients based on test ever, such comparisons are rarely straightforward,
administrations several days or weeks apart are al because they can depend on the variability o f the
ways preferable to internal-consistency coefficients. groups on which the coefficients are based, the
In cases where we can assume that scores are not techniques used to obtain the coefficients, the
likely to change, based on past experience and/or sources of error reflected in the coefficients, and
theoretical considerations, it may be reasonable the lengths and contents o f the instruments being
to assume invariance over occasions (without con compared.
ducting a test-retest study). Another limitation of
test-retest coefficients is that, when the same form Factors Affecting Reliability/Precision
of the test is used, the correlation between the
first and second scores could be inflated by the A number of factors can have significant effects
test takers recall of initial responses. on reliability/precision, and in some cases, these
The test information function, an important factors can lead to misinterpretations o f the results,
result o f IRT, summarizes how well the test dis if not taken into account.
criminates among individuals at various levels of First, any evaluation o f reliability/precision
ability on the trait being assessed. Under the IRT applies to a particular assessment procedure and
conceptualization for dichotomously scored items, is likely to change if the procedure is changed in
the item characteristic curve or item responsefunction any substantial way. In general, if the assessment
is used as a model to represent the increasing pro is shortened (e.g., by decreasing the number of
portion o f correct responses to an item at increasing items or tasks), the reliability is likely to decrease;
levels of the ability or trait being measured. Given and if the assessment is lengthened with comparable
appropriate data, the parameters of the characteristic tasks or items, the reliability is likely to increase.
curve for each item in a test can be estimated. The In fact, lengthening the assessment, and thereby
test information function can then be calculated increasing the size of the sample o f tasks/items
from the parameter estimates for the set of items in (or raters or occasions) being employed, is an ef
the test and can be used to derive coefficients with fective and commonly used method for improving
interpretations similar to reliability coefficients. reliability/precision.
RELIABILITY/PRECISION AND ERRORS OF MEASUREMENT
Second, if the variability associated with raters and the interpretation of scores has become the
is estimated for a select group of raters who have users primary concern.
been especially well trained (and were perhaps Estimates o f the standard errors at different
involved in the development of the procedures), score levels (that is, conditional standard errors)
but raters are not as well trained in some operational are usually a valuable supplement to the single sta
contexts, the error associated with rater variability tistic for all score levels combined. Conditional
in these operational settings may be much higher standard errors of measurement can be much more
than is indicated by the reported interrater reliability informative than a single average standard error
coefficients. Similarly, if raters are still refining their for a population. If decisions are based on test
performance in the early days of an extended scoring scores and these decisions are concentrated in one
window, the error associated with rater variability area or a few areas o f the score scale, then the con
may be greater for examinees testing early in the ditional errors in those areas are o f special interest.
window than for examinees who test later. Like reliability and generalizability coefficients,
Reliability/precision can also depend on the standard errors may reflect variation from many
population for which the procedure is being used. sources of error or only a few. A more comprehensive
In particular, if variability in the construct of standard error (i.e., one that includes the most
interest in the population for which scores are relevant sources of error, given the definition of
being generated is substantially different from what the testing procedure and the proposed interpre
it is in the population for which reliability/precision tation) tends to be more informative than a less
was evaluated, the reliability/precision can be quite comprehensive standard error. However, practical
different in the two populations. When the variability constraints often preclude the kinds of studies
in the construct being measured is low, reliability that would yield information on all potential
and generalizability coefficients tend to be small, sources of error, and in such cases, it is most in
and when the variability in the construct being formative to evaluate the sources of error that are
measured is higher, the coefficients tend to be likely to have the greatest impact.
larger. Standard errors of measurement are less de Interpretations of test scores may be broadly
pendent than reliability and generalizability coeffi categorized as relative or absolute. Relative inter
cients on the variability in the sample of test takers. pretations convey the standing of an individual or
In addition, reliability/precision can vary from group within a reference population. Absolute in
one population to another, even if the variability terpretations relate the status of an individual or
in the construct o f interest in the two populations group to defined performance standards. The stan
is the same. The reliability can vary from one pop dard error is not the same for the two types of in
ulation to another because particular sources of terpretations. Any source of error that is the same
error (rater effects, familiarity with formats and for all individuals does not contribute to the relative
instructions, etc.) have more impact in one popu error but may contribute to the absolute error.
lation than they do in the other. In general, if any Traditional norm-referenced reliability coeffi
aspects of the assessment procedures or the popu cients were developed to evaluate the precision
lation being assessed are changed in an operational with which test scores estimate the relative standing
setting, the reliability/precision may change. of examinees on some scale, and they evaluate re
liability/precision in terms o f the ratio o f true-
Standard Errors of Measurement score variance to observed-score variance. As the
range of uses of test scores has expanded and the
The standard error o f measurement can be used contexts of use have been extended (e.g., diagnostic
to generate confidence intervals around reported categorization, the evaluation of educational pro
scores. It is therefore generally more informative grams), the range o f indices that are used to
than a reliability or generalizability coefficient, evaluate reliability/precision has also grown to in
once a measurement procedure has been adopted clude indices for various kinds o f change scores
39
CHAPTER 2
and difference scores, indices of decision consistency, per se. Note that the degree of consistency or
and indices appropriate for evaluating the precision agreement in examinee classification is specific to
o f group means. the cut score employed and its location within
Some indices of precision, especially standard the score distribution.
errors and conditional standard errors, also depend
on the scale in which they are reported. An index Reliability/Precision of Group Means
stated in terms o f raw scores or the trait-level esti
mates of IRT may convey a very different perception Estimates of mean (or average) scores of groups
of the error if restated in terms of scale scores. For (or proportions in certain categories) involve
example, for the raw-score scale, the conditional sources of error that are different from those that
standard error may appear to be high at one score operate at the individual level. Such estimates are
level and low at another, but when the conditional often used as measures o f program effectiveness
standard errors are restated in units o f scale scores, (and, under some educational accountability sys
quite different trends in comparative precision tems, may be used to evaluate the effectiveness of
may emerge. schools and teachers).
In evaluating group performance by estimating
Decision Consistency the mean performance or mean improvement in
performance for samples from the group, the vari
Where the purpose of measurement is classification, ation due to the sampling o f persons can be a
some measurement errors are more serious than major source of error, especially if the sample
others. Test takers who are far above or far below sizes are small. To the extent that different samples
the cut score established for pass/fail or for from the group of interest (e.g., all students who
eligibility for a special program can have considerable use certain educational materials) yield different
error in their observed scores without any effect results, conclusions about the expected outcome
on their classification decisions. Errors of meas over all students in the group (including those
urement for examinees whose true scores are close who might join the group in the future) are un
to the cut score are more likely to lead to classifi certain. For large samples, the variability due to
cation errors. The choice of techniques used to the sampling of persons in the estimates of the
quantify reliability/precision should take these group means may be quite small. However, in
circumstances into account. This can be done by cases where the samples of persons are not very
reporting the conditional standard error in the large (e.g., in evaluating the mean achievement of
vicinity o f the cut score or the decision- students in a single classroom or the average ex
consistency/accuracy indices (e.g., percentage of pressed satisfaction o f samples o f clients in a
correct decisions, Cohens kappa), which vary as clinical program), the error associated with the
functions o f both score reliability/precision and sampling of persons may be a major component
the location of the cut score. of overall error. It can be a significant source of
Decision consistency refers to the extent to error in inferences about programs even if there is
which the observed classifications o f examinees a high degree of precision in individual test scores.
would be the same across replications of the Standard errors for individual scores are not
testing procedure. Decision accuracy refers to the appropriate measures o f the precision o f group av
extent to which observed classifications of examinees erages. A more appropriate statistic is the standard
based on the results of a single replication would error for the estimates of the group means.
agree with their true classification status. Statistical
methods are available to calculate indices for both Documenting Reliability/Precision
decision consistency and decision accuracy. These
methods evaluate the consistency or accuracy of Typically, developers and distributors o f tests have
classifications rather than the consistency in scores primary responsibility for obtaining and reporting
40
RELIABILITY/PRECISION AND ERRORS OF MEASUREMENT
41
CHAPTER 2
The standards in this chapter begin with an over Cluster 1. Specifications for
arching standard (numbered 2.0), which is designed Replications of the Testing Procedure
to convey the central intent or primary focus of
the chapter. The overarching standard may also
be viewed as the guiding principle of the chapter,
Standard 2.1
and is applicable to all tests and test users. All The range of replications over which reliability/pre-
subsequent standards have been separated into cision is being evaluated should be clearly stated,
eight thematic clusters labeled as follows: along with a rationale for the choice o f this def
inition, given the testing situation.
1. Specifications for Replications of the Testing
Procedure Comment: For any testing program, some aspects
2. Evaluating Reliability/Precision of the testing procedure (e.g., time limits and
3. Reliability/Generalizability Coefficients availability of resources such as books, calculators,
4. Factors Affecting Reliability/Precision and computers) are likely to be fixed, and some
5. Standard Errors o f Measurement aspects will be allowed to vary from one adminis
6. Decision Consistency tration to another (e.g., specific tasks or stimuli,
7. Reliability/Precision o f Group Means testing contexts, raters, and, possibly, occasions).
8. Documenting Reliability/Precision Any test administration that maintains fixed con
ditions and involves acceptable samples of the
conditions that are allowed to vary would be con
Standard 2.0 sidered a legitimate replication of the testing pro
Appropriate evidence o f reliability/precision cedure. As a first step in evaluating the reliability/pre
should be provided for the interpretation for cision of the scores obtained with a testing proce
each intended score use. dure, it is important to identify the range of con
ditions of various kinds that are allowed to vary,
Comment: The form of the evidence (reliability and over which scores are to be generalized.
or generalizability coefficient, information function,
conditional standard error, index of decision con
sistency) for reliability/precision should be ap Standard 2.2
propriate for the intended uses of the scores, the The evidence provided for the reliability/precision
population involved, and the psychometric models o f the scores should be consistent with the
used to derive the scores. A higher degree o f relia domain o f replications associated with the testing
bility/precision is required for score uses that have procedures, and with the intended interpretations
more significant consequences for test takers. for use o f the test scores.
Conversely, a lower degree may be acceptable
where a decision based on the test score is reversible Comment: The evidence for reliability/precision
or dependent on corroboration from other sources should be consistent with the design o f the
of information. testing procedures and with the proposed inter
pretations for use o f the test scores. For example,
if the test can be taken on any o f a range of oc
casions, and the interpretation presumes that
the scores are invariant over these occasions,
then any variability in scores over these occasions
is a potential source o f error. If the tasks or
42
RELIABILITY/PRECISION AND ERRORS OF MEASUREMENT
stimuli are allowed to vary over alternate forms individual or two averages o f a group, reliability/
o f the test, and the observed performances are precision data, including standard errors, should
treated as a sample from a domain o f similar be provided for such differences.
tasks, the variability in scores from one form to
Comment: Observed score differences are used
another would be considered error. If raters are
for a variety o f purposes. Achievement gains are
used to assign scores to responses, the variability
frequently of interest for groups as well as indi
in scores over qualified raters is a source o f error.
viduals. In some cases, the reliability/precision of
Different sources o f error can be evaluated in a
change scores can be much lower than the relia
single coefficient or standard error, or they can
bilities of the separate scores involved. Differences
be evaluated separately, but they should all be
between verbal and performance scores on tests
addressed in some way. Reports o f reliability/pre
o f intelligence and scholastic ability are often em
cision should specify the potential sources of
ployed in the diagnosis of cognitive impairment
error included in the analyses.
and learning problems. Psychodiagnostic inferences
are frequently drawn from the differences between
Cluster 2. Evaluating subtest scores. Aptitude and achievement batteries,
Reliability/Precision interest inventories, and personality assessments
are commonly used to identify and quantify the
relative strengths and weaknesses, or the pattern
Standard 2.3 of trait levels, of a test taker. When the interpretation
o f test scores centers on the peaks and valleys in
For each total score, subscore, or combination
the examinee’s test score profile, the reliability of
o f scores that is to be interpreted, estimates of
score differences is critical.
relevant indices o f reliability/precision should
be reported.
43
CHAPTER 2
44
RELIABILITY/PRECISION AND ERRORS OF MEASUREMENT
longer) version of an existing test, based on data pretation o f scores involves within-group inferences
from an administration o f the existing test. (e.g., in terms o f subgroup norms). For example,
However, these models generally make assumptions test users who work with a specific linguistic and
that may not be met (e.g., that the items in the cultural subgroup or with individuals who have a
existing test and the items to be added or dropped particular disability would benefit from an estimate
are all randomly sampled from a single domain). of the standard error for the subgroup. Likewise,
Context effects are commonplace in tests of max evidence that preschool children tend to respond
imum performance, and the short version o f a to test stimuli in a less consistent fashion than do
standardized test often comprises a nonrandom older children would be helpful to test users inter
sample o f items from the fuil-length version. As a preting scores across age groups.
result, the predicted value of the reliability/precision When considering the reliability/precision of
may not provide a very good estimate of the test scores for relevant subgroups, it is useful to
actual value, and therefore, where feasible, the re evaluate and report the standard error o f measure
liability/precision of both forms should be evaluated ment as well as any coefficients that are estimated.
directly and independently. Reliability and generalizability coefficients can
differ substantially when subgroups have different
Standard 2.10 variances on the construct being assessed. Differences
in within-group variability tend to have less impact
When significant variations are permitted in on the standard error of measurement.
tests or test administration procedures, separate
reliability/precision analyses should be provided Standard 2.12
for scores produced under each major variation
if adequate sample sizes are available. If a test is proposed for use in several grades or
over a range o f ages, and if separate norms are
Comment: To make a test accessible to all exam provided for each grade or each age range, relia
inees, test publishers or users might authorize, or bility/precision data should be provided for each
might be legally required to authorize, accommo age or grade-level subgroup, not just for all
dations or modifications in the procedures that grades or ages combined.
are specified for the administration of a test. For
example, audio or large print versions may be Comment: A reliability or generalizability coefficient
used for test takers who are visually impaired. based on a sample of examinees spanning several
Any alteration in standard testing materials or grades or a broad range of ages in which average
procedures may have an impact on the scores are steadily increasing will generally give a
reliability/precision of the resulting scores, and spuriously inflated impression of reliability/precision.
therefore, to the extent feasible, the reliability/pre When a test is intended to discriminate within
cision should be examined for all versions of the age or grade populations, reliability or generaliz
test and testing procedures. ability coefficients and standard errors should be
reported separately for each subgroup.
Standard 2.11
Cluster 5. Standard Errors of
Test publishers should provide estimates o f reli
Measurement
ability/precision as soon as feasible for each
relevant subgroup for which the test is recom
mended. Standard 2.13
Comment: Reporting estimates of reliability/pre The standard error o f measurement, both overall
cision for relevant subgroups is useful in many and conditional (if reported), should be provided
contexts, but it is especially important if the inter in units of each reported score.
45
CHAPTER 2
46
RELIABILITY/PRECISION AMD ERRORS OF MEASUREMENT
period. Therefore, the students in a particular class Comment: Information on the method o f data
or school at the current timej the current clients of collection, sample sizes, means, standard deviations,
a social service agency, and analogous groups and demographic characteristics o f the groups
exposed to a program of interest typically constitute tested helps users judge the extent to which
a sample in a longitudinal sense. Presumably, com reported data apply to their own examinee popu
parable groups from the same population will recur lations. If the test-retest or alternate-form approach
in future years, given static conditions. The factors is used, the interval between administrations
leading to uncertainty in conclusions about program should be indicated.
effectiveness arise from the sampling of persons as Because there are many ways of estimating re
well as from individual measurement error. liability/precision, and each is influenced by
different sources o f measurement error, it is unac
Standard 2.18 ceptable to say simply, “The reliability/precision
of scores on test X is .90.” A better statement
When the purpose of testing is to measure the would be, “The reliability coefficient o f .90
performance of groups rather than individuals, reported for scores on test X was obtained by cor
subsets o f items can be assigned randomly to dif relating scores from forms A and B, administered
ferent subsamples of examinees. Data are aggregated on successive days. The data were based on a
across subsamples and item subsets to obtain a sample o f 400 lOth-grade students from five mid
measure of group performance. When such pro dle-class suburban schools in New York State.
cedures are used for program evaluation or pop The demographic breakdown o f this group was
ulation descriptions, reliability/precision analyses as follows:. . . ” In some cases, for example, when
must take the sampling scheme into account. small sample sizes or particularly sensitive data
are involved, applicable legal restrictions governing
Comment: This type of measurement program is
privacy may limit the level of information that
termed matrix sampling. It is designed to reduce
the time demanded o f individual examinees and should be disclosed.
yet to increase the total number o f items on
which data can be obtained. This testing approach Standard 2.20
provides the same type of information about
group performances that would be obtained if all If reliability coefficients are adjusted for restriction
examinees had taken ail of the items. Reliability/pre- of range or variability, the adjustment procedure
cision statistics should reflect the sampling plan and both the adjusted and unadjusted coefficients
used with respect to examinees and items. should be reported. The standard deviations o f
the group actually tested and o f the target popu
Cluster 8. Documenting lation, as well as the rationale for the adjustment,
Reliability/Precision should be presented.
47
3. FAIRNESS IN TESTING
BACKGROUND
This chapter addresses the importance of fairness takers is an overriding, foundational concern, and
as a fundamental issue in protecting test takers that common principles apply in responding to
and test users in all aspects o f testing. The term test-taker characteristics that could interfere with
fairness has no single technical meaning and is the validity of test score interpretation. This is
used in many different ways in public discourse. not to say that the response to test-taker charac
It is possible that individuals endorse fairness in teristics is the same for individuals from diverse
testing as a desirable social goal, yet reach quite subgroups such as those defined by race, ethnicity,
different conclusions about the fairness of a given gender, culture, language, age, disability or so
testing program. A full consideration of the topic cioeconomic status, but rather that these responses
would explore the multiple functions of testing should be sensitive to individual characteristics
in relation to its many goals, including the broad diat otherwise would compromise validity. Nonethe
goal of achieving equality of opportunity in our less, as discussed in the Introduction, it is important
society. It would consider the technical properties to bear in mind, when using the Standards, that
of tests, the ways in which test results are reported applicability depends on context. For example,
and used, the factors that affcct the validity of potential threats to test validity for examinees
score interpretations, and the consequences of with limited English proficiency are different
test use. A comprehensive analysis of fairness in from those for examinees with disabilities. Moreover,
testing also would examine the regulations, statutes, threats to validity may differ even for individuals
and case law that govern test use and the remedies within the same subgroup. For example, individuals
for harmful testing practices. The Standards cannot with diverse specific disabilities constitute the
hope to deal adequately with all of these broad subgroup of “individuals with disabilities,” and
issues, some of which have occasioned sharp dis examinees classified as “limited English proficient”
agreement among testing specialists and others represent a range of language proficiency levels,
interested in resting. Our focus must be limited educational and cultural backgrounds, and prior
here to delineating the aspects of tests, testing, experiences. Further, the equivalence of the
and test use that relate to fairness as described in construct being assessed is a central issue in
this chapter, which are the responsibility of those fairness, whether the context is, for example, in
who develop, use, and interpret the results of dividuals with diverse special disabilities, individuals
tests, and upon which there is general professional with limited English proficiency, or individuals
and technical agreement. across countries and cultures.
Fairness is a fundamental validity issue and As in the previous versions of the Standards,
requires attention throughout all stages of test de the current chapter addresses measurement bias
velopment and use. In previous versions of the as a central threat to fairness in testing. However,
Standards, fairness and the assessment of individuals it also adds two major concepts chat have emerged
from specific subgroups of test takers, such as in in the literature, particularly in literature regarding
dividuals with disabilities and individuals with education, for minimizing bias and thereby in
diverse linguistic and cultural backgrounds, were creasing fairness. The first concept is accessibility,
presented in separate chapters. In the current the notion that all test takers should have an un
version o f the Standards, these issues are presented obstructed opportunity to demonstrate their stand
in a single chapter to emphasize that fairness to ing on the construct(s) being measured. For ex
all individuals in the intended population of test ample, individuals with limited English proficiency
49
CHAPTER 3
may not be adequately diagnosed on the target considered, as some adaptations may alter a test’s
construct of a clinical examination if the assessment intended construct. Responding to individual
requires a level of English proficiency that they characteristics that would otherwise impede access
do not possess. Similarly, standard print and some and improving the validity of test score interpre
electronic formats can disadvantage examinees tations for intended uses are dual considerations
with visual impairments and some older adults for supporting fairness.
who need magnification for reading, and the dis In summary, this chapter interprets fairness as
advantage is considered unfair if visual acuity is responsiveness to individual characteristics and
irrelevant to the construct being measured. These testing contexts so that test scores will yield valid
examples show how access to the construct the interpretations for intended uses. The Standards
test is measuring can be impeded by characteristics definition of fairness is often broader than what is
and/or skills that are unrelated to the intended legally required. A test that is fair within the
construct and thereby can limit the validity of meaning of the Standards reflects the same con
score interpretations for intended uses for certain struct^) for all test takers, and scores from it have
individuals and/or subgroups in the intended test- the same meaning for all individuals in the
taking population. Accessibility is a legal require intended population; a fair test does not advantage
ment in some testing contexts. or disadvantage some individuals because o f char
The second new concept contained in this acteristics irrelevant to the intended construct. To
chapter is that of universal design. Universal design the degree possible, characteristics o f all individuals
is an approach to test design that seeks to maximize in the intended test population, including those
accessibility for all intended examinees. Universal associated with race, ethnicity, gender, age, so
design, as described more thoroughly later in this cioeconomic status, or linguistic or cultural back
chapter, demands that test developers be clear on ground, must be considered throughout all stages
the construct(s) to be measured, including the of development, administration, scoring, inter
target o f the assessment, the purpose for which pretation, and use so that barriers to fair assessment
scores will be used, the inferences that will be can be reduced. At the same time, test scores
made from the scores, and the characteristics of must yield valid interpretations for intended uses,
examinees and subgroups of the intended test and different test contexts and uses may call for
population that could influence access. Test items different approaches to fairness. For example, in
and tasks can then be purposively designed and tests used for selection purposes, adaptations to
developed from the outset to reflect the intended standardized procedures that increase accessibility
construct, to minimize construct-irrelevant features for some individuals but change the construct
that might otherwise impede the performance of being measured could reduce the validity o f score
intended examinee groups, and to maximize, to inferences for the intended purposes and unfairly
the extent possible, access for as many examinees advantage those who qualify for adaptation relative
as possible in the intended population regardless to those who do not. In contrast, for diagnostic
of race, ethnicity, age, gender, socioeconomic purposes in medicine and education, adapting a
status, disability, or language or cultural background. test to increase accessibility for some individuals
Even so, for some individuals in some test could increase the accuracy of the diagnosis.
contexts and for some purposes— as is described These issues are discussed in the sections below
later—there may be need for additional test adap and are represented in the standards that follow
tations to respond to individual characteristics the chapter introduction.
that otherwise would limit access to the construct
as measured. Some examples are creating a braille
General Views of Fairness
version of a test, allowing additional testing time,
and providing test translations or language sim The first view o f fairness in testing described in
plification. Any test adaption must be carefully this chapter establishes the principle o f fair and
50
FAIRNESS IN TESTING
equitable treatment o f all test takers during the groups or individuals from accurately demonstrating
testing process. T h e second', third, and fourth their standing with respect to the construct o f in
views presented here emphasize issues o f fairness terest. For example, challenges may arise due to
in measurement quality: fairness as the lack or an examinee’s disability, cultural background, lin
absence o f measurement bias, fairness as access to guistic background, race, ethnicity, socioeconomic
the constructs measured, and fairness as validity status, limitations that may come with aging, or
o f individual test score interpretations for the in som e combination o f these or other factors. In
tended use(s). some instances, greater comparability o f scores
may be attained if standardized procedures are
Fairness in Treatment During the Testing Process changed to address the needs o f specific groups or
Regardless o f the purpose o f testing, the goal o f individuals without any adverse effects on the va
fairness is to maximize, to the extent possible, the lidity or reliability o f the results obtained. For ex
opportunity for test takers to demonstrate their ample, a braille test form, a large-print answer
standing on the construct(s) the test is intended sheet, or a screen reader may be provided to
to measure. Traditionally, careful standardization enable those with some visual impairments to
o f tests, administration conditions, and scoring obtain more equitable access to test content. Legal
procedures have helped to ensure that test takers considerations may also influence how to address
have comparable contexts in which to demonstrate individualized needs.
the abilities or attributes to be measured. For ex
ample, uniform directions, specified time limits,
Fairness as Lack of Measurement Bias
specified room arrangements, use o f proctors, and Characteristics o f the test itself that are not related
use o f consistent security procedures arc imple to the construct being measured, or the manner
mented so that differences in administration con in which the test is used, may sometimes result in
ditions will not inadvertently influence the per different meanings for scores earned by members
formance o f som e test takers relative to others. o f different identifiable subgroups. For example,
Similarly, concerns for equity in treatment may differential itemfunctioning (DIF) is said to occur
require, for som e tests, that all test takers have when equally able test takers differ in their prob
qualified test administrators with whom they can abilities o f answering a test item correctly as a
communicate and feel comfortable to the extent function o f group membership. D IF can be eval
practicable. W here technology is involved, it is uated in a variety o f ways. The detection o f D IF
important that examinees have had similar prior does not always indicate bias in an item; there
exposure to the technology and that the equipment needs to be a suitable, substantial explanation for
provided to all test takers be o f similar processing the D IF to justify the conclusion that the item is
speed and provide similar clarity and size for biased. Differential test functioning (D T F ) refers
images and other media. Procedures for the stan to differences in the functioning o f tests (or sets
dardized administration o f a test should be carefully o f items) for different specially defined groups.
documented by the test developer and followed When D T F occurs, individuals from different
carefully by the test administrator. groups who have the same standing on the char
Although standardization has been a funda acteristic assessed by the test do not have the
mental principle for assuring that all examinees same expected test score.
have the sam e opportunity to demonstrate their T h e term predictive bias may be used when
standing on the construct that a test is intended evidence is found that differences exist in the pat
to measure, sometimes flexibility is needed to terns o f associations between test scores and other
provide essentially equivalent opportunities for variables for different groups, bringing with it
som e test takers. In these cases, aspects o f a stan concerns about bias in the inferences drawn from
dardized testing process that pose no particular the use o f test scores. Differential prediction is
challenge for m ost test takers may prevent specific examined using regression analysis. One approach
51
CHAPTER 3
examines slope and intercept differences between culture m ay not generalize across borders or
two targeted groups (e.g., African A/nerican ex cultures. This can lead to invalid test score inter
aminees and Caucasian examinees), while another pretations. Careful attention to bias in score inter
examines systematic deviations from a common pretations should be practiced in such contexts.
regression line for any num ber o f groups o f
interest. Both approaches provide valuable infor Fairness in Access to the
mation when examining differential prediction. Construct(s) as Measured
Correlation coefficients provide inadequate evidence T h e goal that all intended test takers have a fiill
for or against a differential prediction hypothesis opportunity to demonstrate their standing on the
i f groups are found to have unequal means and construct being measured has given rise to concerns
variances on the test and the criterion. about accessibility in testing. Accessible testing
W hen credible evidence indicates potential situations are those that enable all test takers in
bias in measurement (i.e., lack o f consistent con the intended population, to the extent feasible, to
struct meaning across groups, DIF, D T F ) or bias show their status on the target construct(s) without
in predictive relations, these potential sources o f being unduly advantaged or disadvantaged by in
bias should be independently investigated because dividual characteristics (e.g., characteristics related
the presence or absence o f one form o f such bias to age, disability, race/ethnicity, gender, or language)
may have no relationship with other forms o f that are irrelevant to the construct(s) the test is
bias. For example, a predictor test may show no intended to measure. Accessibility is actually a
significant levels o f DIF, yet show group differences test bias issue because obstacles to accessibility
in regression lines in predicting a criterion. can result in different interpretations o f test scores
Although it is important to guard against the for individuals from different groups. Accessibility
possibility o f measurement bias for the subgroups also has important ethical and legal ramifications.
that have been defined as relevant in the intended Accessibility can best be understood by con
test population, it may not be feasible to fully in trasting the knowledge, skills, and abilities that
vestigate all possibilities, particularly in the em reflect the construct(s) the test is intended to
ployment context. For example, the number of measure with the knowledge, skills, and abilities
subgroup members in the field test or norming that are not the target o f the test but are required
population may limit the possibility o f standard to respond to the test tasks or test items. For
empirical analyses. In these cases, previous research, some test takers, factors related to individual char
a construct-based rationale, and/or data from acteristics such as age, race, ethnicity, socioeconomic
similar tests may address concerns related to po status, cultural background, disability, and/or
tential bias in measurement. In addition, and es English language proficiency may restrict accessi
pecially where credible evidence o f potential bias bility and thus interfere with the measurement o f
exists, small sample methodologies should be con the construct(s) o f interest. For example, a test
sidered. For example, potential bias for relevant taker with impaired vision may not be able to
subgroups may be examined through small-scale access the printed text o f a personality test. I f the
tryouts that use cognitive labs and/or interviews test were provided in large print, the test questions
or focus groups to solicit evidence on the validity could be more accessible to the test taker and
o f interpretations made from the test scores. would be more likely to lead to a valid measurement
A related issue is the extent to which the con o f the test takers personality characteristics. It is
struct being assessed has equivalent meaning across important to be aware o f test characteristics that
the individuals and groups within the intended may inadvertently render test questions less ac
population o f test takers. This is especially important cessible for some subgroups o f the intended testing
when the assessment crosses international borders population. For example, a test question that em
and cultures. Evaluation o f the underlying construct ploys idiomatic phrases unrelated to the construct
and properties o f the test within one country or being measured could have the effect o f making
52
FAIRNESS m TESTING
the test less accessible for test takers who are not to take into account the individual characteristics
native speakers o f English. T h e accessibility o f a o f the test taker and how these characteristics
test could also be decreased by questions that use may interact with the contextual features o f the
regional vocabulary unrelated to the target construct testing situation.
or use stimulus contexts that are less familiar to T h e complex interplay o f language proficiency
individuals from some cultural subgroups than and context provides one example o f the challenges
others. to valid interpretation o f test scores for some
As discussed later in this chapter, some test- testing purposes. Proficiency in English not only
taker characteristics that impede access are related affects the interpretation o f an English language
to the construct being measured, for example, learners test scores on tests administered in English
dyslexia in the context o f tests o f reading. In these but, more important, also may affect the individuals
cases, providing individuals with access to the developmental and academic progress. Individuals
construct and getting som e measure o f it may re who differ culturally and linguistically from the
quire som e adaptation o f the construct as well. In majority o f the test takers are at risk for inaccuratc
situations like this, it may not be possible to score interpretations because o f multiple factors
develop a measurement that is comparable across associated with the assum ption that, absent
adapted and unadapted versions o f the test; language proficiency issues, these individuals have
however, the measure obtained by the adapted developmental trajectories comparable to those
test will m ost likely provide a more accurate as o f individuals who have been raised in an envi
sessment o f the individual’s skills and/or abilities ronm ent m ediated by a single language and
(although perhaps not o f the full intended construct) culture. For instance, consider two sixth-grade
than that obtained without using the adaptation. children who entered school as limited English
Providing access to a test construct becomes speakers. The first child entered school in kinder
particularly challenging for individuals with more garten and has been instructed in academic courses
than one characteristic that could interfere with in English; the second also entered school in
test performance; for example, older adults who kindergarten but has been instructed in his or her
are not fluent in English or English learners who native language. T he two will have a different de
have moderate cognitive disabilities. velopmental pattern. In the former case, the in
terrupted native language development has an at
Fairness as Validity of individual Test Score tenuating effect on learning and academic per
Interpretations for the Intended Uses formance, but the individuals English proficiency
It is im portant to keep in m ind that fairness con may not be a significant barrier to testing. In con
cerns the validity o f individual score interpretations trast, the examinee who has had instruction in his
for intended uses. In attempting to ensure fairness, or her native language through the sixth grade
we often generalize across groups o f test takers has had the opportunity for fully age-appropriate
such as individuals with disabilities, older adults, cognitive, academic, and language development;
individuals who are learning English, or those but, if tested in English, the examinee will need
from different racial or ethnic groups or different the test administered in such a way as to minimize
cultural and/or socioeconomic backgrounds; how the language barrier if proficiency in English is
ever, this is done for convenience and is not not part o f the construct being measured.
meant to imply that these groups are homogeneous As the above examples show, adaptation to in
or that, consequently, all members o f a group dividual characteristics and recognition o f the het
should be treated similarly when making inter erogeneity within subgroups may be important to
pretations o f test scores for individuals (unless the validity o f individual interpretations o f test
there is validity evidence to support such general results in situations where the intent is to understand
izations). It is particularly important, when drawing and respond to individual performance. Professionals
inferences about an examinees skills or abilities, may be justified in deviating from standardized
53
CHAPTER 3
procedures to gain a more accurate measurement and form ats, the potential for some score bias
o f the intended construct and to provide more ap cannot be completely ruled out. Therefore, con
propriate individual decisions. However, for other tinuing efforts in test design and development to
contexts and uses, deviations from standardized eliminate potential sources o f bias without com
procedures may be inappropriate because they prom ising validity, and consistent with legal and
change the construct being measured, compromise regulatory standards, are warranted.
the comparability o f scores or use o f norms, and/or
unfairly advantage some individuals. Threats to Fair and Valid
In closing this section on the meanings o f Interpretations of Test Scores
fairness, note that the Standards measurement
perspective explicitly exclude/ one common view A prim e threat to fair and valid interpretation o f
o f fairness in public discourse: fairness as the test scores comes from aspects o f the test or
equality o f testing outcomes for relevant test- testing process that may produce construct-irrel-
taker subgroups. Certainly, most testing professionals evant variance in scores that systematically lowers
agree that group differences in testing outcomes dr raises scores for identifiable groups o f test
should trigger heightened scrutiny for possible takers and results in inappropriate score inter
sources o f test bias. Examination o f group differences pretations for intended uses. Such construct-ir
also may be important in generating new hypotheses relevant components o f scores may be introduced
about bias, fair treatment, and the accessibility o f by inappropriate sampling o f test content, aspects
the construct as measured; and in fact, there may o f the test context such as lack o f clarity in test
be legal requirements to investigate certain differ instructions, item complexities that are unrelated
ences in the outcomes o f testing am ong subgroups. to the construct being measured, and/or test re
However, group differences in outcomes do not sponse expectations or scoring criteria that may
in themselves indicate that a testing application is favor one group over another. In addition, op
biased or unfair. portunity to learn (i.e., the extent to which an
In many cases, it is not clear whether the dif examinee has been exposed to instruction or ex
ferences are due to real differences between groups periences assumed by the test developer and/or
in the construct being measured or to some source user) can influence the fair and valid interpretations
o f bias (e.g., construct-irrelevant variance or con o f test scores for their intended uses.
struct underrepresentation). In m ost cases, it may
be some combination o f real differences and bias.
Test Content
A serious search for possible sources o f bias that One potential source o f construct-irrelevant variance
comes up empty provides reassurance that the in test scores arises from inappropriate test content,
potential for bias is limited, but even a very that is, test content that confounds the measurement
extensive research program cannot rule the possi o f the target construct and differentially favors
bility out. It is always possible that something individuals from some subgroups over others. A
was missed, and therefore, prudence would suggest test intended to measure critical reading, for ex
that an attempt be made to minimize the differences. ample, should not include words and expressions
For example, some racial and ethnic subgroups especially associated with particular occupations,
have lower mean scores on some standardized disciplines, cultural backgrounds, socioeconomic
tests than do other subgroups. Som e o f the factors status, racial/ethnic groups, or geographical loca
that contribute to these differences are understood tions, so as to maximize the measurement o f the
(e.g., large differences in family income and other construct (the ability to read critically) and to
resources, differences in school quality and students’ minimize confounding o f this measurement with
opportunity to learn the material to be assessed), prior knowledge and experience that are likely to
but even where serious efforts have been m ade to advantage, or disadvantage, test takers from par
eliminate possible sources o f bias in test content ticular subgroups.
54
FAIRNESS IN TESTING
Differential engagem ent and motivational test items that are unrelated to the construct but
value may also be factors in exacerbating con lead some individuals to respond in particular
struct-irrelevant components o f content. Material ways. For exam ple, exam inees from diverse
that is likely to be differentially interesting should racial/ethnic, linguistic, or cultural backgrounds
be balanced to appeal broadly to the full range o f or who differ by gender may be poorly assessed
the targeted testing population (except where the by a vocational interest inventory whose questions
interest level is part o f the construct being m eas disproportionately ask about competencies, ac
ured). In testing, such balance extends to repre tivities, and interests that are stereotypically asso
sentation o f individuals from a variety o f subgroups ciated with particular subgroups.
within the test content itself. For example, applied W hen test settings have an interpersonal
problems can feature children and families from context, the interaction o f examiner with test
different racial/ethnic, socioeconomic, and language taker can be a source o f construct-irrelevant
groups. Also, test content or situations that are variance or bias. Users o f tests should be alert to
offensive or emotionally disturbing to some test the possibility that such interactions may sometimes
takers and may impede their ability to engage affect test fairness. Practitioners administering the
with the test should not appear in the test unless test should be aware o f the possibility o f complex
the use o f the offensive or disturbing content is interactions with test takers and other situational
needed to measure the intended construct. Ex variables. Factors that may affect the performance
am ples o f this type o f content are graphic de o f the test taker include the race, ethnicity, gender,
scriptions o f slavery or the Holocaust, when such and linguistic and cultural background o f both
descriptions are not specifically required by the examiner and test taker, the test takers experience
construct. with formal education, the testing style o f the ex
Depending on the context and purpose o f aminer, the level o f acculturation o f the test taker
tests, it is both comm on and advisable for test de and examiner, the test takers primary language,
velopers to engage an independent and diverse the language used for test administration (if it is
panel o f experts to review test content for language, not the primary language o f the test taker), and
illustrations, graphics, and other representations the use o f a bilingual or bicultural interpreter.
that might be differentially familiar or interpreted Testing o f individuals who are bilingual or
differently by members o f different groups and multilingual poses special challenges. An individual
for material that might be offensive or emotionally who knows two or more languages may not test
disturbing to some test takers. well in one or more o f the languages. For example,
children from homes whose families speak Spanish
Test Context may be able to understand Spanish but express
T h e term test context, as used here, refers to themselves best in English or vice versa. In addition,
multiple aspects o f the test and testing environment som e persons who are bilingual use their native
that may affect the performance o f an examinee language in most social situations and use English
and consequently give rise to construct-irrelevant primarily for academic and work-related activities;
variance in the test scores. As research on contextual the use o f one or both languages depends on the
factors (e.g., stereotype threat) is ongoing, test nature o f the situation. Non-native English speakers
developers and test users should pay attention to who give the impression o f being fluent in con
the emerging empirical literature on these topics versational English may be slower or not completely
so that they can use this information if and when competent in taking tests that require English
the preponderance o f evidence dictates that it is comprehension and literacy skills. Thus, in some
appropriate to do so. Construct-irrelevant variance settings, an understanding o f an individuals type
may result from a lack of clarity in test instructions, and degree o f bilingualism or multilingualism is
from unrelated complexity or language demands important for testing the individual appropriately.
in test tasks, and/or from other characteristics o f N ote that this concern may not apply when the
55
CHAPTER 3
construct o f interest is defined as a particular that are irrelevant or tangential to the construct.
kind o f language proficiency (e.g., academic lan Scoring rubrics may inadvertently advantage some
guage o f the kind found in text books, language individuals over others. For example, a scoring
and vocabulary specific to workplace and em rubric for a constructed response item might
ploym ent testing). reserve the highest score level for test takers who
provide more information or elaboration than
Test Response was actually requested. In this situation, test takers
In som e cases, construct-irrelevant variance may who simply follow instructions, or test takers
arise because test items elicit varieties o f responses who value succinctness in responses, will earn
other than those intended or because items can lower scores; thus, characteristics o f the individuals
be solved in ways that were not intended. To the become construct-irrelevant components o f the
extent that such responses are more typical of test scores. Similarly, the scoring o f open-ended
som e subgroups than others, biased score inter responses may introduce construct-irrelevant vari
pretations may result. For example, some clients ance for some test takers if scorers and/or automated
responding to a neuropsychological test may scoring routines are not sensitive to the full
attem pt to provide the answers they think the test diversity o f ways in which individuals express
adm inistrator expects, as opposed to the answers their ideas. W ith the advent o f automated scoring
that best describe themselves. for complex performance tasks, for example, it is
Construct-irrelevant components in test scores important to examine the validity o f the automated
may also be associated with test response formats scoring results for relevant subgroups in the test-
that pose particular difficulties or are differentially taking population.
valued by particular individuals. For example,
test performance may rely on some capability
Opportunity to Learn
(e.g., English language proficiency or fine-motor Finally, opportunity to learn— the extent to which
coordination) that is irrelevant to the target con individuals have had exposure to instruction or
stru cts) but nonetheless poses impediments to knowledge that affords them the opportunity to
the test responses for some test takers not having learn the content and skills targeted by the test—
the capability. Similarly, different values associated has several implications for the fair and valid in
with the nature and degree o f verbal output can terpretation o f test scores for their intended uses.
influence test-taker responses. Som e individuals Individuals’ prior opportunity to learn can be an
may judge verbosity or rapid speech as rude, important contextual factor to consider in inter
whereas others may regard those speech patterns preting and drawing inferences from test scores.
as indications o f high mental ability or friendliness. For example, a recent im m igrant who has had
An individual o f the first type who is evaluated little prior exposure to school m ay not have had
with values appropriate to the second may be the opportunity to learn concepts assumed to be
considered taciturn, withdrawn, or o f low mental common knowledge by a personality inventory
ability. Another example is a person with memory or ability measure, even if that measure is adm in
or language problems or depression; such a persons istered in the native language o f the test taker.
ability to communicate or show interest in com Similarly, as another example, there has been con
m unicating verbally may be constrained, which siderable public discussion about potential inequities
may result in interpretations o f the outcomes of in school resources available to students from tra
the assessment that are invalid and potentially ditionally disadvantaged groups, for example,
harmful to the person being tested. racial, ethnic, language, and cultural minorities
In the development and use o f scoring rubrics, and rural students. Such inequities affect the
it is particularly important that credit be awarded quality o f education received. T o the extent that
for response characteristics central to the construct inequity exists, the validity o f inferences about
being measured and not for response characteristics student ability drawn from achievement test scores
56
J
FAIRNESS IN TESTING
may be compromised. N ot taking into account coverage for any one student may be impossible
prior opportunity to learn could lead to misdiag to determine. Third, granting a diplom a to a low-
nosis, inappropriate placement, and/or inappropriate scoring examinee on the grounds that the student
assignment o f services, which could have significant had insufficient opportunity to learn the material
consequences for an individual. tested means certificating someone who has not
Beyond its impact on the validity o f test score attained the degree o f proficiency the diploma is
interpretations for intended uses, opportunity to intended to signify.
learn has important policy and legal ramifications It should be noted that concerns about op
in education. Opportunity to learn is a fairness portunity to learn do not necessarily apply to sit
issue when an authority provides differential access uations where the same authority is not responsible
to opportunity to learn for some individuals and for both the delivery o f instruction and the testing
then holds those individuals who have not been and/or interpretation o f results. For example, in
provided that opportunity accountable for their college admissions decisions, opportunity to learn
test performance. Th is problem may affect high- may be beyond the control o f the test users and it
stakes competency tests in education, for example, may not influence the validity o f test interpretations
when educational authorities require a certain level for their intended use (e.g., selection and/or ad
o f test performance for high school graduation. missions decisions). Chapter 12, “Educational
Here, there is a fairness concern that students not Testing and Assessment,” provides additional per
be held accountable for, or face serious permanent spective on opportunity to learn.
negative consequences from, their test results when
their school experiences have not provided them
the opportunity to learn the subject matter covered Minimizing Construct-irrelevant
by the test. In such cases, students’ low scorcs may Components Through Test Design and
accurately reflect what they know and can do, so
Testing Adaptations
that, technically, the interpretation o f the test
results for the purpose o f measuring how much Standardized tests should be designed to facilitate
the students have learned may not be biased. accessibility and minimize construct-irrelevant
However, it may be considered unfair to severely barriers for all test takers in the target population,
penalize students for circumstances that are not as far as practicable. Before considering the need
under their control, that is, for not learning content for any assessment adaptations for test takers who
that their schools have not taught. It is generally may have special needs, the assessment developer
accepted that before high-stakes consequences can first must attempt to improve accessibility within
be imposed for failing an examination in educational the test itself. Some o f these basic principles are
settings, there must be evidence that students have included in the test design process called universal
been provided curriculum and instruction that in design. By using universal design, test developers
corporates the constructs addressed by the test. begin the test development process with an eye
Several important issues arise when opportunity toward maximizing fairness. Universal design em
to learn is considered as a component o f fairness. phasizes the need ro develop tests that are as
First, it is difficult to define opportunity to learn usable as possible for all test takers in the intended
in educational practice, particularly at the individual test population, regardless o f characteristics such
level. Opportunity is generally a matter o f degree as gender, age, language background, culture, so
and is difficult to quantify; moreover, the meas cioeconomic status, or disability.
urement o f some important learning outcomes Principles o f universal design include defining
may require students to work with materials that constructs precisely, so that what is being measured
they have not seen before. Second, even if it is can be clearly differentiated from test-taker char
possible to document the topics included in the acteristics that are irrelevant to the construct but
curriculum for a group o f students, specific content that could otherwise interfere with som e test
57
CHAPTER 3
takers’ ability to respond. Universal design avoids, if the physics test is administered in English. De
where possible, item characteristics and formats, pending on testing circumstances and purposes
or test characteristics (for example, inappropriate o f the test, as well as individual characteristics,
test speededness), that may bias scores for individuals such adaptations might include changing the con
or subgroups due to construct-irrelevant charac tent or presentation o f the test items, changing
teristics that are specific to these test takers. the administration conditions, and/or changing
Universal design processes strive to minimize the response processes. T h e term adaptation is
access challenges by taking into account test char used to refer to any such change. It is important,
acteristics that may impede access to the construct however, to differentiate between changes that
for certain test takers, such as the choice o f result in comparable scores and changes that may
content, test tasks, response procedures, and testing not produce scores that are comparable to those
procedures. For example, the content o f tests can from the original test. Although the terms may
be made more accessible by providing user-selected have different meanings under applicable laws, as
font sizes in a technology-based test, by avoiding used in the Standards the term accommodation is
item contexts that would likely be unfamiliar to used to denote changes with which the compara
individuals because o f their cultural background, bility o f scores is retained, and the term modification
by providing extended administration time when is used to denote changes that affect the construct
speed is not relevant to the construct being meas measured by the test. With a modification, the
ured, or by minimizing the linguistic load o f test changes affect the construct being measured and
items intended to measure constructs other than consequently lead to scores that differ in meaning
competencies in the language in which the test is from those from the original test.1
administered. It is important to keep in mind that attention
Although the principles o f universal design to design and the provision o f altered tests do not
for assessment provide a useful guide for developing always ensure that test results will be fair and
assessments that reduce construct-irrelevant variance, valid for all examinees. Those who administer
researchers are still in the process o f gathering tests and interpret test scores need to develop a
empirical evidence to support some o f these prin full understanding o f the usefulness and limitations
ciples. It is important to note that not all tests can o f test design procedures for accessibility and any
be made accessible for everyone by attention to alterations that are offered.
design changes such as those discussed above.
Even when tests are developed to maximize fairness A Range o1 Test Adaptations
through the use o f universal design and other Rather than a simple dichotomy, potential test
practices to increase access, there will still be situ adaptations reflect a broad range o f test changes.
ations where the test is not appropriate for all test At one end o f the range are test accommodations.
takers in the intended population. Therefore, As the term is used in the Standards, accom moda
some test adaptations may be needed for those tions consist o f relatively minor changes to the
individuals whose characteristics would otherwise presentation and/or format o f the test, test ad
impede their access to the examination.
Adaptations are changes to the original test
‘T h e Americans with Disabiliries Act (AD A) uses the
design or administration to increase access to the
terms accommodation and modification differently from the
test for such individuals. For example, a person Standards. Title I o f the A D A uses the term reasonable accom
who is blind may read only in braille format, and modation to refer to changes that enable qualified individuals
an individual with hemiplegia may be unable to with disabilities to obtain em ploym ent to perform their jobs.
Titles II and HI use the term reasonable modification in much
hold a pencil and thus have difficulty completing
the sam e way. Under the AD A, an accom m odation or m odi
a standard written exam. Students with limited
fication to a test that fundamentally alters the construct being
English proficiency may be proficient in physics measured would not be called som ething different; rather it
but may not be able to demonstrate their knowledge w ould probably be found not "reasonable.”
58
FAIRNESS IN TESTING
ministration, or response procedures that maintain ured, because the student does not have to decode
the original construct and result in scores com pa the printed text; but without the adaptation, the
rable to those on the original test. For example, student may not be able to dem onstrate any
text magnification m ight be an accommodation standing on the construct o f reading comprehension.
for a test taker with a visual impairment who oth O n the other hand, if the purpose o f the reading
erwise would have difficulty deciphering test di test is to evaluate comprehension without concern
rections or items. English—native language glossaries for decoding ability, the adaptation might be
are an example o f an accommodation that might judged to support more valid interpretations o f
be provided for lim ited English proficient test som e students’ reading comprehension and the
takers on a construction safety test to help them essence o f the relevant parts o f the construct
understand what is being asked. The glossaries might be judged to be intact. T he challenge for
w ould contain words that, while not directly those who report, interpret, and/or use test scores
related to the construct being measured, would from adapted tests is to recognize which adaptations
help lim ited English test takers understand the provide scores that are comparable to the scores
context o f the question or task being posed. from the original, unadapted assessment and
At the other end o f the range are adaptations which adaptations do not. This challenge becomes
that transform the construct being measured, in even more difficult when evidence to support the
cluding the test content and/or testing conditions, comparability o f scores is not available.
to get a reasonable measure o f a somewhat different
but appropriate construct for designated test Test Accommodations: Comparable Measures
takers. For example, in educational testing, different
That Maintain the Intended Construct
tests addressing alternate achievement standards Comparability o f scores enables test users to make
are designed for students with severe cognitive comparable inferences based on the scores for all
disabilities for the same subjects in which students test takers. Comparability also is the defining
without disabilities are assessed. Clearly, scores feature for a test adaptation to be considered an
from these different tests cannot be considered accom modation. Scores from the accommodated
comparable to those resulting from the general version o f the test must yield inferences comparable
assessment, but instead represent scores from a to those from the standard version; to make this
new test that requires the same rigorous develop happen is a challenging proposition. O n the one
ment and validation processes as would be carried hand, com m on, uniform procedures are a basic
out for any new assessment. (An expanded dis underpinning for score validity and comparability.
cussion o f the use o f such alternate assessments is O n the other hand, accom modations by their
found in chap. 12; alternate assessments will not very nature mean that something in the testing
be treated further in the present chapter.) Other circumstance has been changed because adhering
adaptations change the intended construct to to the original standardized procedures would in
make it accessible for designated students while terfere with valid measurement o f the intended
retaining as much o f the original construct as constructs) for some individuals.
possible. For example, a reading test adaptation T h e comparability o f inferences made from
might provide a dyslexic student with a screen accommodated test scores rests largely on whether
reader that reads aloud the passages and the test the scores represent the same constructs as those
questions measuring reading comprehension. If from the original test. This determination requires
the construct is intentionally defined as requiring a very clear definition o f the intended constructs).
both the ability to decode and the ability to com For example, when non-native speakers o f the
prehend written language, the adaptation would language o f the test take a survey o f their health
require a different interpretation o f the test scores and nutrition knowledge, one may not know
as a measure o f reading comprehension. Clearly, whether the test score is, in whole or in part, a
this adaptation changes the construct being meas measure o f the ability to read in the language o f
59
CHAPTER 3
rhe test rather than a measure o f the intended disabilities and those with diverse linguistic and
construct. If the test is not intended to also be a cultural backgrounds. Similar approaches may be
measure o f the ability to read in English, then test adapted for other subgroups. Specific strategies
scores do not represent the same construct(s) for depend on the purpose o f the test and the con-
examinees who may have poor reading skills, such struct(s) the test is intended to measure. Some
as limited English proficient test takers, as they strategies require changing test administration
do for those who are fully proficient in reading procedures (e.g., instructions, response format),
English. An adaptation that improves the accessi whereas others alter testing m edium , timing, set
bility o f the test for non-native speakers o f English tings, or format. Depending on the linguistic
by providing direct or indirect linguistic supports background or the nature and extent o f the
may yield a score that is uncontaminated by the disability, one or more testing changes may be
ability to understand English. appropriate for a particular individual.
At the same time, construct underrepresentation Regardless o f the individual’s characteristics
is a primary threat to the validity o f test accom that make accommodations necessary, it is im
m odations. For example, extra time is a common portant that test accom m odations address the
accommodadon, but if speed is part o f the intended specific access issue(s) that otherwise would bias
construct, it is inappropriate to allow for extra an individual’s test results. For example, accom
time in the test administration. Scores obtained modations provided ro limited English proficient
on the test with extended administration time test takers should be designed to address appropriate
may underrepresent the construct measured by linguistic support needs; those provided to test
the strictly timed test because speed will not be takers with visual impairments should address
part o f the construct measured by the extended the inability to see test material. Accommodations
tim e test. Similarly, translating a reading compre should be effective in removing construct-irrelevant
hension test used for selection into an organization’s barriers to an individuals test performance without
training program is inappropriate if reading com providing an unfair advantage over individuals
prehension in English is important to successful who do not receive the accommodation. Admittedly,
participation in the program. achieving both objectives can be challenging.
Claim s that accommodated versions o f a test Adaptations involving test translations merit
yield interpretations comparable to those based special consideration. Simply translating a test
on scores from the original test and that the con from one language to another does not ensure
struct being measured has hot been changed need that the translation produces a version o f the test
to be evaluated and substantiated with evidence. that is comparable in content and difficulty level
Although score comparability is easiest to establish to the original version o f the test, or that the
when different test forms are constructed following translated test produces scores that are equally re
identical procedures and then equated statistically, liable/precise and valid as those from the original
such procedures usually are not possible for ac test. Furthermore, one cannot assume that the rel
commodated and nonaccommodated versions of evant acculturation, clinical, or educational expe
tests. Instead, relevant evidence can take a variety riences are similar for test takers taking the translated
o f forms, from experimental studies to assess con version and for the target group used to develop
struct equivalence to smaller, qualitative studies the original version. In addition, it cannot be as
and/or use o f professional judgm ent and expert sumed that translation into the native language is
review. Whatever the case, test developers and/or always a preferred accommodation. Research in
users should seek evidence o f the comparability educational testing, for example, shows that trans
o f the accommodated and original assessments. lated content tests are not effective unless test
A variety o f strategies for accom modating tests takers have been instructed using the language o f
and testing procedures have been implemented the translated test. Whenever tests are translated
to be responsive to the needs o f test takers with from one language to a second language, evidence
60
FAIRNESS IN TESTING
o f the validity, reliability/precision, and comparability fication, however, the individual may be able to
o f scores on the different versions o f the tests demonstrate mathematics problem-solving skills,
should be collected and reported. even if he or she is not able to dem onstrate
When the testing accommodation employs computation skills. Because m odified assessments
the use o f an interpreter, it is desirable, where fea are m easuring a different construct from that
sible, to obtain someone who has a basic under m easured by the standardized assessm ent, it is
standing o f the process o f psychological and edu im portant to interpret the assessment scores as
cational assessment, is fluent in the language o f resulting from a new test and to gather whatever
the test and the test takers native language, and is evidence is necessary to evaluate the validity o f
familiar with the test taker’s cultural background. the interpretations for intended uses o f the
The interpreter ideally needs to understand the scores. For norm -based score interpretations,
im portance o f following standardized procedures, any m odification that changes the construct
the importance o f accuratcly conveying to the ex will invalidate the norms for score interpretations.
aminer a test taker’s actual responses, and the role Likewise, i f the construct is changed, criterion-
and responsibilities o f the interpreter in testing. based score interpretations from the m odified
The interpreter m ust be careful not to provide assessm ent (for example, m aking classification
any assistance to the candidate that might potentially decisions such as “pass/fail” or assigning categories
comprom ise the validity o f the interpretation for o f mastery such as “ basic,” “proficient,” or “ad
intended uses o f the assessment results. vanced” using cut scores determined on the
Finally, it is important to standardize procedures original assessment) will not be valid.
for implementing accommodations, as far as pos
sible, so that comparability o f scores is maintained. Reporting Scores From
Standardized procedures for test accommodations Accommodated and Modified Tests
must include rules for determining who is eligible Typically, test administrators and testing profes
for an accom m odation, as well as precisely how sionals docum ent steps used in making test ac
the accom m odation is to be administered, le s t commodations or modifications in the test report;
users should monitor adherence to the rules for clinicians may also include a discussion o f the va
eligibility and for appropriate administration o f lidity o f the interpretations o f the resulting scores
the accom m odated test. for intended uses. This practice o f reporting the
nature o f accommodations and modifications is
Test Modifications: Noncomparable Measures consistent with implied requirements to com m u
That Change the Intended Construct nicate information as to the nature o f the assessment
There m ay be tim es when additional flexibility process if these changes may affect the reliability/pre
is required to obtain even partial measurement cision o f test scores or the validity o f interpretations
o f the construct; that is, it may be necessary to drawn from test scores.
consider a m odification to a test that will result T h e flagging o f test score reports can be a
in changing the intended construct to provide controversial issue and subject to legal requirements.
even lim ited access to the construct that is being When there is clear evidence that scores from
m easured. For exam ple, an individual with regular and altered tests or test administrations
dyscalculia m ay have lim ited ability to do com are not comparable, consideration should be given
putations w ithout a calculator; however, i f pro to informing score users, potentially by flagging
vided a calculator, the individual may be able to the test results to indicate their special nature, to
do the calculations required in the assessment. the extent perm itted by law. W here there is
I f the construct being assessed involves broader credible evidence that scores from regular and
m ath em atics skill, the individual m ay have altered tests are comparable, then flagging generally
lim ited access to the construct being measured is not appropriate. There is little agreement in the
w ithout the use o f a calculator; with the m odi field on how to proceed when credible evidence
61
CHAPTER 3
on comparability does nor exist. To the extent For example, allowing extra time on a timed test
possible, test developers and/or users should collect to determine distractibility and speed-of-processing
evidence to examine the comparability o f regular difficulties associated with attention deficit disorder
and altered tests or administration procedures for would make it impossible to determine the extent
the test’s intended purposes. to which the attention and processing-speed d if
ficulties actually exist.
Appropriate Use of Third, it is important to note that not all in
Accommodations or Modifications dividuals within a general class o f examinees, such
D epending on the construct to be measured and as those with diverse linguistic or cultural back
the test’s purpose, there are some testing situations grounds or with disabilities, may require special
where accommodations as defined by the Standards provisions when taking tests. The language skills,
are not needed or modifications as defined by the cultural knowledge, or specific disabilities that
Standards are not appropriate. First, the reason these individuals possess, for example, might not
for the possible alteration, such as English language influence their performance on a particular type
skills or a disability, may in fact be directly relevant o f test. Hence, for these individuals, no changes
to the focal construct. In employment testing, it are needed.
w ould be inappropriate to make changes to the The effectiveness o f a given accom modation
test i f the test is designed to assess essential skills also plays a role in determinations o f appropriate
required for the job and the test changes would use. I f a given accommodation or modification
fundamentally alter the constructs being measured. does not increase access to the construct as
For example, despite increased autom ation and measured, there is little point in using it. Evidence
use o f recording devices, some court reporter jobs o f effectiveness may be gathered through quanti
require individuals to be able to work quickly and tative or qualitative studies. Professional judgm ent
accurately. Speed is an important aspect o f the necessarily plays a substantial role in decisions
construct that cannot be adapted. As another ex about changes to the test or testing situation.
ample, a work sample for a customer service job In summary, fairness is a fundamental issue
that requires fluent communication in English for valid test score interpretation, and it should
w ould not be translated into another language. therefore be the goal for all testing applications.
Second, an adaptation for a particular disability Fairness is the responsibility o f all parties involved
is inappropriate when the purpose o f a rest is to in rest development, administration, and score in
diagnose the presence and degree o f that disability. terpretation for the intended purposes o f the test.
62
FAIRNESS IN TESTING
T h e standards in this chapter begin with an over Cluster 1. Test Design, Development,
arching standard (numbered 3.0), which is designed Administration, and Scoring Procedures
to convey the central intent or primary focus o f
That Minimize Barriers to Valid Score
the chapter. The overarching standard may also
be viewed as the guiding principle o f the chapter,
Interpretations for the Widest Possible
and is applicable to all tests and test users. All Range of Individuals and Relevant
subsequent standards have been separated into Subgroups
four thematic clusters labeled as follows:
Standard 3.1
1. Test Design, Development, Administration,
and Scoring Procedures That Minim ize Bar Those responsible for test development, revision,
riers to Valid Score Interpretations for the and administration should design all steps of
W idest Possible Range o f Individuals and the testing process to promote valid score inter
Relevant Subgroups pretations for intended score uses for the widest
2. Validity o f Test Score Interpretations for possible range of individuals and relevant sub
Intended Uses for the Intended Examinee groups in the intended population.
Population
Com m ent: Test developers must clearly delineate
3. Accom m odations to Remove Construct- both the constructs that are to be measured by the
Irrelevant Barriers and Support Valid Inter
test and the characteristics o f the individuals and
pretations o f Scores for Their Intended Uses
subgroups in the intended population o f test takers.
4. Safeguards Against Inappropriate Score Test tasks and items should be designed to maximize
Interpretations for Intended Uses access and be free o f construct-irrelevant barriers as
far as possible for all individuals and relevant sub
groups in the intended test-taker population. One
Standard 3.0
way to accomplish these goals is to create the test
using principles o f universal design, which take ac
All steps in the testing process, including test
count o f the characteristics o f all individuals for
design, validation, development, administration,
whom the test is intended and include such elements
and scoring procedures, should be designed in
as precisely defining constructs and avoiding, where
such a manner as to minimize construct-irrelevant
possible, characteristics and formats o f items and
variance and to promote valid score interpretations
tests (for example, test speededness) that may com
for the intended uses for all examinees in the in
promise valid score interpretations for individuals
tended population.
or relevant subgroups. Another principle o f universal
Com m ent: T h e central idea o f fairness in testing design is to provide simple, clear, and intuitive
is to identify and remove construct-irrelevant testing procedures and instructions. Ultimately,
barriers to maximal performance for any examinee. the goal is to design a testing process that will, to
Removing these barriers allows for the comparable the extent practicable, remove potential barriers to
and valid interpretation o f test scores for all ex the measurement o f the intended construct for all
aminees. Fairness is thus central to the validity individuals, including those individuals requiring
and comparability o f the interpretation o f test accommodations. Test developers need to be knowl
scores for intended uses. edgeable about group differences that may interfere
63
CHAPTER 3
with the precision o f scores and the validity o f test testing population in pilot or field test samples
score inferences, and they need to be able to take used to evaluate item and test appropriateness for
steps to reduce bias. construct interpretations. T h e analyses that are
carricd out using pilot and field testing data
should seek to detect aspects o f test design,
Standard 3.2
content, and form at that might distort test score
Test developers are responsible for developing interpretations for the intended uses o f the test
tests that m easure the intended construct and scores for particular groups and individuals. Such
for m inim izing the potential for tests’ being af analyses could employ a range o f methodologies,
fected by construct-irrelevant characteristics, such including those appropriate for small sam ple sizes,
as linguistic, communicative, cognitive, cultural, such as expert judgm ent, focus groups, and
physical, or other characteristics. cognitive labs. Both qualitative and quantitative
sources o f evidence are important in evaluating
Com m ent: Unnecessary linguistic, communicative,
whether items are psychomctrically sound and
cognitive, cultural, physical, and/or other charac
appropriate for all relevant subgroups.
teristics in test item stimulus and/or response re
If sample sizes permit, it is often valuable to
quirements can impede some individuals in demon
carry out separate analyses for relevant subgroups
strating their standing on intended constructs.
o f the population. When it is not possible to
Test developers should use language in tests that
include sufficient numbers in pilot and/or field
is consistent with the purposes of the tests and
test samples in order to do separate analyses, op
that is familiar to as wide a range o f test takers as
erational test results may be accumulated and
possible. Avoiding the use o f language that has
used to conduct such analyses when sample sizes
different meanings or different connotations for
become large enough to support the analyses.
relevant subgroups o f test takers will help ensure
If pilot or field test results indicate that items
that test takers who have the skills being assessed
or tests function differentially for individuals
are able to understand what is being asked o f
from, for example, relevant age, cultural, disability,
them and respond appropriately. The level o f lan
gender, linguistic and/or racial/ethnic groups in
guage proficiency, physical response, or other de
the population o f test takers, test developers
mands required by the test should be kept to the
should investigate aspects o f test design, content,
minimum required to meet work and credentialing
and format (including response formats) that
requirements and/or to represent the target con
might contribute to the differential performance
stru cts). In work situations, the modality in
o f members o f these groups and, if warranted,
which language proficiency is assessed should be
eliminate these aspects from future test development
com parable to that required on the job, for
practices.
example, oral and/or written, comprehension
Expert and sensitivity reviews can serve to
and/or production. Similarly, the physical and
guard against construct-irrelevant language and
verbal dem ands o f response requirements should
images, including those that m ay offend som e
be consistent with the intended construct.
individuals or subgroups, and against construct-
irrelevant context that may be m ore fam iliar to
Standard 3.3 som e than others. Test publishers often conduct
sensitivity reviews o f all test material to detect
T h ose responsible for test development should
and remove sensitive material from tests (e.g.,
include relevant subgroups in validity, reliability/
text, graphics, and other visual representations
precision, and other prelim inary studies used
within the test that could be seen as offensive to
when constructing the test.
som e groups and possibly affect the scores o f in
C om m ent: Test developers should include indi dividuals from these groups). Such reviews should
viduals from relevant subgroups o f the intended be conducted before a test becom es operational.
64
FAIRNESS IN TESTING
65
CHAPTER 3
sample sizes are sufficient, studies o f score precision ferential predictions. In contrast, correlation co
and accuracy for relevant subgroups also should efficients provide inadequate evidence for or
be conducted. W hen sample sizes are small, data against a differential prediction hypothesis if
may sometimes be accumulated over operational groups or treatments are found to have unequal
administrations o f the test so that suitable quan means and variances on the test and the criterion.
titative analyses by subgroup can be performed It is particularly important in the context o f
after the test has been in use for a period o f time. testing for high-stakes purposes that test developers
Qualitative studies also are relevant to the supporting and/or users examine differential prediction and
validity arguments (e.g., expert reviews, focus avoid the use o f correlation coefficients in situations
groups, cognitive labs). Test developers should where groups or treatments result in unequal
closely consider findings from quantitative and/or means or variances on the test and criterion.
qualitative analyses in documenting the interpre
tations for the intended score uses, as well as in
Standard 3.8
subsequent test revisions.
Analyses, where possible, may need to take When tests require the scoring of constructed
into account the level o f heterogeneity within rel responses, test developers and/or users should
evant subgroups, for example, individuals with collect and report evidence o f the validity of
different disabilities, or linguistic minority examinees score interpretations for relevant subgroups in
at different levels o f English proficiency. Differences the intended population o f test takers for the in
within these subgroups may influence the appro tended uses o f the test scores.
priateness o f test content, the internal structure
C om m ent: Subgroup differences in examinee re
o f the test responses, the relation o f test scores to
sponses and/or the expectations and perceptions
other variables, or the response processes employed
o f scorers can introduce construct-irrelevant
by individual examinees.
variance in scores from constructed response tests.
T h ese, in turn, could seriou sly affect the
Standard 3.7 reliability/precision, validity, and comparability
o f score interpretations for intended uses for some
When criterion-related validity evidence is used
individuals. Different methods o f scoring could
as a basis for test score-based predictions of
differentially influence the construct representation
future performance and sample sizes are sufficient,
o f scores for individuals from som e subgroups.
test developers and/or users are responsible for
For human scoring, scoring procedures should
evaluating the possibility of differential prediction
be designed with the intent that the scores reflect
for relevant subgroups for which there is prior
the examinee’s standing relative to the tested con
evidence or theory suggesting differential pre
stru cts) and are not influenced by the perceptions
diction.
and personal predispositions o f the scorers. It is
Com m ent: When sample sizes are sufficient, dif essential that adequate training and calibration o f
ferential prediction is often examined using re scorers be carried out and m onitored throughout
gression analysis. O ne approach to regression the scoring process to support the consistency o f
analysis examines slope and intercept differences scorers’ ratings for individuals from relevant sub
between targeted groups (e.g., Black and W hite groups. Where sample sizes permit, the precision
samples), while another examines systematic de and accuracy o f scores for relevant subgroups also
viations from a comm on regression line for the should be calculated.
groups o f interest. Both approaches can account Automated scoring algorithms may be used to
for the possibility o f predictive bias and/or differ score complex constructed responses, such as essays,
ences in heterogeneity between groups and provide either as the sole determiner o f the score or in
valuable information for the examination o f dif conjunction with a score provided by a human
66
FAIRNESS IN TESTING
scorer. Scoring algorithms need to be reviewed for to evaluate the academic progress o f an individual,
potential sources o f bias. The precision o f scores the accom modation that will best eliminate con
and validity o f score interpretations resulting from struct irrelevance will match the accommodation
automated scoring should be evaluated for all used for instruction.
relevant subgroups o f the intended population. Test modifications that change the construct
that the test is measuring may be needed for some
examinees to demonstrate their standing on some
Cluster 3. Accommodations to Remove
aspect o f the intended construct. If an assessment is
Construct-Irrelevant Barriers and modified to improve access to the intended construct
Support Valid Interpretations of Scores for designated individuals, the modified assessment
for Their Intended Uses should be treated like a newly developed assessment
that needs to adhere to the test standards for validity,
reliability/precision, fairness, and so forth.
Standard 3.9
Test developers and/or test users are responsible Standard 3.10
for developing and providing test accommodations,
when appropriate and feasible, to remove con W hen test accom m odations are perm itted, test
struct-irrelevant barriers that otherwise would developers and/or test users are responsible for
interfere with examinees’ ability to demonstrate docum enting standard provisions for u sing the
their standing on the target constructs. accom m odation and for m onitoring the appro
priate im plem entation o f the accomm odation.
C om m ent: Test accommodations are designed to
remove construct-irrelevant barriers related to in Com m ent: Test accommodations should be used
dividual characteristics that otherwise would in only when the test taker has a documented need
terfere with the measurement o f the target construct for the accomm odation, for example, an Individ
and therefore would unfairly disadvantage indi ualized Education Plan (IEP) or documentation
viduals with these characteristics. These accom by a physician, psychologist, or other qualified
m odations include changes in adm inistration professional. The documentation should be prepared
setting, presentation, interface/engagement, and in advance o f the test-taking experience and
response requirements, and may include the ad reviewed by one or more experts qualified to
dition o f individuals to the administration process make a decision about the relevance o f the docu
(e.g., readers, scribes). mentation to the requested accommodation.
An appropriate accommodation is one that Test developers and/or users should provide
responds to specific individual characteristics but individuals requiring accommodations in a testing
does so in a way that does not change the construct situation with information about the availability
the test is measuring or the meaning o f scores. o f accom modations and the procedures for re
Test developers and/or test users should document questing them prior to the test administration. In
the basis for the conclusion that the accommodation settings where accommodations are routinely pro
does not change the construct that the test is vided for individuals with documented needs
measuring. Accom m odations must address indi (e.g., educational settings), the documentation
vidual test takers’ specific needs (e.g., cognitive, should describe permissible accommodations and
linguistic, sensory, physical) and may be required include standardized protocols and/or procedures
by law. For example, individuals who are not for identifying examinees eligible for accom m o
fully proficient in English may need linguistic ac dations, identifying and assigning appropriate ac
comm odations that address their language status, comm odations for these individuals, and adm in
while visually impaired individuals may need text istering accommodations, scoring, and reporting
magnification. In m any cases when a test is used in accordance with standardized rules.
67
CHAPTER 3
Test adm inistrators and users should also be sought to evaluate the validity of the changed
provide those who have a role in determining and test for relevant subgroups, for example through
administering accommodations with sufficient in small-sample qualitative studies or professional
formation and expertise to appropriately use ac judgments that examine the comparability o f the
comm odations that may be applied to the assess original and altered tests and/or that investigate
ment. Instructions for administering any changes alternative explanations for performance on the
in the test or testing procedures should be clearly changed tests.
documented and, when necessary, test adminis Evidence should be provided for recommended
trators should be trained to follow these procedures. alterations. If a test developer recommends different
The test administrator should administer the ac time limits, for example, for individuals with dis
com m odations in a standardized manner as doc abilities or those from diverse linguistic and
umented by the test developer. Administration cultural backgrounds, pilot or field testing should
procedures should include procedures for recording be used, whenever possible, to establish these par
which accommodations were used for specific in ticular time limits rather than sim ply allowing
dividuals and, where relevant, for recording any test takers a multiple o f the standard time without
deviation from standardized procedures for ad examining the utility o f the arbitrary implemen
ministering the accommodations. tation o f multiples o f the standard time. When
T h e test administrator or appropriate repre possible, fatigue and other time-related issues
sentative o f the test user should docum ent any should be investigated as potentially important
use o f accommodations. Tor large-scale education factors when time limits are extended.
assessments, test users also should monitor the W hen tests are linguistically sim plified to
appropriate use o f accommodations. remove construct-irrelevant variance, test developers
and/or users are responsible for documenting ev
idence o f the comparability o f scores from the
Standard 3.11
linguistically simplified tests to the original test,
W hen a test is changed to remove barriers to when sample sizes permit.
the accessibility o f the construct being measured,
test developers an d/or users are responsible for Standard 3.12
ob tain in g and docum en ting evidence o f the
validity o f score interpretations for intended W hen a test is translated and adapted from one
uses o f the changed test, when sam ple sizes language to another, test developers and/or test
perm it. users are responsible for describing the m ethods
used in establishing the adequacy o f the adaptation
C om m ent: It is desirable, where feasible and ap
and docum enting em pirical or logical evidence
propriate, to pilot and/or field test any test alter
for the validity o f test score interpretations for
ations with individuals representing each relevant
intended use.
subgroup for whom the alteration is intended.
Validity studies typically should investigate both C om m ent: T h e term adaptation is used here to
the efficacy o f the alteration for intended describe changes made to tests translated from
subgroup(s) and the comparability o f score infer one language to another to reduce construct-ir
ences from the altered and original tests. relevant variance that m ay arise due to individual
In som e circumstances, developers may not or subgroup characteristics. In this case the trans
be able to obtain sufficient samples o f individuals, lation/adaptation process involves not only trans
for example, those with the same disability or lating the language o f the test so that it is suitable
similar levels o f a disability, to conduct standard for the subgroup taking the test, but also addressing
em pirical analyses o f reliability/precision and any construct-irrelevant linguistic and cultural
validity. In these situations, alternative ways should subgroup characteristics that may interfere with
68
FAIRNESS m TESTING
measurement o f the intended construct(s). When Professional judgm ent needs to be used to de
multiple language versions o f a test are intended termine the most appropriate procedures for es
to provide com parable scores, test developers tablishing relative language proficiencies. Such
should describe in detail the methods used for procedures may range from self-identification by
test translation and adaptation and should report examinees to formal language proficiency testing.
evidence o f test score validity pertinent to the lin Sensitivity to linguistic and cultural characteristics
guistic and cultural groups for whom the test is may require the sole use o f one language in testing
intended and pertinent to the scores’ intended or use o f multiple languages to minimize the in
uses. Evidence o f validity may include empirical troduction o f construct-irrelevant components
studies and/or professional judgment documenting into the measurement process.
that the different language versions measure com Determination o f a test taker’s most proficient
parable or similar constructs and that the score language for test administration does not auto
interpretations from the two versions have com matically guarantee validity o f score inferences
parable validity for rheir intended uses. For for the intended use. For example, individuals
example, i f a test is translated and adapted into may be more proficient in one language than an
Spanish for use with Central American, Cuban, other, but not necessarily developmentally proficient
Mexican, Puerto Rican, South American, and in either; disconnects between the language o f
Spanish populations, the validity o f test score in construct acquisition and that o f assessment also
terpretations for specific uses should be evaluated can compromise appropriate interpretation o f the
with members o f each o f these groups separately, test takers scores.
where feasible. Where sample sizes permit, evidence
o f score accuracy and precision should be provided
Standard 3.14
for each group, and test properties for each
subgroup should be included in test manuals. When testing requires the use of an interpreter,
the interpreter should follow standardized pro
Standard 3.13 cedures and, to the extent feasible, be sufficiently
fluent in the language and content o f the test
A test should be administered in the language and the examinee’s native language and culture
that is most relevant and appropriate to the test to translate the test and related testing materials
purpose. and to explain the examinee’s test responses, as
necessary.
C om m ent: Test users should take into account
the linguistic and cultural characteristics and C om m ent: Although individuals with limited
relative language proficiencies o f examinees who proficiency in the language o f the test (including
are bilingual or use multiple ianguages. Identifying deaf and hard-of-hearing individuals whose native
the m ost appropriate language(s) for testing also language may be sign language) should ideally be
requires close consideration o f the context and tested by professionally trained bilingual/bicultural
purpose for testing. Except in cases where the examiners, the use o f an interpreter m ay be
purpose o f testing is to determine test takers’ level necessary in some situations. I f an interpreter is
o f proficiency in a particular language, the test required, the test user is responsible for selecting
takers should be tested in the language in which an interpreter with reasonable qualifications, ex
they are m ost proficient. In some cases, test takers’ perience, and preparation to assist appropriately
m ost proficient language in general may not be in the administration o f the test. A s with other
the language in which they were instructed or aspects o f standardized testing, procedures for ad
trained in relation to tested constructs, and in ministering a test when an interpreter is used
these cases it may be more appropriate to administer should be standardized and documented. It is
the test in the language o f instruction. necessary for the interpreter to understand the
69
CHAPTER 3
70
J i
FAIRNESS IN TESTING
may evaluate som e groups under the test and Even references to specific categories o f individuals
other groups under a different test. with disabilities, such as hearing impaired, should
be accompanied by an explanation o f the meaning
Standard 3.17 o f the term and an indication o f the variability of
individuals within the group.
W hen aggregate scores are publicly reported for
relevant subgroups— for example, males and fe
Standard 3.18
m ales, individuals o f differing socioeconom ic
status, individuals differing by race/ethnicity, In testing individuals for diagnostic and/or special
individuals with different sexual orientations, program placement purposes, test users should
individuals with diverse linguistic and cultural not use test scores as the sole indicators to char
backgrounds, individuals with disabilities, young acterize an individuals functioning, competence,
children or older adults— test users are responsible attitudes, and/or predispositions. Instead, multiple
for providing evidence o f com parability and for sources o f information should be used, alternative
including cautionary statements whenever credible explanations for test performance should be con
research or theory indicates that test scores m ay sidered, and the professional judgment o f someone
not have com parable m eaning across these sub familiar with the test should be brought to bear
groups. on the decision.
Com ment: Reporting scores for relevant subgroups C om m ent: Many test manuals point out variables
is justified only i f the scores have comparable that should be considered in interpreting test
meaning across these groups and there is sufficient scores, such as clinically relevant history, medica
sample size per group to protect individual identity tions, school record, vocational status, and test-
and warrant aggregation. This standard is intended taker m otivation. Influences associated with
to be applicable to settings where scores are variables such as age, culture, disability, gender,
implicitly or explicitly presented as comparable and linguistic or racial/ethnic characteristics may
in meaning across subgroups. Care should be also be relevant.
taken that the terms used to describe reported Opportunity to learn is another variable that
subgroups are clearly defined, consistent with may need to be taken into account in educational
comm on usage, and clearly understood by those and/or clinical settings. For instance, if recent
interpreting test scores. immigrants being tested on a personality inventory
Terminology for describing specific subgroups or an ability measure have little prior exposure to
for which valid test score inferences can and school, they may not have had the opportunity to
cannot be drawn should be as precise as possible, learn concepts that the test assumes are common
and categories should be consistent with the in knowledge or com m on experience, even if the
tended uses o f the results. For example, the terms test is administered in the native language. N ot
Latino or Hispanic can be ambiguous if not specif taking into account prior opportunity to learn
ically defined, in that they may denote individuals can lead to misdiagnoses, inappropriate placements
o f C uban, M exican, Puerto Rican, South or and/or services, and unintended negative conse
Central American, or other Spanish-culture origin, quences.
regardless o f race/ethnicity, and may combine Inferences about test takers’ general language
those who are recent immigrants with those who proficiency should be based on tests that measure
are U .S. native born, those who may not be pro a range o f language features, not a single linguistic
ficient in English, and those o f diverse socioeco skill. A more complete range o f communicative
nomic background. Similarly, the term “individuals abilities (e.g., word knowledge, syntax as well as
with disabilities” encompasses a wide range o f cultural variation) will typically need to be assessed.
specific conditions and background characteristics. Test users are responsible for interpreting individual
CHAPTER 3
scores in light o f alternative explanations and/or Note that this standard is not applicable in situ
relevant individual variables noted in the test ations where different authorities are responsible for
manual. curriculum, testing, and/or interpretation and use
of results. For example, opportunity to learn may be
beyond the knowledge or control o f test users, and
Standard 3.19
it may not influence the validity o f test interpretations
In settings where the same authority is responsible such as predictions o f future performance.
for both provision o f curriculum and high-stakes
decisions based on testing o f examinees’ curriculum Standard 3.20
m astery, examinees should not suffer permanent
negative consequences i f evidence indicates that When a construct can be m easured in different
they have not had the opportunity to learn the ways that are equal in their degree o f construct
test content. representation and validity (including freedom
from construct-irrelevant variance), test users
C om m en t: In educational settings, students’
should consider, am ong other factors, evidence
opportu n ity to learn the content and skills
o f subgroup differences in m ean scores or in
assessed by an achievement test can seriously
percentages o f examinees whose scores exceed
affect their test perform ance and the validity o f
the cut scores, in deciding which test and/or cut
test score interpretations for intended use for
scores to use.
high-stakes individual decisions. I f there is not
a good match between the content o f curriculum Com ment: Evidence o f differential subgroup per
and instruction and that o f tested constructs for formance is one important factor influencing the
som e students, those students cannot be expected choice between one test and another. However,
to do well on the test and can be unfairly disad other factors, such as cost, testing time, test security,
vantaged by high-stakes individual decisions, and logistical issues (e.g., the need to screen very
such as denying high school graduation, that large numbers o f examinees in a very short time),
are made based on test results. When an authority, must also enter into professional judgments about
such as a state or district, is responsible for pre test selection and use. If the scores from two tests
scrib in g and/or delivering curriculum and in lead to equally valid interpretations and impose
struction, it should not penalize individuals for similar costs or other burdens, legal considerations
test perform ance on contcnt that the authority may require selecting the test that minimizes sub
has n ot provided. group differences.
72
PART IE
Operations
4. TEST DESIGN AND DEVELOPMENT
BACKGROUND
Test development is the process o f producing a cations for such tests should include descriptions
measure o f some aspect o f an individuals knowledge, o f the outcomes the test is designed to predict
skills, abilities, interests, attitudes, or other char and plans to collect evidence o f the effectiveness
acteristics by developing questions or tasks and o f test scores in predicting these outcomes.
combining them to form a test, according to a Issues bearing on validity, reliability, and
specified plan. T h e steps and considerations for fairness are interwoven within the stages o f test
this process are articulated in the test design plan. development. Each o f these topics is addressed
Test design begins with consideration o f expected comprehensively in other chapters o f the Standards:
interpretations for intended uses o f the scores to validity in chapter 1, reliability in chapter 2, and
be generated by the test. T h e content and format fairness in chapter 3. Additional material on test
o f the test are then specified to provide evidence administration and scoring, and on reporting and
to support the interpretations for intended uses. interpretation o f scores and results, is provided in
Test design also includes specification o f test ad chapter 6. Chapter 5 discusses score scales, and
ministration and scoring procedures, and o f how chapter 7 covers documentation requirements.
scores are to be reported. Questions or tasks (here In addition, test developers should respect the
after referred to as items) are developed following rights o f participants in the development process,
the test specifications and screened using criteria including pretest participants. In particular, de
appropriate to the intended uses o f the test. Pro velopers should take steps to ensure proper notice
cedures for scoring individual items and the test and consent from participants and to protect par
as a whole are also developed, reviewed, and ticipants’ personally identifiable information con
revised as needed. Test design is commonly iterative, sistent with applicable legal and professional re
with adjustm ents and revisions made in response quirements. The rights o f test takers are discussed
to data from tryouts and operational use. in chapter 8.
Test design and development procedures must This chapter describes four phases o f the test
support the validity o f the interpretations o f test development process leading from the original
scores for their intended uses. For example, current statement o f purpose(s) to the final product: (a)
educational assessments often are used to indicate development and evaluation o f the test specifica
students’ proficiency with regard to standards for tions; (b) development, tryout, and evaluation o f
the knowledge and skill a student should exhibit; the items; (c) assembly and evaluation o f new test
thus, the relationship between the test content forms; and (d) development o f procedures and
and the established content standards is key. In materials for administration and scoring. W hat
this case, content specifications m ust clearly follows is a description o f typical test development
describe the content and/or cognitive categories procedures, although there may be sound reasons
to be covered so that evidence o f the alignment o f that som e o f the steps covered in the description
the test questions to these categories can be are followed in som e settings and not in others.
gathered. W hen normative interpretations are in
tended, development procedures should include Test Specifications
a precise definition o f the reference population
and plans to collect appropriate normative data. General Considerations
Many tests, such as employment or college selection In nearly all cases, test development is guided by
tests, rely on predictive validity evidence. Specifi a set o f test specifications. The nature o f these
75
CHAPTER 4
specifications and the way in which they are score interpretations are o f primary interest. A
created may vary widely as a function o f the scorc for an individual or for a definable group is
nature o f the test and its intended uses. The term ranked within a distribution o f scores or compared
test specifications is sometimes limited to description with the average performance o f test takers in a
o f the content and format o f the test. In the Stan reference population (e.g., based on age, grade,
dards, test specifications are defined more broadly diagnostic category, or job classification). When
to also include documentation o f the purpose interpretations are criterion-referenced, absolute
and intended uses o f the test, as well as detailed score interpretations are o f primary interest. The
decisions about content, format, test length, psy meaning o f such scores does not depend on rank
chometric characteristics o f the items and test, information. Rather, the test scorc conveys directly
delivery mode, administration, scoring, and score a level o f competence in some defined criterion
reporting. domain. Both relative and absolute interpretations
Responsibility for developing test specifications are often used with a given test, but the test de
also varies widely across testing programs. For veloper determines which approach is most relevant
m ost commercial tests, test specifications are to specific uses o f the test.
created by the test developer. In other contexts,
such as tests used for educational accountability, Content Specifications
many aspects o f the test specifications are established The first step in developing test specifications is
through a public policy process. As discussed in to extend the original statement o f purpose(s),
the introduction, the generic term test developer is and the construct or content dom ain being con
used in this chapter in preference to other terms, sidered, into a framework for the test that describes
such as testpublisher, to cover both those responsible the extent o f the dom ain, or the scope of the con
for developing and those responsible for imple struct to be measured. Content specifications, some
menting test specifications across a wide range o f times referred to as content frameworks, delineate
test development processes. the aspects (e.g., content, skills, processes, and di
agnostic features) o f the construct or domain to
Statement of Purpose and Intended Uses be measured. The specifications should address
The process o f developing educational and psy questions about what is to be included, such as
chological tests should begin with a statement o f “ Does eighth-grade mathematics include algebra?”
the purpose(s) o f the test, the intended users and “ Does verbal ability include text comprehension
uses, the construct or content dom ain to be meas as well as vocabulary?” “ D oes self-esteem include
ured, and the intended examinee population. both feelings and acts?” T h e delineation o f the
Tests o f the same construct or dom ain can differ content specifications can be guided by theory or
in important ways because factors such as purpose, by an analysis o f the content dom ain (e.g., an
intended uses, and examinee population may vary. analysis o f job requirements in the case o f many
In addition, tests intended for diverse examinee credentialing and employment tests). The content
populations m ust be developed to minimize con specifications serve as a guide to subsequent test
struct-irrelevant factors that may unfairly depress evaluation. T h e chapter on validity provides a
or inflate some examinees’ performance. In many more thorough discussion o f the relationships
cases, accommodations and/or alternative versions am ong the construct or content dom ain, the test
o f tests may need to be specified to remove framework, and the purpose(s) o f the test.
irrelevant barriers to performance for particular
subgroups in the intended examinee population. Format Specifications
Specification o f intended uses will include an O nce decisions have been made about what the
indication o f whether the test score interpretations test is to measure and what m eaning its scores are
will be primarily norm-referenced or criterion-ref intended to convey, the next step is to create
erenced. When scores are norm-referenced, relative format specifications. Format specifications delineate
76
TEST DESIGN AND DEVELOPMENT
the format o f items (i.e., tasks or questions); the valid for all intended examinees, to the maximum
response format or conditions for responding; extent possible, is critical. Formats that may be
and the type o f scoring procedures. Although unfamiliar to some groups o f test takers or that
format decisions are often driven by considerations place inappropriate demands should be avoided.
o f expediency, such as ease o f responding or cost The principles of universal design describe the use
o f scoring, validity considerations must not be o f test formats that allow tests to be taken without
overlooked. For example, if test questions require adaptation by as broad a range o f individuals as
test takers to possess significant linguistic skill to possible, but they do not necessarily eliminate the
interpret them but the test is not intended as a need for adaptations. Format specifications should
measure o f linguistic skill, the complexity o f the include consideration o f alternative formats that
questions may lead to construct-irrelevant variance might also be needed to remove irrelevant barriers
in test scores. This would be unfair to test takers to performance, such as large print or braille for
with limited linguistic skills, thereby reducing the examinees who are visually impaired or, where ap
validity o f the test scores as a measure o f the propriate to die construct being measured, bilingual
intended content. Format specifications should dictionaries for test takers who are more proficient
include a rationale for how the chosen format in a language other than the language o f the test.
supports the validity, reliability, and fairness o f The number and types o f adaptations to be specified
intended uses of the resulting scores. depend on both the nature o f the construct being
T h e nature o f the item and response formats assessed and the targeted population o f test takers.
that may be specified depends on the purposes o f
the test, the defined dom ain o f the test, and the Com plex item formats. Some testing programs
testing platform. Selected-response formats, such employ more complex item formats. Examples in
as true-false or multiple-choice items, are suitable clude performance assessments, simulations, and
for many purposes o f testing. Computer-based portfolios. Specifications for more complex item
testing allows different ways o f indicating responses, formats should describe the domain from which
such as drag-and-drop. Other purposes may be the items or tasks are sampled, components o f the
more effectively served by a short-answer format. domain to be assessed by the tasks or items, and
Short-answer items require a response o f no more critical features o f the items that should be replicated
than a few words. Extended-response formats in creating items for alternate forms. Special con
require the test taker to write a more extensive re siderations for complex item formats are illustrated
sponse o f one or more sentences or paragraphs. through the following discussion o f performance
Performance assessments often seek to emulate assessments, simulations, and portfolios.
the context or conditions in which the intended
knowledge or skills are actually applied. O ne type Petfoimance assessments. Performance assessments
o f performance assessment, for example, is the require examinees to demonstrate the ability to
standardized job or work sample where a task is perform tasks that are often complex in nature
presented to the test taker in a standardized format and generally require the test takers ro demonstrate
under standardized conditions. Job or work samples their abilities or skills in settings that closely
might include the assessment o f a medical practi resemble real-life situations. O ne distinction
tioners ability to make an accurate diagnosis and between performance assessments and other forms
recommend treatment for a defined condition, a o f tests is the type o f response that is required
m anagers ability to articulate goals for an organi from the test takers. Performance assessments
zation, or a student’s proficiency in performing a require the test takers to carry out a process such
science laboratory experiment. as playing a musical instrument or tuning a car’s
engine or creating a product such as a written
Accessibility o f item formats. As described in essay. An assessment o f a clinical psychologist in
chapter 3, designing tests to be accessible and training may require the test taker to interview a
77
CHAPTER 4
client, choose appropriate tests, arrive at a diagnosis, like that o f other assessment procedures, must
and plan for therapy. flow from the purpose o f the assessment. Typical
Because performance assessm ents typically purposes include judgment o f Improvement in
consist o f a small number o f tasks, establishing job or educational performance and evaluation o f
the extent to which the results can be generalized eligibility for employment, prom otion, or gradu
to a broader domain described in the test specifi ation. Portfolio specifications indicate the nature
cations is particularly important. T h e test specifi o f the work that is to be included in the portfolio.
cations should indicate critical dimensions to be The portfolio may include entries such as repre
measured (e.g., skills and knowledge, cognitive sentative products, the best work o f the test taker,
processes, context for performing the tasks) so or indicators o f progress. For example, in an em
that tasks selected for testing will systematically ployment setting involving promotion decisions,
represent the critical dimensions, leading to a employees may be instructed to include their best
comprehensive coverage o f the dom ain as well as work or products. Alternatively, i f the purpose is
consistent coverage across test forms. Specification to judge students’ educational growth, the students
o f the domain to be covered is also im portant for may be asked to provide evidence o f improvement
clarifying potentially irrelevant sources o f variation with respect to particular competencies or skills.
in performance. Further, both theoretical and Students m ay also be asked to provide justifications
empirical evidence are important for documenting for their choices or a cover piece reflecting on the
the extent to which performance assessments— work presented and what the student has learned
tasks as well as scoring criteria— reflect the processes from it. Still other methods may call for the use
or skills that are specified by the domain definition. o f videos, exhibitions, or demonstrations.
When tasks are designed to elicit complex cognitive The specifications for the portfolio indicate
processes, detailed analyses o f the tasks and scoring who is responsible for selecting its contents. For
criteria and both theoretical and empirical analyses example, the specifications must state whether
o f the test takers’ performances on the tasks the test taker, the examiner, or both parties working
provide necessary validity evidence. together should be involved in the selection o f
the contents o f the portfolio. The particular re
Simulations. Simulation assessments are similar sponsibilities o f each party are delineated in the
to performance assessments in that they require specifications. In employment settings, employees
the examinee to engage in a com plex set o f may be involved in the selection o f their work
behaviors for a specified period o f time. Simulations and products that demonstrate their competencies
are sometimes a substitute for performance as for promotion purposes. Analogously, in educational
sessments, when actual task performance might applications, students may participate in the se
be cosdy or dangerous. Specifications for simulation lection o f som e o f their work and the products to
tasks should describe the domain o f activities to be included in their portfolios.
be covered by the tasks, critical dimensions o f Specifications for how portfolios are scored
performance to be reflected in each task, and and by whom will vary as a function o f the use o f
specific format considerations such as the number the portfolio scores. Centralized evaluation o f
or duration o f the tasks and essentials o f how the portfolios is com m on where portfolios are used
user interacts with the tasks. Specifications should in high-stakes decisions. The more standardized
be sufficient to allow experts to judge the compa the contents and procedures for collecting and
rability o f different sets o f simulation tasks included scoring material, the more comparable the scores
in alternate forms. from the resulting portfolios will be. Regardless
o f the methods used, all performance assessments,
Portfolios. Portfolios are systematic collections simulations, and portfolios are evaluated by the
o f work or educational products, typically gathered same standards o f technical quality as other forms
over time. The design o f a portfolio assessment, o f tests.
78
TEST DESIGN AND DEVELOPMENT
79
CHAPTER 4
score complex examinee responses, such as essays. derived definition o f the construct being measured.
In such cases, scoring specifications should indicate In such instances, items are selected primarily on
how scores are generated by these algorithms and the basis o f their empirical relationship with an
how they are to be checked and validated. external criterion, their relationships with one
Scoring specifications will also include whether another, or the degree to which they discriminate
test scores are simple sums o f item scores, involve among groups o f individuals. For example, items
differential weighting o f items or sections, or are for a test for sales personnel might be selected
based on a more complex measurement model. If based on the correlations o f item scores with pro
an IR T m odel is used, specifications should ductivity measures o f current sales personnel.
indicate the form o f the model, how model pa Similarly, an inventory to help identify different
rameters are to be estimated, and how model fit is patterns o f psychopathology might be developed
to be evaluated. using patients from different diagnostic subgroups.
When test development relies on a data-based ap
Test Administration Specifications proach, some items will likely be selected based
Test administration specifications describe how on chance occurrences in the data. Cross-validation
the test is to be administered. Administration studies are routinely conducted to determine the
procedures include mode o f test delivery (e.g., tendency to select items by chance, which involves
paper-and-pencil or computer based), time limits, administering the test to a comparable sample
accommodation procedures, instructions and ma that was not involved in the original test develop
terials provided to examiners and examinees, and ment effort.
procedures for monitoring test talcing and ensuring In other testing applications, however, the test
test security. For tests administered by computer, specifications are fixed in advance and guide the
adm inistration specifications will also include a development o f items and scoring procedures.
description o f any hardware and software require Empirical relationships may then be used to inform
ments, including connectivity considerations for decisions about retaining, rejecting, or modifying
W eb-based testing. items. Interpretations o f scores from tests developed
by this process have the advantage o f a theoretical
Refining the Test Specifications
and an empirical foundation for the underlying
There is often a subtle interplay between the dimensions represented by the test.
process o f conceptualizing a construct or content
dom ain and the development o f a test o f that Considerations for Adaptive Testing
construct or domain. The specifications for the In adaptive testing, test items or sets o f items are
test provide a description o f how the construct or selected as the test is being administered based on
dom ain will be represented and may need to be the test takers responses to prior items. Specification
refined as development proceeds. T he procedures o f item selection algorithm s m ay involve consid
used to develop items and scoring rubrics and to eration o f content coverage as well as increasing
examine item and test characteristics may often the precision o f the score estimate. W hen several
contribute to clarifying the specifications. The items are tied to a single passage or task, more
extent to which the construct is fully defined a complex algorithms for selecting the next passage
priori is dependent on the testing application. In or task are needed. In some instances, a larger
many testing applications, well-defined and detailed number o f items are developed for each passage
test specifications guide the development o f items or task and the selection algorithm chooses specific
and their associated scoring rubrics and procedures. items to administer based on content and precision
In som e areas o f psychological measurement, test considerations. Specifications m ust also indicate
development may be less dependent on an a priori whether a fixed num ber o f items are to be admin
defined framework and may rely more on a data- istered or whether the test is to continue until
based approach that results in an empirically precision or content coverage criteria are met.
80
TEST DESIGN AND DEVELOPMENT
T h e use o f adaptive testing and related com common passages or stimuli, variations on adaptive
puter-based testing models also involves special testing are often considered. For example, multistage
considerations related to item development. When testing begins with a set o f routing items. Once
a pool o f operational items is developed for a these are given and scored, the computer branches
computerized adaptive test, the specifications refer to item groups that are explicitly targeted to ap
both to the item pool and to the rules or procedures propriate difficulty levels, based on the evaluation
by which an individualized set o f items is selected o f examinees’ observed performance on the routing
for each test taker. Som e o f the appealing features items. In general, the special requirements o f
o f computerized adaptive tests, such as tailoring adaptive testing necessitate some shift in the way
the difficulty level o f the items to the test takers in which items are developed and tried out. Al
ability, place additional constraints on the design though the fundamental principles o f quality item
o f such tests. In m ost cases, large numbers o f development are no different, greater attention
items are needed in constructing a computerized m ust be given to the interactions am ong content,
adaptive test to ensure that the set o f items ad format, and item difficulty to achieve item pools
ministered to each test taker meets all o f the re that are best suited to this testing approach.
quirements o f the test specifications. Further, tests
often are developed in the context o f larger systems Systems Supporting Item and Test Development
or programs. Multiple pools o f items, for example, The increased reliance on technology and the need
may be created for use with different groups o f for speed and efficiency in the test development
test takers or on different testing dates. Test process require consideration o f the systems sup
security concerns are heightened when limited porting item and test development. Such systems
availability o f equipment makes it impossible to can enhance good item and test development
test all examinees at the same time. A number o f practice by facilitating item/task authoring and
issues, including test security, the complexity o f reviewing, providing item banking and automated
content coverage requirements, required score tools to assist with test form development, and in
precision levels, and whether test takers might be tegrating item/task statistical information with
allowed to retest using the same pool, m ust be item/task text and graphics. These systems can be
considered when specifying the size o f item pools developed to comply with interoperability and ac
associated with each form o f the adaptive test. cessibility standards and frameworks that make it
T h e development o f items for adaptive testing easier for test users to transition their testing
typically requires a greater proportion o f items to programs from one test developer to another. Al
be developed at high or low levels o f difficulty though the specifics o f item databases and supporting
relative to the targeted testing population. Tryout systems are outside the scope o f the Standards, the
data for items developed for use in adaptive tests increased availability o f such systems compels those
should be examined for possible context effects to responsible for developing such tests to consider
assess how much item parameters might shift applying technology to test design and development.
when items are administered in different orders. Test developers should evaluate costs and benefits
In addition, if items are associated with a common o f different applications, considering issues such
passage or stimulus, development should be in as speed o f development, transportability across
formed by an understanding o f how item selection testing platforms, and security.
will work. For example, the approach to developing
items associated with a passage may differ depending Item Development and Review
on whether the item selection algorithm selects
all o f the available items related to the passage or The test developer usually assembles an item pool
is able to choose subsets o f the available items that consists o f more questions or tasks than are
related to the passage. Because o f the issues that needed to populate the test form or forms to be
arise when item s or tasks are nested within built. Th is allows the test developer to select a set
81
CHAPTER 4
o f items for one or more forms o f the test that groups o f test takers. When differential item func
meet the test specifications. The quality o f the tioning is detected, test developers try to identify
items is usually ascertained through item review plausible explanations for the differences, and they
procedures and item tryouts, often referred to as may then replace or revise items to promote sound
pretesting. Items are reviewed for content quality, score interpretations for all examinees. When items
clarity, and construct-irrelevant aspects o f content are dropped due to a differential item functioning
that influence test takers’ responses. In m ost cases, index, the test developer must take care that any
sound practice dictates that items be reviewed for replacements or revisions do not compromise cov
sensitivity and potential offensiveness that could erage o f the specified test content.
introduce construct-irrelevant variance for indi Test developers sometimes use approaches in
viduals or groups o f test takers. An attempt is volving structured interviews or think-aloud pro
generally made to avoid words and topics that tocols with selected test takers. Such approaches,
may offend or otherwise disturb some test takers, sometimes referred to as cognitive labs, are used to
if less offensive material is equally useful (see identify irrelevant barriers to responding correctly
chap. 3). For constructed response questions and that might limit the accessibility o f the test content.
performance tasks, development includes item- Cognitive labs are also used to provide evidence
specific scoring rubrics as well as prompts or task that the cognitive processes being followed by
descriptions. Reviewers should be knowledgeable those talcing the assessment are consistent with
about test content and about the examinee groups the construct to be measured.
covered by this review. Additional steps are involved in the evaluation
Often, new test items are administered to a o f scoring rubrics for extended-response items or
group o f test takers who are as representative as performance tasks. Test developers m ust identify
possible o f the target population for the test, and responses that illustrate each scoring level, for use
where possible, who adequately represent individuals in training and checking scorers. Developers also
from intended subgroups. Item tryouts help deter identify responses at the borders between adjacent
mine some o f the psychometric properties o f the score levels for use in more detailed discussions
test items, such as an items difficulty and ability to during scorer training. Statistical analyses o f scoring
distinguish among test takers o f different standing consistency and accuracy (agreement with scores
on the construct being assessed. Ongoing testing assigned by experts) should be included in the
programs often pretest items by inserting them analysis o f tryout data.
into existing operational tests (the tryout items do
not contribute to the scores that test takers receive).
Assembling and Evaluating Test Forms
Analyses o f responses to these tryout items provide
useful data for evaluating quality and appropriateness The next step in test development is to assemble
prior to operational use. items into one or more test forms or to identify
Statistical analyses o f item tryout data commonly one or more pools o f items for an adaptive or
include studies o f differential item functioning multistage test. The test developer is responsible
(see chap. 3, “Fairness in Testing”)- Differential for documenting that the items selected for the
item functioning is said to exist when test takers test meet the requirements o f the test specifications.
from different groups (e.g., groups defined by In particular, the set o f items selected for a new
gender, race/ethnicity, or age) who have approxi test form or an item pool for an adaptive test
mately equal ability on the targeted construct or must meet both content and psychometric speci
content domain differ in their responses to an fications. In addition, editorial and content reviews
item. In theory, the ultimate goal o f such studies are commonly conducted to replace items that
is to identify construct-irrelevant aspects o f item are too similar to other items or that may provide
content, item format, or scoring criteria that may clues to the answers to other items in the same
differentially affect test scores o f one or more test form or item pool. When multiple forms o f a
82
TEST DESIGN AND DEVELOPMENT
test arc prepared, the test specifications govern accommodations for examinees who need them,
each o f the forms. as discussed in chapter 3.
New test form s are som etim es tried out or For computer-administered tests, administration
field tested prior to operational use. T h e purpose procedures must be consistent with hardware and
o f a field test is to determ ine whether items software requirements included in the test speci
function as intended in the context o f the new fications. Hardware requirements may cover proces
test form and to assess statistical properties, sor speed and memory; keyboard, mouse, or other
such as score precision or reliability, o f the new input devices; m onitor size and display resolution;
form. When field tests are conducted, all relevant and connectivity to local servers or the Internet.
examinee groups should be included so that Software requirements cover operating systems,
results and conclusions will generalize to the in browsers, or other com m on tools and provisions
tended operational use o f the new test forms for blocking access to, or interference from, other
and support further analyses o f the fairness o f software. Examinees taking computer-administered
the new forms. tests should be informed on how to respond to
questions, how to navigate through the test,
Developing Procedures and Materials whether they can skip items, whether they can
for Administration and Scoring revisit previously answered items later in the
testing period, whether they can suspend the
M an y interested persons (e.g., practitioners, testing session to a later time, and other exigencies
teachers) may be involved in developing items that may occur during testing.
and scoring rubrics, and/or evaluating the subse Test security procedures should also be imple
quent performances. If a participatory approach mented in conjunction with both administration
is used, participants’ knowledge about the domain and scoring o f the tests. Such procedures often
being assessed and their ability to apply the scoring include tracking and storage o f materials; encryption
rubrics are o f critical importance. Equally important o f electronic transmission o f exam content and
for those involved in developing tests and evaluating scores; nondisclosure agreements for test takers,
performances is their familiarity with the nature scorers, and administrators; and procedures for
o f the population being tested. Relevant charac monitoring examinees during the testing session.
teristics o f the population being tested may include In addition, for testing programs that reuse test
the typical range o f expected skill levels, familiarity items or test forms, security procedures should
with the response modes required o f them, typical include evaluation o f changes in item statistics to
ways in which knowledge and skills are displayed, assess the possibility o f a security breach. Test de
and the prim ary language used. velopers or users might consider monitoring o f
Test development includes creation o f a number websites for possible disclosure o f test content.
o f documents to support test administration as
described in the test specifications. Instructions
Test Revisions
to test users are developed and tried out as part o f
pilot or field testing procedures. Instructions and Tests and their supporting documents (e.g., test
training for test administrators must also be de manuals, technical m anuals, user guides) should
veloped and tried out. A key consideration in de be reviewed periodically to determine whether
veloping test administration procedures and ma revisions are needed. Revisions or amendments
terials is that test administration should be fair to are necessary when new research data, significant
all examinees. Th is means that instructions for changes in the dom ain, or new conditions o f test
taking the test should be clear and that test ad use and interpretation suggest that the test is no
ministration conditions should be standardized longer optimal or fully appropriate for som e o f
for all examinees. It also means consideration its intended uses. As an example, tests are revised
m ust be given in advance to appropriate testing if the test content or language has becom e
83
CHAPTER 4
outdated and, therefore* may subsequently affect to be as relevant as it was when the test w as de
the validity o f the test score interpretations. How veloped. T h e tim ing o f the need for review will
ever, outdated norms may not have the same im vary as a function o f test content and intended
plications for revisions as an outdated test. For use(s). For example, tests o f mastery o f educational
example, it m ay be necessary to update the norms or training curricula should be reviewed whenever
for an achievement test after a period o f rising or the corresponding curriculum is updated. Tests
falling achievement in the norm ing population, assessing psychological constructs should be re
or when there are changes in the test-taking pop viewed when research suggests a revised co n cep
ulation; but the test content itself may continue tualization o f the construct.
TEST DESIGN AMD DEVELOPMENT
The standards in this chapter begin with an over Cluster 1. Standards for Test
arching standard (numbered 4.0), which is designed Specifications
to convey the central intent or primary focus o f
the chapter. The overarching standard may also
be viewed as the guiding principle o f the chapter,
Standard 4.1
and is applicable to all tests and test users. All Test specifications should describe the purpose(s)
subsequent standards have been separated into o f the test, the definition o f the construct or do
four thematic clusters labeled as follows: main measured, the intended examinee population,
and interpretations for intended uses. T h e spec
1. Standards for Test Specifications ifications should include a rationale supporting
2. Standards for Item Development and Review the interpretations and uses o f test results for
3. Standards for Developing Test Administra the intended purpose(s).
tion and Scoring Procedures and Materials
4. Standards for Test Revision C om m ent: The adequacy and usefulness o f test
interpretations depend on the rigor with which
the purpose(s) o f the test and the domain repre
Standard 4.0 sented by the test have been defined and explicated.
Tests and testing program s should be designed 'The dom ain definition should be sufficiently de
and developed in a way that supports the validity tailed and delimited to show clearly what dimensions
o f interpretations o f the test scores for their in o f knowledge, skills, cognitive processes, attitudes,
tended uses. Test developers and publishers values, emotions, or behaviors are included and
should docum ent steps taken during the design what dimensions are excluded. A clear description
and developm ent process to provide evidence o f will enhance accurate judgments by reviewers and
fairness, reliability, and validity for intended others about the degree o f congruence between
uses for individuals in the intended examinee the defined dom ain and the test items. Clear
population. specification o f the intended examinee population
and its characteristics can help to guard against
Com m ent: Specific standards for designing and construct-irrelevant characteristics o f item content
developing tests in a way that supports intended and format. Specifications should include plans
uses are described below. Initial specifications for for collecting evidence o f the validity o f the
a test, intended to guide the development process, intended interpretations o f the test scores for their
may be modified or expanded as development intended uses. Test developers should also identify
proceeds and new information becomes available. potential limitations on test use or possible inap
Both initial and final documentation o f test spec propriate uses.
ifications and development procedures provide a
basis on which external experts and test users can
judge the extent to which intended uses have
Standard 4.2
been or are likely to be supported, leading to In addition to describing intended uses o f the
valid interpretations o f test results for all individuals. test, the test specifications should define the
Initial test specifications may be modified as evi content o f the test, the proposed test length, the
dence is collected during development and im item formats, the desired psychometric properties
plementation o f the test. o f the test item s and the test, and the ordering
o f item s and sections. Test specifications should
also specify the am ount o f tim e allowed for
85
CHAPTER 4
testing; directions for the test takers; procedures used in selecting items or sets of items for ad
to be used for test administration, including ministration, in determining the starting point
permissible variations; any materials to be used; and termination conditions for the test, in scoring
and scoring and reporting procedures. Specifica the test, and in controlling item exposure.
tions for computer-based tests should include a
C om m ent: If a computerized adaptive test is in
description o f any hardware and software re
tended to measure a number o f different content
quirements.
subcategories, item selection procedures should
C om m ent: Professional judgm ent plays a major ensure that the subcategories are adequately rep
role in developing the test specifications. The resented by the items presented to the test taker.
specific procedures used for developing the speci C om m on rationales for computerized adaptive
fications depend on the purpose(s) o f the test. tests are that score precision is increased, particularly
For example, in developing licensure and certifi for high- and low-scoring examinees, or that com
cation tests, practice analyses or job analyses parable precision is achieved while testing time is
usually provide the basis for defining the test reduced. N ote that these tests are subject to the
specifications; job analyses alone usually serve this same requirements for docum enting the validity
function for employment tests. For achievement o f score interpretations for their intended use as
tests given at the end o f a course, the test specifi other types o f tests. Test specifications should
cations should be based on an outline o f course include plans to collect evidence required for such
content and goals. For placement tests, developers documentation.
will examine the required entry-level knowledge
and skills for different courses. In developing psy Standard 4.4
chological tests, descriptions and diagnostic criteria
o f behavioral, mental, and emotional deficits and If test developers prepare different versions o f a
psychopathology inform test specifications. test with some change to the test specifications,
T h e types o f items, the response formats, the they should document the content and psycho
scoring procedures, and the test administration metric specifications o f each version. The docu
procedures should be selected based on the mentation should describe the impact of differ
purpose(s) o f the test, the domain to be measured, ences among versions on the validity o f score in
and the intended test takers. To the extent possible, terpretations for intended uses and on the
test content and administration procedures should precision and comparability o f scores.
be chosen so that intended inferences from test
C om m ent: Test developers m ay have a number
scores are equally valid for all test takers. Some
o f reasons for creating different versions o f a test,
details o f the test specifications may be revised on
such as allowing different amounts o f time for
the basis o f initial pilot or field tests. For example,
test administration by reducing or increasing the
specifications o f the test length or mix o f item
number o f items on the original test, or allowing
types might be m odified based on initial data to
administration to different populations by trans
achieve desired precision o f measurement.
lating test questions into different languages. Test
developers should docum ent the extent to which
Standard 4.3 the specifications differ from those o f the original
test, provide a rationale for the different versions,
Test developers should document the rationale and describe the implications o f such differences
and supporting evidence for the administration, for interpreting the scores derived from the different
scoring, and reporting rules used in computer- versions. Test developers and users should monitor
adaptive, multistage-adaptive, or other tests de and document any psychometric differences among
livered using computer algorithms to select items. versions o f the test based on evidence collected
This documentation should include procedures during development and implementation. Evidence
86
TEST DESIGN AND DEVELOPMENT
87
CHAPTER 4
(DIF) for major examinee groups, should also be T h e extent to which the different studies show
documented. When model-based methods (e.g., consistent results should be documented.
IRT) are used to estimate item parameters in test
Com m ent: When data-based approaches to test
development, the item response model, estimation
development are used, items are selected primarily
procedures, and evidence o f model fit should be
on the basis o f their empirical relationships with
documented.
an external criterion, their relationships with one
C o m m e n t: A lth ough overall sam ple size is another, or their power to discriminate am ong
relevant, there should also be an adequate number groups o f individuals. Under these circumstances,
o f cases in regions critical to the determ ination it is likely that some items will be selected based
o f the psychom etric properties o f items. I f the on chance occurrences in the data used. Adm inis
test is to achieve greatest precision in a particular tering the test to a comparable sample o f test
part o f the score scale and this consideration takers or use o f a separate validation sample
affects item selection, the manner in which item provides independent verification o f the reladonships
statistics are used for item selection needs to be used in selecting items.
carefully described. W hen IR T is used as the Statistical optimization techniques such as
basis o f test development, it is im portant to doc stepwise regression are sometimes used to develop
um ent the adequacy o f fit o f the model to die test composites or to select tests for further use in
data. T h is is accom plished by providing infor a test battery. As with the empirical selection o f
mation about the extent to which IR T assumptions items, capitalization on chance can occur. Cross-
(e.g., unidimensionality, local item independence, validation on an independent sample or the use
or, for certain models, equality o f slope parameters) o f a formula that predicts the shrinkage o f corre
are satisfied. lations in an independent sample may provide a
Statistics used for flagging items that function less biased index o f the predictive power of the
differently for different groups should be described, tests or composite.
including specification o f the groups to be analyzed,
the criteria for flagging, and the procedures for
reviewing and making final decisions about flagged
Standard 4.12
items. Sam ple sizes for groups of concern should
Test developers should docum ent the extent to
be adequate for detecting meaningful DIF.
which the content dom ain o f a test represents
Test developers should consider how any d if
the dom ain defined in the test specifications.
ferences between the adm inistration conditions
o f the field test and the final form might affect Comment: Test developers should provide evidence
item performance. C onditions that can affect o f the extent to which the test items and scoring
item statistics include motivation o f the test criteria yield scores that represent the defined do
takers, item position, time limits, length o f test, main. This affords a basis to help determine
m ode o f testing (e.g., paper-and-pencil versus whether performance on the test can be generalized
com puter administered), and use o f calculators to the dom ain that is being assessed. This is
or other tools. especially important for tests that contain a small
number o f items, such as performance assessments.
Such evidence may be provided by expert judges.
Standard 4.11
In some situations, an independent study o f the
Test developers should conduct cross-validation alignment o f test questions to the content specifi
studies when items or tests are selected primarily cations is conducted to validate the developers
on the basis o f empirical relationships rather than internal processing for ensuring appropriate content
on the basis of content or theoretical considerations. coverage.
CHAPTER 4
As another example, in directions for interest or to inform participants o f how the test developer
occupational inventories, it may be important to will use the data generated from the test, including
specify whether test talcers are to mark the activities the users personally identifiable information, how
they would prefer under ideal conditions or that information will be protected, and with
whedier they are to consider both their opportunity whom it might be shared.
and their ability realistically.
Instructions and any practice materials should Standard 4.18
be available in formats that can be accessed by all
test takers. For example, if a braille version o f the Procedures for scoring and, i f relevant, scoring
test is provided, the instructions and any practice criteria, should be presented by the test developer
materials should also be provided in a form that with sufficient detail and clarity to maximize
can be accessed by students who take the braille the accuracy o f scoring. Instructions for using
version. rating scales or for deriving scores obtained by
The extent and nature o f practice materials coding, scaling, or classifying constructed responses
and directions depend on expected levels o f knowl should be clear. T h is is especially critical for ex
edge am ong test takers. For example, in using a tended-response items such as performance tasks,
novel test format, it may be very important to portfolios, and essays.
provide the test taker with a practice opportunity C om m ent: In scoring more complex responses,
as part o f the test administration. In some testing test developers m ust provide detailed rubrics and
situations, it may be important for the instructions training in their use. Providing multiple examples
to address such matters as time limits and the o f responses at each score level for use in training
effects that guessing has on test scores. If expansion scorers and monitoring scoring consistency is
or elaboration o f the test instructions is permitted, also common practice, although these are typically
the conditions under which this may be done added to scoring specifications during item de
should be stated clearly in the form o f general velopment and tryouts. For m onitoring scoring
rules and by giving representative examples. If no effectiveness, consistency criteria for qualifying
expansion or elaboration is to be permitted, this scorers should be specified, as appropriate, along
should be stated explicitly. Test developers should with procedures, such as double-scoring o f some
include guidance for dealing with typical questions or all responses. As appropriate, test developers
from test takers. Test administrators should be in should specify selection criteria for scorers and
structed on how to deal with questions that may procedures for training, qualifying, and monitoring
arise during the testing period. scorers. I f different groups o f scorers are used
with different adm inistrations, procedures for
Standard 4.17 checking the comparability o f scores generated
by the different groups should be specified and
I f a test or part o f a test is intended for research implemented.
use only and is not distributed for operational
use, statem ents to that effect should be displayed Standard 4.19
prom inently on all relevant test adm inistration
and interpretation materials that are provided to W hen autom ated algorithms are to be used to
the test user. score complex examinee responses, characteristics
o f responses at each score level should be docu
C om m ent: Th is standard refers to tests that are
m ented along with the theoretical and empirical
intended for research use only. It does not refer to
bases for the use o f the algorithms.
standard test development functions that occur
prior to the operational use o f a test (e.g., item Com m ent: Automated scoring algorithms should
and form tryouts). There may be legal requirements be supported by an articulation o f the theoretical
91
CHAPTER 4
and methodological bases for their use that is suf o f prescored responses for use in training and for
ficiently detailed to establish a rationale for linking judging scoring accuracy The basis for determining
the resulting test scores to the underlying construct scoring consistency (e.g., percentage o f exact agree
o f interest. In addition, the autom ated scoring al ment, percentage within one score point, or some
gorithm should have empirical research support, other index o f agreement) should be indicated.
such as agreement rates with human scorers, prior Information on scoring consistency is essential to
fo operational use, as well as evidence that the estimating the precision o f resulting scores.
scoring algorithms do not introduce systematic
bias against some subgroups. Standard 4.21
Because autom ated scoring algorithm s are
often considered proprietary, their developers are W hen test users are responsible for scoring and
rarely willing to reveal scoring and weighting scoring requires scorer judgm ent, the test user is
rules in public documentation. Also, in some responsible for providing adequate training and
cases, full disclosure o f details o f the scoring algo instruction to the scorers and for examining
rithm might result in coaching strategies that scorer agreement and accuracy. T h e test developer
would increase scores without any real change in should docum ent the expected level o f scorer
the construct(s) being assessed. In such cases, de agreement and accuracy and should provide as
velopers should describe the general characteristics much technical guidance as possible to aid test
o f scoring algorithms. They may also have the al users in satisfying this standard.
gorithms reviewed by independent experts, under C om m ent: A com m on practice o f test developers
conditions of nondisclosure, and collect independent is to provide training materials (e.g., scoring
judgm ents o f the extent to which the resulting rubrics, examples o f test takers’ responses at each
scores will accurately implement intended scoring score level) and procedures when scoring is done
rubrics and be free from bias for intended examinee by test users and requires scorer judgment. Training
subpopulations. provided to support local scoring should includc
standards for checking scorer accuracy during
Standard 4.20 training and operational scoring. Training should
also cover any special consideration for test-taker
T h e process for selecting, training, qualifying, groups that might interact differently with the
and m onitoring scorers should be specified by task to be scored.
the test developer. T h e training materials, such
as the scoring rubrics and examples o f test takers’
Standard 4.22
responses that illustrate the levels on the rubric
score scale, and the procedures for training Test developers should specify the procedures
scorers should result in a degree o f accuracy and used to interpret test scores and, when appropriate,
agreem ent am ong scorers that allows the scores the norm ative or standardization sam ples or the
to be interpreted as originally intended by the criterion used.
test developer. Specifications should also describe
C om m ent: Test specifications m ay indicate that
processes for assessing scorer consistency and
the intended scores should be interpreted as in
potential drift over tim e in raters’ scoring.
dicating an absolute level o f the construct being
Com ment: To the extent possible, scoring processes measured or as indicating standing on the con
and materials should anticipate issues that may struct relative to other examinees, or both. In
arise during scoring. Training materials should absolute score interpretations, the score or average
address any comm on misconceptions about the is assumed to reflect directly a level o f competence
rubrics used to describe score levels. When written or m astery in som e defined criterion dom ain. In
text is being scored, it is comm on to include a set relative score interpretations the status o f an in
92
TEST DESIGN AND DEVELOPMENT
dividual (or group) is determined by comparing conditions o f test use m ay reduce the validity o f
the score (or mean score) with the performance test score interpretations. Although a test that
o f others in one or more defined populations. remains useful need not be withdrawn or revised
Tests designed to facilitate one type of interpre sim ply because o f the passage o f time, test devel
tation may function less effectively for the other opers and test publishers are responsible for m on
type o f interpretation. Given appropriate test itoring changing conditions and for amending,
design and adequate supporting data, however, revising, or withdrawing the test as indicated.
scores arising from norm-referenced testing pro
C om m ent: Test developers need to consider a
gram s m ay provide reasonable absolute score in
number o f factors that may warrant the revision
terpretations, and scores arising from criterion-
o f a test, including outdated test content and lan
referenced program s may provide reasonable rel
guage, new evidence o f relationships am ong meas
ative score interpretations.
ured or predicted constructs, or changes to test
frameworks to reflect changes in curriculum, in
Standard 4.23 struction, or job requirements. If an older version
o f a test is used when a newer version has been
W hen a test score is derived from the differential
published or made available, test users are re
weighting o f items or subscores, the test developer
sponsible for providing evidence that the older
should docum ent the rationale and process used
version is as appropriate as the new version for
to develop, review, and assign item weights.
that particular test use.
W hen the item weights are obtained based on
em pirical data, the sam ple used for obtaining
item weights should be representative o f the Standard 4.25
population for which the test is intended and
WTien tests are revised, users should be informed
large enough to provide accurate estim ates o f
o f the changes to the specifications, o f any ad
optim al weights. W hen the item weights are ob
justm ents made to the score scale, and o f the
tained based on expert judgment, the qualifications
degree o f comparability o f scores from the original
o f the judges should be documented.
and revised tests. Tests should be labeled as “re
C om m ent: Changes in the population o f test vised” only when the test specifications have
takers, along with other changes, for example in been updated in significant ways.
instructions, training, or job requirements, may
C om m ent: It is the test developers responsibility
affect the original derived item weights, necessitating
to determine whether revisions to a test would in
subsequent studies. In many cases, contcnt areas
fluence test score interpretations. I f test score in
are weighted by specifying a different number o f
terpretations would be affected by the revisions,
item s from different areas. T he rationale for
it is appropriate to label the test “revised.” When
weighting the different content areas should also
tests are revised, the nature o f the revisions and
be documented and periodically reviewed.
their implications for test score interpretations
should be documented. Examples o f changes that
Cluster 4. Standards for Test Revision require consideration include adding new areas o f
content, refining content descriptions, redistributing
Standard 4.24 the emphasis across different content areas, and
even just changing item format specifications.
Test specifications should be amended or revised Note that creating a new test form using the same
w hen new research data, significant changes in specifications is not considered a revision within
the dom ain represented, or newly recommended the context o f this standard.
93
5. SCORES, SCALES, NORMS, SCORE
LINKING, AND CUT SCORES
BACKGROUND
95
CHAPTER 5
or more comparison groups to draw useful infer referenced interpretations. Th is could happen as
ences about the persons relative performance. research and experience bring increased under
Test score interpretations based on such comparisons standing o f the capabilities implied by different
are said to be norm referenced. Percentile rank scale score levels. Conversely, results o f an educa
norms, for example, indicate the standing o f an tional assessment might be reported on a scale
individual or group within a defined population consisting o f several ordered proficiency levels,
o f individuals or groups. An example might be defined by descriptions o f the kinds o f tasks
the percentile scores used in military enlistment students at cach level are able to perform. That
testing, which compare each applicant’s score with would be a criterion-referenced scale, but once
scores for the population o f 18-to-23-year-oid the distribution o f scores over levels is reported,
American youth. Percentiles, averages, or other say, for all eighth-grade students in a given state,
statistics for such reference groups are called norms. individual students’ scores will also convey infor
By showing how the test score o f a given examinee mation about their standing relative to that tested
compares with those o f others, norms assist in the population.
classification or description o f examinees. Interpretations based on cut scores may likewise
Other test score interpretations make no direct be either criterion referenced or norm referenced.
reference to the performance o f other examinees. If qualitatively different descriptions are attached
These interpretations m ay take a variety o f forms; to successive score ranges, a criterion-referenced
m ost are collectively referred to as criterion- interpretation is supported. For example, the de
referenced interpretations. Scale scores supporting scriptions o f proficiency levels in some assessment
such interpretations may indicate the likely pro task-scoring rubrics can enhance score interpretation
portion o f correct responses that would be obtained by summarizing the capabilities that must be
on some larger domain o f similar items, or die demonstrated to merit a given score. In other
probability that an examinee will answer particular cases, criterion-referenced interpretations may be
sorts o f items correctly. Other criterion-referenced based on empirically determined relationships be
interpretations may indicate the likelihood that tween test scores and odier variables. But when
som e psychopathology is present. Still other cri tests are used for selection, it may be appropriate
terion-referenced interpretations may indicate the to rank-order examinees according to their test
probability that an examinee’s level o f tested performance and establish a cut score so as to
knowledge or skill is adequate to perform suc select a prespecified number or proportion o f ex
cessfully in some other setting. Scale scores to aminees from one end o f the distribution, provided
support such criterion-referenced score interpre the selection use is sufficiently supported by
tations often are developed on the basis o f statistical relevant reliability and validity evidence to support
analyses o f the relationships o f test scores to other rank ordering. In such cases, the cut score inter
variables. pretation is norm referenced; the labels “reject” or
Som e scale scores are developed primarily to “fail” versus “accept” or “pass” are determined
support norm-referenced interpretations; others primarily by an examinees standing relative to
support criterion-referenced interpretations. In others tested in the current selection process.
practice, however, there is not always a sharp dis Criterion-referenced interpretations based on
tinction. Both criterion-referenced and norm-ref cut scores are sometimes criticized on the grounds
erenced scales may be developed and used with that there is rarely a sharp distinction between
the sam e test scores if appropriate m ethods are those just below and those just above a cut score.
used to validate each type o f interpretation. More A neuropsychological test may be helpful in diag
over, a norm-referenced score scale originally de nosing some particular impairment, for example,
veloped, for example, to indicate performance but the probability that the impairment is present
relative to some specific reference population is likely to increase continuously as a function o f
might, over time, also come to support criterion- the test score rather than to change sharply at a
SCORES, SCALES, NO RM S, SCORE LINKING, AND CUT SCORES
particular score. C ut scores may aid in formulating or educational classifications. Descriptive statistics
rules for reaching decisions on the basis o f test for all examinees who happen to be tested during
performance. It should be recognized, however, a given period o f time (sometimes called user
that the likelihood o f misclassification will generally norms or program norms) may be useful for some
be relatively high for persons with scores close to purposes, such as describing trends over time.
the cut scores. But there must be a sound reason to regard that
group o f test takers as an appropriate basis for
Norms such inferences. When there is a suitable rationale
for using such a group, the descriptive statistics
T h e validity o f norm-referenced interpretations should be clearly characterized as being based on
depends in part on the appropriateness o f the ref a sample o f persons routinely tested as part o f an
erence group to which test scores are compared. ongoing program.
Norms based on hospitalized patients, for example,
might be inappropriate for some interpretations Score Linking
o f nonhospitalized patients’ scores. Thus, it is im
portant that reference populations be carefully Score linking is a general term that refers to relating
defined and clearly described. Validity o f such in scores from different tests or test forms. When
terpretations also depends on the accuracy with different forms o f a test are constructed to the
which norms summarize the performance o f the same content and statistical specifications and
reference population. That population may be administered under the same conditions, they are
small enough that essentially the entire population referred to as alternate forms or sometimes parallel
can be tested (e.g., all test takers at a given grade or equivalent forms. The process o f placing raw
level in a given district tested on the same occasion). scores from such alternate forms on a common
Often, however, only a sample o f examinees from scale is referred to as equating. Equating involves
the reference population is tested. It is then im small statistical adjustments to account for minor
portant that the norms be based on a technically differences in the difficulty o f the alternate forms.
sound, representative sample o f test takers of suf After equating, alternate forms o f the same test
ficient size. Patients in a few hospitals in a small yield scale scores that can be used interchangeably
geographic region are unlikely to be representative even though they are based on different sets o f
o f all patients in the United States, for example. items. In many testing programs that administer
Moreover, the usefulness o f norms based on a tests multiple times, concerns with test security
given sample may diminish over time. Thus, for may be raised if the same form is used repeatedly.
tests that have been in use for a number o f years, In other testing programs, the same test takers
periodic review is generally required to ensure the may be measured repeatedly, perhaps to measure
continued utility o f their norms. Renorming may change in levels o f psychological dysfunction, at
be required to maintain the validity o f norm-ref titudes, or educational achievement. In these cases,
erenced test score interpretations. reusing the same test items may result in biased
More than one reference population may be estimates o f change. Score equating allows for the
appropriate for the same test. For example, achieve use o f alternate forms, thereby avoiding these
m ent test performance might be interpreted by concerns.
reference to local norms based on sampling from Although alternate forms are built to the same
a particular school district for use in making local content and statistical specifications, differences
instructional decisions, or to norms for a state or in test difficulty will occur, creating the need for
type o f community for use in interpreting statewide equating. One approach to equating involves ad
testing results, or to national norms for use in ministering the forms to be equated to the same
m aking comparisons with national groups. For sample o f examinees or to equivalent samples.
other tests, norms might be based on occupational Another approach involves administering a common
97
CHAPTER 5
set o f items, referred to as anchor items, to the tional forms being equated. Both embedded
samples talcing each form. Each approach has and external anchor test designs involve strong
unique strengths, but also involves assumptions statistical assumptions regarding the equivalence
that could influence the equating results, and so o f the anchor and the forms being equated.
these assumptions must be checked. C hoosing These assum ptions are particularly critical
am ong equating approaches may include the fol when the samples o f examinees taking the dif
lowing considerations: ferent forms vary considerably on the construct
being measured.
• Administering forms to the same sample allows
When claiming that scores on test form s are
for an estimate o f the correlation between the
equated, it is important to docum ent how the
scores on the two forms, as well as providing
forms are built to the same content and statistical
data needed to adjust for differences in difficulty.
specifications and to demonstrate that scores on
However, there could be order effects related to
the alternate forms are measures o f the sam e con
practice or fatigue that may affect the score dis
struct and have similar reliability. Equating should
tribution for the form administered second.
provide accurate score conversions for any set o f
• Administering alternate forms to equivalent persons drawn from the examinee population for
samples, usually through random assignment, which the test is designed; hence the stability o f
avoids any order effects but does not provide conversions across relevant subgroups should be
a direct estimate o f the correlation between documented. Whenever possible, the definitions
the scores; other methods are needed to demon o f important examinee populations should include
strate that the two forms measure the same groups for which fairness may be a particular
construct. issue, such as examinees with disabilities or from
diverse linguistic and cultural backgrounds. When
° Em bedding a set o f anchor items in each o f
sample sizes permit, it is important to examine
the forms being equated provides a basis for
the stability o f equating conversions across these
adjusting for differences in the samples o f ex
populations.
aminees talcing each form. The anchor items
The increased use o f tests delivered by computer
should cover the same content and difficulty
raises special considerations for equating and
range as each o f the full forms being equated
linking because more flexible models for delivering
so that differences on the anchor items will
tests become possible. These include adaptive
accurately reflect differences on the full forms.
testing as well as approaches where unique items
Also, anchor item position and other context
or multiple intact sets o f items are selected from a
factors should be the same in both forms. It is
larger pool o f available items. It has long been
im portant to check that the anchor items
recognized that little is learned from examinees’
function similarly in the forms being equated.
responses to items that are m uch too easy or
Anchor items are often dropped from the
much too difficult for them. Consequently, some
anchor if their relative difficulty is substantially
testing procedures use only a subset o f the available
different in the forms being equated.
items with each examinee. An adaptive test consists
° Sometimes an external anchor test is used in o f a pool o f items together with rules for selecting
which the anchor items are administered in a a subset o f those items to be adm inistered to an
separate section and do not contribute to the individual examinee and a procedure for placing
total score on the test. This approach eliminates different examinees5 scores on a com m on scale.
som e context factors as the presentation o f The selection o f successive items is based in part
the anchor items is identical for each examinee on the examinees’ responses to previous items.
sample. Again, however, the anchor test m ust The item pool and item selection rules may be
reflect the content and difficulty o f the opera designed so that each examinee receives a repre-
SCORES, SCALES, NO R M S, SCORF. LINK ING, AND CUT SCORES
sentativc set o f items o f appropriate difficulty. scales that span a broad range o f developmental
W ith som e adaptive tests, it friay happen that two or educational levels. The development o f ver
examinees rarely i f ever receive the same set o f tical scales typically requires linking o f tests
items. Moreover, two examinees taking the same that are purposefully constructed to differ in
adaptive test m ay be given sets o f items that differ difficulty.
markedly in difficulty. Nevertheless, adaptive test
* Test revision often brings a need to link scores
scores can be reported on a common scale and
obtained using newer and older test specifications.
function much like scores from a single alternate
form o f a test that is not adaptive. ° International comparative studies may require
O ften, the adaptation o f the test is done item linking o f scores on tests given in different
by item. In other situations, such as in multistage languages.
testing, the exam process may branch from choos
0 Scores may be linked on tests measuring dif
ing am ong sets o f items that are broadly repre
ferent constructs, perhaps comparing an aptitude
sentative o f content and difficulty to choosing
with a form o f behavior, or linking measures
am ong sets o f items that are targeted explicitly
o f achievement in several content areas or
for a higher or lower level o f the construct being
across different test publishers.
measured, based on an interim evaluation o f ex
aminee performance. ° Sometimes linkings are made to compare per
In many situations, item pools for adaptive formance o f groups (e.g., school districts,
tests are updated by replacing some o f the items states) on different measures o f similar con
in the pool with new items. In other cases, entire structs, such as when linking scores on a state
pools o f items are replaced. In either case, statistical achievement test to scores on an international
procedures are used to link item param eter assessment.
estimates for the new items to the existing IR T
scale so that scores from alternate pools can be ° Results from linking studies are sometimes
used interchangeably, in much the same way that aligned or presented in a concordance table to
scores on alternate forms o f tests are used when aid users in estimating performance on one
scores on the alternate forms are equated. To test from performance on another.
support comparability o f scores on adaptive tests • In situations where complex item types are
across pools, it is necessary to construct the pools used, score linking is sometimes conducted
to the sam e explicit content and statistical speci through judgm ents about the comparability
fications and administer them under the same o f item content from one test to another. For
conditions. M ost often, a common-item design example, writing prom pts built to be similar,
is used in linking parameter estimates for the where responses are scored using a common
new items to the IR T scale used for adaptive rubric, might be assumed to be equivalent in
testing. In such cases, stability checks should be difficulty. When possible, these linkings should
m ade on the statistical characteristics o f the com be checked empirically.
m on items, and the num ber o f com m on items
should be sufficient to yield stable results. The ® In some situations, judgmental m ethods are
adequacy o f the assum ptions needed to link used to link scores across tests. In these situa
scores across pools should be checked. tions, the judgment processes and their reliability
M any other examples o f linking exist that should be well documented and the rationale
m ay not result in interchangeable scores, including for their use should be clear.
the following:
Processes used to facilitate comparisons may
e For the evaluation o f examinee growth over be described with terms such as linking, calibration,
time, it may be desirable to develop vertical concordance, vertical scaling, projection, or moderation.
99
CHAPTER 5
These processes may be technically sound and and other classifications, it should be acknowledged
may fully satisfy desired goals o f comparability that such categorical decisions are rarely made on
for one purpose or for one relevant subgroup o f the basis o f test performance alone. The examples
examinees, but they cannot be assumed to be that follow serve only as illustrations.
stable over time or invariant across multiple sub The first example, that o f an employer inter
groups o f the examinee population, nor is there viewing all those who earn scores above a given
any assurance that scores obtained using different level on an employment test, is the most straight
tests will be equally precise. Thus, their use for forward. Assuming that validity evidence has been
other purposes or with other populations than provided for scores on the employment test for its
the originally intended population may require intended use, average job performance typically
additional support. For example, a score conversion would be expected to rise steadily, albeit slowly,
that was accurate for a group of native speakers with each increment in test score, at least for
might systematically overpredict or undcrpredict some range o f scores surrounding the cut score.
the scores o f a group o f nonnative speakers. In such a case the designation o f the particular
value for the cut score may be largely determined
Cut Scores by the number o f persons to be interviewed or
further screened.
A critical step in the development and use o f In the second example, a state department o f
som e tests is to establish one or more cut scores education establishes content standards for what
dividing die score range to partition the distribution fourth-grade students are to learn in mathematics
o f scores into categories. These categories may be and implements a test for assessing student achieve
used just for descriptive purposes or may be used ment on these standards. Using a structured,
to distinguish among examinees for whom different judgmental standard-setting process, committees
programs are deemed desirable or different pre o f subject matter experts develop or elaborate on
dictions are warranted. An employer may determine performance-level descriptors (sometimes referred
a cut score to screen potential employees or to to as achievement-level descriptors) that indicate
prom ote current employees; proficiency levels o f what students at achievement levels o f “basic,”
“ basic,” “proficient,” and “advanced” may be es “proficient,” and “advanced” should know and be
tablished using standard-setting methods to set able to do in fourth-grade mathematics. In addition,
cut scores on a state test o f madiematics achievement committees examine test items and student per
in fourth grade; educators may want to use test formance to recommend cut scores that are used
scores to identify students who are prepared to go to assign students to each achievement level based
on to college and take credit-bearing courses; or on their test performance. T h e final decision
in granting a professional license, a state may about the cut scores is a policy decision typically
specify a m inimum passing score on a licensure made by a policy body such as the board o f edu
test. cation for the state.
These examples differ in im portant respects, In the third example, educators wish to use
but all involve delineating categories o f examinees test scores to identify students who are prepared
on the basis o f test scores. Such cut scores provide to go on to college and take credit-bearing courses.
the basis for using and interpreting test results. C ut scores might initially be identified based on
Thus, in som e situations, the validity o f test score judgments about requirements for taking credit-
interpretations may hinge on the cut scores. There bearing courses across a range o f colleges. Alter
can be no single method for determining cut natively, judgm ents about individual students
scores for all tests or for all purposes, nor can might be collected and then used to find a score
there be any single set o f procedures for establishing level that m ost effectively differentiates those
their defensibility. In addition, although cut scores judged to be prepared from those judged not to
are helpful for informing selection, placement, be. In such cases, judges must be familiar with
100
SCORES, SCALES, NO RM S, SCORE LINKING, AND CUT SCORES
both the college course requirements and the stu technical matter, although empirical studies and
dents themselves. Where possible, initial judgments statistical models can be o f great value in informing
could be followed up with longitudinal data indi the process.
cating whether former examinees did or did not C ut scores embody value judgm ents as well
have to take remedial courses. as technical and empirical considerations. Where
In the final example, that o f a professional li the results o f the standard-setting process have
censure examination, the cut score represents an highly significant consequences, those involved
informed judgm ent that those scoring below it in the standard-setting process should be concerned
are at risk o f making serious errors because they that the process by which cut scores are deter
lack the knowledge or skills tested. N o test is m ined be clearly docum ented and that it be de
perfect, o f course, and regardless o f the cut score fensible. When standard-setting involves judges
chosen, some examinees with inadequate skills or subject matter experts, their qualifications
are likely to pass, and some with adequate skills and the process by which they were selected are
are likely to fail. The relative probabilities o f such part o f that docum entation. Care must be taken
false positive and false negative errors will vary to ensure that these persons understand what
depending on the cut score chosen. A given prob they are to do and that their judgm ents are as
ability o f exposing the public to potential harm thoughtful and objective as possible. T h e process
by issuing a license to an incompetent individual must be such that well-qualified participants can
(false positive) must be weighed against some apply their knowledge and experience to reach
corresponding probability o f denying a license to, meaningful and relevant judgments that accurately
and thereby disenfranchising, a qualified examinee reflect their understandings and intentions. A
(false negative). Changing the cut score to reduce sufficiently large and representative group of
either probability will increase the other, although participants should be involved to provide rea
both kinds o f errors can be minimized through sonable assurance that the expert ratings across
sound test design that anticipates the role o f the judges are sufficiently reliable and that the results
cut score in test use and interpretation. Determining o f the judgm ents would not vary greatly if the
cut scores in such situations cannot be a purely process were replicated.
101
CHAPTER 5
T h e standards in this chapter begin with an over intended interpretation o f scale scores, as well as
arching standard (numbered 5.0), which is designed their lim itations.
to convey the central intent or primary focus o f
C om m ent: Illustrations o f appropriate and inap
the chapter. The overarching standard may also
propriate interpretations may be helpful, especially
be viewed as the guiding principle o f the chapter,
for types o f scales or interpretations that are unfa
and is applicable to all tests and test users. All
miliar to most users. This standard pertains to
subsequent standards have been separated into
scorc scales intended for criterion-referenced as
four thematic clusters labeled as follows:
well as norm-referenced interpretations. All scores
(raw scores or scale scores) may be subject to m is
1. Interpretations o f Scores
interpretation. If the nature or intended uses o f a
2. Norm s
scale are novel, it is especially im portant that its
3. Score Linking
uses, interpretations, and limitations be clearly
4. C ut Scores
described.
Standard 5.0
Standard 5.2
Test scores should be derived in a w ay that
The procedures for constructing scales used for
supports the interpretations o f test scores for the
reporting scores and the rationale for these pro
proposed uses o f tests. Test developers and users
cedures should be described clearly.
should document evidence o f fairness, reliability,
and validity o f test scores for their proposed use. C om m ent: When scales, norms, or other inter
pretive systems are provided by the test developer,
Com m ent: Specific standards for various uses and
technical documentation should describe their
interpretations o f test scores and score scales are
rationale and enable users to judge the quality
described below. These include standards for norm-
and precision o f the resulting scale scores. For ex
referenced and critcrion-referenced interpretations,
ample, the test developer should describe any
interpretations o f cut scores, interchangeability o f
normative, content, or score precision information
scores on alternate forms following equating, and
that is incorporated into the scale and provide a
score comparability following the use o f other pro
rationale for the number o f score points that are
cedures for score linking. Documentation supporting
used. This standard pertains to score scales intended
such interpretations provides a basis for external
for criterion-referenced as well as norm-referenced
experts and test users to judge the extent to which
interpretations.
the interpretations are likely to be supported and
can lead to valid interpretations o f scores for all in
dividuals in the intended examinee population. Standard 5.3
I f there is sound reason to believe that specific
m isinterpretations o f a score scale are likely, test
Cluster 1. Interpretations of Scores
users should be explicitly cautioned.
102
SCORES, SCALES, NORM S, SCORE U N K IN G , AND CUT SCORES
defined as the mean o f some reference population refer to the absolute levels o f test scores or to
should no longer be interpreted as representing patterns o f scores for an individual examinee.
average performance if the scale is held constant Whenever the test developer recommends such
over time and the examinee population changes. interpretations, the rationale and empirical basis
Similarly, caution is needed if score meanings should be presented clearly. Serious efforts should
may vary for some test takers, such as the meaning be made whenever possible to obtain independent
o f achievement scores for students who have not evidence concerning the soundness o f such score
had adequate opportunity to learn the material interpretations.
covered by the test.
Standard 5.6
Standard 5.4
Testing program s that attem pt to m aintain a
W hen raw scores are intended to be directly in common scale over time should conduct periodic
terpretable, their meanings, intended interpre checks o f the stability o f the scale on which
tations, and lim itations should be described and scores are reported.
justified in the sam e m anner as is done for scale
Com m ent: The frequency o f such checks depends
scores.
on various characteristics o f the testing program.
C om m ent: In som e cases the items in a test arc a In some testing programs, items are introduced
representative sample o f a well-defined domain o f into and retired from item pools on an ongoing
items with regard to both content and item diffi basis. In other cases, the items in successive test
culty. T h e proportion answered correctly on the forms may overlap very little, or not at all. In
test may then be interpreted as an estimate o f the either case, if a fixed scale is used for reporting, it
proportion o f items in the domain that could be is important to ensure that the meaning o f the
answered correctly. In other cases, different inter scale scores does not change over time. When
pretations may be attached to scores above or scales are based on the subsequent application o f
below a particular cut score. Support should be precalibrated item parameter estimates using item
offered for any such interpretations recommended response theory, periodic analyses o f item parameter
by the test developer. stability should be routinely undertaken.
103
CHAPTER 5
diverse linguistic and cultural backgrounds. A and descriptive statistics. Technical documentation
test m ay be translated into braille so that it is ac should indicate the precision o f the norm s them
cessible to individuals who are blind, or the testing selves.
procedure may be changed to include extra time
Com ment: T he information provided should be
for certain groups o f examinees. These changes
sufficient to enable users to judge the appropri
may or may not have an effect on the underlying
ateness o f the norms for interpreting the scores o f
constructs that are measured by the test and, con
local examinees. T he information should be pre
sequently, on the score conversions used with the
sented so as to comply with applicable legal re
test. I f scores on the changed test will be compared
quirements and professional standards relating to
with scores on the original test, the test developer
privacy and data security.
should provide empirical evidence o f the com pa
rability o f scores on the changed and original test
whenever sample sizes are sufficiently large to Standard 5.10
provide this type o f evidence.
When norm s are used to characterize examinee
groups, the statistics used to sum m arize each
Cluster 2. Norms grou ps perform ance and the norm s to which
those statistics are referred should be defined
Standard 5.8 clearly and should support the intended use or
interpretation.
N orm s, if used, should refer to clearly described
populations. These populations should include Com m ent: It is not possible to determine the
individuals or groups with whom test users will percentile rank o f a schools average test score if
ordinarily wish to compare their own examinees. all that is known is the percentile rank o f cach o f
that school’s students. It may sometimes be useful
Com m ent: It is the responsibility o f test developers
to develop special norms for group means, but
to describe norms clearly and the responsibility o f
when the sizes o f the groups differ materially or
test users to use norms appropriately. Users need to
when some groups are much more heterogeneous
know the applicability o f a test to different groups.
than others, the construction and interpretation
Differentiated norms or summary information
o f group norms is problematic. O ne common
about differences between gender, racial/ethnic,
and acceptable procedure is to report the percentile
language, disability, grade, or age groups, for
rank o f the median group member, for example,
example, may be useful in some cases. The permissible
the median percentile rank o f the pupils tested in
uses o f such differentiated norms and related in
a given school.
formation may be limited by law. Users also need
to be alerted to situations in which norms are less
appropriate for some groups or individuals than Standard 5.11
others. O n an occupational interest inventory, for
I f a test publisher provides norm s for use in test
example, norms for persons actually engaged in an
score interpretation, then as long as the test re
occupation may be inappropriate for interpreting
mains in print, it is the test publisher s responsi
the scores o f persons not so engaged.
bility to renorm the test with sufficient frequency
to perm it continued accurate and appropriate
Standard 5.9 score interpretations.
Reports o f norming studies should include precise Com m ent: Test publishers should ensure that
specification o f the population that was sampled, up-to-date norms are readily available or provide
sam pling procedures and participation rates, any evidence that older norms are still appropriate.
w eighting o f the sample, the dates o f testing, However, it remains the test users responsibility
104
SCORES, SCALES, NORM S, SCORE U N K IN G , ANO CUT SCORES
to avoid inappropriate use o f norms that are our operational administration. When equivalent forms
o f date and to strive to ensure accurate and ap of computer-based tests are constructed dynamically,
propriate score interpretations. the algorithms used should be documented and
the technical characteristics o f alternate forms
should be evaluated based on simulation and/or
Cluster 3. Score Linking
analysis o f administration data. Standard errors
o f equating functions should be estimated and re
Standard 5.12 ported whenever possible. Sample sizes permitting,
it may be informative to assess whether equating
A clear rationale and supporting evidence should
functions developed for relevant subgroups o f ex
be provided for any claim that scale scores earned
aminees are similar. It may also be informative to
on alternate form s o f a test may be used inter
use two or more anchor forms and to conduct the
changeably.
equating using each o f the anchors. To be most
Com m ent: For scores on alternate forms to be useful, equating error should be presented in units
used interchangeably, the alternate forms m ust be o f the reported score scale. For testing programs
built to common detailed content and statistical with cut scores, equating error near the cut score
specifications. Adequate data should be collected is o f primary importance.
and appropriate statistical methodology should
be applied to conduct the equating o f scorcs on Standard 5.14
alternate test forms. T h e quality o f the equating
should be evaluated to assess whether the resulting In equating studies that rely on the statistical
scale scores on the alternate forms can be used in equivalence o f examinee groups receiving different
terchangeably. form s, methods o f establishing such equivalence
should be described in detail.
Standard 5.13 Com m ent: Certain equating designs rely on the
random equivalence o f groups receiving different
W hen claims o f form-to-form score equivalence
forms. Often, onew ay to ensure such equivalence
are based on equating procedures, detailed technical
is to mix systematically different test forms and
inform ation should be provided on the method
then distribute them in a random fashion so that
by which equating functions were established
roughly equal numbers o f examinees receive each
and on the accuracy o f the equating functions.
form. Because administration designs intended
Com m ent: Evidence should be provided to show to yield equivalent groups are not always adhered
that equated scores on alternate forms measure to in practice, the equivalence o f groups should
essentially the same construct with very similar be evaluated statistically.
levels o f reliability and conditional standard errors
o f measurement and that the results are appropriate Standard 5.15
for relevant subgroups. Technical information
should include the design o f the equating study, In equating studies that em ploy an anchor test
the statistical methods used, the size and relevant design, the characteristics o f the anchor test and
characteristics o f examinee samples used in equating its similarity to the form s being equated should
studies, and the characteristics o f any anchor tests be presented, including both content specifications
or anchor items. For tests for which equating is and empirically determ ined relationships am ong
conducted prior to operational use (i.e., pre test scores. I f anchor item s are used in the
equating), documentation o f the item calibration equating study, the representativeness and psy
process should be provided and the adequacy o f chometric characteristics o f the anchor items
die equating functions should be evaluated following should be presented.
105
CHAPTER 5
106
SCORES, SCALES, NO RM S, SCORE LINKING, AND CUT SCORES
and the quality of the linking. Technical information procedures have been used to link scores from
about the linking should include, as appropriate, the different versions. W hen substantial changes
the reliability o f the sets o f scores being linked, in test specifications occur, scores should be re
the correlation between test scores, an assessment ported on a new scale, or a clear statem ent
o f content similarity, the conditions o f measurement should be provided to alert users that the scores
for each test, the data collection design, the are not directly comparable with those on earlier
statistical m ethods used, the standard errors o f versions o f the test.
the linking function, evaluations o f sampling sta
C om m ent: M ajor shifts sometimes occur in the
bility, and assessments o f score comparability.
specifications o f tests that are used for substantial
periods o f time. Often such changes take advantage
Standard 5.19 o f improvements in item types or shifts in content
that have been shown to improve validity and
W hen tests are created by taking a subset o f the
therefore are highly desirable. It is important to
item s in an existing test or by rearranging items,
recognize, however, that such shifts will result in
evidence should be provided that there are no
scores that cannot be made strictly interchangeable
distortions o f scale scores, cut scores, or norms
with scores on an earlier form o f the test, even
for the different versions or for score linkings
when statistical linking procedures are used. To
between them.
assess score comparability, it is advisable to evaluate
C om m ent: Som e tests and test batteries are pub the relationship between scores on the old and
lished in both a full-length version and a survey new versions.
or short version. In other cases, multiple versions
o f a single test form may be created by rearranging
Cluster 4. Cut Scores
its items. It should not be assumed that performance
data derived from the administration o f items as
part o f the initial version can be used to compute Standard 5.21
scale scores, com pute linked scores, construct
W hen proposed score interpretations involve
conversion tables, approximate norms, or approx
one or more cut scores, the rationale and proce
imate cut scores for alternative intact tests. Caution
dures used for establishing cut scores should be
is required in cases where context effects are likely,
docum ented clearly.
including speeded tests, long tests where fatigue
may be a factor, adaptive tests, and tests developed Com m ent: C ut scores may be established to select
from calibrated item pools. O ptions for gathering a specified number o f examinees (e.g., to identify
evidence related to context effects might include a fixed number o f job applicants for further screen
examinations o f model-data fit, operational recal ing), in which case little further documentation
ibrations o f item parameter estimates initially may be needed concerning the specific question
derived using pretest data, and comparisons o f o f how the cut scores are established, although at
performance on original and revised test forms as tention should be paid to the rationale for using
administered to randomly equivalent groups. the test in selection and the precision o f comparisons
am ong examinees. In other cases, however, cut
Standard 5.20 scores may be used to classify examinees into
distinct categories (e.g., diagnostic categories, pro
I f test specifications are changed from one version ficiency levels, or passing versus failing) for which
o f a test to a subsequent version, such changes there are no pre-established quotas. In these cases,
should be identified, and an indication should the standard-setting method must be documented
be given that converted scores for the two versions in more detail. Ideally, the role o f cut scores in test
m ay not be stricdy equivalent, even when statistical use and interpretation is taken into account during
107
CHAPTER 5
test design. Adequate precision in regions o f score forward when participants are asked to consider
scales where cut scores arc established is prerequisite kinds o f performances with which they are familiar
to reliable classification o f examinees into categories. and for which they have formed clear conceptions
I f standard setting employs data on the score dis o f adequacy or quality. When the responses elicited
tributions for criterion groups or on the relation by a test neither sample nor closely simulate the
o f test scores to one or more criterion variables, use o f tested knowledge or skills in the actual cri
those data should be summarized in technical terion dom ain, participants are not likely to ap
documentation. I f a judgmental standard-setting proach the task with such clear understandings o f
process is followed, the method employed should adequacy or quality. Special care must then be
be described clearly, and the precise nature and taken to ensure that participants have a sound
reliability o f the judgments called for should be basis for making the judgments requested. Thorough
presented, whether those are judgments o f persons, familiarity with descriptions o f different proficiency
o f item or test performances, or o f other criterion levels, practice in judging task difficulty with
performances predicted by test scores. Docum en feedback on accuracy, the experience o f actually
tation should also include the selection and qual taking a form o f the test, feedback on the pass
ifications o f standard-setting panel participants, rates entailed by provisional proficiency standards,
training provided, any feedback to participants and other forms o f information may be beneficial
concerning the implications o f their provisional in helping participants to reach sound and prin
judgm ents, and any opportunities for participants cipled decisions.
to confer with one another. Where applicable,
variability over participants should be reported. Standard 5.23
Whenever feasible, an estimate should be provided
o f the amount of variation in cut scores that When feasible and appropriate, cut scores defining
might be expected if the standard-setting procedure categories with distinct substantive interpretations
were replicated with a comparable standard-setting should be inform ed by sound em pirical data
panel. concerning the relation o f test perform ance to
the relevant criteria.
Standard 5.22 Com m ent: In em ploym ent settings where it has
been established that test scores are related to job
W hen cut scores defining pass-fail or proficiency
performance, the precise relation o f test and
levels are based on direct judgm ents about the
criterion may have little bearing on the choice o f
adequacy o f item or test perform ances, the ju dg
a cut score, i f the choice is based on the need for
mental process should be designed so that the
a predetermined num ber o f candidates. However,
participants providing the judgm ents can bring
in contexts where distinct interpretations are
their knowledge and experience to bear in a rea
applied to different score categories, the empirical
sonable way.
relation o f test to criterion assumes greater im
C om m ent: C ut scores are sometimes based on portance. For example, if a cut score is to be set
judgm ents about the adequacy o f item or test on a high school mathematics test indicating
performances (e.g., essay responses to a writing readiness for college-level mathematics instruction,
prom pt) or proficiency expectations (e.g., the it may be desirable to collect empirical data estab
scale score that would characterize a borderline lishing a relationship between test scores and
examinee). T h e procedures used to elicit such grades obtained in relevant college courses. C ut
judgm ents should result in reasonable, defensible scores used in interpreting diagnostic tests may
proficiency standards that accurately reflect the be established on the basis o f empirically determined
standard-setting participants’ values and intentions. score distributions for criterion groups. With
Reaching such judgm ents may be most straight many achievement or proficiency tests, such as
108
SCORES, SCALES, NO RM S, SCORE LINKING, AND CUT SCORES
those used in credentialing, suitable criterion (or combination o f approaches) in any given situ
groups (e.g., successful versus unsuccessful prac ation. In general, one would not expect to find a
titioners) are often unavailable. Nevertheless, when sharp difference in levels o f the criterion variable
appropriate and feasible, the test developer should between those just below and those just above the
investigate and report the relation between test cut score, but evidence should be provided, where
scores and perform ance in relevant practical feasible, o f a relationship between test and criterion
settings. Professional judgm ent is required to de performance over a score interval that includes or
termine an appropriate standard-setting approach approaches the cut score.
109
6. TEST ADMINISTRATION, SCORING,
REPORTING, AND INTERPRETATION
BACKGROUND
T h e usefulness and interpretability o f test scores may require that the assessment be abbreviated or
require that a test be administered and scored ac altered. Large-scale testing programs typically es
cording to the test developers instructions. When tablish specific procedures for considering and
directions, testing conditions, and scoring follow granting accommodations and other variations
the same detailed procedures for all test takers, from standardized procedures. Usually these ac
the test is said to be standardized. W ithout such commodations themselves are somewhat standard
standardization, the accuracy and comparability ized; occasionally, some alternative other than the
o f score interpretations would be reduced. For accommodations foreseen and specified by the
tests designed to assess the test takers knowledge, test developer may be indicated. Appropriate care
skills, abilities, or other personal characteristics, should be taken to avoid unfair treatment and dis
standardization helps to ensure that all test takers crimination. Although variations m ay be made
have the sam e opportunity to demonstrate their with the intent o f maintaining score comparability,
competencies. M aintaining test security also helps the extent to which that is possible often cannot
ensure that no one has an unfair advantage. The be determined. Comparability o f scores may be
importance o f adherence to appropriate standard compromised, and the test may then not measure
ization o f adm inistration procedures increases the same constructs for all test takers.
with the stakes o f the test. Tests and assessments differ in their degree o f
Sometimes, however, situations arise in which standardization. In many instances, different test
variations from standardized procedures may be takers are not given the same test form bu t receive
advisable or legally mandated. For example, indi equivalent forms that have been shown to yield
viduals with disabilities and persons o f different comparable scores, or alternate test forms where
linguistic backgrounds, ages, or familiarity with scores are adjusted to make them comparable.
testing m ay need nonstandard modes o f test ad Som e assessments permit test takers to choose
ministration or a more comprehensive orientation which tasks to perform or which pieces o f their
to the testing process, so that all test takers can work are to be evaluated. Standardization can be
have an unobstructed opportunity to demonstrate maintained in these situations by specifying the
their standing on the constructs) being measured. conditions o f the choice and the criteria for eval
Different modes o f presenting the test or its in uation o f the products. W hen an assessment
structions, or o f responding, may be suitable for permits a certain kind o f collaboration between
specific individuals, such as persons with some test takers or between test taker and test adminis
kinds o f disability, or persons with limited proficiency trator, the limits o f that collaboration should be
in the language o f the test, in order to provide ap specified. With some assessments, test administrators
propriate access to reduce construct-irrelevant vari may be expected to tailor their instructions to
ance (see chap. 3, “Fairness in Testing”). In clinical help ensure that all test takers understand what is
or neuropsychological testing situations, flexibility expected o f them. In all such cases, the goal
in administration may be required, depending on remains the same: to provide accurate, fair, and
the individuals ability to comprehend and respond comparable measurement for everyone. T he degree
to test items or tasks and/or the construct required o f standardization is dictated by that goal, and by
to be m easured. Som e situations and/or the the intended use o f the test score.
construct (e.g., testing for memory impairment in Standardized directions help ensure that all
a test taker with dementia who is in a hospital) test takers have a common understanding o f the
111
CHAPTER 6
mechanics o f test taking. Directions generally complex responses is done by human scorers or
inform test takers on how to make their responses, automatic scoring engines, careful training is re
what kind o f help they may legitimately be given quired. T h e training typically requires expert
if they do not understand the question or task, human raters to provide a sample o f responses
how they can correct inadvertent responses, and that span the range o f possible score points or rat
the nature o f any time constraints. General advice ings. Within the score point ranges, trainers should
is sometimes given about omitting item responses. also provide samples that exemplify the variety o f
M any tests, including computer-administered responses that will yield the score point or rating.
tests, require special equipment or software. In Regular m onitoring can help ensure that every
struction and practice exercises are often presented test performance is scored according to the same
in such cases so that the test taker understands standardized criteria and that the test scorers do
how to operate the equipment or software. The not apply the criteria differently as they progress
principle o f standardization includes orienting through the submitted test responses.
test takers to materials and accommodations with Test scores, per se, are not readily interpreted
which they may not be familiar. Som e equipment without other information, such as norms or stan
may be provided at the testing site, such as shop dards, indications o f measurement error, and de
tools or software systems. O pportunity for test scriptions o f test content. Just as a temperature of
takers to practice with the equipment will often 50 degrees Fahrenheit in January is warm for M in
be appropriate, unless ability to use the equipment nesota and cool for Florida, a test score o f 50 is
is the construct being assessed. not meaningful without some context. Interpretive
Tests are sometimes administered via technology, material should be provided that is readily under
with test responses entered by keyboard, computer standable to those receiving the report. Often, the
mouse, voice input, or other devices. Increasingly, test user provides an interpretation o f the results
many test takers are accustomed to using computers. for the test taker, suggesting the limitations o f the
Th ose who are not may require training to reduce results and the relationship of any reported scores
construct-irrelevant variance. Even those test takers to other information. Scores on some tests are not
who are familiar with computers may need some designed to be released to test takers; only broad
brief explanation and practice to manage test- test interpretations, or dichotomous classifications,
specific details such as the tests interface. Special such as “pass/fail,” are intended to be reported.
issues arise in managing the testing environment Interpretations o f test results are som etimes
to reduce construct-irrelevant variance, such as prepared by com puter systems. Such interpreta
avoiding light reflections on the computer screen tions are generally based on a com bination o f
that interfere with display legibility, or maintaining empirical data, expert judgm ent, and experience
a quiet environment when test takers start or and require validation. In som e professional ap
finish at different times from neighboring test plications o f individualized testing, the com put
takers. Those who administer computer-based er-prepared interpretations are com m unicated
tests should be trained so that they can deal with by a professional, who m ight modify the com
hardware, software, or test administration problems. puter-based interpretation to fit special circum
Tests administered by computer in Web-based stances. Care should be taken so that test inter
applications may require other supports to maintain pretations provided by nonalgorithmic approaches
standardized environments. are appropriately consistent. Autom atically gen
Standardized scoring procedures help to ensure erated reports are not a substitute for the clinical
consistent scoring and reporting, which are essential judgm ent o f a professional evaluator who has
in all circumstances. When scoring is done by worked directly with the test taker, or for the in
machine, the accuracy o f the machine, including tegration o f other inform ation, including but
any scoring program or algorithm, should be es not lim ited to other test results, interviews,
tablished and monitored. When the scoring o f existing records, and behavioral observations.
112
TEST ADMINISTRATION, SCORING, REPORTING, AND INTERPRETATION
Jn some large-scale assessments, the primary as each individual may take only an incomplete
target o f assessment is not the individual test taker test, while in the aggregate, the assessment results
but rather a larger unit, such as a school district or may be valid and acceptably reliable for interpre
an industrial plant. Often, different test takers are tations about performance o f the larger unit.
given different sets o f items, following a carefully Som e further issues o f administration and
balanced matrix sampling plan, to broaden the scoring are discussed in chapter 4, “le s t Design
range o f information that can be obtained in a and Development.”
reasonable time period. The results acquire meaning Test users and those who receive test materials,
when aggregated over many individuals taking test scores, and ancillary information such as test
different samples o f items. Such assessments may takers’ personally identifiable information are re
not furnish enough information to support even sponsible for appropriately maintaining the security
minimally valid or reliable scores for individuals, and confidentiality o f that information.
113
CHAPTER 6
T h e standards in this chapter begin with an over or score the test(s) are proficient in the appropriate
arching standard (numbered 6.0), which is designed test administration or scoring procedures and un
to convey the central intent or primary focus o f derstand the importance o f adhering to the direc
the chapter. T he overarching standard may also tions provided by the test developer. Large-scale
be viewed as the guiding principle o f the chapter, testing programs should specify accepted stan
and is applicable to all tests and test users. All dardized procedures for determining accom m o
subsequent standards have been separated into dations and other acceptable variations in test ad
three thematic clusters labeled as follows: ministration. Training should enable test admin
istrators to make appropriate adjustm ents if an
1. Test Administration accommodation or modification is required that
2. Test Scoring is not covered by the standardized procedures.
3. Reporting and Interpretation Specifications regarding instructions to test
takers, time limits, the form o f item presentation
Standard 6.0 or response, and test materials or equipm ent
should be strictly observed. In general, the same
To support useful interpretations o f score results, procedures should be followed as were used when
assessm ent instrum ents should have established obtaining the data for scaling and norming the
procedures for test adm inistration, scoring, re test scores. Some programs do not scale or establish
porting, and interpretation. T hose responsible norms, such as portfolio assessments and most al
for administering, scoring, reporting, and inter ternate academic assessments for students with
preting should have sufficient training and supports severe cognitive disabilities. However, these programs
to help them follow the established procedures. typically have specified standardized procedures
Adherence to the established procedures should for administration and scoring when they establish
be m onitored, and any material errors should be performance standards. A test taker with a disability
docum ented and, i f possible, corrected. may require variations to provide access without
changing the construct that is measured. Other
C om m ent: In order to support the validity o f
special circumstances may require som e flexibility
score interpretations, administration should follow
in administration, such as language support to
any and all established procedures, and compliance
provide access under certain conditions, or some
with such procedures needs to be monitored.
clinical or neuropsychological evaluations, in ad
dition to procedures related to accomm odations.
Cluster 1. Test Administration Judgments o f the suitability o f adjustments should
be tempered by the consideration that departures
from standard procedures may jeopardize the
Standard 6.1 validity or complicate the comparability o f the
test score interpretations. These judgm ents should
Test adm inistrators should follow carefully the
be made by qualified individuals and be consistent
standardized procedures for adm inistration and
with the guidelines provided by the test user or
scoring specified by the test developer and any
test developer.
instructions from the test user.
Policies regarding retesting should be established
C om m ent: Those responsible for testing programs by the test developer or user. The test user and
should provide appropriate training, documentation, administrator should follow the established policy.
and oversight so that the individuals who administer Such retest policies should be clearly communicated
114
TEST ADMINISTRATION, SCORING, REPORTING, AND INTERPRETATION
by the rest user as part o f the conditions for stan accommodations for test takers, the procedures
dardized test administration. Retesting is intended and criteria should be carefully followed and doc
to decrease the probability that a person will be umented. Ideally, these procedures include how
incorrectly classified as not meeting some standard. to consider the instances when some alternative
For example, som e testing programs specify that may be appropriate in addition to those accom
a person may retake the test; some offer multiple modations foreseen and specified by the test de
opportunities to take a test, for example when veloper. Test takers should be informed o f any
passing the test is required for high school gradu testing accommodations that may be available to
ation or credentialing. them and the process and requirements, if any,
Test developers should specify the standardized for obtaining needed accommodations. Similarly,
administration conditions that support intended in educational settings, appropriate school personnel
uses o f score interpretations. Test users should be and parents/legal guardians should be informed
aware o f the implications o f less controlled admin o f the requirements, if any, for obtaining needed
istration conditions. Test users are responsible for accommodations for students being tested.
providing technical and other support to help
ensure that test administrations meet these conditions Standard 6.3
to the extent possible. However, technology and
the Internet have made it possible to administer Changes or disruptions to standardized test ad
tests in many settings, including settings in which m inistration procedures or scoring should be
the administration conditions may not be strictly documented and reported to the test user.
controlled or monitored. Those who allow lack o f
C om m en t: Inform ation about the nature o f
standardization are responsible for providing evidence
changes to standardized administration or scoring
that the lack o f standardization did not affect test-
procedures should be maintained in secure data
taker performance or the quality or comparability
files so that research studies or case reviews based
o f the scores produced. Complete documentation
011 test records can take it into account. This
would include reporting the extent to which stan
includes not only accommodations or modifications
dardized administration conditions were not met.
for particular test takers but also disruptions in
Characteristics such as time limits, choices
the testing environment that may affect all test
about item types and response formats, complex
takers in the testing session. A researcher may
interfaces, and instructions that potentially add
wish to use only the records based on standardized
construct-irrelevant variance should be scrutinized
administration. In other cases, research studies
in terms o f the test purpose and the constructs
may depend on such information to form groups
being measured. Appropriate usability and empirical
o f test takers. Test users or test sponsors should
research should be carried out, as feasible, to doc
establish policies specifying who secures the data
ument and ideally minimize the impact o f sources
files, who may have access to the files, and, if nec
or conditions that contribute to construct-irrelevant
essary, how to maintain confidentiality o f respon
variability.
dents, for example by de-identifying respondents.
Whether the information about deviations from
Standard 6.2 standard procedures is reported to users o f test
data depends on considerations such as whether
W hen form al procedures have been established
the users are admissions officers or users o f indi
for requesting and receiving accom m odations,
vidualized psychological reports in clinical settings.
test takers should be informed o f these procedures
I f such reports are made, it may be appropriate to
in advance o f testing.
include clear documentation o f any deviation
Com ment: When testing programs have established from standard administration procedures, discussion
procedures and criteria for identifying and providing o f how such administrative variations may have
115
CHAPTER 6
116
TEST ADMINISTRATION, SCORING, REPORTING, AND INTERPRETATION
117
CHAPTER 6
1 18
TEST ADM INISTRATIO N, SCORING, REPORTING, AND INTERPRETATION
119
CHAPTER 6
T h e testing then requires less time from each test Standard 6.14
taker, while the aggregation o f individual results
provides for domain coverage that can be adequate O rganizations that m aintain individually iden
for meaningful group- or program-level interpre tifiable test score inform ation should develop a
tations, such as for schools or grade levels within clear set o f policy guidelines on the duration o f
a locality or particular subject areas. However, be retention o f an individual’s records and on the
cause the individual is administered only an in availability and use over tim e o f such d ata for re
complete test, an individual score would have search or other purposes. T he policy should be
limited meaning, if any. documented and available to the test taker. Test
users should m aintain appropriate data security,
Standard 6.13 which should include adm inistrative, technical,
and physical protections.
W hen a m aterial error is found in test scores or
Com m ent: In some instances, test scores become
other im portant inform ation issued by a testing
obsolete over time, no longer reflecting the current
organization or other institution, this information
state o f the test taker. O utdated scores should
and a corrected score report should be distributed
generally not be used or made available, except
as soon as practicable to all known recipients
for research purposes. In other cases, test scores
who m ight otherwise use the erroneous scores
obtained in past years can be useful, as in longitu
as a basis for decision m aking. T h e corrected
dinal assessment or the tracking o f deterioration
report should be labeled as such. W hat was done
o f function or cognition. The key issue is the
to correct the reports should be documented.
valid use o f the inform ation. O rganizations and
T h e reason for the corrected score report should
individuals who maintain individually identifiable
be m ade clear to the recipients o f the report.
test score information should be aware o f and
C om m ent: A material error is one that could comply with legal and professional requirements.
change the interpretation o f the test score and Organizations and individuals who maintain test
make a difference in a significant way. An example scores on individuals may be requested to provide
is an erroneous test score (e.g., incorrecdy computed data to researchers or other third-party users.
or fraudulently obtained) that would affect an Where data release is deemed appropriate and is
im portant decision about the test taker, such as a not prohibited by statutes or regulations, the test
credentialing decision or the awarding o f a high user should protect the confidentiality o f the test
school diplom a. Innocuous typographical errors takers through appropriate policies, such as de-
would be excluded. Timeliness is essential for de identifying test data or requiring nondisclosure
cisions that will be made soon after the test scores and confidentiality o f the data. Organizations
are received. W here test results have been used to and individuals who m aintain or use confidential
inform high-stakes decisions, corrective actions information about test takers or their scores should
by test users may be necessary to rectify circum have and implement an appropriate policy for
stances affected by erroneous scores, in addition maintaining security and integrity o f the data, in
to issuing corrected reports. T he reporting or cor cluding protecting from accidental or deliberate
rective actions m ay not be possible or practicable modification as well as preventing loss or unau
in ccrtain work or other settings. Test users should thorized destruction. In some cases, organizations
develop a policy o f how to handle material errors may need to obtain test takers’ consent to use or
in test scores and should document what was disclose records. Adequate security and appropriate
done in the case o f suspected or actual material protocols should be established when confidential
errors. test data are made part o f a larger record (e.g., an
120
TEST ADMINISTRATION, SCORING, REPORTING, AND INTERPRETATION
electronic medical record) or merged into a data Com ment: Care is always needed when comm u
warehouse. I f records are to be released for clinical nicating the scores o f identified test takers, regardless
and/or forensic evaluations, care should be taken o f the form o f communication. Similar care may
to release them to appropriately licensed individuals, be needed to protect the confidentiality o f ancillary
with appropriate signed release authorization by information, such as personally identifiable infor
the test taker or appropriate legal authority. mation on disability status for students or clinical
test scores shared between practitioners. Appropriate
Standard 6.15 caution with respect to confidential information
should be exercised in communicating face to
W hen individual test d ata are retained, both the face, as well as by telephone, fax, and other forms
test protocol and any written report should also o f written communication. Similarly, transmission
be preserved in som e form. o f test data through electronic media and trans
Comment: The protocol may be needed to respond mission and storage on computer networks—
to a possible challenge from a test taker or to fa including wireless transmission and storage or pro
cilitate interpretation at a subsequent time. The cessing on the Internet— require caution to maintain
protocol would ordinarily be accompanied by appropriate confidentiality and security. Data in
testing materials and test scores. Retention o f tegrity must also be maintained by preventing in
more detailed records o f responses would depend appropriate modification o f results during such
on circumstances and should be covered in a re transmissions. Test users are responsible for un
tention policy. Record keeping may be subject to derstanding and adhering to applicable legal obli
legal and professional requirements. Policy for gations in their data management, transmission,
the release o f any test information for other than use, and retention practices, including collection,
research purposes is discussed in chapter 9, “The handling, storage, and disposition. Test users should
Rights and Responsibilities ofTest Users.” set and follow appropriate security policies regarding
confidential test data and other assessment infor
mation. Release o f clinical raw data, tests, or
Standard 6.16
protocols to third parties should follow laws, reg
Transmission o f individually identified test scores ulations, and guidelines provided by professional
to authorized individuals or institutions should organizations and should take into account the
be done in a manner that protects the confidential impact o f availability o f tests in public domains
nature o f the scores and pertinent ancillary (e.g., court proceedings) and the potential for vio
inform ation. lation o f intellectual properly rights.
121
7. SUPPORTING DOCUMENTATION
FOR TESTS
BACKGROUND
T h is chapter provides general standards for the proposed uses o f a test is important, failure to
preparation and publication o f test documentation formally document such evidence in advance does
by test developers, publishers, and other providers not automatically render the corresponding test
o f tests. Other chapters contain specific standards use or interpretation invalid. For example, consider
that should be useful in the preparation o f materials an unpublished employment selection test developed
to be included in a test’s docum entation. In by a psychologist solely for internal use within a
addition, test users may have their own docu single organization, where there is an immediate
mentation requirements. T h e rights and respon need to fill vacancies. The test may properly be
sibilities o f test users are discussed in chapter 9. put to operational use after needed validity evidence
T h e supporting docum ents for tests are the is collected but before formal documentation o f
prim ary m eans by which test developers, pu b the evidence is completed. Similarly, a test used
lishers, and other providers o f tests communicate for certification may need to be revised frequently,
with test users. These docum ents are evaluated in which case technical reports describing the
on the basis o f their completeness, accuracy, cur test’s development as well as information concerning
rency, and clarity and should be available to item, exam, and candidate performance should
qualified individuals as appropriate. A test’s doc be produced periodically, but not necessarily prior
um entation typically specifies the nature o f the to every exam.
test; the use(s) for which it was developed; the Test documentation is effective if it comm u
processes involved in the test’s development; nicates information to user groups in a manner
technical inform ation related to scoring, inter that is appropriate for the particular audience. To
pretation, and evidence o f validity, fairness, and accommodate the breadth o f training o f those
reliability/precision; scaling, norming, and stan who use tests, separate documents or sections o f
dard-setting inform ation i f appropriate to the documents may be written for identifiable categories
instrument; and guidelines for test administration, o f users such as practitioners, consultants, ad
reporting, and interpretation. T he objective o f ministrators, researchers, educators, and sometimes
the docum entation is to provide test users with examinees. For example, the test user who ad
the inform ation needed to help them assess the ministers the tests and interprets the results needs
nature and quality o f the test, the resulting guidelines for doing so. Those who are responsible
scores, and the interpretations based on the test for selecting tests need to be able to judge the
scores. T h e inform ation may be reported in doc technical adequacy o f the tests and therefore need
um ents such as test m anuals, technical manuals, some combination o f technical manuals, user’s
user’s guides, research reports, specimen sets, ex guides, test manuals, test supplements, examination
am ination kits, directions for test administrators kits, and specimen sets. Ordinarily, these supporting
and scorers, or preview materials for test takers. documents are provided to potential test users or
Regardless o f who develops a test (e.g., test test reviewers with sufficient information to enable
publisher, certification or licensure board, employer, them to evaluate the appropriateness and technical
or educational institution) or how many users adequacy o f a test. T h e types o f information pre
exist, the development process should include sented in these documents typically include a de
thorough, timely, and useful documentation. Al scription o f the intended test-taking population,
though proper documentation o f the evidence stated purpose o f the test, test specifications, item
supporting the interpretation o f test scores for formats, administration and scoring procedures,
123
CHAPTER 7
test security protocols, cut scores or other standards, test user or reviewer. This supplemental material
and a description o f the test development process. can be provided in any o f a variety o f published
Also typically provided are summaries o f technical or unpublished forms and in either paper or elec
data such as psychometric indices o f the items; tronic formats.
reliability/precision and validity evidence; normative In addition to technical documentation, de
data; and cut scores or rules for combining scores, scriptive materials are needed in some settings to
including those for computer-generated interpre inform examinees and other interested parties
tations o f test scores. about the nature and content o f a test. The amount
An essential feature o f the documentation for and type o f information provided will depend on
every test is a discussion o f the common appropriate the particular test and application. For example,
and inappropriate uses and interpretations o f the in situations requiring informed consent, information
test scores and a summary o f the evidence sup should be sufficient for test takers (or their repre
porting these conclusions. The inclusion o f examples sentatives) to make a sound judgm ent about the
o f score interpretations consistent with the test test. Such information should be phrased in non
developers intended applications helps users make technical language and should contain information
accurate inferences on the basis o f the test scores. that is consistent with the use o f the test scores
When possible, examples o f improper test uses and is sufficient to help the user make an informed
and inappropriate test score interpretations can decision. The materials may include a general de
help guard against the misuse o f the test or its scription and rationale for the test; intended uses
scores. When feasible, common negative unintended o f the test results; sample items or complete sample
consequences o f test use (including missed op tests; and information about conditions o f test ad
portunities) should be described and suggestions ministration, confidentiality, and retention o f test
given for avoiding such consequences. results. For some applications, however, the true
le s t documents need to include enough in nature and purpose o f the test are purposely hidden
formation to allow test users and reviewers to de or disguised to prevent faking or response bias. In
termine the appropriateness o f the test for its in these instances, examinees may be motivated to
tended uses. Other materials that provide more reveal more or less o f a characteristic intended to
details about research by the publisher or inde be assessed. H iding or disguising the true nature
pendent investigators (e.g., the samples on which or purpose o f a test is acceptable provided that the
the research is based and summative data) should actions involved are consistent with legal principles
be cited and should be readily obtainable by the and ethical standards.
124
SUPPORTING DOCUMENTATION FOR TESTS
125
CHAPTER 7
126
SUPPORTING DOCUMENTATION FOR TESTS
and what tasks each person or group performed. Cluster 3. Content of Test Documents:
For example, the participants who set the test cut Test Administration and Scoring
scores and their relevant expertise should be doc
umented. Depending on the use o f the test results,
relevant characteristics o f the participants may Standard 7.7
include race/ethnicity, gender, age, employment Test documents should specify user qualifications
status, education, disability status, and primary that are required to adm inister and score a test,
language. Descriptions o f the tasks and the specific as well as the user qualifications needed to
instructions provided to the participants may help interpret the test scores accurately.
future test users select and subsequently use the
test appropriately. Testing conditions, such as the Comment: Statements o f user qualifications should
extent o f proctoring in the validity study, may specify the training, certification, competencies,
have implications for the generalizability o f the and experience needed to allow acccss to a test or
results and should be documented. Any changes scores obtained with it. W hen user qualifications
to the standardized testing conditions, such as ac are expressed in terms of the knowledge, skills,
comm odations or modifications m ade to the test abilities, and other characteristics required to ad
or test administration, should also be documented. minister, score, and interpret a test, the test docu
Test developers and users should take care to mentation should clearly define the requirements
comply with applicable legal requirements and so the user can properly evaluate the competence
professional standards relating to privacy and data o f ad min istrators.
security when providing the docum entation
required by this standard. Standard 7.8
Test docum entation should include detailed in
Standard 7.6
structions on how a test is to be adm inistered
W hen a test is available in more than one and scored.
language, the test docum entation should provide
Com m ent: Regardless o f whether a test is to be
information on the procedures that were employed
administered in paper-and-pencil format, computer
to translate and adapt the test. Inform ation
format, or orally, or whether the test is performance
sh o u ld a lso be p ro v id ed reg ard in g the
based, instructions for administration should be
reliability/precision and validity evidence for the
included in the test documentation. As appropriate,
adapted form when feasible.
these instructions should include all factors related
C om m ent: In addition to providing information to rest administration, including qualifications,
on translation and adaptation procedures, the test competencies, and training o f test administrators;
documents should include the demographics o f equipment needed; protocols for test administrators;
translators and samples o f test takers used in the timing instructions; and procedures for imple
adaptation process, as well as information on any mentation o f test accommodations. When available,
score interpretation issues for each language into test documentation should also include estimates
which the test has been translated and adapted. o f the time required to administer the test to
Evidence o f reliability/precision, validity, and com clinical, disabled, or other special populations for
parability o f translated and adapted scores should whom the test is intended to be used, based on
be provided in test documentation when feasible. data obtained from these groups during the
(See Standard 3.14, in chap. 3, for further discussion norming o f the test. In addition, test users need
o f translations.) instructions on how to score a test and what cut
127
CHAPTER 7
scores to use (or whether to use cut scores) in in on how test scores are stored and who is authorized
terpreting scores. I f the test user does not score to see the scores.
the test, instructions should be given on how to
have a test scored. Finally, test administration doc
Standard 7.10
umentation should include instructions for dealing
with irregularities in test adm inistration and Tests that are designed to be scored and interpreted
guidance on how they should be documented. by test takers should be accom panied by scoring
I f a test is designed so that more than one instructions and interpretive m aterials th at are
method can be used for administration or for written in language the test takers can understand
recording responses— such as marking responses and that assist them in understanding the test
in a test booklet, on a separate answer sheet, or via scores.
computer— then the manual should clearly docu
ment the extent to which scores arising from ap C om m ent: If a test is designed to be scored by
plication o f these methods arc interchangeable. If test takers or its scores interpreted by test takers,
the scores are not interchangeable, this fact should the publisher and test developer should develop
procedures that facilitate accurate scoring and in
be reported, and guidance should be given on the
comparability o f scores obtained under the various terpretation. Interpretive material may include
conditions or methods o f administration. information such as the construct that was meas
ured, the test taker’s results, and the comparison
group. T he appropriate language for Lhe scoring
Standard 7.9 procedures and interpretive materials is one that
I f test security is critical to the interpretation o f meets the particular language needs o f the test
test scores, the documentation should explain taker. Thus, the scoring and interpretive materials
the steps necessary to protect test m aterials and may need to be offered in the native language o f
to prevent inappropriate exchange o f information the test taker to be understood.
during the test adm inistration session.
128
SUPPORTING DOCUMENTATION FOR TESTS
requirements such as race or gender norming in first administration o f the test. Other documents
employment contexts. (e.g., technical manuals containing information
based on data from the first administration) cannot
Standard 7.12 be supplied prior to that administration; however,
such documents should be created promptly.
W hen test scores are used to make predictions The test developer or publisher should judge
about future behavior, the evidence supporting carefully which inform ation should be included
those predictions should be provided to the test in first editions o f the test manual, technical
user. manual, or user’s guide and which inform ation
can be provided in supplements. For low-volume,
Com m ent; The test user should be informed of
unpublished tests, the docum entation may be
any cut scores or rules for combining raw or reported
relatively brief. W hen the developer is also the
scores that are necessary for understanding score in
user, docum entation and sum m aries are still
terpretations. A description o f both the group of
necessary.
judges used in establishing the cut scores and the
methods used to derive the cut scores should be
provided. When security or proprietary reasons ne Standard 7.14
cessitate the withholding o f cut scores or rules for
W hen substantial changes are made to a test,
combining scores, the owners o f the intellectual
the test’s docum entation should be am ended,
property are responsible for documenting evidence
supplem ented, or revised to keep inform ation
in support o f the validity o f interpretations for in
for users current and to provide useful additional
tended uses. Such evidence might be provided, for
inform ation or cautions.
example, by reporting the finding of an independent
review o f the algorithms by qualified professionals. Com m ent: Supporting documents should clearly
When any interpretations o f test scores, including note the date o f their publication as well as the
computer-generated interpretations, are provided, name or version o f the test for which the docu
a summary o f the evidence supporting the interpre mentation is relevant. When substantial changes
tations should be given, as well as the rules and are made to items and scoring, information on
guidelines used in making the interpretations. the extent to which the old scores and new scores
are interchangeable should be included in the test
documentation.
Cluster 4. Timeliness of Delivery of Test
Sometimes it is necessary to changc a test or
Documents
testing procedure to remove construct-irrelevant
variance that m ay arise due to the characteristics
Standard 7.13 o f an individual that are unrelated to the construct
being measured (e.g., when testing individuals
Supporting docum ents (e.g., test m anuals, tech
with disabilities). When a test or testing procedures
nical m anuals, u sers guides, and supplemental
are altered, the documentation for the test should
material) should be m ade available to the appro
include a discussion o f how the alteration may
priate people in a tim ely manner.
affect the validity and comparability o f the test
C om m ent: Supporting documents should be sup scores, and evidence should be provided to demon
plied in a timely manner. Som e documents (e.g., strate the effect o f the alteration on the scores ob
administration instructions, users guides, sample tained from the altered test or testing procedures,
tests or items) must be made available prior to the if sample size permits.
129
8. THE RIGHTS AND RESPONSSBiUTIES
OF TEST TAKERS
BACKGROUND
This chapter addresses issues o f fairness from the ployment tests), the information that is provided
point o f view o f the individual test taker. Most should be consistent across test takers. Test takers,
aspects o f fairness affect the validity o f interpretations or their legal representatives when appropriate,
o f test scores for their intended uses. The standards need enough information about the test and the
in this chapter address test takers’ rights and re intended use o f test results to reach an informed
sponsibilities with regard to test security, their decision about their participation.
access to test results, and their rights when irreg In some instances, the laws or standards o f
ularities in their testing process are claimed. Other professional practice, such as those governing re
issues o f fairness are addressed in chapter 3 search on human subjects, require formal informed
(“Fairness in Testing”). General considerations consent for testing. In other instances (e.g., em
conccrning reports o f test results are covered in ployment testing), informed consent is implied
chapter 6 (“le st Administration, Scoring, Reporting, by other actions (e.g., submission o f an employment
and Interpretation”). Issues related to test takers’ application), and formal consent is not required.
rights and responsibilities in clinical or individual The greater the consequences to the test taker,
settings are also discussed in chapter 10 (“Psycho the greater the importance o f ensuring that the
logical Testing and Assessment”). test taker is fully informed about the test and vol
The standards in this chapter are directed to untarily consents to participate, except when
test providers, not to test takers. It is the shared testing without consent is permitted by law (e.g.,
responsibility o f the test developer, test administrator, when participating in testing is legally required or
test proctor (if any), and test user to provide test mandated by a court order). If a test is optional,
takers with information about their rights and the test taker has the right to know the consequences
their own responsibilities. T h e responsibility to o f taking or not taking the test. Under m ost cir
inform the test taker should be apportioned ac cumstances, the test taker has the right to ask
cording to particular circumstances. questions or express concerns and should receive
Test takers have the right to be assessed with a timely response to legitimate inquiries.
tests that meet current professional standards, in When consistent with the purposes and nature
cluding standards o f technical quality, consistent o f the assessment, general information is usually
treatment, fairness, conditions for test adminis provided about the tests content and purposes.
tration, and reporting o f results. The chapters in Som e programs, in the interest o f fairness, provide
Part I, “Foundations,” and Part II, “O perations,” all test takers with helpful materials, such as study
deal specifically with fair and appropriate test guides, sample questions, or complete sample
design, development, administration, scoring, and tests, when such information does not jeopardize
reporting. In addition, test takers have a right to the validity o f the interpretations o f results from
basic information about the test and how the test future test adm inistrations. Practice materials
results will be used. In m ost situations, fair and should have the same appearance and form at as
equitable treatment o f test takers involves providing the actual test. A practice test for a Web-based as
information about the general nature o f the test, sessment, for example, should be available via
the intended use o f test scores, and the confiden computer. Employee selection programs may le
tiality o f the results in advance o f testing. When gitimately provide more training to certain classes
full disclosure o f this information is not appropriate o f test takers (e.g., internal applicants) and not to
(as is the case with som e psychological or em others (e.g., external applicants). For example, an
131
CHAPTER 8
organization may train current employees on skills to other test takers, particularly in competitive
that are measured on employment tests in the situations in which test takers’ scores are compared.
context o f an employee development program There are many forms o f behavior that affect test
but not offer that training to external applicants. scores, such as using prohibited aids or arranging
Advice m ay also be provided about test-taking for som eone to take the test in the test taker’s
strategies, including time management and the place. Similarly, there are many forms o f behavior
advisability o f omitting a response to an item that jeopardize the security o f test materials, in
(when om itting a response is permitted). Infor cluding com m unicating the specific content o f
m ation on various testing policies, for example the test to other test takers in advance. T h e test
about m aking accommodations available and de taker is obligated to respect the copyrights in test
termining for which individuals the accommoda materials and m ay not reproduce the materials
tions are appropriate, is also provided to the test without authorization or disseminate in any form
taker. In addition, communications to test takers material that is sim ilar in nature to the test. Test
should include policies on retesting when major takers, as well as test adm inistrators, have the re
disruptions o f the test administration occur, when sponsibility to protect test security by refusing to
the test taker feels that the present performance divulge any details o f the test content to others,
does not appropriately reflect his or her true ca unless the particular test is designed to be openly
pabilities, or when the test taker improves on his available in advance. Failure to honor these re
or her underlying knowledge, skills, abilities, or sponsibilities may comprom ise the validity o f
other personal characteristics. test score interpretations for the test taker and
As participants in the assessment, test takers for others. O utside groups that develop items for
have responsibilities as well as rights. Their re test preparation should base those item s on
sponsibilities include being prepared to take the publicly disclosed information and not on infor
test, following the directions o f the test adminis m ation that has been inappropriately shared by
trator, representing themselves honestly on the test takers.
test, and protecting the security o f the test materials. Sometimes, testing programs use special scores,
Requests for accommodations or modifications statistical indicators, and other indirect information
are the responsibility o f the test taker, or in the about irregularities in testing to examine whether
case o f minors, the test takers guardian. In group the test scores have been obtained fairly. Unusual
testing situations, test takers should not interfere patterns o f responses, large changes in test scores
with the performance o f other test takers. In some upon retesting, response speed, and similar indicators
testing programs, test takers are also expected to may trigger careful scrutiny o f certain testing pro
inform the appropriate persons in a timely manner tocols and test scores. The details o f the procedures
if they beiieve there are reasons that their test for detecting problems are generally kept secure to
results wiil not reflect their true capabilities. avoid compromising their use. However, test takers
T h e validity o f score interpretations rests on should be informed that in special circumstances,
the assum ption that a test taker has earned fairly such as response or test score anomalies, their test
a particular score or categorical decision, such as responses may receive special scrutiny. Test takers
“pass” or “fail.” M any forms o f cheating or other should be informed that their score may be canceled
m alfeasant behaviors can reduce the validity o f or other action taken if evidence o f impropriety or
the interpretations o f test scores and cause harm fraud is discovered.
132
THE RIGHTS AND RESPONSIBILITIES OF TEST TAKERS
T h e standards in this chapter begin with an over in this chapter address the responsibility o f test
arching standard (numbered 8.0), which is designed takers to represent themselves fairly and accurately
to convey the central intent or primary focus o f during the testing process and to respect the con
the chapter. T h e overarching standard may also fidentiality o f copyright in all test materials.
be viewed as the guiding principle o f the chapter,
and is applicable to all tests and test users. All
subsequent standards have been separated into
Cluster 1. Test Takers’ Rights to
four thematic clusters labeled as follows: Information Prior to Testing
133
CHAPTER 8
134
THE RIGHTS AND RESPONSIBILITIES OF TEST TAKERS
Consent is not required when testing is legally considerations, including, as applicable, privacy
mandated, as in the case o f a court-ordered psy laws. Information may be provided to researchers
chological assessment, akhough there may be legal if several conditions are all met: (a) each test
requirements for providing information about the taker’s confidentiality is maintained, (b) the in
testing session outcomes to the test taker. N or is tended use is consistent with accepted research
consent typically required in educational settings practice, (c) the use is in compliance with current
for tests administered to all pupils. When testing legal and institutional requirements for subjects’
is required for employment, credentialing, or ed rights and with applicable privacy laws, and (d)
ucational adm issions, applicants, by applying, the use is consistent with the test takers informed
have implicitly given consent to the testing. When consent documents that are on file or with the
feasible, the person explaining the reason for a conditions o f implied consent that are appropriate
test should be experienced in communicating in som e settings.
with individuals within the intended population
for the test (e.g., individuals with disabilities or Standard 8.6
from different linguistic backgrounds).
Test data m aintained or transm itted in data
files, including all personally identifiable infor
Cluster 2. Test Takers’ Rights to Access m ation (not ju st results), should be adequately
Their Test Results and to Be Protected protected from improper access, use, or disclosure,
From Unauthorized Use of Test Results including by reasonable physical, technical, and
administrative protections as appropriate to the
particular data set and its risks, and in compliance
Standard 8.5
with applicable legal requirements. Use o f facsimile
Policies for the release o f test scores with identi transm ission, com puter networks, data banks,
fying inform ation should be carefully considered or other electronic data-processing or transmittal
and clearly com m unicated to those who have systems should be restricted to situations in
access to the scores. Policies should m ake sure which confidentiality can be reasonably assured.
that test results containing the names o f individual Users should develop and/or follow policies,
test takers or other personal identifying infor co nsistent w ith any legal requirem ents, for
m ation are released only to those who have a le whether and how test takers m ay review and
gitim ate, professional interest in the test takers correct personal inform ation.
and are perm itted to access such inform ation
Com m ent: Risk o f compromise is reduced by
under applicable privacy laws, who are covered
avoiding identification numbers or codes that are
by the test takers’ inform ed consent docum ents,
linked to individuals and used for other purposes
or who are otherwise perm itted by law to access
(e.g., Social Security numbers or employee IDs).
the results.
If facsimile or computer communication is used
C om m ent: Test results o f individuals identified to transmit test responses to another site for
by name, or by some other information by means scoring or if scores are similarly transmitted, rea
o f which a person can be readily identified, or sonable provisions should be made to keep the
readily identified when the information is com information confidential, such as encrypting the
bined with other inform ation, should be kept information. In some circumstances, applicable
confidential. In som e situations, information data security laws may require that specific measures
may be provided on a confidential basis to other be taken to protect the data. In most cases, these
practitioners with a legitimate interest in the policies will be developed by the owner o f the
particular case, consistent with legal and ethical data.
135
CHAPTER 8
Cluster 3. Test Takers’ Rights to Fair results are used solely for the purpose o f aiding
and Accurate Score Reports selection decisions, waivers o f access are often a
condition o f employment applications, although
access to test information may often be appropriately
Standard 8.7 required in other circumstances.
136
THE RIGHTS AND RESPONSIBILITIES OF TEST TAKERS
137
9. THE RIGHTS AND RESPONSIBILITIES
OF TEST USERS
BACKGROUND
The previous chapters have dealt primarily with o f the normative data available in the test manual,
the responsibilities o f those who develop, promote, and (d) the potential positive and negative conse
evaluate, or mandate the administration o f tests quences o f use. The accumulated research literature
and with the rights and responsibilities o f test should also be considered, as well as, where ap
takers. T h e present chapter centers attention on propriate, demographic characteristics (e.g., race/eth
the responsibilities o f those who may be considered nicity; gender; age; income; socioeconomic, cultural,
the users o f tests. Test users are professionals who and linguistic background; education; and other
select the specific instruments or supervise test socioeconomic variables) o f the group for which
administration— on their own authority or at the the test was originally constructed and for which
behest o f others— as well as all other professionals normative data are available. Test users can also
who actively participate in the interpretation and consult with measurement professionals. The
use o f test results. They include psychologists, ed name o f the test alone never provides adequate
ucators, employers, test developers, test publishers, information for deciding whether to select it.
and other professionals. Given the reliance on In some cases, the selection o f tests and in
test results in many settings, pressure has typically ventories is individualized for a particular client.
been placed on test users to explain test-based de In other settings, a predetermined battery o f tests
cisions and testing practices; in many circumstances, is taken by all participants. In both cases, test
test users have legal obligations to document the users should be well versed in proper administrative
validity and fairness o f those decisions and practices. procedures and are responsible for understanding
The standards in this chapter provide guidance the validity and reliability evidence and articulating
with regard to test administration procedures and that evidence if the need arises. Test users who
decision making in which tests play a part. Thus, oversee testing and assessment are responsible for
the present chapter includes standards o f a general ensuring that the test administrators who administer
nature that apply in almost all testing contexts. and score tests have received the appropriate edu
These Standards presume that a legitimate ed cation and training needed to perform these tasks.
ucational, psychological, credentialing, or em A higher level o f competence is required o f the
ployment purpose justifies the time and expense test user who interprets the scores and integrates
o f test administration. In most settings, the user the inferences derived from the scores and other
communicates this purpose to those who have a relevant information.
legitimate interest in the measurement process Test scores ideally are interpreted in light o f
and subsequently conveys the implications o f ex the available data, the psychometric properties o f
aminee performance to those entitled to receive the scores, indicators o f effort, and the effects o f
the information. Depending on the measurement moderator variables and demographic characteristics
setting, this group may include individual test on test results. Because items or tasks contained
takers, parents and guardians, educators, employers, in a test that was designed for a particular group
policy makers, the courts, or the general public. may introduce construct-irrelevant variance when
Validity and reliability are critical considerations used with other groups, selecting a test with de-
in test selection and use, and test users should mographically appropriate reference groups is im
consider evidence o f (a) the validity o f the inter portant to the generalizability o f the inference
pretation for intended uses o f the scores, (b) the that the test user seeks to make. When a test de
reliability/precision o f the scores, (c) the applicability veloped and normed for one group is applied to
139
CHAPTER 9
other groups, score interpretations should be qual In such settings, there is often no clear separation
ified and presented as hypotheses rather than in terms o f professional responsibilities between
conclusions. Further, statistical analyses conducted those who develop the instrument and those who
on only one group should be evaluated for appro administer it and interpret the results. Instruments
priateness when generalized to other examinee produced by independent publishers, on the other
populations. T h e test user should rely on any hand, present a somewhat different picture. Typically,
available extant research evidence for the test to these will be used by different test users with a
draw appropriate inferences and should be aware variety o f populations and for diverse purposes.
o f requirements restricting certain practiccs (e.g., T h e conscientious developer o f a standardized
norm ing by race or gender in certain contexts). test attempts to control who has access to the test
Moreover, where applicable, an interpretation and to educate potential users. Furthermore, most
o f test takers’ scores needs to consider not only the publishers and test sponsors work to prevent the
demonstrated relationship between the scores and misuse o f standardized measures and the misin
the criteria, but also the appropriateness o f the terpretation o f individual scores and group averages.
latter. The criteria need to be subjected to an ex Test manuals often illustrate sound and unsound
amination similar to the examination o f the predictors interpretations and applications. Som e identify
if one is to understand the degree to which the un specific practices that are not appropriate and
derlying constructs are congruent with the inferences should be discouraged. Despite the best efforts o f
under consideration. It is important that data test developers, however, appropriate test use and
which are not supportive o f the inferences should sound interpretation o f test scores are likely to re
be acknowledged and either reconciled or noted as main primarily the responsibility o f the test user.
limits to the confidence that can be placed in the Test takers, parents and guardians, legislators,
inferences. The education and experience necessary policy makers, the media, the courts, and the
to interpret group tests are generally less stringent public at large often prefer unambiguous interpre
than the qualifications nccessary to interpret indi tations o f test data. In particular, they often tend
vidually administered tests. to attribute positive or negative results, including
Test users should follow the standardized test group differences, to a single factor or to the con
administration procedures outlined by the test ditions that prevail in one social institution—
developers. C om puter adm inistration o f tests most often, the home or the school. These consumers
should also follow standardized procedures, and o f test data frequently press for score-based rationales
sufficient oversight should be provided to ensure for decisions that are based only in part on test
the integrity o f test results. When nonstandard scores. The wise test user helps all interested parties
procedures are needed, they should be described understand that sound decisions regarding test
and justified. Test users are also responsible for use and score interpretation involve an element o f
providing appropriate testing conditions. For ex professional judgment. It is not always obvious to
am ple, the test user m ay need to determ ine the consumers that the choice o f various informa
whether a test taker is capable o f reading at the tion-gathering procedures involves experience that
level required and whether a test taker with vision, is not easily quantified or verbalized. The user can
hearing, or neurological disabilities is adequately help consumers appreciate the fact diat the weighting
accom modated. Chapter 3 (“Fairness in Testing”) o f quantitative data, educational and occupational
addresses equal access considerations and standards information, behavioral observations, anecdotal
in detail. reports, and other relevant data often cannot be
W here administration o f tests or use o f test specified precisely. Nonetheless, test users should
data is mandated for a specific population by gov provide reports and interpretations o f test data
ernmental authorities, educational institutions, li that are clear and understandable.
censing boards, or employers, the developer and Because test results are frequently reported
user o f an instrument may be essentially the same. as num bers, they often appear to be precise,
140
THE RIGHTS AND RESPONSIBILITIES OF TEST USERS
and test data are som etim es allowed to override and agencies mandating test use agree to provide
other sources o f evidence about test takers. information on the strengths and weaknesses of
There are circumstanccs in which selection based their instruments. They accept the responsibility
exclusively on test scores may be appropriate to warn against likely misinterpretations by unso
(e.g., in pre-employment screening). However, phisticated interpreters o f individual scores or ag
in educational, psychological, forensic, and some gregated data. However, the ultimate responsibility
em ploym ent settings, test users arc well advised, for appropriate test use and interpretation lies
and may be legally required, to consider other predominantly with the test user. In assuming
relevant sources o f inform ation on test takers, this responsibility, the user must become knowl
not just test scores. In such situations, psychol edgeable about a tests appropriate uses and the
ogists, educators, or other professionals familiar populations for which it is suitable. The test user
with the local setting and with local test takers should be prepared to develop a logical analysis
are often best qualified to integrate this diverse that supports the various facets o f the assessment
inform ation effectively. and the inferences made from the assessment
It is not appropriate for these standards to results, le s t users in all settings (e.g., clinical,
dictate minimal levels o f test-criterion correlation, counseling, credentialing, educational, employment,
classification accuracy, or reliability/precision for forensic, psychological) must also become adept
any given purpose. Such levels depend on factors in communicating the implications o f test results
such as the nature o f the measured construct, the to those entitled to receive them.
age o f the tested individuals, and whether decisions In some instances, users may be obligated to
must be made immediately on the strength o f the collect additional evidence about a test’s technical
best available evidence, however weak, or whether quality. For example, if performance assessments
they can be delayed until better evidence becomes are locally scored, evidence o f the degree o f inter-
available. But it is appropriate to expect the user to scorer agreement may be required. Users should
ascertain what the alternatives are, what the quality also be alert to the probable local consequences of
and consequences o f these alternatives are, and test use, particularly in the case o f large-scale
whether a delay in decision making would be ben testing programs. I f the same test material is used
eficial. Cost-benefit compromises become necessary in successive years, users should actively monitor
in test use, as they often are in test development. the program to determine if reuse has compromised
However, in some contexts, legal requirements may the integrity o f the results.
place limits on the extent to which such compromises Some o f the standards that follow reiterate
can be made. As with standards for the various ideas contained in other chapters, principally
phases o f test development, when relevant standards chapter 3 (“Fairness in Testing”), chapter 6 (“Test
are not met in test use, the reasons should be per Administration, Scoring, Reporting, and Inter
suasive. The greater the potential impact on test pretation”), chapter 8 (“T he Rights and Respon
talcers, for good or ill, the greater the need to sibilities o f Test Takers”), chapter 10 (“Psychological
identify and satisfy the relevant standards. Testing and Assessment”), chapter 11 (“Workplace
In selecting a test and interpreting a test score, Testing and Credentialing”), and chapter 12 (“ Ed
the test user is expected to have a clear understanding ucational Testing and Assessment”). This repetition
o f the purposes o f the testing and its probable is intentional. It perm its an enumeration in one
consequences. The knowledgeable user has definite chapter o f the m ajor obligations that must be as
ideas on how to achieve these purposes and how sumed largely by the test administrator and user,
to avoid unfairness and undesirable consequences. although these responsibilities may refer to topics
In subscribing to the Standai'ds, test publishers that are covered more fully in other chapters.
141
CHAPTER 9
142
THE RIGHTS AND RESPONSIBILITIES OF TEST USERS
143
CHAPTER 9
to minimize or avoid foreseeable misinterpretations give reporters, policy makers, or m embers o f the
and inappropriate uses o f test scores. public an opportunity to assimilate relevant data.
Misinterpretation often can be the result o f inad
C om m ent: Untrained audiences may adopt sim
equate presentation o f information that bears on
plistic interpretations of test results or may attribute
test score interpretation.
high or low scores or averages to a single causal
factor. Test users can sometimes anticipate such
misinterpretations and should try to prevent them. Standard 9.9
Obviously, not every unintended interpretation
can be anticipated, and unforeseen negative con W hen a test user contem plates an alteration in
test format, mode o f administration, instructions,
sequences can occur. What is required is a reasonable
or the language used in adm inistering a test,
effort to encourage sound interpretations and uses
the user should have a sound rationale an d em
and to address any negative consequences that
pirical evidence, when possible, for concluding
occur.
that the reliability/precision o f scores and the
validity o f interpretations based on the scores
Standard 9.7 will not be com prom ised.
Test users should verify periodically that their C om m ent: In some instances, minor changes in
interpretations o f test data continue to be ap format or mode o f administration may be reasonably
propriate, given any significant changes in the expected, without evidence, to have little or no
population o f test takers, the mode(s) o f test ad effect on test scores, classification decisions, and/or
m inistration, or the purposes in testing. appropriateness o f norms. In other instances,
C om m ent: Over time, a gradual change in the however, changes in the form at or administrative
characteristics o f an examinee population may procedures could have significant effects on the
significantly affect the accuracy o f inferences validity of interpretations o f the scores— that is,
drawn from group averages. M odifications in test these changes modify or change the construct
being assessed. If a given modification becomes
adm inistration in response to unforeseen circum
stances also may affect interpretations. widespread, evidence for validity should be gathered;
if appropriate, norms should also be developed
under the modified conditions.
Standard 9.8
W hen test results are released to the public or to
Standard 9.10
policy makers, those responsible for the release
should provide and explain any supplem ental Test users should not rely solely on computer-
inform ation that will m inim ize possible m isin generated interpretations o f test results.
terpretations o f the data.
Com m ent: T he user o f automatically generated
C om m ent: Test users have a responsibility to scoring and reporting services has the obligation
report results in ways that facilitate the intended to be familiar with the principles on which such
interpretations for the proposed use(s) o f the interpretations were derived. All users who are
scores, and this responsibility extends beyond the making inferences and decisions on the basis o f
individual test taker to any individuals or groups these reports should have the ability to evaluate a
who are provided with test scores. Test users in computer-based score interpretation in the light
group testing situations are responsible for ensuring o f other relevant evidence on each test taker. Au
that the individuals who use the test results are tomated narrative reports can be misleading, if
trained to interpret the scores properly. Preliminary used in isolation, and are not a substitute for
briefings prior to the release o f test results can sound professional judgment.
144
THE RIGHTS AMD RESPONSIBILITIES OF TEST USERS
145
CHAPTER 9
146
THE RIGHTS AMD RESPONSIBILITIES OF TEST USERS
tential recipients. Test takers and other score re Standard 9.20
cipients should he informed o f such privileges, if
any, and the conditions under which they apply. In situations where test results are shared with
the public, test users should form ulate and share
the established policy regarding the release o f
Standard 9.19 the results (e.g., timeliness, am ount o f detail)
and apply that policy consistently over time.
Test users are obligated to protect the privacy o f
examinees and institutions that are involved in a C om m ent: Test developers and test users should
testing program , unless a disclosure o f private consider the practices o f the communities they
inform ation is agreed upon or is specifically au serve and facilitate the creation o f common policies
thorized by law. regarding the release o f test results. For example,
in many states, the release o f data from large-scale
Com m ent: Protection o f the privacy o f individual
educational tests is often required by law. However,
examinees is a well-established principle in psy
even when the release o f data is not required but
chological and educational measurement. Storage
is routinely done, test users should have clear
and transm ission o f this type o f information
policies governing the release procedures. Different
should meet existing professional and legal stan
policies without appropriate rationales can confuse
dards, and care should be taken to protect the
the public and lead to unnecessary controversy.
confidentiality o f scores and ancillary information
(e.g., disability status). In certain circumstances,
test users and testing agencies may adopt more Cluster 3. Test Security and
stringent restrictions on the communication and Protection of Copyrights
sharing o f test results than relevant law dictates.
Privacy laws may apply to certain types o f infor
mation, and similar or more rigorous standards Standard 9.21
som etim es arise through the codes o f ethics
T est users have the responsibility to protect the
adopted by relevant professional organizations.
security o f tests, including that o f previous
In som e testing programs the conditions for dis
editions.
closure are stated to the examinee prior to testing,
and taking the test can constitute agreement to Com m ent: When tests are used for purposes o f
the disclosure o f test score information as specified. selection, credentialing, educational accountability,
In other programs, the test taker or his or her or for clinical diagnosis, treatment, and monitoring,
parents or guardians m ust formally agree to any the rigorous protection o f test security is essential,
disclosure o f test information to individuals or for reasons related to validity o f inferences drawn,
agencies other than those specified in the test ad protection o f intellectual property rights, and the
ministrators published literature. Applicable privacy costs associated with developing tests. Test developers,
laws, if any, may govern and allow (as in the case test publishers, and individuals who hold the copy
o f school districts for accountability purposes) or rights on tests provide specific guidelines about
prohibit (as in clinical settings) the disclosure of test security and disposal of test materials. The
test information. It should be noted that the right test user is responsible for helping to ensure the
o f the public and the media to examine the security o f test materials according to the professional
aggregate test results o f public school systems is guidelines established for that test as well as any
often guaranteed by law. T h is may often include applicable legal standards. Resale o f copyrighted
test scores disaggregated by demographic subgroups materials in open forums is a violation o f this
when the numbers are sufficient to yield statistically standard, and audio and video recordings for
sound results and to prevent the identification o f training purposes must also be handled in such a
individual test takers. way that they are not released to the public. These
147
CHAPTER 9
prohibitions also apply to outdated and previous answer sheets or profile forms, scoring templates,
editions o f tests; test users should help to ensure conversion tables o f raw scores to reported scores,
that test materials are securely disposed o f when and tables o f norms. Storage and transmission o f
no longer in use (e.g., upon retirement or after test information should satisfy existing legal and
purchase o f a new edition). Consistency and clarity professional standards.
in the definition o f acceptable and unacceptable
practices is critical in such situations. When tests Standard 9.23
are involved in litigation, inspection o f the instru
ments should be restricted— to the extent permitted Test users should remind all test takers, including
by law— to those who are obligated legally or by those taking electronically adm inistered tests,
professional ethics to safeguard test security. and others who have access to test m aterials
that copyright policies and regulations m ay pro
hibit the disclosure o f test item s without specific
Standard 9.22
authorization.
Test users have the responsibility to respect test
Comment: In some cases, information on copy
copyrights, including copyrights o f tests that are
rights and prohibitions on the disclosure o f test
adm inistered via electronic devices.
items are provided in written form or verbally as
C om m ent: Legally and ethically, test users may part o f the procedure prior to beginning the test
not reproduce or create electronic versions of or as part o f the administration procedures. How
copyrighted materials for routine test use without ever, even in cases where this information is not a
consent o f the copyright holder. These materials— formal part o f the test administration, if materials
in both paper and electronic form — include test are copyrighted, rest users should inform test
item s, test protocols, ancillary form s such as takers o f their responsibilities in this area.
148
PART 111
Testing
Applications
10. PSYCHOLOGICAL TESTING AND
ASSESSMENT
BACKGROUND
Th is chapter addresses issues important to profes on scaling and equating; on test administration,
sionals who use psychological tests to assess indi scoring, reporting, and interpretation; and on
viduals. Topics covered in this chapter include supporting documentation.
test selection and administration, test score inter T h e use o f psychological tests provides one
pretation, use o f collateral information in psy approach to collecting information within the
chological testing, types o f tests, and purposes o f larger framework o f a psychological assessment o f
psychological testing. The types o f psychological an individual. Typically, psychological assessments
tests reviewed in this chapter include cognitive involve an interaction between a professional,
and neuropsychological, problem behavior, family who is trained and experienced in testing, the test
and couples, social and adaptive behavior, per taker, and a client who may be the test taker or
sonality, and vocational. In addition, the chapter another party. The test taker may be a child, an
includes an overview o f five common uses o f psy adolescent, or an adult. T h e client usually is the
chological tests: for diagnosis; neuropsychological person or agency that arranges for the assessment.
evaluation; intervention planning and outcome Clients may be patients, counselees, parents, chil
evaluation; judicial and governmental decisions; dren, employees, employers, attorneys, students,
and personal awareness, social identity, and psy government agencies, or other responsible parties.
chological health, growth, and action. The standards The settings in which psychological tests or in
in this chapter are applicable to settings where in- ventories are used include (but are not limited to)
depth assessment o f people, individually or in preschools; elementary, middle, and secondary
groups, is conducted. Psychological tests are used schools; colleges and universities; pre-employment
in several other contexts as well, most notably in settings; hospitals; prisons; mental health and
employment and educational settings. Tests designed health clinics; and other professionals’ offices.
to measure specific job-related characteristics across T h e tasks involved in a psychological
m ultiple candidates for selection purposes are assessment— collecting, evaluating, integrating,
treated in the text and standards o f chapter 11; and reporting salient information relevant to the
tests used in educational settings are addressed in aspects o f a test takers functioning that are under
depth in chapter 12. examination— comprise a complex and sophisti
It is critical that professionals who use tests to cated set o f professional activities. A psychological
conduct assessments o f individuals have knowledge assessment is conducted to answer specific questions
o f educational, linguistic, national, and cultural about a test takers psychological functioning or
factors as well as physical capabilities that influence behavior during a particular time interval or to
(a) a test takers development, (b) the methods for predict an aspect o f a test takers psychological
obtaining and conveying information, and (c) functioning or behavior in the future. Because
the planning and implementation o f interventions. test scores characteristically are interpreted in the
Therefore, readers are encouraged to review chapter context o f other information about the test taker,
3, which discusses fairness in testing; chapter 8, an individual psychological assessment usually
which focuses on rights o f test takers; and chapter also includes interviewing the test taker; observing
9, which focuses on rights and responsibilities o f the test takers behavior in the appropriate setting;
test users. In chapters 1, 2, 4, 5, 6, and 7, readers reviewing educational, health, psychological, and
will find important additional detail on validity; other relevant records; and integrating these
on reliability and precision; on test development; findings with other information that may be pro
151
CHAPTER 10
vided by third parties. The results from tests and other sources o f information needed to evaluate
inventories used in psychological assessments may the test taker are identified. Preliminary findings
help the professional to understand test takers may lead to the selection o f additional tests. The
more fully and to develop more informed and ac professional is responsible for being familiar with
curate hypotheses, inferences, and decisions about the evidence o f validity for the intended uses o f
aspects o f the test takers psychological functioning scores from the tests and inventories selected, in
or appropriate interventions. cluding computer-administered or online tests.
The interpretation o f test and inventory scores Evidence o f the reliability/precision o f scores, and
can be a valuable part o f the assessment process the availability o f applicable normative data in
and, if used appropriately, can provide useful in the tests accumulated research literature also
formation to test takers as well as to other users o f should be considered during test selection. In the
the test interpretation. For example, the results of case o f tests that have been revised, editions
tests and inventories may be used to assess the psy currently supported by the publisher usually
chological functioning o f an individual; to assign should be selected. O n occasion, use o f an earlier
diagnostic classification; to detect and characterize edition o f an instrument is appropriate (e.g.,
neuropsychological impairment, developmental de when longitudinal research is conducted, or when
lays, and learning disabilities; to determine the an earlier edition contains relevant subtests not
validity o f a symptom; to assess cognitive and per included in a later edition). In addition, professionals
sonality strengths or mental health and emotional are responsible for guarding against reliance on
behavior problems; to assess vocational interests test scores that are outdated; in such cases, retesting
and values; to determine developmental stages; to is appropriate. In international applications, it is
assist in health decision making; or to evaluate especially important to verify that the construct
treatment outcomes. Test results also may provide being assessed has equivalent m eaning across in
information used to make decisions that have a ternational borders and cultural contexts.
powerful and lasting impact on people’s lives (e.g., Validity and reliability/precision considerations
vocational and educational decisions; diagnoses; are paramount, but the demographic characteristics
treatment plans, including plans for psychophar- o f the group(s) for which the test originally was
macological intervention; intervention and outcome constructed and for which initial and subsequent
evaluations; health decisions; disability determina normative data are available also are important
tions; decisions on parole sentencing, civil com test selection considerations. Selecting a test with
mitment, child custody, and competency to stand demographically and clinically appropriate nor
trial; personal injury litigation; and death penalty mative groups relevant for the test taker and for
decisions). the purpose o f the assessment is im portant for
the generalizability o f the inferences that the pro
Test Selection and Administration fessional seeks to make. Applying a test constructed
for one group to other groups may not be appro
T h e selection and administration o f psychological priate, and score interpretations, i f the test is
tests and inventories often is individualized for used, should be qualified and presented as hy
each participant. However, in som e settings pre potheses rather than conclusions.
determined tests may be taken by all participants, Tests and inventories that meet high technical
and interpretations o f results may be provided in standards o f quality are a necessary but not a suf
a group setting. ficient condition for the responsible administration
T h e assessment process begins by clarifying, and scoring o f tests and interpretation and use o f
as much as possible, the reasons why a test taker test scores. A professional conducting a psychological
will be assessed. Guided by these reasons or other assessment must complete the appropriate education
relevant concerns, the tests, inventories, and di and training, acquire appropriate credentials,
agnostic procedures to be used are selected, and adhere to professional ethical guidelines, and pos
152
PSYCHOLOGICAL TESTING AND ASSESSMENT
sesses a high degree o f professional judgment and som e professionals first administer tests to assess
scientific knowledge. basic domains (e.g., attention) and end with tests
Professionals who oversee testing and assessment to assess more complex domains (e.g., executive
should be thoroughly versed in proper test admin functions). Professionals also are responsible for
istration procedures. They are responsible for en establishing testing conditions that are appropriate
suring that all persons who administer and score to the test taker’s needs and abilities. For example,
tests have received the appropriate education and the examiner may need to determine if the test
training needed to perform their assigned tasks. taker is capable o f reading at the level required
Test administrators should administer tests in the and if vision, hearing, psychomotor, or clinical
manner that the test manuals indicate and should impairments or neurological deficits are adequately
adhere to ethical and professional standards. The accom modated. Chapter 3 addresses access con
education and experience necessary to administer siderations and standards in detail.
group tests and/or to proctor computer-administered Standardized administration is not required
tests generally are less extensive than the qualifications for all tests but is important for the interpretation
neccssary to administer and interpret scores from o f test scores for many tests and purposes. In
individually administered tests that require inter those situations, standardized test administration
actions between the test taker and the test admin procedures should be followed- When nonstandard
istrator. In many situations where complex behavioral administration procedures are needed or allowed,
observations are required, the use o f a nonprofes- they should be described and justified. The inter
sionai to administer or score tests may be inappro preter o f the test results should be informed if the
priate. Prior to beginning the assessment process, test was unproctored or if it was administered
the test taker or a responsible party acting on the under nonstandardized procedures. In some cir
test takers behalf (e.g., parent, legal guardian) cumstances, test administration may provide the
should understand who will have access to the test opportunity for skilled examiners to carefully
results and the written report, how test results will observe the performance o f test takers under stan
be shared with the test taker, and whether and dardized conditions. For example, the test ad
when decisions based on the test results will be ministrators’ observations may allow them to
shared with the test taker and/or a third party or record behaviors being assessed, to understand
the public (e.g., in court proceedings). the manner in which test takers arrived at their
Test administrators must be aware o f any per answers, to identify test-taker strengths and weak
sonal lim itation s that affect their ability to nesses, and to make modifications in the testing
administer and score the test fairly and accurately. process. If tests are administered by computer or
These limitations may include physical, perceptual, other technological devices or online, the profes
and cognitive factors. Som e tests place considerable sional is responsible for determining if the purpose
demands on the test administrator (e.g., recording o f the assessment and the capabilities o f the test
responses rapidly, manipulating equipment, or taker require the presence o f a proctor or support
perform ing complex item scoring during admin staff (e.g., to assist with the use o f the computer
istration). Test administrators who cannot com equipment or software). Also, some computer-
fortably meet these demands should not administer administered tests m ay require giving the test
such tests. For tests that require oral instructions taker the opportunity to receive instructions and
prior to or during administration, test administrators to practice prior to the test administration. Chapters
should be sure that there are no barriers to being 4 and 6 provide additional detail on technologically
clearly understood by test takers. administered tests.
W hen using a battery o f tests, the professional Inappropriate effort on the part o f the person
should determine the appropriate order o f tests being assessed may affect the results o f psychological
to be administered. For example, when adminis assessment and may introduce error into the meas
tering cognitive and neuropsychological tests, urement o f the construct in question. Therefore,
153
CHAPTER 10
in som e cases, the importance o f expending ap consider other available data that support or
propriate effort when taking the test should be challenge the inferences. For example, the profes
explained to the test taker. For many tests, measures sional should review the test taker’s history and in
o f effort can be derived from stand-alone tests or formation about past behaviors, as well as the
from responses embedded within a standard as relevant literature, to develop familiarity with sup
sessment procedure (e.g., increased numbers o f porting evidence. A t times, the professional also
errors, inconsistent responding, and unusual re should corroborate results from one testing session
sponses relevant to sym ptom patterns), and effort with results from other tests and testing sessions
m ay be m easured throughout the assessment to address reliability/precision and validity o f the
process. When low levels o f effort and motivation inferences made about the test taker’s performance
are evident during the test administration, con across time and/or tests. Triangulation o f multiple
tinuing an evaluation may result in inappropriate sources o f information— including stylistic and
score interpretations. test-taking behaviors inferred from observation
Professionals are responsible for protecting during the test administration— may strengthen
the confidentiality and security o f the test results confidence in the inference. Importantly, data that
and the testing materials. Storage and transmission are not supportive o f the inferences should be ac
o f this type o f information should satisfy relevant knowledged and either reconciled with other in
professional and legal standards. formation or noted as a limitation to the confidence
placed in the inference. When there is strong evi
Test Score Interpretation dence for the reliability/precision and validity o f
the scores for the intended uses o f a test and
Test scores used in psychological assessment ideally strong evidence for the appropriateness o f the test
are interpreted in light o f a number o f factors, in for the test taker being assessed, then the professionals
cluding the available normative data appropriate ability to draw appropriate inferences increases.
to the characteristics o f the test taker, the psycho When an inference is based on a single study or
metric properties o f the test, indicators o f effort, based on several studies whose samples are o f
the circumstances o f the test taker at the time the limited generalizability to the test taker, then the
test is given, the temporal stability o f the constructs professional should be more cautious about the
being measured, and the effects o f moderator inference and note in the report limitations regarding
variables and demographic characteristics on test conclusions drawn from the inference.
results. T he professional rarely has the resources Threats to the interpretability o f obtained
available to personally conduct the research or to scores are minimized by clearly defining how par
assemble representative norms that, in some types ticular psychological tests are to be used. These
o f assessment, might be needed to m ake accurate threats occur as a result o f construct-irrelevant
inferences about each individual test taker’s past, variance (i.e., aspects o f the test and the testing
current, and future functioning. Therefore, the process that are not relevant to the purpose o f the
professional may need to rely on the research and test scores) and construct underrepresentation
the body o f scientific knowledge available for the (i.e., failure o f the test to account for important
test that support appropriate inferences. Presentation facets relevant to the purpose o f the testing). Re
o f validity and reliability/precision evidence often sponse bias and faking are examples o f construct-
is not needed in the written report summarizing irrelevant components that may significantly skew
the findings o f the assessment, but the professional the obtained scores, possibly resulting in inaccurate
should strive to understand, and be prepared to or misleading interpretations. In situations where
articulate, such evidence as the need arises. response bias or faking is anticipated, professionals
When m aking inferences about a test taker’s may choose a test that has scales (e.g., percentage
past, present, and future behaviors and other char o f “yes” answers, percentage o f “no” answers;
acteristics from test scores, the professional should “faking good,” “faking bad”) that clarify the threats
154
PSYCHOLOGICAL TESTING AND ASSESSMENT
to validity. In so doing, the professionals may be traits and personal characteristics. The quality o f
able to assess the degree to which, test takers are interpretations m ade from psychological tests and
acquiescing to the perceived demands o f the test assessments often can be enhanced by obtaining
administrator or attempting to portray themselves credible collateral information from various third-
as impaired by “faking bad,” or as well functioning party sources, such as significant others, teachers,
by “faking good.” health professionals, and school, legal, military,
For some purposes, including career counseling and employment records. The quality o f collateral
and neuropsychological assessment, batteries o f information is enhanced by using various methods
tests are frequently used. For example, career coun to acquire it. Structured behavioral observations,
seling batteries may include tests o f abilities, values, checklists, ratings, and interviews are a few o f the
interests, and personality. N europsychological methods that m ay be used, along with objective
batteries may include measures o f orientation, at test scores to minimize the need for the scorer to
tention, communication skills, executive function, rely on individual judgm ent. For example, an
fluency, visual-m otor and visual-spatial skills, evaluation o f career goals may be enhanced by
problem solving, organization, memory, intelligence, obtaining a history o f employment as well as by
academic achievement, and/or personality, along administering tests to assess academic aptitude
with tests o f effort. When psychological test batteries and achievement, vocational interests, work values,
incorporate multiple methods and scores, patterns personality, and temperament. The availability o f
o f test results frequently are interpreted as reflecting information on multiple traits or attributes, when
a construct or even an interaction among constructs acquired from various sources and through the
underlying test performance. Interactions among use o f various methods, enables professionals to
the constructs underlying configurations o f test assess more accurately an individual’s psychosocial
outcomes may be postulated on the basis o f test functioning and facilitates more effective decision
score patterns. The literature reporting evidence o f making. When using collateral data, the professional
reliability/precision and validity o f configurations should take steps to ascertain their accuracy and
o f scores that supports the proposed interpretations reliability, especially when the data come from
should be identified when possible. However, it is third parties who may have a vested interest in
understood that little, if any, literature exists that the outcome o f the assessment.
describes the validity o f interpretations o f scores
from highly customized or flexible batteries o f
Types of Psychological
tests. The professional should recognize that variability
Testing and Assessment
in scores on different tests within a battery commonly
occurs in the general population, and should use For purposes o f this chapter, the types o f psycho
base rate data, when available, to determine whether logical tests have been divided into six categories:
the observed variability is exceptional. If the literature cognitive and neuropsychological tests; problem
is incomplete, the resulting inferences may be pre behavior tests; family and couples tests; social
sented with the qualification that they are hypotheses and adaptive behavior tests; personality tests; and
for future verification rather than probabilistic vocational tests.
statements regarding the likelihood o f some behavior
that imply some known validity evidence. Cognitive and Neuropsychological
Testing and Assessment
Collateral Information Used in Tests often are used to assess various classes o f
Psychological Testing and Assessment cognitive and neuropsychological functioning, in
cluding intelligence, broad ability domains, and
Test scores that are used as part o f a psychological more focused domains (e.g., abstract reasoning
assessment are best interpreted in the context o f and categorical thinking; academic achievement;
the test takers personal history and other relevant attention; cognitive ability; executive function;
155
CHAPTER 10
language; learning and memory; motor and sen attention, divided attention, focused attention,
sorim otor functions and lateral preferences; and selective attention, and vigilance. Tests may measure
perception and perceptual organization/integration). (a) levels o f alertness, orientation, and localization;
Overlap may occur in the constructs that are (b) the ability to focus, shift, and m aintain
assessed by tests o f differing functions or domains. attention and to track one or more stimuli under
In com m on with other types o f tests, cognitive various conditions; (c) span o f attention; and
and neuropsychological tests require a minimally (d) short-term information storage functioning.
sufficient level o f test-taker capacity to maintain Scores for each aspect o f attention that have been
attention as well as appropriate effort. For example, examined should be reported individually so that
when administering cognitive and neuropsycho the nature o f an attention disorder can be clarified.
logical tests, som e professionals first administer
tests to assess basic domains (e.g., attention) and Cognitive ability. Measures designed to quantify
end with administration o f tests to assess more cognitive abilities are among the m ost widely ad
com plex domains (e.g., executive function). ministered tests. The interpretation o f results from
a cognitive ability test is guided by the theoretical
Abstract reasoning and categorical thinking. Tests constructs used to develop the test. Som e cognitive
o f reasoning and thinking measure a broad array ability assessments are based on results from m ul
o f skills and abilities, including the examinee’s tidimensional test batteries that are designed to
ability to infer relationships, to form new concepts assess a broad range o f skills and abilities. lest
or strategies, to respond to changing environmental results are used to draw inferences about a person’s
circumstances, and to act in goal-oriented situations, overall level o f intellectual functioning and about
as well as the ability to understand a problem or a strengths and weaknesses in various cognitive abil
concept, to develop a strategy to solve that problem, ities, and to diagnose cognitive disorders.
and, as necessary, to alter such conccpts or strategies
as situations vary. Executive function. This class o f functions is in
volved in the organized performances (e.g., cognitive
Academ ic achievement. Academic achievement flexibility, inhibitory control, multitasking) that
tests are measures o f knowledge and skills that a are necessary for the independent, purposive, and
person has acquired in formal and inform al effective attainment o f goals in various cognitive-
learning situations. Two major types o f academic processing, problem-solving, and social situations.
achievement tests include general achievement Some tests emphasize (a) reasoned plans o f action
batteries and diagnostic achievement tests. General that anticipate consequences o f alternative solutions,
achievement batteries are designed to assess a (b) motor performance in problem-solving situations
persons level o f learning in multiple areas (e.g., that require goal-oriented intentions, an d /or
reading, mathematics, and spelling). In contrast, (c) regulation o f performance for achieving a
diagnostic achievement tests typically focus on desired outcome.
one subject area (e.g., reading) and assess an aca
dem ic skill in greater detail. Test results are used Language. Language deficiencies typically are iden
to determine the test taker’s strengths and may tified with assessments that focus on phonology,
also help identify sources o f academic difficulties morphology, syntax, semantics, supralinguistics,
or deficiencies. Chapter 12 provides additional and pragmatics. Various functions m ay be assessed,
detail on academic achievement testing in educa including listening, reading, and spoken and written
tional settings. language skills and abilities. Language disorder as
sessments focus on functional speech and verbal
A ttention. Attention refers to a domain that en comprehension measured through oral, written,
compasses the constructs o f arousal, establishment or gestural modes; lexical access and elaboration;
o f sets, strategic deployment o f attention, sustained repetition o f spoken language; and associative
156
PSYCHOLOGICAL TESTING AND ASSESSMENT
verbal fluency. I f a multilingual person is assessed tests assess activities ranging from perceptual speed
for a possible language disorder, the degree to to choice reaction time, to complex information
which the disorder may be due more directly to processing and visual-spatial reasoning.
developmental language issues (e.g., phonological,
morphological, syntactic, semantic, or pragmatic Problem Behavior Testing and Assessment
delays; intellectual disabilities; peripheral, sensory, Problem behaviors include behavioral adjustment
or central neurological impairment; psychological difficulties that interfere with a person’s effective
conditions; or sensory disorders) than to lack o f functioning in daily life situations. Tests are used
proficiency in a given language must be addressed. to assess the individuals behavior and seif-per-
ceptions for differential diagnosis and educational
Learning and memory. Th is class o f functions classification for a variety o f emotional and be
involves the acquisition, retention, and retrieval havioral disorders and to aid in the development
o f information beyond the requirements o f im o f treatment plans. In some cases (e.g., death
mediate or short-term information processing and penalty evaluations), retrospective analysis is
storage. These tests may measure acquisition o f required and multiple sources o f information help
new information through various sensory channels provide the m ost com prehensive assessm ent
and by means o f assorted test formats (e.g., word possible. Observing a person in her or his envi
lists, prose passages, geom etric figures, form- ronment often is helpful for understanding fully
boards, digits, and musical melodies). Memory the specific demands o f the environment, not
tests also may require retention and recall of old only to offer a more comprehensive assessment
information (e.g., personal data as well as commonly but to provide more useful recommendations.
learned facts and skills). In addition, testing o f
recognition o f stored information may be used in Family and Couples Testing and Assessment
understanding memory deficits. Family testing addresses the issues o f family dy
namics, cohesion, and interpersonal relations
M otor functions, sensorim otor functions, and among family members, including partners, parents,
lateral preferences. M otor functions (e.g., finger children, and extended family members. Tests de
tapping) and sensory functions (e.g., tactile stim veloped to assess families and couples are distin
ulation) are often measured as part o f a compre guished by whether they measure the interaction
hensive neuropsychological evaluation. M otor patterns o f partial or whole families, in both cases
tests assess various aspects o f movement such as requiring simultaneous focus on two or more
speed, dexterity, coordination, and purposeful family members in terms o f their transactions.
movement. Sensory tests evaluate function in the Testing with couples m ay address factors such as
areas o f vision, hearing, touch, and sometimes issues o f intimacy, compatibility, shared interests,
smell. Testing also is done to examine the integration trust, and spiritual beliefs.
o f perceptual and m otor functions.
Social and Adaptive Behavior
Perception and perceptual organization/integra Testing and Assessment
tion. Th is class o f functioning involves reasoning Measures o f social and adaptive behaviors assess
and judgm ent as they relate to the processing and motivation and ability to care for oneself and
elaboration o f complex sensory combinations and relate to others. Social and adaptive behaviors are
inputs. Tests o f perception may emphasize imme based on a repertoire o f knowledge, skills, and
diate perceptual processing but also may require abilities that enable a person to meet the daily de
conceptualizations that involve some reasoning mands and expectations o f the environment, such
and judgm ental processes. Som e tests have motor as eating, dressing, working, participating in leisure
components ranging from m aking simple move activities, using transportation, interacting with
ments to building complex constructions. These peers, communicating with others, making pur
157
CHAPTER 10
chases, m anaging money, maintaining a schedule, how she or he may behave in new situations. Test
living independently, being socially responsive, scores outside the expected range may be considered
and engaging in healthy behaviors. strong expressions o f normal traits or m ay be in
dicative o f psychopathology. Such scores also may
Personality Testing and Assessment reflect normal functioning o f the person within a
T h e assessment o f personality requires a synthesis culture different from that o f the population on
o f aspects o f an individuals functioning that con which the norms are based.
tribute to the form ulation and expression o f Other personality tests are designed specifically
thoughts, attitudes, em otions, and behaviors. to measure constructs underlying abnormal func
Som e o f these aspects are stable over time; others tioning and psychopathology. Developers o f some
change with age or are situation specific. Cognitive o f these tests use previously diagnosed individuals
and emotional functioning may be considered to construct their scales and base their interpretations
separately in assessing an individual, but their in on the association between the test’s scale scores,
fluences are interrelated. For example, a person within a given range, and the behavioral correlates
whose perceptions are highly accurate, or who is o f persons who scored within that range, as com
relatively stable emotionally, may be able to control pared with clinical samples. If interpretations
suspiciousness better than a person whose per made from scores go beyond the theory that
ceptions are inaccurate or distorted or who is guided the test’s construction, then evidence o f
emotionally unstable. the validity o f the interpretations sh ou ld be
Scores or personality descriptors derived from collected and analyzed from additional relevant
a personality test may be regarded as reflecting data.
the underlying theoretical constructs or empirically
derived scales or factors that guided the tests con Vocational Testing and Assessment
struction. The stimulus-and-response form ats o f Vocational testing generally includes the meas
personality tests vaiy widely. Some include a series urement o f interests, work needs, and values, as
o f questions (e.g., self-report inventories) to which well as consideration and assessment o f related el
the test taker is required to respond by choosing ements o f career development, maturity, and in
from multiple well-defined options; others involve decision. Academic achievement and cognitive
being placed in a novel situation in which the test abilities, discussed earlier in the section on cognitive
takers response is not completely structured (e.g., ability, also are important components in vocational
responding to visual stimuli, telling stories, dis testing and assessment. Results from these tests
cussing pictures, or responding to other projective often are used to enhance personal growth and
stimuli). Results may consist o f themes, patterns, understanding and for career counseling, out
or diagnostic indicators, as well as scores. T h e re placement counseling, and vocational decision
sponses are scored and combined into either making. These interventions frequently take place
logically or statistically derived dimensions estab in the context o f educational and vocational reha
lished by previous research. bilitation. However, vocational testing may also
Personality tests may be designed to assess be used in the workplace as part o f corporate pro
normal or abnormal attitudes, feelings, traits, and grams for career planning.
related characteristics. Tests intended to measure
normal personality characteristics are constructed Interest inventories. The measurement o f interests
to yield scores reflecting the degree to which a is designed to identify a persons preferences for
person manifests personality dimensions empirically various activities. Self-report interest inventories
identified and hypothesized to be present in the are widely used to assess personal preferences, in
behavior o f most individuals. A persons configu cluding likes and dislikes for various work and
ration o f scores on these dimensions is then used leisure activities, school subjects, occupations, or
to infer how the person behaves presently and types o f people. The resulting scores m ay provide
158
PSYCHOLOGICAL TESTING AND ASSESSMENT
insight into types and patterns o f interests in ed uations; testing for intervention planning and
ucational curricula (e.g., college majors), in various outcome evaluation; testing for judicial and gov
fields o f work (e.g., specific occupations), or in ernmental decisions; and testing for personal
more general or basic areas o f interests related to awareness, social identity, and psychological health,
specific activities (e.g., sales, office practices, or growth, and action. However, these categories are
mechanical activities). not always mutually exclusive.
159
CHAPTER 10
(e.g., functional capacity, degree o f anxiety, amount by prior research to belong to a specific diagnostic
o f suspiciousness, openness to interpretations, group.
amount o f insight into behaviors, and level o f in Diagnoses made with the help of test scores
tellectual functioning). typically are based on empirically demonstrated
Diagnostic criteria may vary from one nomen relationships between the test score and the diag
clature system to another. N oting which nomen nostic category. Validity studies that demonstrate
clature system is being used is an important initial relationships between test scores and diagnostic
step because different diagnostic systems may use categories currently are available for some, but
the same diagnostic term to describe different not all, diagnostic categories. M any more studies
symptoms. Even within one diagnostic system, demonstrate evidence o f validity for the relations
the symptoms described by the same term may between test scores and various subsets o f symptoms
differ between editions o f the manual. Similarly, a that contribute to a diagnostic category. Although
test that uses a diagnostic term in its title may it often is not feasible for individual professionals
differ significandy from another test using a similar to personally conduct research into relationships
title or from a subscaie using the same term. For between obtained scores and diagnostic categories,
example, some diagnostic systems may define de familiarity with the research literature that examines
pression by behavioral symptomatology (e.g., psy these relationships is important.
chomotor retardation, disturbance in appetite or The professional often can enhance the diag
sleep), by affective symptomatology (e.g., dysphoric nostic interpretations derived from test scores by
feeling, emotional flatness), or by cognitive symp integrating the test results with inferences made
tomatology (e.g., thoughts o f hopelessness, mor from other sources o f information regarding the
bidity). Further, rarely are the sym ptom s o f test taker’s functioning, such as self-reported
diagnostic categories mutually exclusive. Hence, it history, information provided by significant others,
can be expected that a given sym ptom may be or systematic observations in the natural environ
shared by several diagnostic categories. More knowl ment or in the testing setting. In arriving at a di
edgeable and precisely drawn inferences relating agnosis, a professional also looks for information
to a diagnosis may be obtained from test scores if that does not corroborate the diagnosis, and in
appropriate weight is given to the sym ptom s those instances, places appropriate limits on the
included in the diagnostic category and to the degree o f confidence placed in the diagnosis.
suitability o f each test for assessing the symptoms. When relevant to a referral decision, the professional
Therefore, the first step in evaluating a test’s should acknowledge alternative diagnoses that
suitability for yielding scores or information in may require consideration. Particular attention
dicative o f a particular diagnostic syndrome is to should be paid to all relevant available data before
compare the construct that the test is intended to concluding that a test taker fails into a diagnostic
measure with the symptomatology described in category. Cultural competency is param ount in
the diagnostic criteria. the effort to avoid m isdiagnosing or overpatholo-
Different methods m ay be used to assess par gizing culturally appropriate behavior, affect, or
ticular diagnostic categories. Some methods rely cognition. Tests also arc used to assess the appro
primarily on structured interviews using a “yes”/”no” priateness o f continuing the initial diagnosis, es
or “true’7 ” faise” format, in which the professional pecially after a course o f treatment or if the client’s
is interested in the presence or absence o f diagno psychological functioning has changed over time.
sis-specific symptomatology. Other methods often
rely principally on tests o f personality or cognitive Testing for Neuropsychological Evaluations
functioning and use configurations o f obtained Neuropsychological testing analyzes the test taker’s
scores. These configurations o f scores indicate the current psychological and behavioral status, including
degree to which a test taker’s responses are similar manifestations o f neurological, neuropathological,
to those o f individuals who have been determined and neurochemical changes that may arise during
160
PSYCHOLOGICAL TESTING AND ASSESSMENT
161
CHAPTER 10
seeking to obtain the greatest possible monetary test security (e.g., releasing the test questions, the
award for a personal injury may be motivated to examinees responses, or raw or standardized scores
exaggerate cognitive and emotional symptoms, on tests to another qualified professional) and
whereas persons attempting to forestall the loss o f should seek, if necessary, appropriate legal and
a professional license may attem pt to portray professional remedies.
themselves in the best possible light by minimizing
symptoms or deficits. In form ing an assessment Testing for Personal Awareness, Social Identity,
opinion, it is necessary to interpret the test scores and Psychologicai Health, Growth, and Action
with informed knowledge relating to the available Tests and inventories frequently are used to provide
validity and reliability evidence. When forming information to help individuals understand them
such opinions, it also is necessary to integrate a selves, identify their own strengths and weaknesses,
test takers test scores with all other sources o f in and clarify issues important to their own develop
formation that bear on the test taker’s current ment. For example, test results from personality
status, including psychological, health, educational, inventories may help test takers better understand
occupational, legal, sociocultural, and other relevant themselves and their interactions with others.
collateral records. Measures o f ethnic identity and acculturation—
Som e tests are intended to provide information two components o f social identity— that assess
about a clients functioning that helps clarify a the cognitive, affective, and behavioral facets o f
given legal issue (e.g., parental functioning in a the ways in which people identify with their
child custody case or a defendants ability to un cultural backgrounds, also may be informative.
derstand charges in hearings on competency to Psychological tests are used sometimes to assess
stand trial). The manuals o f some tests also provide an individual’s ability to understand and adapt to
dem ographic and actuarial data for normative health conditions. In these instances, observations
groups that are representative o f persons involved and checldists, as well as tests, are used to measure
in the legal system. However, many tests measure the understanding that an individual with a health
constructs that are generally relevant to the legal condition (e.g., diabetes) has about the disease
issues even though norms specific to the judicial process and about behavioral and cognitive tech
or governmental context may not be available. niques applicable to the amelioration or control
Professionals are expected to make every effort to o f the symptoms o f the disease state.
be aware o f evidence o f validity and reliability/ Results from interest inventories and tests o f
precision that supports or does not support their ability may be useful to individuals who are making
interpretations and to place appropriate limits on educational and career decisions. Appropriate cog
the opinions rendered. Test users who practice in nitive and neuropsychological tests that have been
judicial and governmental settings are expected to normed and standardized for children may facilitate
be aware o f conflicts o f interest that may lead to the monitoring o f development and growth during
bias in the interpretation o f test results. the formative years, when relevant interventions
Protecting the confidentiality o f a test taker’s may be more efficacious for recognizing and pre
test results and o f the test instrument itself poses venting potentially disabling learning difficulties.
particular challenges for professionals involved Test scores for young adults or children on these
with attorneys, judges, jurors, and other legal de types o f measures may change in later years;
cision makers. T he test taker has the right to therefore, test users should be cautious about over
cxpect that test results will be communicated only reliance on results that may be outdated.
to persons who are legally authorized to receive Test results may be used in several ways for
them and that other information from the testing self-exploration, growth, and decision making.
session that is not relevant to the evaluation will First, the results can provide individuals with new
not be reported. The professional should be information that allows them to compare themselves
apprised o f possible threats to confidentiality and with others or to evaluate themselves by focusing
162
PSYCHOLOGICAL TESTING AND ASSESSMENT
on self-descriptions and self-characterizations. Test and competence to select, administer, and interpret
results may also serve to stimulate discussions be tests and inventories as crucial elements o f the
tween test taker and professional, to facilitate psychological testing and assessment process (see
test-taker insights, to provide directions for future chap. 9). The standards in this chapter provide a
treatm ent considerations, to help individuals framework for guiding the professional toward
identify strengths and weaknesses, and to provide achieving relevance and effectiveness in the use o f
the professional with a general framework for or psychological tests within the boundaries or limits
ganizing and integrating information about an defined by the professionals educational, experi
individual. Testing for personal growth may take ential, and ethical foundations. Earlier chapters
place in training and development programs, and standards that are relevant to psychological
within an educational curriculum, during psy testing and assessment describe general aspects o f
chotherapy, in rehabilitation programs as part o f test quality (chaps. 1 and 2), fairness (chap. 3),
an educational or career-planning process, or in test design and development (chap. 4), and test
other situations. administration (chap. 6). Chapter 11 discusses
test uses for the workplace, including crcdentialing,
and the importance o f collecting data that provide
Summary
evidence o f a tests accuracy for predicting job
T h e responsible use o f tests in psychological performance; chapter 12 discusses educational
practice requires a commitment by the professional applications; and chapter 13, discusses test use in
to develop and maintain the necessary knowledge program evaluation and public policy.
163
CHAPTER 10
164
PSYCHOLOGICAL TESTING AND ASSESSMENT
165
CHAPTER 10
procedures may have on the test taker’s obtained to have an inappropriate influence on the inter
score and the interpretation o f that score. W hen pretation o f the assessment results.
using tests that employ an unstructured response
C om m ent: Individuals or groups with a vested
format, such as some projective tests, the professional
interest in the significance or m eaning o f the
should follow the administration instructions pro
findings from psychological testing may include
vided and apply objective scoring criteria when
but are not limited to employers, health profes
available and appropriate.
sionals, legal representatives, school personnel,
In some cases, testing may be conducted in a
third-party payers, and family members. In some
realistic setting to determine how a test taker re
instances, legal requirements may lim it a profes
sponds in these settings. For example, an assessment
sionals ability to prevent inappropriate interpre
for an attention disorder may be conducted in a
tations o f assessments from affecting decisions,
noisy or distracting environment rather than in
but professionals have an obligation to document
an environment that typically protects the test
any disagreement in such circumstances.
taker from such external threats to performance
efficiency.
Standard 10.11
Standard 10.9 Professionals should share test scores and inter
pretations with the test taker when appropriate
Professionals should take into account the purpose
or required by law. Such inform ation should be
o f the assessment* the construct being measured,
expressed in language that the test taker or,
and the capabilities o f the test taker when
when appropriate, the test takers legal represen
deciding whether technology-based administration
tative, can understand.
o f tests should be used.
C om m ent: Test scores and interpretations should
C om m ent: Quality control should be integral to
be expressed in terms that can be understood
the administration o f computerized or technolo
readily by the test taker or others entitled to the
gy-based tests. Some technology-based tests may
results. In m ost instances, a report should be gen
require that test takers have an opportunity to
erated and made available to the referral source.
receive instruction and to practice prior to the
T h at report should adhere to standards required
test administration, unless assessing ability to use
by the profession and/or the referral source, and
the equipment is the purpose o f the test. The
the inform ation should be docum ented in a
professional is responsible for determining whether
manner that is understandable to the referral
the technology-based administration o f the test
source. In some clinical situations, providing feed
should be proctored, or whether technical support
back to the test taker may actually cause harm.
staff are necessary to assist with the use o f the test
Care should be taken to minimize unintended
equipm ent and software. T h e interpreter o f the
consequences o f test feedback. Any disclosure o f
test scores should be informed i f the test was un-
test results to an individual or any decision not to
proctored or if no support staff were available.
release such results should be consistent with ap
plicable legal standards, such as privacy laws.
Cluster 4. Test Interpretation
Standard 10.12
Standard 10.10
In psychological assessment, the interpretation
T h ose w ho select tests and interpret test results o f test scores or patterns o f test battery results
should not allow individuals or groups w ith should consider other factors that m ay influence
vested interests in the outcomes o f an assessm ent a particular testing outcome. W here appropriate,
166
PSYCHOLOGICAL TESTING AND ASSESSMENT
167
CHAPTER 10
measures ofincerests and personality styles are pack obsolete versions) should not be m ade available
aged together, then supporting reliability/precision to the public or resold to unqualified test users.
and validity data for such combinations o f the test
Com m ent: Professionals should be knowledgeable
scores and interpretations should be available.
about and should conform to record-keeping and
confidentiality guidelines required by applicable
Standard 10.17 federal law and within the jurisdictions where
T h ose who use computer-generated interpreta they practice, as well as guidelines o f the professional
tions o f test data should verify that the quality organizations to which they belong. The test pub
o f the evidence o f validity is sufficient for the lisher, the test user, the test taker, and third parties
interpretations. (e.g., school, court, employer) may have different
levels o f understanding or recognition o f the need
C om m ent: Efforts to reduce a complex set o f
for confidentiality o f test materials. To the extent
data into computer-generated interpretations o f a
possible, the professional who uses tests is responsible
given construct may yield misleading or oversim
for managing the confidentiality o f test information
plified analyses o f the meanings o f test scores,
across all parties. It is important for the professional
which in turn may lead to faulty diagnostic and
to be aware o f possible threats to confidentiality
prognostic decisions. Norm s on which the inter
and the legal and professional remedies available.
pretations are based should be reviewed for their
Professionals also are responsible for maintaining
relevance and appropriateness.
the security o f testing materials and respecting the
copyrights o f all tests. Distribution, display, or
Cluster 5. Test Security resale o f test materials (including obsolete editions)
to unauthorized recipients infringes the copyright
o f the materials and compromises test security.
Standard 10.18
When it is necessary to reveal test content in the
Professionals and others who have access to test process of explaining results or in a court proceeding,
m aterials and test results should m aintain the this should happen in a controlled environment.
confidentiality o f the test results and testing When possible, copies o f the content should not
materials consistent with scientific, professional, be distributed, or should be distributed in a manner
legal, and ethical requirements. Tests (including that protects test security to the extent possible.
168
11. WORKPLACE TESTING AND
CREDENTIALING
BACKGROUND
Organizations use employment testing for many o f providing self-insight to employees. Testing
purposes, including employee selection, placement, can also take place in the context o f program
and promotion. Selection generally refers to decisions evaluation, as in the case o f an experimental study
about which individuals will enter the organization; o f the effectiveness o f a training program, where
placement refers to decisions about how to assign tests may be adm inistered as pre- and post
individuals to positions within the organization; measures. Som e assessments conducted in em
and promotion refers to decisions about which in ployment settings, such as unstructured job in
dividuals within the organization will advance. terviews for which no claim o f predictive validity
W hat all three have in com m on is a focus on the is made, are nonstandardized in nature, and it is
prediction o f future job behaviors, with the goal generally not feasible to apply standards to such
o f influencing organizational outcomes such as assessments. T h e focus o f this chapter, however, is
efficiency, growth, productivity, and employee on the use o f testing specifically in staffing decisions
m otivation and satisfaction. and credentialing. Many additional issues relevant
Testing used in the processes o f licensure and to uses o f testing in organizational settings are
certification, which will here be generically called discussed in other chapters: technical matters in
credentialing, focuses on an applicant’s current chapters 1, 2, 4, and 5; documentation in chapter
skill or competence in a specified domain. In 7; and individualized psychological and personality
many occupations, individual practitioners must assessment o f job candidates in chapter 10.
be licensed by governmental agencies. In other As described in chapter 3, the ideal o f fairness
occupations, it is professional societies, employers, in testing is achieved if a given test score has the
or other organizations that assume responsibility same meaning for all individuals and is not sub
for credentialing. Although licensure typically in stantially influenced by construct-irrelevant barriers
volves provision o f a credential for entry into an to individuals’ performance. For example, a visually
occupation, credentialing programs may exist at impaired person may have dilficulty reading ques
various levels, from novice to expert in a given tions on a personality inventory or other vocational
field. Certification is usually sought voluntarily, assessment provided in small print. Young people
although occupations differ in the degree to which just entering the workforce may be less sophisticated
obtaining certification influences employability in test-taking strategies than more experienced
or advancement. The credentialing process may job applicants, and their scores may suffer. A
include testing and other requirements, such as person unfamiliar with computer technology may
education or supervised experiences. The Standards have difficulty with the user interface for a
applies to the use o f tests as a component o f the computer simulation assessment. In each o f these
broader credentialing process. cases, performance is hindered by a source o f
Testing is also conducted in workplaces for a variance that is unrelated to the construct o f
variety o f purposes other than staffing decisions interest. Sound testing practice involves careful
and credentialing. Testing as a tool for personal monitoring o f all aspects o f the assessment process
growth can be part o f training and development and appropriate action when needed to prevent
programs, in which instruments measuring per undue disadvantages or advantages for some can
sonality characteristics, interests, values, preferences, didates caused by factors unrelated to the construct
and work styles are commonly used with the goal being assessed.
169
CHAPTER 11
170
WORKPLACE TESTING AND CREDENTIALING
advance. The key question is whether advance in situations with small sample sizes can be used
knowledge o f test content affects candidates’ per (see the discussion on page 173 concerning settings
formance unfairly and consequently changes the with small samples), as well as content-oriented
constructs measured by the test and the validity studies using the subject matter experts responsible
o f inferences based on the scores. for designing the job.
Fixed applicant pool versus continuous flow. In Size o f applicant pool relative to the num ber o f
some instances, an applicant poof can be assembled job openings. T h e size o f an applicant pool can
prior to beginning the selection process, as when constrain the type o f testing system that is feasible.
an organizations policy is to consider ail candidates For desirable jobs, very large numbers o f candidates
who apply before a specific date. In other cases, may compete, and short screening tests may be
there is a continuous flow o f applicants about used to reduce the pool to a size for which the ad
whom employment decisions need to be made on ministration o f more time-consuming and expensive
an ongoing basis. Ranking o f candidates is possible tests is practical. Large applicant pools may also
in the case o f the fixed pool; in the case o f a con pose test security concerns, limiting the organization
tinuous flow, a decision may need to be made to testing methods that permit simultaneous test
about each candidate independent o f information administration to all candidates.
about other candidates.
Thus, test use by employers is conditioned by
Sm all versus large sam ple size. Sample size affects contextual features. Knowledge o f these features
the degree to which different lines o f evidence plays an important part in the professional judgment
can be used to examine validity and fairness o f in that will influence both the types o f testing system
terpretations o f test scores for proposed uses o f developed and the strategies used to evaluate crit
tests. For example, relying on the local setting to ically the validity o f interpretations o f test scores
establish empirical linkages between test and cri for proposed uses o f the tests.
terion scores is not technically feasible with small
sample sizes. In em ployment testing, sample sizes The Validation Process in Employment Testing
are often small; at the extreme is a job with only a The validation process often begins with a job
single incumbent. Large sample sizes are somedmes analysis in which information about job duties
available when there are many incumbents for and tasks, responsibilities, worker characteristics,
the job, when m ultiple jobs share similar require and other relevant information is collected. This
ments and can be pooled, or when organizations information provides an empirical basis for artic
with similar jobs collaborate in developing a se ulating what is meant b y job performance in the
lection system. job under consideration, for developing measures
o f job performance, and for hypothesizing char
A new job. A special case o f the problem o f small acteristics o f individuals that may be predictive o f
sample size exists when a new job is created and performance.
there are no job incumbents. As new jobs emerge, The fundamental inference to be drawn from
employers need selection procedures to staff the test scores in most applications o f testing in em
new positions. Professional judgm ent may be used ployment settings is one o f prediction: The test
to identify appropriate employment tests and pro user wishes to make an inference from test results
vide a rationale for the selection program even to some future job behavior or job outcome.
though the array o f methods for documenting Even when the validation strategy used does not
validity m ay be restricted. Although validity involve empirical predictor-criterion linkages, as
evidence based on criterion-oriented studies can in the case o f validity evidence based on test
rarely be assembled prior to the creation o f a new content, there is an im plied criterion. Thus,
job, the methods for generalizing validity evidence although different strategies for gathering evidence
171
CHAPTER 11
may be used, the inference to be supported is that are intended to assess an individuals standing on
scores on the test can be used to predict subsequent the characteristics assessed in those domains.
job behavior. The validation process in employment The diagram enumerates inferences about a
settings involves the gathering and evaluation of number o f linkages that are commonly o f interest.
evidence relevant to sustaining or challenging this T h e first linkage (labeled 1 in the diagram) is be
inference. As detailed below and in chapter 1 (in tween scores on a predictor measure and scores
the section “Evidence Based on Relations to Other on a criterion measure. T h is inference is tested
Variables”), a variety o f validation strategies can through empirical examination o f relationships
be used to support the inference. between the two measures. The second and fourth
It follows that establishing this predictive in linkages (labeled 2 and 4) are conceptually similar:
ference requires attention to two domains: that o f Both examine the relationship o f an operational
the test (the predictor) and that o f the job behavior measure to the construct dom ain o f interest.
or outcome o f interest (the criterion). Evaluating Logical analysis, expert judgment, and convergence
the use o f a test for an employment decision can with or divergence from conccptually sim ilar or
be viewed as testing the hypothesis o f a linkage different measures are am ong the forms o f evidence
between these domains. Operationally, there are that can be examined in testing these linkages.
many ways o f linking these domains, as illustrated Linkage 3 involves the relationship between the
by the diagram below. predictor construct dom ain and the criterion
construct domain. Th is inferred linkage is estab
lished on the basis o f theoretical and logical
predictor criterion analysis. It commonly draws on systematic eval
measure measure
uation o f job content and expert judgm ent as to
the individual characteristics linked to successful
job performance. Linkage 5 examines a direct re
lationship o f die predictor measure to the criterion
construct domain.
Some predictor measures are designed explicitly
predictor . criterion as samples of the criterion construct dom ain o f
construct construct
interest; thus, isom orphism between the measure
domain domain
and the construct dom ain constitutes direct
Alternative links between predictor and criterion measures evidence for linkage 5. Establishing linkage 5 in
this fashion is the hallmark o f approaches that
The diagram differentiates between a predictor rely heavily on what the Standards refers to as
construct dom ain and a predictor measure, and validity evidence based on lest content. Tests in
between a criterion construct dom ain and a which candidates for lifeguard positions perform
criterion measure. A predictor construct domain is rescue operations, or in which candidates for
defined by specifying the set o f behaviors, knowl word processor positions type and edit text,
edge, skills, abilities, traits, dispositions, and values provide examples o f test content that form s the
that will be included under particular construct basis for validity.
labels (e.g., verbal reasoning, typing speed, con A prerequisite to the use o f a predictor measure
scientiousness). Similarly, a criterion construct for personnel selection is that the inferences con
domain specifies the set o f job behaviors or job cerning the linkage between the predictor measure
outcomes that will be included under particular and the criterion construct dom ain be established.
construct labels (e.g., performance o f core job As the diagram illustrates, there are multiple
tasks, teamwork, attendance, sales volume, overall strategies for establishing this crucial linkage. One
job performance). Predictor and criterion measures strategy is direct, via linkage 5; a second involves
172
WORKPLACE TESTING AND C R E D E N T IA L S
pairing linkage 1 and linkage 4; and a third above, diere is no single direct route to establishing
involves pairing linkage 2 and linkage 3- these linkages. They involve lines o f evidence sub
When the tesr is designed as a sam pje o f the sumed under “construct validity” in prior con
criterion construct dom ain, the validity evidence ceptualizations o f the validation process. A com
can be established directly via linkage 5. Another bination o f lines o f evidence (e.g., expert judgment
strategy for linking a predictor measure and the o f the characteristics predictive o f job success, in
criterion construct dom ain focuses on linkages 1 ferences drawn from an analysis o f critical incidents
and 4: pairing an empirical link between the pre o f effective and ineffective job performance, and
dictor and criterion measures with evidence o f interview and observation methods) may support
the adequacy with which the criterion measure inferences about the predictor constructs linked
represents the criterion construct domain. The to the criterion construct domain. Measures of
empirical link between che predictor measure and these predictor constructs may then be selected
the criterion measure is part o f what the Standards or developed, and the linkage between the predictor
refers to as validity evidence based on relationships measure and the predictor construct domain can
to other variables. T h e empirical link o f the test be established with various lines o f evidence for
and the criterion measure must be supplemented linkage 2, discussed above.
by evidence o f the relevance o f the criterion T h e various strategies for linking predictor
measure to the criterion construct dom ain to scores to the criterion construct dom ain may
complete the linkage between the test and the cri differ in their potential applicability to any given
terion construct domain. Evidence o f the relevance employment testing context. While the availability
o f the criterion measure to the criterion construct o f certain lines o f evidence may be constrained,
domain is commonly based on job analysis, al such constraints do not reduce the importance o f
though in som e cases the link between the domain establishing a validity argument for the predictive
and the measure is so direct that relevance is ap inference.
parent without job analysis (e.g., when the criterion For example, methods for establishing linkages
construct o f interest is absenteeism or turnover). are more limited in settings with only small
N ote that this strategy does not necessarily rely samples available. In such situations, gathering
on a well-developed predictor construct domain. local evidence o f predictor-criterion relationships
Predictor measures such as empirically keyed is not feasible, and approaches to generalizing ev
biodata measures are constructed on the basis of idence from other settings may be more useful. A
empirical links between test item responses and variety o f methods exist for generalizing evidence
the criterion measure o f interest. Such measures o f the validity o f the interpretation o f the predictive
may, in som e instances, be developed without a inference from other settings. Validity evidence
fully established conception o f the predictor con may be directly transported from another setting
struct dom ain; the basis for their use is the direct in a case where sound evidence (e.g., careful job
empirical link between test responses and a relevant analysis) indicates that the local job is highly
criterion measure. Unless sample sizes are very comparable to the job for which the validity data
large, capiralization on chance may be a problem, are being imported. These methods may rely on
in which case appropriate steps should be taken evidence for linkage 1 and linkage 4 that have al
(e.g., cross-validation). ready been established in other studies, as in the
Yet another strategy for linking predictor scores case o f the transportability study described previ
and the criterion construct domain focuses on ously. Evidence for linkage 1 may also be established
pairing evidence o f the adequacy with which the using techniques such as meta-analysis to combine
predictor measure represents the predictor construct results from multiple studies, and a careful job
dom ain (linkage 2) with evidence o f the linkage analysis may establish evidence for linkage 4 by
between the predictor construct domain and the showing the focal job to be similar to other jobs
criterion construct dom ain (linkage 3). As noted included in the meta-analysis. At the extreme, a
173
CHAPTER 11
selection system may be developed for a newly Thus, any single characteristic will be only an im
created job with no current incumbents. Here, perfect predictor, and even complex selection sys
generalizing evidence from other settings may be tems only focus on the set o f constructs deemed
especially helpful. most critical for the job, rather than on all char
For many testing applications, there is a con acteristics that can influence job behavior. Third,
siderable cumulative body o f research that speaks some measurement error always occurs, even in
to some, if not all, o f the inferences discussed well-developed test and criterion measures.
above. A meta-analytic integration o f this research Thus, testing systems cannot be judged against
can form an integral part o f the strategy for a standard o f perfect prediction. Rather, they
linking test information to the construct domain should be judged in terms o f comparisons with
o f interest. The value o f collecting local validation available alternative selection methods. Professional
data varies with the magnitude, relevance, and judgment, informed by knowledge o f the research
consistency o f research findings using similar pre literature about the degree o f predictive accuracy
dictor measures and similar criterion construct relative to available alternatives, influences decisions
dom ains for similar jobs. In som e cases, a small about test use.
and inconsistent cumulative research record may Decisions about test use are often influenced
lead to a validation strategy that relies heavily on by additional considerations, including utility
local data; in others, a large, consistent research (i.e., cost-benefit) and return on investment, value
base may make investing resources in additional judgments about the relative importance o f selecting
local data collection unnecessary. for one criterion dom ain versus others, concerns
Thus, multiple sources o f data and multiple about applicant reactions to test content and
lines o f evidence can be drawn upon to evaluate processes, the availability and appropriateness o f
the linkage between a predictor measure and the alternative selection methods, and statutory or
criterion construct domain o f interest. There is no regulatory requirements governing test use, fairness,
single preferred method o f inquiry for establishing and policy objectives such as workforce diversity.
this linkage. Rather, the test user must consider Organizational values necessarily come into play
the specifics o f the testing situation and apply in decisions about test use; thus, even organizations
professional judgm ent in developing a strategy for with comparable evidence supporting an intended
testing the hypothesis o f a linkage between the inference drawn from test scores may reach different
predictor measure and the criterion domain. conclusions about whether to use any particular
test.
Bases for Evaluating Employment Test Use
Although a primary goal o f employment testing Testing in Professional and
is the accurate prediction o f subsequent job be
Occupational Credentialing
haviors or job outcom es, it is im portant to
recognize that there are limits to the degree to Tests are widely used in the credentialing o f
which such criteria can be predicted. Perfect pre persons for many occupations and professions.
diction is an unattainable goal. First, behavior in Licensing requirements are imposed by federal,
work settings is influenced by a wide variety o f state, and local governments to ensure that those
organizational and extra-organizational factors, who are licensed possess knowledge and skills in
including supervisor and peer coaching, formal sufficient degree to perform important occupational
and informal training, job design, organizational activities safely and effectively. Certification plays
structures and systems, and family responsibilities, a similar role in m any occupations not regulated
am ong others. Second, behavior in work settings by governments and is often a necessary precursor
is also influenced by a wide variety o f individual to advancement. Certification has also become
characteristics, including knowledge, skills, abilities, widely used to indicate that a person has specific
personality, and work attitudes, among others. skills (e.g., operation o f specialized auto repair
174
WORKPLACE TESTING AND CREDENTIALING
equipment) or knowledge (e.g., estate planning), sionals. Panels o f experts in the field often work
which may be only a part o f their occupational in collaboration with measurement experts to
duties. Licensure and certification will here gener- define test specifications, including the knowledge
ically be called credentialing. and skills needed for safe, effective performance
Tests used in credentialing are intended to and an appropriate way o f assessing them. The
provide the public, including employers and gov Standards apply to all forms o f testing, including
ernment agencies, with a dependable mechanism traditional multiple-choice and other selected-re-
for identifying practitioners who have met particular sponse tests, constructed-response tasks, portfolios,
standards. The standards may be strict, but not so situational judgm ent tasks, and oral examinations.
stringent as to unduly restrain the right o f qualified M ore elaborate performance tasks, som etimes
individuals to offer their services to the public. using computer-based simulation, are also used
Credentialing also serves to protect the public by in assessing such practice components as, for ex
excluding persons who are deemed to be not am ple, patient diagnosis or treatment planning.
qualified to do the work o f the profession or oc H ands-on performance tasks may also be used
cupation. Qualifications for credentials typically (e.g., operating a boom crane or filling a tooth),
include educational requirements, som e amount with observation and evaluation by one or more
o f supervised experience, and other specific criteria, examiners.
as well as attainment o f a passing score on one or Credentialing tests may cover a number o f re
more examinations. Tests are used in credentialing lated but distinct areas o f knowledge or skill. D e
in a broad spectrum o f professions and occupations, signing the testing program includes deciding
including medicine, law, psychology, teaching, what areas are to be covered, whether one or a
architecture, real estate, and cosmetology. In some series o f tests is to be used, and how multiple test
o f these, such as actuarial science, clinical neu scores are to be combined to reach an overall de
ropsychology, and medical specialties, tests are cision. In some cases, high scores on som e tests
also used to certify advanced levels o f expertise. are permitted to offset (i.e., compensate for) low
Relicensure or periodic recertification is also scores on other tests, so that an additive combination
required in som e occupations and professions. is appropriate. In other cases, a conjunctive decision
Tests used in credentialing are designed to de model requiring acceptable performance on each
termine whether the essential knowledge and skills test in an examination series is used. The type o f
have been mastered by the candidate. T he focus pass-fail decision model appropriate for a creden
is on the standards o f competence needed for ef tialing program should be carefully considered,
fective performance (e.g., in licensure this refers and the conceptual and/or empirical basis for the
to safe and effective performance in practice). decision model should be articulated.
Test design generally starts with an adequate def Validation o f credentialing tests depends mainly
inition o f the occupation or specialty, so that on content-related evidence, often in the form o f
persons can be clearly identified as engaging in judgments that the test adequately represents the
the activity. Then the nature and requirements o f content domain associated with the occupation or
the occupation, in its current form, are delineated. specialty being considered. Such evidence may be
To identify the knowledge and skills necessary for supplemented with other forms o f evidence external
competent practice, it is important to complete to the test. For example, information may be pro
an analysis o f the actual work performed and vided about the process by which specifications
then document the tasks and responsibilities that for the content domain were developed and the
are essential to the occupation or profession o f expertise o f the individuals making judgments
interest. A wide variety o f empirical approaches about the content dom ain. Criterion-related
may be used, including the critical incident tech evidence is o f limited applicability because cre
nique, jo b analysis, training needs assessments, or dentialing examinations are not intended to predict
practice studies and surveys o f practicing profes individual performance in a specific job but rather
175
CHAPTER 11
to provide evidence that candidates have acquired mastery tests may not be designed to provide ac
the knowledge, skills, and judgment required for curate results over the full score range, many such
effective performance, often in a wide variety o f tests report results as sim ply “pass” or “fail.” When
jobs or settings (we use the term judgment to refer feedback is given to candidates about how well or
to the applications o f knowledge and skill to par how poorly they performed, precision throughout
ticular situations). In addition, measures o f per the score range is needed. Conditional standard
formance in practice are generally not available errors o f measurement, discussed in chapter 2,
for those who are not granted a credential. provide information about the precision o f specific
Defining the m inimum level of knowledge scores.
and skill required for licensure or certification is Candidates who fail may profit from infor
one o f the most important and difficult tasks mation about the areas in which their performance
facing those responsible for credentialing. The was especially weak. T h is is the reason that
validity o f the interpretation o f the test scores de subscores are sometimes provided. Subscores are
pends on whether the standard for passing makes often based on relatively small numbers o f items
an appropriate distinction between adequate and and can be much less reliable than the total score.
inadequate performance. Often, panels o f experts Moreover, differences in subscores may simply
are used to specify the level o f performance that reflect measurement error. For these reasons, the
should be required. Standards m ust be high decision to provide subscores to candidates should
enough to ensure that the public, employers, and be made carefully, and information should be
government agencies are well served, but not so provided to facilitate proper interpretation. Chapter
high as to be unreasonably limiting. Verifying the 2 and Standard 2.3 speak to the importance o f
appropriateness o f the cm score or scores on a test subscore reliability.
used for licensure or certification is a critical Because credentialing tends to involve high
element o f the validation process. Chapter 5 stakes and is an ongoing process, with tests given
provides a general discussion o f setting cut scores on a regular schedule, it is generally not desirable
(see Standards 5.21—5.23 for specific topics con to use the sam e test form repeatedly. Th us, new
cerning cut scores). forms, or versions o f the test, are generally needed
Legislative bodies sometimes attempt to legislate on an ongoing basis. From a technical perspective,
a cut score, such as answering 70% o f test items all forms o f a test should be prepared to the same
correcdy. C ut scores established in such an arbitrary specifications, assess the same content domains,
fashion can be harmful for two reasons. First, and use the same weighting o f components or
w ithout detailed information about the test, job topics.
requirements, and their relationship, sound standard Alternate test forms should have the same
setting is impossible. Second, without detailed score scale so that scores can retain their meaning.
information about the form at o f the test and the Various methods o f linking or equating alternate
difficulty o f items, such arbitrary cut scores have forms can be used to ensure that the standard for
little meaning. passing represents the same level o f performance
Scores from credentialing tests need to be on all forms. N ote that release o f past test forms
precise in the vicinity o f the cut score. They may may compromise the extent to which different
not need to be as precise for test takers who test forms are comparable.
clearly pass or clearly fail. Computer-based mastery Practice in professions and occupations often
tests may include a provision to end the testing changes over time. Evolving legal restrictions,
when it becomes clear that a decision about the progress in scientific fields, and refinements in
candidates performance can be made, resulting techniques can result in a need for changes in test
in a shorter test for candidates whose performance content. Each profession or occupation should
clearly exceeds or falls below the minimum per periodically reevaluate the knowledge and skills
formance required for a passing score. Because measured in its examination used to meet the re
176
WORKPLACE TESTING AND CREDENTIALING
quirements o f the credential. When change is Passing a credentialing examination should signify
substantial, it becomes necessary to revise the that the candidate meets the knowledge and skill
definition o f the profession, and the test content, standards set by the credentialing body to ensure
to reflect changing circumstances. These changes effective practice.
to the test may alter the meaning o f the score Issues o f cheating and test security are o f
scale. When m ajor revisions are made in the test special importance for testing practices in creden-
or when the score scale changes, the cut score tialing. Issues o f test security are covered in
should also be reestablished. chapters 6 and 9. Issues o f cheating by test takers
Some credentialing groups consider it necessary, are covered in chapter 8 (see Standards 8.9 -8 .1 2 ,
as a practical matter, to adjust their passing score addressing testing irregularities).
or other criteria periodically to regulate the number Fairness and access, discussed in chapter 3,
o f accredited candidates entering the profession. are im portant for licensing and certification
This questionable procedure raises serious problems testing. An evaluation o f an accom m odation or
for the technical quality o f the test scores and modification for a credentialing test should take
threatens the validity o f the interpretation o f a into consideration the critical functions performed
passing score as indicating entry-level competence. in the work targeted by the test. In the case o f
Adjusting the cut score periodically also implies credentialing tests, the criticality o f job functions
that standards are set higher in some years than in is informed by the public interest as well as the
others, a practice that is difficult to justify on the nature o f the work itself. When a condition
grounds o f quality o f performance. The score limits an individual’s ability to perform a critical
scale is som etim es adjusted so that a certain function o f a job, an accom m odation or m odifi
number or proportion o f candidates will reach cation o f the licensing or certification exam may
the passing score. This approach, while less obvious not be appropriate (i.e., som e changcs may fun
to the candidates than changing the cut score, is damentally alter factors that the examination is
also technically inappropriate because it changes designed to measure for protection o f the publics
the meaning o f the scores from year to year. health, safety, and welfare).
177
CHAPTER 11
T h e standards in this chapter have been separated o f the tasks that are performed and/or the knowledge,
into three thematic clusters labeled as follows: skills, abilities, and other characteristics that are re
quired on the job. They should be clearly defined
1. Standards Generally Applicable to Both so that they can be linked to test content. The
Em ploym ent Testing and Credentialing knowledge, skills, abilities, and other characteristics
2. Standards for Em ployment Testing included in the content domain should be those
3. Standards for Credentialing that qualified applicants already possess when being
considered for the job in question. Moreover, the
importance o f these characteristics for the job
Cluster 1. Standards Generally under consideration should not be expected to
Applicable to Both Employment Testing change substantially over a specified period o f time.
and Credentialing For credentialing tests, the target content do
main generally consists o f the knowledge, skills,
Standard 11.1 and judgm ent required for effective performance.
T h e target content domain should be clearly
Prior to development and im plem entation o f an defined so it can be linked to test content.
employment or credentialing test, a clear statement
o f the intended interpretations o f test scores for
specified uses should be made. T he subsequent Standard 11.3
validation effort should be designed to determine
When test content is a primary source o f validity
how well this has been achieved for all relevant
evidence in support o f the interpretation for the
subgroups.
use o f a test for employm ent decisions or cre
C om m ent: The objectives o f employment and dentialing, a close link between test content and
credentialing tests can vary considerably. Some the job or professional/occupational requirements
em ploym ent tests aim to screen out those least should be demonstrated.
suited for the job in question, while others are de
Comment: For example, if the test content samples
signed to identify those best suited for the job.
job tasks with considerable fidelity (e.g., with
Em ploym ent tests also vary in the aspects o f job
actual job samples such as machine operation) or,
behavior they are intended to predict, which may
in the judgm ent o f experts, correctly simulates
include quantity or quality o f work output, tenure,
job task content (e.g., with certain assessment
counterproductive behavior, and teamwork, among
center exercises), or if the test samples specific job
others. Credentialing tests and some employment
knowledge (e.g., information necessary to perform
tests are designed to identify candidates who have
certain tasks) or skills required for competent
met som e specified level o f proficiency in a target
performance, then content-related evidence can
dom ain o f knowledge, skills, or judgment.
be offered as the principal form o f evidence o f va
lidity. I f the link between the test content and the
Standard 11.2 job content is not clear and direct, other lines o f
validity evidence take on greater importance.
Evidence o f validity based on test content requires
When evidence o f validity based on test content
a thorough and explicit definition o f the content
is presented for a job or class o f jobs, the evidence
dom ain o f interest.
should include a description o f the m ajor job
C om m ent: In general, the job content domain for characteristics that a test is meant to sample. It is
an employment test should be described in terms often valuable to also include information about
178
WORKPLACE TESTING AND CREDENTIALING
W hen m ultiple test scores or test scores and Com m ent: The cumulative literature on the rela
nontest information are integrated for the purpose tionship between a particular type o f predictor
o f m aking a decision, the role played by each and type o f criterion may be sufficiently large and
should be clearly explicated, and the inference consistent to support the predictor-criterion rela
m ade from each source o f inform ation should tionship without additional research. In some set
be supported by validity evidence. tings, the cumulative research literature may be
so substantial and so consistent that a dissimilar
C om m ent: In credentialing, candidates may be finding in a local study should be viewed with
required to score at or above a specified minimum caution unless the local study is exceptionally
on each o f several tests (e.g., a practical, skill- sound. Local studies are o f greatest value in settings
based examination and a multiple-choice knowledge where the cumulative research literature is sparse
test) or at or above a cut score on a total composite (e.g., due to the novelty o f the predictor and/or
score. Specific educational and/or experience re criterion used), where the cumulative record is
quirements may also be mandated. A rationale inconsistent, or where the cumulative literature
and its supporting evidence should be provided does not include studies similar to the study from
for each requirement. For tests and assessments, the local setting (e.g., a study o f a test with a large
such evidence includes, but is not necessarily cumulative literature dealing exclusively with pro
limited to, the reliability/precision o f scores and duction jobs and a local setting involving managerial
the correlations am ong the tests and assessments. jobs).
In employment testing, a decision maker may
integrate test scores with interview data, reference
checks, and many other sources o f information in
Standard 11.6
m aking em ployment decisions. The inferences Reliance on local evidence o f empirically deter
drawn from test scores should be limited to those mined predictor-criterion relationships as a vali
for which validity evidence is available. For dation strategy is contingent on a determination
example, viewing a high test score as indicating o f technical feasibility.
overall job suitability, and thus precluding the
need for reference checks, would be an inappropriate C om m ent: Meaningful evidence o f predictor-cri
inference from a test measuring a single narrow, terion relationships is conditional on a number o f
albeit relevant, dom ain, such as job knowledge. features, including (a) the jo b s being relatively
In other circumstances, decision makers integrate stable rather than in a period o f rapid evolution;
scores across multiple tests, or across multiple (b) the availability o f a relevant and reliable
scales within a given test. criterion measure; (c) the availability o f a sample
reasonably representative o f the population o f in
terest; and (d) an adequate sample size for estimating
179
CHAPTER 11
the strength o f the predictor-criterion relationship. Com ment: Errors o f measurement in the criterion
I f any o f these conditions is not met, some alter and restrictions on the variability o f predictor or
native validation strategy should be used. For ex criterion scores systematically reduce estimates o f
ample, as noted in the comment to Standard the relationship between predictor measures and
11 .5, the cum ulative research literature may the criterion construct dom ain, but procedures
provide strong evidence o f validity. for correction for the effects o f these artifacts are
available. W hen these procedures are applied,
both corrected and uncorrected values should be
Standard 11.7
presented, along with the rationale for the correction
W hen em pirical evidence o f predictor-criterion procedures chosen. Statistical significance tests
relationships is part o f the pattern o f evidence for uncorrected correlations should not be used
used to support test use, the criterion measure(s) with corrected correlations. O ther features to be
used should reflect the criterion construct domain considered include issues such as m issing data for
o f interest to the organization. All criteria used some variables for some individuals, decisions
should represent im portant work behaviors or about the retention or removal o f extreme data
w ork outputs, either on the job or in job-relevant points, the effects o f capitalization on chancc in
training, as indicated by an appropriate review selecting predictors from a larger set on the basis
o f inform ation about the job. o f strength o f predictor-criterion relationships,
and the possibility o f spurious predictor-criterion
C om m en t: W hen criteria are constructed to
relationships, as in the case o f collecting criterion
represent jo b activities or behaviors (e.g., super
ratings from supervisors who know selection test
visory ratings o f subordinates on im portant job
scores. Chapter 3, on fairness, describes additional
dimensions), systematic collection o f information
issues that should be considered.
about the job should inform the development
o f the criterion measures. However, there is no
clear choice am ong the m any available job Standard 11.9
analysis methods. N ote that job analysis is not
Evidence o f predictor-criterion relationships in
lim ited to direct observation of the job or direct
a current local situation should not be inferred
sam pling o f subject m atter experts; large-scale
from a single previous validation study unless
job-analytic databases often provide useful in
the previous study o f the predictor-criterion re
form ation. There is not a clear need for job
lationships was done under favorable conditions
analysis to support criterion use when measures
(i.e., with a large sam ple size and a relevant cri
such as absenteeism , turnover, or accidents are
terion) and the current situation corresponds
the criteria o f interest.
closely to the previous situation.
180
WORKPLACE TESTING AMD CREDENTIALING
181
CHAPTER 11
and are consistent with the purpose for which liability estimates and associated standard errors
the credentialing program was instituted. o f measurement may also be useful, particularly
the conditional standard error at the cut score.
Com m ent: Typically, some form o f job or practice
However, the consistency o f decisions on whether
analysis provides the primary basis for defining
to certify is o f primary importance.
the content domain. If the same examination is
used in the credentialing o f people employed in a
variety o f settings and specialties, a number o f Standard 11.15
different job settings may need to be analyzed.
Rules and procedures that are used to combine
Although the job analysis techniques may be
scores on different parts o f an assessm ent or
similar to those used in employment testing, the
scores from m ultiple assessm ents to determine
emphasis for credentialing is limited appropriately
the overall outcome o f a credentialing test should
to knowledge and skills necessary for effective
be reported to test takers, preferably before the
practice. The knowledge and skills contained in a
test is administered.
core curriculum designed to train people for the
job or occupation may be relevant, especially if Com ment: In some credentialing cases, candidates
the curriculum has been designed to be consistent may be required to score at or above a specified
with empirical job or practice analyses. minimum on each o f several tests. In other cases,
In tests used for licensure, knowledge and the pass-fail decision may be based solely on a total
skills that may be important to success but are composite score. If tests will be combined into a
not directly related to the purpose o f licensure composite, candidates should be provided infor
(e.g., protecting the public) should not be included. mation about the relative weighting o f the tests. It
For example, in accounting, marketing skills may is not always possible to inform candidates o f the
be im portant for success, and assessment o f those exact weights prior to test administration because
skills might have utility lor organizations selecting the weights may depend on empirical properties of
accountants for employment. However, lack o f the score distributions (e.g., their variances). However,
those skills may not present a threat to the public, candidates should be informed o f the intention of
and thus the skills would appropriately be excluded weighting (e.g., test A contributes 25% and test B
from this licensing examination. The fact that contributes 75% to the total score).
successful practitioners possess certain knowledge
or skills is relevant but not persuasive. Such infor
Standard 11.16
mation needs to be coupled with an analysis o f
the purpose o f a credentialing program and the T h e level o f performance required for passin g a
reasons that the knowledge or skills are required credentialing test should depend on the knowledge
in an occupation or profession. and skills necessary for credential-worthy per
formance in the occupation or profession and
Standard 11.14 should not be adjusted to control the num ber or
proportion o f persons passing the test.
Estim ates o f the consistency o f test-based cre
C om m ent: The cut score should be determined
dentialing decisions should be provided in addition
by a careful analysis and judgm ent o f credential-
to other sources o f reliability evidence.
worthy performance (see chap. 5). W hen there
Com m ent: The standards for decision consistency are alternate forms o f a test, the cut score should
described in chapter 2 are applicable to tests used refer to the same level o f performance for all
for licensure and certification. Other types o f re forms.
182
12. EDUCATIONAL TESTING
AND ASSESSMENT
BACKGROUND
Educational testing has a long history o f use for also applies to assessments that are adopted for
informing decisions about learning, instruction, use across classrooms and whose developers make
and educational policy. Results o f tests are used claims for the validity o f score interpretations for
to make judgm ents about the status, progress, or intended uses. Admittedly, this distinction is not
accomplishments o f individual students, as well always clear. Increasingly, districts, schools, and
as entities such as schools, school districts, states, teachers are using an array of coordinated instruction
or nations. Tests used in educational settings rep and/or assessment systems, many o f which are
resent a variety o f approaches, ranging from tra technology based. These systems may include, for
ditional multiple-choice and open-ended item example, banks o f test items that individual
formats to performance assessments, including teachers can use in constructing tests for their
scorable portfolios. As noted in the introductory own purposes, focused assessment exercises that
chapter, a distinction is sometimes made between accompany instructional lessons, or simulations
the terms test and assessment, the latter term en and games designed for instruction or assessment
compassing broader sources o f information Lhan purposes. Even though it is not always possible to
a score on a single instrument. In this chapter we separate measurement issues from corresponding
use both terms, sometimes interchangeably, because instructional and learning issues in these systems,
the standards discussed generally apply to both. assessments that are part o f these systems and
This chapter does not explicitly address issues that serve purposes beyond an individual teacher’s
related to tests developed or selected exclusively instruction fall within the purview o f the Standards.
to inform learning and instruction at the classroom Developers o f these systems bear responsibility
level. Those tests often have consequences for for adhering to the Standards to support their
studen ts, including influencing instructional claims.
actions, placing students in educational programs, Both the introductory discussion and the stan
and affecting grades that may affect admission to dards provided in this chapter are organized into
colleges. T h e Standards provide desirable criteria three broad clusters: (1) design and development
o f quality that can be applied to such tests. o f educational assessments; (2) use and interpre
However, as with past editions, practical consid tation o f educational assessments; and (3) admin
erations limit the Standards’ applicability at the istration, scoring, and reporting o f educational
classroom level. Formal validation practices are assessments. Although the clusters are related to
often not feasible for classroom tests because the chapters addressing operational areas o f the
schools and teachers do not have the resources to standards, this discussion draws upon the principles
docum ent the characteristics o f their tests and are and concepts provided in the foundational chapters
not publishing their tests for widespread use. on validity, reliability/precision, and fairness and
Nevertheless, the core expectations o f validity, re applies them to educational settings. It should
liability/precision, and fairness should be considered also be noted that this chapter does not specifically
in the development o f such tests. address the use o f test results in mandated ac
T h e Standards clearly applies to formal tests countability systems that may impose perform
whose scores or other results are used for purposes ance-based rewards or sanctions on institutions
that extend beyond the classroom, such as bench such as schools or school districts or on individuals
m ark or interim tests that schools and districts such as teachers or principals. Accountability ap
use to m onitor student progress. T h e Standards plications involving aggregates o f scores are
183
CHAPTER 12
addressed in chapter 13 (“Uses ofTests for Program and to identify any gaps and/or misconceptions
Evaluation, Policy Studies, and Accountability”). that need to be addressed.
M ore formal assessments used for teaching
and learning purposes may not only inform class
Design and Development of
room instruction but also provide individual and
Educational Assessments aggregated assessment data that others may use to
Educational tests are designed and developed to support learning improvement. For example,
provide scores that support interpretations for teachers in a district may periodically administer
the intended test purposes and uses. Design and commercial or locally constructed assessments
development o f educational tests, therefore, begins that are aligned with the district curriculum or
by considering rest purpose. Once a tests purposes state content standards. These tests may be used
are established, considerations related to the to evaluate student learning over one or more
specifics o f test design and development can be units o f instruction. Results may be reported im
addressed. mediately to students, teachers, and/or school or
district leaders. T he results may also be broken
Major Purposes of Educational Testing down by content standard or subdomain to help
Although educational tests are used in a variety o f teachers and instructional leaders identify strengths
ways, m ost address at least one o f three major and weaknesses in students’ learning and/or to
purposes: (a) to make inferences that inform identify students, teachers, and/or schools that
teaching and learning at the individual or curricular may need special assistance. For example, special
level; (b) to make inferences about outcomes for programs may be designed to tutor students in
individual students and groups o f students; and specific areas in which test results indicate they
(c) to inform decisions about students, such as need help. Because the test results may influence
certifying students acquisition o f particular knowl decisions about subsequent instruction, it is im
edge and skills for promodon, placement in special portant to base content domain or subdomain
instructional programs, or graduation. scores on sufficient numbers o f items or tasks to
reliably support the intended uses.
Inform ing teaching and learning. Assessments In some cases, assessments administered during
that inform teaching and learning start with clear the school year may be used to predict student
goals for student learning and may involve a performance on a year-end summative assessment.
variety o f strategies for assessing student status I f the predicted performance on the year-end as
and progress. T he goals are typically cognitive in sessment is low, additional instructional interven
nature, such as student understanding o f rational tions may be warranted. Statistical techniques,
number equivalence, but may also address affective such as linear regression, may be used to establish
states or psychomotor skills. For example, teaching the predictive relationships. A confounding variable
and learning goals could include increasing student in such predictions may be the extent to which
interest in science or teaching students to form instructional interventions based on interim results
letters with a pen or pencil. improve the performance o f initially low-scoring
M any assessments that inform teaching and students over the course o f the school year; the
learning are used for formative purposes. Teachers predictive relationships will decrease to the extent
use them in day-to-day classroom settings to guide that student learning is improved.
ongoing instruction. For example, teachers may
assess students prior to starting a new unit to as Assessing student outcomes. The assessment o f
certain whether they have acquired the necessary student outcomes typically serves summative func
prerequisite knowledge and skills. Teachers may tions, that is, to help assess pupils learning at the
then gather evidence throughout the unit to see completion o f a particular instructional sequence
whether students are making anticipated progress (e.g., the end o f the school year). Educational testing
184
EDUCATIONAL TESTING AMD ASSESSMENT
o f student outcomes can be concerned with several For example, performance assessments that are
types o f score interpretations,' including standards- more closely connected with instructional units
based interpretations, growth-based interpretations, may measure certain content standards that are
and normative interpretations. These outcomes may not easily assessed by a more traditional end-of-
relate to the individual student or be aggregated year summative assessment.
over groups o f students, for example, classes, sub The evaluation o f student outcomes can also
groups, schools, districts, states, or nations. involve interpretations related to student progress
Standards-based interpretations o f student out or growth over time, rather than just performance
comes typically start with content standards* which at a particular time. In standards-based testing,
specify what students are expected to know and an important consideration is measuring student
be able to do. Such standards are typically established growth from year to year, both at the level o f the
by committees o f experts in the area to be tested. individual student and aggregated across students,
Content standards should be clear and specific for example at the teacher, subgroup, or school
and give teachers, students, and parents sufficient level. A number o f educational assessments are
direction to guide teaching and learning. Academic used to monitor the progress or growth o f individual
achievement standards, which are sometimes referred students within and/or across school years. Tests
to as performance standards, conncct content stan used for these purposes are sometimes supported
dards to information that describes how well stu by vertical scales that span a broad range o f devel
dents are acquiring the knowledge and skills con opmental or educational levels and include (but
tained in academic content standards. Performance are not limited to) both conventional multilevel
standards may include labels for levels o f per test batteries and computerized adaptive assessments.
formance (e.g., “basic,” “proficient,” “advanced”)> In constructing vertical scales for educational
descriptions o f what students at different per tests, it is important to align standards and/or
formance levels know and can do, examples o f learning objectives vertically across grades and to
student work that illustrate the range o f achievement design tests at adjacent levels (or grades) that have
within each performance level, and cut scores substantial overlap in the content measured.
specifying the levels o f performance on an assess However, a variety o f alternative statistical
m ent that separate adjacent levels o f achievement. models exist for m easuring student growth, not
The process o f establishing the cut scores for the all o f which require the use o f a vertical scale. In
academic achievement standards is often referred using and evaluating various growth models, it
to as standard setting. is important to clearly understand which questions
Although it follows from a consideration o f each growth model can (and cannot) answer,
standards-based testing that assessments should what assum ptions each growth model is based
be tightly aligned widi content standards, it is on, and what appropriate inferences can be
usually not possible to comprehensively measure derived from each growth m odels results. Missing
all o f the content standards using a single summative data can create challenges for som e growth
test. For example, content standards that focus models. Attention should be paid to whether
on student collaboration, oral argumentation, or som e populations are being excluded from the
scientific lab activities do not easily lend themselves model due to m issing data (for example, students
to measurement by traditional tests. As a result, who are m obile or have poor attendance). Other
certain content standards may be underemphasized factors to consider in the use o f growth models
in instruction at the expense o f standards that can are the relative reliability/precision o f scores es
be measured by the end-of-year summative test. timated for groups with different am ounts o f
Such limitations may be addressed by developing m issing data, and whether the model treats stu
assessment components that focus on various dents the sam e regardless o f where they are on
aspects o f a set o f comm on content standards. the perform ance continuum .
185
CHAPTER 12
Student outcomes in educational testing are In form ing decisions about students. Test results
sometimes evaluated through norm-referenced in are often used in the process o f making decisions
terpretations. A norm-referenced interpretation about individual students, for example, about
compares a students performance with the per high school graduation, placement in certain ed
formances o f other students. Such interpretations ucational programs, or promotion from one grade
may be m ade when assessing both status and to the next. In higher education, test results
growth. Comparisons may be made to all students, inform admissions decisions and the placement
to a particular subgroup (e.g., other test takers who o f admitted students in different courses (e.g., re
have majored in the test takers intended field o f medial or regular) or instructional programs.
study), or to subgroups based on many other con Fairness is a fundamental concern with all
ditions (e.g., students with similar academic per tests, but because decisions regarding educational
formance, students from similar schools). N orm s placement, promotion, or graduation can have
can be developed for a variety o f targeted populations profound individual effects, fairness is paramount
ranging from national or international samples o f when tests are used to inform such decisions.
students to the students in a particular school Fairness in this context can be enhanced through
district (i.e., local norms). Norm-referenced inter careful consideration o f conditions that affect stu
pretations should consider differences in the target dents’ opportunities to demonstrate their capa
populations at different times o f a school year and bilities. For example, when tests are used for pro
in different years. When a test is routinely admin motion and graduation, the fairness o f individual
istered to an entire target population, as in the case score interpretations can be enhanced by (a) pro
o f a statewide assessment, norm-referenced inter viding students with multiple opportunities to
pretations are relatively easy to produce and generally demonstrate their capabilities through repeated
apply only to a single point in the school year. resting with alternate forms or other construct-
However, national norms for a standardized achieve equivalent means; (b) providing students with
ment test are often provided at several intervals adequate notice o f the skills and content to be
within the school year. In that case, developers tested, along with appropriate test preparation
should indicate whether the norms covering a par materials; (c) providing students with curriculum
ticular time interval were based on data or interpolated and instruction that afford them the opportunity
from data collected at other times o f year. For ex to learn the content and skills to be tested;
ample, winter norms are often based on an inter (d) providing students with equal access to disclosed
polation between empirical norms collected in fall test content and responses as well as any specific
and spring. T he basis for calculating interpolated guidance for test taking (e.g., test-taking strategies);
norms should be documented so that users can be (e) providing students with appropriate testing
made aware o f the underlying assumptions about accommodations to address particular access needs;
student growth over the school year. and (f) in appropriate cases, taking into account
Because o f the time and expense associated m ultiple criteria rather than just a single test
with developing national norms, many test developers score.
report alternative user norms that consist o f descriptive Tests informing college admissions decisions
statistics based on all those who take their test or a are used in conjunction with other information
demographically representative subset o f those test about students’ capabilities. Selection criteria may
takers over a given period o f time. Although such vary within an institution by academic specialization
statistics— based on people who happen to take and may include past academic records, transcripts,
the test— are often useful, the norms based on and grade-point average or rank in class. Scores
them will change as the makeup o f the reference on tests used to certify students for high school
group changes. Consequently, user norms should graduation or scores on tests administered at the
not be confused with norms representative o f more end o f specific high school courses m ay be used
systematically sampled groups. in college admissions decisions. The interpretations
186
EDUCATIONAL TESTING AND ASSESSMENT
inherent in these uses o f high school tests should Development of Educational Tests
be supported by multiple lines o f relevant validity
evidence (e.g., both concurrent and predictive ev As with all tests, once the construct and purposes
idence). Other measures used by some institutions o f an educational test have been delineated, con
in making admissions decisions are samples o f sideration must be given to the intended population
previous work by students, lists o f academic and o f test takers, as well as to practical issues such as
service accomplishments, letters o f recommendation, available testing time and the resources available
and student-com posed statements evaluated for to support the development effort. In the devel
the appropriateness o f the goals and experience of opm ent o f educational tests, focus is placed on
the student and/or for writing proficiency. measuring the knowledge, skills, and abilities o f
Tests used to place students in appropriate all examinees in the intended population without
college-level or remedial courses play an important introducing any advantages or disadvantages
role in both com m unity colleges and four-year because o f individual characteristics (e.g., age,
institutions. Most institutions either use commercial culture, disability, gender, language, race/ethnicity)
placement tests or develop their own tests for that are irrelevant to the construct the test is in
placement purposes. The items on placement tests tended to measure. The principles o f universal
are typically selected to serve this single purpose design— an approach to assessment development
in an efficient manner and usually do not com that attempts to maximize the accessibility o f a
prehensively measure prerequisite content. For test for all o f its intended examinees— provide
example, a placement test in algebra will cover one basis for developing educational assessments
only a subset o f algebra content taught in high in this manner. Paramount in the process is explicit
school. Results o f som e placement tests are used documentation o f the steps taken during the de
to exempt students from having to take a coursc velopment process to provide evidence o f fairness,
that would normally be required. Other placement reliability/precision, and validity for the test’s in
tests are used by advisors for placing students in tended uses. T h e higher the stakes associated with
remedial courses or the m ost appropriate course the assessment, the more attention needs to be
in an introductory college-level sequence. In some paid to such documentation. More detailed con
cases, placement decisions are mechanized through siderations related to the development o f educational
the application o f locally determined cut scores tests are discussed in the chapters on fairness in
on the placement exam. Such cut scores should testing (chap. 3) and test design and development
be established through a documented process in (chap. 4).
volving appropriate stakeholders and validated A variety o f formats are used in developing
through empirical research. educational tests, ranging from traditional item
Results from educational tests may also inform formats such as multiple-choice and open-ended
decisions related to placing students in special in item s to perform ance assessm ents, including
structional programs, including those for students scorable portfolios, simulations, and games. Ex
with disabilities, English learners, and gifted and am ples o f such performance assessments might
talented students. Test scores alone should never include solving problems using manipulable ma
be used as the sole basis for including any student terials, making complex inferences after collecting
in special education programming, or excluding information, or explaining orally or in writing
any student from such programming. Test scores the rationale for a particular course o f government
should be interpreted in the context o f the students action under given economic conditions. An in
history, functioning, and needs. Nevertheless, test dividual portfolio may be used as another type o f
results m ay provide an important basis for deter performance assessment. Scorable portfolios are
mining whether a student has a disability and system atic collections o f educational products
what the students educational needs are. typically collected, and possibly revised, over time.
187
CHAPTER 12
Technology is often used in educational settings teachers who need help; and/or predicting each
to present testing material and to record and student’s likelihood o f success on a sum m ative as
score test takers1 responses. Examples include en sessment. It is im portant to validate the interpre
hancem ents o f text by audio instructions to tations made from test scores on such assessments
facilitate student understanding, computer-based for each o f their intended uses.
and adaptive tests, and simulation exercises where There are often tensions associated with using
attributes o f performance assessments are supported educational assessments for multiple purposes.
by technology. Som e test administration formats For example, a test developed to m onitor the
also may have the capacity to capture aspects o f progress or growth o f individual students across
students’ processes as they solve test items. They school years is unlikely to also effectively provide
may, for example, monitor time spent on items, detailed and actionable diagnostic information
solutions tried and rejected, or editing sequences about students’ strengths and weaknesses. Similarly,
for texts created by test takers. Technologies also an assessment designed to be given several times
m ake it possible to provide test administration over rhe course o f the school year to predict
conditions designed to accommodate students student performance on a year-end summative
with particular needs, such as those with different assessment is unlikely to provide useful information
language backgrounds, attention deficit disorders, about student learning with respect to particular
or physical disabilities. instructional units. M ost educational tests will
Interpretations o f scores on technology-based serve one purpose better than others; and the
tests are evaluated by the same standards for more purposes an educational test is purported to
validity, reliability/precision, and fairness as tests serve, the less likely it is to serve any o f those pur
administered through more traditional means. It poses effectively. For this reason, test developers
is especially important that test takers be familiarized and users should design and/or select educational
with the assessment technologies so that any un assessments to achieve the purposes they believe
familiarity with an input device or assessment in are most important, and they should consider
terface does not lead to inferences based on con whether additional purposes can be fulfilled and
struct-irrelevant variance. Furthermore, explicit should monitor the appropriateness o f any identified
consideration o f sources o f construct-irrelevant additional uses.
variance should be part o f the validation framework
as new technologies or interfaces are incorporated Use and inferprefafion of
into assessment programs. Finally, it is important
Educational Assessments
to describe scoring algorithms used in technolo
gy-based tests and the expert models on which
they may be based, and to provide technical data Stakes and Consequences of Assessment
supporting their use in the testing system docu The importance o f the results o f testing program s
mentation. Such documentation, however, should for individuals, institutions, or groups is often re
stop short o f jeopardizing the security o f the as ferred to as the stakes o f the testing program.
sessment in ways that could adversely affect the When the stakes for an individual are high, and
validity o f score interpretations. important decisions depend substantially on test
performance, the responsibility for providing evi
Assessments Serving Multiple Purposes dence supporting a tests intended purposes is
By evaluating students’ knowledge and skills greater than m ight be expected for tests used in
relative to a specific set o f academic goals, test low-stakes settings. Although it is never possible
results may serve a variety o f purposes, including to achieve perfect accuracy in describing an indi
im proving instruction to better meet student vidual’s performance, efforts need to be m ade to
needs; evaluating curriculum and instruction dis- minimize errors o f measurement or errors in clas
trict-wide; identifying students, schools and/or sifying individuals into categories such as “pass,”
188
EDUCATIONAL TESTING AND ASSESSMENT
“fail,” “adm it,” or “reject.” Further, supporting teaching and learning), to collect information
the validity o f interpretations'for high-stakes pur that bears on these issues, and to make decisions
poses, whether individual or institutional, typically about the uses o f assessments that take this infor
entails collecting sound collateral information mation into account.
that can be used to assist in understanding the
factors that contributed to test results and to Assessments for Students With
provide corroborating evidence that supports in Disabilities and English Language Learners
ferences based on the results. For example, test In the 1999 edition o f the Standards, the material
results can be influenced by multiple factors, both on educational testing for special populations fo
institutional and individual, such as the quality cused primarily on individualized diagnostic as
o f education provided, students’ exposure to edu sessment and educational placement for students
cation (e.g., through regular school attendance), with special needs. Sincc then, requirements stem
and students’ motivation to perform well on the ming from federal legislation have significantly
test. Collecting this type o f information can con increased the participation o f special populations
tribute to appropriate interpretations o f test results. in large-scale educational assessment programs.
The high-stakes nature o f some testing programs Special populations have also become more diverse
can create special challenges when new test versions and now represent a larger percentage o f those
are introduced. For example, a state may introduce test takers who participate in general education
a series o f high school end-of-course tests that arc programs. More students are being diagnosed
based on new content standards and are partially with disabilities, and more o f these students are
tied to graduation requirements. T h e operational included in general education programs and in
use o f these new' tests m ust be accompanied by state standards-based assessments. In addition,
documentation that students have indeed been the number o f students who are English language
instructed on contcnt aligned to the new standards. learners has grown dramatically, and the number
Because o f feasibility constraints, this may require included in educational assessments has increased
a carefully planned phase-in period that includes accordingly.
special surveys 01* qualitative research studies that As discussed in chapter 3 (“Fairness in Testing”),
provide the needed opportunity-to-learn docu assessments for special populations involve a con
mentation. Until such documentation is available, tinuum o f potential adaptations, ranging from
the tests should not be used for their intended specially developed alternate assessments to m odi
high-stakes purpose. fications and accommodations o f regular assessments.
Many types o f educational tests are viewed as The purpose of alternate assessments and adaptations
tools o f educational policy. Beyond any intended is to increase the accessibility o f tests that may not
policy goals, it is im portant to consider potential otherwise allow students with some characteristics
unintended effects o f large-scale testing programs. to display their knowledge and skills. Assessments
These possible unintended effects include (a) nar for special populations may also include assessments
rowing o f curricula in som e schools to focus ex developed for English language learners and indi
clusively on anticipated test content, (b) restriction vidually administered assessments that are used
o f the range o f instructional approaches to corre for diagnosis and placement.
spond to the testing format, (c) higher dropout
rates am ong students who do not pass the test, Alternate assessments. The term alternate assessments
and (d) encouragement o f instructional or ad as used here, in the context o f educational testing,
ministrative practices that may raise test scores refers to assessments developed for students with
w ithout improving the quality o f education. It is significant cognitive disabilities. Based on per
essential for those who mandate and use educational formance standards different from those used for
tests to be aware o f such potential negative conse regular assessments, alternate assessments provide
quences (including missed opportunities to improve these students with the opportunity to demonstrate
189
CHAPTER 12
their standing and progress in learning. An alternate assessments have been used in som e years and
assessment might consist o f an observation checldist, regular assessments in other years.
a multilevel assessment with performance tasks,
or a portfolio that includes responses to selected- Accom m odations and m odifications. To enable
response and/or open-ended tasks. The assessment assessment systems to include all students, ac
tasks are developed with the special characteristics commodations and modifications are provided to
o f this population in m ind. For exam ple, a those students who need them, including those
multilevel assessment with perform ance tasks who participate in alternate assessments because
might include scaffolding procedures in which of their significant cognitive disabilities. Adaptations,
the examiner eliminates question distracters when which include both accommodations and modi
students answer incorrectly, in order to reduce fications, provide access to educational assessments.
question complexity. Or, in a portfolio assessment, Accommodations are adaptations to test format
the teacher might include work samples and other or administration (such as changes in the way the
assessment information tailored specifically to the test is presented, the setting for the test, or the
student. T he teacher may assess the same English way in which the student responds) that maintain
language arts standard by asking one student to the same construct and produce results that are
write a story and another to sequence a story comparable to those obtained by students who
using picture cards, depending on which activity do not use accommodations. Accom m odations
provides students with access to demonstrate what may be provided to English language learners to
they know and can do. address their linguistic needs, as well as to students
The development and use o f alternate assess with disabilities to address specific, individual
ments in education have been heavily influenced characteristics that otherwise would interfere with
by federal legislation. Federal regulations may accessibility. For example, a student with extreme
require that alternate assessments used in a given dyslexia may be provided with a screen reader to
state have explicit connections to the content read aloud the scenarios and questions on a test
standards measured by the regular state assessment measuring science inquiry skills. The screen reader
while allowing for content with less depth, breadth, would be considered an accom modation because
and complexity. Such requirements clearly influence reading is not part o f the defined construct (science
the design and development o f alternate assessments inquiry) and the scores obtained by the student
in state standards-based programs. on the test w ould be assumed to be comparable
Alternate assessments in education should be to those obtained by students testing under regular
held to the same technical requirements that apply conditions.
to regular large-scale assessments. These include T h e use o f accommodations should be sup
docum entation and empirical data that support ported by evidence that their application does
test development, standard setting, validity, relia not change the construct that is being measured
bility/precision, and technical characteristics o f by the assessment. Such evidence may be available
the tests. When the number o f students served from studies o f similar applications but may also
under alternate assessments is too small to generate require specially designed research.
stable statistical data, the test developer and users Modifications are adaptations to test format or
should describe alternate judgm ental or other administration that change the construct being
procedures used to document evidence o f the va measured in order to make it accessible for designated
lidity o f score interpretations. students while retaining as much o f the original
A variety o f comparability issues may arise construct as possible. Modifications result in scores
when alternate assessments are used in statewide that differ in meaning from those for the regular
testing programs, for example, in aggregating the assessment. For example, a student with extreme
results o f alternate and regular assessments or in dyslexia may be provided with a screen reader to
comparing trend data for subgroups when alternate read aloud the passages and questions on a reading
190
EDUCATIONAL TESTING AMD ASSESSMENT
comprehension test that includes decoding as part students, classification consistency, and other
o f the construct. In this case, the screen reader claims in the validity argument. The rationale
would be considered a modification because it and evidence supporting the ELP domain definition
changes the construct being measured, and scores and the roles/relationships o f the language modalities
obtained by the student on the test would not be (e.g., reading, writing, speaking, listening) to
assumed to be comparable to those obtained by overall ELP are im portant considerations in artic
students testing under regular conditions. In many ulating the validity argument for an ELP test and
cases, accommodations can meet student access can inform the interpretation o f test results. Since
needs without the use o f modifications, but in no single assessment is equally effective in serving
some cases, modifications are the only option for all desired purposes, users should consider which
providing some students with access to an educational uses o f ELP tests are their highest priority and
assessment. As with alternate assessments, compa choose or develop instruments accordingly.
rability issues arise with the use o f modifications in Accommodations associated with ELP tests
educational testing programs. should be carefully considered, as adaptations that
Modified tests should be designed and developed are appropriate for regular content assessments
w ith the sam e co n sid eration s o f validity, may compromise the ELP standards being assessed.
reliability/precision, and fairness as regular assess In addition, users should establish common guidelines
ments. It is not sufficient to assume that the for using ELP results in making decisions about
validity evidence associated with a regular assessment E LL students. The guidelines should include explicit
generalizes to a modified version. policies and procedures for using results in identifying
An extensive discussion o f modifications and and redesignating E LL students as English proficient,
accommodations for special populations is provided an important process because of the legal and edu
in chapter 3 (“ Fairness in Testing”). cational importance o f these designations. Local
education agencies and schools should be provided
A ssessm ents for English language proficiency. with easy access to the guidelines.
An increasing focus on the measurement o f English
language proficiency (ELP) for English language Individual assessments. Individually administered
learners (ELLs) has mirrored the growing presence tests are used by psychologists and other professionals
o f these students in U .S. classrooms. Like stan in schools and other related settings to inform de
dards-based content tests, ELP tests are based on cisions about a variety o f services that may be ad
ELP standards and are held to the same standards ministered to students. Services are provided for
for precision o f scores and validity and fairness o f students who are gifted as well as for those who
score interpretations for intended uses as are other encounter academic difficulties (e.g., students re
large-scale tests. quiring remedial reading instruction). Still other
ELP tests can serve a variety o f purposes. They services are provided for students who display be
are used to identify students as English learners havioral, emotional, physical, and/or more severe
and qualify them for special E L L programs and learning difficulties. Services may be provided for
services, to redesignate students as English proficient, students who are taught in regular classrooms as
and for purposes o f diagnosis and instruction. well as for those receiving more specialized in
States, districts, and schools also use ELP tests to struction (e.g., special education students).
m onitor these students’ progress and to hold Aspects o f the test that may result in con
schools and educators accountable for E LL learning struct-irrelevant variance for students with certain
and progress toward English proficiency. relevant characteristics should be taken into account
As with any educational test, validity evidence as appropriate by qualified testing professionals
for measures o f ELP can be provided by examining when using test results to aid placement decisions.
the test blueprint, the alignm ent o f content with For example, students’ English language proficiency
E LP standards, construct comparability across or prior educational experience may interfere with
191
CHAPTER 12
their performance on a test o f academic ability results that are reported to them. Similarly, as test
and, if not taken into account, could lead to mis- users, it is the responsibility o f educators to pursue
classification in special education. O nce a student and attain assessment literacy as it pertains to
is placed, tests may be administered to monitor their roles in the education system.
the progress o f the student toward prescribed Test sponsors and test developers can promote
learning goals and objectives. Test results may educator assessment literacy in a variety o f ways,
also be used to inform evaluations o f instructional including workshops, development o f written ma
effectiveness and determ inarions o f whether the terials and media, and collaboration with educators
special services need to be continued, modified, in the test development process (e.g., development
or discontinued. o f content standards, item writing and review,
M any types o f tests are used in individualized and standard setting). In particular, those responsible
and special needs testing. These include tests o f for educational testing programs should incorporate
cognitive abilities, academic achievement, learning assessment literacy into the ongoing professional
processes, visual and auditory memory, speech development o f educators. In addition, regular
and language, vision and hearing, and behavior attempts should be made to educate other major
and personality. These tests typically are used in stakeholders in the educational process, including
conjunction with other assessment methods— parents, students, and policy makers.
such as interviews, behavioral observations, and
reviews o f records— for purposes o f identifying Administration, Scoring, and
and placing students with disabilities. Regardless
Reporting of Educational Assessments
o f the qualities being assessed and the data
collection m ethods employed, assessment data
used in m aking special education decisions are
Administration of Educational Tests
evaluated in terms o f evidence supporting intended M ost educational tests involve standardized pro
interpretations as related to the specific needs o f cedures for administration. These include directions
the students. T he data must also be judged in to test administrators and examinees, specifications
terms o f their usefulness for designing appropriate for testing conditions, and scoring procedures.
educational programs for students who have special Because educational tests typically are administered
needs. For further information, see chapter 10 by school personnel, it is important for the spon
(“ Psychological Testing and Assessment”). soring agency to provide appropriate oversight to
the process and for schools to assign local roles
Assessment Literacy and and responsibilities (e.g., testing coordination)
Professional Development for training those who will administer the test.
Assessment literacy can be broadly defined as knowl Similarly, test developers have an obligation to
edge about the basic principles o f sound assessment support the test administration process and to
practice, including terminology, the development provide resources to help solve problems when
and use o f assessment methodologies and tech they arise. For example, with high-stakes tests ad
niques, and familiarity with standards by which ministered by computer, effective technical support
the quality o f testing practices are judged. The to the local administration is critical and should
results o f educational assessments are used in de involve personnel who understand the context o f
cision m aking across a variety o f settings in class the testing program as well as the technical aspects
rooms, schools, districts, and states. Given the o f the delivery system.
range and complexity o f test purposes, it is im Those responsible for educational testing pro
portant for test developers and those responsible grams should have form al procedures for granting
for educational testing programs to encourage testing accom m odations and involve qualified
educators to be informed consumers o f the tests personnel in the associated decision-making process.
and to fully understand and appropriately use For students with disabilities, changes in both in
192
EDUCATIONAL TESTING AMD ASSESSMENT
struction and assessment are typically specified in the task and because it is not feasible to include
an individualized education program (IEP). For more than one extended writing task in the test.
English language learners, schools may use guidance In addition, scoring based on item response theory
from the state or district to match students’ (IRT) models can result in item weights that
language proficiency and instructional experience differ from nominal or desired weights. Such ap
with appropriate language accommodations. Test plications o f IR T should include consideration
accom m odations should be chosen by qualified and explanation o f item weights in scoring. In
personnel on the basis o f the individual student’s general, the scoring rules used for educational
needs. It is particularly important in large-scale tests should be documented and include a validi-
assessment programs to establish clear policies ty-based rationale.
and procedures for assigning and using accom In addition, test developers should discuss
modations. These steps help to maintain the com with policy makers the various m ethods o f com
parability o f scores for students testing with ac bining the results from different educational
comm odations on academic assessments across tests used to make decisions about students, and
districts and schools. O nce selected, accomm oda should clearly docum ent and comm unicate the
tions should be used consistently for both in methods, also known as decision rules. For example,
struction and assessment, and test administrators as part o f graduation requirements, a state may
should be fully familiar with procedures for ac require a student to achieve established levels of
com m odated testing. Additional inform ation performance on multiple tests measuring different
related to test administration accommodations is content areas using either a noncompensatory
provided in chapter 3 (“Fairness in Testing”). or a compensatory decision rule. Under a non
compensatory decision rule, the student has to
Weighted and Composite Scoring achieve a determined level o f performance on
Scoring educational tests and assessments requires each Lest; under a compensatory decision rule,
developing rules for combining scores on items the student may only have to achieve a certain
and/or tasks to obtain a total score and, in some total com posite score based on a com bination o f
cases, for combining multiple scores into an overall scores across tests. For a high-stakes decision,
composite. Scores from multiple tests are sometimes such as one related to graduation, the rules used
combined into linear composites using nominal to combine scores across tests should be established
weights, which are assigned to each component with a clear understanding o f the associated im
score in accordance with a logical judgm ent o f its plications. In these situations, im portant conse
relative importance. N om inal weights may some quences such as passing rates and classification
times be misleading because the variance o f the error rates will differ depending on the rules for
composite is also determined by the variances com bining test results. Test developers should
and covariances o f the individual component docum ent and comm unicate these implications
scores. As a result, the “effective weight” o f each to policy makers to encourage policy decisions
component may not reflect the nominal weighting. that are fully informed.
When composite scores are used, differences be
tween nominal and effective weights should be Reporting Scores
understood and documented. Score reports for educational assessments should
For a single test, total scores are often based support the interpretations and decisions o f their
on a simple sum o f the item and task scores. intended audiences, which include students, teach
However, differential weighting schemes may be ers, parents, principals, policy makers, and other
applied to reflect differential emphasis on specific educators. Different reports may be developed
content or constructs. For example, in an English and produced for different audiences, and the
language arts test, more weight may be assigned score report layouts may differ accordingly. For
to an extended essay because o f the importance o f example, reports prepared for individual students
193
CHAPTER 12
and parents may include background information compared by gathering data about the interpreta
about the purpose o f the assessment, definitions tions and inferences m ade by users based on the
o f performance categories, and more user-friendly data presented in each report.
representations o f measurement error (e.g., error Online reporting capabilities give users flexible
bands around graphical score displays). Those access to test results. For example, the user can
who develop such reports should strive to provide select options online to break down the results by
information diat can help students make productive content or subgroup. The options provided to
decisions about their own learning. In contrast, test users for querying the results should support
reports prepared for principals and district-level the tests intended uses and interpretations. For
personnel may include more detailed summaries example, online systems may discourage or disallow
but less foundational information because these viewing o f results, in some cases as required by
individuals typically have a much better under law, i f the sample sizes o f particular subgroups fall
standing o f assessments. below an acceptable number. In addition, care
As discussed in chapter 3, when modifications should be taken to allow access only to the appro
have been made to a test for some test takers that priate individuals. As with score reports, the
affect the construct being measured, consideration validity o f interpretations from online supporting
may be given to reporting that a modification systems can be enhanced through usability research
was made because it affects the reliability/precision involving the intended score users.
o f test scores or the validity o f interpretations Technology also facilitates close alignm ent of
drawn from test scores. Conversely, when accom instructional materials with the results o f educational
m odations are made that do not affect the com tests. For example, results reported for an individual
parability o f test scores, flagging those accom m o student could include not only strengths and
dations is not appropriate. weaknesses but direct links to specific instructional
In general, score reports for educational tests materials that a teacher may use with the student
should be designed to provide information that is in the future. Rationales and docum entation sup
understandable and useful to stakeholders without porting the efficacy o f the recommended inter
leading to unwarranted score interpretations. Test ventions should be provided, and users should be
developers can significantly improve the design encouraged to consider such information in con
o f score reports by conducting supporting research. junction with other evidence and judgm ents about
For example, surveys o f available reports for other student instructional needs.
educational tests can provide ideas for effectively When results are reported for large-scale as
displaying test results. In addition, usability research sessments, the test sponsors or users should prepare
with consumers o f score reports can provide accom panying guidance to promote sound use
insights into report design. A number o f techniques and valid interpretations o f the data by the media
can be used in this type o f research, including and other stakeholders in the assessment process.
focus groups, surveys, and analyses o f verbal pro Such communications should address likely testing
tocols. For example, the advantages and disad consequences (both positive and negative), as well
vantages o f alternate prototype designs can be as anticipated misuses o f the results.
194
EDUCATIONAL TESTING AND ASSESSMENT
195
CHAPTER 12
access to the construct for all individuals and evaluated. T h e analyses should m ake explicit
subgroups for whom the assessm ent is intended. those aspects o f the target dom ain that the test
represents, as well as those aspects that the test
C om m ent: It is important in educational contexts
fails to represent.
to provide for ail students— regardless o f their
individual characteristics— the opportunity to C om m ent: Tests are comm only developed to
dem onstrate their proficiency on the construct monitor the status or progress o f individuals and
being measured. Test specifications should clearly groups with respect to local, state, national, or
specify all relevant subgroups in the target popu professional content standards. Rarely can a single
lation, including those for whom the test may test cover the full range o f performances reflected
not allow demonstration o f knowledge and skills. in the content standards. In developing a new test
Item s and tasks should be designed to maximize or selecting an existing test, appropriate interpre
access to the test content for all individuals in tation o f test scores as indicators o f performance
the intended test-taker population. Tools and on these standards requires docum enting and
strategies should be implemented to familiarize evaluating both the relevance o f the test to the
all test takers with the technology and testing standards and the extent to which the test is
form at used, and the administration and scoring aligned to the standards. Such alignment studies
approach should avoid introducing any con- should address multiple criteria, including not
struct-irrelevant variance into the testing process. only alignment o f the test with the content areas
In situations where individual characteristics such covered by the standards but also alignment with
as E nglish language proficiency, cultural or the standards in terms o f the range and complexity
linguistic background, disability, or age are believed o f knowledge and skills that students are expected
to interfere with access to the construct(s) that to demonstrate. Further, conducting studies of
the test is intended to measure, appropriate adap the cognitive strategies and skills employed by
tations should be provided to allow access to the test takers, or studies o f the relationships between
content, context, and response formats o f the test scores and other perform ance indicators
test items. These may include both accom m oda relevani to the broader target dom ain, enables
tions (changes that are assumed to preserve the evaluation o f the extent to which generalizations
construct being measured) and m odifications to that dom ain are supported. This information
(changes that are assumed to m ake an altered should be m ade available to all who use the test
version o f the construct accessible). Additional or interpret the test scores.
considerations related to fairness and accessibility
in educational tests and assessments are provided
Standard 12.5
in chapter 3 (“Fairness in Testing”).
Local norms should be developed when appropriate
Standard 12.4 to support test users’ intended interpretations.
W hen a test is used as an indicator o f achievement Com m ent: Com parison o f examinees’ scores to
in an instructional dom ain or w ith respect to local as well as more broadly representative norm
specified content standards, evidence o f the groups can be informative. Thus, sam ple size per
extent to which the test sam ples the range o f mitting, local norms are often useful in conjunction
knowledge and elicits the processes reflected in with published norms, especially i f the local pop
the target dom ain should be provided. Both the ulation differs markedly from the population on
tested and the target domains should be described which published norms are based. In some cases,
in sufficient detail for their relationship to be local norms m ay be used exclusively.
196
EDUCATIONAL TESTING AND ASSESSMENT
197
CHAPTER 12
as the prim ary assessment. In particular, evidence In cases where growth scores are predicted for
should be provided that the alternative approach individual students, results based on different ver
measures the same skills and has the same passing sions o f tests taken over time may be used. For
expectations as the primary assessment. example, math scores in Grades 3, 4, and 5 may
be used to predict the expected math score in
Grade 6. In such cases, if complex statistical
Standard 12.10
models are used to predict scores for individual
In educational settings, a decision or charac students, the method for constructing the models
terization th at will have m ajor im pact on a should be made explicit and should be justified,
student sh ou ld take into consideration not ju st and supporting technical and interpretive infor
scores from a single test bu t other relevant in mation should be provided to the score users.
form ation. Chapter 13 (“Uses o f Tests for Program Evaluation,
Policy Studies, and Accountability”) addresses the
C om m ent: In general, multiple measures or data
application o f more complex models to groups or
sources will often enhance the appropriateness o f
systems within accountability settings.
decisions about students in educational settings
and therefore should be considered by test sponsors
and test users in establishing decision rules and Standard 12.12
policy. It is important that in addition to scores
on a single test, other relevant information (e.g., When an individual students scores from different
school coursework, classroom observation, parental tests are compared, any educational decision
reports, other test scores) be taken into account based on the comparison should take into account
when warranted. These additional data sources the extent o f overlap between the two constructs
should demonstrate information relevant to the and the reliability or standard error o f the differ
intended construct. For example, it may not be ence score.
advisable or lawful to automatically accept students
Com m ent: When difference scores between two
into a gifted program if their I Q is measured to
tests are used to aid in m aking educational
be above 130 without considering additional rel
decisions, it is im portant that the two tests be
evant information about their performance. Sim
placed on a common scale, either by standardization
ilarly, som e students with measured IQ s below
or by some other m eans, and, if appropriate,
130 may be accepted based on other measures or
normed on the same population at about the
data sources, such as a test o f creativity, a portfolio
same time. In addition, the reliability and standard
o f student work, or teacher recommendations. In
error o f the difference scores between the two
these cases, other evidence o f gifted performance
tests are affected by the relationship between the
serves to compensate for the lower IQ test score.
constructs measured by the tests as well as by the
standard errors o f measurement o f the scores o f
Standard 12.11 the two tests. For example, when scores on a non
verbal ability measure are compared with achieve
W hen difference or growth scores are used for in
ment test scores, the overlapping nature o f the
dividual students, such scores should be clearly
two constructs may render the reliability o f the
defined, and evidence o f their validity, reliability/
difference scores lower than test users normally
precision, and fairness should be reported.
would expect. If the ability and/or achievement
C om m ent: T h e standard error o f the difference tests involve a significant amount o f measurement
between scores on the pretest and posttest, the re error, this will also reduce the confidence that can
gression o f posttest scores on pretest scores, or be placed in the difference scores. All these factors
relevant data from other appropriate methods for affect the reliability o f difference scores between
examining change should be reported. tests and should be considered when such scores
198
EDUCATIONAL TESTING AND ASSESSMENT
199
CHAPTER 12
and the availability o f experts to answer questions C om m ent: Differences in test scores between rel
that arise as test results are disseminated. evant subgroups (e.g., classified by gender, race/eth-
Th e interpretation o f som e test scores is suffi nicity, school/district, or geographical region) can
ciently complex to require that the user have be influenced, for example, by differences in
relevant training and experience or be assisted by student characteristics, in course-taking patterns,
and consult with persons who have such training in curriculum, in teachers’ qualifications, or in
and experience. Examples o f such tests include parental educational levels. Differences in per
individually administered intelligence tests, interest formance o f cohorts o f students across time may
inventories, growth scores on state assessments, be influenced by changes in the population o f
projective tests, and neuropsychological tests. students tested or changes in learning opportunities
for students. Users should be advised to consider
the appropriate contextual information and be
Cluster 3. Administration, Scoring, and
cautioned against misinterpretation.
Reporting of Educational Assessments
Standard 12.18
Standard 12.16
fn educational settings, score reports should be ac
Those responsible for educational testing programs
companied by a clear presentation o f information
should provide appropriate training, docum en
on how to interpret the scores, including the degree
tation, and oversight so that the individuals who
o f measurement error associated with each score or
adm inister and score the test(s) are proficient in
classification level, and by supplementary information
the appropriate test adm inistration and scoring
related to group summary scores. In addition, dates
procedures and understand the im portance o f
o f test administration and relevant norming studies
adhering to the directions provided by the test
should be included in score reports.
developer.
C om m ent: Score information should be com m u
Com m ent: In addition to being familiar with stan
nicated in a way that is accessible to persons
dardized test administration documentation and
receiving the score report. Empirical research in
procedures (including rest security protocols), it is
volving score report users can help to improve the
important for test coordinators and test administrators
clarity o f reports. For instance, the degree o f un
to be familiar with materials and procedures for
certainty in the scores might be represented by
accommodations and modifications for testing.
presenting standard errors o f measurement graph
Test developers should therefore provide appropriate
ically; or the probability o f misclassification asso
manuals and training materials that specifically
ciated with performance levels m ight be provided.
address accommodated administrations. Test coor
Similarly, when average or sum m ary scores for
dinators and test administrators should also receive
groups o f students are reported, they should be
information about the characteristics o f the student
supplemented with additional information about
populations included in the testing program.
the sample sizes and the shapes or dispersions o f
score distributions. Particular care should be taken
Standard 12.17 to portray subscore information in score reports
In educational settings, reports o f group differences in ways that facilitate proper interpretation. Score
in test scores should be accom panied by relevant reports should include the date o f administration
contextual information, where possible, to enable so that score users can consider the validity o f in
m eaningful interpretation o f the differences. ferences as time passes. Score reports should also
W here appropriate contextual inform ation is include the dates o f relevant norm ing studies so
not available, users should be cautioned against users can consider the age o f the norms in making
m isinterpretation. inferences about student performance.
200
EDUCATIOMAL TESTING AND ASSESSMENT
201
13. USES OF TESTS FOR PROGRAM
EVALUATION, POLICY STUDIES,
AND ACCOUNTABILITY
BACKGROUND
Tests are widely used to inform decisions as part m odeling results aggregated at the classroom,
o f public policy. O ne example is the use o f tests school, or institution level. Systems or programs
in the context o f the design and evaluation o f that focus on accountability for individual students,
programs or policy initiatives. Program evaluation such as through test-based prom otion policies or
is the set o f procedures used to make judgments graduation exams, are addressed in chapter 12.
about a program’s design, its implementation, (However, many o f the issues raised in that chapter
and its outcomes. Policy studies are somewhat are relevant to the use o f educational tests for
broader than program evaluations; they contribute program evaluation or school accountability pur
to judgments about plans, principles, or procedures poses.) I f accountability systems or program s
enacted to achieve broad public goals. Tests often include tests administered to teachers, principals,
provide the data that are analyzed to estimate the or other providers for purposes o f evaluating their
effect o f a policy, program, or initiative on outcomes practice or performance (e.g., for teacher pay-for-
such as student achievement or motivation. A perform ance program s that include a test o f
second broad category o f test use in policy settings teacher knowledge or an observation-based measure
is in accountability systems, which attach conse- o f their practices), those tests should be evaluated
quenccs (e.g., rewards and sanctions) to the per according to the standards related to workplace
formance o f institutions (such as schools or school testing and credentialing in chapter 11.
districts) or individuals (such as teachers or mental T h e contexts in which testing for evaluation
health care providers). Program evaluations, policy and accountability takes place vary in the stakes
studies, and accountability systems should not for test takers and for those who are responsible
necessarily be viewed as discrete categories. They for promoting specific outcomes (such as teachers
are frequently adopted in combination with one or health care providers). Testing programs for
another, as is the case when accountability systems institutions can have high stakes when the aggregate
impose requirements or recommendations to use performance o f a sample or o f the entire population
test results for evaluating programs adopted by o f test takers is used to make inferences about the
schools or districts. quality o f services provided and, as a result,
T h e uses o f tests for program evaluations, decisions are m ade about institutional status, re
policy studies, and accountability share several wards, or sanctions. For example, the quality o f
characteristics, including measurement o f the per reading curriculum and instruction may be judged
formance o f a group o f people and use o f test in part on the basis o f results o f testing for levels
scores as evidence o f the success or shortcomings o f attainment reached by groups o f students. Sim
o f an institution or initiative. This chapter examines ilarly, aggregated scores on psychological tests are
these uses o f tests. T h e accountability discussion sometimes used to evaluate the effectiveness o f
focuses on systems that involve aggregates o f treatment provided by mental health programs or
scores, such as school-wide or institution-wide agencies and m ay be included in accountability
averages, percentages o f students or patients scoring systems. Even when test results are reported in
above a certain level, or growth or value-added the aggregate and intended for low-stakes purposes,
203
CHAPTER 13
the public release o f data may be used to inform for those who conduct such studies to rely on
judgm ents about program quality, personnel, or measures developed for other purposes. In addition,
educational programs and may influence policy for reasons o f cost or convenience, certain tests
decisions. may be adopted for use in a program evaluation
or policy study even though they were developed
Evaluation of Programs for a somewhat different population o f respondents.
Some tests may be selected because they are well
and Policy Initiatives
known and thought to be especially credible in
As noted earlier, program evaluation typically in the view o f clients or public consumers, or because
volves m aking judgments about a single program, useful data already exist from earlier administrations
whereas policy studies address plans, principles, o f the tests. Evidence for the validity o f test scores
or procedures enacted to achieve broad public for the intended uses should be provided whenever
goals. Policy studies may address policies at various tests are used for program or policy evaluations or
levels o f government, including local, state, federal, for accountability purposes.
and international, and may be conducted in both Because o f administrative realities, such as
public and private organizational or institutional cost constraints and response burden, method
contexts. There is no sharp distinction between ological refinements may be adopted to increase
policy studies and program evaluations, and in the efficiency o f testing. O ne strategy is to obtain
many instances there is substantial overlap between a sample o f participants to be evaluated from the
the two types o f investigations. Test results are larger set o f those exposed to a program or policy.
often one important source o f evidence for the When a sufficient number o f clients are affected
initiation, continuation, modification, termination, by the program or policy that will be evaluated,
or expansion o f various programs and policies. and when there is a desire to limit the time spent
Tests may be used in program evaluations or on testing, evaluators can create multiple forms
policy studies to provide information on the status o f short tests from a larger pool o f items. By con
o f clients, students, or other groups before, during, structing a number o f test forms consisting o f rel
or after an intervention or policy enactment, as atively few items each and assigning the test forms
well as to provide score information for appropriate to different subsamples o f test takers (a procedure
comparison groups. Whereas many testing activities known as matrix sampling), a larger number o f
are intended to document the performance o f in items can be included in the study than could
dividual test takers, program evaluation and policy reasonably be administered to any single test taker.
studies target the performance o f groups or the When it is desirable to represent a dom ain with a
im pact o f the test results on these groups. A large number o f test items, this approach is often
variety o f tests can be used for evaluating programs used. However, in matrix sample testing, individual
and policies; examples include standardized achieve scores usually are not created or interpreted.
ment tests administered by states or districts, pub Because procedures for sam pling individuals or
lished psychological tests that measure outcomes test items may vary in a number o f ways, adequate
o f interest, and measures developed specifically analysis and interpretation o f test results depend
for the purposes o f the evaluation. In addition, on a clear description o f how samples were formed
evaluations o f programs and policies sometimes and how the tests were designed, scored, and re
synthesize results from multiple studies or tests. ported. Reports o f test results used for evaluation
It is important to evaluate any proposed test in or accountability should describe the sampling
terms o f its relevance to the goals o f the program strategy and the extent to which the sample is
or policy and/or to the particular questions its use representative o f the population that is relevant
will address. It is relatively rare for a test to be de to the intended inferences.
signed specifically for program evaluation or policy Evaluations and policy studies sometimes rely
study purposes, and therefore it is often necessary on secondary data analysis: analysis of data previously
204
USES OF TESTS FOR PROGRAM EVALUATION, POLICY STUDIES, AND ACCOUNTABILITY
collected for other purposes. In some circumstances, o f the test or its technical quality, including study
it may be difficult to ensure a good match between design, administrative feasibility, and the quality
the existing test and the intervention or policy o f other available data. This chapter focuses on
under examination, or to reconstruct in detail the testing and does not deal with these other consid
conditions under which the data were originally erations in any substantial way. In order to develop
collected. Secondary data analysis also requires defensible conclusions, however, investigators con
consideration o f the privacy rights o f test takers ducting program evaluations and policy studies
and others affected by the analysis. Sometimes should supplement test results with data from
this requires determining whether the informed other sources. These data may include information
consent obtained from participants in the original about program characteristics, delivery, costs,
data collection was adequate to allow secondary client backgrounds, degree o f participation, and
analysis to proceed without a need for additional evidence o f side effects. Because test results lend
consent. It may also require an understanding o f important weight to evaluation and policy studies,
the extent to which individually identifiable in it is critical that any tests used in these investigations
formation has been redacted from the data set be sensitive to the questions o f the study and ap
consistent with applicable legal standards. In se propriate for the test takers.
lecting (or developing) a test or deciding whether
to use existing data in evaluation and policy
Test-Based Accountability Systems
studies, careful investigators attempt to balance
the purpose o f the test, the likelihood that it will T h e inclusion o f test scores in educational ac
be sensitive to the intervention under study, its countability systems has become com m on in the
credibility to interested parties, and the costs o f United States and in other nations. M ost test-
administration. Otherwise, test results may lead based educational accountability in the United
to inappropriate conclusions about the progress, States takes place at the K -1 2 level, but many o f
impact, and overall value o f programs and policies the issues raised in the K -1 2 context are relevant
under review. to efforts to adopt outcomes-based accountability
Interpretation of test scores in program evalu in postsecondary education. In addition, account
ation and policy studies usually entails complex ability systems may incorporate information from
analysis o f a number o f variables. For example, longitudinal data systems linking students’ per
som e programs are mandated for a broad popula formance on tests and other indicators, including
tion; others target only certain subgroups. Some systems that capture a cohort’s performance from
are designed to affect attitudes, beliefs, or values; preschool through higher education and into the
others are intended to have a more direct impact workforce. Test-based accountability sometimes
011 behavior, knowledge, or skills. It is important occurs in sectors other than education; one example
that the participants included in any study meet is the use o f psychological tests to create measures
the specified criteria for participating in the o f effectiveness for providers o f mental health
program or policy under review, so that appropriate care. These uses o f tests raise issues similar to
interpretation o f test results will be possible. Test those that arise in educational contexts.
results will reflect not only the effects o f rules for Test-based accountability systems take a variety
participant selection and the impact on the par o f approaches to measuring perform ance and
ticipants o f taking part in programs or treatments, holding individuals or groups accountable for
but also the characteristics o f the participants. that performance. These systems vary along a
Relevant background information about clients number o f dimensions, including the unit o f ac
or students may be obtained to strengthen the in countability (e.g., district, school, teacher), the
ferences derived from the test results. Valid inter stakes attached to results, the frequency o f meas
pretations may depend on additional considerations urement, and whether nontest indicators are in
that have nothing to do with the appropriateness cluded in the accountability system. One important
205
G HAPTER13
measurem ent concern in accountability stems stable over time and across students and items.
from the construction o f an accountability index: These assumptions must be supported by evidence.
a number or label that reflects a set o f rules for Moreover, those responsible for developing or
combining scores and other information to arrive implementing test-based accountability systems
at conclusions and inform decision making. An often assert that these systems will lead to specific
accountability index could be as simple as an outcomes, such as increased educator motivation
average test score for students in a particular or improved achievement; these assertions should
grade in a particular school, but m ost systems also be supported by evidence. In particular,
rely on more complex indices. These may involve efforts should be made to investigate any potential
a set o f rules (often called decision rules) for syn positive or negative consequences o f the selected
thesizing multiple sources o f information, such as accountability system.
test scores, graduation rates, course-taking rates, Similarly, the choice o f specific rules and data
and teacher qualifications. An accountability index that are used to create an accountability index
may also be created from applications o f complex should reflect the goals and values o f those who
statistical models such as those used in value- are developing the accountability system, as well
added m odeling approaches. As discussed in as the inferences that the system is designed to
chapter 12, for high-stakes decisions, such as clas support. For example, if a primary goal o f an ac
sification o f schools or teachers into performance countability system is to identify teachers who
categories that are linked to rewards or sanctions, are effective at improving student achievement,
the establishment o f rules used to create account the accountability index should be based on as
ability indices should be informed by a consideration sessments that are closely aligned with the content
o f the nature o f the information the system is in the teacher is expected to cover, and should take
tended to provide and by an understanding o f into account factors outside the teachers control.
how consequences will be affected by these rules. T h e process typically involves decisions such as
The implications o f the rules should be commu whether to measure percentages above a cut score
nicated to decision makers so that they understand or an average o f scale scores, whether to measure
the consequences o f any policy decisions based status or growth, how to combine information
on the accountability index. for multiple subjects and grade levels, and whether
Test-based accountability systems include in to measure performance against a fixed target or
terpretations and assumptions that go beyond use a rank-based approach. The development o f
those for the interpretation o f the test scores on an accountability index also involves political con
which they are based; therefore, they require ad siderations, such as how to balance technical con
ditional evidence to support their validity. Ac cerns and transparency.
countability systems in education typically aggregate
scores over the students in a class or school, and
may use complex mathematical models to generate Issues in Program and
a summary statistic, or index, for each teacher or Policy Evaluation and Accountability
school. These indices are often interpreted as esti
mates o f the effectiveness o f the teacher or school. Test results are sometimes used as one way to m o
Users o f information from accountability systems tivate program administrators or other service
m ight assum e that the accountability indices providers as well as to infer institutional effectiveness.
provide valid indicators o f the intended outcomes T h is use o f tests, including the public reporting
o f education (e.g., mastery o f the skills and knowl o f results, is thought to encourage an institution
edge described in the state content standards), to improve its services for its clients. For example,
that differences am ong indices can be attributed in som e test-based accountability systems, consis
to differences in the effectiveness o f the teacher or tently poor results on achievement tests at the
school, and that these differences are reasonably school level may result in interventions that affect
206
USES OF TESTS FOR PROGRAM EVALUATION, POLICY STUDIES, AMD ACCOUNTABILITY
the schools staffing or operations. The interpretation not been taken seriously, the motivation o f test
o f test results is especially complex when tests are takers may be explored by collecting additional
used both as an institutional policy mechanism information where feasible, using observation or
and as a measure o f effectiveness. For example, a interview methods. Issues o f inappropriate prepa
policy or program may be based on the assumption ration and unmotivated performance raise questions
that providing clear goals and general specifications about the validity o f interpretations o f test results.
o f test content (such as the types o f topics, con In every case, it is important to consider the
structs, cognitive dom ains, and response formats potential impact on the test taker o f the testing
included in the test) may be a reasonable strategy process itself, including test administration and
to comm unicate new expectations to educators. reporting practices.
Yet the desire to influence test or evaluation results Public policy decisions are rarely based solely
to show acceptable institutional performance could on the results o f empirical studies, even when the
lead to inappropriate testing practices, such as studies are o f high quality. The more expansive
teaching the test items in advance, modifying test and indirect the policy, the more likely it is that
administration procedures, discouraging certain other considerations will come into play, such as
students or clients from participating in the testing the political and economic impact o f abandoning,
sessions, or focusing teaching exclusively on test- changing, or retaining the policy, or the reactions
taking skills. These responses illustrate that the o f various stakeholders when institutions become
more an indicator is used for decision making, the targets o f rewards or sanctions. Tests used in
the more likely it is to become corrupted and policy settings may be subjected to intense and
distort the process that it was intended to measure. detailed scrutiny for political reasons. When the
Undesirable practices such as excessive emphasis test results contradict a favored position, attempts
on test-taking skills might replace practices aimed may be made to discredit the testing procedure,
at helping the test takers learn the broader domains content, or interpretation. Test users should be
measured by the test. Because results derived from able to defend the use o f the test and the interpre
such practices may lead to spuriously high estimates tation o f results but should also recognize that
o f performance, the diligent investigator should they cannot control the reactions o f stakeholder
estimate the impact o f changes in teaching practices groups.
that may result from testing in order to interpret It is essential that all tests used in accountability,
the test results appropriately. Looking at possible program evaluation, or policy contexts meet the
inappropriate consequences o f tests as well as standards for validity, reliability, and fairness ap
their benefits will result in more accurate assessment propriate to the intended test score interpretations
o f policy claims that particular types o f testing and use. Moreover, as described in chapter 6,
program s lead to improved performance. tests should be administered by personnel who
Investigators conducting policy studies and are appropriately trained to implement the test
program evaluations may give no clear reasons to administration procedures. It is also essential that
the test takers for participating in the testing pro assistance be provided to those responsible for in
cedure, and they often withhold the results from terpreting study results for practitioners, the lay
the test takers. When matrix sampling is used for public, and the media. Careful communication
program evaluation, it may not be feasible to about goals, procedures, findings, and limitations
provide such reports. I f little effort is m ade to increases the likelihood that the interpretations o f
motivate the test takers to regard the test seriously the results will be accurate and useful.
(e.g., i f the purpose o f the test is not explained),
the test takers may have little reason to maximize
Additional Considerations
their effort on the test. T h e test results thus may
misrepresent the im pact o f a program, institution, This chapter and its associated standards are
or policy. W hen there is suspicion that a test has directed to users o f tests in program evaluations,
207
CHAPTER 13
policy studies, and accountability systems. Users as well as educators, administrators, and policy
include those who mandate, design, or implement makers who are engaged in efforts to measure
these evaluations, studies, or systems and those school performance or evaluate the effectiveness
who make decisions based on the information o f education policies or programs. In addition to
they provide. Users include, am ong others, psy the standards below, users should consider other
chologists who develop, evaluate, or enforce policies, available documents containing relevant standards.
2 08
USES OF TESTS FOR PROGRAM EVALUATION, POLICY STUDIES, AND ACCOUNTABILITY
209
CHAPTER 13
value-added models) are used, the m ethod for nontest information is included in an accountability
constructing such indices, indicators, or m odels index, the rules for combining the information
should be described and justified, and their tech need to be made explicit and must be justified. It
nical qualities should be reported. is im portant to recognize that when multiple
sources o f data are collapsed into a single composite
C om m ent: An index that is constructed by m a
score or rating, the weights and distributional
nipulating and combining test scores should be
characteristics o f the sources will affect the distri
subjected to the same validity, reliability, and
bution o f the composite scores. The effects o f the
fairness investigations that are expected for the
weighting and distributional characteristics on
test scores that underlie the index. The methods
the composite score should be investigated.
and rules for constructing such indices should be
When indices combine scores from tests ad
m ade available to users, along with documentation
ministered under standard conditions with those
o f their technical properties. T he strengths and
that involve modifications or other changes to
limitations o f various approaches to combining
administration conditions, there should be a clear
scores should be evaluated, and information that
rationale for combining the information into a
w ould allow independent replication o f the con
single index, and the implications for validity and
struction o f indices, indicators, or models should
reliability should be examined.
be m ade available for use by appropriate parties.
A s with regular test scores, a validity argument
should be set forth to justify inferences about Cluster 2. Interpretations and Uses of
indices as measures o f a desired outcome. It is im Information From Tests Used in
portant to help users understand the extent to
Program Evaluation, Policy Studies,
which the m odels support causal inferences. For
and Accountability Systems
example, when value-added estimates are used as
measures o f teachers5 effectiveness in improving
student achievement, evidence for the appropri Standard 13.4
ateness o f this inference needs to be provided.
Evidence o f validity, reliability, and fairness for
Similarly, i f published ratings o f health care
each purpose for which a test is used in a program
providers are based on indices constructed from
evaluation, policy study, or accountability system
psychological test scores o f their patients, the
should be collected and m ade available.
public information should include information
to help users understand what inferences about C om m ent: Evidence should be provided o f the
provider performance are warranted. Developers suitability o f a test for use in program evaluation,
and users o f indices should be aware o f ways in policy studies, or accountability systems, including
which the process o f combining individual scores the relevance o f the test to the goals o f the
into an index may introduce technical problems program, policy, or system under study and the
that did not affect the original scores. Linking suitability o f the test for the populations involved.
errors, floor or ceiling effects, differences in vari Those responsible for the release or reporting o f
ability across different measures, and lack o f an test results should provide and explain any sup
interval scale are examples o f features that may plemental information that will minimize possible
not be problematic for the purpose o f interpreting misinterpretations or misuse o f the data. In par
individual test scores but can become problematic ticular, if an evaluation or accountability system
when scores are combined into an aggregate meas is designed to support interpretations regarding
ure. Finally, when evaluations or accountability the effectiveness o f a program, institution, or
systems rely on measures that combine various provider, the validity o f these interpretations for
sources o f information, such as when scores on the intended uses should be investigated and doc
multiple forms o f a test are combined or when umented. Reports should include cautions against
210
USES OF TESTS FOR PROGRAM EVALUATION, POLICY STUDIES, AND ACCOUNTABILITY
m aking unwarranted inferences, such as holding mentation o f any exclusion rules, testing modifi
health care providers accountable for test-score cations, or other changes to the test or adminis
changes that may not be under their control. If tration conditions; and provide evidence regarding
the use involves a classification o f persons, insti the validity o f score interpretations for subgroups.
tutions, or program s into distinct categories, the When summaries o f test scores are reported sepa
consistency, accuracy, and fairness o f the classifi rately by subgroup (e.g., by racial/ethnic group),
cations should be reported. I f the same test is test users should conduct analyses to evaluate the
used for m ultiple purposes (e.g., m onitoring reliability/precision o f scores for these groups and
achievement o f individual students; providing in the validity o f score interpretations, and should
formation to assist in instructional planning for report this information when publishing the score
individuals or groups o f students; evaluating summaries. Analyses o f complex indices used for
districts, schools, or teachers), evidence related to accountability or for measuring program effec
the validity o f interpretations for each o f these tiveness should address the possibility o f bias
uses should be gathered and provided to users, against specific subgroups or against programs or
and the potential negative effects for certain uses institutions serving those subgroups. If bias is de
(e.g., improving instruction) that might result tected (e.g., if scores on the index are shown to be
from unintended uses (e.g., high-stakes account subject to system atic error that is related to
ability) need to be considered and mitigated. examinee characteristics such as race/ethnicity),
When tests are used to evaluate the performance these indices should not be used unless they are
o f personnel, the suitability o f the tests for different modified in a way that removes the bias. Additional
groups o f personnel (e.g., regular teachers, special considerations related to fairness and accessibility
education teachers, principals) should be examined. in educational tests and assessments are provided
in chapter 3-
When test results are used to support actions
Standard 13.5 regarding program or policy adoption or change,
the professionals who are expected to make inter
T h ose responsible for the developm ent and use
pretations leading to these actions may need as
o f tests for evaluation or accountability purposes
sistance in interpreting test results for this purpose.
should take steps to prom ote accurate interpre
Advances in technology have led to increased
tations and appropriate uses for all groups for
availability o f data and reports am ong teachers,
which results will be applied.
administrators, and others who m ay not have re
C om m ent: Those responsible for measuring out ceived training in appropriate test use and inter
comes should, to the extent possible, design the pretation or in analysis o f test-score data. Those
testing process to prom ote access and to maximize who provide the data or tools have the responsibility
the validity o f interpretations (e.g., by providing to offer support and assistance to users, and users
appropriate accom m odations) for any relevant have the responsibility to seek guidance on ap
subgroups o f test takers who participate in program propriate analysis and interpretation. Those re
or policy evaluation. Users o f secondary data sponsible for the release or reporting o f test results
should clearly describe the extent to which the should provide and explain any supplemental in
population included in the test-score database in formation that will minimize possible misinter
cludes all relevant subgroups. The users should pretations o f the data.
also docum ent any exclusion rules that were Often, the test results for program evaluation
applied and any other changes to the testing or policy analysis are analyzed well after the tests
process that could affect interpretations o f results. have been given. W hen this is the case, the user
Similarly, users o f tests for accountability purposes should investigate and describe the context in
should m ake every effort to include all relevant which the tests were given. Factors such as inclu
subgroups in the testing program; provide docu sion/exclusion rules, test purpose, content sampling,
211
CHAPTER 13
instructional alignment, and the attachment o f is often justified on the grounds that it will
high stakes can affect the aggregated results and improve the quality o f education by providing
should be made known to the audiences for the useful information to decision makers and by
evaluation or analysis. creating incentives to prom ote better performance
by educators and students. These kinds o f claims
should be made explicit when the system is m an
Standard 13.6
dated or adopted, and evidence to support their
R eports o f group differences in test perform ance validity should be provided when available. The
should be accom panied by relevant contextual collection and reporting of evidence for a particular
information, where possible, to enable meaningful validity claim should be incorporated into the
interpretation o f the differences. I f appropriate program design. A given claim for the benefits o f
contextual inform ation is not available, users test use, such as improving students’ achievement,
should be cautioned against misinterpretation. may be supported by logical or theoretical argument
as well as empirical data. D ue weight should be
C om m ent: Observed differences in average test
given to findings in the scientific literature that
scores between groups (e.g., classified by gender,
may be inconsistent with the stated claim.
race/cthnicity, disability, language proficiency, so
cioeconomic status, or geographical region) can
be influenced by differences in factors such as op Standard 13.8
portunity to learn, training experience, effort, in
structor quality, and level and type o f parental T h ose who m andate the use o f tests in policy,
support. In education, differences in group per evaluation, and accountability contexts and those
formance across time may be influenced by changcs who use tests in such contexts should m onitor
in the population o f those tested (including their im pact and should identify and minimize
changes in sample size) or changes in their experi negative consequences.
ences. Users should be advised to consider the ap
Com m ent: The use o f tests in policy, evaluation,
propriate contextual information when interpreting
and accountability settings may, in some cases,
these group differences and when designing policies
lead to unanticipated consequences. Particularly
or practices to address those differences. In addition,
when high stakes are attached, those who mandate
if evaluations involve comparisons o f test scores
tests, as well as those who use the results, should
across national borders, evidence for the com pa
take steps to identify potential unanticipated con
rability o f scores should be provided.
sequences, Unintended negative consequences
may include teaching test items in advance, m od
Standard 13.7 ifying test administration procedures, and dis
couraging or excluding certain test takers from
W hen tests are selected for use in evaluation or
taking the test. These practices can lead to spuri
accountability settings, the ways in which the
ously high scores that do not reflect performance
test results are intended to be used, and the con
on the underlying construct or domain o f interest.
sequences they are expected to prom ote, should
In addition, these practices may be prohibited by
be clearly described, alon g with cautions against
law. Testing procedures should be designed to
inappropriate uses.
minimize the likelihood o f such consequences,
C om m ent: In some contexts, such as evaluation and users should be given guidance and encour
o f a specific curriculum program, a test may have agement to refrain from inappropriate test-prepa-
a lim ited purpose and may not be intended to ration practices.
prom ote specific outcomes other than informing Som e consequences can be anticipated on the
the evaluation. In other settings, particularly with basis o f past research and understanding o f how
test-based accountability systems, the use o f tests people respond to incentives. For example, research
212
USES OF TESTS FOR PROGRAM EVALUATION, POLICY STUDIES, AND ACCOUNTABILITY
shows that educational accountability tests influence accountability context, a decision that will have
curriculum and instruction by signaling what is a m ajor im pact on an individual such as a
important for students to know and be able to teacher or health care provider, or on an organ
do. This influence can be positive if a test encourages ization such as a school or treatment facility,
a focus on valuable learning outcomes, but it is should take into consideration other relevant
negative if it narrows the curriculum in unintended inform ation in addition to test scores. Exam ples
ways. These and other comm on negative conse o f other inform ation that may be incorporated
quences, such as possible motivational im pact on into evaluations or accountability system s are
teachers and students (even when test results are measures o f educators’ or health care providers’
used as intended) and increasing dropout rates, practices (e.g., classroom observations, checklists)
should be studied and the results taken into con and nontest m easures o f student attainm ent
sideration. The integrity o f test results should be (course taking, college attendance).
maintained by striving to eliminate practices de In the case o f value-added modeling, some re
signed to raise test scores without improving searchers have argued for the inclusion o f student
performance on the construct or domain measured demographic characteristics (e.g., race/ethnicity
by the test. In addition, administering an audit and socioeconomic status) as controls, whereas
measure (i.e., another measure o f the tested con other work suggests that including such variables
struct) may detect possible corruption o f scores. does not improve the performance o f the measures
and can promote undesirable consequences such
as a perception that lower standards are being set
Standard 13.9 for some students than for others. Decisions re
garding what variables to include in such models
In evaluation or accountability settings, test
should be informed by empirical evidence regarding
results should be used in conjunction with in
the effects o f their inclusion or exclusion.
form ation from other sources when the use o f
An additional type o f information that is
the additional inform ation contributes to the
relevant to the interpretation o f test results in
validity o f the overall interpretation.
policy settings is the degree o f motivation o f the
C o m m e n t: Perform ance on indicators other test takers. It is important to determine whether
than tests is alm ost always useful and in many test takers regard the test experience seriously,
cases essential. D escriptions or analyses o f such particularly when individual scores are not reported
variables as client selection criteria, services, to test takers or when the scores are not associated
client characteristics, setting, and resources are with consequences for the test takers. Decision
often needed to provide a comprehensive picture criteria regarding whether to include scores from
o f the program or policy under review and to individuals with questionable motivation should
aid in the interpretation o f test results. In the be clearly documented.
213
GLOSSARY
This glossary provides definitions o f terms as used achievem ent levels/proficiency levels: D escription s o f
in the text and standards. For many o f the terms, test takers’ levels o f com petency in a particular area o f
multiple definitions can be found in the literature; knowledge or skill, usually defined in terms o f categories
also, technical usage may differ from common ordered on a con tinuum , for example from “ basic” to
“advanced,” or “novice” to “expert.” T h e categories
usage.
constitute broad ranges fo r classifying perform ance.
a b ility param eter: In item response theory (IR T ), a See cut score.
theoretical value indicating the level o f a test taker on
achievem ent standards: See performance standards.
the ability or trait m easured by the test; analogous to
the con cept o f true score in classical test theory. achievem ent test: A test to m easure the extent o f
knowledge or skill attained by a test taker in a content
ab ility testing: T h e use o f tests to evaluate the current
dom ain in which the test taker has received instruction.
p erform ance o f a person in som e defined dom ain o f
cognitive, psychom otor, o r physical functioning. ad a p tatio n /test ad a p tatio n : 1. A ny change in test con
tent, form at {including response form at), or adm in is
accessibility: T h e degree to which the items or tasks on
tration conditions that is m ade to increase a te sts ac
a test enable as many test takers as possible to demonstrate
cessibility for individuals who otherw ise w ould face
their stan din g on the target construct without being
construct-irrelevant barriers on the original test. An
im peded by characteristics o f the item that are irrelevant
adaptation may or m ay not change the m ean in g o f the
to the construct being measured. A test that ranks high
construct being m easured or alter score interpretations.
on this criterion is referred to as accessible.
A n adaptation that changes score m ean in g is referred
acco m m o d atio n s/test accom m o dation s: Adjustm ents to as a modification; an adaptation that does not change
that do not alter the assessed construct that are applied the score m eaning is referred to as an accommodation
to test presentation, environm ent, content, form at (in (see definitions in this glossary). 2. C h an ge m ade to a
cluding response form at), or adm inistration conditions test that has been translated into the language o f a
for particular test rakers, and that are em bedded within target grou p and that takes into accoun t the nuances
assessm ents or applied after the assessm ent is designed. o f the language and culture o f that group.
Tests or assessm ents w ith such accom m odations, and
adaptive test: A sequential form o f individual testing
their scores, are said to be accommodated. Accom modated
in w hich successive item s, or sets o f item s, in the test
scores should be sufficiently com parable to unaccom
are selected for adm inistration based prim arily on their
m odated scores that they can be aggregated together.
psychom etric properties and content, in relation to the
accou n tability index: A n u m ber o r label that reflects a test takers responses to previous items.
set o f rules for com bin in g scores an d other inform ation
a d ju sted validity o r reliability coefficient: A validity
to form conclusions and inform decision m aking in an
or reliab ility co efficien t— m o st o ften , a p ro d u ct-
accountability system.
m om en t correlation— that has been adjusted to offset
acco u n tab ility system : A system that im poses student the effects o f differences in score variability, criterion
perform ance-based rewards or sanctions on institutions variability, or the unreliability o f test an d/or criterion
such as schools or school system s or on individuals scores. See restriction o f range or variability.
such as teachers or m ental health care providers.
aggregate score: A total score form ed by com bin in g
accu ltu ration : A process related to the acquisition o f scores on the same test o r across test com ponents. Th e
cultural know ledge an d artifacts that is developm ental scores m ay be raw or standardized. T h e com ponen ts o f
in nature an d dependent upon time o f exposure and the aggregate score may be weighted or not, depending
o p portu n ity for learning. on the interpretation to be given to the aggregate score.
215
GLOSSARY
alig n m e n t: T h e degree to w hich the con ten t an d battery: A set o f tests usually adm inistered as a unit.
cognitive dem an ds o f test questions m atch targeted T h e scores on the tests usually are scaled so that they
con ten t an d cognitive dem ands described in the test can readily be com pared or used in com bin ation for
specifications. decision m aking.
alternate assessm en ts/altern ate tests: Assessments or behavioral science: A scientific discipline, such as so d -
tests used to evaluate the performance o f students in ed ology, anthropology, or psychology, in w hich the actions
ucational settings who are unable to participate in stan an d reactions o f hu m an s an d an im als are stu died
dardized accountability assessments, even with accom through observational an d experim ental m ethods.
modations. Alternate assessments or tests typically measure
b en ch m ark assessm en ts: A ssessm ents adm inistered in
achievement relative to alternate conrent standards.
educational settings at specified times during a curriculum
altern ate form s: Tw o o r m ore versions o f a test that are sequence, to evaluate students’ know ledge an d skills
considered interchangeable, in that they m easure the relative to an explicit set o f longer-term learning goals.
sam e constructs in the sam e ways, are built to the sam e See interim assessments or tests.
content and statistical specifications, and are administered
bias: 1. In test fairness, construct underrepresentation
under the sam e condition s using the sam e directions.
or construct-irrclevant com ponen ts o f test scores that
See equivalentforms, parallelforms.
differentially affect the perform ance o f different groups
altern ate o r alternative stan dards: C o nten t and per o f test takers and consequently the reliability/precision
fo rm an ce stan d ard s in edu cation al assessm en t for and validity o f interpretations and uses o f their test
students w ith significant cognitive disabilities. scores. 2. In statistics or m easurem ent, system atic error
in a test score. See construct undenepresentation, con-
an alytic scorin g: A m ethod o f scoring constructed re
struct-irrelevant variance, fairness, predictive bias.
sponses (such as essays) in which each critical dim ension
o f a p articu lar p erform an ce is ju d g ed an d scored bilin gu al/m u ltilin gu al: H av in g a degree o f proficiency
separately, and the resultant values are com bined for an in two or more languages.
overall score. In som e instances, scores on the separate
calibration : 1. In linking test scores, the process o f
dimensions m ay also be used in interpreting performance.
relating scores on one test to scores on another that
C o n trast with holistic scoring.
differ in reliability/precision from those o n the first
an ch or item s: Item s adm inistered with each o f two or test, so that scores have the sam e relative m ean in g fo r a
m ore alternate form s o f a test for the purpose o f group o f test takers. 2. In item response theory, the
eq u atin g the scores obtained on these alternate form s. process o f estim ating the parameters o f the item response
function. 3. In scorin g constructed response tasks, p ro
an ch or test: A set o f anchor items used fo r equating.
cedures used during train in g an d scoring to achieve a
assessm en t: A n y system atic m eth od o f obtaining in desired level o f scorer agreem ent.
form ation, used to draw inferences about characteristics
certification : A process by w hich individuals are recog
o f people, objects, or program s; a system atic process to
nized (or certified) as having dem on strated som e level
m easure or evaluate the characteristics o r perform ance
o f k n o w led ge a n d sk ill in so m e d o m a in . See
o f individuals, program s, or other entities, for purposes
licensing, credentialing.
o f draw ing inferences; som etim es used synonym ously
w ith test. classical test theory: A psychom etric theory based on
the view that an in dividu als observed score on a test is
assessm en t literacy: Know ledge abo ut testing that su p
the su m o f a true score co m p o n en t for the test taker
ports valid interpretations o f test scores for their intended
and an independent ran dom error com ponent.
purposes, such as know ledge abou t test developm ent
practices, test score interpretations, threats to valid classification accuracy: Degree to which the assignm ent
score interpretations, score reliability an d precision, o f test takers to specific categories is accurate; the
test adm in istration, and use. degree to which false positive an d false negative classi
fications are avoided. See sensitivity, specificity.
au to m ated scorin g: A procedure by w hich constructed
response item s are scored by com puter using a rules- coaching: Planned short-term instructional activities
based approach. for prospective test takers provided prior to the test ad
216
GLOSSARY
ministration for the prim ary purpose o f im proving their pirical d ata an d/o r expert ju dgm en t u sin g various
test scores. Activities that approxim ate the instruction formats such as narratives, tables, and graphs. Sometim es
provided by regular school curricula or training programs referred to as automated scoring or narrative report.
are not typically referred to as coaching.
computerized adaptive test: An adaptive test administered
coefficient alph a: A n internal-consistency reliability by com puter. See adaptive test.
coefficient based on the n um ber o f parts into which a
concordance: In linldng test scores for tests that measure
test is partitioned (e.g., item s, subtests, or raters), the
sim ilar constructs, the process o f relating a score on one
interrelationships o f the parts, and the total test score
test to a score on another, so that the scores have the
variance. A lso called Cronbach's alpha and, for dichoto-
sam e relative m eaning for a group o f test lakers.
m ous item s, KR-20. See internal-consistency coefficient,
reliability coefficient. co n d itio n al stan d ard erro r o f m e asu re m en t: T h e
standard deviation o f m easurem ent errors that affect
cognitive assessm en t: T h e process o f systematically
the scores o f test takers at a specified test score level.
collecting test scores and related data to make judgm ents
ab o u t an in d ivid u als ability to perform various mental confiden ce interval: An interval within which the p a
activities involved in the processing, acquisition, retention, ram eter o f interest will be included with a specified
conceptualization, and organization o f sensory, perceptual, probability.
verbal, spatial, an d psychom otor inform ation.
consequences: T h e outcom es, intended and unintended,
cognitive lab: A m ethod o f stu dyin g the cognitive o f using tests in particular ways in ccrtain contexts and
processes that test takers use when com pleting a task with certain populations.
such as solving a m athem atics problem or interpreting a
construct: T h e concept o r characteristic that a test is
passage o f text, typically involving test takers’ thinking
designed to measure.
aloud while responding to the task an d/or responding
to interview questions after com pleting the task. con stru ct dom ain : T h e set o f interrelated attributes
(e.g., behaviors, attitudes, values) that are included
cognitive science: T h e interdisciplinary study o f learning
under a con structs label.
and inform ation processing.
co n stru ct eq u ivale n ce: 3. T h e exten t to w hich a
com p arability/score com parability: In test linking, the
construct measured by on e test is essentially the same
degree o f score comparability resulting from the application
as the construct m easured by another test. 2. T h e
o f a linldng procedure. Score com parability varies along
degree to which a construct m easured by a test in one
a continuum that depends on the type o f linldng con
cultural or linguistic group is com parable to the construct
ducted. See alternateforms, equating, calibration, linking,
m easured by the sam e test in a different cultural or lin
moderation, projection, vertical scaling.
guistic group.
co m p o site score: A score that com bines several scores construct-irrelevant variance: Variance in test-taker
according to a specified form ula. scores that is attributable to extraneous factors that
com p uter-adm in istered test: A test adm inistered by a distort the m eaning o f the scores and thereby decrease
com puter; test takers respond by using a keyboard, the validity o f the proposed interpretation.
217
GLOSSARY
in clu de d iagram s, m ath em atical p ro o fs, essays, or differen tial item fu n ction in g (D IF ): For a particular
problem solutions such as network repairs or other item in a test, a statistical indicator o f the extent to
w ork products. w hich different groups o f test takers w ho are at the
sam e ability level have different frequencies o f correct
con ten t d o m ain : T h e set o f behaviors, know ledge,
responses or, in som e cases, different rates o f ch oosin g
skills, abilities, attitudes, or other characteristics to be
various item options.
m easured by a test, represented in detailed test specifi
cations an d often organized into categories by which differen tial test fu n ctio n in g (D T F ): D ifferential per
item s are classified. form ance at the test or dim ension level indicating that
individuals from different groups w ho have the sam e
con ten t-related validity evidence: Evidence based on
stan d in g o n the characteristic assessed by a test d o not
test content that supports the intended interpretation
have the sam e expected test score.
o f test scores for a given purpose. Such evidence may
address issues such as the fidelity o f test con ten t to per discrim in an t evidence: Evidence indicating w hether
form ance in the dom ain in question and the degree to two tests interpreted as measures o f different constructs
w hich test con tent representatively sam ples a dom ain, are sufficiently independent (uncorrelated) that they
such as a course curriculum or job. do, in fact, m easure two distinct constructs.
content stan dard: In educational assessment, a statement d o c u m e n ta tio n : T h e b o dy o f literature (e .g., test
o f conten t an d skills that students are expected to learn m anuals, m anual supplem ents, research reports, p u b li
in a su bject m atter area, often at a particular grade or at cation s, user’s guides) developed by a tests author, de
the com pletion o f a particular level o f schooling. veloper, user, an d/o r publisher to support test score in
terpretations for their intended use.
convergent evidence: Evidence based on the relationship
between test scores and other measures o f the sam e or d o m ain or co n ten t sam plin g: T h e process o f selecting
related construct. lest item s, in a system atic way, to represent ih e total set
o f item s m easuring a dom ain.
creden tialin g: G ran tin g to a person, by so m e authority,
a credential, such as a certificate, license, or diplom a, effort: T h e extent to which a test taker appropriately
that signifies an acceptable level o f p erform ance in participates in test taking.
som e dom ain o f knowledge or activity.
em pirical evidence: Evidence based on so m e form o f
criterion d o m ain : T h e construct dom ain o f a variable data, as opposed to that based on logic or theory.
that is used as a criterion. See construct domain. En glish lan guage learner (E L L ): An individual w ho is
criterion-referenced score in terpretation: T h e m eaning not yet proficient in English. An E L L may be an in di
o f a test score for an individual or o f an average score vidual whose first language is not English, a language
fo r a defin ed group, indicatin g the in dividu al’s or m in ority individual ju st beginning to learn En glish, or
g ro u p s level o f perform ance in relationship to som e an individual who has developed considerable proficiency
d e fin e d criterio n dom ain . E x am p les o f criterion - in English. Related terms include English learner (EL),
referenced interpretations include com p arison s to cut limited English proficient (LE P ), English as a second
scores, interpretations based on expectancy tables, and language (E S L ), an d culturally and linguistically diverse.
dom ain-referenced score interpretations. C on trast with eq u ated fo rm s: Alternate form s o f a test w hose scores
norm-referencedscore interpretation. have been related through a statistical process know n
as equating, w hich allows scale scores on equated form s
cross-validation: A procedure in which a scoring system
to be used interchangeably.
for predicting perform ance, derived from on e sam ple,
is ap plied to a second sam ple to investigate the stability eq u atin g: A process for relating scores o n alternate
o f prediction o f the scoring system. form s o f a test so that they have essentially the sam e
m eaning. T h e equated scores are typically reported on
cu t score: A specified point on a score scale, such that
a co m m o n score scale.
scores at or above that point are reported, interpreted,
or acted upon differently from scores below that point. eq u ivalent fo rm s: See alternateforms, parallelforms.
218
GLOSSARY
erro r o f m easurem en t: T h e difference between an ob form ative assessm en t: A n assessm ent process used by
served score an d the corresponding true score. See teachers and students du rin g instruction that provides
standard error of measurement, systematic error, random feedback to adjust o n go in g teaching and learning with
error, true score. the goal o f im proving students’ achievement o f intended
instructional outcom es.
facto r: A ny variable, real or hypothetical, that is an
aspect o f a concept or construct. gain score: In testing, the difference between two scores
obtained by a test taker on the sam e test or two equated
facto r analysis: A ny o f several statistical m ethods o f
tests taken on different occasions, often before and
describing the interrelationships o f a set o f variables by
after som e treatm ent.
statistically deriving new variables, called factors, that
are few er in n u m b e r th an the o rig in a l set o f generalizability coefficient: A n index o f reliability/pre
variables. cision based on generalizability theory (G theory). A
generalizability coefficien t is the ratio o f universe
fairn ess: T h e validity o f test score interpretations for
score variance to observed score variance, w here the
intended use(s) for individuals from all relevant subgroups.
observ ed score v arian ce is equal to the universe score
A test that is fair m inim izes the construct-irrelevant
variance plus the total error variance. See generalizability
variance associated w ith individual characteristics and
theory.
testin g contexts that otherw ise w ould com prom ise the
validity o f scores for som e individuals. generalizability theory: M ethodological fram ework for
evaluating reliability/precision in which various sources
fake b ad: Exaggerate or falsify responses to test items
o f error variance are estim ated through the application
in an effort to appear im paired.
o f the statistical techniques o f analysis o f variance. Th e
fake g o o d : Exaggerate or falsify responses to test items analysis indicates the generalizability o f scores beyond
in an effort to present on eself in an overly positive way. the specific sam ple o f item s, persons, an d observational
conditions that were studied. Also called G theory.
false negative: An error o f classification, diagnosis, or
selection leading to a determ ination that an individual grou p testing: Testin g for groups o f test takers, usually
does not m eet the standard based on an assessm ent for in a group setting, typically with standardized adm in is
inclusion in a particular group, w hen, in truth, he or tration procedures an d supervised by a proctor o r test
she does m eet the standard (or w ould, absent measure adm inistrator.
m ent error). See sensitivity, specificity. grow th m odels: Statistical models that measure students’
false positive: An error o f classification, diagnosis, or progress on achievem ent tests by com paring the test
selection leading to a determ ination that an individual scores o f the sam e students over time. See value-added
m eets the standard based on an assessm ent fo r inclusion modeling.
in a particular group, when, in truth, he or she does high-stakes test: A test used to provide results that
not meet the standard (or would not, absent measurement have im portant, direct consequences for individuals,
error). See sensitivity, specificity. p rogram s, o r in stitu tio n s involved in the testing.
219
GLOSSARY
in fo rm ed con sen t: T h e agreem ent o f a person, or that attribute m easured by the item . Also called item response
p erson s legal representative, for som e procedure to be curve, item responsefunction.
perform ed on or by the individual, such as talcing a test
item co n tex t effect: Influence o f item p osition , other
or com p letin g a questionnaire.
items administered, time limits, administration conditions,
intelligence test: A test designed to measure an individual s and so forth, on item difficulty an d other statistical
level o f cognitive functioning in accord w ith som e rec item characteristics.
ognized theory o f intelligence. See cognitive assessment.
item p o o l/ite m bank: T h e collection or set o f item s
in terim assessm ents or tests: Assessm ents adm inistered from w hich a test or test scale’s item s are selected
du rin g instruction to evaluate students’ know ledge and du rin g test developm ent, or the total set o f item s from
skills relative to a specific set o f academ ic goals to w hich a particular subset is selected for a test taker
inform policy-m aker or educator decisions at the class d u rin g adaptive testing.
room , school, or district level. See benchmark assess
item resp o n se theory (IR T ): A m athem atical m odel o f
ments.
the functional relationship between perform ance o n a
in te rn a l-c o n siste n c y co e ffic ien t: A n in dex o f the test item , the test item’s characteristics, an d the test
reliability o f test scores derived from the statistical in taker’s stan din g on the con stru ct being m easured.
terrelationships am ong item responses o r scores on sep
jo b an aly sis: T h e investigation o f p osition s or job
arate parts o f a test. See coefficietit alpha, split-halves re
classes to obtain inform ation abou t jo b duties and
liability coefficient.
tasks, responsibilities, necessary worker characteristics
internal structure: In test analysis, the factorial structure (e.g. knowledge, skills, and abilities), working conditions,
o f item responses or subscales o f a lest. an d/or other aspects o f the w ork. Sec practice analysis.
interpreter: Som eone w ho facilitates cross-cultural com jo b /jo b classification : A grou p o f positions that are
m un ication by converting concepts from one language sim ilar enough in duties, responsibilities, necessary
to another (including sign language). w orker characteristics, an d other relevant aspects that
they m ay be properly placed under the sam e jo b title.
interrater agreement/consistency: Th e level o f consistency
w ith w hich two or m ore ju dges rate the work or per jo b p erfo rm an ce m easu rem en t: M easurem ent o f an
form ance o f test takers. See interrater reliability. in cum ben t’s observed perform ance o f a jo b as evaluated
by a jo b sam ple test, an assessm ent o f jo b know ledge,
interrater reliability: T h e level o f consistency in rank or
or ratings o f the in cum ben t’s actual perform ance on
dering o f ratings across raters. See interrater agreement.
the jo b . See job sample test.
in trarater reliability; T h e level o f consistency am ong
jo b sam p le test: A test o f the ability o f an individual to
repetitions o f a single rater in scorin g test takers’
perform the tasks com prised by a job. Seejob performance
responses. Inconsistencies in the scoring process resulting
measurement.
from influences that are internal to the rater rather
than true differences in test takers’ perform ances result licen sin g: T h e gran tin g, u su ally by a go vern m ent
in low intrarater reliability. agency, o f an authorization or legal perm ission to
p ractice an occupation or profession. See certification,
inventory: A questionnaire or checklist that elicits in
credentialing.
form ation abou t an in dividu als personal opinions, in
terests, attitudes, preferences, personality characteristics, lin k in g/sco re linkin g: T h e process o f relating scores on
m otivations, or typical reactions to situations and p rob tests. Sec alternateforms, equating, calibration, moderation,
lem s. projection, vertical scaling.
item : A statem ent, question, exercise, or task on a test local evidence: Evidence (usually related to reliability/pre
for w hich the test taker is to select o r construct a cision o r validity) collected for a specific test an d a
response, o r perform a task. Set prompt. specific set o f test takers in a single institution o r at a
specific location.
ite m c h a ra c te ristic cu rve (I C C ): A m ath em atical
fu nction relating the p robability o f a certain item local n o rm s: N orm s by w hich test scores are referred to
response, usually a correct response, to the level o f the a specific, lim ited reference p opu lation o f particular in-
220
GLOSSARY
tcrest to the test user (e.g., population o f a locale, or n orm s: Statistics o r tabular data that sum m arize the
ganization, or institution). Local norm s are not intended distribution or frequency o f test scores for one o r more
to be representative o f population s beyond that lim ited specified groups, such as test takers o f various ages or
setting. grades, usually designed to represent som e larger p opu
lation, referred to as the reference population. Sec local
low -stakes test: A test used to provide results that have
norms.
only m inor or indirect consequences for individuals,
p ro gram s, or in stitu tio n s involved in the testing. op erational use: T h e actual use o f a test, after initial
C o n trast with high-stakes test. test developm ent has been com pleted, to inform an in
terpretation, decision, or action, based in part or wholly
m astery test: A test designed to indicate whether a test
on test scores.
taker has attained a prescribed level o f com petence, or
mastery, in a dom ain . Sec ait score, computer-based o p p ortu n ity to learn : T h e extent to which test takers
masteiy test. have been exposed to the tested constructs through
their educational program an d /o r have had exposure to
m atrix sam p lin g: A m easurem ent form at in which a
or experience with the language or the m ajority culture
large set o f test item s is organized into a num ber o f
required to understand the test.
relatively short item sets, each o f w hich is random ly
assigned to a su bsam ple ol test takers, thereby avoiding parallel form s: In classical test theory, strictly parallel
the need to adm in ister all item s to all test takers. test form s that arc assu m ed to m easure the sam e
Equivalence o f the sh ort item sets, or subsets, is not construct and to have the sam e m eans and the same
assum ed. standard deviations in the populations o f interest. Sec
alternateforms.
m eta-analysis: A statistical m ethod o f research in which
the results from independent, com parable studies are percentile: T h e score on a test below which a given
com bined to determ ine the size of an overall effect or percentage o f scores for a specified population occurs.
the degree o f relationship between two variables.
percentile rank: T h e rank o f a given score based on the
m o d eration : A process of relating scores on different percentage o f scores in a specified score distribution
rests so thar scores have the sam e relative meaning. that are below the score being ranked.
m o d erator variable: A variable that affects the direction p erform ance assessm en ts: Assessm ents for w hich the
o r strength o f the relationship between rwo other vari test taker actually dem onstrates the skills the lest is in
ables. tended to measure by doin g tasks that require those
skills.
m o d ific a tio n /te st m o d ific a tio n : A ch an ge in test
content, form at (in clu din g response form ats), an d/or perform an ce level: Label or b rie f statem ent classifying
adm in istration con dition s that is m ade to increase ac a test taker’s com petency in a particular dom ain, usually
cessibility for so m e individuals but that also affects the defined by a range o f scores on a test. For example,
construct m easured and, consequently, results in scores labels such as “basic” to “advanced,” or “novice” to “ex
that differ in m eaning from scores from the unm odified pert,” constitute broad ranges for classifying proficiency.
assessm ent. See achievement levels, cut score, performance-level descriptor,
standard setting.
n eu ropsy ch ological assessm ent: A specialized type o f
psychological assessm ent o f norm al o r pathological perform ance-level descriptor; D escriptions o f w hat test
processes affecting the central nervous system and the takers know and can do at specific performance levels.
resulting psychological and behavioral functions or
perform ance stan dards: D escriptions o f levels o f knowl
dysfunctions.
edge and skill acquisition contained in content standards,
norm -referenced score in terpretation: A score inter as articulated through perform ance-level labels (e.g.,
pretation based on a com parison o f a test taker’s per “basic,” “proficient,” “advanced”); statem ents o f what
form ance with the distribution o f perform ance in a test takers at different perform ance levels know and
sp ecified reference p o p u la tio n . C o n trast criterion- can do; an d cut scores o r ranges o f scores on the scale
referenced score interpretation. o f an assessm ent that differentiate levels o f performance.
221
GLOSSARY
See cut score, performance level, performance-level de p ro jectio n : A m eth od o f score linking in w hich scores
scriptor. o n one test are used to predict scores on another test
for a group o f test takers, often using regression m ethod-
p erson ality inventory: An inventory that measures one
ology.
or m ore characteristics that are regarded generally as
psychological attributes or interpersonal tendencies. p ro m p t/item p rom pt/w ritin g p rom pt: T h e question,
stim ulus, or instruction that elicits a test takers response.
p ilo t test: A test adm inistered to a sam ple o f test takers
to try o u t som e aspects o f the test or test items, such as p ro p rietary algorith m s: Procedures, often com puter
instructions, tim e lim its, item response form ats, or code, used by com m ercial publishers or test developers
item response options. See field test. that are n o t revealed to the public for com m ercial rea
sons.
p o lic y study: A study that contributes to judgm en ts
abou t plans, principles, or procedures enacted to achieve p sy ch o d ia g n o sis: Form alization or classification o f
bro ad public goals. functional m ental health status based on psychological
assessm ent.
p o rtfo lio : In assessm ent, a system atic collection o f ed
ucational or w ork products that have been com piled or p sych o lo gical assessm en t: An exam ination o f psych o
accum ulated over tim e, according to a specific set o f logical fun ction in g that involves collecting, evaluating,
principles or rules. an d integrating test results and collateral inform ation,
an d reporting inform ation about an individual.
p o sitio n : In em ploym ent contexts, the sm allest organi
zational unit, a set o f assigned duties and responsibilities p sych o lo gical testing: T h e use o f tests or inventories to
th at are p e rfo r m e d by a p erso n w ith in an assess particular psychological characteristics o f an in
organization. dividual.
practice analysis: An investigation o f a certain occupation ran d o m error: A nonsystem atic error; a co m p o n en t o f
or profession to obtain descriptive inform ation about test scores that appears to have no relationship to other
the activities an d responsibilities o f the occupation or variables.
p rofession an d abou t the knowledge, skills, and abilities
ran d o m sam ple: A selection from a defined popu lation
n eeded to engage successfully in the occupation or pro
o f entities according to a random process with the
fession. Set job analysis.
selection o f each entity independent o f the selection o f
precision o f m easurem ent: T h e im pact o f measurement other entities. See sample.
error on the outcom e o f the m easurem ent. See standard
raw score: A score on a test that is calculated by
error of measurement, error of measurement,
co u n tin g the n u m b er o f correct answ ers, o r m ore
reliability/precision.
generally, a su m or other com bination o f item scores.
predictive bias: T h e systematic under- or over-prediction
reference p o p u latio n : T h e population o f test takers to
o f criterion perform ance for people belonging to groups
differentiated by characteristics not relevant to the w hich individual test takers are com pared through the
222
GLOSSARY
applications o f a m easurem ent procedure an d hence by producing scale scores designed to su p port score in
are inferred to be dependable an d consistent fo r an in terpretations. See scale.
dividual test taker; the degree to which scores are free
school district: A local education agency adm inistered
o f ran dom errors o f m easurem ent for a given group.
by a public board o f education or other public authority
See generalizability theory, classical test theory, precision
that oversees p u b lic elem entary or secondary schools in
of measurement.
a political subdivision o f a state.
respon se bias: A test taker’s tendency to respond in a
score: Any specific num ber resulting from the assessment
particular way or style to items on a test (e.g., acquiescence,
o f an individual, such as a raw score, a scale score, an
choice o f socially desirable option s, choice o f “true” on
estim ate o f a latent variable, a production count, an
a true-false test) that yields system atic, construct-
absence record, a course grade, or a rating.
irrelevant error in test scores.
sc o rin g ru bric: T h e established criteria, in clu din g
respon se form at: T h e m echanism that a test taker uses
rules, principles, an d illustrations, used in scorin g con
to respond to a test item , such as selecting from a list o f
structed responses to individual tasks an d clusters o f
o p tio n s (m u ltiple-ch oice question ) or providin g a
tasks.
w ritten response (fill-in or written response to an open-
ended or constructed-response question); oral response; screening test: A test that is used to m ake broad cate
o r physical perform ance. gorizations o f test takers as a first step in selection
decisions or diagnostic processes.
respon se p rotocol: A record o f the responses given by a
test taker to a particular test. selection: T h e acceptance or rejection o f applicants for
a particular educational or em ploym ent opportunity.
restriction o f range o r variability: R eduction in the
observed score variance o f a test-taker sam ple, com pared sensitivity: In classification, diagnosis, and selection,
w ith the variance o f the entire test-taker population , as the proportion o f cases that are assessed as m eeting or
a consequence o f constraints on the process o f sam pling predicted to m eet the criteria and which, in truth, do
test takers. See adjusted validity or reliability coefficient. meet the criteria.
retesting: A repeat adm inistration o f a test, using either specificity: In classification, diagnosis, and selection,
the sam e test or an alternate form , som etim es w ith ad the proportion o f cases that are assessed as not m eeting
d itio n a l tra in in g or e d u c atio n betw een or predicted to not m eet the criteria and which, in
adm inistrations. truth, d o not m eet the criteria.
rubric: See scoring rubric. speededness: T h e extent to w hich test takers’ scores
sam ple: A selection o f a specified num ber o f entities, depend on the rate at w hich w ork is perform ed as well
called sampling units (test takers, item s, etc.), from a as on the correctness o f the responses. T h e term is not
larger sp ecified set o f p ossib le entities, called the used to describe tests o f speed.
population. See random sample, stratified random sam split-halves reliability coefficient: An internal-consistency
ple. coefficient obtained by using h a lf the item s on a test to
scale: 1. T h e system o f num bers, an d their units, by yield on e score an d the other h a lf o f the item s to yield
w hich a value is reported on som e dim ension o f m eas a second, independent score. See internal-consistency
urem ent. 2. In testing, the set o f items or subtests used coefficient, coefficient alpha.
to m easure a specific characteristic (e.g., a test o f verbal
stability: T h e extent to which scores on a test are es
ability or a scale o f extroversion-introversion).
sentially invariant over tim e, assessed by correlating the
scale score: A score obtain ed by transform in g raw test scores o f a grou p o f individuals w ith scores on the
scores. S cale scores are typically used to facilitate sam e test o r an equated test taken by the sam e group at
interpretation. a later time. See test-retest reliability coefficient.
scalin g: T h e process o f creating a scale or a scale score standard error o f m easurem ent: T h e standard deviation
to enhance test score interpretation by placing scores o f an individual’s observed scores from repeated ad
from different tests o r test form s on a com m on scale or m inistration s o f a test (or parallel form s o f a test)
223
GLOSSARY
u nder identical conditions. Because such data generally docum entation regarding its technical quality for an
cannot be collected, the standard error o f m easurem ent intended purpose.
is u sually estim ated from grou p data. See error of
test developm en t: T h e process through which a test is
measurement.
planned, constructed, evaluated, and modified, including
stan d ard settin g : T h e process, often ju d gm en t based, consideration o f content, format, administration, scoring,
o f settin g cut scores using a structured procedure that item properties, scaling, and technical quality for the
seeks to m ap test scores into discrete perform ance test’s intended purpose.
levels that are usually specified by perform ance-level
test d o cu m en ts: D o cu m en ts such as test m anuals,
descriptors.
technical manuals, user’s guides, specim en sets, and d i
stan d ard izatio n : I. In test adm inistration, m aintaining rections for test adm inistrators an d scorers that provide
a consistent testing environm ent and condu cting tests inform ation for evaluating the appropriateness and tech
according to detailed rules and specifications, so that nical adequacy o f a test for its intended purpose.
testing con d ition s are the sam e for all test takers on the
test fo rm : A set o f test item s or exercises that m eet re
sam e an d m ultiple occasions. 2. In test developm ent,
quirem ents o f the specifications for a testing program .
establishin g a reporting scale using norm s based on the
M an y testing program s use alternate test form s, each
test performance o f a representative sam ple o f individuals
built according to the sam e specifications but with
from the p opu latio n with which the test is intended to
som e or all o f the test item s u nique to each form . See
be used.
alternateforms.
standards-based assessm ent: Assessment o f an individuals
test fo rm at/m o d e: T h e m anner in which test content
standing with respect to systematically described content
is presented to the test taker: w ith paper and pencil, via
and perlorm ance standards.
c o m p u te r term in al o r In te rn e t, o r o ra lly by an
stratified ran d o m sam ple: A set of random sam ples, examiner.
each o f a specified size, from each of several different
sets, w hich arc viewed as strata o f a population . See test in form ation fun ction : A m athem atical function
random sample, sample. relating each level o f an ability or latent trait, as defined
under item response theory (IR T ), to the reciprocal o f
su m m ativ e assessm en t: T h e assessm ent o f a test taker’s the corresponding con dition al m easurem ent error vari
know ledge an d skills typically carried ou t at the com ance.
pletion o f a program o f learning, such as the end o f an
instructional unit. test m an ual: A publication prepared by test developers
an d/o r publishers to provide inform ation on test ad
system atic error: An error that consistently increases
m inistration, scoring, an d interpretation and to provide
o r decreases the scores o f all test takers or som e subset selected technical data on test characteristics. See user’s
o f test takers, b u t is not related to the construct that guide, technical manual.
the test is inten ded to measure. See bias.
test m odification: C hanges m ade in the content, format,
tech nical m an u al: A publication prepared by test de
an d/o r adm inistration procedure o f a test to increase
velopers an d /o r publishers to provide technical and
the accessibility o f the test fo r test takers w ho are
psychom etric inform ation about a test.
unable to take the original test under standard testing
test: A n evaluative device or procedure in which a sys conditions. In contrast to test accom m odation s, test
tem atic sam ple o f a test taker’s behavior in a specified m odification s change the construct being m easured by
dom ain is obtain ed and scored using a standardized the test to som e extent an d hence change score inter
process. pretations. See adaptation/test adaptation,modification/test
modification. C ontrast with accommodationsltest accom
test d esign : T h e process o f developing detailed specifi
modations.
cations for w hat a test is to m easure and the content,
cognitive level, form at, and types o f test item s to be test pu b lish er: An entity, individual, organization, or
used. agency that produces an d/o r distributes a test.
test developer: T h e person(s) or organization responsible test-retest reliability coefficient: A reliability coefficient
for the design and construction o f a test and for the obtained by adm inistering the sam e test a second time
224
GLOSSARY
to the sam e group after a tim e interval and correlating purpose, appropriate uses, proper administration, scoring
the two sets o f scores; typically used as a measure o f procedures, norm ative data, interpretation o f results,
stability o f the test scores. See stability. an d case studies. See test manual.
test security: Protection o f the content o f a test from validation : T h e process through which the validity o f a
unauthorized release or use, to protect the integrity o f proposed interpretation o f test scores for their intended
the test scores so they are valid for their intended use. uses is investigated.
test specification s: D ocum en tation o f the purpose and validity: T h e degree to w hich accum ulated evidence
intended uses o f a test as well as o f the test’s content, and theory su p port a specific interpretation o f test
form at, length, psychom etric characteristics (o f the scores for a given use o f a test. I f multiple interpretations
item s an d test overall), delivery m ode, adm inistration, o f a test score for different uses are intended, validity
scoring, an d score reporting. evidence for each interpretation is needed.
test-takin g strategies: Strategies that test takers m ight validity argum ent: An explicit justification o f the degree
use while taking a test to im prove their perform ance, to w hich accum ulated evidence and theory su p port the
such as dm e management or the elimination o f obviously proposed interpretation(s) o f test scores for their intended
incorrect op tion s o n a m ultiple-choice question before uses.
responding to the question.
validity generalization: Application o f validity evidence
te st user: A person or entity responsible for the choice obtain ed in one or m ore situ ation s to other sim ilar
an d adm inistration o f a test, for the interpretation o f situ ation s on the basis o f m ethods such as m eta
test scores produced in a given context, and for any de analysis.
cisions or actions that are based, in part, on test scores.
value-added m odelin g: Estim ating the contribution o f
tim ed test: A test adm inistered to test takers who sre individual schools or teachers to student perform ance
allotted a prescribed am ou n t o f time to respond to the by m eans o f com plex statistical techniques that use
test. m ultiple years o f student outcom e data, which typically
are standardized test scores. See groiuth models.
to p-do w n selection : Selection o f applicants on the
basis o f rank-ordered test scores from highest to lowest. varian ce co m p o n en ts: Variances accruing from the
true score: In classical test theory, the average o f the separate constituent sources that are assumed to contribute
scores that w ould be earned by an individual on an un to the overall variance o f observed scores. Such variances,
lim ited num ber o f strictly parallel form s o f the sam e estim ated by m ethods o f the analysis o f variance, often
225
NDEX
227
INDEX
228
INDEX
229
INDEX
230