You are on page 1of 241

AMERICAN EDUCATIONAL RESEAACM ASSOCIATION

AMERICAN P'SYCMOLOGI CAL AsSOCIATION


NATIONAL COUNC IL ON MEAS UREMENT IN EDUCATION
STANDARDS
for Educational and Psychological Testing

American Educational Research Association


American Psychological Association
National Council on Measurement in Education
Copyright 2014 by the American Educational Research Association, the American Psychological Asso
ciation, and the National Council on Measurement in Education. All rights reserved. No part of this
publication may be reproduced or distributed in any form or by any means now known or later developed,
including, but not limited to, photocopying or the process of scanning and digitization, transmitted, or
stored in a database or retrieval system, without the prior written permission of the publisher.

Published by the
American Educational Research Association
1430 K St., Nw, Suite 1200
Washington, DC 20005

Printed in the United States of America

Prepared by the
Joint Committee on the Standards for Educational and Psychological Testing of the American Educational
Research Association, the American Psychological Association, and the National Council on Measurement
in Education

Library of Congress Cataloging-in-Publication Data

American Educational Research Association.


Standards for educational and psychological testing / American Educational Research Association,
American Psychological Association, National Council on Measurement in Education.
pages cm
"Prepared by the Joint Committee on Standards for Educational and Psychological Testing of the
American Educational Research Association, American Psychological Association and National Council
on Measurement in Education"-T.p. verso.
Include index.
ISBN 978-0-935302-35-6 (alk. paper)
1. Educational tests and measurements-Standards-United Stares. 2. Psychological tests-Standards
United States. I. American Psychological Association. II. National Council on Measurement in Education.
III. Joint Committee on Standards for Educational and Psychological Testing (U.S.) IV. T itle.
LB305I.A693 2014
371.26'0973-dc23
CONTENTS

PREFACE , . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Vll

INTRODUCTION
The Purpose of the Standards ...................................................l
Legal Disclaimer .............................................................1
Tests and Test Uses to Which T hese Standards Apply .................................2
Participants in the Testing Process ................................................3
Scope of the Revision .........................................................4
Organization of the Volume ....................................................5
Categories of Standards ........................................................5
Presentation ofindividual Standards ..............................................6
Cautions to Be Considered in Using the Standards .. .................................7

PART I
FOUNDATIONS
1. Validity ...................................................................11
Background ............................................................. 11
Sources ofValidity Evidence ................................................13
Integrating theValidity Evidence .............................................21
Standards for Validity .......................................................23
Cluster 1.Establishing Intended Uses and Interpretations ..........................23
Cluster 2.Issues Regarding Samples and Settings Used inValidation .................25
Cluster 3.Specific Forms ofValidity Evidence ...................................26
2. Reliability/Precision and Errors of Measurement ................................33
Background ...............................................................33
Implications forValidity ...................................................3 4
Specifications for Replications of the Testing Procedure ...........................35
Evaluating Reliability/Precision ..............................................37
Reliability/Generalizability Coefficients ........................................37
Factors Affecting Reliability/Precision .........................................38
Standard Errors of Measurement .............................................39
Decision Consistency ......................................................40
Reliability/Precision of Group Means .........................................40
Documenting Reliability/Precision ...........................................40
Standards for Reliability/Precision .............................................42
Cluster 1.Specifications for Replications of the Testing Procedure ...................42
Cluster 2.Evaluating Reliability/Precision ......................................43
Cluster 3.Reliability/Generalizability Coefficients ...............................44
Cluster 4.Factors Affecting Reliability/Precision .................................44
Cluster 5.Standard Errors of Measurement .....................................45

iii
CONTENTS

Cluster 6. Decision Consistency .............................................46


Cluster 7.Reliability/Precision of Group Means .................................46
Cluster 8 .Documenting Reliability/Precision ...................................47
3. Fairness in Testing ...................................................... ...49
Background ........................................................... ..49
General Views of Fairness ..................................................50
T hreats to Fair andValid Interpretations ofTest Scores ............................5 4
Minimizing Construct-Irrelevant Components T hrough Test Design and
Testing Adaptations ....................................................57
Standards for Fairness .......................................................63
Cluster 1.Test Design, Development, Administration, and Scoring Procedures
That Minimize Barriers toValid Score Interpretations for the Widest Possible
Range ofindividuals and Relevant Subgroups ................................63
Cluster 2.Validity ofTest Score Interpretations for Intended Uses for rhe
Intended Examinee Population ............................................65
Cluster 3.Accommodations to Remove Construct-Irrelevant Barriers and Support
Valid Interpretations of Scores for Their Intended Uses .........................67
Cluster 4.Safeguards Against Inappropriate Score Interpretations for Intended Uses .....70

PART II
OPERATIONS
4. Test Design and Development ............................................. ...75
Background ............................................................75
Test Specifications ........................................................75
Item Development and Review ........................................... ...81
Assembling and Evaluating Test Forms ..................................... ...82
Developing Procedures and Materials for Administration and Scoring ............. ...83
Test Revisions ............................................................83
Standards for Test Design and Development ................................ ...85
Cluster 1.Standards for Test Specifications .................................. ...85
Cluster 2.Standards for Item Development and Review ...........................87
Cluster 3.Standards for Developing Test Administration and Scoring Procedures
and Materials ...................................................... ...9 0
Cluster 4 .Standards for Test Revision .........................................93
5. Scores, Scales, Norms, Score Linking, and Cut Scores ........................ ...95
Background ............................................................ ...9 5
Interpretations o f Scores ................................................ ...9 5
Norms .............................................................. ...9 7
Score Linking ......................................................... ...9 7
Cut Scores .............................................................100
Standards for Scores, Scales, Norms, Score Linking, and Cut Scores .................10 2
Cluster 1.Interpretations of Scores ........................................ ..10 2
Cluster 2.Norms ........................................................10 4
CONTENTS

Cluster 3.Score Linking ..................................................105


Cluster 4.Cut Scores .....................................................107
6. Test Administration, Scoring, Reporting, and Interpretation ......................111
Background ..............................................................111
Standards for Test Administration, Scoring, Reporting, and Interpretation ..........11 4
Cluster 1.Test Administration ..............................................11 4
Cluster 2.Test Scoring ....................................................118
Cluster 3.Reporting and Interpretation ......................................119
7. Supporting Documentation for Tests ..........................................123
Background ..............................................................123
Standards for Supporting Documentation for Tests ............................125
Cluster 1.Content of Test Documents: Appropriate Use .........................125
Cluster 2.Content of Test Documents: Test Development ........................126
Cluster 3.Content of Test Documents: Test Administration and Scoring .............127
Cluster 4.Timeliness of Delivery of Test Documents ............................129
8. The Rights and Responsibilities of Test Takers .................................131
Background ..............................................................131
Standards for Test Takers' Rights and Responsibj]ities ...........................133
Cluster 1.Test Takers' Rights to Information Prior to Testing ......................133
Cluster 2.Test Takers' Rights to Access Their Test Results and to Be Protected
From Unauthorized Use of Test Results ....................................135
Cluster 3.Test Takers' Rights to Fair and Accurate Score Reports ...................136
Cluster 4.Test Takers' Responsibilities for Behavior T hroughout the Test
Administration Process .................................................136
9. The Rights and Responsibilities of Test Users ................................. .139
Background ..............................................................139
Standards for Test Users' Rights and Responsibilities ............................1 42
Cluster 1.Validity oflnterpretations ..........................: ..............1 42
Cluster 2.Dissemination of Information ......................................1 46
Cluster 3.Test Security and Protection of Copyrights ............................1 47

PART Ill
TESTING APPLICATIONS
10. Psychological Testing and Assessment .......................................151
Background ..............................................................151
Test Selection and Administration ...........................................152
Test Score Interpretation ..................................................154
Collateral Information Used in Psychological Testing and Assessment ...............155
Typ es of Psychological Testing and Assessment .................................155
Purposes of Psychological Testing and Assessment ...............................159
Summary ..............................................................163

V
CONTENTS

Standards for Psychological Testing and Assessment ............................16 4


Cluster 1.Test User Qualifications ...........................................16 4
Cluster 2.Test Selection ...................................................165
Cluster 3.Test Administration ..............................................165
Cluster 4.Test Interpretation ............ ...................................166
Cluster 5.Test Security ...................................................168
11. Workplace Testing and Credentialing ........................................169
Background ..............................................................169
Employment Testing .....................................................170
Testing in Professional and Occupational Credentialing ..........................17 4
Standards for Workplace Testing and Credentialing ..............................178
Cluster 1.Standards Generally Applicable to Both Employment
Testing and Credentialing ...............................................178
Cluster 2.Standards for Employment Testing ..................................179
Cluster 3.Standards for Credentialing ........................................181
12. Educational Testing and Assessment .........................................183
Background ..............................................................183
Design and Development of Educational Assessments ...........................18 4
Use and Interpretation of Educational Assessments ..............................188
Administration, Scoring, and Reporting of Educational Assessments ................19 2
Standards for Educational Testing and Assessment ........................... ..195
Cluster 1.Design and Development of Educational Assessments ...................195
Cluster 2.Use and Interpretation of Educational Assessments ......................197
Cluster 3.Administration, Scoring, and Reporting of Educational Assessments ...... ..200
13. Uses of Tests for Program Evaluation, Policy Studies, and Accountability .. ....... ..203
Background .........................................................., . ..203
Evaluation of Programs and Policy Initiatives ................................ ..20 4
Test-Based Accountability Systems ........................................ ..205
Issues in Program and Policy Evaluation and Accountability .......................206
Additional Considerations .................................................207
Standards for Uses of Tests for Program Evaluation, Policy Studies,
and Accountability .................................................... ..209
Cluster 1.Design and Development of Testing Programs and Indices for
Program Evaluation, Policy Studies, and Accountability Systems ............... ..209
Cluster 2.Interpretations and Uses oflnformation From Tests Used in
Program Evaluation, Policy Studies, and Accountability Systems .............. ..210

GLOSSARY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
INDEX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
PREFACE

This edition of Standardsfor Educational and Psy The present edition of the Standards was developed
chological Testing is sponsored by the American by the Joint Committee on the Standards for Ed
Educational Research Association ( AERA), the ucational and Psychological Testing, appointed by
American Psychological Association ( APA), and the Standards Management Committee in 2008.
the National Council on Measurement in Education Members of the Joint Committee are members of
( NCME).Earlier documents from the sponsoring at least one of the three sponsoring organizations,
organizations also guided the development and AERA, APA, and NCME.T he Joint Committee
use of tests.The first was Technical Recommendations was charged with the revision of the Standards
for Psychological Tests and Diagnostic Techniques, and the preparation of a final document for pub
prepared by an APA committee and published by lication.It held its first meeting in January 20 09.
APA in 19 5 4.The second was Technical Recom
mendations for Achievement Tests, prepared by a Joint Committee on theStandards tor
committee representing AERA and the National Educational and Psychological Testing
Council on Measurement Used in Education Barbara S.Plake (Co-Chair)
( NCMUE) and published by the National Edu Lauress L.Wise (Co-Chair)
cation Association in 1955. Linda L.Cook
The third, which replaced the earlier two, was Fritz Drasgow
Brian T. Gong
prepared by a joint committee representing AER A,
Laura S.Hamilton
APA, and NCME and was published by APA in
Jo-Ida Hansen
19 66.It was the first edition of the Standards for Joan L.Herman
Educational and Psychological Testing, also known Michael T. Kane
as the Standards. Three subsequent editions of Michael J.Kolen
the Standards were prepared by joint committees Antonio E.Puente
representing AERA, APA, and NCME, published Paul R.Sackett
in 197 4, 1985, and 1999. Nancy T.Tippins
The current Standards Management Committee Walter D.Way
was formed by AER A, AP A, and NCME, the Frank C.Worrell
three sponsoring organizations, in 2005, consisting
of one representative from each organization.The Each sponsoring organization appointed one or
committee's responsibilities included determining two liaisons, some of whom were members of the
whether the 19 9 9 Standards needed revision and Joint Committee, to serve as the communication
then creating the charge, budget, and work timeline conduits between the sponsoring organizations
for a joint committee; appointing joint committee and the committee during the revision process.
co-chairs and members; overseeing finances and a
Liaisons to the Joint Committee
development fund; and performing other tasks
related to the revision and publication of the AERA: Joan L.Herman
APA: Michael J. Kolen and Frank C.Worrell
Standards.
NCME: Steve Ferrara
Standards Management Committee Marianne Ernesto ( APA) served as the project di
Wayne J.Camara (Chair), appointed by APA rector for the Joint Committee, and Dianne L.
David Frisbie (2008-present), appointed by NCME Schneider ( APA) served as the project coordinator.
Suzanne Lane, appointed by AERA Gerald Sroufe ( AER A) provided administrative
Barbara S.Plake (2005-2007), appointed by NCME support for the Management Committee.APA's

vii
PREFACE

legal counsel managed the external legal review of APA Committee on Aging
the Standards. Daniel R. Eignor and James C. APA Committee on Children, Youth, and Families
lmpara reviewed the Standards for technical APA Committee on Ethnic Minority Affairs
APA Committee on International Relations in
accuracy and consistency across chapters.
Psychology
In 2008, each of the three sponsoring organi
APA Committee on Legal Issues
zations released a call for comments on the 19 9 9
APA Committee on Psychological Tests and
Standards. Based o n a review of the comments re Assessment
ceived, the Management Committee identified APA Committee on Socioeconomic Status
four main content areas of focus for the revision: APA Society for the Psychology ofWomen
technological advances in testing, increased use (Division 35)
of tests for accountability and education policy APA Division of Evaluation, Measurement, and
setting, access for all examinee populations, and Statistics (Division 5)
issues associated with workplace testing.In addition, APA Division of School Psychology (Division 1 6)
the committee gave special attention to ensuring APA Ethics Committee
a common voice and consistent use of technical APA Society for Industrial and Organizational
Psychology (Division 14)
language across chapters.
APA Society of Clinical Child and Adolescent
In January 2011, a draft of the revised Standards
Psychology (Division 53)
was made available for public review and comment. APA Society of Counseling Psychology (Division 1 7)
Organizations that submitted comments on the Asian American Psychological Association
draft and/or comments in response to the 2 0 0 8 Association ofTest Publishers
call for comments are listed below. Many individuals District of Columbia Psychological Association
from each organization contributed comments, Massachusetts Neuropsychological Society
as did many individual members of AERA, APA, Massachusetts Psychological Association
and NCME. T he Joint Committee considered National Academy of Neuropsychology
each comment in its revision of the Standards. National Association of School Psychologists
T hese thoughtful reviews from a variety of pro National Board of Medical Examiners
National Council of Teachers of Mathematics
fessional vantage points helped the Joint Committee
NC.ME Board of Directors
in drafting the final revisions of the present edition
NC.ME Diversity Issues and Testing Committee
of the Standards. NC.ME Standards and Test Use Committee
Comments came from the following organi
zations: Testing Companies
Sponsoring O rganizations ACT
Alpine Testing Solutions
American Educational Research Association
The College Board
American Psychological Association
Educational Testing Service
National Council on Measurement in Education
Harcourt Assessment, Inc.
Professional Associations Hogan Assessment Systems
Pearson
American Academy of Clinical Neuropsychology
Prometric
American Board of lnternal Medicine
Vangent Human Capital Management
American Counseling Association
Wonderlic, Inc .
American Institute of CPAs, Examinations Team
APA Board for the Advancement of Psychology in
Academic and Research Institutions
the Public Interest
APA Board of Educational Affairs Center for Educational Assessment, University of
APA Board of Professional Affairs Massachusens
APA Board of Scientific Affairs George Washington University Center for Equity
APA Policy and Planning Board and Excellence in Education

viii
PREFACE

Human Resources Research Organization (HumRRO) AERA: The AEMs approval of the Standards
National Center on Educational Outcomes, means that the Council adopts the document
University of Minnesota as AERA policy.

Credentialing Organizations APA: T he AP.A's approval of the Standards


American Registry of Radiologic Technologists means that the Council of Representatives
National Board for Certified Counselors adopts the document as APA policy.
National Board of Medical Examiners
N CME: T he Standards for Educational and
Other I nstitutions Psychological Testing has been endorsed by
California Department of Education NCME, and this endorsement carries with
Equal Employment Advisory Council it an ethical imperative for all NCME members
Fair Access Coalition on Testing to abide by these standards in the practice of
Instituto de Evaluaci6n e Ingenieria of Avanzada, measurement.
Mexico Although the Standards is prescriptive, it does not
Qualifications and Curriculum Authority, UK contain enforcement mechanisms.T he Standards
Department for Education
was formulated with the intent of being consistent
Performance Testing Council
with other standards, guidelines, and codes of
conduct published by the three sponsoring
When the Joint Committee completed its final re
organizations.
vision of the Standards, it submitted the revision
to the three sponsoring organizations for approval
and endorsement.Each organization had its own
governing body and mechanism for approval, as Joint Committee on the Standardsfor
well as a statement on the meaning of its approval: Educational and Psychological Testing

ix
I NTRODUCTI ON

Educational and psychological testing and assess why a standard is not relevant or technically
ment are among the most important contributions feasible in a particular case.
of cognitive and behavioral sciences to our society, The Standards makes no attempt to provide
providing fundamental and significant sources of psychometric answers to questions of public policy
information about individuals and groups. Not regarding the use of tests.In general, the Standards
all tests are well developed, nor are all testing advocates that, within feasible limits, the relevant
practices wise or beneficial, but there is extensive technical information be made available so that
evidence documenting the usefulness of well-con those involved in policy decisions may be fully
structed, well-interpreted tests.Well-constructed informed.
tests that are valid for their intended purposes
have the potential to provide substantial benefits Legal Disclaimer
for test takers and test users.Their proper use can
result in better decisions about individuals and The Standards is not a statement of legal require
programs than would result without their use and ments, and compliance with the Standards is not a
can also provide a route to broader and more eq substitute for legal advice.Numerous federal, state,
uitable access to education and employment.The and local statutes, regulations, rules, and judicial
improper use of tests, on the other hand, can decisions relate to some aspects of the use, pro
cause considerable harm to test takers and other duction, maintenance, and development of tests
parties affected by test-based decisions.The intent and test results and impose standards that may be
of the Standards for Educational and Psychological different for different types of testing. A review of
Testing is to promote sound testing practices and these legal issues is beyond the scope of the
to provide a basis for evaluating the quality of Standards, the distinct purpose of which is to set
those practices. The Standards is intended for forth the criteria for sound testing practices from
professionals who specify, develop, or select tests the perspective of cognitive and behavioral science
and for those who interpret, or evaluate the professionals.Where it appears that one or more
technical quality 0 test results. standards address an issue on which established
legal requirements may be particularly relevant,
The Purp ose of the Standards the standard, comment, or introductory material
may make note of that fact. Lack of specific
The purpose of the Standards is to provide criteria reference to legal requirements, however, does not
for the development and evaluation of tests and imply the absence of a relevant legal requirement.
testing practices and to provide guidelines for as When applying standards across international bor
sessing the validity of interpretations of test scores ders, legal differences may raise additional issues
for the intended test uses.Although such evaluations or require different treatment of issues.
should depend heavily on professional j udgment, In some areas, such as the collection, analysis,
the Standards provides a frame of reference to and use of test data and results for different sub
ensure that relevant issues are addressed.All pro groups, the law may both require participants in
fessional test developers, sponsors, publishers, and the testing process to take certain actions and
users should make reasonable efforts to satisfy p rohibit those participants from taking other
and follow the Standards and should encourage actions.Furthermore, because the science of testing
others to do so.All applicable standards should is an evolving discipline, recent revisions to the
be met by all tests and in all test uses unless a Standards may not be reflected in existing legal
sound professional reason is available to show authorities, including judicial decisions and agency
INTRODUCTION

guidelines. In all situations, part1c1pants in the Tests differ on a number of dimensions: the
testing process should obtain the advice of counsel mode in which test materials are presented (e.g.,
concerning applicable legal requirements. paper-and-pencil, oral, or computerized adminis
In addition, although the Standards is not en tration); the degree to which stimulus materials
forceable by the sponsoring organizations, it has are standardized; the type of response format (se
been repeatedly recognized by regulatory authorities lection of a response from a set of alternatives, as
and courts as setting forth the generally accepted opposed to the production of a free-form response);
professional standards that developers and users and the degree to which test materials are designed
of tests and other selection procedures follow. to reflect or simulate a particular context. In all
Compliance or noncompliance with the Standards cases, however, tests standardize the process by
may be used as relevant evidence of legal liability which test takers' responses to test materials are
in judicial and regulatory proceedings.The Standards evaluated and scored.As noted in prior versions
therefore merits careful consideration by all par of the Standards, the same general types of infor
ticipants in the testing process. mation are needed to judge the soundness of
Nothing in the Standards is meant to constitute results obtained from using all varieties of tests.
legal advice. Moreover, the publishers disclaim The precise demarcation between measurement
any and all responsibility for liability created by devices used in the fields of educational and psy
participation in the testing process. chological testing that do and do not fall within
the purview of the Standards is difficult to identify.
Tests and Test Uses to Although the Standards applies most directly to
Which These Standards Apply standardized measures generally recognized as
"tests," such as measures of ability, aptitude,
A test is a device or procedure in which a sample achievement, attitudes, interests, personality, cog
of an examinee's behavior in a specified domain is nitive functioning, and mental health, the Standards
obtained and subsequently evaluated and scored may also be usefully applied in varying degrees to
using a standardized process. Whereas the label a broad range of less formal assessment techniques.
test is sometimes reserved for instruments on Rigorous application of the Standards to unstan
which responses are evaluated for their correctness dardized employment assessments (such a s some
or quality, and the terms scale and inventory are job interviews) or to the broad range of unstructured
used for measures of attitudes, interest, and dis behavior samples used in some forms of clinical
positions, the Standards uses the single term test and school-based psychological assessment (e.g.,
to refer to all such evaluative devices. an intake interview), or to instructor-made tests
A distinction is sometimes made between tests that are used to evaluate student performance in
and assessments.Assessment is a broader term than education and training, is generally not possible.
test, commonly referring tq a process that integrates It is useful to distinguish between devices that lay
test information with information from other claim to the concepts and techniques of the field
sources (e.g., information from other tests, inven of educational and psychological testing and
tories, and interviews; or the individual's social, devices that represent unstandardized or less stan
educational, employment, health, or psychological dardized aids to day-to-day evaluative decisions.
history).The applicability of the Standards to an Although the principles and concepts underlying
evaluation device or method is determined by the Standards can be fruitfully applied to day-to
substance and not altered by the label applied to day decisions-such as when a business owner
it (e.g., test, assessment, scale, inventory). The interviews a job applicant, a manager evaluates
Standards should not be used as a checklist, as is the performance of subordinates, a teacher develops
emphasized in the section "Cautions to Be Con a classroom assessment to monitor student p rogress
sidered in Using the Standards" at the end of this toward an educational goal, or a coach evaluates a
chapter. prospective athlete- it would be overreaching to
INTRODUCTION

expect that the standards of the educational and The interests of the various parties involved
psychological testing field be followed by those in the testing process may or may not be congruent.
making such decisions. In contrast, a structured For example, when a test is given for counseling
interviewing system developed by a psychologist purposes or for job placement, the interests of the
and accompanied by claims that the system has individual and the institution often coincide. In
been found to be predictive of job performance contrast, when a test is used to select from among
in a variety of other settings falls within the many individuals for a highly competitive job or
purview of the Standards. Adhering to the Standards for entry into an educational or training program,
becomes more critical as the stakes for the test the preferences of an applicant may be inconsistent
taker and the need to protect the public increase. with those of an employer or admissions officer.
Similarly, when testing is mandated by a court,
Participants in the Testing Process the interests of the test taker may be different
from those of the party requesting the court order.
Educational and psychological testing and assess Individuals or institutions may serve several
ment involve and significantly affect individuals, roles in the testing process.For example, in clinics
institutions, and society as a whole. The individuals the test taker is typically the intended beneficiary
affected include students, parents, families, teachers, of the test results.In some situations the test ad
educational administrators, job applicants, em ministrator is an agent of the test developer, and
ployees, clients, patients, supervisors, executives, sometimes the test administrator is also the test
and evaluators, among others. The institutions user.When an organization prepares its own em
affected include schools, colleges, businesses, in ployment tests, it is both the developer and the
dustry, psychological clinics, and government user. Sometimes a test is developed by a test
agencies.Individuals and institutions benefit when author but published, marketed, and distributed
testing helps them achieve their goals.Society, in by an independent publisher, although the publisher
turn, benefits when testing contributes to the may play an active role in the test development
achievement of individual and institutional goals. process. Roles may also be further subdivided.
T here are many participants in the testing For example, both an organization and a professional
process, including, among others, ( a) those who assessor may play a role in the provision of an as- .
prepare and develop the test; ( b) those who publish sessment center.Given this intermingling of roles, .
and market the test; ( c) those who administer and it is often difficult to assign precise responsibility_
score the test; ( d) those who interpret test results for addressing various standards to specific par
for clients; ( e) those who use the test results for ticipants in the testing process.Uses of tests and
some decision-making purpose ( including policy testing practices are improved to the extent that
makers and those who use data to inform social those involved have adequate levels of assessment
policy); ( f ) thos who take the test by choice, di literacy.
rection, or necessity; ( g) those who sponsor tests, Tests are designed, developed, and used in a
such as boards that represent institutions or gov wide variety of ways. In some cases, they are de
ernmental agencies that contract with a test veloped and "published" for use outside the or
developer for a specific instrument or service; and ganization that produces them.In other cases, as
( h) those who select or review tests, evaluating with state educational assessments, they are designed
their comparative merits or suitability for the uses by the state educational agency and developed by
proposed. In general, those who are participants contractors for exclusive and often one-time use
in the testing process should have appropriate by the state and not really "published" at all.
knowledge of tests and assessments to allow them Throughout the Standards, we use the general
to make good decisions about which tests to use term test developer, rather than the more specific
and how to interpret test results. term test publisher, to denote those involved in

3
INTRODUCTION

the design and development of tests across the To be responsive to this charge, several actions
full range of test development scenarios. were taken:
T he Standards is based on the premise that ef T he chapters " Educational Testing and As
fective testing and assessment require that all pro
sessment" and "Testing in P rogram Evaluation
fessionals in the testing process possess the knowl and Public Policy," in the 1999 version, were
edge, skills, and abilities necessary to fulfill their
rewritten to attend to the issues associated
roles, as well as an awareness of personal and con
with the uses of tests for educational account
textual factors that may influence the testing
ability purposes.
process. For example, test developers and those
selecting tests and interpreting test results need A new chapter, "Fairness in Testing," was
adequate knowledge of psychometric principles written to emphasize accessibility and fairness
such as validity and reliability.T hey also should as fundamental issues in testing.Specific con
obtain any appropriate supervised experience and cerns for fairness are threaded throughout all
legislatively mandated practice credentials that of the chapters of the Standards.
are required to perform competently those aspects T he chapter "Testing in Employment and
of the testing process in which they engage. All
Credentialing" ( now "Workplace Testing and
professionals in the testing process should follow
Credentialing") was reorganized to more dearly
the ethical guidelines of their profession.
identify when a standard is relevant to em
ployment and/or credentialing.
Scope of the Revision
T he impact of technology was considered
T his volume serves as a revision o f the 199 9 Stan throughout the volume. One of the major
dards for Educational and Psychological Testing. technology issues identified w.as the tension
T he revision process started with the appointment between the use of proprietary algorithms and
of a Management Committee, composed of rep the need for test users to be able to evaluate
resentatives of the three sponsoring organizations complex applications in areas such as automated
responsible for overseeing the general direction of scoring of essays, administering and scoring
the effort: the American Educational Research of innovative item types, and computer-based
Association ( AERA), the American Psychological testing. T hese issues are considered in the
Association ( APA), and the National Council on chapter "Test Design and Development."
Measurement in Education ( NCME).To guide
the revision, the Management Committee solicited A content editor was engaged to help with the
and synthesized comments on the 1999 Standards technical accuracy and clarity of each chapter
from members of the sponsoring organizations and with consistency of !anguage across chapters.
and convened the Joint Committee for the Revision As noted below, chapters in Part I ( "Founda
of the 199 9 Standards in 2 0 0 9 to do the actual re tions") and Part II ( "Operations") now have
vision.T he Joint Committee also was composed an "overarching standard" as well as themes
of members of the three sponsoring organizations under which the individual standards are or
and was charged by the Management Committee ganized. In addition, the glossary from the
199 9 Standardsfor Educational and Psychological
Testing was updated.As stated above, a major
with addressing five major areas: considering the
accountability issues for use of tests in educational
policy; broadening the concept of accessibility of change in the organization of this volume in
tests for all examinees; representing more com volves the conceptualization of fairness.T he
prehensively the role of tests in the workplace; 1999 edition had a part devoted to this topic,
broadening the role of technology in testing; and with separate chapters titled "Fairness in Testing
providing for a better organizational structure for and Test Use," "Testing Individuals of Diverse
communicating the standards. Linguistic Backgrounds," and "Testing Indi-
INTRODUCTION

vi duals With Disabilities." In the present Each chapter begins with introductory text
edition, the topics addressed in those chapters that provides background for the standards that
are combined into a single, comprehensive follow.Although the introductory text is at times
chapter, and the chapter is located in Part I. prescriptive, it should not be interpreted as
T his change was made to emphasize that imposing additional standards.
fairness demands that all test takers be treated
equitably. Fairness and accessibility, the un Categories of Standards
obstructed opportunity for all examinees to
demonstrate their standing on the construct( s) T he text of each standard and any accompanying
being measured, are relevant for valid score commentary include the conditions under which a
interpretations for all individuals and subgroups standard is relevant. Depending on the context
in the intended population of test takers. Be and purpose of test development or use, some
cause issues related to fairness in resting are standards will be more salient than others.Moreover,
not restricted to individuals with diverse lin some standards are broad in scope, setting forth
guistic backgrounds or those with disabilities, concerns or requirements relevant co nearly all tests
the chapter was more broadly cast to support or testing contexts, and other standards are narrower
appropriate testing experiences for all individ in scope. However, all standards are important in
uals. Although the examples in the chapter the contexts to which they apply.Any classification
often refer to individuals with diverse linguistic that gives the appearance of elevating the general
and cultural backgrounds and individuals with importance of some standards over others could
disabilities, they also include examples relevant invite neglect of certain standards that need to be
to gender and to older adults, people of various addressed in particular situations.R ather than dif
ethnicities and racial backgrounds, and young ferentiate standards using priority labels, such as
children, to illustrate potential barriers to fair "primary," "secondary," or "conditional" ( as were
and equitable assessment for all examinees. used in the 1985 Standards), this edition emphasizes
that unless a standard is deemed clearly irrelevant,
Organization of the Vol ume inappropriate, or technically infeasible for a particular
use, all standards should be met, making all of
Part I of the Standards, "Foundations," contains them essentially "primary" for that context.
standards for validity ( chap.l ); reliability/precision Unless otherwise specified in a standard or
and errors of measurement ( chap. 2); and fairness commentary, and with the caveats outlined below,
in testing ( chap.3).Part II, "Operations," addresses standards should be met before operational test
test design and development ( chap. 4); scores, use.Each standard should be carefully considered
scales, norms, score linking, and cut scores (chap. co determine its applicability to the testing context
5); test administration, scoring, reporting, and in under consideration. In a given case there may
terpretation ( chap.6); supporting documentation be a sound professional reason that adherence to
for tests ( chap. 7); the rights and responsibilities the standard is inappropriate.T here may also be
of test takers ( chap.8); and the rights and respon occasions when technical feasibility influences
sibilities of test users ( chap. 9). Part III, "Testing whether a standard can be met prior to operational
Applications," treats specific applications in psy test use. For example, some standards may call
chological testing and assessment ( chap.10); work for analyses of data that are not available at the
place testing and credentialing ( chap.11); educa point of initial operational test use. In other
tional testing and assessment ( chap.12); and uses cases, traditional quantitative analyses may not
of tests for program evaluation, policy studies, be feasible due to small sample sizes. However,
and accountability ( chap. 13).Also included is a there may be other methodologies that could be
glossary, which provides definitions for terms as used to gather information to support the standard,
they are used specifically in this volume. such as small sample methodologies, qualitative

5
INTRODUCTION

studies, focus groups, and even logical analysis. Some of the individual standards and intro
In such instances, test developers and users should ductory text refer to groups and subgroups.The
make a good faith effort to provide the kinds of term group is generally used to identify the full
data called for in the standard to support the examinee population, referred to as the intended
valid interpretations of the test results for their examinee group, the intended test-taker group, the
intended purposes.If test developers, users, and, intended examinee population, or the population.
when applicable, sponsors have deemed a standard A subgroup includes members of the larger group
to be inapplicable or technically infeasible, they who are identifiable in some way that is relevant
should be able, if called upon, to explain the to the standard being applied. When data or
basis for their decision.However, there is no ex analyses are indicated for various subgroups, they
pectation that documentation of all such decisions are generally referred to as subgroups within the
be routinely available. intended examinee group, groups ftom the intended
examinee population, or relevant subgroups.
Presentation of Individual Standards In applying the Standards, it is important to
bear in mind chat the intended referent subgroups
Individual standards are presented after an intro for che individual standards are context specific.
ductory text that presents some key concepts for For example, referent ethnic subgroups to be con
interpreting and applying the standards.In many sidered during the design phase of a test would
cases, the standards themselves are coupled with depend on the expected ethnic composition of
one or more comments.These comments are in the intended test group.In addition, many more
tended to amplify, clarify, or provide examples to subgroups could be relevant to a standard dealing
aid in the interpretation of the meaning of the with the design of fair test questions than to a
standards.The standards often direct a developer standard dealing with adaptations of a test's format.
or user to implement certain actions.Depending Users of the Standards will need to exercis e pro
on the type of test, it is sometimes not clear in the fessional judgment when deciding which particular
statement of a standard to whom the standard is subgroups are relevant for the application of a
directed.For example, Standard 1.2 in the chapter specific standard.
"Validity" states: In deciding which subgroups are relevant for
a particular standard, the following factors, among
A rationale should be presented for
others, may be considered: credible evidence that
each intended interpretation of test
suggests a group may face particular construct
scores for a given use, together with
irrelevant barriers to test performance, statutes or
a summary of the evidence and
regulations that designate a group as relevant to
theory bearing on the intended in
score interpretations, and large numbers of indi
terpretation.
viduals in the group within the general population.
The party responsible for implementing this stan Depending on the context, relevant subgroups
dard is the party or person who is articulating the might include, for example, males and females,
recommended interpretation of the test scores. individuals of differing socioeconomic status, in
T his may be a test user, a test developer, or dividuals differing by race and/or ethnicity, indi
someone who is planning to use the test scores viduals with different sexual orientations, individuals
for a particular purpose, such as making classification with diverse linguistic and cultural backgrounds
or licensure decisions. It often is not possible in ( particularly when testing extends across interna
the statement of a standard to specify who is re tional borders), individuals with disabilities, young
sponsible for such actions; it is intended that the children, or older adults.
party or person performing the action specified Numerous examples are provided in the Stan
in the standard be the party responsible for dards to clarify points or to provide illustrations
adhering to the standard. of how to apply a particular standard. Many of

6
INTRODUCTION

the examples are drawn from research with students should not be considered in isolation.Therefore,
with disabilities or persons from diverse language evaluating acceptability depends on ( a) pro
or cultural groups; fewer, from research with other fessional judgment that is based on a knowledge
identifiable groups, such as young children or of behavioral science, psychometrics, and the
adults. There was also a purposeful effort to relevant standards in the professional field to
provide examples for educational, psychological, which the test applies; ( b) the degree to which
and industrial settings. the intent of the standard has been satisfied
The standards in each chapter in Parts I and by the test developer and user; ( c) the alternative
II ( "Foundations" and "Operations") are introduced measurement devices that are readily available;
by an overarching standard, designed to convey (d) research and experiential evidence regarding
the central intent of the chapter.These overarching the feasibility of meeting the standard; and
standards are always numbered with .0 following (e) applicable laws and regulations.
the chapter number.For example, the overarching
When tests are at issue in legal proceedings
standard in chapter 1 is numbered 1.0.The over
and other situations requiring expert witness
arching standards summarize guiding principles
testimony, it is essential that professional judg
that are applicable to all tests and test uses.
ment be based on the accepted corpus of
Further, the themes and standards in each chapter
knowledge in determining the relevance of
are ordered to be consistent with the sequence of
particular standards in a given situation.The
the material in the introductory text for the
intent of the Standards is to offer guidance for
chapter.Because some users of the Standards may
such judgments.
turn only to chapters directly relevant to a given
application, certain standards are repeated in dif Claims by test developers or test users that a
ferent chapters, particularly in Part III, "Testing rest, manual, or procedure satisfies or follows
Applications." When such repetition occurs, the the standards in this volume should be made
essence of the standard is the same. Only the with care. It is appropriate for developers or
wording, area of application, or level of elaboration users to state that efforts were made to adhere
in the comment is changed. to the Standards, and to provide documents
describing and supporting those efforts.Blanket
claims without supporting evidence should
Cautions to Be Considered
not be made.
in Using the Standards
The standards are concerned with a field that
In addition to the legal disclaimer set forth above, is rapidly evolving. Consequently, there is a
several cautions are important if we are to avoid continuing need to monitor changes in the
misinterpretations, misapplications, and misuses field and to revise this document as knowledge
of the Standards: develops. The use of older versions of the
Standards may be a disservice to test users and
Evaluating the acceptability of a test or test
test takers.
application does not rest on the literal satis
faction of every standard in this document, Requiring the use of specific technical methods
and the acceptability of a test or test application is not the intent of the Standards.For example,
cannot be determined by using a checklist. where specific statistical reporting requirements
Specific circumstances affect the importance are mentioned, the phrase "or generally accepted
of individual standards, and individual standards equivalent" should always be understood.

7
PART I

Foundations
1 . VALIDITY

BACKGROUND
Validity refers to the degree to which evidence the construct interpretation that will be made on
and theory support the interpretations of test the basis of the score or response pattern.
scores for proposed uses of tests. Validity is, Examples of constructs currently used in as
therefore, the most fundamental consideration in sessment include mathematics achievement, general
developing tests and evaluating tests.The process cognitive ability, racial identity attitudes, depression,
of validation involves accumulating relevant and self-esteem. To support test development,
evidence to provide a sound scientific basis for the proposed construct interpretation is elaborated
the proposed score interpretations.It is the inter by describing its scope and extent and by delin
pretations of test scores for proposed uses that are eating the aspects of the construct that are to be
evaluated, not the test itself. When test scores are represented.The detailed description provides a
interpreted in more than one way ( e.g., both to conceptual framework for the test, delineating
describe a test taker's current level of the attribute the knowledge, skills, abilities, traits, interests,
being measured and to make a prediction about a processes, competencies, or characteristics to be
future outcome), each intended interpretation assessed. Ideally, the framework indicates how
must be validated. Statements about validity the construct as represented is to be distinguished
should refer to particular interpretations for from other constructs and how it should relate to
specified uses.It is incorrect to use the unqualified other variables.
phrase "the validity of the test." The conceptual framework is partially shaped
Evidence of the validity of a given interpretation by the ways in which test scores will be used. For
of test scores for a specified use is a necessary con instance, a test of mathematics achievement might
dition for the justifiable use of the test.Where suf be used to place a student in an appropriate program
ficient evidence of validity exists, the decision as of instruction, to endorse a high school diploma,
to whether to actually administer a particular test or to inform a college admissions decision.Each of
generally takes additional considerations into ac these uses implies a somewhat different interpretation
count.These include cost-benefit considerations, of the mathematics achievement test scores: that a
framed in different subdisciplines as utility analysis student will benefit from a particular instructional
or as consideration of negative consequences of intervention, that a student has mastered a specified
test use, and a weighing of any negative consequences curriculum, or that a student is likely to be successful
against the positive consequences of test use. with college-level work.Similarly, a test of consci
Validation logically begins with an explicit entiousness might be used for psychological coun
statement of the proposed interpretation of test seling, to inform a decision about employment, or
scores, along with a rationale for the relevance of for the basic scientific purpose of elaborating the
the interpretation to the proposed use. The construct of conscientiousness. Each of these
proposed interpretation includes specifying the potential uses shapes the specified framework and
construct the test is intended to measure. The the proposed interpretation of the test's scores and
term construct is used in the Standards to refer to also can have implications for test development
the concept or characteristic that a test is designed and evaluation. Validation can be viewed as a
to measure.Rarely, if ever, is there a single possible process of constructing and evaluating arguments
meaning that can be attached to a test score or a for and against the intended interpretation of test
pattern of test responses. Thus, it is always in scores and their relevance to the proposed use.The
cumbent on test developers and users to specify conceptual framework points to the kinds of

11
CHAPTER 1

evidence that might be collected to evaluate the consider the perspectives of different interested
proposed interpretation in light of the purposes of parties, existing experience with similar tests and
testing.As validation proceeds, and new evidence contexts, and the expected consequences of the
regarding the interpretations that can and cannot proposed test use.A finding of unintended con
be drawn from test scores becomes available, sequences of test use may also prompt a consider
revisions may be needed in the test, in the conceptual ation of rival hyp otheses.Plausible rival hypotheses
framework that shapes it, and even in the construct can often be generated by considering whether a
underlying the test. test measures less or more than its proposed con
T he wide variety of tests and circumstances struct. Such considerations are referred to as
makes it natural that some types of evidence will construct underrepresentation (or construct deficiency)
be especially critical in a given case, whereas and construct-irrelevant variance (or construct con
other types will be less useful. Decisions about tamination), respectively.
what types of evidence are important for the val Construct underrepresentation refers to the
idation argument in each instance can be clarified degree to which a test fails to capture important
by developing a set of propositions or claims aspects of the construct. It implies a narrowed
that support the proposed interpretation for the meaning of test scores because the test does not
particular purpose of testing.For instance, when adequately sample some types of content, engage
a mathematics achievement test is used to assess some psychological processes, or elicit some ways
readiness for an advanced course, evidence for of responding that are encompassed by the intended
the following propositions might be relevant: construct.Take, for example, a test intended as a
(a) that certain skills are prerequisite for the ad comprehensive measure of anxiety. A particular
vanced course; (b) that the content domain of test might underrepresent the intended construct
the test is consistent with these prerequisite because it measures only physiological reactions
skills; (c) that test scores can be generalized and not emotional, cognitive, or situational com
across relevant sets of items; (d) that test scores ponents. As another example, a test of reading
are not unduly influenced by ancillary variables, comprehension intended to measure children's
such as writing ability; (e) that success in the ad ability to read and interpret stories with under
vanced course can be validly assessed; and (f ) standing might not contain a sufficient variety of
that test takers with high scores on the test will reading passages or might ignore a common type
be more successful in the advanced course than of reading material.
test takers with low scores on the test.Examples Construct-irrelevance refers to the degree to
of propositions in other testing contexts might which test scores are affected by processes that are
include, for instance, the proposition that test extraneous to the test's intended purpose. T he
takers with high general anxiety scores experience test scores may be systematically influenced to
significant anxiety in a range of settings, the some extent by processes that are not part of the
proposition that a child's score on an intelligence construct.In the case of a reading comprehension
scale is strongly related to the child's academic test, these might include material too far above or
performance, or the proposition that a certain below the level intended to be tested, an emotional
pattern of scores on a neuropsychological battery reaction to the test content, familiarity with the
indicates impairment that is characteristic of subject matter of the reading passages on the test,
brain injury. T he validation process evolves as or the writing skill needed to compose a response.
these propositions are articulated and evidence Depending on the detailed definition of the con
is gathered to evaluate their soundness. struct, vocabulary knowledge or reading speed
Identifying the propositions implied by a pro might also be irrelevant components. On a test
posed test interpretation can be facilitated by designed to measure anxiety, a response bias to
considering rival hypotheses that may challenge underreport one's anxiety might be considered a
the proposed interpretation. It is also useful to source of construct-irrelevant variance.In the case

12
VALIDITY

of a mathematics test, it might include overreliance of the testing materials and procedures for the
on reading comprehension skills that English lan full range of applicants, and the consistency of
guage learners may be lacking.On a test designed the support for the proposed interpretation across
to measure science knowledge, test-taker inter groups. Professional judgment guides decisions
nalizing of gender-based stereotypes about women regarding the specific forms of evidence that can
in the sciences might be a source of construct-ir best support the intended interpretation for a
relevant variance. specified use. As in all scientific endeavors, the
Nearly all tests leave out elements that some quality of the evidence is paramount.A few pieces
potential users believe should be measured and of solid evidence regarding a particular proposition
include some elements that some potential users are better than numerous pieces of evidence of
consider inappropriate.Validation involves careful questionable quality. T he determination that a
attention to possible distortions in meaning given test interpretation for a specific purpose is
arising from inadequate representation of the warranted is based on professional judgment that
construct and also to aspects of measurement, the preponderance of the available evidence
such as test format, administration conditions, supports that interpretation. T he quality and
or language level, that may materially limit or quantity of evidence sufficient to reach chis judg
qualify the interpretation of test scores for various ment may differ for test uses depending on the
groups of test takers.T hat is, the process of vali stakes involved in the testing.A given interpretation
dation may lead to revisions in the test, in the may not be warranted either as a result of insufficient
conceptual framework of the test, or both.Inter evidence in support of it or as a resulc of credible
pretations drawn from the revised test would evidence against it.
again need validation. Validation is the joint responsibility of the
When propositions have been identified chat test developer and the test user.The test developer
would support the proposed interpretation of test is responsible for furnishing relevant evidence and
scores, one can proceed with validation by obtaining a rationale in support of any test score interpretations
empirical evidence, examining relevant literature, for specified uses intended by the developer.T he
and/or conducting logical analyses to evaluate test user is ultimately responsible for evaluating
each of the propositions.Empirical evidence may the evidence in the particular setting in which the
include both local evidence, produced within the test is to be used.When a test user proposes an
contexts where the test will be used, and evidence interpretation or use of test scores that differs
from similar testing applications in other settings. from those supported by the test developer, the
Use of existing evidence from similar tests and responsibility for providing validity evidence in
contexts can enhance the quality of the validity support of that interpretation for the specified
argument, especially when data for the test and use is the responsibility of the user. It should be
context in question are limited. noted that important contributions to the validity
Because an interpretation for a given use typ evidence may be made as other researchers report
ically depends on more than one proposition, findings of investigations chat are related to the
strong evidence in support of one part of the in meaning of scores on the test.
terpretation in no way diminishes the need for
evidence to support other parts of the interpretation. Sources of Validity Evidence
For example, when an employment test is being
considered for selection, a strong predictor-criterion T he following sections outline various sources
relationship in an employment setting is ordinarily of evidence that might be used in evaluating the
not sufficient to justify use of the test.One should validity of a proposed interpretation of test
also consider the appropriateness and meaning scores for a particular use.T hese sources of evi
fulness of the criterion measure, the appropriateness dence may illuminate different aspects of validity,

13
CHAPTER 1

but they do not represent distinct types of Evidence Based on Test Content
validity.Validity is a unitary concept. It is the
degree to which all the accumulated evidence Important validity evidence can be obtained from
supports the intended interpretation of test an analysis of the relationship between the content
scores for the proposed use.Like the 19 9 9 Stan of a test and the construct it is intended to
dards, this edition refers to types of validity evi measure.Test content refers to the themes, wording,
dence, rather than distinct types of validity.To and format of the items, tasks, or questions on a
emphasize this distinction, the treatment that test. Administration and scoring may also be
follows does not follow historical nomenclature relevant to content-based evidence.Test developers
(i.e., the use of the terms content validity or pre often work from a specification of the content
dictive validity) . domain.T he content specification carefully describes
As the discussion in the prior section emphasizes, the content in detail, often with a classification of
each type of evidence presented below is not areas of content and types of items. Evidence
required in all settings. Rather, support is needed based on test content can include logical or
for each proposition that underlies a proposed empirical analyses of the adequacy with which
test interpretation for a specified use.A proposition the test content represents the content domain
that a test is predictive of a given criterion can be and of the relevance of the content domain to the
supported without evidence that the test samples proposed interpretation of test scores. Evidence
a particular content domain.In contrast, a propo based on content can also come from expert j udg
sition that a test covers a representative sample of ments of the relationship between parts of the
a particular curriculum may be supported without test and the construct.For example, in developing
evidence that the test predicts a given criterion. a licensure test, the major facets that are relevant
However, a more complex set of propositions, to the purpose for which the occupation is regulated
e.g., that a test samples a specified domain and can be specified, and experts in that occupation
thus is predictive of a criterion reflecting a related can be asked to assign test items to the categories
domain, will require evidence supporting both defined by those facets. T hese or other experts
parts of this set of propositions.Tests developers can then judge the representativeness of the chosen
are also expected to make the case that the scores set of items.
are not unduly influenced by construct-irrelevant Some tests are based on systematic observations
variance (see chap. 3 for detailed treatment of of behavior. For example, a list of the tasks con
issues related to construct-irrelevant variance).In stituting a j ob domain may be developed from
general, adequate support for proposed interpre observations of behavior in a job, together with
tations for specific uses will require multiple judgments of subject matter experts.Expert judg
sources of evidence. ments can be used to assess the relative importance,
T he position developed above also underscores criticality, and/or frequency of the various tasks.
the fact that if a given test is interpreted in A job sample test can then be constructed from a
multiple ways for multiple uses, the propositions random or stratified sampling of tasks rated highly
underlying these interpretations for different uses on these characteristics.T he test can then be ad
also are likely to differ.Support is needed for the ministered under standardized conditions in an
propositions underlying each interpretation for a off-the-job setting.
specific use.Evidence supporting the interpretation The appropriateness of a given content domain
of scores on a mathematics achievement test for is related to the specific inferences to be made
placing students in subsequent courses (i.e., from test scores. T hus, when considering an
evidence that the test interpretation is valid for its available test for a purpose other than that for
intended purpose) does not permit inferring which it was first developed, it is especially
validity for other purposes (e.g., promotion or important to evaluate the appropriateness of the
teacher evaluation). original content domain for the proposed new

14
VALIDITY

purpose. For example, a test given for research and empirical analyses of the response processes
purposes to compare student achievement across of rest takers can provide evidence concerning the
states in a given domain may properly also cover fir between the construct and the derailed nature
material that receives little or no attention in the of the performance or response actually engaged
curriculum. Policy makers can then evaluate in by test takers.For instance, if a test is intended
student achievement with respect to both content to assess mathematical reasoning, it becomes im
neglected and content addressed. On the other portant to determine whether test takers are, in
hand, when student mastery of a delivered cur fact, reasoning about the material given instead
riculum is tested for purposes of informing of following a standard algorithm applicable only
decisions about individual students, such as pro to the specific items on the test.
motion or graduation, the framework elaborating Evidence based on response processes generally
a content domain is appropriately limited to what comes from analyses of individual responses.
students have had an opportunity to learn from Questioning test takers from various groups
the curriculum as delivered. making up the intended test-raking population
Evidence about content can be used, in part, to about their performance strategies or responses
address questions about differences in the meaning co particular items can yield evidence chat enriches
or interpretation of test scores across relevant sub r he definition of a construct.Maintaining records
groups of test takers. Of particular concern is the char monitor the development of a response to a
extent to which construct underrepresentation or writing task, through successive written drafts
construct-irrelevance may give an unfair advantage or electronically monitored revisions, for instance,
or disadvantage to one or more subgroups of test also provides evidence of process.Documentation
takers. For example, in an employment test, the of other aspects of performance, like eye move
use of vocabulary more complex than needed on ments or response rimes, may also be relevant to
the job may be a source of construct-irrelevant some constructs. Inferences about processes in
variance for English language learners or ochers. volved in performance can also be developed by
Careful review of the construct and test content analyzing the relationship among parts of the
domain by a diverse panel of experts may point to rest and between the rest and other variables.
potential sources of irrelevant difficulty (or easiness) Wide individual differences in process can be re
chat require further investigation. vealing and may lead to reconsideration of certain
Content-oriented evidence of validation is at rest formats.
the heart of the process in the educational arena Evidence of response processes can contribute
known as align ment, which involves evaluating the to answering questions about differences in meaning
correspondence between student learning standards or interpretation of test scores across relevant sub
and test content. Content-sampling issues in the groups of rest takers. Process studies involving
alignment proces include evaluating whether test rest takers from different subgroups can assist in
content appropriately samples the domain set forward determining the extent to which capabilities irrel
in curriculum standards, whether the cognitive de evant or ancillary to the construct may be differ
mands of test items correspond to the level reflected entially influencing test takers' test performance.
in the student learning standards (e.g., content Studies of response processes are not limited
standards), and whether the test avoids the inclusion to the test raker.Assessments often rely on observers
of features irrelevant to the standard that is the in or judges to record and/or evaluate test takers'
tended target of each test item. performances or products. In such cases, relevant
validity evidence includes the extent to which the
Evidence Based on Response Processes processes of observers or judges are consistent
Some construct interpretations involve more or with the intended interpretation of scores.For in
less explicit assumptions about the cognitive stance, if judges are expected to apply particular
processes engaged in by rest takers. T heoretical criteria in scoring rest takers' performances, it is

15
CHAPTER 1

important to ascertain whether they are, in fact, tionships form the basis for an estimate o f score
applying the appropriate criteria and not being reliability, but such an index would be inappropriate
influenced by factors that are irrelevant to the in for tests with a more complex internal structure.
tended interpretation ( e.g., quality of handwriting Some studies of the internal structure of tests
is irrelevant to judging the content of an written are designed to show whether particular items
essay). Thus, validation may include empirical may function differently for identifiable subgroups
studies of how observers or judges record and of test takers ( e.g., racial/ethnic or gender sub
evaluate data along with analyses of the appropri groups.) Differential itemfunctioning occurs when
ateness of these processes to the intended inter different groups of test takers with similar overall
pretation or construct definition. ability, or similar status on an appropriate criterion,
While evidence about response processes may have, on average, systematically different responses
be central in settings where explicit claims about to a particular item. This issue is discussed in
response processes are made by test developers or chapter 3.However, differential item functioning
where inferences about responses are made by test is not always a flaw or weakness.Subsets of items
users, there are many other cases where claims that have a specific characteristic in common
about response processes are not part of the ( e.g., specific content, task representation ) may
validity argument.In some cases, multiple response function differently for different groups of similarly
processes are available for solving the problems of scoring test takers.This indicates a kind of multi
interest, and the construct of interest is only con dimensionality that may be unexpected or may
cerned with whether the problem was solved cor conform to the test framework.
rectly.As a simple example, there may be multiple
possible routes to obtaining the correct solution Evidence Based on Relations to Other Variables
to a mathematical problem. In many cases, the intended interpretation for a
given use implies that the construct should be
Evidence Based on Internal Structure related to some other variables, and, as a result,
Analyses of the internal structure of a test can analyses of the relationship of test scores to
indicate the degree to which the relationships variables external to the test provide another im
among test items and test components conform to portant source of validity evidence. External
the construct on which the proposed test score in variables may include measures of some criteria
terpretations are based.The conceptual framework that the test is expected to predict, as well as rela
for a test may imply a single dimension of behavior, tionships to other tests hypothesized to measure
or it may posit several components that are each the same constructs, and tests measuring related
expected to be homogeneous, but that are also or different constructs. Measures other than test
distinct from each other. For example, a measure scores, such as performance criteria, are often
of discomfort on a health survey might assess both used in employment settings.Categorical variables,
physical and emotional health.The extent to which including group membership variables, become
item interrelationships bear out the presumptions relevant when the theory underlying a proposed
of the framework would be relevant to validity. test use suggests that group differences should be
The specific types of analyses and their inter present or absent if a proposed test score interpre
pretation depend on how the test will be used. tation is to be supported. Evidence based on rela
For example, if a particular application posited a tionships with other variables provides evidence
series of increasingly difficult test components, about the degree to which these relationships are
empirical evidence of the extent to which response consistent with the construct underlying the pro
patterns conformed to this expectation would be posed test score interpretations.
provided.A theory that posited unidimensionality
would call for evidence of item homogenei ty. In Convergent and discriminant evidence. Rela
this case, the number of items and item interrela- tionships between test scores and other m easures

16
VALIDITY

intended to assess the same or similar constructs criterion scores are of central importance. The
provide convergent evidence, whereas relationships credibility of a test-criterion study depends on
between test scores and measures purportedly of the relevance, reliability, and validity of the inter
different constructs provide discriminant evidence. pretation based on the criterion measure for a
For instance, within some theoretical frameworks, given testing application.
scores on a multiple-choice test of reading com Historically, two designs, often called predictive
prehension might be expected to relate closely and concurrent, have been distinguished for eval
( convergent evidence) to other measures of reading uating test-criterion relationships. A predictive
comprehension based on other methods, such as study indicates the strength of the relationship
essay responses. Conversely, test scores might be between test scores and criterion scores that are
obtained at a later time. A concurrent study
expected to relate less closely ( discriminar:it evidence)
to measures of other skills, such as logical reasoning.
obtains test scores and criterion information at
Relationships among different methods of meas about the same time.When prediction is actually
uring the construct can be especially helpful in contemplated, as in academic admission or em
sharpening and elaborating score meaning and ployment settings, or in planning rehabilitation
interpretation. regimens, predictive studies can retain the temporal
Evidence of relations with other variables can differences and other characteristics of the practical
involve experimental as well as correlational evi situation.Concurrent evidence, which avoids tem
dence.Studies might be designed, for instance, to poral changes, is particularly useful for psychodi
investigate whether scores on a measure of anxiety agnostic tests or in investigating alternative measures
improve as a result of some psychological treatment of some specified construct for which an accepted
or whether scores on a test of academic achievement measurement procedure already exists.The choice
differentiate between instructed and noninstructed of a predictive or concurrent research strategy in
groups. If performance increases due to short a given domain is also usefully informed by prior
term coaching are viewed as a threat to validity, itresearch evidence regarding the extent to which
would be useful to investigate whether coached predictive and concurrent studies in that domain
and uncoached groups perform differently. yield the same or different results.
Test scores are sometimes used in allocating
Test-criterion relationships. Evidence of the individuals to different treatments in a way that is
relation of test scores to a relevant criterion may advantageous for the institution and/or for the
be expressed in various ways, but the fundamental individuals. Examples would include assigning
question is always, how accurately do test scores individuals to different jobs within an organization,
predict criterion performance? The degree of ac or determining whether to place a given student
curacy and the score range within which accuracy in a remedial class or a regular class. In that
is needed depends on the purpose for which the context, evidence is needed to judge the suitability
test is used. of using a test when classifying or assigning a
The criterion variable is a measure of some at person to one job versus another or to one
tribute or outcome that is operationally distinct treatment versus another.Support for the validity
from the test.Thus, the test is not a measure of a of the classification procedure is provided, by
criterion, but rather is a measure hypothesized as showing that the test is useful in determining
a potential predictor of that targeted criterion. which persons are likely to profit differentially
Whether a test predicts a given criterion in a from one treatment or another. It is possible for
given context is a testable hypothesis.The criteria tests to be highly predictive of performance for
that are of interest are determined by test users, different education programs or jobs without pro
for example administrators in a school system or viding the information necessary to make a com
managers of a firm. The choice of the criterion parative judgment of the efficacy of assignments
and the measurement procedures used to obtain or treatments.In general, decision rules for selection

17
CHAPTER 1

or placement are also influenced by the number atively small. T hus, statistical summaries of past
of persons to be accepted or the numbers that can validation studies in similar situations may be
be accommodated in alternative placement cate useful in estimating test-criterion relationships in
gories (see chap.11). a new situation.T his practice is referred to as the
Evidence about relations to other variables is study of validity generalization.
also used to investigate questions of differential In some circumstances, there is a strong
prediction for subgroups. For instance, a finding basis for using validity generalizatio n . T his
that the relation oftest scores to a relevant criterion would be the case where the meta-analytic data
variable differs from one subgroup to another -base is large, where the meta-analytic data ade
may imply that the meaning of the scores is not quately represent the type of situation to which
the same for members of the different groups, one wishes to generalize, and where correction
perhaps due to construct underrepresentation or for statistical artifacts produces a clear and con
construct-irrelevant sources ofvarianc e.However, sistent pattern of validity evidence. In such cir
the difference may also imply that the criterion cumstances, the informational value of a local
has different meaning for different groups. The validity study may be relatively limited if not
differences in test-criterion relationships can also actually misleading, especially if its sample size
arise from measurement error, especially when is small. In other circumstances, the infe rential
group means differ, so such differences do not leap required for generalization may be much
necessarily indicate differences in score meaning. larger.The meta-analytic database may be small,
See the discussion of fairness in chapter 3 for the findings may be less consistent, or the new
more extended consideration of possible courses situation may involve features markedly different
ofaction when scores have different meanings for from those represented in the meta-analytic
different groups. database.In such circumstances, situation-specific
validity evidence will be relatively more inform
Validity generalization. An important issue in ative.Although research on validity generalization
educational and employment settings is the degree shows that results of a single local vali dation
to which validity evidence based on rest-criterion study may be quite imprecise, there are situations
relations can be generalized to a new situation where a single study, carefully done, with adequate
without further study ofvalidity in that new situ sample size, provides sufficient evidence to
ation.When a test is used to predict the same or support or reject test use in a new situation.
similar criteria (e.g., performance of a given job) T his highlights the importance of examining
at different times or in different places, it is carefully the comparative informational value
typically found that observed test-criterion corre of local versus meta-analytic studies.
lations vary substantially. In the past, this has In conducting studies of the generalizability
been taken to imply that focal validation studies of validity evidence, the prior studies that are in
are always required. More recently, a variety of cluded may vary according to several situational
approaches to generalizing evidence from other facets.Some of the major facets are (a) differences
settings has been developed, with meta-analysis in the way the predictor construct is measured,
the most widely used in the published literature. (b) the type of job or curriculum involved, (c) the
In particular, meta-analyses have shown that in type of criterion measure used, (d) the type of test
some domains, much of this variability may be takers, and (e) the time period in which the study
due to statistical artifacts such as sampling fluctu was conducted.In any particular study of validity
ations and variations across validation studies in generalization, any number of these facets might
the ranges of test scores and in the reliability of vary, and a major objective of the study is to de
criterion measures.When these and other influences termine empirically the extent to which variation
are taken into account, it may be found that the in these facets affects the test-criterion correlations
remaining variability in validity coefficients is rel- obtained.

18
VALIDITY

The extent to which predictive or concurrent Still other consequences are unintended, and
validity evidence can be generalized to new are often negative.For example, school district or
situations is in large measure a function of accu statewide educational testing on selected subjects
mulated research.Although evidence of general may lead teachers to focus on those subjects at
ization can often help to support a claim of the expense of others.As another example, a test
validity in a new situation, the extent of available developed to measure knowledge needed for a
data limits the degree to which the claim can be given job may result in lower passing rates for one
sustained. group than for another.Unintended consequences
T he above discussion focuses on the use of merit close examination.While not all consequences
cumulative databases to estimate predictor-criterion can be anticipated, in some cases factors such as
relationships. Meta-analytic techniques can also prior experiences in other settings offer a basis for
be used to summarize other forms of data relevant anticipating and proactively addressing unintended
to other inferences one may wish to draw from consequences. See chapter 12 for additional ex
test scores in a particular application, such as amples from educational settings. In some cases,
effects of coaching and effects of certain alterations actions to address one consequence bring about
in testing conditions for test takers with specified other consequences. One example involves the
disabilities. Gathering evidence about how well notion of "missed opportunities," as in the case of
validity findings can be generalized across groups moving to computerized scoring of student essays
of test takers is an important part of the validation to increase grading consistency, thus forgoing the
process.When the evidence suggests that inferences educational benefits of addressing the same problem
from test scores can be drawn for some subgroups by training teachers to grade more consistently.
but not for others, pursuing options such as those These types of consideration of consequences
discussed in chapter 3 can reduce the risk of of testing are discussed further below.
unfair test use.
Interpretation and uses of test scores intended by
Evidence for Validity and test developers. Tests are commonly administered
Consequences of Testing in the expectation that some benefit will be realized
Some consequences of test use follow directly from the interpretation and use of the scores intended
from the interpretation of test scores for uses in by the test developers.A few of the many possible
tended by the test developer. T he validation benefits that might be claimed are selection of effi
process involves gathering evidence to evaluate cacious therapies, placement of workers in suitable
the soundness of these proposed interpretations jobs, prevention of unqualified individuals from
for their intended uses. entering a profession, or improvement of classroom
Other consequences may also be part of a instructional practices.A fundamental purpose of
claim that extends beyond the interpretation or validation is to indicate whether these specific
use of sco.ces intended by the test developer. For benefits are likely to be realized.Thus, in the case of
example, a test of student achievement might a test used in placement decisions, the validation
provide data for a system intended to identify would be informed by evidence that alternative
and improve lower-performing schools.The claim placements, in fact, are differentially beneficial to
that testing results, used this way, will result in the persons and the institution.In the case of em
improved student learning may rest on propositions ployment testing, if a test publisher asserts that use
about the system or intervention itself, beyond of the test will result in reduced employee training
propositions based on the meaning of the test costs, improved workforce efficiency, or some other
itself. Consequences may point to the need for benefit, then the validation would be informed by
evidence about components of the system that evidence in support of that proposition.
will go beyond the interpretation of test scores as It is important to note that the validity of test
a valid measure of student achievement. score interpretations depends not only on the uses

19
CHAPTER 1

cally on the claims that


of the test scores but sp ecifi eXJsung data coll ected for purposes other than
. o f a ction for these uses. For test validation; in other cases new information
under1re the theory
a sch ool district that wants to will be needed to address the impact of the testing
examp1e, cons1 der
. I rea diness for kindergarten, program.
determme ch . 1" dren's
. a re st battery and screens out
and so admmisters
" c ore s. If higher scores do, in Consequences that are unintended. Test score
students with l ow s
a er per rm ance on k ey kindergarten interpretation for a given use may result in unin
act, predict high 0
..C
11

tasks, the c1a.1 m rhat use of the test scores for tended consequences.A key distinction is between
.
screenmg results 1n higher performance on these consequences that result from a source of error in
key tasks rs supporred and the interpr etation of the intended test score interpretation for a given
the test scores as a Predictor of kindergarten use and consequences that do not result from
.
readmess wou Id be valid. If, however, the claim error in test score interpretation. Examples of

were made that use Of the test scores for screening each are given below.
wouId resuI t m the greatest benefit to students, As discussed at some length in chapter 3, one

the mterpretauon 0 f rest scores as indicators of domain in which unintended negative consequences
.
readmess ror kindergarten might not be valid
c of test use are at times observed involves test score
.
because students WIth low scores might actually differences for groups defined in terms of r ace/eth
.
benefiIt more rom access to kindergarten In this nicity, g ender, age, and other characteristics. In
r is need ed to support such cases, however, it is important to distinguish
case, d1frerent evrdence
din:
rrerent cIaims that might be made about the between evidence that is directly relevant to validity
same use of the sereenlng test (for example, evidence and evidence that may inform decisions about
that students b elo"' " a certain cut score benefit social policy but falls outside the realm of validity.
more c._rrom another assignment than from assignment For example, concerns have been raised about the
to kindergarten ) .T he rest developer is responsible effect of group differences in test scores on em
cror the vali" danon 0f the interpretation
that the ployment selection and promotion, the placement
test scor es assess the i ndicated readiness skills.The of chil dren in special education classes, and the
I .
SCh00 distnct IS respon
sible for the validation of narrowing of a school's curriculum to exclude
the proper mterp re ta tion of the readiness test learning objectives that are not assessed.Although
. .
c
scores and ror evalu arion of the policy of usmg the information about the consequences of testing
. c placernentladmissions decisions. may influ ence decisions about test use, such con
readmess test ror
sequences do not, in and of themselves, detract
.
Clatms made ab ou t test use that are not directly from the validity of intended interpretations of
based on t est score interpretations. Claims are the test scores. Rather, judgments of validity or

sometimes made 11c0r benefits of testing that go invalidity in the light of testing consequences
.
beyond the drrect 1nterp retations or uses of the depend on a more searching inquiry into the
rest scores the mseIves that are specified by the test sources of those consequences.
p
developers. Educational rests, for exam l e, may Take, as an example, a finding of different
be advacated on th e g rou nds that their use will hiring rates for me mbers of different groups as a
improve studenc morivarion to l.earn or encourage
.
consequence of using an e mployment test. If the
changes in classroo rn instr ucuonal p racnces by difference is due solely to an unequal distribution
.
hoIdmg educators accountable for valued learning of the skills the test purports to measure , and if
outcomes. Where Such claims ar e central to the those skills are, in fact, important contributors to

rauona1e ad vanced 1r01 r resting, the direct exami- job performance, then the finding of group dif
. consequences necessarily assumes ferences per se does not imply any lack of validity
nat10n of tesrmg

even greater rmpor tallce T hose making the claims for the intended interp retation. If, however, the
are responsr'bl e r,c0r eva luation of the claims. In rest measured skill differences unrelated to job
.
some cases, sueh i n1ro1 rrnation can be drawn from performance ( e.g., a sophisticated reading test for

20
VALIDITY

a job that required only minimal functional also illustrates that different decision makers may
literacy), or if the differences were due to the make different value judgments about the impact
test's sensitivity to some test-taker characteristic of consequences on test use.
not intended to be part of the test construct, then The fact that the validity evidence supports
the intended interpretation of test scores as pre the intended interpretation of test scores for use
dicting job performance in a comparable manner in applicant screening does not mean that test use
for all groups of applicants would be rendered in is thus required: Issues other than validity, including
valid, even if test scores correlated positively with legal constraints, can play an important and, in
some measure of job performance.If a test covers some cases, a determinative role in decisions about
most of the relevant content domain but omits test use. Legal constraints may also limit an em
some areas, the content coverage might be judged ployer's discretion to discard test scores from tests
adequate for some purposes. However, if it is that have already been administered, when that
found that excluding some components that could decision is based on differences in scores for sub
readily be assessed has a noticeable impact on se groups of different races, ethnicities, or genders.
lection rates for groups of interest ( e.g., subgroup Note that unintended consequences can also
differences are found to be smaller on excluded be positive. Reversing the above example of test
components than on included components), the takers who form a negative impression of an or
intended interpretation of test scores as predicting ganization based on the use of a particular test, a
job performance in a comparable manner for all different test may be viewed favorably by applicants,
groups of applicants would be rendered invalid. leading to a positive impression of the organization.
Thus, evidence about consequences is relevant to A given test use may result in multiple consequences,
validity when it can be traced to a source of some positive and some negative.
invalidity such as construct underrepresentation In short, decisions about test use are appro
or construct-irrelevant components. Evidence priately informed by validity evidence about in
about consequences that cannot be so traced is tended test score interpretations for a given use,
not relevant to the validity of the intended inter by evidence evaluating additional claims about
pretations of the test scores. consequences of test use that do not follow directly
As another example, consider the case where from test score interpretations, and by value judg
research supports an employer's use of a particular ments about unintended positive and negative
test in the personality domain ( i.e., the test proves consequences of test use.
to be predictive of an aspect of subsequent job
performance), but it is found that some applicants Integrating the Validity Evidence
form a negative opinion of the organization due
to the perception that the test invades personal A sound validity argument integrates various
privacy. Thus, .there is an unintended negative strands of evidence into a coherent account of the
consequence of test use, but one that is not due degree to which existing evidence and theory sup
to a flaw in the intended interpretation of test port the intended interpretation of test scores for
scores as predicting subsequent performance.Some specific uses. It encompasses evidence gathered
employers faced with chis situation may conclude from new studies and evidence available from
that this negative consequence is grounds for dis earlier reported research. The validity argument
continuing test use; others may conclude that the may indicate the need for refining the definition
benefits gained by screening applicants outweigh of the construct, may suggest revisions in the test
this negative consequence.As this example illus or other aspects of the testing process, and may
trates, a consideration of consequences can influence indicate areas needing further study.
a decision about test use, even though the conse It is commonly observed that the validation
quence is independent of the validity of the process never ends, as there is always additional
intended test score interpretation. The example information that can be gathered to more fully

21
CHAPTER 1

es that can be as research on a to pic advances .For exampl e, pre


understand a test and rhe infe renc
drawn fro m it. In this way an
infe nce of valid ity
r e vailing standards of evidence may vary with the
rence . Howe ver, a stakes invo lved in the use or interp retation of the
is s imilar to any scientific infe
rests on e vidence test sc o re s. Higher s tak e s may entail high e r
test interpretation for a given use
the vali dity standards o f evidence . As anoth e r example, in
for a set of propos itions makin up g
lida i n evi dence areas where data collection comes at a greater
argument, and at some po int va
t o

t of h intended cost, one may find it necessary to base interpretations


allows for a summary judgmen
t e

ed an d fensible. on fewer data than in areas where data collect ion


interpretatio n that is well support
d e

rovi s fficient comes with less cost.


At so me point the effort to p
de u

v n s int erp re Ultimately, the vali dity of an intended inter


validity evidence to support a gi
e te t

( a l a t p rovi pretation o f tes t scores relies o n all the a vailable


tat ion for a specific use does e n d
t e s

a s ng basis e videnc e rele van t to the technical quality o f a


sionally, pending the emergence of
tro

quirements tes ting system. Different c om po nents of validity


for questioning that judgment). Legal re
ion s tudy be evidence are described in subsequent chapters of
may n ece ss itate that the vali da
r he Standards, and i nclude evidence of careful test
t
g
updated in light of such factors as chan s in th
e e

tes t populati o n o r newly de ve


lop ed al rna e te tiv construction; adequate score reliability; appropriate
test administrati on and sco ring; accurate score
tes ting methods.
req ired s caling, equating, and standard se tting; and careful
The amount and character of evidence u
e nt of validity attention to fairness for all test takers , as app rop riate
to suppo rt a p ro visio nal j udgm
als o within an area to th e t est in terpretatio n in question.
o ften vary between area s and

22
VALIDITY

STANDARDS FOR VALIDITY


The standards in this chapter begin with an over be employed, and the processes by which the test
arching standard ( numbered 1.0), which is designed is to be administered and scored.
to convey the central intent or primary focus of
the chapter.The overarching standard may also
Standard 1 .2
be viewed as the guiding principle of the chapter,
and is applicable to all tests and test users. All A rationale should be presented for each intended
subsequent standards have been separated into interpretation of test scores for a given use,
three thematic clusters labeled as follows: together with a summary of the evidence and
theory bearing on the intended interpretation.
1. Establishing Intended Uses and Interpreta
Comment: T he rationale should indicate what
nons
propositions are necessary to investigate the
2. Issues Regarding Samples and Settings Used
intended interpretation. T he summary should
inValidation
combine logical analysis with empirical evidence
3. Specific Forms of Validity Evidence
to provide support for the test rationale.Evidence
may come from studies conducted locally, in the
Standard 1 .0 setting where the test is to be used; from specific
prior studies; or from comprehensive statistical
Clear articulation of each intended test score in
syntheses of available studies meeting clearly spec
terpretation for a specified use should be set forth,
ified study quality criteria.No type of evidence is
and appropriate validity evidence in support of
inherently preferable to others; rather, the quality
each intended interpretation should be provided.
and relevance of the evidence to the intended test
score interpretation for a given use determine the
Cluster 1 . Establishing Intended value of a particular kind of evidence.A presentation
Uses and Interp retations of empirical evidence on any point should give
due weight to all relevant findings in the scientific
literature, including those inconsistent with the
Standard 1 . 1 intended interpretation or use. Test developers
The test developer should set forth clearly how have the responsibility to provide support for
test scores are intended to be interpreted and their own recommendations, but test users bear
consequently used. The population(s) for which ultimate responsibility for evaluating the quality
a test is intended should be delimited clearly, of the validity evidence provided and its relevance
and the construct or constructs that the test is to the local situation.
intended to assess should be described clearly.

Comment: Statements about validity should refer Standard 1 .3


to particular interpretations and consequent uses.
If validity for some common or likely interpretation
It is incorrect to use the unqualified phrase "the
for a given use has not been evaluated, or if such
validity of the test." No test permits interpretations
an interpretation is inconsistent with available
that are valid for all purposes or in all situations.
evidence, that fact should be made clear and po
Each recommended interpretation for a given use
tential users should be strongly cautioned about
requires validation. The test developer should
making unsupported interpretations.
specify in clear language the population for which
the test is intended, the construct it is intended to Comment: If past experience suggests that a test
measure, the contexts in which test scores are to is likely to be used inappropriately for certain

23
CHAPTER 1

kinds of decisions or certain kinds of test takers, as well as empirical data. Appropriate weight
specific warnings against such uses should be should be given to findings in the scientific
given.Professional judgment is required to evaluate literature that may be inconsistent with the stated
the extent to which existing validity evidence sup expectation.
ports a given test use.
Standard 1 .6
Standard 1 .4
When a test use is recommended on the grounds
If a test score is interpreted for a given use in a that testing or the testing program itself will
way that has not been validated, it is incumbent result in some indirect benefit, in addition to
on the user to justify the new interpretation for the utility of information from interpretation of
that use, providing a rationale and collecting the test scores themselves, the recommender
new evidence, if necessary. should make explicit the rationale for anticipating
Comment: Professional judgment is required to the indirect benefit. Logical or theoretical argu
evaluate the extent to which existing validity evi ments and empirical evidence for the indirect
dence applies in the new situation or to the new benefit should be provided. Appropriate weight
group of test takers and to determine what new should be given to any contradictory findings in
evidence may be needed.The amount and kinds the scientific literature, including findings sug
of new evidence required may be influenced by gesting important indirect outcomes other than
experience with similar prior test uses or interpre those predicted.
tations and by the amount, quality, and relevance Comment: For example, certain educational resting
of existing data. programs have been advocated on the grounds
A test that has been altered or administered in that they would have a salutary influence on class
ways that change the construct underlying the room instructional practices or would clarify stu
test for use with subgroups of the population re dents' understanding of the kind or level of
quires evidence of the validity of the interpretation achievement they were expected to attain.To the
made on the basis of the modified test (see chap. extent that such claims enter into the justification
3). For example, if a test is adapted for use with for a testing program, they become part of the ar
individuals with a particular disability in a way gument for test use. Evidence for such claims
that changes the underlying construct, the modified should be examined-in conjunction with evidence
test should have its own evidence of validity for about the validity of intended test score interpre
the intended interpretation. tation and evidence about unintended negative
consequences of test use-in making an overall
Standard 1 .5 decision about test use. Due weight should be
given to evidence against such predictions, for ex
When it is clearly stated or implied that a rec ample, evidence that under some conditions edu
ommended test score interpretation for a given cational testing may have a negative effect on
use will result in a specific outcome, the basis classroom instruction.
for expecting that outcome should be presented,
together with relevant evidence.
Standard 1 . 7
Comment: If it is asserted, for example, that in
terpreting and using scores on a given test for em If test performance, or a decision made therefrom,
ployee selection will result in reduced employee is claimed to be essentially unaffected by practice
errors or training costs, evidence in support of and coaching, then the propensity for test per
that assertion should be provided.A given claim formance to change with these forms of instruction
may be supported by logical or theoretical argument should be documented.

24
VALIDITY

Comment: Materials to aid in score interpretation well as the gender and ethnic composition of the
should summarize evidence indicating the degree sample.Sometimes legal restrictions about privacy
to which improvement with practice or coaching preclude obtaining or disclosing such population
can be expected.Also, materials written for test information or limit the level of particularity at
takers should provide practical guidance about which such data may be disclosed.The specific
the value of test preparation activities, including privacy laws, if any, governing the type of data
coaching. should be considered, in order to ensure that any
description of a population does not have the po
tential to identify an individual in a manner in
Cluster 2 . Issues Regarding Samples consistent with such standards. The extent of
and Settings Used in Va lidation missing data, if any, and the methods for handling
missing data (e.g., use of imputation procedures)
Standard 1 .8 should be described.

The composition of any sample of test takers Standard 1 .9


from which validity evidence is obtained should
be described in as much detail as is practical and When a validation rests in part on the opinions
permissible, including major relevant socio or decisions of expert judges, observers, or raters,
demographic and developmental characteristics. procedures for selecting such experts and for
eliciting judgments or ratings should be fully
Comment: Statistical findings can be influenced
described. T he qualifications and experience of
by factors affecting the sample on which the
the judges should be presented. T he description
results are based.When the sample is intended to
of procedures should include any training and
represent a population, that population should
instructions provided, should indicate whether
be described, and attention should be drawn to
participants reached their decisions independently;
any systematic factors that may limit the repre
and should report the level of agreement reached.
sentativeness of the sample. Factors that might
If participants interacted with one another or
reasonably be expected to affect the results include
exchanged information, the procedures through
self-selection, attrition, linguistic ability, disability
which they may have influenced one another
status, and exclusion criteria, among others. If
should be set forth.
the participants in a validity study are patients,
for example, then the diagnoses of the patients Comment: Systematic collection of j udgments or
are important, as well as other characteristics, opinions may occur at many points in test con
such as the severity of the diagnosed conditions. struction (e.g., eliciting expert judgments of content
For tests used in employment settings, the em appropriateness or adequate content representation),
ployment status (e.g., applicants versus current in the formulation of rules or standards for score
job holders), the general level of experience and interpretation (e.g., in setting cut scores), or in test
educational backgrou,nd, and the gender and storing (e.g., rating of essay responses).Whenever
ethnic composition of the sample may be relevant such procedures ace employed, the quality of the
information. For tests used in credentialing, the resulting judgments is important to the validation.
status of those providing information (e.g., can Level of agreement should be specified clearly (e.g.,
didates for a credential versus already-credentialed whether percent agreement refers to agreement
individuals) is important for interpreting the re prior to or after a consensus discussion, and whether
sulting data.For tests used in educational settings, the criterion for agreement is exact agreement of
relevant information may include educational ratings or agreement within a certain number of
background, developmental level, community scale points.) The basis for specifying certain types
characteristics, or school admissions policies, as of individuals (e.g., experienced teachers, experienced

25
CHAPTER 1

job incumbents, supervisors) as appropriate experts ifying and generating test content should be de
for the judgment or rating task should be articulated. scribed and justified with reference to the intended
It may be entirely appropriate to have experts work population to be tested and the construct the
together to reach consensus, but it would not then test is intended to measure or the domain it is
be appropriate to treat their respective judgments intended to represent. If the definition of the
as statistically independent. Different judges may content sampled incorporates criteria such as
be used for different purposes ( e.g., one set may importance, frequency, or criticality, these criteria
rate items for cultural sensitivity while another should also be clearly explained and justified.
may rate for reading level) or for different portions
Comment: For example, test developers might
of a test.
provide a logical structure that maps the items on
the test to the content domain, illustrating the
Standard 1 .1 O relevance of each item and the adequacy with
which the set of items represents the content do
When validity evidence includes statistical analyses
main. Areas of the content domain that are not
of test results, either alone or together with data
included among the test items could be indicated
on other variables, the conditions under which
as well.The match of test content to the targeted
the data were collected should be described in
domain in terms of cognitive complexity and the
enough detail that users can judge the relevance
accessibility of the test content to all members of
of the statistical findings to local conditions. At
the intended population are also important con
tention should be drawn to any features of a val
siderations.
idation data collection that are likely to differ
from typical operational testing conditions and
that could plausibly influence test performance. (b) Evidence Regard ing Cognitive
Comment: Such conditions might include (but Processes
would not be limited to) the following: test-taker Standard 1 .1 2
motivation or prior preparation, the range of test
scores over test takers, the time allowed for test If the rationale for score interpretation for a given
takers to respond or other administrative conditions, use depends on premises about the psychological
the mode of test administration ( e.g., unproctored processes or cognitive operations of test takers,
online testing versus proctored on-site testing), then theoretical or empirical evidence in support
examiner training or other examiner characteristics, ofthose premises should be provided. When state
the time intervals separating collection of data on ments about the processes employed by observers
different measures, or conditions that may have or scorers are part of the argument for validity,
changed since the validity evidence was obtained. similar information should be provided.

Comment: If the test specification delineates the


Cluster 3. Specific Forms of processes to be assessed, then evidence is needed
Validity Evidence that the test items do, in fact, tap the intended
processes.

(a) Content-Oriented Evidence


{c) Evidence Regarding Internal Structure
Standard 1 . 1 1
Standard 1 .1 3
When the rati onale for test sc ore interpretati on
for a given use rests in part on the appropriateness If the rati onale for a test score interpretati on for
of test c ontent, the pr ocedures followed in spec- a given use depends on premises ab out the rela-

26
VALIDITY

tionships among test items or among parts of the rationale and relevant evidence in support of
the test, evidence concerning the internal structure such interpretation should be provided. When
of the test should be provided. interpretation of individual item responses is
likely but is not recommended by the developer,
Comment: It might be claimed, for example,
the user should be warned against making such
that a test is essentially unidimensional. Such a
interpretations.
claim could be supported by a multivariate statistical
analysis, such as a factor analysis, showing that Comment: Users should be given sufficient guidance
the score variability attributable to one major di to enable them to judge the degree of confidence
mension was much greater than the score variability warranted for any interpretation for a use recom
attributable to any other identified dimension, or mended by the test developer.Test manuals and
showing that a single factor adequately accounts score reports should discourage overinterpretation
for the covariation among test items.When a test of information that may be subject to considerable
provides more than one score, the interrelationships error.This is especially important if interpretation
of those scores should be shown to be consistent of performance on isolated items, small subsets of
with the construct(s) being assessed. items, or subtest scores is suggested.

Standard 1 . 1 4
(d) Evidence Regarding Relationships
When interpretation of subscores, score differences, With Conceptually Related Constructs
or profiles is suggested, the rationale and relevant
evidence in support of such interpretation should Standard 1 . 1 6
be provided.Where composite scores are devel
oped, the basis and rationale for arriving at the When validity evidence includes empirical analyses
composites should be given. of responses to test items together with data on
other variables, the rationale for selecting the ad
Comment: When a test provides more than one ditional variables should be provided. Where ap
score, the distinctiveness and reliability of the propriate and feasible, evidence concerning the
separate scores should be demonstrated, and the constructs represented by other variables, as well
interrelationships of those scores should be shown as their technical properties, should be presented
to be consistent with the construct(s) being or cited. Attention should be drawn to any likely
assessed. Moreover, evidence for the validity of sources of dependence (or lack of independence)
interpretations of two or more separate scores among variables other than dependencies among
would not necessarily justify a statistical or sub the construct(s) they represent.
stantive interpretation of the difference between
them.Rather, the rationale and supporting evidence Comment: The patterns of association between
must pertain directly to the specific score, score and among scores on the test under study and
combination, or score pattern to be interpreted other variables should be consistent with theoretical
for a given use.When subscores from one test or expectations. T he additional variables might be
scores from different tests are combined into a demographic characteristics, indicators of treatment
composite, the basis for combining scores and for conditions, or scores on other measures. T hey
how scores are combined (e.g., differential weighting might include intended measures of the same
versus simple summation) should be specified. construct or of different constructs.T he reliability
of scores from such other measures and the validity
of intended interpretations of scores from these
Standard 1 .1 5
measures are an important part of the validity ev
When interpretation of performance on specific idence for the test under study. If such variables
items, or small subsets of items, is suggested, include composite scores, the manner in which

27
CHAPTER 1

the composites were constructed should be explained as information about the distribution of criterion
( e.g., transformation or standardization of the performances conditional upon a given test score.
variables, and weighting of the variables) . In In the case of categorical rather than continuous
addition to considering the properties of each variables, techniques appropriate to such data
variable in isolation, it is important to guard should be used ( e.g., the use of logistic regression
against faulty interpretations arising from spurious in the case of a dichotomous criterion) .Evidence
sources of dependency among measures, including about the overall association between variables
correlated errors or shared variance due to common should be supplemented by information about
methods of measurement or common elements. the form of that association and about the variability
of that association in different ranges of test scores.
(e) Evidence Regarding Relationships Note that data collections employing test takers
selected for their extreme scores on one or more
With Criteria
measures ( extreme groups) typically cannot provide
Standard 1 . 1 7 adequate information about the association.

When validation relies on evidence that test


scores are related to one or more criterion variables,
Standard 1 .1 9
information about the suitability and technical
If test scores are used in conjunction with other
quality of the criteria should be reported.
variables to predict some outcome or criterion,
Comment: The description of each criterion analyses based on statistical models of the pre
variable should include evidence concerning its dictor-criterion relationship should include those
reliability, the extent to which it represents the additional relevant variables along with the test
intended construct ( e.g., task performance on the scores.
job), and the extent to which it is likely to be in
Comment: In general, if several predictors of
fluenced by extraneous sources of variance.Special
some criterion are available, the optimum combi
attention should be given to sources that previous
nation of predictors cannot be determined solely
research suggests may introduce extraneous variance
from separate, pairwise examinations of the criterion
that might bias the criterion for or against identi
variable with each separate predictor in turn, due
fiable groups. to intercorrelation among predictors. It i s often
informative to estimate the increment in predictive
Standard 1 .1 8 accuracy that may be expected when each variable,
including the test score, is introduced in addition
When it is asserted that a certain level of test
to all other available variables. As empirically
performance predicts adequate or inadequate
derived weights for combining predictors can cap
criterion performance, information about the
italize on chance factors in a given sample, analyses
levels of criterion performance associated with
involving multiple predictors should be verified
given levels of test scores should be provided.
by cross-validation or equivalent analysis whenever
Comment: For purposes of linking specific test feasible, and the precision of estimated regression
scores with specific levels of criterion performance, coefficients or other indices should be reported.
regression equations are more useful than correlation Cross-validation procedures include formula esti
coefficients, which are generally insufficient to mates of validity in subsequent samples and em
fully describe patterns of association between tests pirical approaches such as deriving weights in one
and other variables. Means, standard deviations, portion of a sample and applying them to an in
and other statistical summaries are needed, as well dependent subsample.

28
VALIDITY

Standard 1 .20 for adjusting the correlation r o estimate the strength


of the correlation net of the effects of measurement
When effect size measures (e.g., correlations be error in either or both variables. Reporting of an
tween test scores and criterion measures, stan adjusted correlation should be accompanied by a
dardized mean test score differences between
statement of the method and the statistics used in
subgroups) are used to draw inferences that go
making the adjustment.
beyond describing the sample or samples on
which data have been collected, indices of the
degree of uncertainty associated with these meas Standard 1 .22
ures (e.g., standard errors, confidence intervals, When a meta-analysis is used as evidence of the
or significance tests) should be reported.
strength of a test-criterion relationship, the test
Comment: Effect size measures are usefully paired and the criterion variables in the local situation
with indices reflecting their sampling error to should be comparable with those in the studies
make meaningful evaluation possible. T here are summarized. If relevant research includes credible
various possible measures of effect size, each ap evidence that any other specific features of the
plicable to different settings. In the presentation testing application may influence the strength
of indices of uncertainty, standard errors or confi of the test-criterion relationship, the correspon
dence intervals provide more information and dence between those features in the local situation
thus are preferred in place of, or as supplements and in the meta-analysis should be reported.
to, significance testing. Any significant disparities that might limit the
applicability of the meta-analytic findings to the
local situation should be noted explicitly.
Standard 1 .21
Comment: T he meta-analysis should incorporate
When statistical adjustments, such as those for
all available studies meeting explicitly stated in
restriction of range or attenuation, are made,
clusion criteria. Meta-analytic evidence used in
both adjusted and unadjusted coefficients, as
test validation typically is based on a number of
well as the specific procedure used, and all
tests measuring the same or very similar constructs
statistics used in the adjustment, should be re
and criterion measures that likewise measure the
ported. Estimates of the construct-criterion re
same or similar constructs.A meta-analytic study
lationship that remove the effects of measurement
may also be limited to multiple studies of a single
error on the test should be clearly reported as
test and a single criterion.For each study included
adjusted estimates.
in the analysis, the test-criterion relationship is
Comment: T he correlation between two variables, expressed in some common metric, often as an
such as test scores and criterion measures, depends effect size.T he strength of the test-criterion rela
on the range of values on each variable.For example, tionship may be moderated by features of the sit
the test scores and the criterion values of a selected uation in which the test and criterion measures
subset of test takers ( e.g., job applicants who have were obtained ( e.g., types of jobs, characteristics
been selected for hire) will typically have a smaller of test takers, time interval separating collection
range than the scores of all test takers ( e.g., the of test and criterion measures, year or decade in
entire applicant pool.) Statistical methods are which the data were collected). If test-criterion
available for adjusting the correlation to reflect the relationships vary according to such moderator
population of interest rather than the sample variables, then the meta-analysis should report
available.Such adjustments are often appropriate, separate estimated effect-size distributions condi
as when results are compared across various situations. tional upon levels of these moderator variables
The correlation between two variables is also affected when the number of studies available for analysis
by measurement error, and methods are available permits doing so.T his might be accomplished,

29
CHAPTER 1

for example, by reporting separate distributions porting of meta-analytic evidence, the individual
for subsets of studies or by estimating the magni drawing on existing meta-analytic evidence must
tudes of the influences of situational features on evaluate the soundness of the meta-analysis for
effect sizes. the setting in question.
T his standard addresses the responsibilities of
the individual who is drawing on meta-analytic
Standard 1 .24
evidence to support a test score interpretation for
a given use. In some instances, that individual If a test is recommended for use in assigning
may also be the one conducting the meta-analysis; persons to alternative treatments, and ifoutcomes
in other instances, existing meta-analyses are relied from those treatments can reasonably be compared
on.In the latter instance, the individual drawing on a common criterion, then, whenever feasible,
on meta-analytic evidence does not have control supporting evidence of differential outcomes
over how the meta-analysis was conducted or re should be provided.
ported, and must evaluate the soundness of the
meta-analysis for the setting in question. Comment: If a test is used for classification into
alternative occupational, therapeutic, or educational
programs, it is not sufficient just to show that the
Standard 1 .23 test predicts treatment outcomes.Support for the
Any meta-analytic evidence used to support an validity of the classification procedure is provided
intended test score interpretation for a given use by showing that the test is useful in determining
should be clearly described, including method which persons are likely to profit differentially
ological choices in identifying and coding studies, from one treatment or another.Treatment categories
correcting for artifacts, and examining potential may have to be combined to assemble sufficient
moderator variables. Assumptions made in cor cases for statistical analysis. It is recognized,
recting for artifacts such as criterion unreliability however, that such research may not be feasible,
and range restriction should be presented, and because ethical and legal constraints on differential
the consequences of these assumptions made assignments may forbid control groups.
clear.

Comment: T he description should include docu (f) Evidence Based on Consequences of


mented information about each study used as Tests
input to the meta-analysis, thus permitting evalu
ation by an independent party. Note also that Standard 1 .25
meta-analysis inevitably involves j udgments re
When unintended consequences result from test
garding a number of methodological choices.T he
use, an attempt should be made to investigate
bases for these judgments should be articulated.
whether such consequences arise from the test's
In the case of choices involving some degree of
sensitivity to characteristics other than t:hose it
uncertainty, such as artifact corrections based on
is intended to assess or from the test's failure to
assumed values, the uncertainty should be ac
fully represent the intended construct.
knowledged and the degree to which conclusions
about validity hinge on these assumptions should Comment: T he validity of test score interpreta
be examined and reported. tions may be limited by construct-irrelevant
As in the case ofStandard 1.22, the individual components or construct underrepresentation.
who is drawing on meta-analytic evidence to When unintended consequences appear to stem,
support a test score interpretation for a given use at least in part, from the use of one or more
may or may not also be the one conducting the tests, it is especially important to check that
meta-analysis.As Standard 1.22 addresses the re- these consequences do not arise from construct-

30
VALIDITY

irrelevant components or construct underrepre question.Ensuring chat unintended consequences


sentation. For example, although group differ are evaluated is the responsibility of those making
ences, in and of themselves, do not call into the decision whether to use a particular test, al
question the validity of a proposed interpretation, though legal constraints may limit the test user's
they may increase the salience of plausible rival discretion to discard the results of a previously
hypotheses that should be evaluated as part of administered test, when that decision is based
the validation effort. A finding of unintended on differences in scores for subgroups of different
consequences may also lead to reconsideration races, ethnicities, or genders. T hese issues are
of the appropriateness of the construct in discussed further in chapter 3.

31
2 . RELIABILITY/PRECISION AND
ERRORS OF MEASUREMENT
BACKGROUND
A test, broadly defined, is a set of tasks or stimuli reliability/precision is warranted.If a decision can
designed to elicit responses that provide a sample and will be corroborated by information from
of an examinee's behavior or performance in a other sources or if an erroneous initial decision
specified domain. Coupled with the test is a scoring can be easily corrected, scores with more modest
procedure that enables the scorer to evaluate the reliability/ precision may suffice.
behavior or work samples and generate a score.In Interpretations of test scores generally depend
interpreting and using test scores, it is important on assumptions that individuals and groups exhibit
to have some indication of their reliability. some degree of consistency in their scores across
The term reliability has been used in two ways independent administrations of the testing pro
in the masurement literature.First, the term has cedure. However, different samples of performance
been used to refer to the reliability coefficients of from the same person are rarely identical.An in
classical test theory, defined as the correlation be dividual's performances, products, and responses
tween scores on two equivalent forms of the test, to sets of tasks or test questions vary in quality or
presuming that taking one form has no effect on character from one sample of tasks to another
performance on the second form. Second, the and from one occasion to another, even under
term has been used in a more general sense, to strictly controlled conditions.Different raters may
refer to the consistency of scores across replications award different scores to a specific performance.
of a testing procedure, regardless of how this con All of these sources of variation are reflected in
sistency is estimated or reported (e.g., in terms of the examinees' scores, which will vary across in
standard errors, reliability coefficients per se, gen stances of a measurement procedure.
eralizability coefficients, error/tolerance ratios, The reliability/precision of the scores depends
item response theory (IRT) information functions, on how much the scores vary across replications
or various indices of classification consistency) . of the testing procedure, and analyses of
To maintain a link to the traditional notions of reliability/precision depend on the kinds of vari
reliability while avoiding the ambiguity inherent ability allowed in the testing procedure (e.g., over
in using a single, familiar term to refer to a wide tasks, contexts, raters) and the proposed interpre
range of concepts and indices, we use the term re tation of the test scores.For example, if the inter
liability/precision to denote the more general notion pretation of the scores assumes that the construct
of consistency of the scores across instances of the being assessed does not vary over occasions, the
testing procedure, and the term reliability coefficient variability over occasions is a potential source of
to refer to the reliability coefficients of classical measurement error. If the test tasks vary over al
test theory. ternate forms of the test, and the observed per
The reliability/precision of measurement is formances are treated as a sample from a domain
always important.However, the need for precision of similar tasks, the random variabili ty in scores
increases as the consequences of decisions and in from one form to another would be considered
terpretations grow in importance. If a test score error.If raters are used to assign scores to responses,
leads to a decision that is not easily reversed, such the variability in scores over qualified raters is a
as rejection or admission of a candidate to a pro source of error.Variations in a test taker's scores
fessional school, or a score-based clinical judgment that are not consistent with the definition of the
(e.g., in a legal context) that a serious cognitive construct being assessed are attributed to errors
inj ury was sustained, a higher degree of of measurement.

33
CHAPTER 2

A very basic way to evaluate the consistency variable that fluctuates around the true score for
of scores involves an analysis of the variation in the person.
each test taker's scores across replications of the Generalizability theory provides a different
testing procedure. T he test is administered and framework for estimating reliability/precision.
then, after a brief period during which the exam While classical test theory assumes a single dis
inee's standing on the variable being measured tribution for the errors in a test taker's scores,
would not be expected to change, the test ( or a generalizability theory seeks to evaluate the con
distinct but equivalent form of the test) is admin tributions of different sources of error ( e.g., items,
istered a second time; it is assumed that the first occasions, raters) to the overall error.The universe
administration has no influence on the second score for a person is defined as the expected value
administration. Given that the attribute being over a universe of all possible replications of the
measured is assumed to remain the same for each testing procedure for the test taker.The universe
test taker over the two administrations and that score of generalizability theory plays a role that is
the test administrations are independent of each similar to the role of true scores in classical test
other, more variation across the two administrations theory.
indicates more error in the test scores and therefore Item response theory ( IRT ) addresses the basic
lower reliability/precision. issue of reliability/precision using information
T he impact of such measurement errors can functions, which indicate the precision with which
be summarized in a number of ways, but typically, observed task/item performances can be used to
in educational and psychological measurement, it estimate the value of a latent trait for each test
is conceptualized in terms of the standard deviation taker.Using IRT, indices analogous to traditional
in the scores for a person over replications of the reliability coefficients can be estimated from the
testing procedure. In most testing contexts, it is item information functions and distributions of
not possible to replicate the testing procedure re the latent trait in some population.
peatedly, and therefore it is not possible to estimate In practice, the reliability/precision of the
the standard error for each person's score via scores is typically evaluated in terms of various
repeated measurement. Instead, using model coefficients, including reliability coefficients, gen
based assumptions, the average error of measure eralizability coefficients, and IRT information
ment is estimated over some population, and this functions, depending on the focus of the analysis
average is referred to as the standard error ofmeas and the measurement model being used.T he co
urement ( SEM). The SEM is an indicator of a efficients tend to have high values when the vari
lack of consistency in the scores generated by the ability associated with the error is small compared
testing procedure for some population.A relatively with the observed variation in the scores ( or score
large SEM indicates relatively low reliability/pre differences) to be estimated.
cision.The conditional standard error ofmeasurement
for a score level is the standard error of measurement Implicati ons for Va lidity
at that score level .
To say that a score includes error implies that Although reliability/precision is discussed here as
there is a hypothetical error-free value that char an independent characteristic of test scores, it
acterizes the variable being assessed. In classical should be recognized that the level of reliability/pre
test theory this error-free value is referred to as cision of scores has implications for validity.Reli
the person's true score for the test procedure.It is ability/precision of data ultimately bears on the
conceptualized as the hypothetical average score generalizability or dependability of the scores
over an infinite set of replications of the testing and/or the consistency of classifications of indi
procedure. In statistical terms, a person's true viduals derived from the scores. To the extent
score is an unknown parameter, or constant, and that scores are not consistent across replications
the observed score for the person is a random of the testing procedure ( i.e., to the extent that

34
RELIABILITY/PRECISION AND ERRORS OF MEASUREMENT

they reflect random errors of measurement), their comparisons of scores across individuals.Conditions
potential for accurate prediction of criteria, for of observation that are fixed or standardized for
beneficial examinee diagnosis, and for wise decision the testing procedure remain the same across
making is limited. replications. However, some aspects of any stan
dardized testing procedure will be allowed to vary.
Specifications for Replications The time and place of testing, as well as the
of the Testing Procedure persons administering the test, are generally allowed
to vary to some extent. T he particular tasks
As indicated earlier, the general notion of reliability/ included in the test may be allowed to vary ( as
precision is defined in terms of consistency over samples from a common content domain), and
replications of the testing procedure.Reliability/pre the persons who score the results can vary over
cision is high if the scores for each person are some set of qualified scorers.
consistent over replications of the testing procedure Alternate forms ( or parallel forms) of a stan
and is low if the scores are not consistent over dardized test are designed to have the same general
replications.Therefore, in evaluating reliability/pre distribution of content and item formats ( as de
cision, it is important to be clear about what scribed, for example, in detailed test specifications),
constitutes a replication of the testing procedure. the same administrative procedures, and at least
Replications involve independent administra approximately the same score means and standard
tions of the testing procedure, such that the deviations in some specified population or popu
attribute being measured would not be expected lations. Alternate forms of a test are considered
to change. For example, in assessing an attribute interchangeable, in the sense that they are built to
that is not expected to change over an extended the same specifications, and are interpreted as
period of time ( e.g., in measuring a trait), scores measures of the same construct.
generated on two successive days ( using different In classical test theory, strictly parallel tests are
test forms if appropriate) would be considered assumed to measure the same construct and to
replications. For a state variable ( e.g., mood or yield scores that have the same means and standard
hunger), where fairly rapid changes are common, deviations in the populations of interest and have
scores generated on two successive days would the same correlations with all other variables.A
not be considered replications; the scores obtained classical reliability coefficient is defined in terms
on each occasion would be interpreted in terms of the correlation between scores from strictly
of the value of the state variable on that occasion. parallel forms of the test, but it is estimated in
For many tests of knowledge or skill, the admin terms of the correlation between alternate forms
istration of alternate forms of a test with different of the test that may not quite be strictly parallel.
samples of items would be considered replications Different approaches to the estimation of reli
of the test; for survey instruments and some per ability/precision can be implemented to fit different
sonality measures, it is expected that the same data-collection designs and different interpretations
questions will be used every time the test is ad and uses of scores. In some cases, it may be
ministered, and any substantial change in wording feasible to estimate the variability over replications
would constitute a different test form. directly ( e.g., by having a number of qualified
Standardized tests present the same or very raters evaluate a sample of test performances for
similar test materials to all test takers, maintain each test taker).In other cases, it may be necessary
close adherence to stipulated procedures for test to use less direct estimates of the reliability coeffi
administration, and employ prescribed scoring cient.For example, internal-consistency estimates
rules that can be applied with a high degree of of reliability ( e.g., split halves coefficient, KR-20,
consistency.Administering the same questions or coefficient alpha) use the observed extent of agree
commonly scaled questions to all test takers under ment between different parts of one test to estimate
the same conditions promotes fairness and facilitates the reliability associated with form-to-form vari-

35
CHAPTER 2

ability.For the split-halves method, scores on two differences in the difficulty of test forms that
more-or-less parallel halves of the test ( e.g., odd have not been adequately equated or linked; ex
numbered items and even-numbered items) are aminees who take one form may receive higher
correlated, and the resulting half-test reliability scores on average than if they had taken the other
coefficient is statistically adjusted to estimate reli form.Such systematic errors would not generally
ability for the full-length test. However, when a be included in the standard error of measurement,
test is designed to reflect rate of work, internal and they are not regarded as contributing to a
consistency estimates of reliability ( particularly lack of reliability/precision. Rather, systematic
by the odd-even method) are likely to yield inflated errors constitute construct-irrelevant factors that
estimates of reliability for highly speeded tests. reduce validity but not reliability/precision.
In some cases, it may be reasonable to assume Important sources of random error may be
that a potential source of variability is likely to be grouped in two broad categories: those rooted
negligible or that the user will be able to infer ad within the test takers and those external to them.
equate reliability from other types of evidence. Fluctuations in the level of an examinee's motivation,
For example, if test scores are used mainly to interest, or attention and the inconsistent application
predict some criterion scores and the test does an of skills are clearly internal sources that may lead
acceptable job in predicting the criterion, it can to random error.Variations in testing conditions
be inferred that the test scores are reliable/precise ( e.g., time of day, level of distractions) and
enough for their intended use. variations in scoring due to scorer subjectivi ty are
The definition of what constitutes a standardized examples of external sources that may lead to ran
test or measurement procedure has broadened dom error. The importance of any particular
significantly over the last few decades. Various source of variation depends on the specific condi
kinds of performance assessments, simulations, tions under which the measures are taken, how
and portfolio-based assessments have been developed performances are scored, and the interpretations
to provide measures of constructs that might oth derived from the scores.
erwise be difficult to assess. Each step toward Some changes in scores from one occasion to
greater flexibility in the assessment procedures another are not regarded as error ( random or sys
enlarges the scope of the variations allowed in tematic), because they result, in part, from changes
replications of the testing procedure, and therefore in the construct being measured ( e.g., due to
tends to increase the measurement error.However, learning or maturation that has occurred b etween
some of these sacrifices in reliability/precision the initial and final measures). In such cases, the
may reduce construct irrelevance or construct un changes in performance would constitute the phe
derrepresentation and thereby improve the validity nomenon of interest and would not be considered
of the intended interpretations of the scores. For errors of measurement.
example, performance assessments that depend Measurement error reduces the usefulness of
on ratings of extended responses tend to have test scores.It limits the extent to which test results
lower reliability than more structured assessments can be generalized beyond the particulars of a
( e.g., multiple-choice or short-answer tests), but given replication of the testing procedure. It
they can sometimes provide more direct measures reduces the confidence that can be placed in the
of the attribute of interest. results from any single measurement and therefore
Random errors of measurement are viewed as the reliability/precision of the scores.Because ran
unpredictable fluctuations in scores. They are dom measurement errors are unpredictable, they
conceptually distinguished from systematic errors, cannot be removed from observed scores.However,
which may also affect the performances of indi their aggregate magnitude can be summarized in
viduals or groups but in a consistent rather than a several ways, as discussed below, and they can be
random manner.For example, an incorrect answer controlled to some extent ( e.g., by standardization
key would contribute systematic error, as would or by averaging over multiple scores).

36
RELIABILITY/PRECISION AND ERRORS OF MEASUREMENT

The standard error of measurement, as such, Therefore, to the extent feasible ( i.e., if sample
provides an indication of the expected level of sizes are large enough), reliability/precision should
random error over score points and replications be estimated separately for all relevant subgroups
for a specific population. In many cases, it is ( e.g., defined in terms of race/ethnicity, gender,
useful to have estimates of the standard errors for language proficiency) in the population.( Also see
individual examinees ( or for examinees with scores chap.3, "Fairness in Testing.")
in certain score ranges).These conditional standard
errors are difficult to estimate directly, but can be Reliability/Generalizabil ity Coefficients
estimated indirectly. For example, the test infor
mation functions based on IRT models can be In classical test theory, the consistency of test scores
used to estimate standard errors for different is evaluated mainly in terms of reliability coefficients,
values of a latent ability parameter and/or for dif defined in terms of the correlation between scores
ferent observed scores.In using any of these mod derived from replications of the testing procedure
el-based estimates of conditional standard errors, on a sample of test takers. Three broad categories
it is important that the model assumptions be of reliability coefficients are recognized: ( a) coefficients
consistent with the data. derived from the administration of alternate forms
in independent testing sessions ( alternate-form co
Evaluati n g Reliability/Precision efficients); ( b) coefficients obtained by administration
of the same form on separate occasions ( test-retest
The ideal approach to the evaluation of reliability/pre coefficients); and ( c) coefficients based on the rela
cision would require many independent replications tionships/interactions among scores derived from
of the testing procedure on a large sample of test individual items or subsets of the items within a
takers.The range of differences allowed in replications test, all data accruing from a single administration
of the testing procedure and the proposed inter ( internal-consistency coefficients).In addition, where
pretation of the scores provide a framework for in test scoring involves a high level of judgment,
vestigating reliability/precision. indices of scorer consistency are commonly obtained.
For most testing programs, scores are expected In formaltreatments of classical test theory, reliability
to generalize over alternate forms of the test, oc can be defined as the ratio of true-score variance to
casions ( within some period), testing contexts, observed score variance, but it is estimated in terms
and raters ( if judgment is required in scoring).To of reliability coefficients of the kinds mentioned
the extent that the impact of any of these sources above.
of variability is expected to be substantial, the In generalizability theory, these different reli
variability should be estimated in some way. It is ability analyses are treated as special cases of a
not necessary that the different sources of variance more general framework for estimating error vari
be estimated separately.The overall reliability/pre ance in terms of the variance components associated
cision, given error variance due to the sampling with different sources of error. A generalizability
of forms, occasions, and raters, can be estimated coefficient is defined as the ratio of universe score
through a test-retest study involving different variance to observed score variance.Unlike tradi
forms administered on different occasions and tional approaches to the study of reliability, gen
scored by different raters. eralizability theory encourages the researcher to
The interpretation of reliability/precision analy specify and estimate components of true score
ses depends on the population being tested. For variance, error score variance, and observed score
example, reliability or generalizability coefficients variance, and to calculate coefficients based on
derived from scores of a nationally representative these estimates. Estimation is typically accomplished
sample may differ significantly from those obtained by the application of analysis-of-variance techniques.
from a more homogeneous sample drawn from The separate numerical estimates of the components
one gender, one ethnic group, or one community. of variance ( e.g., variance components for items,

37
CHAPTER 2

occasions, and raters, and for the interactions T he information function may be viewed as a
among these potential sources of error) can be mathematical statement of the precision of meas
used to evaluate the contribution of each source urement at each level of the given trait.The IRT
of error to the overall measurement error; the information function is based on the results
variance-component estimates can be helpful in obtained on a specific occasion or in a specific
identifying an effective strategy for controlling context, and therefore it does not provide an in
overall error variance. dication of generalizability over occasions or con
Different reliability (and generalizabili ty) co texts.
efficients may appear to be interchangeable, but Coefficients (e.g., reliability, generalizability,
the different coefficients convey different infor and IRT-based coefficients) have two major ad
mation.A coefficient may encompass one or more vantages over standard errors. First, as indicated
sources of error. For example, a coefficient may above, they can be used to estimate standard
reflect error due to scorer inconsistencies but not errors (overall and/or conditional) in cases where
reflect the variation over an examinee's performances it would not be possible to do so directly. Second,
or products.A coefficient may reflect only the in coefficients (e.g., reliability and generalizability
ternal consistency of item responses within an in coefficients), which are defined in terms of ratios
strument and fail to reflect measurement error as of variances for scores on the same scale, are
sociated with day-to-day changes in examinee invariant over linear transformations of the score
performance. scale and can be useful in comparing different
It should not be inferred, however, that alter testing procedures based on different scales.How
nate-form or test-retest coefficients based on test ever, such comparisons are rarely straightforward,
administrations several days or weeks apart are al because they can depend on the variability of the
ways preferable to internal-consistency coefficients. groups on which the coefficients are based, the
In cases where we can assume that scores are not techniques used to obtain the coefficients, the
likely to change, based on past experience and/or sources of error reflected in the coefficients, and
theoretical considerations, it may be reasonable the lengths and contents of the instruments being
to assume invariance over occasions (without con compared.
ducting a test-retest study).Another limitation of
test-retest coefficients is that, when the same form Factors Affecting Reliability/Precision
of the test is used, the correlation between the
first and second scores could be inflated by the A number of factors can have significant effects
test taker's recall of initial responses. on reliability/precision, and in some cases, these
The test information function, an important factors can lead to misinterpretations of the results,
result of IRT, summarizes how well the test dis if not taken into account.
criminates among individuals at various levels of First, any evaluation of reliability/precision
ability on the trait being assessed.Under the IRT applies to a particular assessment procedure and
conceptualization for dichotomously scored items, is likely to change if the procedure is changed in
the item characteristic curve or item responsefonction any substantial way. In general, if the assessment
is used as a model to represent the increasing pro is shortened (e.g., by decreasing the number of
portion of correct responses to an item at increasing items or tasks), the reliability is likely to decrease;
levels of the ability or trait being measured.Given and if the assessment is lengthened with comparable
appropriate data, the parameters of the characteristic tasks or items, the reliability is likely to increase.
curve for each item in a test can be estimated.T he In fact, lengthening the assessment, and thereby
test information function can then be calculated increasing the size of the sample of tasks/items
from the parameter estimates for the set of items in (or raters or occasions) being employed, is an ef
the test and can be used to derive coefficients with fective and commonly used method for improving
interpretations similar to reliability coefficients. reliability/precision.

38
RELIABILITY/PRECISION AND ERRORS OF MEASUREMENT

Second, if the variability associated with raters and the interpretation of scores has become the
is estimated for a select group of raters who have user's primary concern.
been especially well trained (and were perhaps Estimates of the standard errors at different
involved in the development of the procedures), score levels (that is, conditional standard errors)
but raters are not as well trained in some operational are usually a valuable supplement to the single sta
contexts, the error associated with rater variability tistic for all score levels combined. Conditional
in these operational settings may be much higher standard errors of measurement can be much more
than is indicated by the reported interrater reliability informative than a single average standard error
coefficients.Similarly, if raters are still refining their for a population. If decisions are based on test
performance in the early days of an extended scoring scores and these decisions are concentrated in one
window, the error associated with rater variability area or a few areas of the score scale, then the con
may be greater for examinees testing early in the ditional errors in those areas are of special interest.
window than for examinees who test later. Like reliability and generalizability coefficients,
Reliability/precision can also depend on the standard errors may reflect variation from many
population for which the procedure is being used. sources of error or only a few.A more comprehensive
In particular, if variability in the construct of standard error (i.e., one that includes the most
interest in the population for which scores are relevant sources of error, given the definition of
being generated is substantially different from what the testing procedure and the proposed interpre
it is in the population for which reliability/precision tation) tends to be more informative than a less
was evaluated, the reliability/precision can be quite comprehensive standard error.However, practical
different in the two populations.When the variability constraints often preclude the kinds of studies
in the construct being measured is low, reliability that would yield information on all potential
and generalizabili ty coefficients tend to be small, sources of error, and in such cases, it is most in
and when the variability in the construct being formative to evaluate the sources of error that are
measured is higher, the coefficients tend to be likely to have the greatest impact.
larger.Standard errors of measurement are less de Interpretations of test scores may be broadly
pendent than reliability and generalizability coeffi categorized as relative or absolute. Relative inter
cients on the variability in the sample of test takers. pretations convey the standing of an individual or
In addition, reliability/precision can vary from group within a reference population.Absolute in
one population to another, even if the variability terpretations relate the status of an individual or
in the construct of interest in the two populations group to defined performance standards.The stan
is the same.The reliability can vary from one pop dard error is not the same for the two types of in
ulation to another because particular sources of terpretations.Any source of error that is the same
error (rater effects, familiarity with formats and for all individuals does not contribute to the relative
instructions, etc.J have more impact in one popu error but may contribute to the absolute error.
lation than they do in the other.In general, if any Traditional norm-referenced reliability coeffi
aspects of the assessment procedures or the popu cients were developed to evaluate the precision
lation being assessed are changed in an operational with which test scores estimate the relative standing
setting, the reliability/precision may change. of examinees on some scale, and they evaluate re
liability/precision in terms of the ratio of true
Standard Errors of Measurement score variance to observed-score variance.As the
range of uses of test scores has expanded and the
The standard error of measurement can be used contexts of use have been extended (e.g., diagnostic
to generate confidence intervals around reported categorization, the evaluation of educational pro
scores. It is therefore generally more informative grams), the range of indices that are used to
than a reliability or generalizability coefficient, evaluate reliability/precision has also grown to in
once a measurement procedure has been adopted clude indices for various kinds of change scores

39
CHAPTER 2

and difference scores, indices of decision consistency, per se. Nor e that the degree of consistency or
and indices appropriate for evaluating the precision agreement in examinee classification is specific to
of group means. the cut score employed and its location within
Some indices of precision, especially standard the score distribution.
errors and conditional standard errors, also depend
on the scale in which they are reported.An index Reliability/Precision of Group Means
stated in terms of raw scores or the trait-level esti
mates of IRT may convey a very different perception Estimates of mean ( or average) scores of groups
of the error if restated in terms of scale scores.For ( or proportions in certain categories) involve
example, for the raw-score scale, the conditional wurces of error that are different from those that
standard error may appear to be high at one score operate at the individual level. Such estimates are
level and low at another, but when the conditional often used as measures of program effectiveness
standard errors are restated in units of scale scores, (and, under some educational accountabilir y sys
quite different trends in comparative precision tems, may be used to evaluate the effectiveness of
may emerge. schools and teachers).
In evaluating group performance by estimating
the mean performance or mean improvement in
D ecision Consistency
performance for samples from the group, the vari
Where the purpose of measurement is classification, ation due to the sampling of persons can be a
some measurement errors are more serious than major source of error, especially if the sample
others.Test takers who are far above or far below sizes are small.To the extent that different samples
the cut score established for pass/fail or for from the group of interest ( e.g., all students who
eligibility for a special program can have considerable use certain educational materials) yield different
error in their observed scores without any effect results, conclusions about the expected outcome
on their classification decisions. Errors of meas over all students in the group ( including those
urement for examinees whose true scores are close who might j oin the group in the future) are un
to the cut score are more likely to lead to classifi certain. For large samples, the variability due to
cation errors. T he choice of techniques used to the sampling of persons in the estimates of the
quantify reliability/precision should take these group means may be quite small. However, in
circumstances into account.T his can be done by cases where the samples of persons are not very
reporting the conditional standard error in the large ( e.g., in evaluating the mean achievement of
vicinity of the cut score or the decision students in a single classroom or the average ex
consistency/accuracy indices ( e.g., percentage of pressed satisfaction of samples of clients in a
correct decisions, Cohen's kappa), which vary as clinical program), the error associated with the
sampling of persons may be a major component
0

functions of both score reliabilir y/precision and


the location of the cut score. of overall error. It can be a significant source of
Decision consistency refers to the extent to error in inferences about programs even i f there is
which the observed classifications of examinees a high degree of precision in individual test scores.
would be the same across replications of the Standard errors for individual scores are not
testing procedure. Decision accuracy refers to the appropriate measures of the precision of group av
extent to which observed classifications of examinees erages.A more appropriate statistic is the standard
based on the results of a single replication would error for the estimates of the group means.
agree with their true classification status.Statistical
methods are available to calculate indices for both Documenting Reliability/Precision
decision consistency and decision accuracy.T hese
methods evaluate the consistency or accuracy of Typically, developers and distributors o f tests have
classifications rather than the consistency in scores primary responsibilir y for obtaining and reporting

40
RELIABILITY/PRECISION AND ERRORS OF MEASUREMENT

evidence for reliability/precision ( e.g., appropriate The reporting of indices of reliability/precision


standard errors, reliability or generalizability co alone-with little detail regarding the methods
efficients, or test information functions).The test used to estimate the indices reported, the nature
user must have such data to make an informed of the group from which the data were derived,
choice among alternative measurement approaches and the conditions under which the data were
and will generally be unable to conduct adequate obtained-constitutes inadequate documentation.
reliability/precision studies prior to operational General statements to the effect that a test is
use of an instrument. "reliable" or that it is "sufficiently reliable to permit
In some instances, however, local users of a interpretations of individual scores" are rarely, if
test or assessment procedure must accept ac least ever, acceptable.It is the user who must take re
partial responsibility for documenting the precision sponsibility for determining whether . scores are
of measurement.This obligation holds when one sufficiently trustworthy to justify anticipated uses
of the primary purposes of measurement is to and interpretations for particular uses. Nevertheless,
classify students using locally developed performance test constructors and publishers are obligated to
standards, or to rank examinees within the local provide sufficient data to make informed judgments
population.It also holds when users must rely on possible.
local scorers who are trained to use the scoring If scores are to be used for classification, indices
rubrics provided by the test developer. In such of decision consistency are useful in addition to
settings, local factors may materially affect the estimates of the reliability/precision of the scores.
magnitude of error variance and observed score If group means are likely to play a substantial role
variance. Therefore, the reliability/precision of in the use of the scores, the reliability/precision of
scores may differ appreciably from that reported these mean scores should be reported.
by the developer. As the foregoing comments emphasize, there
Reported evaluations of reliability/precision is no single, preferred approach to quantification
should identify the potential sources of error for of reliability/precision.No single index adequately
the testing program, given the proposed uses of conveys all of the relevant information. No one
the scores. These potential sources of error can method of investigation is optimal in all situations,
then be evaluated in terms of previously reported nor is the test developer limited to a single
research, new empirical studies, or analyses of the approach for any instrument.The choice of esti
reasons for assuming that a potential source of mation techniques and the minimum acceptable
error is likely to be negligible and therefore can level for any index remain a matter of professional
be ignored. j udgment.

41
CHAPTER 2

STANDARDS FOR RELIABILITY/PRECISION

The standards i n this chapter begin with an over Cl uster 1 . Specifications for
arching standard ( numbered 2.0), which is designed Repl ications of the Testing Procedure
to convey the central intent or primary focus of
the chapter. T he overarching standard may also
be viewed as the guiding principle of the chapter,
Standard 2.1
and is applicable to all tests and test users. All The range of replications over which reliability/pre
subsequent standards have been separated into cision is being evaluated should be clearly stated,
eight thematic clusters labeled as follows: along with a rationale for the choice of this def
inition, given the testing situation.
1. Specifications for Replications of the Testing
Procedure Comment: For any testing program, some aspects
2. Evaluating Reliability/Precision of the testing procedure ( e.g., time limits and
3. Reliability/Generalizability Coefficients availability of resources such as books, calculators,
4. Factors Affecting Reliability/Precision and computers) are likely to be fixed, and some
5. Standard Errors of Measurement aspects will be allowed to vary from one adminis
6. Decision Consistency tration to another ( e.g., specific tasks or stimuli,
7. Reliability/Precision of Group Means testing contexts, raters, and, possibly, occasions).
8. Documenting Reliability/Precision Any test administration that maintains fixed con
ditions and involves acceptable samples of the
conditions that are allowed co vary would be con
Standard 2.0 sidered a legitimate replication of the testing pro
Appropriate evidence o f reliability/ precision cedure.As a first step in evaluating the reliability/pre
should be provided for the interpretation for cision of the scores obtained with a testing proce
each intended score use. dure, it is important to identify the range of con
ditions of various kinds that are allowed to vary,
Comment: The form of the evidence ( reliability and over which scores are to be generalized.
or generalizability coefficient, information function,
conditional standard error, index of decision con
sistency) for reliabili ty/precision should be ap
Standard 2.2
propriate for the intended uses of the scores, the The evidence provided for the reliability/precision
population involved, and the psychometric models of the scores should be consistent with the
used to derive the scores.A higher degree of relia domain of replications associated with the testing
bili ty/precision is required for score uses that have procedures, and with the intended interpretations
more significant consequences for test takers. for use of the test scores.
Conversely, a lower degree may be acceptable
where a decision based on the test score is reversible Comment: T he evidence for reliability/precision
or dependent on corroboration from other sources should be consistent with the design of the
of information. testing procedures and with the proposed inter
pretations for use of the test scores. For example,
if the test can be taken on any of a range of oc
casions, and the interpretation presumes that
the scores are invariant over these occasions,
then any variability in scores over these occasions
is a potential source of error. If the tasks or

42
RELIABILITY/PRECISION AND ERRORS OF M EASUREMENT

stimuli are allowed to vary over alternate forms individual or two averages of a group, reliability/
of the test, and the observed performances are precision data, including standard errors, should
treated as a sample from a domain of similar be provided for such differences.
tasks, the variabili ty in scores from one form to
Comment: Observed score differences are used
another would be considered error. If raters are
for a variety of purposes.Achievement gains are
used to assign scores to responses, the variabili ty
frequently of interest for groups as well as indi
in scores over qualified raters is a source of error.
viduals.In some cases, the reliability/precision of
Different sources of error can be evaluated in a
change scores can be much lower than the relia
single coefficient or standard error, or they can
bilities of the separate scores involved.Differences
be evaluated separately, but they should all be
between verbal and performance scores on tests
addressed in some way.Reports of reliability/pre
of intelligence and scholastic ability are often em
cision should specify the potential sources of
ployed in the diagnosis of cognitive impairment
error included in the analyses.
and learning problems.P sychodiagnostic inferences
are frequently drawn from the differences between
Cl uster 2. Eva lu ating subtest scores.Aptitude and achievement batteries,
Reliabi l ity/Precision interest inventories, and personality assessments
are commonly used to identify and quantify the
relative strengths and weaknesses, or the pattern
Standard 2.3 of trait levels, of a test taker.When the interpretation
of test scores centers on the peaks and valleys in
For each total score, subscore, or combination
the examinee's test score profile, the reliability of
of scores that is to be interpreted, estimates of
score differences is critical.
relevant indices of reliability/precision should
be reported.

Comment: It is not sufficient to report estimates


Standard 2.5
of reliabilities and standard errors of measurement
Reliability estimation procedures should b e con
only for total scores when subscores are also in
sistent with the structure of the test.
terpreted.T he form-to-form and day-to-day con
sistency of total scores on a test may be acceptably Comment: A single total score can be computed
high, yet subscores may have unacceptably low on tests that are multidimensional. T he total
reliability, depending on how they are defined score on a test that is substantially multidimensional
and used.Users should be supplied with reliability should be treated as a composite score. If an in
data for all scores to be interpreted, and these ternal-consistency estimate of total score reliability
data should e detailed enough to enable the is obtained by the split-halves procedure, the
users to judge whether the scores are precise halves should be comparable in content and sta
enough for the intended interpretations for use. tistical characteristics.
Composites formed from selected subtests within In adaptive testing procedures, the set of tasks
a test battery are frequently proposed for predictive included in the test and the sequencing of tasks
and diagnostic purposes.Users need information are tailored to the test taker, using model-based
about the reliability of such composites. algorithms. In this context, reliability/precision
can be estimated using simulations based on the
model. For adaptive testing, model-based condi
Standard 2 .4
tional standard errors may be particularly useful
"When a test score interpretation emphasizes and appropriate in evaluating the technical adequacy
differences between two observed scores of an of the procedure.

43
CHAPTER 2

Cluster 3 . Reliability/G eneralizabil ity time period of trait stability. Information should
Coefficients be provided on the qualifications and training of
the judges used in reliability studies.Interrater or
interobserver agreement may be particularly im
Standard 2.6 portant for ratings and observational data that in
volve subtle discriminations. It should be noted,
A reliability or generalizability coefficient (or
however, that when raters evaluate positively cor
standard error) that addresses one kind of vari
related characteristics, a favorable or unfavorable
ability should not be interpreted as interchangeable
assessment of one trait may color their opinions
with indices that address other kinds of variability,
of other traits.Moreover, high interrater consistency
unless their definitions of measurement error
does not imply high examinee consistency from
can be considered equivalent.
task to task.Therefore, interrater agreement does
Comment: Internal-consistency, alternate-form, not guarantee high reliability of examinee scores.
and test-retest coefficients should not be considered
equivalent, as each incorporates a unique definition Cluster 4. Factors Affecting
of measurement error. Error variances derived via
item response theory are generally not equivalent
Reliabil ity/Precision
to error variances estimated via other approaches.
Test developers should state the sources of error Standard 2.8
that are reflected in, and those that are ignored
When constructed-response tests are scored locally,
by, the reported reliabiliry or generalizability co
reliability/precision data should be gathered and
efficients.
reported for the local scoring when adequate
size samples are available.
Standard 2.7
Comment: For example, many statewide testing
When subjective judgment enters into test scoring, programs depend on local scoring of essays, con
evidence should be provided on both interrater structed-response exercises, and performance tasks.
consistency in scoring and within-examinee con Reliability/precision analyses can indicate that ad
sistency over repeated measurements. A clear dis ditional training of scorers is needed and, hence,
tinction should be made among reliability data should be an integral part of program monitoring.
based on (a) independent panels of raters scoring Reliability/precision data should be released only
the same performances or products, (b) a single when sufficient to yield statistically sound results
panel scoring successive performances or new and consistent with applicable privacy obligations.
products, and (c) independent panels scoring
successive performances or new products. Standard 2.9
Comment: Task-to-task variations in the quality
When a test is available in both long and short
of an examinee's performance and rater-to-rater
versions, evidence for reliability/precision should
inconsistencies in scoring represent independent
be reported for scores on each version, preferably
sources of m easurement error. R eports of
based on independent administration(s) of each
reliability/precision studies should make clear
version with independent samples of test takers.
which of these sources are reflected in the data.
Generalizability studies and variance component Comment: The reliability/precision of scores on
analyses can be helpful in estimating the error each version is best evaluated through a n inde
variances arising from each source of error.T hese pendent administration ofeach, usingthe designated
analyses can provide separate error variance estimates time limits. Psychometric models can be used to
for tasks, for judges, and for occasions within the estimate the reliability/precision of a shorter (or

44
RELIABILITY/PRECISION AND ERRORS OF MEASUREMENT

longer) version of an existing test, based on data


pretation of scores involves within-group inferences
from an administration of the existing test. (e.g., in terms of subgroup norms). For example,
However, these models generally make assumptions
test users who work with a specific linguistic and
cultural subgroup or with individuals who have a
that may not be met ( e.g., that the items in the
existing test and the items to be added or dropped
particular disability would benefit from an estimate
of the standard error for the subgroup. Likewise,
are all randomly sampled from a single domain) .
Context effects are commonplace in tests of max
evidence that preschool children tend to respond
imum performance, and the short version of a to test stimuli in a less consistent fashion than do
standardized test often comprises a nonrandom older children would be helpful to test users inter
preting scores across age groups.
sample of items from the full-length version.As a
"When considering the reliability/precision of
result, the predicted value of the reliability/precision
may not provide a very good estimate of the test scores for relevant subgroups, it is useful to
evaluate and report the standard error of measure
actual value, and therefore, where feasible, the re
ment as well as any coefficients that are estimated.
liability/precision of both forms should be evaluated
directly and independently. Reliability and generalizability coefficients can
differ substantially when subgroups have different
Standard 2.1 0 variances on the construct being assessed.Differences
in within-group variability tend to have less impact
"When significant vanat10ns are permitted in . on the standard error of measurement.
tests or test administration procedures, separate
reliability/precision analyses should be provided Standard 2 . 1 2
for scores produced under each major variation
if adequate sample sizes are available. If a test i s proposed fo r use i n several grades or
over a range of ages, and if separate norms are
Comment: To make a test accessible to all exam provided for each grade or each age range, relia
inees, test publishers or users might authorize, or bility/precision data should be provided for each
might be legally required to authorize, accommo age or grade-level subgroup, not just for all
dations or modifications in the procedures that grades or ages combined.
are specified for the administration of a test.For
example, audio or large print versions may be Comment: A reliability or generalizability coefficient
used for test takers who are visually impaired. based on a sample of examinees spanning several
Any alteration in standard testing materials or grades or a broad range of ages in which average
procedures may have an impact on the scores are steadily increasing will generally give a
reliability/precision of the resulting scores, and spuriously inflated impression of reliability/precision.
therefore, to the extent feasible, the reliability/pre When a test is intended to discriminate within
cision should be examined for all versions of the age or grade populations, reliability or generaliz
test and testing procedures. ability coefficients and standard errors should be
reported separately for each subgroup.
Standard 2.1 1
Cl uster 5. Standard Errors of
Test publishers should provide estimates of reli
Measurement
ability/precision as soon as feasible for each
relevant subgroup for which the test is recom
mended. Standard 2.1 3
Comment: Reporting estimates of reliability/pre The standard error of measurement, both overall
cision for relevant subgroups is useful in many and conditional (if reported), should be provided
contexts, but it is especially important if the inter- in units of each reported score.

45
CHAPTER 2

Comment: T he standard error of measurement Cluster 6. Decision Consistency


(overall or conditional) that is reported should be
consistent with the scales that a re used in reporting Standard 2. 1 6
scores.Standard errors in scale-score units for the
scales used to report score s and/or to mke When a test or combination of measures is used
decisions are particularly helpful to the typical to make classification decisions, estimates should
test user. T he data on examinee performance be provided of the percentage of test takers who
should be consis tent with the assumptions built would be classified in the same way on two
into any statistical models use d to generate scale replications of the procedure.
scores and to estimate the standard errors for
Comment: When a test score or composite score
these scores.
is used to make classification decisions ( e.g.,
pass/fail, achievement levels), the standard error
Standard 2 . 1 4 of measurement at or near the cut scores has im
portant implications for the trustworthiness of
When possible and appropriate, conditional stan these decisions.However, the standard error cannot
dard erro rs of measurement should be reported be translated into the expected percentage of con
at several score levels unless there is evidence sistent or accurate decisions without strong as
that the standard error is constant across score sumptions about the distributions of measurement
levels. Where cut scores are specified for selection errors and true scores.Although decision consistency
or classification, the standard errors of measure is typically estimated from the administration of
ment should be reported in the vicinity of each a single form, it can and should be estimated
cut score. directly through the use of a test-retest approach,
Comment: Estimation of conditional standard if consistent with the requirements of test security,
errors is usually feasible with the sample sizes that and if the assumption of no change in the construct
are used for analyses of reliability/precision. If it is met and adequate samples are available.
is assumed that the standard error is constant over
a broad range of score levels, the rationale for this Cluster 7. Reliability/Precision
assump tion shoul d be presented. T he model on
of Group Means
which the computation of the conditional standard
errors is based should be specified.
Standard 2. 1 7
When average test scores for groups are the focus
Sta ndard 2.1 5
of the proposed interpretation of the test results,
When there is credible evidence for expecting the groups tested should generally be regarded as
that conditional standard errors of measurement a sample from a larger population, even if all ex
or test info rmation functions will differ sub aminees available at the time of measurement are
stantially for various subgroups, investigation of tested. In such cases the standard error of the
the extent and impact of suc h differences should group mean should be reported, because it reflects
be undertaken and reported as soon as is feasible. variability due to sampling of examinees as well as
variability due to individual measurement error.
Comment: If differences are found, they should
be clearly indicated in the appropriate documen Comment: T he overall levels of performance in
tation. In addition if substantial differences do various groups tend to be the focus in program
exist, the test contet and scoring models should evaluation and in accountability systems, and the
be examined to see if there are legally acceptable groups that are of interest include all students/clients
altern atives that do not result in such differences. who could participate in the program over some

46
RELIABILITY/PRECISION AND ERRORS OF MEASUREMENT

period.T herefore, the students in a particular class Comment: Information on the method of data
or school at the current time, the current clients of collection, sample sizes, means, standard deviations,
a social service agency, and analogous groups and demographic characteristics of the groups
exposed to a program of interest typically constitute tested helps users judge the extent to which
a sample in a longitudinal sense.Presumably, com reported data apply to their own examinee popu
parable groups from the same population will recur lations.If the test-retest or alternate-form approach
in future years, given static conditions.T he factors is used, the interval between administrations
leading to uncertainty in conclusions about program should be indicated.
effectiveness arise from the sampling of persons as Because there are many ways of estimating re
well as from individual measurement error. liability/precision, and each is influenced by
different sources of measurement error, it is unac
Standard 2.1 8 ceptable to say simply, "T he reliability/precision
of scores on test X is .9 0." A better statement
When the purpose of testing is to measure the would be, "T he reliability coefficient of .9 0
performance of groups rather than individuals, reported for scores on test X was obtained by cor
subsets of items can be assigned randomly to dif relating scores from forms A and B, administered
ferent subsamples of examinees. Data are aggregated on successive days. T he data were based on a
across subsamples and item subsets to obtain a sample of 40 0 10th-grade students from five mid
measure of group performance. "When such pro dle-class suburban schools in New York State.
cedures are used for program evaluation or pop T he demographic breakdown of this group was
ulation descriptions, reliability/precision analyses as follows: ..." In some cases, for example, when
must take the sampling scheme into account. small sample sizes or particularly sensitive data
Comment: T his type of measurement program is are involved, applicable legal restrictions governing
termed matrix sampling. It is designed to reduce privacy may limit the level of information that
the time demanded of individual examinees and should be disclosed.
yet to increase the total number of items on
which data can be obtained.T his testing approach Standard 2.20
provides the same type of information about
group performances that would be obtained if all If reliability coefficients are adjusted for restriction
examinees had taken all of the items.Reliability/pre of range or variability, the adjustment procedure
cision statistics should reflect the sampling plan and both the adjusted and unadjusted coefficients
used with respect to examinees and items. should be reported. The standard deviations of
the group actually tested and of the target popu
Clu ster 8 . Docu menting lation, as well as the rationale for the adjustment,
Reliabil ity/Precision should be presented.

Standard 2.1 9 Comment: Application ofa correction for restriction


in variability presumes that the available sample
Each method of quantifying the reliability/pre is not representative ( in terms of variability) of
cision of scores should be described clearly and the test-taker population to which users might be
expressed in terms of statistics appropriate to expected to generalize.T he rationale for the cor
the method. The sampling procedures used to rection should consider the appropriateness of
select test takers for reliability/precision analyses such a generalization.Adjustment formulas that
and the descriptive statistics on these samples, presume constancy in the standard error across
subject to privacy obligations where applicable, score levels should not be used unless constancy
should be reported. can be defended.

47
3 . FAIRNESS IN TESTING

BACKGROUND
This chapter addresses the importance of fairness takers is an overriding, foundational concern, and
as a fundamental issue in protecting test takers that common principles apply in responding to
and test users in all aspects of testing. The term test-taker characteristics that could interfere with
fairness has no single technical meaning and is the validity of test score interpretation. This is
used in many different ways in public discourse. not to say that the response to test-taker charac
It is possible that individuals endorse fairness in teristics is the same for individuals from diverse
testing as a desirable social goal, yet reach quite subgroups such as those defined by race, ethnicity,
different conclusions about the fairness of a given gender, culture, language, age, disability or so
testing program.A full consideration of the topic cioeconomic status, but rather that these responses
would explore the multiple functions of testing should be sensitive to individual characteristics
in relation to its many goals, including the broad that otherwise would compromise validity.Nonethe
goal of achieving equality of opportunity in our less, as discussed in the Introduction, it is important
society.It would consider the technical properties to bear in mind, when using the Standards, that
of tests, the ways in which test results are reported applicability depends on context. For example,
and used, the factors that affect the validity of potential threats to test validity for examinees
score interpretations, and the consequences of with limited English proficiency are different
test use.A comprehensive analysis of fairness in from those for examinees with disabilities.Moreover,
testing also would examine the regulations, statutes, threats to validity may differ even for individuals
and case law that govern test use and the remedies within the same subgroup.For example, individuals
for harmful testing practices.The Standards cannot with diverse specific disabilities constitute the
hope to deal adequately with all of these broad subgroup of "individuals with disabilities," and
issues, some of which have occasioned sharp dis examinees classified as "limited English proficient"
agreement among testing specialists and others represent a range of language proficiency levels,
interested in testing. Our focus must be limited educational and cultural backgrounds, and prior
here to delineating the aspects of tests, testing, experiences. Further, the equivalence of the
and test use that relate to fairness as described in construct being assessed is a central issue in
this chapter, which are the responsibility of those fairness, whether the context is, for example, in
who develop, use, and interpret the results of dividuals with diverse special disabilities, individuals
tests, and upon which there is general professional with limited English proficiency, or individuals
and technical agreement. across countries and cultures.
Fairness is a fundamental validity issue and As in the previous versions of the Standards,
requires attention throughout all stages of test de the current chapter addresses measurement bias
velopment and use. In previous versions of the as a central threat to fairness in testing.However,
Standards, fairness and the assessment of individuals it also adds two major concepts that have emerged
from specific subgroups of test takers, such as in in the literature, particularly in literature regarding
dividuals with disabilities and individuals with education, for minimizing bias and thereby in
diverse linguistic and cultural backgrounds, were creasing fairness.The first concept is accessibility,
presented in separate chapters. In the current the notion that all test takers should have an un
version of the Standards, these issues are presented obstructed opportunity to demonstrate their stand
in a single chapter to emphasize that fairness to ing on the construct(s) being measured. For ex
all individuals in the intended population of test ample, individuals with limited English proficiency

49
CHAPTER 3

may not be adequately diagnosed on the target considered, as some adaptations may alter a test's
construct ofa clinical examination if the assessment intended construct. Responding to individual
requires a level of English proficiency that they characteristics that would otherwise impede access
do not possess.Similarly, standard print and some and improving the validity of test score interpre
electronic formats can disadvantage examinees tations for intended uses are dual considerations
with visual impairments and some older adults for supporting fairness.
who need magnification for reading, and the dis In summary, this chapter interprets fairness as
advantage is considered unfair if visual acuity is responsiveness to individual characteristics and
irrelevant to the construct being measured. These testing contexts so that test scores will yield valid
examples show how access to the construct the interpretations for intended uses.The Standards'
test is measuring can be impeded by characteristics definition offairness is often broader than what is
and/or skills that are unrelated to the intended legally required. A test that is fair within the
construct and thereby can limit the validity of meaning of the Standards reflects the same con
score interpretations for intended uses for certain struct(s) for all test takers, and scores from it have
individuals and/or subgroups in the intended test the same meaning for all individuals in the
taking population.Accessibility is a legal require intended population; a fair test does not advantage
ment in some testing contexts. or disadvantage some individuals because of char
The second new concept contained in this acteristics irrelevant to the intended construct.To
chapter is that ofuniversal design.Universal design the degree possible, characteristics of all individuals
is an approach to test design that seeks to maximize in the intended test population, including those
accessibility for all intended examinees. Universal associated with race, ethnicity, gender, age, so
design, as described more thoroughly later in this cioeconomic status, or linguistic or cultural back
chapter, demands that test developers be clear on ground, must be considered throughout all stages
the construct(s) to be measured, including the of development, administration, scoring, inter
target of the assessment, the purpose for which pretation, and use so that barriers to fair assessment
scores will be used, the inferences that will be can be reduced. At the same time, test scores
made from the scores, and the characteristics of must yield valid interpretations for intended uses,
examinees and subgroups of the intended test and different test contexts and uses may call for
population that could influence access.Test items different approaches to fairness.For example, in
and tasks can then be purposively designed and tests used for selection purposes, adaptations to
developed from the outset to reflect the intended standardized procedures that increase accessibility
construct, to minimize construct-irrelevant features for some individuals but change the construct
that might otherwise impede the performance of being measured could reduce the validity of score
intended examinee groups, and to maximize, to inferences for the intended purposes and unfairly
the extent possible, access for as many examinees advantage those who qualify for adaptation relative
as possible in the intended population regardless to those who do not. In contrast, for diagnostic
of race, ethnicity, age, gender, socioeconomic purposes in medicine and education, adapting a
status, disability; or language or cultural background. test to increase accessibility for some individuals
Even so, for some individuals in some test could increase the accuracy of the diagnosis.
contexts and for some purposes-as is described These issues are discussed in the sections below
later-there may be need for additional test adap and are represented in the standards that follow
tations to respond to individual characteristics the chapter introduction.
that otherwise would limit access to the construct
as measured.Some examples are creating a braille General Views of Fairness
version of a test, allowing additional testing time,
and providing test translations or language sim The first view of fairness in testing described in
plification. Any test adaption must be carefully this chapter establishes the principle of fair and

50
FAIRNESS IN TESTING

equitable treatment of all test takers during the groups or individuals from accurately demonstrating
testing process. The second, third, and fourth their standing with respect to the construct of in
views presented here emphasize issues of fairness terest. For example, challenges may arise due to
in measurement quality: fairness as the lack or an examinee's disability, cultural background, lin
absence of measurement bias, fairness as access to guistic background, race, ethnicity, socioeconomic
the constructs measured, and fairness as validity status, limitations that may come with aging, or
of individual test score interpretations for the in some combination of these or other factors. In
tended use( s). some instances, greater comparability of scores
may be attained if standardized procedures are
Fairness in Treatment During the Testing Process changed to address the needs of specific groups or
Regardless of the purpose of testing, the goal of individuals without any adverse effects on the va
fairness is to maximize, to the extent possible, the lidity or reliability of the results obtained.For ex
opportunity for test takers to demonstrate their ample, a braille test form, a large-print answer
standing on the construct( s) the test is intended sheet, or a screen reader may be provided to
to measure.Traditionally, careful standardization enable those with some visual impairments to
of tests, administration conditions, and scoring obtain more equitable access to test content.Legal
procedures have helped to ensure that test takers considerations may also influence how to address
have comparable contexts in which to demonstrate individualized needs.
the abilities or attributes to be measured.For ex
ample, uniform directions, specified time limits, Fai rness as Lack of Measurement Bias
specified room arrangements, use of proctors, and Characteristics of the test itself that are not related
use of consistent security procedures are imple to the construct being measured, or the manner
mented so that differences in administration con in which the test is used, may sometimes result in
ditions will not inadvertently influence the per different meanings for scores earned by members
formance of some test takers relative to others. of different identifiable subgroups. For example,
Similarly, concerns for equity in treatment may differential item fanctioning (DIF) is said to occur
require, for some tests, that all test takers have when equally able test takers differ in their prob
qualified test administrators with whom they can abilities of answering a test item correctly as a
communicate and feel comfortable to the extent function of group membership. DIF can be eval
practicable. Where technology is involved, it is uated in a variety of ways.The detection of DIF
important that examinees have had similar prior does not always indicate bias in an item; there
exposure to the technology and that the equipment needs to be a suitable, substantial explanation for
provided to all test takers be of similar processing the DIF to justify the conclusion that the item is
speed and provide similar clarity and size for biased. Differential test functioning ( DTF) refers
images and othe, media.Procedures for the stan to differences in the functioning of tests ( or sets
dardized administration of a test should be carefully of items) for different specially defined groups.
documented by the test developer and followed When DTF occurs, individuals from different
carefully by the test administrator. groups who have the same standing on the char
Although standardization has been a funda acteristic assessed by the test do not have the
mental principle for assuring that all examinees same expected test score.
have the same opportunity to demonstrate their The term predictive bias may be used when
standing on the construct that a test is intended evidence is found that differences exist in the pat
to measure, sometimes flexibility is needed to terns of associations between test scores and other
provide essentially equivalent opportunities for variables for different groups, bringing with it
some test takers.In these cases, aspects of a stan concerns about bias in the inferences drawn from
dardized testing process that pose no particular the use of test scores. Differential prediction is
challenge for most test takers may prevent specific examined using regression analysis.One approach

51
CHAPTER 3

examines slope and intercept differences between culture may not generalize across borders or
two targeted groups (e.g., African American ex cultures.This can lead to invalid test score inter
aminees and Caucasian examinees), while another pretations.Careful attention to bias in score inter
examines systematic deviations from a common pretations should be practiced in such contexts.
regression line for any number of groups of
interest. Both approaches provide valuable infor Fairness in Access to the
mation when examining differential prediction. Construct(s) as Measured
Correlation coefficients provide inadequate evidence T he goal that all intended test takers have a full
for or against a differential prediction hypothesis opportunity to demonstrate their standing on the
if groups are found to have unequal means and construct being measured has given rise to concerns
variances on the test and the criterion. about accessibility in testing. Accessible testing
When credible evidence indicates potential situations are those that enable all test takers in
bias in measurement (i.e., lack of consistent con the intended population, to the extent feasible, to
struct meaning across groups, DIF, DT F) or bias show their status on the target construct(s) without
in predictive relations, these potential sources of being unduly advantaged or disadvantaged by in
bias should be independently investigated because dividual characteristics (e.g., characteristics related
the presence or absence of one form of such bias to age, disability, race/ethnicity, gender, or language)
may have no relationship with other forms of that are irrelevant to the construct(s) the test is
bias. For example, a predictor test may show no intended to measure. Accessibility is actually a
significant levels of DIF, yet show group differences test bias issue because obstacles to accessibility
in regression lines in predicting a criterion. can result in different interpretations of test scores
Although it is important to guard against the for individuals from different groups.Accessibility
possibility of measurement bias for the subgroups also has important ethical and legal ramifications.
that have been defined as relevant in the intended Accessibility can best be understood by con
test population, it may not be feasible to fully in trasting the knowledge, skills, and abilities that
vestigate all possibilities, particularly in the em reflect the construct(s) the test is intended to
ployment context. For example, the number of measure with the knowledge, skills, and abilities
subgr oup members in the field test or norming that are not the target of the test but are required
population may limit the possibility of standard to respond to the test tasks or test items. For
empirical analy ses.In these cases, previous research, some test takers, factors related to individual char
a constr uct-based rationale, and/or data from acteristics such as age, race, ethnicity; socioeconomic
similar tests may address concerns related to po status, cultural background, disability, and/ or
tential bias in measurement. In addition, and es English language proficiency may restrict accessi
pecially where credible evidence of potential bias bility and thus interfere with the measurement of
exists, small sample methodologies should be con the construct(s) of interest. For example , a test
sidered. For example, potential bias for relevant taker with impaired vision may not be able to
subgr oups may be examined through small-scale access the printed text of a personality test.If the
tryouts that use cognitive labs and/or interviews test were provided in large print, the test questions
or focus groups to solicit evidence on the validity could be more accessible to the test taker and
of interpr etation s made from the test scores. would be more likely to lead to a valid measurement
A related issue is the exten t to which the con of the test taker's personality characteristics. It is
struct being assessed has equivalent meaning across important to be aware of test characteristics that
the individuals and groups within the intended may inadvertently render test questions less ac
population of test takers.This is especially important cessible for some subgroups of the intended testing
when the assessment crosses international borders population.For example, a test question that em
and cultures.Evaluation of the underlying construct ploys idiomatic phrases unrelated to the construct
and properties of the test within one country or being measured could have the effect of making

52
FAIRNESS IN TESTING

the test less accessible for test takers who are not to take into account the individual characteristics
native speakers of English.T he accessibility of a of the test taker and how these characteristics
test could also be decreased by questions that use may interact with the contextual features of the
regional vocabulary unrelated to the target construct testing situation.
or use stimulus contexts that are less familiar to T he complex interplay of language proficiency
individuals from some cultural subgroups than and context provides one example of the challenges
others. to valid interpretation of test scores for some
. As discussed later in this chapter, some test testing purposes.Proficiency in English not only
taker characteristics that impede access are related affects the interpretation of an English language
to the construct being measured, for example, learner's test scores on tests administered in English
dyslexia in the context of tests of reading.In these but, more important, also may affect the individual's
cases, providing individuals with access to the developmental and academic progress.Individuals
construct and getting some measure of it may re who differ culturally and linguistically from the
quire some adaptation of the construct as well.In majority of the test takers are at risk for inaccurate
situations like this, it may not be possible to score interpretations because of multiple factors
develop a measurement that is comparable across associated with the assumption that, absent
adapted and unadapted versions of the test; language proficiency issues, these individuals have
however, the measure obtained by the adapted developmental trajectories comparable to those
test will most likely provide a more accurate as of individuals who have been raised in an envi
sessment of the individual's skills and/or abilities ronment mediated by a single language and
( although perhaps not of the full intended construct) culture. For instance, consider two sixth-grade
than that obtained without using the adaptation. children who entered school as limited English
Providing access to a test construct becomes speakers.T he first child entered school in kinder
particularly challenging for individuals with more garten and has been instructed in academic courses
than one characteristic that could interfere with in English; the second also entered school in
test performance; for example, older adults who kindergarten but has been instructed in his or her
are not fluent in English or English learners who native language.The two will have a different de
have moderate cognitive disabilities. velopmental pattern. In the former case, the in
terrupted native language development has an at
Fairness as Validity of Individual Test Score tenuating effect on learning and academic per
Interpretations for the Intended Uses formance, but the individual's English proficiency
It is important to keep in mind that fairness con may not be a significant barrier to testing.In con
cerns the validity of individual score interpretations trast, the examinee who has had instruction in his
for intended uses.In attempting to ensure fairness, or her native language through the sixth grade
we often gener:alize across groups of test takers has had the opportunity for fully age-appropriate
such as individuals with disabilities, older adults, cognitive, academic, and language development;
individuals who are learning English, or those but, if tested in English, the examinee will need
from different racial or ethnic groups or different the test administered in such a way as to minimize
cultural and/or socioeconomic backgrounds; how the language barrier if proficiency in English is
ever, this is done for convenience and is not not part of the construct being measured.
meant to imply that these groups are homogeneous As the above examples show, adaptation to in
or that, consequently, all members of a group dividual characteristics and recognition of the het
should be treated similarly when making inter erogeneity within subgroups may be important to
pretations of test scores for individuals ( unless the validity of individual interpretations of test
there is validity evidence to support such general results in situations where the intent is to understand
izations).It is particularly important, when drawing and respond to individual performance.Professionals
inferences about an examinee's skills or abilities, may be justified in deviating from standardized

53
CHAPTER 3

procedures to gain a more accurate measurement and formats, the potential for some score bias
of the intended construct and to provide more ap cannot be completely ruled out.Therefore, con
propriate individual decisions.However, for other tinuing efforts in test design and development to
contexts and uses, deviations from standardized eliminate potential sources of bias without com
procedures may be inappropriate because they promising validity, and consistent with legal and
change the construct being measured, compromise regulatory standards, are warranted.
the comparability of scores or use of norms, and/or
unfairly advantage some individuals. Threats to Fair and Valid
In closing this section on the meanings of Interpretations of Test Scores
fairness, note that the Standards' measurement
perspective explicitly excludes one common view A prime threat to fair and valid interpretation of
of fairness in public discourse: fairness as the test scores comes from aspects of the test or
equality of testing outcomes for relevant test testing process that may produce construct-irrel
taker subgroups.Certainly, most testing professionals evant variance in scores that systematically lowers
agree that group differences in testing outcomes or raises scores for identifiable groups of test
should trigger heightened scrutiny for possible takers and results in inappropriate score inter
sources of test bias.Examination of group differences pretations for intended uses. Such construct-ir
also may be important in generating new hypotheses relevant components of scores may be introduced
about bias, fair treatment, and the accessibility of by inappropriate sampling of test content, aspects
the construct as measured; and in fact, there may of the test context such as lack of clarity in rest
be legal requirements to investigate certain differ instructions, item complexities that are unrelated
ences in the outcomes of testing among subgroups. to the construct being measured, and/or test re
However, group differences in outcomes do not sponse expectations or scoring criteria that may
in themselves indicate that a testing application is favor one group over another. In addition, op
biased or unfair. portunity to learn (i.e., the extent to which an
In many cases, it is not clear whether the dif examinee has been exposed to instruction or ex
ferences are due to real differences between groups periences assumed by the test developer and/or
in the construct being measured or to some source user) can influence the fair and valid interpretations
of bias (e.g., construct-irrelevant variance or con of test scores for their intended uses.
struct underrepresentation).In most cases, it may
be some combination of real differences and bias. Test Content
A serious search for possible sources of bias that One potential source of construct-irrelevant variance
comes up emp ty provides reassurance that the in test scores arises from inappropriate test content,
potential for bias is limited, but even a very that is, test content that confounds the measurement
extensive research program cannot rule the possi of the target construct and differentially favors
bility out. It is always possible that something individuals from some subgroups over others.A
was missed, and therefore, prudence would suggest test intended to measure critical reading, for ex
that an anempt be made to minimize the differences. ample, should not include words and expressions
For example, some racial and ethnic subgroups especially associated with particular occupations,
have lower mean scores on some standardized disciplines, cultural backgrounds, socioeconomic
tests than do other subgroups.Some of the factors status, racial/ethnic groups, or geographical loca
that contribute to these differences are understood tions, so as to maximize the measurement of the
(e.g., large differences in family income and other construct (the ability to read critically) and to
resources, differences in school quality and students' minimize confounding of this measurement with
opportunity to learn the material to be assessed), prior knowledge and experience that are likely to
but even where serious efforts have been made to advantage, or disadvantage, test takers from par
eliminate possible sources of bias in test content ticular subgroups.

54
FAIRNESS IN TESTING

Differential engagement and motivational test items that are unrelated to the construct but
value may also be factors in exacerbating con lead some individuals to respond in particular
struct-irrelevant components of content.Material ways. For example, examinees from diverse
that is likely to be differentially interesting should racial/ethnic, linguistic, or cultural backgrounds
be balanced to appeal broadly to the full range of or who differ by gender may be poorly assessed
the targeted testing population ( except where the by a vocational interest inventory whose questions
interest level is part of the construct being meas disproportionately ask about competencies, ac
ured). In testing, such balance extends to repre tivities, and interests that are stereotypically asso
sentation of individuals from a variety of subgroups ciated with particular subgroups.
within the test content itself.For example, applied When test settings have an interpersonal
problems can feature children and families from context, the interaction of examiner with test
different racial/ethnic, socioeconomic, and language taker can be a source of construct-irrelevant
groups.Also, test content or situations that are variance or bias. Users of tests should be alert to
offensive or emotionally disturbing to some test the possibility that such interactions may sometimes
takers and may impede their ability to engage affect test fairness.P ractitioners administering the
with the test should not appear in the test unless test should be aware of the possibility of complex
the use of the offensive or disturbing content is interactions with test takers and other situational
needed to measure the intended construct. Ex variables.Factors that may affect the performance
amples of this type of content are graphic de of the test taker include the race, ethnicity, gender,
scriptions of slavery or the Holocaust, when such and linguistic and cultural background of both
descriptions are not specifically required by the examiner and test taker, the test taker's experience
construct. with formal education, the testing style of the ex
Depending on the context and purpose of aminer, the level of acculturation of the test taker
tests, it is both common and advisable for test de and examiner, the test taker's primary language,
velopers to engage an independent and diverse the language used for test administration (if it is
panel of experts to review test content for language, not the primary language of the test taker), and
illustrations, graphics, and other representations the use of a bilingual or bicultural interpreter.
that might be differentially familiar or interpreted Testing of individuals who are bilingual or
differently by members of different groups and multilingual poses special challenges.An individual
for material that might be offensive or emotionally who knows two or more languages may not test
disturbing to some test takers. well in one or more of the languages.For example,
children from homes whose families speak Spanish
Test Context may be able to understand Spanish but express
T he term test context, as used here, refers to themselves best in English or vice versa.In addition,
multiple aspects. of the test and testing environment some persons who are bilingual use their native
that may affect the performance of an examinee language in most social situations and use English
and consequently give rise to construct-irrelevant primarily for academic and work-related activities;
variance in the test scotes.As research on contextual the use of one or both languages depends on the
factors ( e.g., stereotype threat) is ongoing, test nature of the situation.Non-native English speakers
developers and test users should pay attention to who give the impression of being fluent in con
the emerging empirical literature on these topics versational English may be slower or not completely
so that they can use this information if and when competent in taking tests that require English
the preponderance of evidence dictates that it is comprehension and literacy skills.T hus, in some
appropriate to do so.Construct-irrelevant variance settings, an understanding of an individual's type
may result from a lack of clarity in test instructions, and degree of bilingualism or multilingualism is
from unrelated complexity or language demands important for testing the individual appropriately.
in test tasks, and/or from other characteristics of Note that this concern may not apply when the

55
CHAPTER 3

construct of interest is defined as a particular that are irrelevant or tangntial to the construct.
kind of language proficiency (e.g., academic lan Scoring rubrics may inadvertently advantage some
guage of the kind found in text books, language individuals over others. For example, a scoring
and vocabulary specific to workplace and em rubric for a constructed response item might
ployment testing). reserve the highest score level for test takers who
provide more information or elaboration than
Test Response was actually requested.In this situation, test takers
In some cases, construct-irrelevant variance may who simply follow instructions, or test takers
arise because test items elicit varieties of responses who value succinctness in responses, will earn
other than those intended or because items can lower scores; thus, characteristics of the individuals
be solved in ways that were not intended.To the become construct-irrelevant components of the
extent that such responses are more rypical of test scores. Similarly, the scoring of open-ended
some subgroups than others, biased score inter responses may introduce construct-irrelevant vari
pretations may result. For example, some clients ance for some test takers if scorers and/or automated
responding to a neuropsychological test may scoring routines are not sensitive to the full
attempt to provide the answers they think the test diversity of ways in which individuals express
administrator expects, as opposed to the answers their ideas.With the advent of automated s coring
that best describe themselves. for complex performance tasks, for examp le, it is
Construct-irrelevant components in test scores important to examine the validity of the automated
may also be associated with test response formats scoring results for relevant subgroups in the test
that pose particular difficulties or are differentially taking population.
valued by particular individuals. For example,
test performance may rely on some capability Opportunity to Learn
( e.g., English language proficiency or fine-motor Finally, opportunity to learn-the extent to which
coordination) that is irrelevant to the target con individuals have had exposure to instruction or
struct( s) but nonetheless poses impediments to knowledge that affords them the opportunity to
the test responses for some test takers not having learn the content and skills targeted by the test
the capabiliry.Similarly, different values associated has several implications for the fair and valid in
with the nature and degree of verbal output can terpretation of test scores for their intended uses.
influence test-taker responses. Some individuals Individuals' prior opportunity to learn can be an
may judge verbosi ty or rapid speech as rude, important contextual factor to consider in inter
whereas others may regard those speech patterns preting and drawing inferences from test scores.
as indications of high mental ability or friendliness. For example, a recent immigrant who has had
An individual of the first type who is evaluated little prior exposure to school may not have had
with values appropriate to the second may be the opportuniry to learn concepts assumed to be
considered taciturn, withdrawn, or of low mental common knowledge by a personality inventory
ability.Another example is a person with memory or abili ty measure, even if that measure is admin
or language problems or depression; such a person's istered in the native language of the test taker.
ability to communicate or show interest in com Similarly, as another example, there has been con
municating verbally may be constrained, which siderable public discussion about potential inequities
may result in interpretations of the outcomes of in school resources available to students from tra
the assessment that are invalid and potentially ditionally disadvantaged groups, for example,
harmful to the person being tested. racial, ethnic, language, and cultural minorities
In the development and use of scoring rubrics, and rural students. Such inequities affect the
it is particularly important that credit be awarded quality of education received.To the extent that
for response characteristics central to the construct inequity exists, the validity of inferences about
being measured and not for response characteristics student ability drawn from achievement test scores

56
FAIRNESS IN TESTING

may be compromised. Not taking into account coverage for any one student may be impossible
prior opportunity to learn could lead to misdiag to determine.T hird, granting a diploma to a low
nosis, inappropriate placement, and/or inappropriate scoring examinee on the grounds that the student
assignment of services, which could have significant had insufficient opportunity to learn the material
consequences for an individual. tested means certificating someone who has not
Beyond its impact on the validity of test score attained the degree of proficiency the diploma is
interpretations for intended uses, opportunity to intended to signify.
learn has important policy and legal ramifications It should be noted that concerns about op
in education. Opportunity to learn is a fairness portunity to learn do not necessarily apply to sit
issue when an authority provides differential access uations where the same authority is not responsible
to opportunity to learn for some individuals and for both the delivery of instruction and the testing
then holds those individuals who have not been and/or interpretation of results. For example, in
provided that opportunity accountable for their college admissions decisions, opportunity to learn
test performance.This problem may affect high may be beyond the control of the test users and it
stakes competency tests in education, for example, may not influence the validity of test interpretations
when educational authorities require a certain level for their intended use (e.g., selection and/or ad
of test performance for high school graduation. missions decisions) . Chapter 12, "Educational
Here, there is a fairness concern that students not Testing and Assessment," provides additional per
be held accountable for, or face serious permanent spective on opportunity to learn.
negative consequences from, their test results when
their school experiences have not provided them
the opportunity to learn the subject matter covered M inimizing Construct-I rrelevant
by the test.In such cases, students' low scores may Components Through Test Design and
accurately reflect what they know and can do, so Testing Adaptations
that, technically, the interpretation of the test
results for the purpose of measuring how much Standardized tests should be designed to facilitate
the students have learned may not be biased. accessibility and minimize construct-irrelevant
However, it may be considered unfair to severely barriers for all test takers in the target population,
penalize students for circumstances that are not as far as practicable.Before considering the need
under their control, that is, for not learning content for any assessment adaptations for test takers who
that their schools have not taught.It is generally may have special needs, the assessment developer
accepted that before high-stakes consequences can first must attempt to improve accessibility within
be imposed for failing an examination in educational the test itself. Some of these basic principles are
settings, there must be evidence that students have included in the test design process called universal
been provided ct.vriculum and instruction that in design.By using universal design, test developers
corporates the constructs addressed by the test: begin the test development process with an eye
Several important issues arise when opportunity toward maximizing fairness.Universal design em
to learn is considered as a component of fairness. phasizes the need to develop tests that are as
First, it is difficult to define opportunity to learn usable as possible for all test takers in the intended
in educational practice, particularly at the individual test population, regardless of characteristics such
level.Opportunity is generally a matter of degree as gender, age, language background, culture, so
and is difficult to quantify; moreover, the meas cioeconomic status, or disability.
urement of some important learning outcomes Principles of universal design include defining
may require students to work with materials that constructs precisely, so that what is being measured
they have not seen before. Second, even if it is can be clearly differentiated from test-taker char
possible to document the topics included in the acteristics that are irrelevant to the construct but
curriculum for a group of students, specific content that could otherwise interfere with some test

57
CHAPTER 3

takers' ability to respond.Universal design avoids, if the physics test is administered in English.De
where possible, item characteristics and formats, pending on testing circumstances and purposes
or test characteristics ( for example, inappropriate of the test, as well as individual characteristics,
test speededness), that may bias scores for individuals such adaptations might include changing the con
or subgroups due to construct-irrelevant charac tent or presentation of the test items, changing
teristics that are specific to these test takers. the administration conditions, and/or changing
Universal design processes strive to minimize the response processes. The term adaptation is
access challenges by taking into account test char used to refer to any such change.Ir is important,
acteristics that may impede access to the construct however, to differentiate between changes that
for certain test takers, such as the choice of result in comparable scores and changes that may
content, test tasks, response procedures, and testing not produce scores that are comparable to those
procedures. For example, the content of tests can from the original r est. Although the terms may
be made more accessible by providing user-selected have different meanings under applicable laws, as
font sizes in a technology-based test, by avoiding used in the Standards the term accommodation is
item contexts that would likely be unfamiliar to used to denote changes with which the compara
individuals because of their cultural background, bility of scores is retained, and the term modification
by providing extended administration time when is used to denote changes that affect the construct
speed is not relevant to the construct being meas measured by the test.With a modification, the
ured, or by minimizing the linguistic load of test changes affect the construct being measured and
items intended to measure constructs other than consequently lead to scores that differ in meaning
competencies in the language in which the test is from those from the original test.1
administered. It is important to keep in mind that attention
Although the principles of universal design to design and the provision of altered tests do not
for assessment provide a useful guide for developing always ensure that test results will be fair and
assessments that reduce construct-irrelevant variance, valid for all examinees. Those who administer
researchers are still in the process of gathering tests and interpret test scores need to develop a
empirical evidence to support some of these prin full understanding ofthe usefulness and limitations
ciples.It is important to note that not all tests can of test design procedures for accessibility and any
be made accessible for everyone by attention to alterations that are offered.
design changes such as those discussed above.
Even when tests are developed to maximize fairness A Range of Test Adaptations
through the use of universal design and other Rather than a simple dichotomy, potential test
practices to increase access, there will still be situ adaptations reflect a broad range of test changes.
ations where the test is not appropriate for all test At one end of the range are test accommodations.
takers in the intended .population. Therefore, As the term is used in the Standards, accommoda
some test adaptations may be needed for those tions consist of relatively minor changes to the
individuals whose characteristics would otherwise presentation and/or format of the test, test ad-
impede their access to the examination.
Adaptations are changes to the original test
'The Americans with Disabilities Act (ADA) uses the
design or administration to increase access to the terms accommodation and modification differently from the
test for such individuals. For example, a person Standards. Title I of the ADA uses the term reasonable accom
who is blind may read only in braille format, and modation to refer to changes that enable qualified individuals
an individual with hemiplegia may be unable to with disabilities to obtain employment to perform their jobs.
hold a pencil and thus have difficulty completing Titles II and III use the term reasonable modification in much
the same way.Under the ADA, an accommodation or modi
a standard written exam. Students with limited fication to a test that fundamentally alters the construct being
English proficiency may be proficient in physics measured would not be called something different; rather it
but may not be able to demonstrate their knowledge would probably be found not "reasonable."

58
FAIRNESS IN TESTING

ministration, or response procedures that maintain ured, because the student does not have to decode
the original construct and result in scores compa the printed text; but without the adaptation, the
rable to those on the original test. For example, student may not be able to demonstrate any
text magnification might be an accommodation standing on the construct of reading comprehension.
for a test taker with a visual impairment who oth On the other hand, if the purpose of the reading
erwise would have difficul ty deciphering test di test is to evaluate comprehension without concern
rections or items.English-native language glossaries for decoding ability, the adaptation might be
are an example of an accommodation that might judged to support more valid interpretations of
be provided for limited English proficient test some students' reading comprehension and the
takers on a construction safety test to help them essence of the relevant parts of the construct
understand what is being asked. The glossaries might be judged to be intact.The challenge for
would contain words that, while not directly those who report, interpret, and/or use test scores
related to the construct being measured, would from adapted tests is to recognize which adaptations
help limited English test takers understand the provide scores that are comparable to the scores
context of the question or task being posed. from the original, unadapted assessment and
At the other end of the range are adaptations which adaptations do not.This challenge becomes
that transform the construct being measured, in even more difficult when evidence to support the
cluding the test content and/or testing conditions, comparability of scores is not available.
to get a reasonable measure of a somewhat different
but appropriate construct for designated test Test Accommodations: Comparable Measures
takers.For example, in educational testing, different That Maintain the Intended Construct
tests addressing alternate achievement standards Comparability ofscores enables test users to make
are designed for students with severe cognitive comparable inferences based on the scores for all
disabilities for the same subjects in which students test takers. Comparability also is the defining
without disabilities are assessed. Clearly, scores feature for a test adaptation to be considered an
from these different tests cannot be considered accommodation. Scores from the accommodated
comparable to those resulting from the general version of the test must yield inferences comparable
assessment, but instead represent scores from a to those from the standard version; to make this
new test that requires the same rigorous develop happen is a challenging proposition. On the one
ment and validation processes as would be carried hand, common, uniform procedures are a basic
out for any new assessment. ( An expanded dis underpinning for score validity and comparability..
cussion of the use of such alternate assessments is On the other hand, accommodations by their
found in chap. 12; alternate assessments will not very nature mean that something in the testing
be treated further in the present chapter.) Other circumstance has been changed because adhering
adaptations change the intended construct to to the original standardized procedures would in
make it accessible for designated students while terfere with valid measurement of the intended
retaining as much of the original construct as construct( s) for some individuals.
possible. For example, a reading test adaptation The comparability of inferences made from
might provide a dyslexic student with a screen accommodated test scores rests largely on whether
reader that reads aloud the passages and the test the scores represent the same constructs as those
questions measuring reading comprehension. If from the original test.This determination requires
the construct is intentionally defined as requiring a very clear definition of the intended construct( s).
both the ability to decode and the ability to com For example, when non-native speakers of the
prehend written language, the adaptation would language of the test take a survey of their health
require a different interpretation of the test scores and nutrition knowledge, one may not know
as a measure of reading comprehension. Clearly, whether the test score is, in whole or in part, a
this adaptation changes the construct being meas- measure of the ability to read in the language of

59
CHAPTER 3

the test rather than a measure of the intended disabilities and those with diverse linguistic and
construct.If the test is not intended to also be a cultural backgrounds.Similar approaches may be
measure of the ability to read in English, then test adapted for other subgroups. Specific strategies
scores do not represent the same construct( s) for depend on the purpose of the test and the con
examinees who may have poor reading skills, such struct( s) the test is intended to measure. Some
as limited English proficient test takers, as they strategies require changing test administration
do for those who are fully proficient in reading procedures ( e.g., instructions, response format),
English.An adaptation that improves the accessi whereas others alter testing medium, timing, set
bility of the test for non-native speakers of English tings, or format. Depending on the linguistic
by providing direct or indirect linguistic supports background or the nature and extent of the
may yield a score that is uncontaminated by the disability, one or more testing changes may be
ability to understand English. appropriate for a particular individual.
At the same time, construct underrepresentation Regardless of the individual's characteristics
is a primary threat to the validity of test accom that make accommodations necessary, it is im
modations.For example, extra time is a common portant that test accommodations address the
accommodation, but if speed is part of the intended specific access issue( s) that otherwise would bias
construct, it is inappropriate to allow for extra an individual's test results. For example, accom
time in the test administration. Scores obtained modations provided to limited English proficient
on the test with extended administration time test takers should be designed to address appropriate
may underrepresent the construct measured by linguistic support needs; those provided to test
the strictly timed test because speed will not be takers with visual impairments should address
part of the construct measured by the extended the inability to see test material.Accommodations
time test.Similarly, translating a reading compre should be effective in removing construct-irrelevant
hension test used for selection into an organization's barriers to an individual's test performance without
training program is inappropriate if reading com providing an unfair advantage over individuals
prehension in English is important to successful who do not receive the accommodation.Admiuedly,
participation in the program. achieving both objectives can be challenging.
Claims that accommodated versions of a test Adaptations involving test translations merit
yield interpretations comparable to those based special consideration. Simply translating a test
on scores from the original test and that the con from one language to another does not ensure
struct being measured has not been changed need that the translation produces a version of the test
to be evaluated and substantiated with evidence. that is comparable in content and difficulty level
Although score comparability is easiest to establish to the original version of the test, or that the
when different test forms are constructed following translated test produces scores that are equally re
identical procedures and t;hen equated statistically, liable/ precise and valid as those from the original
such procedures usually are not possible for ac test.Furthermore, one cannot assume that the rel
commodated and nonaccommodated versions of evant acculturation, clinical, or educational expe
tests.Instead, relevant evidence can take a variety riences are similar for test takers raking the translated
of forms, from experimental studies to assess con version and for the target group used to develop
struct equivalence to smaller, qualitative studies the original version. In addition, it cannot be as
and/or use of professional judgment and expert sumed that translation into the native language is
review.Whatever the case, test developers and/or always a preferred accommodation. Research in
users should seek evidence of the comparability educational testing, for example, shows that trans
of the accommodated and original assessments. lated content tests are not effective unless test
A variety of strategies for accommodating tests takers have been instructed using the language of
and testing procedures have been implemented the translated test. Whenever tests are translated
to be responsive to the needs of test takers with from one language to a second language, evidence

60
FAIRNESS IN TESTING

ofthe validity, reliability/precision, and comparability fication, however, the individual may be able to
of scores on the different versions of the tests demonstrate mathematics problem-solving skills,
should be collected and reported. even if he or she is not able to demonstrate
When the testing accommodation employs computation skills.Because modified assessments
the use of an interpreter, it is desirable, where fea are measuring a different construct from that
sible, to obtain someone who has a basic under measured by the standardized assessment, it is
standing of the process of psychological and edu important to interpret the assessment scores as
cational assessment, is fluent in the language of resulting from a new test and to gather whatever
the test and the test taker's native language, and is evidence is necessary to evaluate the validity of
familiar with the test taker's cultural background. the interpretations for intended uses of the
The interpreter ideally needs to understand the scores. For norm-based score interpretations,
importance offollowing standardized procedures, any modification that changes the construct
the importance of accurately conveying to the ex will invalidate the norms for score interpretations.
aminer a test taker's actual responses, and the role Likewise, if the construct is changed, criterion
and responsibilities of the interpreter in testing. based score interpretations from the modified
The interpreter must be careful not to provide assessment (for example, making classification
any assistance to the candidate that might potentially decisions such as "pass/fail" or assigning categories
compromise the validity of the interpretation for of mastery such as " basic," "proficient," or "ad
intended uses of the assessment results. vanced" using cut scores determined on the
Finally, it is important to standardize procedures original assessment) will not be valid.
for implementing accommodations, as far as pos
sible, so that comparability ofscores is maintained. Reporting Scores From
Standardized procedures for test accommodations Accommodated and Modified Tests
must include rules for determining who is eligible Typically, test administrators and testing profes
for an accommodation, as well as precisely how sionals document steps used in making test ac
the accommodation is to be administered. Test commodations or modifications in the test report;
users should monitor adherence to the rules for clinicians may also include a discussion of the va
eligibility and for appropriate administration of lidity of the interpretations of the resulting scores
the accommodated test. for intended uses.T his practice of reporting the
nature of accommodations and modifications is
Test Modifications: Noncomparable Measures consistent with implied requirements to commu
That Change the Intended Construct nicate information as to the nature ofthe assessment
T here may be times when additional flexibility process ifthese changes may affect the reliability/pre
is required to obtain even partial measurement cision of test scores or the validity of interpretations
of the construct; that is, it may be necessary to drawn from test scores.
consider a modification to a test that will result T he flagging of test score reports can be a
in changing the intended construct to provide controversial issue and subject to legal requirements.
even limited access to the construct that is being When there is clear evidence that scores from
measured. For example, an individual with regular and altered tests or test administrations
dyscalculia may have limited ability to do com are not comparable, consideration should be given
putations without a calculator; however, if pro to informing score users, potentially by flagging
vided a calculator, the individual may be able to the test results to indicate their special nature, to
do the calculations required in the assessment. the extent permitted by law. Where there is
If the construct being assessed involves broader credible evidence that scores from regular and
mathematics skill, the individual may have altered tests are comparable, then flagging generally
limited access to the construct being measured is not appropriate.T here is little agreement in the
without the use of a calculator; with the modi- field on how to proceed when credible evidence

61
CHAPTER 3

on comparability does not exist. To the extent For example, allowing exta time on a timed test
possible, test developers and/or users should collect to determine distractibility and speed-of-processing
evidence to examine the comparability of regular difficulties associated with attention deficit disorder
and altered tests or administration procedures for would make it impossible to determine the extent
the test's intended purposes. to which the attention and processing-speed dif
ficulties actually exist.
Appropriate Use of Third, it is important to note that not all in
Accommodations or Modifications dividuals within a general class of examinees, such
Depending on the construct to be measured and as those with diverse linguistic or cultural back
the test's purpose, there are some testing situations grounds or with disabilities, may require special
where accommodations as defined by the Standards provisions when taking tests.The language skills,
are not needed or modifications as defined by the cultural knowledge, or specific disabilities that
Standards are not appropriate. First, the reason these individuals possess, for example, might not
for the possible alteration, such as English language influence their performance on a particular type
skills or a disability, may in fact be directly relevant of test. Hence, for these individuals, no changes
to the focal construct. In employment testing, it are needed.
would be inappropriate to make changes to the The effectiveness of a given accommodation
test if the test is designed to assess essential skills also plays a role in determinations of appropriate
required for the job and the test changes would use. If a given accommodation or modification
fundamentally alter the constructs being measured. does not increase access to the construct as
For example, despite increased automation and measured, there is little point in using it.Evidence
use of recording devices, some court reporter jobs of effectiveness may be gathered through quanti
require individuals to be able to work quickly and tative or qualitative studies.Professional judgment
accurately. Speed is an important aspect of the necessarily plays a substantial role in d ecisions
construct that cannot be adapted.As another ex about changes to the test or testing situation.
ample, a work sample for a customer service job In summary, fairness is a fundamental issue
that requires fluent communication in English for valid test score interpretation, and it should
would not be translated into another language. therefore be the goal for all testing applications.
Second, an adaptation for a particular disability Fairness is the responsibility of all parties involved
is inappropriate when the purpose of a test is to in test development, administration, and score in
diagnose the presence and degree ofthat disability. terpretation for the intended purposes of the test.

62
FAIRNESS IN TESTING

STANDARDS FOR FAIRNESS

The standards i n this chapter begin with a n over Cluster 1 . Test Design, D evelopment,
arching standard (numbered 3.0), which is designed Admin istration, and Scoring Procedures
to convey the central intent or primary focus of
That Mini mize Barriers to Valid Score
the chapter. The overarching standard may also
be viewed as the guiding principle of the chapter,
Interp retati ons for the Widest Possible
and is applicable to all tests and test users. All Range of Ind ividuals and Relevant
subsequent standards have been separated into Subgroups
four thematic clusters labeled as follows:
Standard 3.1
1 . Test Design, Development, Administration,
and Scoring Procedures That Minimize Bar T hose responsible for test development, revision,
riers to Valid Score Interpretations for the and administration should design all steps of
Widest Possible Range of Individuals and the testing process to promote valid score inter
Relevant Subgroups pretations for intended score uses for the widest
2. Validity of Test Score Interpretations for possible range of individuals and relevant sub
Intended Uses for the Intended Examinee groups in the intended population.
Population Comment: Test developers must clearly delineate
3. Accommodations to Remove Construct both the constructs that are to be measured by the
Irrelevant Barriers and Support Valid Inter test and the characteristics of the individuals and
pretations of Scores for Their Intended Uses subgroups in the intended population of test takers.
4. Safeguards Against Inappropriate Score Test tasks and items should be designed to maximize
Interpretations for Intended Uses access and be free of construct-irrelevant barriers as
far as possible for all individuals and relevant sub
groups in the intended test-taker population.One
Standard 3.0
way to accomplish these goals is to create the test
using principles of universal design, which take ac
All steps in the testing process, including test
count of the characteristics of all individuals for
design, validation, development, administration,
whom the test is intended and include such elements
and scoring procedures, should be designed in
as precisely defining constructs and avoiding, where
such a manner as to minimize construct-irrelevant
possible, characteristics and formats of items and
variance and to promote valid score interpretations
tests ( for example, test speededness) that may com
for the intended uses for all examinees in the in
promise valid score interpretations for individuals
tended population.
or relevant subgroups.Another principle of universal
Comment: The central idea of fairness in testing design is to provide simple, clear, and intuitive
is to identify and remove construct-irrelevant testing procedures and instructions. Ultimately,
barriers to maximal performance for any examinee. the goal is to design a testing process that will, to
Removing these barriers allows for the comparable the extent practicable, remove potential barriers to
and valid interpretation of test scores for all ex the measurement of the intended construct for alf
aminees. Fairness is thus central to the validity individuals, including those individuals requiring
and comparability of the interpretation of test accommodations.Test developers need to be knowl
scores for intended uses. edgeable about group differences that may interfere

63
CHAPTER 3

with the precision of scores and the validity of test testing population in pilot or field test samples
score inferences, and they need to be able to take used to evaluate item and test appropriateness for
steps to reduce bias. construct interpretations. The analyses that are
carried out using pilot and field testing data
Standard 3.2 should seek to detect aspects of test design,
content, and format that might distort test score
Test developers are responsible for developing interpretations for the intended uses of the test
tests that measure the intended construct and scores for particular groups and individuals.Such
for minimizing the potential for tests' being af analyses could employ a range of methodologies,
fected by construct-irrelevant characteristics, such including those appropriate for small sample sizes,
as linguistic, communicative, cognitive, cultural, such as expert judgment, focus groups, and
physical, or other characteristics. cognitive labs. Both qualitative and quantitative
sources of evidence are important in evaluating
Comment: Unnecessary linguistic, communicative,
whether items are psychometrically sound and
cognitive, cultural, physical, and/or other charac
appropriate for all relevant subgroups.
teristics in test item stimulus and/or response re
If sample sizes permit, it is often valuable to
quirements can impede some individuals in demon
carry out separate analyses for relevant subgroups
strating their standing on intended constructs.
of the population. When it is not possible to
Test developers should use language in tests that
include sufficient numbers in pilot and/ or field
is consistent with the purposes of the tests and
test samples in order to do separate analyses, op
that is familiar to as wide a range of test takers as
erational test results may be accumulated and
possible. Avoiding the use of language that has
used to conduct such analyses when sample sizes
different meanings or different connotations for
become large enough to support the analyses.
relevant subgroups of test takers will help ensure
If pilot or field test results indicate that items
that test takers who have the skills being assessed
or tests function differentially for individuals
are able to understand what is being asked of
from, for example, relevant age, cultural, disability,
them and respond appropriately.T he level of lan
gender, linguistic and/or racial/ ethnic groups in
guage proficiency, physical response, or other de
the population of test takers, test developers
mands required by the test should be kept to the
should investigate aspects of test design, content,
minimum required to meet work and credentialing
and format ( including response formats) that
requirements and/or to represent the target con
might contribute to the differential performance
struct( s). In work situations, the modality in
of members of these groups and, if warranted,
which language proficiency is assessed should be
eliminate these aspects from future test development
comparable to that required on the job, for
practices.
example, oral and/or written, comprehension
Expert and sensitivity reviews can serve to
and/or production. Similarly, the physical and
guard against construct-irrelevant language and
verbal demands of response requirements should
images, including those that may offend some
be consistent with the intended construct.
individuals or subgroups, and against construct
irrelevant context that may be more familiar to
Standard 3.3 some than others.Test publishers often conduct
sensitivity reviews of all test material to detect
T hose responsible fo r test development should
and remove sensitive material from tests ( e.g.,
include relevant subgroups in validity, reliability/
text, graphics, and other visual representations
precision, and other preliminary studies used
within the test that could be seen as offensive to
when constructing the test.
some groups and possibly affect the scores of in
Comment: Test developers should include indi dividuals from these groups).Such reviews should
viduals from relevant subgroups of the intended be conducted before a test becomes operational.

64
FAI RNESS I N TESTING

Standard 3.4 the test development process for individuals from


all relevant subgroups in the intended test popu
Test takers should receive comparable treatment lation. Test developers and/or users should also
during the test administration and scoring process. document any studies carried out to examine the

Comment: Those responsible for testing should reliability/precision of scores and validity of scorer
adhere to standardized test administration, scoring, interpretations for relevant subgroups of the in
and security protocols so that test scores will tended population of test takers for the intended
reflect the construct( s) being assessed and will uses of the test scores.Special test administration,
not be unduly influenced by idiosyncrasies in the scoring, and reporting procedures should be doc
testing process.Those responsible for test admin umented and made available to test users.
istration should mitigate the possibility of personal
predispositions that might affect the test admin
istration or interpretation of scores. Cluster 2 . Validity of Test Score
Computerized and other forms of technolo I nterpretations for Intended Uses
gy-based testing add extra concerns for standard for the Intended Examinee Popu l ation
ization in administration and scoring.Examinees
must have access to technology so that aspects of Standard 3.6
the technology itself do not influence scores.Ex
aminees working on older, slower equipment may Where credible evidence indicates that test scores
be unfairly disadvantaged relative to those working may differ in meaning for relevant subgroups in
on newer equipment.If computers or other devices the intended exarninee population, test developers
differ in speed of processing or movement from and/or users are responsible for examining the
one screen to the next, in the fidelity of the evidence for validity of score interpretations for
visuals, or in other important ways, it is possible intended uses for individuals from those sub
that construct-irrelevant factors may influence groups.What constitutes a significant difference
test performance. in subgroup scores and what actions are taken in
Issues related to test security and fidelity of response to such differences may be defined by
administration can also threaten the comparability applicable laws.
of treatment of individuals and the validity and
Comment: Subgroup mean differences do not in
fairness of test score interpretations.For example,
and of themselves indicate lack of fairness, but
unauthorized distribution of items to some ex
such differences should trigger follow-up studies,
aminees but not others, or unproctored test ad
where feasible, to identify the potential causes of
ministrations where standardization cannot be
such differences.Depending on whether subgroup
ensured, could provide an advantage to some test
differences are discovered during the development
takers over others.In these situations, test results
or use phase, either the test developer or the test
should be interpreted with caution.
user is responsible for initiating follow-up inquiries
and, as appropriate, relevant studies.T he inquiry
Standard 3.5 should investigate construct underrepresentation
and sources of construct-irrelevant variance as
Test developers should specify and document
potential causes of subgroup differences, investigated
provisions that have been made to test adminis
as feasible, through quantitative and/or qualitative
tration and scoring procedures to remove con
studies.The kinds of validity evidence considered
struct-irrelevant barriers for all relevant subgroups
may include analysis of test content, internal
in the test-taker population.
structure of test responses, the relationship of test
Comment: Test developers should specify how scores to other variables, or the response processes
construct-irrelevant barriers were minimized in employed by the individual examinees. When

65
CHAPTER 3

sample sizes are sufficient, studies of score precision ferential predictions. In contrast, correlation co
and accuracy for relevant subgroups also should efficients provide inadequate evidence for or
be conducted.When sample sizes are small, data against a differential prediction hypothesis if
may sometimes be accumulated over operational groups or treatments are found to have unequal
administrations of the test so that suitable quan means and variances on the test and the criterion.
titative analyses by subgroup can be performed It is particularly important in the context of
after the test has been in use for a period of time. testing for high-stakes purposes that test developers
Qualitative studies also are relevant to the supporting and/or users examine differential prediction and
validity arguments ( e.g., expert reviews, focus avoid the use of correlation coefficients in situations
groups, cognitive labs). Test developers should where groups or treatments result in unequal
closely consider findings from quantitative and/or means or variances on the test and criterion.
qualitative analyses in documenting the interpre
tations for the intended score uses, as well as in Standard 3.8
subsequent test revisions.
Analyses, where possible, may need to take When tests require the scoring of constructed
into account the level of heterogeneity within rel responses, test developers and/or users should
evant subgroups, for example, individuals with collect and report evidence of the validity of
different disabilities, or linguistic minority examinees score interpretations for relevant subgroups in
at different levels of English proficiency.Differences the intended population of test takers for the in
within these subgroups may influence the appro tended uses of the test scores.
priateness of test content, the internal structure
Comment: Subgroup differences in examinee re
of the test responses, the relation of test scores to
sponses and/or the expectations and perceptions
other variables, or the response processes employed
of scorers can introduce construct-irrelevant
by individual examinees.
variance in scores from constructed response tests.
T hese, in turn , c ould seriously affect the
Standard 3.7 reliability/precision, validity, and comparability
of score interpretations for intended uses for some
When criterion-related validity evidence i s used
individuals. Different methods of scoring could
as a basis for test score-based predictions of
differentially influence the construct representation
future performance and sample sizes are sufficient,
of scores for individuals from some subgroups.
test developers and/ or users are responsible for
For human scoring, scoring procedures should
evaluating the possibility of differential prediction
be designed with the intent that the scores reflect
for relevant subgroups for which there is prior
the examinee's standing relative to the tested con
evidence or theory suggesting differential pre
struct(s) and are not influenced by the perceptions
diction.
and personal predispositions of the scorers. It is
Comment: When sample sizes are sufficient, dif essential that adequate training and calibration of
ferential prediction is often examined using re scorers be carried out and monitored throughout
gression analysis. One approach to regression the scoring process to support the consistency of
analysis examines slope and intercept differences scorers' ratings for individuals from relevant sub
between targeted groups ( e.g., Black and White groups.Where sample sizes permit, the precision
samples), while another examines systematic de and accuracy of scores for relevant subgro ups also
viations from a common regression line for the should be calculated.
groups of interest. Both approaches can account Automated scoring algorithms may be used to
for the posibility of predictive bias and/or differ score complex constructed responses, such as essays,
ences in heterogeneity between groups and provide either as the sole determiner of the score or in
valuable information for the examination of dif- conjunction with a score provided by a human

66
FAIRNESS IN TESTING

scorer.Scoring algorithms need to be reviewed for to evaluate the academic progress of an individual,
potential sources of bias. T he precision of scores the accommodation that will best eliminate con
and validity of score interpretations resulting from struct irrelevance will match the accommodation
automated scoring should be evaluated for all used for instruction.
relevant subgroups of the intended population. Test modifications that change the construct
that the test is measuring may be needed for some
examinees to demonstrate their standing on some
Cl uster 3. Accommodations to Remove
aspect of the intended construct.If an assessment is
Construct-Irrelevant Barriers and modified to improve access to the intended construct
Support Valid Interpretations of Scores for designated individuals, the modified assessment
for Their Intended Uses should be treated like a newly developed assessment
that needs to adhere to the test standards for validity,
Standard 3.9 reliability/precision, fairness, and so forth.

Test developers and/or test users are responsible Standard 3.1 O


for developing and providing test accommodations,
when appropriate and feasible, to remove con When test accommodations are p ermitted, test
struct-irrelevant barriers that otherwise would developers and/or test users are responsible for
interfere with examinees' ability to demonstrate documenting standard p rovisions for using the
their standing on the target constructs. accommodation and for monitoring the appro
priate implementation of the accommodation.
Comment: Test accommodations are designed to
remove construct-irrelevant barriers related to in Comment: Test accommodations should be used
dividual characteristics that otherwise would in only when the test taker has a documented need
terfere with the measurement of the target construct for the accommodation, for example, an Individ
and therefore would unfairly disadvantage indi ualized Education P lan ( IE P ) or documentation
viduals with these characteristics. T hese accom by a physician, psychologist, or other qualified
modations include changes in administration professional.The documentation should be prepared
setting, presentation, interface/engagement, and in advance of the test-taking experience and
response requirements, and may include the ad reviewed by one or more experts qualified to
dition of individuals to the administration process make a decision about the relevance of the docu
( e.g., readers, scribes). mentation to the requested accommodation.
An appropriate accommodation is one that Test developers and/or users should provide
responds to specific individual characteristics but individuals requiring accommodations in a testing
does so in a way that does not change the construct situation with information about the availability
the test is measuring or the meaning of scores. of accommodations and the procedures for re
Test developers and/or test users should document questing them prior to the test administration.In
the basis for the conclusion that the accommodation settings where accommodations are routinely pro
does not change the construct that the test is vided for individuals with documented needs
measuring.Accommodations must address indi ( e.g., educational settings), the documentation
vidual test takers' specific needs ( e.g., cognitive, should describe permissible accommodations and
linguistic, sensory, physical) and may be required include standardized protocols and/ or procedures
by law. For example, individuals who are not for identifying examinees eligible for accommo
fully proficient in English may need linguistic ac dations, identifying and assigning appropriate ac
commodations that address their language status, commodations for these individuals, and admin
while visually impaired individuals may need text istering accommodations, scoring, and reporting
magnification.In many cases when a test is used in accordance with standardized rules.

67
CHAPTER 3

Test administrators and users should also be sought to evaluate the validity of the changed
provide those who have a role in determining and test for relevant subgroups, for example through
administering accommodations with sufficient in small-sample qualitative studies or professional
formation and expertise to appropriately use ac judgments that examine the comparability of the
commodations that may be applied to the assess original and altered tests and/or that investigate
ment.Instructions for administering any changes alternative explanations for performance on the
in the test or testing procedures should be clearly changed tests.
documented and, when necessary, test adminis Evidence should be provided for recommended
trators should be trained to follow these procedures. alterations.Ifa test developer recommends different
The test administrator should administer the ac time limits, for example, for individuals with dis
commodations in a standardized manner as doc abilities or those from diverse linguistic and
umented by the test developer. Administration cultural backgrounds, pilot or field testing should
procedures should include procedures for recording be used, whenever possible, to establish these par
which accommodations were used for specific in ticular time limits rather than simply allowing
dividuals and, where relevant, for recording any test takers a multiple of the standard time without
deviation from standardized procedures for ad examining the utility of the arbitrary implemen
ministering the accommodations. tation of multiples of the standard time. When
The test administrator or appropriate repre possible, fatigue and other time-related issues
sentative of the test user should document any should be investigated as potentially important
use ofaccommodations.For large-scale education factors when time limits are extended.
assessments, test users also should monitor the When tests are linguistically simplified to
appropriate use of accommodations. remove construct-irrelevant variance, test developers
and/or users are responsible for documenting ev
Standard 3.1 1 idence of the comparability of scores from the
linguistically simplified tests to the original test,
When a test is changed to remove barriers to when sample sizes permit.
the accessibility of the construct being measured,
test developers and/or users are responsible for Standard 3.1 2
obtaining and documenting evidence of the
validity of score interpretations for intended When a test is translated and adapted from one
uses of the changed test, when sample sizes language to another, test developers and/or test
permit. users are responsible for describing the methods
used in establishing the adequacy of the adaptation
Comment: It is desirable, where feasible and ap
and documenting empirical or logical evidence
propriate, to pilot and/or field test any test alter
for the validity of test score interpretations for
ations with individuals representing each relevant
intended use.
subgroup for whom the alteration is intended.
Validity studies typically should investigate both Comment: The term adaptation is used here to
the efficacy of the alteration for intended describe changes made to tests translate d from
subgroup(s) and the comparability of score infer one language to another to reduce construct-ir
ences from the altered and original tests. relevant variance that may arise due to individual
In some circumstances, developers may not or subgroup characteristics. In this case the trans
be able to obtain sufficient samples of individuals, lation/adaptation process involves not only trans
for example, those with the same disability or lating the language of the test so that it is suitable
similar levels of a disability, to conduct standard for the subgroup taking the test, but also addressing
empirical analyses of reliability/precision and any construct-irrelevant linguistic and cultural
validity. In these situations, alternative ways should subgroup characteristics that may interfere with

68
FAIRNESS IN TESTING

measurement of the intended construct( s) .When Professional judgment needs to be used to de


multiple language versions of a test are intended termine the most appropriate procedures for es
to provide comparable scores, test developers tablishing relative language proficiencies. Such
should describe in detail the methods used for procedures may range from self-identification by
test translation and adaptation and should report examinees to formal language proficiency testing.
evidence of test score validity pertinent to the lin . Sensitivity to linguistic and cultural characteristics
guistic and cultural groups for whom the test is may require the sole use of one language in testing
intended and pertinent to the scores' intended or use of multiple languages to minimize the in
uses. Evidence of validity may include empirical troduction of construct-irrelevant components
studies and/or professional judgment documenting into the measurement process.
that the different language versions measure com Determination of a test taker's most proficient
parable or similar constructs and that the score language for test administration does not auto
interpretations from the two versions have com matically guarantee validity of score inferences
parable validity for their intended uses. For for the intended use. For example, individuals
example, if a test is translated and adapted into may be more proficient in one language than an
Spanish for use with Central American, Cuban, other, but not necessarily developmentally proficient
Mexican, Puerto R ican, South American, and in either; disconnects between the language of
Spanish populations, the validity of test score in construct acquisition and that of assessment also
terpretations for specific uses should be evaluated can compromise appropriate interpretation of the
with members of each of these groups separately, test taker's scores.
where feasible.Where sample sizes permit, evidence
of score accuracy and precision should be provided Standard 3. 1 4
for each group, and test properties for each
subgroup should be included in test manuals. "When testing requires the use of an interpreter,
the interpreter should follow standardized pro
Standard 3.1 3 cedures and, to the extent feasible, be sufficiently
fluent in the language and content of the test
A test should be administered in the language and the exarninee's native language and culture
that is most relevant and appropriate to the test to translate the test and related testing materials
purpose. and to explain the exarninee's test responses, as
necessary.
Comment: Test users should take into account
the linguistic and cultural characteristics and Comment: Although individuals with limited
relative language proficiencies of examinees who proficiency in the language of the test ( including
are bilingual or use multiple languages.Identifying deaf and hard-of-hearing individuals whose native
the most appropriate language( s) for testing also language may be sign language) should ideally be
requires close consideration of the context and tested by professionally trained bilingual/bicultural
purpose for testing. Except in cases where the examiners, the use of an interpreter may be
purpose of testing is to determine test takers' level necessary in some situations.If an interpreter is
of proficiency in a particular language, the test required, the test user is responsible for selecting
takers should be tested in the language in which an interpreter with reasonable qualifications, ex
they are most proficient.In some cases, test takers' perience, and preparation to assist appropriately
most proficient language in generalmay not be in the administration of the test. As with other
the language in which they were instructed or aspects of standardized testing, procedures for ad
trained in relation to tested constructs, and in ministering a test when an interpreter is used
these cases it may be more appropriate to administer should be standardized and documented. It is
the test in the language of instruction. necessary for the interpreter to understand the

69
CHAPTER 3

importance of following standardized procedures Standard 3.1 6


for this test, the importance of accurately conveying
to the examiner an examinee's actual responses, When credible research indicates that test scores
and the role and responsibilities of the interpreter for some relevant subgroups are differentially af
in testing.When the translation oftechnical terms fected by construct-irrelevant characteristics of
is important to accurately assess the construct, the test or of the examinees, when legally per
the interpreter should be familiar with the meaning missible, test users should use the test only for
of these terms and corresponding vocabularies in those subgroups for which there is sufficient ev
the respective languages. idence of validiry to support score interpretations
Unless a test has been standardized and normed for the intended uses.
with the use of interpreters, their use may need to
Comment: A test may not measure the same
be viewed as an alteration that could change the
construcr(s) for individuals from different relevant
measurement of the intended construct, in particular
subgroups because different characteristics of
because of the introduction of a third party during
test content or format influence scores of test
testing, as well as the modification ofthe standardized
takers from one subgroup to another. Any such
protocol. Differences in word meaning, familiarity,
differences may inadvertently advantage or dis
frequency, connotations, and associations make it
advantage individuals from these subgroups. The
difficult to directly compare scores from any non
decision whether to use a test with any given rel
standardized translations to English-language norms.
evant subgroup necessarily involves a careful
When a test is likely to require the use of in
analysis of the validity evidence for the subgroup,
terpreters, the test developer should provide clear
as is called for in Standard 1 .4.The decision also
guidance on how interpreters should be selected
requires consideration of applicable legal require
and their role in administration.
ments and the exercise of thoughtful professional
judgment regarding the significance of any con
C luster 4. Safeg uards Aga i nst struct-irrelevant components. In cases where
Inappropriate Score Interpretations there is credible evidence of differential validity,
developers should provide clear guidance to the
for Intended Uses
test user about when and whether valid inter
pretations of scores for their intended uses can
Standard 3.1 5 or cannot be drawn for individuals from these
subgroups.
Test developers and publishers who claim that a
There may be occasions when examinees
test can be used with examinees from specific
request or demand to rake a version of the test
subgroups are responsible for providing the nec
other than that deemed most appropriate by the
essary information to support appropriate test
developer or user. For example, an individual
score interpretations for their intended uses for
with a disability may decline an altered format
individuals from these subgroups.
and request the standard form.Acceding to such
Comment: Test developers should include in test requests, after fully informing the examinee about
manuals and instructions for score interpretation the characteristics of the test, the accommodations
explicit statements about the applicability of the that are available, and how the test scores will be
test for relevant subgroups.Test developers should used, is not a violation of this standard and in
provide evidence of the applicability of the test some instances may be required by law.
for relevant subgroups and make explicit cautions In some cases, such as when a test will distribute
against foreseeable (based on prior experience or benefits or burdens (such as qualifying for an
other relevant sources such as research literature) honors class or denial of a promotion in a job),
misuses of test results. the law may limit the extent to which a test user

70
FAIRNESS IN TESTING

may evaluate some groups under the test and Even references to specific categories of individuals
other groups under a different test. with disabilities, such as hearing impaired, should
be accompanied by an explanation of the meaning
Standard 3.1 7 of the term and an indication of the variability of
individuals within the group.
"When aggregate scores are publicly reported for
relevant subgroups-for example, males and fe Standard 3.1 8
males, individuals of differing socioeconomic
status, individuals differing by race/ethnicity, In testing individuals for diagnostic and/or special
individuals with different sexual orientations, program placement purposes, test users should
individuals with diverse 1inguistic and cultural not use test scores as the sole indicators to char
backgrounds, individuals with disabilities, young acterize an individual's functioning, competence,
children or older adults-test users are responsible attitudes, and/or predispositions. Instead, multiple
for providing evidence of comparability and for sources of information should be used, alternative
including cautionary statements whenever credible explanations for test performance should be con
research or theory indicates that test scores may sidered, and the professional judgment of someone
not have comparable meaning across these sub familiar with the test should be brought to bear
groups. on the decision.

Comment: Reporting scores for relevant subgroups Comment: Many test manuals point out variables
is justified only if the scores have comparable that should be considered in interpreting test
meaning across these groups and there is sufficient scores, such as clinically relevant history, medica
sample size per group to protect individual identity tions, school record, vocational status, and test
and warrant aggregation.T his standard is intended taker motivation. Influences associated with
to be applicable to settings where scores are variables such as age, culture, disability, gender,
implicitly or explicitly presented as comparable and linguistic or racial/ethnic characteristics may
in meaning across subgroups. Care should be also be relevant.
taken that the terms used to describe reported Opportunity to learn is another variable that
subgroups are clearly defined, consistent with may need to be taken into account in educational
common usage, and clearly understood by those and/or clinical settings. For instance, if recent
interpreting test scores. immigrants being tested on a personality inventory
Terminology for describing specific subgroups or an ability measure have little prior exposure to
for which valid test score inferences can and school, they may not have had the opportunity to
cannot be drawn should be as precise as possible, learn concepts that the test assumes are common
and categories should be consistent with the in knowledge or common experience, even if the
tended uses of the results.For example, the terms test is administered in the native language. Not
Latino or Hispanic can be ambiguous if not specif taking into account prior opportunity to learn
ically defined, in chat they may denote individuals can lead to misdiagnoses, inappropriate placements
of Cuban, Mexican, P uerto Rican, South or and/or services, and unintended negative conse
Central American, or other Spanish-culture origin, quences.
regardless of race/ethnicity, and may combine Inferences about test takers' general language
those who are recent immigrants with those who proficiency should be based on tests that measure
are U.S.native burn, those who may not be pro a range of language features, not a single linguistic
ficient in English, and those of diverse socioeco skill.A more complete range of communicative
nomic background.Similarly, the term "individuals abilities ( e.g., word knowledge, syntax as 'well as
with disabilities" encompasses a wide range of cultural variation) will typically need to be assessed.
specific conditions and background characteristics. Test users are responsible for interpreting individual

71
CHAPTER 3

scores in light of alternative explanations and/or Note that this standard is not applicable in situ
relevant individual variables noted in the test ations where different authorities are responsible for
manual. curriculum, testing, and/or interpretation and use
of results.For example, opportunity to learn may be
Standard 3 . 1 9 beyond the knowledge or control of test users, and
it may not influence the validity oftest interpretations
In settings where the same authority is responsible such as predictions of future performance.
for both provision of curriculum and high-stakes
decisions based on testing of examinees' curriculum Standard 3.20
mastery, examinees should not suffer permanent
negative consequences if evidence indicates that When a construct can be measured in different
they have not had the opportunity to learn the ways that are equal in their degree of construct
test content. representation and validity ( including freedom
from construct-irrelevant variance), test users
Comment: In educational settings, students'
should consider, among other factors, evidence
opportunity to learn the content and skills
of subgroup differences in mean scores or in
assessed by an achievement test can seriously
percentages of examinees whose scores exceed
affect their test performance and the validity of
the cut scores, in deciding which test and/or cut
test score interpretations for intended use for
scores to use.
high-stakes individual decisions. If there is not
a good match between the content of curriculum Comment: Evidence of differential subgroup per
and instruction and that of tested constructs for formance is one important factor influencing the
some students, those students cannot be expected choice between one test and another. However,
to do well on the test and can be unfairly disad other factors, such as cost, testing time, test s ecurity,
vantaged by high-stakes individual decisions, and logistical issues ( e.g., the need to screen very
such as denying high school graduation, that large numbers of examinees in a very short time),
are made based on test results.When an authority, must also enter into professional judgments about
such as a state or district, is responsible for pre test selection and use.If the scores from two tests
scribing and/or delivering curriculum and in lead to equally valid interpretations and impose
struction, it should not penalize individuals for similar costs or other burdens, legal considerations
test performance on content that the authority may require selecting the test that minimizes sub
has not provided. group differences.

72
PART II

Operations
4. TEST DESIGN AND DEVELOPMENT

BACKGROUND
Test develop ment is the process o f producing a cations for such tests should include descriptions
measure of some aspect of an individual's knowledge, of the outcomes the test is designed to predict
skills, abilities, interests, attitudes, or other char and plans to collect evidence of the effectiveness
acteristics by developing questions or tasks and of test scores in predicting these outcomes.
combining them to form a test, according to a Issues bearing on validity, reliability, and
specified plan.The steps and considerations for fairness are interwoven within the stages of test
this process are articulated in the test design plan. development. Each of these topics is addressed
Test design begins with consideration of expected comprehensively in other chapters of the Standards:
interpretations for intended uses of the scores to validity in chapter 1, reliability in chapter 2, and
be generated by the test.The content and format fairness in chapter 3.Additional material on test
of the test are then specified to provide evidence administration and scoring, and on reporting and
to support the interpretations for intended uses. interpretation of scores and results, is provided in
Test design also includes specification of test ad chapter 6. Chapter 5 discusses score scales, and
ministration and scoring procedures, and of how chapter 7 covers documentation requirements.
scores are to be reported.Questions or tasks ( here In addition, test developers should respect the
after referred to as items) are developed following rights of participants in the development process,
the test specifications and screened using criteria including pretest participants. In particular, de
appropriate to the intended uses of the test. Pro velopers should take steps to ensure proper notice
cedures for scoring individual items and the test and consent from participants and to protect par
as a whole are also developed, reviewed, and ticipants' personally identifiable information con
revised as needed.Test design is commonly iterative, sistent with applicable legal and professional re
with adjustments and revisions made in response quirements.The rights of test takers are discussed
to data from tryouts and operational use. in chapter 8 .
Test design and development procedures must This chapter describes four phases o f the test
support the validity of the interpretations of test development process leading from the original
scores for their intended uses.For example, current statement of purpose( s) to the final product: ( a)
educational assessments ofren are used to indicate development and evaluation of the test specifica
students' proficiency with regard to standards for tions; ( b) development, tryout, and evaluation of
the knowledge ?-nd skill a student should exhibit; the items; ( c) assembly and evaluation of new test
thus, the relationship between the test content forms; and ( d) development of procedures and
and the established content standards is key. In materials for administration and scoring. What
this case, content specifications must clearly follows is a description of typical test development
describe the content and/or cognitive categories procedures, although there may be sound reasons
to be covered so that evidence of the alignment of that some of the steps covered in the description
the test questions to these categories can be are followed in some settings and not in others.
gathered.When normative interpretations are in
tended, development procedures should include Test Specifications
a precise definition of the reference population
and plans to collect appropriate normative data. General Considerations
Many tests, such as employment or college selection In nearly all cases, test development is guided by
tests, rely on predictive validity evidence.Specifi- a set of test specifications. The nature of these

75
CHAPTER 4

specifications and the way in which they are score interpretations are of primary interest. A
created may vary widely as a function of the score for an individual or for a definable group is
nature of the test and its intended uses.The term ranked within a distribution of scores or compared
test specifications is sometimes limited to description with the average performance of test takers in a
of the content and format of the test.In the Stan reference population ( e.g., based on age, grade,
dards, test specifications are defined more broadly diagnostic category, or job classification). When
to also include documentation of the purpose interpretations are criterion-referenced, absolute
and intended uses of the test, as well as detailed score interpretations are of primary interest.The
decisions about content, format, test length, psy meaning of such scores does not depend on rank
chometric characteristics of the items and test, information.Rather, the test score conveys directly
delivery mode, administration, scoring, and score a level of competence in some defined criterion
reporting. domain.Both relative and absolute interpretations
Responsibility for developing test specifications are often used with a given test, but the test de
also varies widely across testing programs. For veloper determines which approach is most relevant
most commercial tests, test specifications are to specific uses of the test.
created by the test developer. In other contexts,
such as tests used for educational accountability, Content Specifications
The first step in developing test specifications is
many aspects of the test specifications are established
through a public policy process.As discussed in to extend the original statement of purpose( s),
the introduction, the generic term test developer is and the construct or content domain being con
used in this chapter in preference to other terms, sidered, into a framework for the test that describes
the extent of the domain, or the scope of the con
such as test publisher, to cover both those responsible
for developing_ and those responsible for imple struct to be measured. Content sp ecifications, some
menting test specifications across a wide range of times referred to as content frameworks, delineate
test development processes. the aspects ( e.g., content, skills, processes, and di
agnostic features) of the construct or domain to
Statement of Purpose and I ntended Uses be measured. The specifications should address
T he process of developing educational and psy questions about what is to be included, such as
chological tests should begin with a statement of "Does eighth-grade mathematics include algebra?"
the purpose( s) of the test, the intended users and "Does verbal ability include text comprehension
uses, the construct or content domain to be meas as well as vocabulary?" "Does self-esteem include
ured, and the intended examinee population. both feelings and acts?" The delineation of the
Tests of the same construct or domain can differ content specifications can be guided by theory or
in important ways because factors such as purpose, by an analysis of the content domain ( e.g., an
intended uses, and examin.ee population may vary. analysis of j ob requirements in the case o f many
In addition, tests intended for diverse examinee credentialing and employment tests).T he content
populations must be developed to minimize con specifications serve as a guide to subsequent test
struct-irrelevant factors that may unfairly depress evaluation. T he chapter on validity provides a
or inflate some examinees' performance.In many more thorough discussion of the relationships
cases, accommodations and/or alternative versions among the construct or content domain, the test
of tests may need to be specified to remove framework, and the purpose( s) of the test.
irrelevant barriers to performance for particular
subgroups in the intended examinee population. Format Specifications
Specification of intended uses will include an Once decisions have been made about what the
indication of whether the test score interpretations test is to measure and what meaning its scores are
will be primarily norm-referenced or criterion-ref intended to convey, the next step is to create
erenced. When scores are norm-referenced, relative format specifications.Format specifications delineate

76
TEST DESIGN AND DEVELOPMENT

the format of items (i.e., tasks or questions); the valid for all intended examinees, to the maximum
response format or conditions for responding; extent possible, is critical. Formats that may be
and the type of scoring procedures. Although unfamiliar to some groups of test takers or chat
format decisions are often driven by considerations place inappropriate demands should be avoided.
of expediency, such as ease of responding or cost T he principles of universal design describe the use
of scoring, validity considerations must not be of test formats chat allow tests to be taken without
overlooked.For example, if test questions require adaptation by as broad a range of individuals as
test takers to possess significant linguistic skill to possible, but they do not necessarily eliminate the
interpret them but the test is not intended as a need for adaptations.Format specifications should
measure of linguistic skill, the complexity of the include consideration of alternative formats that
questions may lead to construct-irrelevant variance might also be needed to remove irrelevant barriers
in test scores.This would be unfair to test takers to performance, such as large print or braille for
with limited linguistic skills, thereby reducing the examinees who are visually impaired or, where ap
validity of the test scores as a measure of the propriate to the construct being measured, bilingual
intended content. Format specifications should dictionaries for test takers who are more proficient
include a rationale for how the chosen format in a language other than the language of the test.
supports the validity, reliability, and f airness of The number and types of adaptations to be specified
intended uses of the resulting scores. depend on both the nature of the construct being
T he nature of the item and response formats assessed and the targeted population of test takers.
that may be specified depends on the purposes of
the test, the defined domain of the test, and the Complex item formats. Some testing programs
testing platform.Selected-response formats, such employ more complex item formats.Examples in
as true-false or multiple-choice items, are suitable clude performance assessments, simulations, and
for many purposes of testing. Computer-based portfolios. Specifications for more complex item
testing allows different ways of indicating responses, formats should describe the domain from which
such as drag-and-drop. Other purposes may be the items or tasks are sampled, components of the
more effectively served by a short-answer format. domain to be assessed by the tasks or items, and
Shore-answer items require a response of no more critical features of the items that should be replicated
than a few words. Extended-response formats in creating items for alternate forms. Special con
require the test taker to write a more extensive re siderations for complex item formats are illustrated
sponse of one or more sentences or paragraphs. through the following discussion of performance
Performance assessments often seek to emulate assessments, simulations, and portfolios.
the context or conditions in which the intended
knowledge or skills are actually applied.One type Performance assessments. Performance assessments
of performance assessment, for example, is the require examinees to demonstrate the ability to
standardized job or work sample where a task is perform tasks that are often complex in nature
presented to the test taker in a standardized format and generally require the test takers to demonstrate
under standardized conditions.Job or work samples their abilities or skills in settings that closely
might include the assessment of a medical practi resemble real-life situations. One distinction
tioner's ability to make an accurate diagnosis and between performance assessments and other forms
recommend treatment for a defined condition, a of tests is the type of response that is required
manager's ability to articulate goals for an organi from the test takers. Performance assessments
zation, or a student's proficiency in performing a require the test takers to carry out a process such
science laboratory experiment. as playing a musical instrument or tuning a car's
engine or creating a product such as a written
Accessibility of item formats. As described in essay.An assessment of a clinical psychologist in
chapter 3, designing tests to be accessible and training may require the test taker to interview a

77
CHAPTER 4

client, choose appropriate tests, arrive at a diagnosis, like that of other assessent procedures, must
and plan for therapy. flow from the purpose of the assessment.Typical
Because performance assessments typically purposes include judgment of improvement in
consist of a small number of tasks, establishing job or educational performance and evaluation of
the extent to which rhe results can be generalized eligibility for employment, promotion, or gradu
to a broader domain described in rhe test specifi ation. Portfolio specifications indicate rhe nature
cations is particularly important.The test specifi of the work that is to be included in the portfolio.
cations should indicate critical dimensions to be The portfolio may include entries such as repre
measured (e.g., skills and knowledge, cognitive sentative products, rhe best work of rhe test taker,
processes, context for performing the tasks) so or indicators of progress. For example, in an em
that tasks selected for testing will systematically ployment setting involving promotion decisions,
represent the critical dimensions, leading to a employees may be instructed to include their best
comprehensive coverage of the domain as well as work or products.Alternatively, if the purpose is
consistent coverage across test forms.Specification to judge students' educational growrh, rhe s tudents
of the domain to be covered is also important for may be asked to provide evidence of improvement
clarifying potentially irrelevant sources of variation wirh respect to particular competencies o r skills .
in performance. Further, both theoretical and Students may also be asked to provide justifications
empirical evidence are important for documenting for their choices or a cover piece reflecting on rhe
rhe extent to which performance assessments work presented and what the student has learned
tasks as well as scoring criteria-reflect rhe processes from it. Still other methods may call for the use
or skills rhat are specified by rhe domain definition. of videos, exhibitions, or demonstrations.
When tasks are designed to elicit complex cognitive The specifications for the portfolio i ndicate
processes, detailed analyses of rhe tasks and scoring who is responsible for selecting its contents.For
criteria and both rheoretical and empirical analyses example, the specifications must state whether
of the test takers' performances on the tasks the test taker, rhe examiner, or borh parties working
provide necessary validity evidence. together should be involved in the selection of
the contents of the portfolio.The particular re
Simu/,ations.Simulation assessments are similar sponsibilities of each party are delineated in the
to performance assessments in that they require specifications.In employment settings, employees
the examinee to engage in a complex set of may be involved in the selection of thei r work
behaviors for a specified period of time.Simulations and products that demonstrate their competencies
are sometimes a substitute for performance as for promotion purposes.Analogously, in educational
sessments, when actual task performance might applications, students may participate in the se
be costly or dangerous.Specifications for simulation lection of some of their work and the products to
tasks should describe rhe .domain of activities to be included in their portfolios.
be covered by the tasks, critical dimensions of Specifications for how portfolios are scored
performance to be reflected in each task, and and by whom will vary as a function of the use of
specific format considerations such as the number the portfolio scores. Centralized evaluation of
or duration of the tasks and essentials of how the portfolios is common where portfolios are used
user interacts with the tasks.Specifications should in high-stakes decisions.The more standardized
be sufficient to allow experts to judge the compa the contents and procedures for collecting and
rability of different sets of simulation tasks included scoring material, the more comparable rhe scores
in alternate forms. from the resulting portfolios will be. Regardless
of the methods used, all performance assessments,
Portfolios.Portfolios are systematic collections simulations, and portfolios are evaluated by the
of work or educational products, typically gathered same standards of technical quality as other forms
over time.The design of a portfolio assessment, of tests.

78
TEST DESIGN AND DEVELOPMENT

Test Length yield a different item score. For short-answer


items, a list of acceptable responses may suffice,
Test developers frequently follow test blueprints
although more general scoring instructions are
that specify the number of items for each content
sometimes required.Extended-response items re
area to be included in each test form.Specifications
quire more detailed rules for scoring, sometimes
for test length must balance testing time require
called scoring rubrics. Scoring rubrics specify the
ments with the precision of the resulting scores,
criteria for evaluating performance and may vary
with longer tests generally leading to more precise
in the degree of judgment entailed, the number
scores.Test developers frequently follow test blue
of score levels employed, and the ways in which
prints that provide guidance on the number or
criteria for each score level are described. It is
percentage of items for each area of content and
common practice for test developers to provide
that may also include specification of the distri
scorers with examples of performances at each of
bution of items by cognitive requirements or by
the score levels to help clarify the criteria.
item format.Test length and blueprint specifications
For extended-response items, including per
are often updated based on data from tryouts on
formance tasks, simulations, and portfolios, two
time requirements, content coverage, and score
major types of scoring procedures are used: analytic
precision.When tests are administered adaptively,
and holistic.Both of the procedures require explicit
test length (the number of items administered to
performance criteria that reflect the test framework.
each examinee) is determined by stopping rules,
However, the approaches lead to some differences
which may be based on a fixed number of test
in the scoring specifications. Under the analytic
questions or may be based on a desired level of
scoring procedure, each critical dimension of the
score precision.
performance criteria is judged independently, and
Psychometric Specifications separate scores are obtained for each of these di
mensions in addition to an overall score. Under
P sychometric specifications indicate desired statistical the holistic scoring procedure, the same performance
properties of items (e.g., difficulty, discrimination, criteria may implicitly be considered, but only
and inter-item correlations) as well as the desired one overall score is provided. Because the analytic
statistical properties of the whole test, including procedure can provide information on a number
the nature of the reporting scale, test difficulty of critical dimensions, it potentially provides
and precision, and the distribution of items across valuable information for diagnostic purposes and
content or cognitive categories.When psychometric lends itself to evaluating strengths and weaknesses
indices of the items are estimated using item re of test takers.However, validation will be required
sponse theory (IRT), the fit of the model to the for diagnostic interpretations for particular uses
data is also evaluated. This is accomplished by of the separate scores. In contrast, the holistic
evaluating the extent to which the assumptions procedure may be preferable when an overall
underlying the item response model (e.g., unidi j udgment is desired and when the skills being as
mensionality and local independence) are satisfied. sessed are complex and highly interrelated. Re
gardless of the type of scoring procedure, designing
Scoring Specifications the items and developing the scoring rubrics and
Test specifications will describe how individual procedures is an integrated process.
test items are to be scored and how item scores When scoring procedures require human judg
are to be combined to yield one or more overall ment, the scoring specifications should describe
test scores.All types of items require some indication essential scorer qualifications, how scorers are to
of how to score the responses. For selected-response be trained and monitored, how scoring discrepancies
items, one of the response options is considered are to be identified and resolved, and how the ab
the correct response in some testing programs.In sence of bias in scorer judgment is to be checked.
other testing programs, each response option may In some cases, computer algorithms are used to

79
CHAPTER 4

score complex examinee responses, such as essays. derived definition of the construct being measured.
In such cases, scoring specifications should indicate In such instances, items are selected primarily on
how scores are generated by these algorithms and the basis of their empirical relationship with an
how they are to be checked and validated. external criterion, their relationships with one
Scoring specifications will also include whether another, or the degree to which they discriminate
test scores are simple sums of item scores, involve among groups of individuals.For example, items
differential weighting of items or sections, or are for a test for sales personnel might be s elected
based on a more complex measurement model.If based on the correlations of item scores with pro
an IRT model is used, specifications should ductivity measures of current sales personnel.
indicate the form of the model, how model pa Similarly, an inventory to help identify different
rameters are to be estimated, and how model fit is patterns of psychopathology might be developed
to be evaluated. using patients from different diagnostic subgroups.
When test development relies on a data-based ap
Test Administration Specifications proach, some items will likely be selected based
Test administration specifications describe how on chance occurrences in the data.Cross-validation
the test is to be administered. Administration studies are routinely conducted to determine the
procedures include mode of test delivery ( e.g., tendency to select items by chance, which involves
paper-and-pencil or computer based), time limits, administering the test to a comparable sample
accommodation procedures, instructions and ma that was not involved in the original test develop
terials provided to examiners and examinees, and ment effort.
procedures for monitoring test taking and ensuring In other testing applications, however, the test
test security.For tests administered by computer, specifi cations are fixed in advance and guide the
administration specifications will also include a development of items and scoring procedures.
description of any hardware and software require Empirical relationships may then be used to inform
ments, including connectivity considerations for decisions about retaining, rejecting, or modifying
Web-based testing. items.Interpretations ofscores from tests developed
by this process have the advantage of a theoretical
Refining the Test Specifications and an empirical foundation for the underlying
There is often a subtle interplay between the dimensions represented by the test.
process of conceptualizing a construct or content
domain and the development of a test of that Considerations for Adaptive Testing
construct or domain.The specifications for the In adaptive testing, test items or sets of items are
test provide a description of how the construct or selected as the test is being administered based on
domain will be represented and may need to be the test taker's responses to prior items.Specification
refined as development proceeds.T he procedures of item selection algorithms may involve consid
used to develop items and scoring rubrics and to eration of content coverage as well as increasing
examine item and test characteristics may often the precision of the score estimate.When several
contribute to clarifying the specifications. T he items are tied to a single passage or task, more
extent to which the construct is fully defined a complex algorithms for selecting the next passage
priori is dependent on the testing application.In or task are needed. In some instances, a larger
many testing applications, well-defined and detailed number of items are developed for each passage
test specifications guide the development of items or task and the selection algorithm choose s specific
and their associated scoring rubrics and procedures. items to administer based on content and precision
In some areas of psychological measurement, test considerations. Specifications must also indicate
development may be less dependent on an a priori whether a fixed number of items are to be admin
defined framework and may rely more on a data istered or whether the test is to continue until
based approach that results in an empirically precision or content coverage criteria are met.

80
TEST DESIGN AND DEVELOPMENT

T he use of adaptive testing and related com common passages or stimuli, variations on adaptive
puter-based testing models also involves special testing are often considered.For example, multistage
considerations related to item development.When testing begins with a set of routing items. Once
a pool of operational items is developed for a these are given and scored, the computer branches
computerized adaptive test, the specifications refer to item groups that are explicitly targeted to ap
both to the item pool and to the rules or procedures propriate difficulty levels, based on the evaluation
by which an individualized set of items is selected of examinees' observed performance on the routing
for each test taker.Some of the appealing features items. In general, the special requirements of
of computerized adaptive tests, such as tailoring adaptive testing necessitate some shift in the way
the difficulty level of the items to the test taker's in which items are developed and tried out.Al
ability, place additional constraints on the design though the fundamental principles of quality item
of such tests. In most cases, large numbers of development are no different, greater attention
items are needed in constructing a computerized must be given to the interactions among content,
adaptive test to ensure that the set of items ad format, and item difficulty to achieve item pools
ministered to each test taker meets all of the re that are best suited to this testing approach.
quirements of the test specifications.Further, tests
often are developed in the context of larger systems Systems Supporting Item and Test Development
or programs.Multiple pools of items, for example, T he increased reliance on technology and the need
may be created for use with different groups of for speed and efficiency in the test development
test takers or on different testing dates. Test process require consideration of the systems sup
security concerns are heightened when limited porting item and test development.Such systems
availability of equipment makes it impossible to can enhance good item and test development
test all examinees at the same time.A number of practice by facilitating item/task authoring and
issues, including test security, the complexity of reviewing, providing item banking and automated
content coverage requirements, required score tools to assist with test form development, and in
precision levels, and whether test takers might be tegrating item/task statistical information with
allowed to retest using the same pool, must be item/task text and graphics.T hese systems can be
considered when specifying the size of item pools developed to comply with interoperability and ac
associated with each form of the adaptive test. cessibility standards and frameworks that make it
T he development of items for adaptive testing easier for test users to transition their testing
ty pically requires a greater proportion of items to programs from one test developer to another.Al
be developed at high or low levels of difficulty though the specics of item databases and supporting
relative to the targeted testing population.Tryout systems are outside the scope of the Standards, the
data for items developed for use in adaptive tests increased availability of such systems compels those
should be examined for possible context effects to responsible for developing such tests to consider
assess how much item parameters might shift applying technology to test design and development.
when items are administered in different orders. Test developers should evaluate costs and benefits
In addition, if items are associated with a common of different applications, considering issues such
passage or stimulus, development should be in as speed of development, transportability across
formed by an understanding of how item selection testing platforms, and security.
will work.For example, the approach to developing
items associated with a passage may differ depending Item Development and Review
on whether the item selection algorithm selects
all of the available items related to the passage or T he test developer usually assembles an item pool
is able to choose subsets of the available items that consists of more questions or tasks than are
related to the passage. Because of the issues that needed to populate the test form or forms to be
arise when items or tasks are nested within built.T his allows the test developer to select a set

81
CHAPTER 4

of items for one or more forms of the test that groups of test takers.When differential item func
meet the test specifications. The quality of the tioning is detected, test developers try to i dentify
items is usually ascertained through item review plausible explanations for the differences, and they
procedures and item tryouts, often referred to as may then replace or revise items to promote sound
pretesting.Items are reviewed for content quality, score interpretations for all examinees.When items
clarity, and construct-irrelevant aspects of content are dropped due to a differential item functioning
that influence test takers' responses.In most cases, index, the test developer must take care that any
sound practice dictates that items be reviewed for replacements or revisions do not compromise cov
sensitivity and potential offensiveness that could erage of the specified test content.
introduce construct-irrelevant variance for indi Test developers sometimes use approaches in
viduals or groups of test takers. An attempt is volving structured interviews or think-aloud pro
generally made to avoid words and topics that tocols with selected test takers.Such approaches,
may offend or otherwise disturb some test takers, sometimes referred to as cognitive labs, are used to
if less offensive material is equally useful ( see identify irrelevant barriers to responding correctly
chap. 3).For constructed response questions and that might limit the accessibility of the test content.
performance tasks, development includes item Cognitive labs are also used to provide evidence
specific scoring rubrics as well as prompts or task that the cognitive processes being followed by
descriptions.Reviewers should be knowledgeable those taking the assessment are consistent with
about test content and about the examinee groups the construct to be measured.
covered by this review. Additional steps are involved in the evaluation
Often, new test items are administered to a of scoring rubrics for extended-response items or
group of test takers who are as representative as performance tasks.Test developers must i dentify
possible of the target population for the test, and responses that illustrate each scoring level, for use
where possible, who adequately represent individuals in training and checking scorers.Developers also
from intended subgroups.Item tryouts help deter identify responses at the borders between adjacent
mine some of the psychometric properties of the score levels ::or use in more detailed discussions
test items, such as an item's difficulty and ability to during scorer training.Statistical analyses of scoring
distinguish among test takers of different standing consistency and accuracy ( agreement with scores
on the construct being assessed. Ongoing testing assigned by experts) should be included in the
programs often pretest items by inserting them analysis of tryout data.
into existing operational tests ( the tryout items do
not contribute to the scores that test takers receive). Assembling and Evaluating Test Forms
Analyses of responses to these tryout items provide
useful data for evaluating quality and appropriateness T he next step in test development is to assemble
prior to operational use. items into one or more test forms or to identify
one Or more pools of Items tor an adap i::1ve or
r . 1 1 r r
.:>tansuca.t anatyses or Item tryout ,aata commoruy
1

include studies of differential item functioning multistage test.The test developer is responsible
( see chap. 3, "Fairness in Testing"). Differential for documenting that the items selected for the
item functioning is said to exist when test takers test meet the requirements of the test specifications.
from different groups ( e.g., groups defined by In particular, the set of items selected for a new
gender, race/ethnicity, or age) who have approxi test form or an item pool for an adaptive test
mately equal ability on the targeted construct or must meet both content and psychometric speci
content domain differ in their responses to an fications.In addition, editorial and content reviews
item. In theory, the ultimate goal of such studies are commonly conducted to replace items that
is to identify construct-irrelevant aspects of item are too similar to other items or that may provide
content, item format, or scoring criteria that may clues to the answers to other items in the same
differentially affect test scores of one or more test form or item pool.When multiple forms of a

82
TEST DESIGN AND DEVELOPMENT

test are prepared, the test specificati ons g overn acc omm odations for examinees wh o need them,
each of the forms. as discussed in chapter 3.
New test forms are s ometimes tried out or For computer-administered tests, administration
field tested prior t o operati onal use.The purp ose pr ocedures must be c onsistent with hardware and
of a field test is t o determine whether items s oftware requirements included in the test speci
functi on as intended in the c ontext of the new ficati ons.Hardware requirements may c over pr oces
test form and to assess statistical pr operties, s or speed and memory; keyb oard, m ouse, or other
such as sc ore precisi on or reliabili ty, of the new input devices; monitor size and display res olution;
form.When field tests are c onducted, all relevant and c onnectivity to l ocal servers or the Internet.
examinee gr oups sh ould be included s o that S oftware requirements c over operating systems,
results and c onclusi ons will generalize t o the in br owsers, or other c ommon t o ols and pr ovisions
tended operati onal use of the new test forms for bl ocking accessto, or interference from, other
and supp ort further analyses of the fairness of s oftware.Examinees taking c omputer-administered
the new forms. tests sh ould be informed on h ow t o resp ond t o
questions, h ow t o navigate thr ough the test,
Developing Procedures and Materials whether they can skip items, whether they can
for Administration and Scoring revisit previ ously answered items later in the
testing peri od, whether they can suspend the
Many interested pers ons ( e.g., practiti oners, testing sessi on t o a later time, and other exigencies
teachers) may be involved in devel oping items that may occur during testing.
and sc oring rubrics, and/or evaluating the subse Test securi ty procedures should als o be imple
quent performances. If a participat ory appr oach mented in c onjuncti on with b oth administration
is used, participants' kn owledge about the d omain and sc oring of the tests. Such pr ocedures often
being assessed and their ability t o apply the sc oring include cracking and storage of materials; encrypti on
rubrics are of critical imp ortance.Equally imp ortant of electr onic transmissi on of exam c ontent and
for those inv olved in devel oping tests and evaluating sc ores; n ondisclosure agreements for test takers,
performances is their familiarity with the nature sc orers, and administrat ors; and pr ocedures for
of the p opulation being tested. Relevant charac m onit oring examinees during the testing session ..
teristics of the p opulation being tested may include In additi on, for testing pr ograms that reuse test
the typical range of expected skill levels, familiarity items or test forms, security pr ocedures should
with the resp onse m odes required of them, typical include evaluati on of changes in item statistics t o
ways in which kn owledge and skills are displayed, assess the p ossibility of a security breach.Test de
and the primary language used. vel opers or users might c onsider m onit oring of
Test development includes creati on of a number websites for p ossible discl osure of test c ontent.
of d ocuments to supp ort test administration as
described in the test specificati ons. Instructi ons Test Revisions
t o test users are developed and tried out as part of
pil ot or field testing procedures.Instructi ons and Tests and their supp orting d ocuments ( e.g., test
training for test administrat ors must als o be de manuals, technical manuals, user guides) should
veloped and tried out.A key c onsiderati on in de be reviewed peri odically to determine whether
veloping test administrati on pr ocedures and ma revisi ons are needed. Revisi ons or amendments
terials is that test administrati on should be fair t o are necessary when new research data, significant
all examinees. This means that instructions for changes in the d omain, or new c onditions of test
taking the test sh ould be clear and that test ad use and interpretati on suggest that the test is n o
ministrati on c onditi ons sh ould be standardized l onger optimal or fully appr opriate for s ome of
for all examinees. It als o means c onsiderati on its intended uses.As an example, tests are revised
must be given in advance t o appr opriate testing if the test c ontent or language has bec ome

83
CHAPTER 4

outdated and, therefore, may subsequently affect to be as relevant as it was when the test was de
the validity of the test score interpretations.How veloped.T he timing of the need for review will
ever, outdated norms may not have the same im vary as a function of test content and intended
plications for revisions as an outdated test. For use(s).For example, tests of mastery of educational
example, it may be necessary to update the norms or training curricula should be reviewed whenever
for an achievement test after a period of rising or the corresponding curriculum is updated. Tests
falling achievement in the norming population, assessing psychological constructs should be re
or when there are changes in the test-taking pop viewed when research suggests a revised concep
ulation; but the test content itself may continue tualization of the construct.

84
TEST DESIGN AND DEVELOPMENT

STANDARDS FOR TEST DESIGN AND DEVELOPMENT


T he standards in this chapter begin with an over C luster 1 . Standards for Test
arching standard (numbered 4.0), which is designed Specifications
to convey the central intent or primary focus of
the chapter. T he overarching standard may also
be viewed as the guiding principle of the chapter, Standard 4.1
and is applicable to all tests and test users. All Test specifications should describe the purpose(s)
subsequent standards have been separated into of the test, the definition of the construct or do
four thematic clusters labeled as follows: main measured, the intended ex:aminee population,
and interpretations for intended uses. The spec
1. Standards for Test Specifications ifications should include a rationale supporting
2. Standards for Item Development and Review the interpretations and uses of test results for
3. Standards for Developing Test Administra the intended purpose(s) .
tion and Scoring Procedures and Materials
4. Standards for Test Revision Comment: T he adequacy and usefulness of test
interpretations depend on the rigor with which
the purpose(s) of the test and the domain repre
Standard 4.0 sented by the test have been defined and explicated.
Tests and testing programs should be designed The domain definition should be sufficiently de
and developed in a way that supports the validity tailed and delimited to show clearly what dimensions
of interpretations of the test scores for their in of knowledge, skills, cognitive processes, attitudes,
tended uses. Test developers and publishers values, emotions, or behaviors are included and
should document steps taken during the design what dimensions are excluded.A clear description
and development process to provide evidence of will enhance accurate judgments by reviewers and
fairness, reliability, and validity for intended others about the degree of congruence between
uses for individuals in the intended exarninee the defined domain and the test items. Clear
population. specification of the intended examinee population
and its characteristics can help to guard against
Comment: Specific standards for designing and construct-irrelevant characteristics of item content
developing tests in a way that supports intended and format. Specifications should include plans
uses are described below.Initial specifications for for collecting evidence of the validity of the .
a test, intended to guide the development process, intended interpretations of the test scores for their
may be modified or expanded as development intended uses.Test developers should also identify
proceeds and new information becomes available. ,potential limitations on test use or possible inap
Both initial and final documentation of test spec propriate uses.
ifications and development procedures provide a
basis on which external experts and test users can
judge the extent to which intended uses have Standard 4.2
been or are likely to be supported, leading to In addition to describing intended uses of the
valid interpretations of test results for all individuals. test, the test specifications should define the
Initial test specifications may be modified as evi content of the test, the proposed test length, the
dence is collected during development and im item formats, the desired psychometric properties
plementation of the test. of the test items and the test, and the ordering
of items and sections.Test specifications should
also specify the amount of time allowed for

85
CHAPTER 4

testing; directions for the test takers; procedures used in selecting items or sets of items for ad
to be used for test administration, including ministration, in determining the starting point
permissible variations; any materials to be used; and termination conditions for the test, in scoring
and scoring and reporting procedures. Specifica the test, and in controlling item exposure.
tions for computer-based tests should include a
Comment: If a computerized adaptive test is in
description of any hardware and software re
tended to measure a number of different content
quirements.
subcategories, item selection procedures should
Comment: P rofessional judgment plays a major ensure that the subcategories are adequately rep
role in developing the test specifications. T he resented by the items presented to the test taker.
specific procedures used for developing the speci Common rationales for computerized adaptive
fications depend on the purpose(s) of the test. tests are that score precision is increased, particularly
For example, in developing licensure and certifi for high- and low-scoring examinees, or chat com
cation tests, practice analyses or job analyses parable precision is achieved while testing time is
usually provide the basis for defining the test reduced. Note that these tests are subject to the
specifications; job analyses alone usually serve this same requirements for documenting the validity
function for employment tests. For achievement of score interpretations for their intended use as
tests given at the end of a course, the test specifi other types of tests. Test specifications should
cations should be based on an outline of course include plans to collect evidence required for such
content and goals.For placement tests, developers documentation.
will examine the required entry-level knowledge
and skills for different courses.In developing psy Standard 4.4
chological tests, descriptions and diagnostic criteria
of behavioral, mental, and emotional deficits and If test developers prepare different versions of a
psychopathology inform test specifications. test with some change to the test specifications,
T he types of items, the response formats, the they should document the content and psycho
scoring procedures, and the test administration metric specifications of each version. The docu
procedures should be selected based on the mentation should describe the impact of differ
purpose(s) of the test, the domain to be measured, ences among versions on the validity of score in
and the intended test takers.To the extent possible, terpretations for intended uses and on the
test content and administration procedures should precision and comparability of scores.
be chosen so that intended inferences from test
Comment: Test developers may have a number
scores are equally valid for all test takers. Some
of reasons for creating different versions of a test,
details of the test specifications may be revised on
such as allowing different amounts of time for
the basis of initial pilot o_r field tests.For example,
test administration by reducing or increasing the
specifications of the test length or mix of item
number of items on the original test, or allowing
types might be modified based on initial data to
administration to different populations by trans
achieve desired precision of measurement.
lating test questions into different languages.Test
developers should document the extent to which
Standard 4.3 the specifications differ from those of the original
test, provide a rationale for the different versions,
Test developers should document the rationale and describe the implications of such differences
and supporting evidence for the administration, for interpreting the scores derived from the different
scoring, and reporting rules used in computer versions.Test developers and users should monitor
adaptive, multistage-adaptive, or other tests de and document any psychometric differences among
livered using computer algorithms to select items. versions of the test based on evidence collected
This documentation should include procedures during development and implementation. Evidence

86
TEST DESIGN AND DEVELOPMENT

of differences may involve judgments when the Standard 4.6


number of examinees receiving a particular version
is small (e.g., a braille version). Note that these When appropriate to documenting the validity of
requirements are in addition to the normal re test score interpretations for intended uses, relevant
quirements for demonstrating the equivalency of experts external to the testing program should
scores from different forms of the same test. review the test specifications to evaluate their ap
When different languages are used in different propriateness for intended uses of the test scores
test versions, the procedures used to develop and and fairness for intended test takers. The purpose
check translations into each language should be of the review, the process by which the review is
documented. conducted, and the results of the review should
be documented. The qualifications, relevant ex
periences, and demographic characteristics of
Standard 4.5 expert judges should also be documented.
If the test developer indicates that the conditions Comment: A number of factors may be considered
of administration are permitted to vary from in deciding whether external review of test speci
one test taker or group to another, permissible fications is needed, including the extent of intended
variation in conditions for administration should use, whether score interpretations may have im
be identified. A rationale for permitting the dif portant consequences, and the availability of
ferent conditions and any requirements for per external experts.Expert review of the test specifi
mitting the different conditions should be doc cations may serve many useful purposes, such as
umented. helping to ensure content quality and representa
Comment: Variation in conditions of adminis tiveness.Use of experts external to the test devel
tration may reflect administration constraints in opment process supports objectivity in judgments
different locations or, more commonly, may be of the quality of the test specifications.Review of
designed as testing accommodations for specific the specifications prior to starting item development
examinees or groups of examinees.One example can avoid significant problems during subsequent
of a common variation is the use of computer test item reviews.The expert judges may include
administration of a test form in some locations individuals representing defined populations of
and paper-and-pencil administration of the same concern to the test specifications.For example, if
form in other locations. Another example is the test is to be administered to different linguistic
small-group or one-on-one administration for and cultural groups, the expert review typically
test takers whose test performance might be includes members of these groups and experts on
limited by distractions in large group settings. testing issues specific to these groups.
Test accommodations, as discussed in chapter 3
( "Fairness in Testing"), are changes made in a Cluster 2 . Standards for Item
test to increase fairness for individuals who oth
Development and Review
erwise would be disadvantaged by construct-ir
relevant features of test items. Test developers
should specify procedures for monitoring variations Standard 4.7
and for collecting evidence to show that the
The procedures used to develop, review, and try
target construct is or is not altered by allowable out
items and to select items from the item pool
variations.These procedures should be documented should be documented.
based on data collected during implementation.
Comment: The qualifications of individuals de
veloping and reviewing items and the processes
used to train and guide them in these activities

87
CHAPTER 4

are important aspects of test development docu Standard 4.9


mentation.Typically, several groups of individuals
participate in the test development process, in When item or test form tryouts are conducted,
cluding item writers and individuals participating the procedures used to select the sample(s) of test
in reviews for item and test content, for sensitivity, takers as well as the resulting characteristics of
or for other purposes. the sample(s) should be documented. The sarnple(s)
should be as representative as possible of the pop
ulation(s) for which the test is intended.
Standard 4.8
Comment: Conditions that may differentially
The test review process should include empirical affect performance on the test items by the tryout
analyses and/or the use of expert judges to sample(s) as compared with the intended popula
review items and scoring criteria. When expert tion( s) should be documented when appropriate.
judges are used, their qualifications, relevant For example, test takers may be less motivated
experiences, and demographic characteristics when they know their scores will not h ave an
should be documented, along with the instruc impact on them. Where possible, item and test
tions and training in the item review process characteristics should be examined and documented
that the judges receive. for relevant subgroups in the intended examinee
Comment: When sample size permits, empirical population.
analyses are needed to check the psychometric To the extent feasible, item and test form
properties of test items and also to check whether tryouts should include relevant examinee groups.
test items function similarly for different groups. Where sample size permits, test developers should
Expert judges may be asked to check item scoring determine whether item scores have different re
and to identify material likely to be inappropriate, lationships to the construct being measured for
confusing, or offensive for groups in the test different groups ( differential item functioning).
taking population. For example, judges may be When testing accommodations are designed for
asked to identify whether lack of exposure to specific examinee groups, information on item
problem contexts in mathematics word problems performance under accommodated conditions
may be of concern for some groups of students. should also be collected. For relatively small
Various groups of test takers can be defined by groups, qualitative information may be useful.
characteristics such as age, ethnicity, culture, For example, test-taker interviews might be used
gender, disability, or demographic region.When to assess the effectiveness of accommodations in
feasible, both empirical and judgmental evidence removing irrelevant variance.
of the extent to which test items function similarly
for different groups should be used in screening Standard 4.1 O
the items.(See chap.3 for examples of appropriate
types of evidence.) When a test devel oper evaluates the psychometric
Studies of the alignment of test forms to properties of items, the m odel used for that
content specifications are sometimes conducted purpose (e.g., classical test the ory, item response
to support interpretations that test scores the ory, or another m odel) should be documented.
indicate mastery of targeted test content.Experts The sample used for estimating item properties
independent of the test developers judge the sh ould be described and should be of adequate
degree to which item content matches content size and diversity for the pr ocedure. The process
categories in the test specifications and whether by which items are screened and the data used
test forms provide balanced coverage of the for screening, such as item difficulty, item dis
targeted content. crimination, or differential item functi oning

88
TEST DESIGN AND DEVELOPMENT

(DIF) for major examinee groups, should also be The extent to which the different studies show
documented. When model-based methods (e.g., consistent results should be documented.
IRT) are used to estimate item parameters in test
Comment: When data-based approaches to test
development, the item response model, estimation
development are used, items are selected primarily
procedures, and evidence of model fit should be
on the basis of their empirical relationships with
documented.
an external criterion, their relationships with one
Comment: Although overall sample size is another, or their power to discriminate among
relevant, there should also be an adequate number groups of individuals.Under these circumstances,
of cases in regions critical to the determination it is likely that some items will be selected based
of the psychometric properties of items. If the on chance occurrences in the data used.Adminis
test is to achieve greatest precision in a particular tering the test to a comparable sample of test
part of the score scale and this consideration takers or use of a separate validation sample
affects item selection, the manner in which item provides independent verification of the relationships
statistics are used for item selection needs to be used in selecting items.
carefully described. When IRT is used as the Statistical optimization techniques such as
basis of test development, it is important to doc stepwise regression are sometimes used to develop
ument the adequacy of fit of the model to the test composites or to select tests for further use in
data. This is accomplished by providing infor a test battery.As with the empirical selection of
mation about the extent to which IRT assumptions items, capitalization on chance can occur. Cross
(e.g., unidimensionality, local item independence, validation on an independent sample or the use
or, for certain models, equality of slope parameters) of a formula that predicts the shrinkage of corre
are satisfied. lations in an independent sample may provide a
Statistics used for flagging items that function less biased index of the predictive power of the
differently for different groups should be described, tests or composite.
including specification of the groups to be analyzed,
the criteria for flagging, and the procedures for
reviewing and making final decisions about flagged
Standard 4.1 2
items. Sample sizes for groups of concern should
Test developers should document the extent to
be adequate for detecting meaningful DIE
which the content domain of a test represents
Test developers should consider how any dif
the domain defined in the test specifications.
ferences between the administration conditions
of the field test and the final form might affect Comment: Test developers should provide evidence
item performance. Conditions that can affect of the extent to which the test items and scoring
item statistics include motivation of the test criteria yield scores that represent the defined do
takers, item po;ition, time limits, length of test, main. This affords a basis to help determine
mode of testing (e.g., paper-and-pencil versus whether performance on the test can be generalized
computer administered), and use of calculators to the domain that is being assessed. This is
or other tools. especially important for tests that contain a small
number of items, such as performance assessments.
Standard 4.1 1 Such evidence may be provided by expert judges.
In some situations, an independent study of the
Test developers should conduct cross-validation alignment of test questions to the content specifi
studies when items or tests are selected primarily cations is conducted to validate the developer's
on the basis of empirical relationships rather than internal processing for ensuring appropriate content
on the basis of content or theoretical considerations. coverage.

89
CHAPTER 4

Standard 4.1 3 C luster 3 . Standards for D eveloping


When credible evidence indicates that irrelevant
Test Adm i nistration and Scoring
variance could affect scores from the test, then to Procedures and Materials
the extent feasible, the test developer should in
vestigate sources of irrelevant variance. Where Standard 4.1 5
possible, such sources of irrelevant variance should
be removed or reduced by the test developer. T he directions for test administration should be
presented with sufficient clarity so that it is
Comment: A variety of methods may be used to possible for others to replicate the administration
check for the influence of irrelevant factors, in conditions under which the data on reliability,
cluding analyses of correlations with measures of validity, and (where appropriate) norms were
other relevant and irrelevant constructs and, in obtained. Allowable variations in administration
some cases, deeper cognitive analyses (e.g., use of procedures should be clearly described. T he
follow-up probes to identify relevant and irrelevant process for reviewing requests for additional
reasons for correct and incorrect responses) of ex testing variations should-also be documented.
aminee standing on the targeted construct. A
deeper understanding ofirrelevant sources of vari Comment: Because all people administering tests,
ance may also lead to refinement of the description including those in schools, industry, and clinics,
of the construct under examination. need to follow test administration procedures
carefully, it is essential that test administrators
receive detailed instructions on test administration
Standard 4 . 1 4 guidelines and procedures.Testing accommodations
For a test that has a time limit, test development may be needed to allow accurate measurement of
research should examine the degree to which intended constructs for specific groups of test
scores include a speed component and should takers, such as individuals with disabilities and
evaluate the appropriateness of that component, individuals whose native language is not English.
given the domain the test is designed to measure. (See chap.3, "Fairness in Testing.")

Comment: At a minimum, test developers should


Standard 4.1 6
examine the proportion of examinees who complete
the entire test, as well as the proportion who fail The instructions presented to test takers should
to respond to ( omit) individual test questions. contain sufficient detail so that test takers can
Where speed is a meaningful part of the target respond to a task in the manner that the test de
construct, the distribution of the number of items veloper intended.When appropriate, sample ma
answered should be analyzed to check for appro terials, practice or sample questions, criteria for
priate variability in the number ofitems attempted scoring, and a representative item identified with
as well as the number of correct responses. When each item format or major area in the test's clas
speed is not a meaningful part of the target con sification or domain should be provided to the
struct, time limits should be determined so that test takers prior to the administration of the
examinees will have adequate time to demonstrate test, or should be included in the testing material
the targeted knowledge and skill. as part of the standard administration instruc
tions.

Comment: For example, in a personality inventory


the intent may be that test takers give the first re
sponse that occurs to them. Such an expectation
should be made clear in the inventory directions.

90
TEST DESIGN AND DEVELOPMENT

As another example, in directions for interest or to inform participants of how the test developer
occupational inventories, it may be important to will use the data generated from the test, including
specify whether test takers are to mark the activities the user's personally identifiable information, how
they would prefer under ideal conditions or that information will be protected, and with
whether they are to consider both their opportunity whom it might be shared.
and their ability realistically.
Instructions and any practice materials should Standard 4.1 8
be available in formats that can be accessed by all
test takers.For example, if a braille version of the Procedures for scoring and, if relevant, scoring
test is provided, the instructions and any practice criteria, should be presented by the test developer
materials should also be provided in a form that with sufficient detail and clarity to maximize
can be accessed by students who take the braille the accuracy of scoring. Instructions for using
version. rating scales or for deriving scores obtained by
The extent and nature of practice materials coding, scaling, or classifying constructed responses
and directions depend oi;_i expected levels of knowl should be clear. This is especially critical for ex
edge among test takers. For example, in using a tended-response items such as performance tasks,
novel test format, it may be very important to portfolios, and essays.
provide the test taker with a practice opportunity Comment: In scoring more complex responses,
as part of the test administration.In some testing test developers must provide detailed rubrics and
situations, it may be important for the instructions training in their use.Providing multiple examples
to address such matters as time limits and the of responses at each score level for use in training
effects that guessing has on test scores.If expansion scorers and monitoring scoring consistency is
or elaboration of the test instructions is permitted, also common practice, although these are typically
the conditions under which this may be done added to scoring specifications during item de
should be stated clearly in the form of general velopment and tryouts. For monitoring scoring
rules and by giving representative examples.If no effectiveness, consistency criteria for qualifying
expansion or elaboration is to be permitted, this scorers should be specified, as appropriate, along
should be stated explicitly.Test developers should with procedures, such as double-scoring of some
include guidance for dealing with typical questions or all responses.As appropriate, test developers
from test takers.Test administrators should be in should specify selection criteria for scorers and
structed on how to deal with questions that may procedures for training, qualifying, and monitoring
arise during the testing period. scorers. If different groups of scorers are used
with different administrations, procedures for
Standard 4.1 7 checking the comparability of scores generated
by the different groups should be specified and
If a test or part of a test is intended for research implemented.
use only and is not distributed for operational
use, statements to that effect should be displayed Standard 4.1 9
prominently on all relevant test administration
and interpretation materials that are provided to When automated algorithms are to be used to
the test user. score complex examinee responses, characteristics
of responses at each score level should be docu
Comment: T his standard refers to tests that are
mented along with the theoretical and empirical
intended for research use only.It does not refer to
bases for the use of the algorithms.
standard test development functions that occur
prior to the operational use of a test ( e.g., item Comment: Automated scoring algorithms should
and form tryouts).There may be legal requirements be supported by an articulation of the theoretical

91
CHAPTER 4

and methodological bases for their use that is suf of prescored responses for use in training and for
ficiently detailed to establish a rationale for linking judging scoring accuracy.The basis for determining
the resulting test scores to the underlying construct scoring consistency (e.g., percentage ofexact agree
of interest.In addition, the automated scoring al ment, percentage within one score point, o r some
gorithm should have empirical research support, other index of agreement) should be indicated.
such as agreement rates with human scorers, prior Information on scoring consistency is essential to
to operational use, as well as evidence that the estimating the precision of resulting scores.
scoring algorithms do not introduce systematic
bias against some subgroups. Standard 4.21
Because automated scoring algorithms are
often considered proprietary, their developers are When test users are responsible for scoring and
rarely willing to reveal scoring and weighting scoring requires scorer judgment, the test user is
rules in public documentation. Also, in some responsible for providing adequate training and
cases, full disclosure of derails of the scoring algo instruction to the scorers and for examining
rithm might result in coaching strategies that scorer agreement and accuracy. The test developer
would increase scores without any real change in should document the expected level of scorer
the consrruct(s) being assessed.In such cases, de agreement and accuracy and should provide as
velopers should describe the general characteristics much technical guidance as possible to aid test
of scoring algorithms.They may also have the al users in satisfying this standard.
gorithms reviewed by independent experts, under Comment: A common practice of test developers
conditions of nondisclosure, and collect independent is to provide training materials ( e.g., scoring
judgments of the extent to which the resulting rubrics, examples of test takers' responses at each
scores will accurately implement intended scoring score level) and procedures when scoring is done
rubrics and be free from bias for intended exarninee by test users and requires scorer judgment.Training
subpopulations. provided to support local scoring should i nclude
standards for checking scorer accuracy during
Standard 4.20 training and operational scoring.Training should
also cover any special consideration for rest-taker
The process for selecting, trammg, qualifying, groups that might interact differently with the
and monitoring scorers should be specified by task to be scored.
the test developer. T he training materials, such
as the scoring rubrics and examples of test takers' Standard 4.22
responses that illustrate the levels on the rubric
score scale, and the procedures for training Test developers should specify the procedures
scorers should result in a degree of accuracy and used to interpret test scores and, when appropriate,
agreement among scorers that allows the scores the normative or standardization samples or the
to be interpreted as originally intended by the criterion used.
test developer. Specifications should also describe
Comment: Test specifications may indicate that
processes for assessing scorer consistency and
the intended scores should be interpreted as in
potential drift over time in raters' scoring.
dicating an absolute level of the construct being
Comment: To the extent possible, scoring processes measured or as indicating standing on r he con
and materials should anticipate issues that may struct relative to other examinees, or both. In
arise during scoring.Training materials should absolute score interpretations, the score or average
address any common misconceptions about the is assumed to reflect directly a level ofcompetence
rubrics used to describe score levels.When written or mastery in some defined criterion domain.In
text is being scored, it is common to include a set relative score interpretations the status of an in-

92
TEST DESIGN AND DEVELOPMENT

dividual ( or group) is determined by comparing conditions of test use may reduce the validity of
the score (or mean score) with the performance test score interpretations. Although a test that
of others in one or more defined populations. remains useful need not be withdrawn or revised
Tests designed to facilitate one type of interpre simply because of the passage of time, test devel
tation may function less effectively for the other opers and test publishers are responsible for mon
type of interpretation. Given appropriate test itoring changing conditions and for amending,
design and adequate supporting data, however, revising, or withdrawing the test as indicated.
scores arising from norm-referenced testing pro
Comment: Test developers need to consider a
grams may provide reasonable absolute score in
number of factors that may warrant the revision
terpretations, and scores arising from criterion
of a test, including outdated test content and lan
referenced programs may provide reasonable rel
guage, new evidence of relationships among meas
ative score interpretations.
ured or predicted constructs, or changes to test
frameworks to reflect changes in curriculum, in
Standard 4.23 struction, or job requirements.If an older version
of a test is used when a newer version has been
When a test score is derived from the differential
published or made available, test users are re
weighting ofitems or subscores, the test developer
sponsible for providing evidence that the older
should document the rationale and process used
version is as appropriate as the new version for
to develop, review, and assign item weights.
that particular test use.
When the item weights are obtained based on
empirical data, the sample used for obtaining
item weights should be representative of the Standard 4.25
population for which the test is intended and
When tests are revised, users should be informed
large enough to provide accurate estimates of
of the changes to the specifications, of any ad
optimal weights. When the item weights are ob
justments made to the score scale, and of the
tained based on expert judgment, the qualifications
degree of comparability of scores from the original
of the judges should be documented.
and revised tests. Tests should be labeled as "re
Comment: Changes in the population of test vised" only when the test specifications have
takers, along with other changes, for example in been updated in significant ways.
instructions, training, or job requirements, may
Comment: It is the test developer's responsibility
affect the original derived item weights, necessitating
to determine whether revisions to a test would in
subsequent studies. In many cases, content areas
fluence test score interpretations.If test score in
are weighted by specifying a different number of
terpretations would be affected by the revisions,
items from different areas. The rationale for
it is appropriate to label the test "revised." When
weighting the different content areas should also
tests are revised, the nature of the revisions and
be documented and periodically reviewed.
their implications for test score interpretations
should be documented.Examples of changes that
Cluster 4. Standards for Test Revision require consideration include adding new areas of
content, refining content descriptions, redistributing
Standard 4.24 the emphasis across different content areas, and
even just changing item format specifications.
Test specifications should be amended or revised Note that creating a new test form using the same
when new research data, significant changes in specifications is not considered a revision within
the domain represented, or newly recommended the context of this standard.

93
5. SCORES , SCALES, NORMS, SCORE
LINKING , AND CUT SCORES
BACKGROUND

Test scores are reported on scales designed to In addition to facilitating interpretations of


assist in score interpretation. Typically, scoring scores on a single test form, scale scores often are
begins with responses to separate test items.These created to enhance comparability across alternate
item scores are combined, sometimes by addition, form? of the same test, by using equating methods.
to obtain a raw score when using classical test Score linking is a general term for methods used to
theory or to produce an IRT score when using develop scales with similar scale properties. Score
item response theory ( IRT ) or other model-based linking includes equating and other methods for
techniques.Raw scores and IRT scores often are transforming scores to enhance their comparability
difficult to interpret in the absence of further in on tests designed to measure different constructs
formation. Interpretation may be facilitated by ( e.g., related subtests in a battery).Linking methods
converting raw scores or IRT scores to scale are also used to relate scale scores on different
scores.Examples include various scale scores used measures of similar constructs ( e.g., tests of a
on college admissions tests and those used to particular construct from different test developers)
report results for intelligence tests or vocational and to relate scale scores on tests that measure
interest and personality inventories.T he process similar constructs given under different modes of
of developing a score scale is referred to as scaling administration ( e.g., computer and paper-and
a test. Scale scores may aid interpretation by in pencil administrations). Vertical scaling methods
dicating how a given score compares with those sometimes are used to place scores from different
of other test takers, by enhancing the comparability levels of an achievement test on a single scale to fa
of scores obtained through different forms of a cilitate inferences about growth or development.
test, and by helping to prevent confusion with The degree of score comparability that results from
other scores. the application of a linking procedure varies along
Another way of assisting score interpretation a continuum. Equating is intended to allow scores
is to establish cut scores that distinguish different on alternate test forms to be used interchangeably,
score ranges. In some cases, a single cut score whereas comparability of scores associated with
defines the boundary between passing and failing. other types of linking may be more restricted.
In other cases, a series of cut scores define distinct
proficiency levels. Scale scores, proficiency levels, Interpretations of Scores
and cut scores can be central to the use and inter
pretation of test scores. For that reason, their de An individual's raw scores or scale scores often are
fensibility is an important consideration in test compared with the distribution of scores for one
score validation for the intended purposes.
Decisions about how many scale score points 'The term alternateform is used in chis chapter to indicate
to use often are based on test score reliability con rest forms that have been built to the same content and
cerns. If too few scale score points are used, then statistical specifications and developed co measure rhe same
the reliability of scale scores is decreased as infor construct. This term is nor co be confused with the term
alternate assessment as it is used in chapter 3, to indicate a test
mation is discarded.If too many scale-score points
chat has been modified or changed to increase access to the
are used, then test users might attempt to interpret construct for subgroups of the population. The alternate
scale score differences that are small relative to assessment may or may not measure the same construct as the
the amount of measurement error in the scores. unaltered assessment.

95
CHAPTER 5

. or more comparison groups to draw useful infer referenced inte_rpretations.'T his could happen as
ences about the person's relative performance. research and experience bring increased under
Test score interpretations based on such comparisons standing of the capabilities implied by different
are said to be norm referenced. Percentile rank scale score levels. Conversely, results of an educa
norms, for example, indicate the standing of an tional assessment might be reported on a scale
individual or group within a defined population consisting of several ordered proficiency levels,
of individuals or groups. An example might be defined by descriptions of the kinds of tasks
the percentile scores used in military enlistment students at each level are able to perform. That
testing, which compare each applicant's score with would be a criterion-referenced scale, but once
scores for the population of 18-to- 23-year-old the distribution of scores over levels is reported,
American youth. Percentiles, averages, or other say, for all eighth-grade students in a given. state,
statistics for such reference groups are called norms. individual students' scores will also convey infor
By showing how the test score of a given examinee mation about their standing relative to that tested
compares with those of others, norms assist in the population.
classification or description of examinees. Interpretations based on cut scores may likewise
Other test score interpretations make no direct be either criterion referenced or norm referenced.
reference to the performance of other examinees. If qualitatively different descriptions are attached
These interpretations may take a variety of forms; to successive score ranges, a criterion-referenced
most are collectively referred to as c riterion interpretation is supported.For example, the de
referenced interpretations.Scale scores supporting scriptions of proficiency levels in some assessment
such interpretations may indicate the likely pro task-scoring rubrics can enhance score interpretation
portion of correct responses that would be obtained by summarizing the capabilities that must be
on some larger domain of similar items, or the demonstrated to merit a given score. In. other
probability that an examinee will answer particular cases, criterion-referenced interpretations may be
sorts of items correctly.Other criterion-referenced based on empirically determined relationships be
interpretations may indicate the likelihood that tween test scores and other variables. But when
some psychopathology is present. Still other cri tests are used for selection, it may be appropriate
terion-referenced interpretations may indicate the to rank-order examinees according to their test
probability that an examinee's level of tested performance and establish a cut score so as to
knowledge or skill is adequate to perform suc select a prespecified number or proportion of ex
cessfully in some other setting. Scale scores to aminees from one end of the distribution, p rovided
support such criterion-referenced score interpre the selection use is sufficiently supported by
tations often are developed on the basis of statistical relevant reliability and validity evidence to support
analyses of the relationships of test scores to other rank ordering. In such cases, the cut score inter
variables. pretation is norm referenced; the labels "reject" or
Some scale scores are developed primarily to "fail" versus "accept" or "pass" are determined
support norm-referenced interpretations; others primarily by an examinee's standing relative to
support criterion-referenced interpretations. In others tested in the current selection process.
practice, however, there is not always a sharp dis Criterion-referenced interpretations b ased on
tinction.Both criterion-referenced and norm-ref cut scores are sometimes criticized on the grounds
erenced scales may be developed and used with that there is rarely a sharp distinction between
the same test scores if appropriate methods are those just below and those just above a cut score.
used to validate each type of interpretation.More A neuropsychological test may be helpful in diag
over, a norm-referenced score scale originally de nosing some particular impairment, for example,
veloped, for example, to indicate performance but the probability that the impairment is present
relative to some specific reference population is likely to increase continuously as a function of
might, over time, also come to support criterion- the test score rather than to change sharply at a

96
SCORES, SCALES, NORMS, SCORE LINKING, AND CUT SCORES

particular score.Cut scores may aid in formulating or educational classifications.Descriptive statistics


rules for reaching decisions on the basis of test for all examinees who happen to be tested during
performance. It should be recognized, however, a given period of time ( sometimes called user
that the likelihood of misclassification will generally norms or program norms) may be useful for some
be relatively high for persons with scores close to purposes, such as describing trends over time.
the cut scores. But there must be a sound reason to regard that
group of test takers as an appropriate basis for
Norms such inferences.When there is a suitable rationale
for using such a group, the descriptive statistics
T he validity of norm-referenced interpretations should be clearly characterized as being based on
depends in part on the appropriateness of the ref a sample of persons routinely tested as part of an
erence group to which test scores are compared. ongomg program.
Norms based on hospitalized patients, for example,
might be inappropriate for some interpretations Score Linking
of nonhospitalized patients' scores.T hus, it is im
portant that reference populations be carefully Score linking is a general term that refers to relating
defined and clearly described.Validity of such in scores from different tests or test forms. When
terpretations also depends on the accuracy with different forms of a test are constructed to the
which norms summarize the performance of the same content and statistical specifications and
reference population. T hat population may be administered under the same conditions, they are
small enough that essentially the entire population referred to as alternate forms or sometimes parallel
can be tested (e.g., all test takers at a given grade or equivalent forms.T he process of placing raw
level in a given district tested on the same occasion). scores from such alternate fo rms on a common
Often, however, only a sample of examinees from scale is referred to as equa_t}!lg. Equating involves
the reference population is tested. It is then im small statistical adjustments to account for minor
portant that the norms be based on a technically differences in the difficulty of the alternate forms.
sound, representative sample of test takers of suf After equating, alternate forms of the same test
ficient size. Patients in a few hospitals in a small yield scale scores that can be used interchangeably
geographic region are unlikely to be representative even though they are based on different sets of
of all patients in the United States, for example. items.In many testing programs that administer
Moreover, the usefulness of norms based on a tests multiple times, concerns with test security
given sample may diminish over time. T hus, for may be raised if the same form is used repeatedly.
tests that have been in use for a number of years, In other testing programs, the same test takers
periodic review is generally required to ensure the may be measured repeatedly, perhaps to measure
continued utility of their norms.Renorming may change in levels of psychological dysfunction, at
be required to maintain the validity of norm-ref titudes, or educational achievement.In these cases,
erenced test score interpretations. reusing the same test items may result in biased
More than one reference population may be estimates of change.Score equating allows for the
appropriate for the same tesi:.For example, achieve use of alternate forms, thereby avoiding these
ment test performance might be interpreted by concerns.
reference to local norms based on sampling from Although alternate forms are built to the same
a particular school district for use in making local content and statistical specifications, differences
instructional decisions, or to norms for a state or in test difficulty will occur, creating the need for
type of community for use in interpreting statewide equating.One approach to equating involves ad
testing results, or to national norms for use in ministering the forms to be equated to the same
making comparisons with national groups. For sample of examinees or to equivalent samples.
other tests, norms might be based on occupational Another approach involves administering a common

97
CHAPTER 5

set of items, referred to as anchor items, to the tional forms being eqated. Both embedded
samples taking each form. Each approach has and external anchor test designs involve strong
unique strengths, but also involves assumptions statistical assumptions regarding the equivalence
that could influence the equating results, and so of the anchor and the forms being equated.
these assumptions must be checked. Choosing T hese assumptions are particularly critical
among equating approaches may include the fol when the samples of examinees taking the dif
lowing considerations: ferent forms vary considerably on the construct
being measured.
Administering forms to the same sample allows
When claiming that scores on test forms are
for an estimate of the correlation between the
equated, it is important to document how the
scores on the two forms, as well as providing
forms are built to the same content and statistical
data needed to adjust for differences in difficulty.
specifications and to demonstrate that scores on
However, there could be order effects related to
the alternate forms are measures of the same con
practice or fatigue that may affect the score dis
struct and have similar reliability.Equating should
tribution for the form administered second.
provide accurate score conversions for any set of
Administering alternate forms to equivalent persons drawn from the examinee population for
samples, usually through random assignment, which the test is designed; hence the stability of
avoids any order effects but does not provide conversions across relevant subgroups should be
a direct estimate of the correlation between documented.Whenever possible, the defin itions
the scores; other methods are needed to demon of important examinee populations should include
strate that the two forms measure the same groups for which fairness may be a particular
construct. issue, such as examinees with disabilities or from
Embedding a set of anchor items in each of diverse linguistic and cultural backgrounds.When
sample sizes permit, it is important to examine
the forms being equated provides a basis for
the stabili ty of equating conversions across these
adjusting for differences in the samples of ex
populations.
aminees taking each form.T he anchor items
The increased use of tests delivered by computer
should cover the same content and difficulty
raises special considerations for equating and
range as each of the full forms being equated
linking because more flexible models for delivering
so that differences on the anchor items will
tests become possible. These include adaptive
accurately reflect differences on the full forms.
testing as well as approaches where unique items
Also, anchor item positi.on and other context
or multiple intact sets of items are selected from a
factors should be the same in both forms.It is
larger pool of available items. It has long been
important to check that the anchor items
recognized that little is learned from examinees'
function similarly in the forms being equated.
responses to items that are much too easy or
Anchor items are often dropped from the
much too diffic ult for them.Consequently, some
anchor if their relative difficulty is substantially
testing procedures use only a subset of the available
different in the forms being equated.
items with each examinee.An adaptive test consists
Sometimes an external anchor test is used in of a pool of items together with rules for selecting
which the anchor items are administered in a a subset of those items to be administered to an
separate section and do not contribute to the individual examinee and a procedure for placing
total score on the test.This approach eliminates different examinees' scores on a common scale.
some context factors as the presentation of The selection of successive items is based in part
the anchor items is identical for each examinee on the examinees' responses to previous items.
sample.Again, however, the anchor test must The item pool and item selection rules may be
reflect the content and difficulty of the opera- designed so that each examinee receives a repre-

98
SCORES, SCALES, NORMS , SCORE LINKING, AND CUT SCORES

sentative sec of items of appropriate difficulty. scales that span a broad range of developmental
With some adaptive tests, it may happen that two or educational levels.The development of ver
examinees rarely if ever receive the same set of tical scales typically requires linking of tests
items.Moreover, two examinees taking the same that are purposefully constructed to differ in
adaptive test may be given sets of items that differ difficulty.
markedly in difficulty.Nevertheless, adaptive test
Test revision often brings a need to link scores
scores can be reported on a common scale and
obtained using newer and older test specifications.
function much like scores from a single alternate
form of a test that is not adaptive. International comparative studies may require
Often, the adaptation of the test is done item linking of scores on tests given in different
by item.In other situations, such as in multistage languages.
testing, the exam process may branch from choos
Scores may be linked on tests measuring dif
ing among sets of items that are broadly repre
ferent constructs, perhaps comparing an aptitude
sentative of content and difficulty to choosing
with a form of behavior, or linking measures
among sets of items that are targeted explicitly
of achievement in several content areas or
for a higher or lower. level of the construct being
across different test publishers.
measured, based on an interim evaluation of ex
aminee performance. Sometimes linkings are made to compare per
In many situations, item pools for adaptive formance of groups ( e.g., school districts,
tests are updated by replacing some of the items states) on different measures of similar con
in the pool with new items.In other cases, entire structs, such as when linking scores on a state
pools of items are replaced.In either case, statistical achievement test to scores on an international
procedures are used to link item parameter assessment.
estimates for the new items to the existing IRT
Results from linking studies are sometimes
scale so that scores from alternate pools can be
used interchangeably, in much the same way that aligned or presented in a concordance table to
scores on alternate forms of tests are used when aid users in estimating performance on one
scores on the alternate forms are equated. To test from performance on another.
support comparability of scores on adaptive tests In situations where complex item types are
across pools, it is necessary to construct the pools used, score linking is sometimes conducted
to the same explicit content and statistical speci through judgments about the comparability
fications and administer them under the same of item content from one test to another.For
conditions. Most often, a common-item design example, writing prompts built to be similar,
is used in linking parameter estimates for the where responses are scored using a common
new items to ,he IRT scale used for adaptive rubric, might be assumed to be equivalent in
testing.In such cases, stability checks should be difficulty.When possible, these linkings should
made on the statistical characteristics of the com be checked empirically.
mon items, and the number of common items
should be sufficient to yield stable results. T he In some situations, judgmental methods are
adequacy of the assumptions needed to link used to link scores across tests. In these situa
scores across pools should be checked. tions, the judgment processes and their reliability
Many other examples of linking exist that should be well documented and the rationale
may not result in interchangeable scores, including for their use should be clear.
the following: Processes used to facilitate comparisons may
For the evaluation of examinee growth over be described with terms such as linking, calibration,
time, it may be desirable to develop vertical concordance, verticalsca!i.ng, projection, or moderation.

99
CHAPTER 5

T hese processes may be technically sound and and other classifications, it should be acknowledged
may fully satisfy desired goals of comparability that such categorical decisions are rarely made on
for one purpose or for one relevant subgroup of the basis of test performance alone.T he examples
examinees, but they cannot be assumed to be that follow serve only as illustrations.
stable over time or invariant across multiple sub T he first example, that of an employer inter
groups of the examinee population, nor is there viewing all those who earn scores above a given
any assurance that scores obtained using different level on an employment test, is the most straight
tests will be equally precise. T hus, their use for forward.Assuming that validity evidence has been
other purposes or with other populations than provided for scores on the employment test for its
the originally intended population may require intended use, average j ob performance typically
additional suppon.For example, a score conversion would be expected to rise steadily, albeit slowly,
that was accurate for a group of native speakers with each increment in test score, at least for
might systematically overpredict or underpredict some range of scores surrounding the cut score.
the scores of a group of nonnative speakers. In such a case the designation of the particular
value for the cut score may be largely determined
Cut Scores by the number of persons to be interviewed or
further screened.
A critical step in the development and use of In the second example, a state department of
some tests is to establish one or more cut scores education establishes content standards for what
dividing the score range to panition the distribution fourth-grade students are to learn in mathematics
of scores into categories.T hese categories may be and implements a test for assessing student achieve
used j ust for descriptive purposes or may be used ment on these standards. Using a structured,
to distinguish among examinees for whom different judgmental standard-setting process, committees
programs are deemed desirable or different pre of subject matter experts develop or elaborate on
dictions are warranted.An employer may determine performance-level descriptors ( sometimes referred
a cut score to screen potential employees or to to as achievement-level descriptors) that indicate
promote current employees; proficiency levels of what students at achievement levels of "basic,"
"basic," "proficient," and "advanced" may be es "proficient," and "advanced" should know and be
tablished using standard-setting methods to set able to do in fourth-grade mathematics.In addition,
cut scores on a state test of mathematics achievement committees examine test items and student per
in fourth grade; educators may want to use test formance to recommend cut scores that are used
scores to identify students who are prepared to go to assign students to each achievement level based
on to college and take credit-bearing courses; or on their test performance. T he final decision
in granting a professional license, a state may about the cut scores is a policy decision typically
specify a minimum passing score on a licensure made by a policy body such as the board of edu
test. cation for the state.
T hese examples differ in important respects, In the third example, educators wish to use
but all involve delineating categories of examinees test scores to identify students who are prepared
on the basis of test scores.Such cut scores provide to go on to college and take credit-bearing courses.
the basis for using and interpreting test results. Cut scores might initially be identified based on
T hus, in some situations, the validity of test score judgments about requirements for taking credit
interpretations may hinge on the cut scores.T here bearing courses across a range of colleges. Alter
can be no single method for determining cut natively, judgments about individual students
scores for all tests or for all purposes, nor can might be collected and then used to find a score
there be any single set of procedures for establishing level that most effectively differentiates those
their defensibility.In addition, although cut scores judged to be prepared from those judged not to
are helpful for informing selection, placement, be. In such cases, judges must be familiar with

1 00
SCORES, SCALES, NORMS, SCORE LINKING, AND CUT SCORES

both the college course requirements and the stu technical matter, although empirical studies and
dents themselves.Where possible, initial judgments statistical models can be of great value in informing
could be followed up with longitudinal data indi the process.
cating whether former examinees did or did not Cut scores embody value judgments as well
have to take remedial courses. as technical and empirical c onsiderations.Where
In the final example, chat of a professional Ii the results of the standard-setting process have
censure examination, the cut score represents an highly significant consequences, those involved
informed judgment that those scoring below it in the standard-setting process should be concerned
are at risk of making serious errors because they that the process by which cut scores are deter
lack the knowledge or skills tested. No test is mined be clearly documented and that it be de
perfect, of course, and regardless of the cut score fensible.When standard-setting involves judges
chosen, some examinees with inadequate skills or subject matter experts, their qualifications
are likely to pass, and some with adequate skills and the process by which they were selected are
are likely to fail.T he relative probabilities of such pare of that documentation.Care must be taken
false positive and false negative errors will vary to ensure that these persons understand what
depending on the cut score chosen.A given prob they are to do and that their judgments are as
ability of exposing the public to potential harm thoughtful and objective as possible.T he process
by issuing a license to an incompetent individual must be such that well-qualified participants can
( false positive) must be weighed against some apply their knowledge and experience to reach
corresponding probability of denying a license to, meaningful and relevant judgments that accurately
and thereby disenfranchising, a qualified examinee reflect their understandings and intentions. A
( false negative). Changing the cut score to reduce sufficiently large and representative group of
either probability will increase the other, although participants should be involved to provide rea
both kinds of errors can be minimized through sonable assurance that the expert ratings across
sound test design that anticipates the role of the judges are sufficiently reliable and that the results
cut score in test use and interpretation.Determining of the judgments would not vary greatly if the
cut scores in such situations cannot be a purely process were replicated.

1 01
CHAPTER 5

STANDARDS FOR SCORES, SCALES, NORMS,


SCORE LINKING, AND CUT SCORES

The standards in this chapter begin with an over intended interpretation of scale scores, as well as
arching standard (numbered 5.0), which is designed their limitations.
to convey the central intent or primary focus of
Comment: Illustrations of appropriate and inap
the chapter.The overarching standard may also
propriate interpretations may be helpful, especially
be viewed as the guiding principle of the chapter,
for types of scales or interpretations that are unfa
and is applicable to all tests and test users. All
miliar to most users. This standard pertains to
subsequent standards have been separated into
score scales intended for criterion-referenced as
four thematic clusters labeled as follows:
well as norm-referenced interpretations.All scores
(raw scores or scale scores) may be subject to mis
1. Interpretations of Scores
interpretation.If the nature or intended uses of a
2. Norms
scale are novel, it is especially important that its
3. Score Linking
uses, interpretations, and limitations be dearly
4. Cut Scores
described.

Standard 5.0 Standard 5.2


Test scores should be derived in a way that
The procedures for constructing scales used for
supports the interpretations of test scores for the
reporting scores and the rationale for these pro
. OfOOsed uses of tests. Test develoofrS and users
cedures 'should be' desert bed'clearly.
should document evidence of fairness, reliability,
and validity of test scores for their proposed use. Comment: When scales, norms, or other inter
pretive systems are provided by the test developer,
Comment: Specific standards for various uses and
technical documentation should describe their
interpretations of test scores and score scales are
rationale and enable users to judge the quality
described below.These include standards for norm
and precision of the resulting scale scores.For ex
referenced and criterion-referenced interpretations,
ample, the test developer should describe any
interpretations of cut scores, interchangeability of
normative, content, or score precision information
scores on alternate forms following equating, and
that is incorporated into the scale and provide a
score comparability following the use of other pro
rationale for the number of score points that are
cedures for score linking.Documentation supporting
used.This standard pertains to score scales intended
such interpretations provides a basis for external
for criterion-referenced as well as norm-referenced
experts and test users to judge the extent to which
interpretations.
the interpretations are likely to be supported and
can lead to valid interpretations of scores for all in
dividuals in the intended examinee population. Standard 5.3
If there is sound reason to believe that specific
misinterpretations of a score scale are likely, test
Cluster 1 . Interpretations of Scores
users should be explicitly cautioned.

Standard 5.1 Comment: Test publishers and users can reduce


misinterpretations of scale scores if they explicitly
Test users should b e provided with clear expla describe both appropriate uses and potential
nations of the characteristics, meaning, and misuses.For example, a score scale point originally

1 02
SCORES, SCALES, NORMS, SCORE LINKING, AND CUT SCORES

defined as the mean of some reference population refer to the absolute levels of test scores or to
should no longer be interpreted as representing patterns of scores for an individual examinee.
average performance if the scale is held constant Whenever the test developer recommends such
over time and the examinee population changes. interpretations, the rationale and empirical basis
Similarly, caution is needed if score meanings should be presented clearly.Serious efforts should
may vary for some test takers, such as the meaning be made whenever possible to obtain independent
of achievement scores for students who have not evidence concerning the soundness of such score
had adequate opportunity to learn the material interpretations.
covered by the test.
Standard 5.6
Standard 5.4
Testing programs that attempt to maintain a
When raw scores are intended to be directly in common scale over time should conduct periodic
terpretable, their meanings, intended interpre checks of the stability of the scale on which
tations, and limitations should be described and scores are reported.
justified in the same manner as is done for scale
Comment: The frequency of such checks depends
scores.
on various characteristics of the testing program.
Comment: In some cases the items in a test are a In some testing programs, items are introduced
representative sample of a well-defined domain of into and retired from item pools on an ongoing
items with regard to both content and item diffi basis. In other cases, the items in successive test
culty.T he proportion answered correctly on the forms may overlap very little, or not at all. In
test may then be interpreted as an estimate of the either case, if a fixed scale is used for reporting, it
proportion of items in the domain that could be is important to ensure that the meaning of the
answered correctly.In other cases, different inter scale scores does not change over time. When
pretations may be attached to scores above or scales are based on the subsequent application of
below a particular cut score. Support should be precalibrated item parameter estimates using item
offered for any such interpretations recommended response theory, periodic analyses of item parameter
by the test developer. stability should be routinely undertaken.

Standard 5.5 Standard 5.7


When raw scores or scale scores are designed for When standardized tests or testing procedures
criterion-referenced interpretation, including the are changed for relevant subgroups of test takers,
classification of examinees into separate categories, the individual or group making the change should
the rationale for recommended score interpreta provide evidence of the comparability of scores
tions should be explained clearly. on the changed versions with scores obtained on
the original versions of the tests. If evidence is
Comment: Criterion-referenced interpretations
lacking, documentation should be provided that
are score-based descriptions or inferences that do
cautions users that scores from the changed test
not take the form of comparisons of an examinee's
or testing procedure may not be comparable with
test performance with the test performance of
those from the original version.
other examinees. Examples include statements
that some psychopathology is likely present, that Comment: Sometimes it becomes necessary to
a prospective employee possesses specific skills re change original versions of a test or testing
quired in a given position, or that a child scoring procedure when the test is given to relevant sub
above a certain score point can successfully apply groups of the testing population, for example, in
a given set of skills. Such interpretations may dividuals with disabilities or individuals with

1 03
CHAPTER 5

diverse linguistic and cultural backgrounds. A and descriptive statistics.Technical documentati on


test may be translated into braill so that it is ac sh ould indicate the precisi on of the n orms them
cessible to individuals who are blind, or the testing selves.
procedure may be changed to include extra time
Comment: The information provided should be
for certain groups of examinees. T hese changes
sufficient to enable users to judge the appropri
may or may not have an effect on the underlying
ateness of the norms for interpreting the scores of
constructs that are measured by the test and, con
local examinees.T he information should be pre
sequently, on the score conversions used with the
sented so as to comply with applicable legal re
test.If scores on the changed test will be compared
quirements and professional standards relating to
with scores on the original test, the test developer
privacy and data security.
should provide empirical evidence of the compa
rability of scores on the changed and original test
whenever sample sizes are sufficiently large to Standard 5.1 0
provide this type of evidence.
When norms are used to characterize examinee
groups, the statistics used to summarize each
Cluster 2. Norms group's performance and the norms to which
those statistics are referred should be defined
Standard 5.8 clearly and should support the intended use or
interpretation.
Norms, i f used, should refer to clearly described
populations. These populations should include Comment: It is not possible to determi ne the
individuals or groups with whom test users will percentile rank of a school's average test score if
ordinarily wish to c ompare their own examinees. all that is known is the percentile rank of each of
that school's students.It may sometimes be useful
Comment: It is the responsibility of test developers to develop special norms for group means, but
to describe norms clearly and the responsibility of when the sizes of the groups differ materially or
test users to use norms appropriately.Users need to when some groups are much more heterogeneous
know the applicability of a test to different groups. than others, the construction and interpretation
Differentiated norms or summary information of group norms is problematic. One common
about differences between gender, racial/ethnic, and acceptable procedure is to report the percentile
language, disability, grade, or age groups, for rank of the median group member, for example,
example, may be useful in some cases.The permissible the median percentile rank of the pupils tested in
uses of such differentiated norms and related in a given school.
formation may be limited by law.Users also need
to be alerted to situations in which norms are less
appropriate for some groups or individuals than Standard 5.1 1
others.On an occupational interest inventory, for
If a test publisher provides norms for use in test
example, norms for persons actually engaged in an
score interpretation, then as long as the test re
occupation may be inappropriate for interpreting
mains in print, it is the test publisher's responsi
the scores of persons not so engaged.
bility to renorm the test with sufficient frequency
to permit continued accurate and appropriate
Standard 5.9 score interpretations.

Reports of n orming studies should include precise Comment: Test publishers should ensure that
specificati on of the p opulati on that was sampled, up-to-date norms are readily available or provide
sampling procedures and participati on rates, any evidence that older norms are still appropriate.
weighting of the sample, the dates of testing, However, it remains the test user's responsibility

1 04
SCORES, SCALES , NORMS, SCORE LINKING, AND CUT SCORES

to avoid inappropriate use of norms that are out operational administration.When equivalent forms
of date and to strive to ensure accurate and ap of computer-based tests are constructed dynamically,
propriate score interpretations. the algorithms used should be documented and
the technical characteristics of alternate forms
should be evaluated based on simulation and/or
Cluster 3. Score Linking analysis of administration data. Standard errors
of equating functions should be estimated and re
Standard 5.1 2 ported whenever possible.Sample sizes permitting,
it may be informative to assess whether equating
A clear rationale and supporting evidence should
functions developed for relevant subgroups of ex
be provided for any claim that scale scores earned
aminees are similar.It may also be informative to
on alternate forms of a test may be used inter
use two or more anchor forms and to conduct the
changeably.
equating using each of the anchors.To be most
Comment: For scores on alternate forms to be useful, equating error should be presented in units
used interchangeably, the alternate forms must be of the reported score scale. For testing programs
built to common detailed content and statistical with cut scores, equating error near the cut score
specifications.Adequate data should be collected is of primary importance.
and appropriate statistical methodology should
be applied to conduct the equating of scores on Standard 5.1 4
alternate test forms.The quality of the equating
should be evaluated to assess whether the resulting In equating studies that rely on the statistical
scale scores on the alternate forms can be used in equivalence of exarninee groups receiving different
terchangeably. forms, methods of establishing such equivalence
should be described in detail.
Standard 5.1 3 Comment: Certain equating designs rely on the
random equivalence of groups receiving different
When claims of form-to-form score equivalence
forms.Often, one way to ensure such equivalence
are based on equating procedures, detailed technical
is to mix systematically different test forms and
information should be provided on the method
then distribute them in a random fashion so that
by which equating functions were established
roughly equal numbers of examinees receive each
and on the accuracy of the equating functions.
form. Because administration designs intended
Comment: Evidence should be provided to show to yield equivalent groups are not always adhered
that equated scores on alternate forms measure to in practice, the equivalence of groups should
essentially the same construct with very similar be evaluated statistically.
levels of reliability and conditional standard errors
of measurement and chat the results are appropriate Standard 5.1 5
for relevant subgroups. Technical information
should include the design of the equating study, In equating studies that employ an anchor test
the statistical methods used, the size and relevant design, the characteristics of the anchor test and
characteristics of examinee samples used in equating its similarity to the forms being equated should
studies, and the characteristics of any anchor tests be presented, including both content specifications
or anchor items. For tests for which equating is and empirically determined relationships among
conducted prior to operational use ( i.e., pre test scores. If anchor items are used in the
equating), documentation of the item calibration equating study, the representativeness and psy
process should be provided and the adequacy of chometric characteristics of the anchor items
the equating functions should be evaluated following should be presented.

1 05
CHAPTER 5

Comment: Scores on tests or test forms may be Standard 5.1 7


equated via common items embedded within
each of them, or a common test administered to When scores on tests that canno_t be equated are
gether with each of them.T hese common items linked, direct evidence of score comparability
or tests are referred to as linking items, common should be provided, and the exarninee population
items, anchor items, or anchor tests.Statistical pro for which score comparability applies should be
cedures applied to anchor items make assumptions specified clearly. T he specific rationale and the
that substitute for the equivalence achieved with evidence required will depend in part on the in
an equivalent groups design. Performances on tended uses for which score comparability is
these items are the only empirical evidence used claimed.
to adjust for differences in ability between groups Comment: Support should be provided for any
before making adjustments for test difficulty. assertion that linked scores obtained using tests
With such approaches, the quality of the resulting built to different content or statistical specifications,
equating depends strongly on the number of the tests that use different testing materials, or tests
anchor items used and how well the anchor items that are administered under different test admin
proportionally reflect the content and statistical istration conditions are comparable for the intended
characteristics of the test. T he content of the purpose. For these links, the examinee population
anchor items should be exactly the same in each for which score comparability is established should
test form to be equated.T he anchor items should be specified clearly.T his standard applies, for ex
be in similar positions to help reduce error in ample, to tests that differ in length, tests adminis
equating due to item context effects.In addition, tered in different formats (e.g., paper-and-pencil
checks should be made to ensure that, after con and computer-based tests), test forms designed
trolling for examinee group differences, the anchor for individual versus group administration, tests
items have similar statistical characteristics on that are vertically scaled, computerized adaptive
each test form. tests, tests that are revised substantially, tests given
in different languages, tests administered under
Standard 5.1 6 various accommodations, tests measuring different
constructs, and tests from different publishers.
When test scores are based on model-based psy
chometric procedures, such as those used in Standard 5.1 8
computerized adaptive or multistage testing,
documentation should be provided to indicate When linking procedures are used to relate scores
that the scores have comparable meaning over on tests or test forms that are not closely parallel,
alternate sets of test items. the construction, intended interpretation, and
limitations of those linkings should be described
Comment: When model-based psychometric pro
clearly.
cedures are used, technical documentation should
be provided that supports the comparability of Comment: Various linkings have been conducted
scores over alternate sets of items. Such docu relating scores on tests developed at different
mentation should include the assumptions and levels of difficulty, relating earlier to revised forms
procedures that were used to establish compara of published tests, creating concordances between
bility, including clear descriptions of model-based different tests of similar or different constructs,
algorithms, software used, quality control proce or for other purposes. Such linkings often are
dures followed, and technical analyses conducted useful, but they may also be subject to misinter
that justify the use of the psychometric models pretation.The limitations of such linkings should
for the particular test scores that are intended to be described clearly.Detailed technical information
be comparable. should be provided on the linking methodology

1 06
SCORES, SCALES, NORMS, SCORE LINKING, ANO CUT SCORES

and the quality of the linking.Technical information procedures have been used to link scores from
about the linking should include, as appropriate, the different versions. "When substantial changes
the reliability of the sets of scores being linked, in test specifications occur, scores should be re
the correlation between test scores, an assessment ported on a new scale, or a clear statement
of content similarity, the conditions of measurement should be provided to alert users that the scores
for each test, the data collection design, the are not directly comparable with those on earlier
statistical methods used, the standard errors of versions of the test.
the linking function, evaluations of sampling sta
Comment: Major shifts sometimes occur in the
bility, and assessments of score comparability.
specifications of tests that are used for substantial
periods of time.Often such changes take advantage
Standard 5.1 9 of improvements in item types or shifts in content
that have been shown to improve validity and
"When tests are created by taking a subset of the
therefore are highly desirable. It is important to
items in an existing test or by rearranging items,
recognize, however, that such shifts will result in
evidence should be provided that there are no
scores that cannot be made strictly interchangeable
distortions of scale scores, cut scores, or norms
with scores on an earlier form of the test, even
for the different versions or for score linkings
when statistical linking procedures are used.To
between them.
assess score comparability, it is advisable to evaluate
Comment: Some tests and test batteries are pub the relationship between scores on the old and
lished in both a full-length version and a survey new versions.
or short version.In other cases, multiple versions
of a single test form may be created by rearranging
its items.It should not be assumed that performance
Cluster 4. Cut Scores
data derived from the administration of items as
part of the initial version can be used to compute Standard 5.21
scale scores, compute linked scores, construct
"When proposed score interpretations involve
conversion tables, approximate norms, or approx
one or more cut scores, the rationale and proce
imate cut scores for alternative intact tests.Caution
dures used for establishing cut scores should be
is required in cases where context effects are likely,
documented clearly.
including speeded tests, long tests where fatigue
may be a factor, adaptive tests, and tests developed Comment: Cut scores may be established to select.
from calibrated item pools.Options for gathering a specified number of examinees (e.g., to identify
evidence related to context effects might include a fixed number of job applicants for further screen
examinations of model-data fit, operational recal ing), in which case little further documentation
ibrations of item parameter estimates initially may be needed concerning the specific question
derived using pretest data, and comparisons of of how the cut scores are established, although at
perforrp.ance on original and revised test forms as tention should be paid to the rationale for using
administered to randomly equivalent groups. the test in selection and the precision of comparisons
among examinees. In other cases, however, cut
Standard 5.20 scores may be used to classify examinees into
distinct categories ( e.g., diagnostic categories, pro
Iftest specifications are changed from one version ficiency levels, or passing versus failing) for which
of a test to a subsequent version, such changes there are no pre-established quotas.In these cases,
should be identified, and an indication should the standard-setting method must be documented
be given that converted scores for the two versions in more detail.Ideally, the role of cut scores in test
may not be strictly equivalent, even when statistical use and interpretation is taken into account during

1 07
CHAPTER 5

test design.Adequate precision in regions of score forward when participants are asked to consider
scales where cut scores are established is prerequisite kinds of performances with which they are familiar
to reliable classification of examinees into categories. and for which they have formed clear conceptions
If standard setting employs data on the score dis of adequacy or quality.When the responses elicited
tributions for criterion groups or on the relation by a test neither sample nor closely simulate the
of test scores to one or more criterion variables, use of tested knowledge or skills in the actual cri
those data should be summarized in technical terion domain, participants are not likely to ap
documentation. If a judgmental standard-setting proach the task with such clear understandings of
process is followed, the method employed should adequacy or quality. Special care must then be
be described clearly, and the precise nature and taken to ensure that participants have a sound
reliability of the judgments called for should be basis for making the judgments requested.Thorough
presented, whether those are judgments of persons, familiarity with descriptions of different proficiency
of item or test performances, or of other criterion levels, practice in judging task difficulty with
performances predicted by test scores.Documen feedback on accuracy, the experience of actually
tation should also include the selection and qual taking a form of the test, feedback on the pass
ifications of standard-setting panel participants, rates entailed by provisional proficiency standards,
training provided, any feedback to participants and other forms of information may be beneficial
concerning the implications of their provisional in helping participants to reach sound and prin
judgments, and any opportunities for participants cipled decisions.
to confer with one another. Where applicable,
variability over participants should be reported. Standard 5.23
Whenever feasible, an estimate should be provided
of the amount of variation in cut scores that When feasible and appropriate, cut scores defining
might be expected if the standard-setting procedure categories with distinct substantive interpretations
were replicated with a comparable standard-setting should be informed by sound empirical data
panel. concerning the relation of test performance to
the relevant criteria.
Standard 5.22 Comment: In employment settings where it has
been established that test scores are related to job
When cut scores defining pass-fail or proficiency
performance, the precise relation of test and
levels are based on direct judgments about the
criterion may have little bearing on the choice of
adequacy of item or test performances, the judg
a cut score, if the choice is based on the need for
mental process should be designed so that the
a predetermined number of candidates.However,
participants providing the judgments can bring
in contexts where distinct interpretations are
their knowledge and experience to bear in a rea
applied to different score categories, the empirical
sonable way.
relation of test to criterion assumes greater im
Comment: Cut scores are sometimes based on portance. For example, if a cut score is to be set
judgments about the adequacy of item or test on a high school mathematics test indicating
performances ( e.g., essay responses to a writing readiness for college-level mathematics instruction,
prompt) or proficiency expectations (e.g., the it may be desirable to collect empirical data estab
scale score that would characterize a borderline lishing a relationship between test scores and
examinee). T he procedures used to elicit such grades obtained in relevant college courses. Cut
judgments should result in reasonable, defensible scores used in interpreting diagnostic tests may
proficiency standards that accurately reflect the be established on the basis of empirically determined
standard-setting participants' values and intentions. score distributions for criterion groups . With
Reaching such judgments may be most straight- many achievement or proficiency tests, such as

1 08
SCORES, SCALES, NORMS, SCORE LINKING, AND CUT SCORES

those used in credentialing, suitable criterion ( or combination of approaches) in any given situ
groups (e.g., successful versus unsuccessful prac ation. In general, one would not expect to find a
titioners) are often unavailable.Nevertheless, when sharp difference in levels of the criterion variable
appropriate and feasible, the test developer should between those just below and those just above the
investigate and report the relation between test cut score, but evidence should be provided, where
scores and performance in relevant practical feasible, of a relationship between test and criterion
settings.P rofessional judgment is required to de performance over a score interval that includes or
termine an appropriate standard-setting approach approaches the cut score.

1 09
6. TEST ADMINISTRATIO N , SCORIN G ,
REPORTING , AND INTERPRETATIO N
BACKGROUND
The usefulness and interpretability of test scores may require that the assessment be abbreviated or
require that a test be administered and scored ac altered.Large-scale testing programs typically es
cording to the test developer's instructions.When tablish specific procedures for considering and
directions, testing conditions, and scoring follow granting accommodations and other variations
the same detailed procedures for all test takers, from standardized procedures. Usually these ac
the test is said to be standardized.Without such commodations themselves are somewhat standard
standardization, the accuracy and comparability ized; occasionally, some alternative ocher than the
of score interpretations would be reduced. For accommodations foreseen and specified by the
tests designed to assess the test taker's knowledge, test developer may be indicated.Appropriate care
skills, abilities, or other personal characteristics, should be taken to avoid unfair treatment and dis
standardization helps to ensure chat all test takers crimination. Although variations may be made
have the same opportunity to demonstrate their with the intent of maintaining score comparability,
competencies.Maintaining test security also helps the extent to which chat is possible often cannot
ensure that no one has an unfair advantage.The be determined. Comparability of scores may be
importance of adherence to appropriate standard compromised, and the test may then not measure
ization of administration procedures increases the same constructs for all test takers.
with the stakes of the r est. Tests and assessments differ in their degree of
Sometimes, however, situations arise in which standardization.In many instances, different test
variations from standardized procedures may be takers are not given the same test form but receive
advisable or legally mandated.For example, indi equivalent forms that have been shown to yield
viduals with disabilities and persons of different comparable scores, or alternate test forms where
linguistic backgrounds, ages, or familiarity with scores are adjusted to make them comparable.
testing may need nonstandard modes of test ad Some assessments permit test takers to choose
ministration or a more comprehensive orientation which tasks to perform or which pieces of their
to the r esting process, so that all test takers can work are to be evaluated.Standardization can be
have an unobstructed opportunity to demonstrate maintained in these situations by specifying the
their standing on the construct( s) being measured. conditions of the choice and the criteria for eval
Different modes of presenting the test or its in uation of the products. When an assessment
structions, or of .responding, may be suitable for permits a certain kind of collaboration between
specific individuals, such as persons with some test takers or between test taker and test adminis
kinds of disability, or persons with limited proficiency trator, the limits of that collaboration should be
in the language of the test, in order to provide ap specified.With some assessments, test administrators
propriate access to reduce construct-irrelevant vari may be expected to tailor their instructions to
ance ( see chap.3, "Fairness in Testing").In clinical help ensure that all test takers understand what is
or neuropsychological testing situations, flexibility expected of them. In all such cases, the goal
in administration may be required, depending on remains the same: to provide accurate, fair, and
the individual's ability to comprehend and respond comparable measurement for everyone.The degree
to test items or tasks and/or the construct required of standardization is dictated by that goal, and by
to be measured. Some situations and/or the the intended use of the test score.
construct ( e.g., testing for memory impairment in Standardized directions help ensure that all
a test taker with dementia who is in a hospital) test takers have a common understanding of the

111
CHAPTER 6

mechanics of test taking. Directions generally complex responses is dorie by human scorers or
inform test takers on how to make their responses, automatic scoring engines, careful training is re
what kind of help they may legitimately be given quired. T he training typically requires expert
if they do not understand the question or task, human raters to provide a sample of responses
how they can correct inadvertent responses, and that span the range of possible score points or rat
the nature of any time constraints.General advice ings.Within the score point ranges, trainers should
is sometimes given about omitting item responses. also provide samples that exemplify the variety of
Many tests, including computer-administered responses that will yield the score point or rating.
tests, require special equipment or software. In Regular monitoring can help ensure that every
struction and practice exercises are often presented test performance is scored according to the same
in such cases so that the test taker understands standardized criteria and that the test scorers do
how to operate the equipment or software.The not apply the criteria differently as they progress
principle of standardization includes orienting through the submitted test responses.
test takers to materials and accommodations with Test scores, per se, are not readily interpreted
which they may not be familiar.Some equipment without other information, such as norms or stan
may be provided at the testing site, such as shop dards, indications of measurement error, and de
tools or software systems. Opportunity for test scriptions of test content.Just as a temperature of
takers to practice with the equipment will often 50 degrees Fahrenheit in January is warm for Min
be appropriate, unless ability to use the equipment nesota and cool for Florida, a test score of 5 0 is
is the construct being assessed. not meaningful without some context.Interpretive
Tests are sometimes administered via technology, material should be provided that is readily under
with test responses entered by keyboard, computer standable to those receiving the report.Often, the
mouse, voice input, or other devices.Increasingly, test user provides an interpretation of the results
many test takers are accustomed to using computers. for the test taker, suggesting the limitations of the
Those who are not may require training to reduce results and the relationship of any reported scores
construct-irrelevant variance. Even those test takers to other information.Scores on some tests are not
who are familiar with computers may need some designed to be released to test takers; only broad
brief explanation and practice to manage test test interpretations, or dichotomous classifications,
specific details such as the test's interface.Special such as "pass/fail," are intended to be reported.
issues arise in managing the testing environment Interpretations of test results are sometimes
to reduce construct-irrelevant variance, such as prepared by computer systems. Such interpreta
avoiding light reflections on the computer screen tions are generally based on a combination of
that interfere with display legibility, or maintaining empirical data, expert judgment, and experience
a quiet environment when test takers start or and require validation. In some professional ap
finish at different times_ from neighboring test plications of individualized testing, the comput
takers. T hose who administer computer-based er-prepared interpretations are communicated
tests should be trained so that they can deal with by a professional, who might modify the com
hardware, software, or test administration problems. puter-based interpretation to fit special circum
Tests administered by computer in Web-based stances. Care should be taken so that test inter
applications may require other supports to maintain pretations provided by nonalgorithmic approaches
standardized environments. are appropriately consistent.Automatically gen
Standardized scoring procedures help to ensure erated reports are not a substitute for the clinical
consistent scoring and reporting, which are essential judgment of a professional evaluator who has
in all circumstances. When scoring is done by worked directly with the test taker, or fo r the in
machine, the accuracy of the machine, including tegration of other information, including but
any scoring program or algorithm, should be es not limited to other test results, interviews,
tablished and monitored. When the scoring of existing records, and behavioral observations.

112
TEST ADMINISTRATION, SCORING, REPORTING, AND INTERPRETATION

In some large-scale assessments, the primary as each individual may take only an incomplete
target of assessment is not the individual test taker test, while in the aggregate, the assessment results
but rather a larger unit, such as a school district or may be valid and acceptably reliable for interpre
an industrial plant.Often, different test takers are tations about performance of the larger unit.
given different sets of items, following a carefully Some further issues of administration and
balanced matrix sampling plan, to broaden the scoring are discussed in chapter 4, "Test Design
range of information that can be obtained in a and Development."
reasonable time period.The results acquire meaning Test users and those who receive test materials,
I when aggregated over many individuals taking test scores, and ancillary information such as test
j
different samples of items. Such assessments may takers' personally identifiable information are re
not furnish enough information to support even sponsible for appropriately maintaining the security
minimally valid or reliable scores for individuals, and confidentiality of that information.

113
CHAPTER 6

STANDARDS FOR TEST ADMINISTRATION, SCORING,


REPORTING, AND I NTERPRETATION
The standards i n this chapter begin with an over o r score the test(s) are proficient i n the appropriate
arching standard (numbered 6.0), which is designed test administration or scoring procedures and un
to convey the central intent or primary focus of derstand the importance of adhering to the direc
the chapter. The overarching standard may also tions provided by the test developer. Large-scale
be viewed as the guiding principle of the chapter, testing programs should specify accepted stan
and is applicable to all tests and test users. All dardized procedures for determining accommo
subsequent standards have been separated into dations and other acceptable variations in test ad
three thematic clusters labeled as follows: ministration.Training should enable test admin
istrators to make appropriate adjustments if an
1. Test Administration accommodation or modification is required that
2. Test Scoring is not covered by the standardized procedures.
3. Reporting and Interpretation Specifications regarding instructions to test
takers, time limits, the form of item presentation
Standard 6.0 or response, and test materials or equipment
should be strictly observed. In general, the same
To support useful interpretations of score results, procedures should be followed as were used when
assessment instruments should have established obtaining the data for scaling and norming the
procedures for test administration, scoring, re test scores.Some programs do not scale or establish
porting, and interpretation. Those responsible norms, such as portfolio assessments and most al
for administering, scoring, reporting, and inter ternate academic assessments for students with
preting should have sufficient training and supports severe cognitive disabilities.However, these programs
to help them follow the established procedures. typically have specified standardized procedures
Adherence to the established procedures should for administration and scoring when they establish
be monitored, and any material errors should be performance standards.A test taker with a disability
documented and, if possible, corrected. may require variations to provide access without
changing the construct that is measured . Other
Comment: In order to support the validity of
special circumstances may require some flexibility
score interpretations, administration should follow
in administration, such as language support to
any and all established procedures, and compliance
provide access under certain conditions, or some
with such procedures needs to be monitored.
clinical or neuropsychological evaluations, in ad
dition to procedures related to accommodations.
Cl uster 1 . Test Administration Judgments of the suitability of adjustments should
be tempered by the consideration that d epartures
from standard procedures may j eopardize the
Standard 6 . 1 validity or complicate the comparability of the
test score interpretations.These judgments should
Test administrators should follow carefully the
be made by qualified individuals and be consistent
standardized procedures for administration and
with the guidelines provided by the test user or
scoring specified by the test developer and any
test developer.
instructions from the test user.
Policies regarding retesting should be established
Comment: Those responsible for testing programs by the test developer or user. The test user and
should provide appropriate training, documentation, administrator should follow the established policy.
and oversight so that the individuals who administer Such retest policies should be clearly communicated

114
TEST ADMINISTRATION , SCORING, REPORTING, AND INTERPRETATION

by the test user as part of the conditions for stan accommodations for test takers, the procedures
dardized test administration.Retesting is intended and criteria should be carefully followed and doc
to decrease the probability that a person will be umented. Ideally, these procedures include how
incorrectly classified as not meeting some standard. to consider the instances when some alternative
For example, some testing programs specify that may be appropriate in addition to those accom
a person may retake the test; some offer multiple modations foreseen and specified by the test de
opportunities to take a test, for example when veloper. Test takers should be informed of any
passing the test is required for high school gradu- testing accommodations that may be available to
ation or credentialing. them and the process and requirements, if any,
Test developers should specify the standardized for obtaining needed accommodations.Similarly;
administration conditions that support intended in educational settings, appropriate school personnel
uses of score interpretations.Test users should be and parents/legal guardians should be informed
aware of the implications of !ess controlled admin of the requirements, if any, for obtaining needed
istration conditions.Test users are responsible for accommodations for students being tested.
providing technical and other support to help
ensure that test administrations meet these conditions Standard 6.3
to the extent possible. However, technology and
the Internet have made it possible to administer Changes or disruptions to standardized test ad
tests in many settings, including settings in which ministration procedures or scoring should be
the administration conditions may not be strictly documented and reported to the test user.
controlled or monitored.Those who allow lack of
Comment: Information about the nature of
standardization are responsible for providing evidence
changes to standardized administration or scoring
that the lack of standardization did not affect test
procedures should be maintained in secure data
taker performance or the quality or comparability
files so that research studies or case reviews based
of the scores produced. Complete documentation
on test records can take it into account. This
would include reporting the extent to which stan
includes not only accommodations or modifications
dardized administration conditions were not met.
for particular test takers but also disruptions in
Characteristics such as time limits, choices
the testing environment that may affect all test
about item types and response formats, complex
takers in the testing session. A researcher may
interfaces, and instructions that potentially add
wish to use only the records based on standardized
construct-irrelevant variance should be scrutinized
administration. In other cases, research studies
in terms of the test purpose and the constructs
may depend on such information to form groups
being measured.Appropriate usability and empirical
of test takers.Test users or test sponsors should
research should be carried out, as feasible, to doc
establish policies specifying who secures the data
ument and ideally minimize the impact of sources
files, who may have access to the files, and, if nec
or conditions that contribute to construct-irrelevant
essary, how to maintain confidentiality of respon
variability.
dents, for example by de-identifying respondents.
Whether the information about deviations from
Standard 6.2 standard procedures is reported to users of test
data depends on considerations such as whether
When formal procedures have been established
the users are admissions officers or users of indi
for requesting and receiving accommodations,
vidualized psychological reports in clinical settings.
test takers should be informed of these procedures
If such reports are made, it may be appropriate to
in advance of testing.
include clear documentation of any deviation
Comment: When testing programs have established from standard administration procedures, discussion
procedures and criteria for identifying and providing of how such administrative variations may have

115
CHAPTER 6

affected the results, and perhaps certain cautions. Standard 6.5


For example, test users may need to be informed
about the comparability of scores when modifica Test takers should be provided appropriate in
tions are provided (see chap. 3, "Fairness in structions, practice, and other support necessary
Testing," and chap.9, "The Rights and Responsi to reduce construct-irrelevant variance.
bilities of Test Users").If a deviation or change to Comment: Instructions to test takers should
a standardized test administration procedure is clearly indicate how to make responses, except
judged significant enough to adversely affect the when doing so would obstruct measurement of
validity of score interpretation, then appropriate the intended construct (e.g., when an individual's
action should be taken, such as not reporting the spontaneous approach to the test-taking situation
scores, invalidating the scores, or providing op is being assessed). Instructions should also be
portunities for readministration under appropriate given in the use of any equipment or s oftware
circumstances.Testing environments that are not likely to be unfamiliar to test takers, unless ac
monitored (e.g., in temporary conditions or on commodating to unfamiliar tools is part of what
the Internet) should meet these standardized ad is being assessed. The functions or interfaces of
ministration conditions; otherwise, the report on computer-administered tests may be unfamiliar
scores should note that standardized conditions to some test takers, who may need to be shown
were not guaranteed. how to log on, navigate, or access tools. Practice
opportunities should be given when equipment is
Standard 6.4 involved, unless use of the equipment is b eing as
sessed. Some test takers may need practice re
The testing environment should furnish reasonable sponding with particular means required by the
comfort with minimal distractions to avoid con test, such as filling in a multiple-choice "bubble"
struct-irrelevant variance. or interacting with a multimedia simulation.
Comment: Test developers should provide in Where possible, practice responses should be mon
formation regarding the intended test adminis itored to confirm that the test taker is making ac
tration conditions and environment.Noise, dis ceptable responses. If a test taker is unable to use
ruption in the testing area, extremes of tempera the equipment or make the responses, it may be
ture, poor lighting, inadequate work space, appropriate to consider alternative testing modes.
illegible materials, and malfunctioning computers In addition, test takers should be clearly informed
are among the conditions that should be avoided on how their rate of work may affect scores, and
in testing situations, unless measuring the construct how certain responses, such as not responding,
requires such conditions.The testing site should guessing, or responding incorrectly, will be treated
be readily accessible. Technology-based admin in scoring, unless such directions would undermine
istrations should avoid distractions such as equip the construct being assessed.
ment or Internet-connectivity failures, or large
variations in the time taken to present test items Standard 6.6
or score responses. Testing sessions should be
monitored where appropriate to assist the test Reasonable efforts should be made to ensure the
taker when a need arises and to maintain proper integrity of test scores by eliminating opportunities
administrative procedures.In general, the testing for test takers to attain scores by fraudulent or
conditions should be equivalent to those that deceptive means.
prevailed when norms and other interpretative Comment: In testing programs where the results
data were obtained. may be viewed as having important consequences,
score integrity should be supported through active

116
TEST ADMINISTRATION, SCORING, REPORTING, AND INTERPRETATION

efforts to prevent, detect, and correct scores obtained Standard 6.7


by fraudulent or deceptive means. Such efforts
may include, when appropriate and practicable, Test users have the responsibility of protecting
stipulating requirements for identification, con the security of test materials at all times.
structing seating charts, assigning test takers to
seats, requiring appropriate space between seats, Comment: Those who have test materials under
and providing continuous monitoring of the testing their control should, with due consideration of
process.Test developers should design test materials ethical and legal requirements, take all steps nec
and procedures to minimize the possibility of cheat essary to ensure that only individuals with
ing.A local change in the date or time of testing legitimate needs and qualifications for access to
may offer an opportunity for cheating. Test ad test materials are able to obtain such access before
ministrators should be trained on how to take ap the test administration, and afterwards as well, if
propriate precautions against and detect opportunities any part of the test will be reused at a later time.
to cheat, such as opportunities afforded by technology Concerns with inappropriate access to test materials
that would allow a test taker to communicate with include inappropriate disclosure of test content,
an accomplice outside the testing area, or technology tampering with test responses or results, and pro
that would allow a test taker to copy test information tection of test taker's privacy rights. Test users
for subsequent disclosure.Test administrators should must balance test security with the rights of all
follow established policies for dealing with any in test takers and test users. When sensitive test
stances of testing irregularity. In general, steps documents are at issue in court or in administrative
should be taken to minimize the possibility of agency challenges, it is important to identify se
breaches in test security, and to detect any breaches. curity and privacy concerns and needed protections
In any evaluation of work products (e.g., portfolios) at the outset. Parties should ensure that the
steps should be taken to ensure that the product release or exposure of such documents (including
represents the test taker's own work, and that the specific sections of those documents that may
amount and kind of assistance provided is consistent warrant redaction) to third parties, experts, and
with the intent of the assessment.Ancillary docu the courts/agencies themselves are consistent with
mentation, such as the date when the work was conditions (often reflected in protective orders)
done, may be, useful. Testing programs may use that do not result in inappropriate disclosure and
technologies during scoring to detect possible ir that do not risk unwarranted release beyond the
regularities, such as computer analyses of erasure particular setting in which the challenge has oc
patterns, similar answer patterns for multiple test curred.Under certain circumstances, when sensitive
takers, plagiarism from online sources, or unusual test documents are challenged, it may be appro
item parameter shifts. Users of such technologies priate to employ an independent third party,
are responsible for their accuracy and appropriate using a closely supervised secure procedure to
application. Test developers and test users may conduct a review of the relevant materials rather
need to monitor for disclosure of test items on the than placing tests, manuals, or a test taker's test
Internet or from other sources.Testing programs responses in the public record. Those who have
with high-stakes consequences should have defined confidential information related to testing, such
policies and procedures for detecting and processing as registration information, scheduling, and pay
potential testing irregularities-including a process ments, have similar responsibility for protecting
by which a person charged with an irregularity can that information.Those with test materials under
qualify for and/or present an appeal-and for in their control should use and disclose such infor
validating test scores and providing opportunity mation only in accordance with any applicable
for retesting. privacy laws.

117
CHAPTER 6

Cluster 2 . Test Scoring leered test responses. Periodic checks o f the


statistical properties (e.g., means, standard devia
tions, percentage of agreement with scores previously
Standard 6.8 determined to be accurate) of scores assigned by
Those responsible fo r test scoring should establish individual scorers during a scoring session can
scoring protocols. Test scoring that involves provide feedback for the scorers, helping them to
human judgment should include rubrics, proce maintain scoring standards.In addition, analyses
dures, and criteria for scoring. When scoring of might monitor possible effects on scoring accuracy
complex responses is done by computer, the ac of variables such as scorer, task, time or day of
curacy of the algorithm and processes should be scoring, scoring trainer, scorer pairing, and so on,
documented. to inform appropriate corrective or preventative
actions."When the same items are used in multiple
Comment: A scoring protocol should be established, administrations, programs should have procedures
which may be as simple as an answer key for mul in place to monitor consistency of scoring across
tiple-choice questions.For constructed responses, administrations (e.g., year-to-year comparability).
scorers-humans or machine programs-may be One way to check for consistency over time is to
provided with scoring rubrics listing acceptable rescore some responses from earlier administrations.
alternative responses, as well as general criteria.A Inaccurate or inconsistent scoring may call for re
common practice of test developers is to provide training, rescoring, dismissing some scorers, and/or
scoring training materials, scoring rubrics, and reexamining the scoring rubrics or programs.Sys
examples of test takers' responses at each score tematic scoring errors should be corrected, which
level."When tests or items are used over a period may involve rescoring responses previously scored,
of time, scoring materials should be reviewed as well as correcting the source of the error.
periodically. Clerical and mechanical errors should be examined.
Scoring errors should be minimized and, when
Standard 6.9 they are found, steps should be taken promptly
to minimize their recurrence.
Those responsible for test scoring should establish
Typically, those responsible for scoring will
and document quality control processes and cri
document the procedures followed for scoring,
teria. Adequate training should be provided.
procedures followed for quality assurance of that
The quality of scoring should be monitored and
scoring, the results of the quality assurance, and
documented. Any systematic source of scoring
any unusual circumstances. Depending on the
errors should be documented and corrected.
test user, that documentation may be provided
Comment: Criteria should be established for ac regularly or upon reasonable request.Computerized
ceptable scoring quality.P{ocedures should be in scoring applications of text, speech, or other con
stituted to calibrate scorers (human or machine) structed responses should provide similar docu
prior to operational scoring, and to monitor how mentation of accuracy and reliability, including
consistently scorers are scoring in accordance with comparisons with human scoring.
those established standards during operational "When scoring is done locally and requires
scoring."Where scoring is distributed across scorers, scorer judgment, the test user is responsible for
procedures to monitor raters' accuracy and reliability providing adequate training and instruction to
may also be useful as a quality control procedure. the scorers and for examining scorer agreement
Consistency in applying scoring criteria is often and accuracy. The expected level of scorer agreement
checked by independently rescoring randomly se- and accuracy should be documented, as feasible.

118
TEST ADMINISTRATION, SCORING, REPORTING , AND INTERPRETATION

Cluster 3 . Reporting and Interpretation Standard 6.1 1


When automatically generated interpretations
Standard 6. 1 0 of test response protocols or test performance
When test score information is released, those are reported, the sources, rationale, and empirical
responsible for testing programs should provide basis for these interpretations should be available,
interpretations appropriate to the audience. The and their limitations should be described.
interpretations should describe in simple language Comment: Interpretations of test results are some
what the test covers, what scores represent, the times automatically generated, either by a computer
precision/ reliability of the scores, and how scores program in conjunction with computer scoring,
are intended to be used. or by manually prepared materials.Automatically
Comment: Test users should consult the interpretive generated interpretations may not be able to take
material prepared by the test developer and should into consideration the context of the individual's
revise or supplement the material as necessary to circumstances.Automatically generated interpre
present the local and individual results accurately tations should be used with care in diagnostic set
and clearly to the intended audience, which may tings, because they may not take into account
include clients, legal representatives, media, referral other relevant information about the individual
sources, test takers, parents, or teachers. Reports test taker that provides context for test results,
and feedback should be designed to support valid such as age, gender, education, prior employment,
interpretations and use, and minimize potential psychosocial situation, health, psychological history,
negative consequences. Score precision might be and symptomatology. Similarly, test developers
depicted by error bands or likely score ranges, and test users of automatically generated inter
showing the standard error of measurement. pretations of academic performance and accom
Reports should include discussion of any admin panying prescriptions for instructional follow-up
istrative variations or behavioral observations in should report the bases and limitations of the in
clinical settings that may affect results and inter terpretations.Test interpretations should not imply
pretations.Test users should avoid misinterpretation that empirical evidence exists for a relationship
and misuse of test score information.While test among particular test results, prescribed interven
users are primarily responsible for avoiding mis tions, and desired outcomes, unless empirical ev
interpretation and misuse, the interpretive materials idence is available for populations similar to those
prepared by the test developer or publisher may representative of the test taker.
address common misuses or misinterpretations.
To accomplish this, developers of reports and in Standard 6. 1 2
terpretive materia)s may conduct research to help
verify that reports and materials can be interpreted When group-level information is obtained by
as intended (e.g., focus groups with representative aggregating the results of partial tests taken by
end-users of the reports). T he test developer individuals, evidence ofvalidity and reliability/pre
should inform test users of changes in the test cision should be reported for the level of aggre
over time that may affect test score interpretation, gation at which results are reported. Scores
such as changes in norms, test content frameworks, should not be reported for individuals without
or scale score meanings. appropriate evidence to support the interpretations
for intended uses.

Comment: Large-scale assessments often achieve


efficiency by "matrix sampling" the content domain
by asking different test takers different questions.

119
CHAPTER 6

The testing then requires less time from each test Standard 6. 1 4
taker, while the aggregation of individual results
provides for domain coverage that can be adequate Organizations that maintain individually iden
for meaningful group- or program-level interpre tifiable test sc ore information should develop a
tations, such as for schools or grade levels within clear set of policy guidelines on the duration of
a locality or particular subject areas.However, be retention of an individual's records and on the
cause the individual is administered only an in availability and use over time of such data for re
complete test, an individual score would have search or other purposes. The policy should be
limited meaning, if any. documented and available to the test taker. Test
users should maintain appropriate data security,
Standard 6.1 3 which should include administrative, technical,
and physical protections.
When a material error is found in test scores or
Comment: In some instances, test scores become
other important information issued by a testing
obsolete over time, no longer reflecting the current
organization or other institution, this information
state of the test taker. Outdated scores should
and a corrected score report should be distributed
generally not be used or made available, except
as soon as practicable to all known recipients
for research purposes. In other cases, test scores
who might otherwise use the erroneous scores
obtained in past years can be useful, as in longitu
as a basis for decision making. The corrected
dinal assessment or the tracking of deterioration
report should be labeled as such. What was done
of function or cognition. The key issue is the
to correct the reports should be documented.
valid use of the information. Organizations and
The reason for the corrected score report should
individuals who maintain individually identifiable
be made clear to the recipients of the report.
test score information should be aware of and
Comment: A material error is one that could comply with legal and professional requirements.
change the interpretation of the test score and Organizations and individuals who maintain test
make a difference in a significant way.An example scores on individuals may be requested to provide
is an erroneous test score (e.g., incorrectly computed data to researchers or other third-party users.
or fraudulently obtained) that would affect an Where data release is deemed appropriate and is
important decision about the test taker, such as a not prohibited by statutes or regulations, the test
credentialing decision or the awarding of a high user should protect the confidentiality of the test
school diploma. Innocuous typographical errors takers through appropriate policies, such as de
would be excluded.T imeliness is essential for de identifying test data or requiring nondisclosure
cisions that will be made soon after the test scores and confidentiality of the data. Organizations
are received.Where test results have been used to and individuals who maintain or use confidential
inform high-stakes decisions, corrective actions information about test takers or their scores should
by test users may be necessary to rectify circum have and implement an appropriate poli cy for
stances affected by erroneous scores, in addition maintaining security and integrity of the data, in
to issuing corrected reports.The reporting or cor cluding protecting from accidental or deliberate
rective actions may not be possible or practicable modification as well as preventing loss or unau
in certain work or other settings.Test users should thorized destruction.In some cases, organizations
develop a policy of how to handle material errors may need to obtain test takers' consent to use or
in test scores and should document what was disclose records.Adequate security and appropriate
done in the case of suspected or actual material protocols should be established when confidential
errors. test data are made part of a larger record (e.g., an

1 20
TEST ADMINISTRATION, SCORING, REPORTING, AND INTERPRETATION

electronic medical record) or merged into a data Comment: Care is always needed when commu
warehouse.If records are to be released for clinical nicating the scores of identified test takers, regardless
and/or forensic evaluations, care should be taken of the form of communication.Similar care may
to release them to appropriately licensed individuals, be needed to protect the confidentiality of ancillary
with appropriate signed release authorization by information, such as personally identifiable infor
the test taker or appropriate legal authority. mation on disability status for students or clinical
test scores shared between practitioners.Appropriate
Standard 6.1 5 caution with respect to confidential information
should be exercised in communicating face to
When individual test data are retained, both the face, as well as by telephone, fax, and other forms
test protocol and any written report should also of written communication.Similarly, transmission
be preserved in some form. of test data through electronic media and trans
Comment: The protocol may be needed to respond mission and storage on computer networks
to a possible challenge from a test taker or to fa including wireless transmission and storage or pro
cilitate interpretation at a subsequent time.The cessing on the Internet-require caution to maintain
protocol would ordinarily be accompanied by appropriate confidentiality and security. Data in
testing materials and test scores. Retention of tegrity must also be maintained by preventing in
more detailed records of responses would depend appropriate modification of results during such
on circumstances and should be covered in a re transmissions.Test users are responsible for un
tention policy.Record keeping may be subject to derstanding and adhering to applicable legal obli
legal and professional requirements. Policy for gations in their data management, transmission,
the release of any test information for other than use, and retention practices, including collection,
research purposes is discussed in chapter 9, "The handling, storage, and disposition.Test users should
Rights and Responsibilities of Test Users." set and follow appropriate security policies regarding
confidential test data and other assessment infor
mation. Release of clinical raw data, tests, or
Standard 6.1 6
protocols to third parties should follow laws, reg
Transmission of individually identified test scores ulations, and guidelines provided by professional
to authorized individuals or institutions should organizations and should take into account the
be done in a manner that protects the confidential impact of availability of tests in public domains
nature of the scores and pertinent ancillary (e.g., court proceedings) and the potential for vio
information. lation of intellectual property rights.

1 21
7. SUPPORTING DOCUMENTATION
FOR TESTS
BACKGROUND
This chapter provides general standards for the proposed uses of a test 1s important, failure to
preparation and publication of test documentation formally document such evidence in advance does
by test developers, publishers, and other providers not automatically render the corresponding test
of tests.Other chapters contain specific standards use or interpretation invalid.For example, consider
that should be useful in the preparation of materials an unpublished employment selection test developed
to be included in a test's documentation. In by a psychologist solely for internal use within a
addition, test users may have their own docu single organization, where there is an immediate
mentation requirements.The rights and respon need to fill vacancies. The test may properly be
sibilities of test users are discussed in chapter 9. put to operational use after needed validity evidence
The supporting documents for tests are the is collected but before formal documentation of
primary means by which test developers, pub the evidence is completed. Similarly, a test used
lishers, and other providers of tests communicate for certification may need to be revised frequently,
with test users. These documents are evaluated in which case technical reports describing the
on the basis of their completeness, accuracy, cur test's development as well as information concerning
rency, and clarity and should be available to item, exam, and candidate performance should
qualified individuals as appropriate.A test's doc be produced periodically, but not necessarily prior
umentation typically specifies the nature of the to every exam.
test; the use( s) for which it was developed; the Test documentation is effective if it commu
processes involved in the test's development; nicates information to user groups in a manner
technical information related to scoring, inter that is appropriate for the particular audience.To
pretation, and evidence of validity, fairness, and accommodate the breadth of training of those
reliability/precision; scaling, norming, and stan who use tests, separate documents or sections of
dard-setting information if appropriate to the documents may be written for identifiable categories
instrument; and guidelines for test administration, of users such as practitioners, consultants, ad
reporting, and interpretation.T he objective of ministrators, researchers, educators, and sometimes
the documentation is to provide test users with examinees. For example, the test user who ad
the information needed to help chem assess the ministers the tests and interprets the results needs
nature and quality of the test, the resulting guidelines for doing so.Those who are responsible
scores, and the interpretations based on the test for selecting tests need to be able to judge the
scores.T he information may be reported in doc technical adequacy of the tests and therefore need
uments such as test manuals, technical manuals, some combination of technical manuals, user's
user's guides, research reports, specimen sets, ex guides, test manuals, test supplements, examination
amination kits, directions for test administrators kits, and specimen sets.Ordinarily, these supporting
and scorers, or preview materials for test takers. documents are provided to potential test users or
Regardless of who develops a test ( e.g., test test reviewers with sufficient information to enable
publisher, certification or licensure board, employer, them to evaluate the appropriateness and technical
or educational institution) or how many users adequacy of a test.The types of information pre
exist, the development process should include sented in these documents typically include a de
thorough, timely, and useful documentation.Al scription of the intended test-taking population,
though proper documentation of the evidence stated purpose of the test, test specifications, item
supporting the interpretation of test scores for formats, administration and scoring procedures,

1 23
CHAPTER 7

test security protocols, cut scores or other standards, test user or reviewer.This supplemental material
and a description of the test development process. can be provided in any of a variety of published
Also typically provided are summaries of technical or unpublished forms and in either paper or elec
data such as psychometric indices of the items; tronic formats.
reliability/precision and validity evidence; normative In addition to technical documentation, de
data; and cut scores or rules for combining scores, scriptive materials are needed in some settings to
including those for computer-generated interpre inform examinees and other interested parr ies
tations of test scores. about the nature and content of a test.The amount
An essential feature of the documentation for and type of information provided will depend on
every test is a discussion of the common appropriate the particular test and application. For example,
and inappropriate uses and interpretations of the in situations requiring informed consent, information
test scores and a summary of the evidence sup should be sufficient for test takers (or their repre
porting these conclusions.The inclusion of examples sentatives) to make a sound judgment about the
of score interpretations consistent with the test test.Such information should be phrased in non
developer's intended applications helps users make technical language and should contain information
accurate inferences on the basis of the test scores. that is consistent with the use of the test scores
When possible, examples of improper test uses and is sufficient to help the user make an informed
and inappropriate test score interpretations can decision.The materials may include a general de
help guard against the misuse of the test or its scription and rationale for the test; intended uses
scores.When feasible, common negative unintended of the test results; sample items or complete sample
consequences of test use ( including missed op tests; and information about conditions of test ad
portunities) should be described and suggestions ministration, confidentiality, and retention of test
given for avoiding such consequences. results. For some applications, however, the true
Test documents need to include enough in nature and purpose of the test are purposely hidden
formation to allow test users and reviewers to de or disguised to prevent faking or response bias.In
termine the appropriateness of the test for its in these instances, examinees may be motivated to
tended uses. Other materials that provide more reveal more or less of a characteristic intended to
der ails about research by the publisher or inde be assessed. Hiding or disguising the true nature
pendent investigators ( e.g., the samples on which or purpose of a test is acceptable provided that the
the research is based and summative data) should actions involved are consistent with legal principles
be cited and should be readily obtainable by the and ethical standards.

1 24
SUPPORTING DOCUMENTATION FOR TESTS

STANDARDS FOR SUPPORTING DOCUMENTATION FOR TESTS


T he standards in this chapter begin with an over are accessible to potential users, including appro
arching standard (numbered 7.0), which is designed priate school personnel, parents, students from
to convey the central intent or primary focus of all relevant subgroups of intended test takers,
the chapter.T he overarching standard may also and the members of the community (e.g., via the
be viewed as the guiding principle of the chapter, Internet).Test documentation in educational set
and is applicable to all tests and test users. All tings might also include guidance on how users
subsequent standards have been separated into could use test materials and results to improve
four thematic clusters labeled as follows: instruction.
Test documents should provide sufficient detail
1. Content of Test Documents: Appropriate Use to permit reviewers and researchers to evaluate
2. Content of Test Documents: Test Development important analyses published in the test manual
3. Content of Test Documents: Test Adminis or technical report.For example, reporting corre
tration and Scoring lation matrices in the test document may allow
4. Timeliness of Delivery of Test Documents the test user to judge the data on which decisions
and conclusions were based.Similarly, describing
Standard 7.0 in detail the sample and the nature of factor
analyses that were conducted may allow the test
Information relating t o tests should b e dearly user to replicate reported studies.
documented so that those who use tests can Test documentation will also help those who
make informed decisions regarding which test are affected by the score interpretations to decide
to use for a s pecinc purpose, how to administer whether to participate in the testing program or
the chosen test, and how to interpret test scores. how to participate if participation is not optional.
Comment: Test developers and publishers should
provide general information to help test users
Cl uster 1 . Content of Test Documents:
and researchers determine the appropriateness of
an intended test use in a specific context.When
Appropriate Use
test developers and publishers become aware of a
particular test use that cannot be justified, they
should indicate this fact clearly. General information Standard 7 .1
also should be provided for test takers and legal
T he rationale fo r a test, recommended uses of
guardians who must provide consent prior to a
the test, support for such uses, and information
test's administration.(See Standard 8.4 regarding
that assists in score interpretation should be
informed consent.) Administrators and even the
documented. When particular misuses of a test
general public may also need general information
can be reasonably anticipated, cautions against
about the test and its results so that they can cor
such misuses should be specified.
rectly interpret the results.
Test documents should be complete, accurate, Comment: Test publishers should make every
and clearly written so that the intended audience effort to caution test users against known misuses
can readily understand the content.Test docu of tests.However, test publishers cannot anticipate
mentation should be provided in a format that is all possible misuses of a test. If publishers do
accessible to the population for which it is know of persistent test misuse by a test user, addi
intended.For tests used for educational account tional educational efforts, including providing in
ability purposes, documentation should be made formation regarding potential harm to the individual,
publicly available in a format and language that organization, or society, may be appropriate.

1 25
CHAPTER 7

Standard 7 .2 the results of the statistical analyses that were


used in the development of the test, evidence of
The population for whom a test is intended the reliability/precision of scores and the validity
and specifications for the test should be docu of their recommended interpretations, and the
mented. If normative data are provided, the methods for establishing performance cut scores.
procedures used to gather the data should be
explained; the norming population should be Comment: When applicabl, test documents
described in terms of relevant demographic vari should include descriptions of the procedures
ables; and the year(s) in which the data were used to develop items and create the item pool, to
collected should be reported. create tests or forms of tests, to establish scales for
reported scores, and to set standards and rules for
Comment: Known limitations of a test for certain cut scores or combining scores. Test documents
populations should be clearly delineated in the should also provide information that allows the
test documents.For example, a test used to assess user to evaluate bias or fairness for all relevant
educational progress may not be appropriate for groups of intended test takers when it is meaningful
employee selection in business and industry. and feasible for such studies to be conducted.In
Other documentation can assist the user in addition, other statistical data should be provided
identifying the appropriate normative information as appropriate, such as item-level information,
to use to interpret test scores appropriately. For information on the effects of various cut scores
example, the time of year in which the normative ( e.g., number of candidates passing at potential
data were collected may be relevant in some edu cut scores, level of adverse impact at potential cut
cational settings.In organizational settings, infor scores), information about raw scores and reported
mation on the context in which normative data scores, normative data, the standard errors of
were gathered ( e.g., in concurrent or predictive measurement, and a description of the procedures
studies; for development or selection purposes) used to equate multiple forms. ( See chaps. 3 and
may also have implications for which norms are 4 for more information on the evaluation of
appropriate for operational use. fairness and on procedures and statistics commonly
used in test development.)
Standard 7.3
.. . . , Standard 7.5
When the mtormatton 1s available ana appro-
()rd the relevant char priately shared, test documents should cite a Test documents should rec
ls or groups of indi representative set of the studies pertaining to acteristics of the individu:
data collection efforts general and specific uses of the test. viduals who participated in
>pment or validation associated with test devel,
Comment: If a study cited by the test publisher is
tion, job status, grade (e.g., demographic informa
not published, summaries should be made available
that were contributed level); the nature of the data
on request to test users and researchers by the
ion data); the nature (e.g., predictor data, crite1
publisher.
1bject matter experts of judgments made by s1
ages); the instructions (e.g., content validation linl
>articipants in data
,pecific tasks; and the
C luster 2. Content of Test Documents: that were provided to J
collection efforts for their
the test data were Test Devel opment
conditions under which
dy. collected in the validity st1
; should describe the
Standard 7.4
Comment: Test developer
10se who participated Test documentation should summarize test de relevant characteristics of t
development process velopment procedures, including descriptions and in various steps of the test

1 26
SUPPORTING DOCUMENTATION FOR TESTS

and what tasks each person or group performed. Cluster 3 . Content of Test Documents:
For example, the participants who set the test cut Test Ad ministration and Scoring
scores and their relevant expertise should be doc
umented.Depending on the use of the test results,
relevant characteristics of the participants may Standard 7.7
include race/ethnicity, gender, age, employment Test documents should specify user qualifications
status, education, disability status, and primary that are required to administer and score a test,
language.Descriptions of the tasks and the specific as well as the user qualifications needed to
instructions provided to the participants may help interpret the test scores accurately.
future test users select and subsequently use the
test appropriately.Testing conditions, such as the Comment: Statements of user qualifications should
extent of proctoring in the validity study, may specify the training, certification, competencies,
have implications for the generalizability of the and experience needed to allow access to a test or
results and should be documented.Any changes scores obtained with it.When user qualifications
to the standardized testing conditions, such as ac are expressed in terms of the knowledge, skills,
commodations or modifications made to the test abilities, and other characteristics required to ad
or test administration, should also be documented. minister, score, and interpret a test, the test docu
Test developers and users should take care to mentation should clearly define the requirements
comply with applicable legal requirements and so the user can properly evaluate the competence
professional standards relating to privacy and data of administrators.
security when providing the documentation
required by this standard. Standard 7.8

Standard 7.6 Test documentation should include detailed in


structions on how a test is to be administered
When a test is available in more than one and scored.
language, the test documentation should provide
C omment: Regardless of whether a test is to be
information on the procedures that were employed
administered in paper-and-pencil format, computer
to translate and adapt the test. Information
format, or orally, or whether the test is performance
s hould also be provided regarding the
based, instructions for administration should be
reliability/precision and validity evidence for the
included in the test documentation.As appropriate,
adapted form when feasible.
these instructions should include all factors related
Comment: In addition to providing information to test administration, including qualifications,
on translation and adaptation procedures, the test competencies, and training of test administrators;
documents shouid include the demographics of equipment needed; protocols for test administrators;
translators and samples of test takers used in the timing instructions; and procedures for imple
adaptation process, as well as information on any mentation of test accommodations.When available,
score interpretation issues for each language into test documentation should also include estimates
which the test has been translated and adapted. of the time required to administer the test to
Evidence of reliability/precision, validity, and com clinical, disabled, or other special populations for
parability of translated and adapted scores should whom the test is intended to be used, based on
be provided in test documentation when feasible. data obtained from these groups during the
(See Standard 3.1 4, in chap.3, for further discussion norming of the test.In addition, test users need
of translations.) instructions on how to score a test and what cut

1 27
CHAPTER 7

scores to use ( or whether to use cut scores) in in on how test scores are stored and who is authorized
terpreting scores. If the test user does not score tosee the scores.
the test, instructions should be given on how to
have a test scored.Finally, test administration doc Standard 7.1 0
umentation should include instructions for dealing
with irregularities in test administration and Tests that are designed to be scored and interpreted
guidance on how they should be documented. by test takers should be accompanied by scoring
If a test is designed so that more than one instructions and interpretive materials that are
method can be used for administration or for written in language the test takers can understand
recording responses-such as marking responses and that assist them in understanding the test
in a test booklet, on a separate answer sheet, or via scores.
computer-then the manual should clearly docu
ment the extent to which scores arising from ap Comment: If a test is designed ro be scored by
plication of these methods are interchangeable.If test takers or its scores interpreted by test takers,
the scores are not interchangeable, this fact should the publisher and test developer should develop
be reported, and guidance should be given on the procedures that facilitate accurate scoring and in
comparability of scores obtained under the various terpretation. Interpretive material may include
conditions or methods of administration. information such as the construct that was meas
ured, the test taker's results, and the comparison
group.T he appropriate language for the scoring
Standard 7.9 procedures and interpretive materials is one that
If test security is critical to the interpretation of meets the particular language needs of the test
test scores, the documentation should explain taker.T hus, the scoring and interpretive materials
the steps necessary to protect test materials and may need to be offered in the native language of
to prevent inappropriate exchange of information the test taker to be understood.
during the test administration session.

Comment: When the proper interpretation of Standard 7.1 1


test scores assumes that the test taker has not been
Interpretive materials for tests that include case
exposed to the test content or received illicit assis
studies should provide examples illustrating the
tance, the instructions should include procedures
diversity of prospective test takers.
for ensuring the security of the testing process and
of all test materials at all times.Security procedures Comment: When case studies can assist the user
may include guidance for storing and distributing in the interpretation of the test scores and profiles,
test materials as well as instructions for maintaining the case studies should be included in the test
a secure testing process, such as identifying test documentation and represent members of the
takers and seating test takers to prevent exchange subgroups for which the test is relevant. To
of information. Test users should be aware that illustrate the diversity of prospective test takers,
federal and state laws, regulations, and policies case studies might cite examples involving women
may affect security procedures. and men of different ages, individuals differing in
In many situations, test scores should also be sexual orientation, persons representing various
maintained securely.For example, in promotional racial/ethnic or cultural groups, and individuals
testing in some employment settings, only the with disabilities. Test developers may wish to
candidate and the staffing personnel are authorized inform users that the inclusion of such examples
to see the scores, and the candidate's current su is intended to illustrate the diversity of prospective
pervisor is specifically prohibited from viewing test takers and not to promote interpretation of
them. Documentation may include information test scores in a manner that conflicts with legal

1 28
SUPPORTING DOCUMENTATION FOR TESTS

requirements such as race or gender norming in first administration of the test. Ocher documents
employment contexts. ( e.g., technical manuals containing information
based on data from the first administration) cannot
Standard 7 . 1 2 be supplied prior to that administration; however,
such documents should be created promptly.
When test scores are used to make predictions The test developer or publisher should judge
about future behavior, the evidence supporting carefully which information should be included
those predictions should be provided to the test in first editions of the test manual, technical
user. manual, or user's guide and which information
can be provided in supplements.For low-volume,
Comment: The test user should be informed of
unpublished tests, the documentation may be
any cut scores or rules for combining raw or reported
relatively brief.When the developer is also the
scores that are necessary for understanding score in
user, documentation and summaries are still
terpretations. A description of both the group of
necessary.
judges used in establishing the cut scores and the
methods used to derive the cut scores should be
provided.When security or proprietary reasons ne Standard 7. 1 4
cessitate the withholding of cut scores or rules for
combining scores, the owners of the intellectual When substantial changes are made t o a test,
property are responsible for documenting evidence the test's documentation should be amended,
in support of the validity of interpretations for in supplemented, or revised to keep information
tended uses. Such evidence might be provided, for for users current and to provide useful additional
example, by reporting the finding of an independent informatio n or cautions.
review of the algorithms by qualified professionals. Comment: Supporting documents should clearly
When any interpretations of test scores, including note the date of their publication as well as the
computer-generated interpretations, are provided, name or version of the test for which the docu
a summary of the evidence supporting the interpre mentation is relevant.When substantial changes
tations should be given, as well as the rules and are made to items and scoring, information on
guidelines used in making the interpretations. the extent to which the old scores and new scores
are interchangeable should be included in the test
Cluster 4. Timeliness of De livery of Test documentation.
Sometimes it is necessary to change a test or
Documents
testing procedure to remove construct-irrelevant
variance that may arise due to the characteristics
Standard 7.1 3 of an individual that are unrelated to the construct
being measured ( e.g., when testing individuals
Supporting documents (e.g., test manuals, tech
with disabilities).When a test or testing procedures
nical manuals, user's guides, and supplemental
are altered, the documentation for the test should
material) should be made available to the appro
include a discussion of how the alteration may
priate people in a timely manner.
affect the validity and comparability of the test
Comment: Supporting documents should be sup scores, and evidence should be provided to demon
plied in a timely manner.Some documents ( e.g., strate the effect of the alteration on the scores ob
administration instructions, user's guides, sample tained from the altered test or testing procedures,
tests or items) must be made available prior to the if sample size permits.

1 29
8. THE RIGHTS AND RESPONSIBILITIES
OF TEST TAKERS
BACKGROUND
T his chapter addresses issues o f fairness from the ployment tests), the information that is provided
point of view of the individual test taker. Most should be consistent across test takers.Test takers,
aspects of fairness affect the validity of interpretations or their legal representatives when appropriate,
of test scores for their intended uses.T he standards need enough information about the test and the
in this chapter address test takers' rights and re intended use of test results to reach an informed
sponsibilities with regard to test security, their decision about their participation.
access to test results, and their rights when irreg In some instances, the laws or standards of
ularities in their testing process are claimed.Other professional practice, such as those governing re
issues of fairness are addressed in chapter 3 search on human subjects, require formal informed
( "Fairness in Testing"). General considerations consent for testing. In other instances ( e.g., em
concerning reports of test results are covered in ployment testing), informed consent is implied
chapter 6 ("Test Administration, Scoring, Reporting, by other actions ( e.g., submission of an employment
and Interpretation"). Issues related to test takers' application), and formal consent is not required.
rights and responsibilities in clinical or individual T he greater the consequences to the test taker,
settings are also discussed in chapter 10 ( "P sycho the greater the importance of ensuring that the
logical Testing and Assessment"). test taker is fully informed about the test and vol
T he standards in this chapter are directed to untarily consents to participate, except when
test providers, not to test takers. It is the shared testing without consent is permitted by law ( e.g.,
responsibility of the test developer, test administrator, when participating in testing is legally required or
test proctor (if any), and test user to provide test mandated by a court order). If a test is optional,
takers with information about their rights and the test taker has the right to know the consequences
their own responsibilities. T he responsibility to of taking or not taking the test.Under most cir
inform the test taker should be apportioned ac cumstances, rhe test taker has the right to ask
cording to particular circumstances. questions or express concerns and should receive
Test takers have the right to be assessed with a timely response to legitimate inquiries.
tests that meet current professional standards, in When consistent wirh rhe purposes and nature
cluding standards of technical quality, consistent of the assessment, general information is usually
treatment, fairness, conditions for test adminis provided about the test's content and purposes.
tration, and reporting of results.T he chapters in Some programs, in the interest of fairness, provide
Part I, "Foundations," and Part II, " Operations," all test takers wirh helpful materials, such as study
deal specifically with fair and appropriate test guides, sample questions, or complete sample
design, development, administration, scoring, and tests, when such information does not jeopardize
reporting. In addition, test takers have a right to the validity of the interpretations of results from
basic information about the test and how the test future test administrations. Practice materials
results will be used. In most situations, fair and should have the same appearance and format as
equitable treatment of test takers involves providing the actual test.A practice test for a Web-based as
information about the general nature of the test, sessment, for example, should be available via
the intended use of test scores, and the confiden computer. Employee selection programs may le
tiality of the results in advance of testing.When gitimately provide more training to certain classes
full disclosure of this information is not appropriate of test takers (e.g., internal applicants) and not to
(as is the case with some psychological or em- others (e.g., external applicants).For example, an

1 31
CHAPTER 8

organization may train current employees on skills to other test takers, particularly in compet1t1ve
that are measured on employment tests in the situations in which test takers' scores are compared.
context of an employee development program There are many forms of behavior that affect test
but not offer that training to external applicants. scores, such as using prohibited aids or arranging
Advice may also be provided about test-taking for someone to take the test in the test taker's
strategies, including time management and the place.Similarly, there are many forms of b ehavior
advisability of omitting a response to an item that jeopardize the security of test materials, in
(when omitting a response is permitted). Infor cluding communicating the specific content of
mation on various testing policies, for example the test to other test takers in advance.The test
about making accommodations available and de taker is obligated to respect the copyrights in test
termining for which individuals the accommoda materials and may not reproduce the materials
tions are appropriate, is also provided to the test without authorization or disseminate in any form
taker.In addition, communications to test takers material that is similar in nature to the test.Test
should include policies on retesting when major takers, as well as test administrators, have the re
disruptions of the test administration occur, when sponsibility to protect test security by refusing to
the test taker feels that the present performance divulge any details of the test content to others,
does not appropriately reflect his or her true ca unless the particular test is designed to be openly
pabilities, or when the test taker improves on his available in advance. Failure to honor these re
or her underlying knowledge, skills, abilities, or sponsibilities may compromise the validity of
other personal characteristics. test score interpretations for the test taker and
As participants in the assessment, test takers for others.Outside groups that develop items for
have responsibilities as well as rights. Their re test preparation should base those items on
sponsibilities include being prepared to take the publicly disclosed information and not o n infor
test, following the directions of the test adminis mation that has been inappropriately shared by
trator, representing themselves honestly on the test takers.
test, and protecting the security of the test materials. Sometimes, testing programs use special scores,
Requests for accommodations or modifications statistical indicators, and other indirect information
are the responsibility of the test taker, or in the about irregularities in testing to examine whether
case of minors, the test taker's guardian.In group the test scores have been obtained fairly. Unusual
testing situations, test takers should not interfere patterns of responses, large changes in test scores
with the performance of other test takers.In some upon retesting, response speed, and similar indicators
testing programs, test takers are also expected to may trigger careful scrutiny of certain testing pro
inform the appropriate persons in a timely manner tocols and test scores.The details of the procedures
if they believe there are reasons that their test for detecting problems are generally kept secure to
results will not reflect their true capabilities. avoid compromising their use.However, test takers
The validity of score interpretations rests on should be informed that in special circumstances,
the assumption that a test taker has earned fairly such as response or test score anomalies, their test
a particular score or categorical decision, such as responses may receive special scrutiny.Test takers
"pass" or "fail." Many forms of cheating or other should be informed that their score may be canceled
malfeasant behaviors can reduce the validity of or other action taken if evidence of impropriety or
the interpretations of test scores and cause harm fraud is discovered.

1 32
THE RIGHTS AND RESPONSIBILITIES OF TEST TAKERS

STANDARDS FOR TEST TAKERS' RIGHTS AND RESPONSIBILITIES


T he standards in this chapter begin with an over in this chapter address the responsibility of test
arching standard ( numbered 8.0), which is designed takers to represent themselves fairly and accurately
to convey the central intent or primary focus of during the testing process and to respect the con
the chapter.T he overarching standard may also fidentiality of copyright in all test materials.
be viewed as the guiding principle of the chapter,
and is applicable to all tests and test users. All
subsequent standards have been separated into Cluster 1 . Test Takers' Rights to
four thematic clusters labeled as follows: I nformation Prior to Testing

1. Test Takers' Rights to Information Prior to Standard 8.1


Testing
2. Test Takers' R ights to Access T heir Test Information about test content and purposes
Results and to Be Protected From Unautho that is available to any test taker prior to testing
rized Use ofTest Results should be available to all test takers. Shared in
3. Test Takers' R ights to Fair and Accurate formation should be available free of charge and
Score Reports in accessible formats.
4. Test Takers' Responsibilities for Behavior
Comment: T he intent of this standard is
T hroughout the Test Administration Process
equitable treatment for all test takers with respect
to access to basic information about a testing
Standard 8.0 event, such as when and where the test will be
given, what materials should be brought, what
Test takers have the right to adequate information
the purpose of the test is, and how the results will
to help them properly prepare for a test so that
be used.When applicable, such offerings should
the test results accurately reflect their standing
be made to all test takers and, to the degree
on the construct being assessed and lead to fair
possible, should be in formats accessible to all test
and accurate score interpretations. They also
takers.Accessibility of formats also applies to in
have the right to protection of their personally
formation that may be provided on a public
identifiable score results from unauthorized access,
website. For example, depending on the format
use, or disclosure. Further, test takers have the
of the information, conversions can be made so
responsibility to represent themselves accurately
that individuals with visual disabilities can access
in the testing process and to respect copyright in
textual or graphical material.For test takers with
test materials.
disabilities, providing these materials in accessible
Comment: Specific standards for test takers' rights formats may be required by law.
and responsibilities are described below.T hese in It merits noting that while general information
clude standards for the kinds of information that about test content and purpose should be made
should be provided to test takers prior to testing available to all test takers, some organizations
so they can properly prepare to take the test and may supplement this information with additional
so that their results accurately reflect their standing training or coaching.For example, some employers
on the construct being assessed. Standards also may teach basic skills to workers to help them
cover test takers' access to their test results; protection qualify for higher level positions. Similarly, one
of the results from unauthorized access, use, or teacher in a school may choose to drill students
disclosure by others; and test takers' rights to fair on a topic that will be tested while other teachers
and accurate score reports.In addition, standards focus on other topics.

1 33
CHAPTER 8

Standard 8.2 as cheating, that could result in their being pro


hibited from completing the test or receiving test
Test takers should be provided in advance with scores, or could make them subject to other sane
as much information about the test, the testing tions.Test takers should be informed, at least in a
process, the intended test use, test scoring criteria, general way, if there will be special scrutiny of
testing policy, availability of accommodations, testing protocols or score patterns to detect breaches
and confidentialitv: oro,tei:tion as_ is _consistent r: <Pru;;.;;:_c..,hey,c'1a.m,grl-i.,,,rJ,mcr'1nef.l;,"ikr'UCI,= , w,o
with obtaining valid responses and making ap-
propriate interpretations of test scores.
Standard 8.3
Comment: When appropriate, test takers should
be informed in advance about test content, in When the test taker is offered a choice of test
cluding subject area, topics covered, and - item format, information about the characteristics of
formats. General advice should be given about each format should be provided.
test-taking strategies. For example, test takers Comment: Test takers sometimes may choose be
should usually be informed about the advisability tween paper-and-pencil administration of a test
of omitting responses and made aware of any im and computer administration. Some tests are
posed time limits, so that they can manage their offered in different languages.Sometimes, an al
time appropriately. For computer administrations, ternative assessment is offered.Test takers need to
test takers should be shown samples of the interface know the characteristics of each alternative that is
they will be expected to use during the test and available to them so that they can make an
be provided an opportunity to practice with those informed choice.
tools and master their use before the test begins.
In addition, they should be told about possibilities
for revisiting items they have previously answered Standard 8.4
or omitted.
Informed consent should be obtained from test
In most testing situations, test takers should
takers, or from their legal representatives when
be informed about the intended use of test scores
appropriate, before testing begins, exc ept
and the extent of the confidentiality of test results,
(a) when testing without consent is mandated
and should be told whether and when they will
by law or governmental regulation, (b) when
have access to their results.Exceptions occur when
testing is conducted as a regular part of school
knowledge of the purposes or intended score uses
activities, or ( c) when consent is clearly implied,
would violate the integrity of the interpretations
such as in employment settings. Informed consent
of the scores, such as when the test is intended to
may be required by applicable law and professional
detect malingering. If a record of the testing
standards.
session is kept in written, video, audio, or any
other form, or if other records associated with the Comment: Informed consent implies that the
testing event, such as scoring information, are test takers or their representatives are made aware,
kept, test takers are entitled to know what testing in language that they can understand, of the
information will be released and to whom and for reasons for testing, the types of tests to be used,
what purposes the results will be used. In some the intended uses of test takers' test results or
cases, legal standards apply to information about other information, and the range of material
the use and confidentiality of, and test-taker access consequences of the intended use. It is generally
to, test scores.Policies concerning retesting should recommended that persons be asked directly to
also be communicated. Test takers should be give their formal consent rather than being asked
warned against improper behavior and made cog only to indicate if they are withholding their
nizant of the consequences of misconduct, such consent.

1 34
THE RIGHTS ANO RESPONSIBILITIES OF TEST TAKERS

Consent is not required when testing is legally considerations, including, as applicable, privacy
mandated, as in the case of a court-ordered psy laws.Information may be provided to researchers
chological assessment, although there may be legal if several conditions are all met: ( a) each test
requirements for providing information about the taker's confidentiality is maintained, ( b) the in
testing session outcomes to the test taker.Nor is tended use is consistent with accepted research
consent typically required in educational settings practice, ( c) the use is in compliance with current
for tests administered to all pupils.When testing legal and institutional requirements for subjects'
is required for employment, credentialing, or ed rights and with applicable privacy laws, and (d)
ucational admissions, applicants, by applying, the use is consistent with the test taker's informed
have implicitly given consent to the testing.When consent documents that are on file or with the
feasible, the person explaining the reason for a conditions of implied consent that are appropriate
test should be experienced in communicating in some settings.
with individuals within the intended population
for the test ( e.g., individuals with disabilities or Standard 8.6
from different linguistic backgrounds).
Test data maintained or transmitted in data
. files, including all personally identifiable infor
-muster 2. Test Takers' Rights to Access mation (not just results), should be adequately
Their Test Resu lts and to Be Protected protected from improper access, use, or disclosure,
From Unauthorized Use of Test Results including by reasonable physical, technical, and
administrative protections as appropriate to the
Standard 8.5 particular data set and its risks, and in compliance
with applicable legal requirements. Use of facsimile
Policies for the release of test scores with identi transmission, computer networks, data banks,
fying information should be carefully considered or other electronic data-processing or transmittal
and clearly communicated to those who have systems should be restricted to situations in
access to the scores. Policies should make sure which confidentiality can be reasonably assured.
that test results containing the names of individual Users should develop and/or follow policies,
test takers or other personal identifying infor consistent with any legal requirements, for
mation are released only to those who have a le whether and how test takers may review and
gitimate, professional interest in the test takers correct personal information.
and are permitted to access such information
Comment: Risk of compromise is reduced by
under applicable privacy laws, who are covered
avoiding identification numbers or codes that are
by the test takers' informed consent documents,
linked to individuals and used for other purposes
or who are otherwise permitted by law to access
( e.g., Social Security numbers or employee IDs).
the results.
If facsimile or computer communication is used
Comment: Test results of individuals identifled to transmit test responses to another site for
by name, or by some other information by means scoring or if scores are similarly transmitted, rea
of which a person can be readily identified, or sonable provisions should be made to keep the
readily identified when the information is com information confidential, such as encrypting the
bined with other information, should be kept information. In some circumstances, applicable
confidential. In some situations, information data security laws may require that specific measures
may be provided on a confidential basis to other be taken to protect the data.In most cases, these
practitioners with a legitimate interest in the policies will be developed by the owner of the
particular case, consistent with legal and ethical data.

1 35
_L
CHAPTER 8

Cluster 3. Test Takers' Rights to Fa i r results are used solely for the purpose of aiding
and Accurate Score Reports selection decisions, waivers of access are often a
condition of employment applications, although
access to test information may often be appropriately
Standard 8.7 required in other circumstances.
When score reporting assigns scores o f individual
test takers into categories, the labels assigned to Cl uster 4. Test Takers' Responsibil ities
the categories should be chosen to reflect intended for Behavior Throu g hout the Test
inferences and should be described precisely.
Administration Process
Comment: When labels are associated with test
results, care should be taken to avoid labels with Standard 8.9
unnecessarily stigmatizing implications. For ex
ample, descriptive labels such as "basic," "proficient," Test takers should be made aware that having
and "advanced" would carry less stigmatizing in someone else take the test for them, disclosing
terpretations than terms such as "poor" or "unsat confidential test material, or engaging in any
isfactory." In addition, information should be other form of cheating is unacceptable and that
provided regarding the accuracy of score classifi such behavior may result in sanctions.
cations ( e.g., decision accuracy and decision con
Comment: Although the Standards cannot regulate
sistency).
test takers' behavior, test takers should be made
aware of their personal and legal responsibilities.
Standard 8.8 Arranging for someone else to impersonate the
test taker constitutes fraud. In tests designed to
When test scores are used to make decisions
measure a test taker's independent thinking, pro
about a test taker or to make recommendations
viding responses that make use of the work of
to a test taker or a third party, the test taker
others without attribution or that were prepared
should have timely access to a copy of any report
by someone other than the test taker constitutes
of test scores and test interpretation, unless that
plagiarism.Disclosure of confidential testing ma
right has been waived explicitly in the test taker's
terial for the purpose of giving other test takers
informed consent document or implicitly through
advance knowledge interferes with the validity of
the application procedure in education, creden
test score interpretations; and circulation of test
tialing, or employment testing or is prohibited
items in print or electronic form may constitute
by law or court order.
copyright infringement.In licensure and certification
Comment: In some cases, a test taker may be ad tests, such actions may compromise public health
equately informed when 'the test report is given and safety.In general, the validity of test score in
to an appropriate third party ( e.g., treating psy terpretations is compromised by inappropriate
chologist or psychiatrist) who can interpret the test disclosure.
findings for the test taker.When the test taker is
given a copy of the test report and there is a Standard 8.1 0
credible reason to believe that test scores might
be incorrectly interpreted, the examiner or a In educational and credentialing testing programs,
knowledgeable third party should be available to when an individual score report is expected to
interpret them, even if the score report is clearly be significantly delayed beyond a brief investigative
written, as the test taker may misunderstand or period because of possible irregularities such as
raise questions not specifically answered in the re suspected misconduct, the test taker should be
port.In employment testing situations, when test notified and given the reason for the investigation.

1 36
THE RIGHTS AND RESPONSIBILITIES OF TEST TAKERS

Reasonable efforts should be made to expedite Standard 8. 1 2


the review and to protect the interests of the test
taker. The test taker should be notified of the In educational and credentialing testing pro
disposition when the investigation is closed. grams, a test taker is entitled to fair treatment
and a reasonable resolution process, appropriate
to the particular circumstances, regarding charges
Standard 8.1 1 associated with testing irregularities, or challenges
issued by the test taker regarding accuracies of
In educational and credentialing testing programs, the scoring or scoring key. Test takers are
when it is deemed necessary to cancel or withhold entitled to be informed of any available means
a test taker's score because of possible testing ir of recourse.
regularities, including suspected misconduct, the
type of evidence and the general procedures to Comment: When a test taker's score is questioned
be used to investigate the irregularity should be and invalidated, or when a test taker seeks a
explained to all test takers whose scores are review or revision of his or her score or of some
directly affected by the decision. Test takers other aspect of the testing, scoring, or reporting
should be given a timely opportunity to provide process, the test taker is entitled to some orderly
evidence that the score should not be canceled process for effective input into or review of the
or withheld. Evidence considered in deciding on decision making of the test administrator or test
the final action should be made available to the user. Depending on the magnitude of the conse
test taker on request. quences associated with the test, this process can
range from an internal review of all relevant data
Comment: Any form of cheating or behavior by a test administrator, to an informal conversation
that reduces the validity and fairness of the inter with an examinee, to a full administrative hearing.
pretations of test results should be investigated T he greater the consequences, the greater the
promptly, with appropriate action taken. A test extent of procedural protections that should be
score may be withheld or canceled because of sus made available.Test takers should also be made
pected misconduct by the test taker or because of aware of procedures for recourse, possible fees as
some anomaly involving others, such as theft or sociated with recourse procedures, expected time
administrative mishap.An avenue of appeal should for resolution, and any other significant related
be available and made known to candidates whose issues, including consequences for the test taker.
scores may be amended or withheld.Some testing Some testing programs advise that the test taker
organizations offer the option of a prompt and may be represented by an attorney, although
free retest or arbitration of disputes.The information possibly at the test taker's expense.Depending on
provided to the test takers should be specific the circumstances and context, principles of due
enough for them to understand the evidence that process under law may be relevant to the process
is being used to support the contention of a afforded to test takers.
testing irregularity but not specific enough to
divulge trade secrets or to facilitate cheating.

1 37
9. THE RIGHTS AND RESPONSIBILITIES
OF TEST USERS
BACKGROUND
T h e previous chapters have dealt primarily with of the normative data available in the test manual,
the responsibilities of those who develop, promote, and ( d) the potential positive and negative conse
evaluate, or mandate the administration of tests quences of use.T he accumulated research literature
and with the rights and responsibilities of test should also be considered, as well as, where ap
takers.T he present chapter centers attention on propriate, demographic characteristics ( e.g., race/eth
the responsibilities of those who may be considered nicity; gender; age; income; socioeconomic, cultural,
the users of tests.Test users are professionals who and linguistic background; education; and other
select the specific instruments or supervise test socioeconomic variables) of the group for which
administration-on their own authority or at the the test was originally constructed and for which
behest of others-as well as all other professionals normative data are available.Test users can also
who actively participate in the interpretation and consult with measurement professionals. T he
use of test results.T hey include p'sychologists, ed name of the test alone never provides adequate
ucators, employers, test developers, test publishers, information for deciding whether to select it.
and other professionals. Given the reliance on In some cases, the selection of tests and in
test results in many settings, pressure has typically ventories is individualized for a particular client.
been placed on test users to explain test-based de In other settings, a predetermined battery of tests
cisions and testing practices; in many circumstances, is taken by all participants. In both cases, test
test users have legal obligations to document the users should be well versed in proper administrative
validity and fairness of those decisions and practices. procedures and are responsible for understanding
T he standards in this chapter provide guidance the validity and reliability evidence and articulating
with regard to test administration procedures and that evidence if the need arises. Test users who
decision making in which tests play a part.T hus, oversee testing and assessment are responsible for
the present chapter includes standards of a general ensuring that the test administrators who administer
nature that apply in almost all testing contexts. and score tests have received the appropriate edu
T hese Standards presume that a legitimate ed cation and training needed to perform these tasks.
ucational, psychological, credentialing, or em A higher level of competence is required of the
ployment purpose justifies the time and expense test user who interprets the scores and integrates
of test administration. In most settings, the user the inferences derived from the scores and ocher
communicates this purpose to those who have a relevant information.
legitimate interest in the measurement process Test scores ideally are interpreted in light of
and subsequently conveys the implications of ex the available data, the psychometric properties of
aminee performance to those entitled to receive the scores, indicators of effort, and the effects of
the information.Depending on the measurement moderator variables and demographic characteristics
setting, this group may include individual test on test results. Because items or tasks contained
takers, parents and guardians, educators, employers, in a test that was designed for a particular group
policy makers, the courts, or the general public. may introduce construct-irrelevant variance when
Validity and reliability are critical considerations used with other groups, selecting a test with de
in test selection and use, and test users should mographically appropriate reference groups is im
consider evidence of (a) the validity of the inter portant to the generalizability of the inference
pretation for intended uses of the scores, ( b) the that the test user seeks to make.When a test de
reliability/precision of the scores, (c) the applicability veloped and normed for one group is applied to

1 39
CHAPTER 9

other groups, score interpretations should be qual In such settings, there is often no dear separation
ified and presented as hypotheses rather than in terms of professional responsibilities between
conclusions.Further, statistical analyses conducted those who develop the instrument and those who
on only one group should be evaluated for appro administer it and interpret the results.Instruments
priateness when generalized to other examinee produced by independent publishers, on the other
populations. The test user should rely on any hand, present a somewhat different picture.Typically,
available extant research evidence for the test to these will be used by different test users with a
draw appropriate inferences and should be aware variety of populations and for diverse purposes.
of requirements restricting certain practices (e.g., The conscientious developer of a standardized
norming by race or gender in certain contexts). test attempts to control who has access to the test
Moreover, where applicable, an interpretation and to educate potential users.Furthermore, most
of test takers' scores needs to consider not only the publishers and test sponsors work to prevent the
demonstrated relationship between the scores and misuse of standardized measures and the misin
the criteria, but also the appropriateness of the terpretation of individual scores and group averages.
latter.The criteria need to be subjected to an ex Test manuals often illustrate sound and unsound
amination similar to the examination of the predictors interpretations and applications. Some identify
if one is to understand the degree to which the un specific practices that are not appropriate and
derlying constructs are congruent with the inferences should be discouraged.Despite the best efforts of
under consideration. It is important that data test developers, however, appropriate test use and
which are not supportive of the inferences should sound interpretation of test scores are likely to re
be acknowledged and either reconciled or noted as main primarily the responsibility of the test user.
limits to the confidence that can be placed in the Test takers, parents and guardians, legislators,
inferences.The education and experience necessary policy makers, the media, the courts, and the
to interpret group tests are generally less stringent public at large often prefer unambiguous interpre
than the qualifications necessary to interpret indi tations of test data.In particular, they often tend
vidually administered tests. to attribute positive or negative results, including
Test users should follow the standardized test group differences, to a single factor or to the con
administration procedures outlined by the test ditions that prevail in one social institution
developers. Computer administration of tests most often, the home or the school.These consumers
should also follow standardized procedures, and of test data frequently press for score-based rationales
sufficient oversight should be provided to ensure for decisions that are based only in part on test
the integrity of test results.When nonstandard scores.The wise test user helps all interested parties
procedures are needed, they should be described understand that sound decisions regarding test
and justified. Test users are also responsible for use and score interpretation involve an element of
providing appropriate testing conditions. For ex professional judgment.It is not always obvious to
ample, the test user may need to determine the consumers that the choice of various informa
whether a test taker is capable of reading at the tion-gathering procedures involves experience that
level required and whether a test taker with vision, is not easily quantified or verbalized.The user can
hearing, or neurological disabilities is adequately help consumers appreciate the fact that the weighting
accommodated.Chapter 3 ("Fairness in Testing") of quantitative data, educational and occupational
addresses equal access considerations and standards information, behavioral observations, anecdotal
in detail. reports, and other relevant data often cannot be
Where administration of tests or use of test specified precisely. Nonetheless, test users should
data is mandated for a specific population by gov provide reports and interpretations of test data
ernmental authorities, educational institutions, li that are clear and understandable.
censing boards, or employers, the developer and Because test results are frequently reported
user of an instrument may be essentially the same. as numbers, they often appear to be precise,

1 40
THE RIGHTS AND RESPONSIBILITIES OF TEST USERS

and test data are sometimes allowed to override and agencies mandating test use agree to provide
other sources of evidence about test takers. information on the strengths and weaknesses of
T here are circumstances in which selection based their instruments .They accept the responsibility
exclusively on test scores may be appropriate to warn against likely misinterpretations by unso
( e.g., in pre-employment screening). However, phisticated interpreters of individual scores or ag
in educational, psychological, frensic, and some gregated data.However, r he ultimate responsibility
employment settings, test users are well advised, for appropriate test use and interpretation lies
and may be legally required, to consider other predominantly with the test user. In assuming
relevant sources of information on test takers, this responsibility, the user must become knowl
not just test scores.In such situations, psychol edgeable about a test's appropriate uses and the
ogists, educators, or other professionals familiar populations for which it is suitable.The test user
with the local setting and with local test takers should be prepared to develop a logical analysis
are often best qualified to integrate this diverse that supports the various facets of the assessment
information effectively. and the inferences made from the assessment
It is not appropriate for these standards to results. Test users in all settings ( e.g., clinical,
dictate minimal levels of test-criterion correlation, counseling, credentialing, educational, employment,
classification accuracy, or reliability/precision for forensic, psychological) must also become adept
any given purpose.Such levels depend on factors in communicating the implications of test results
such as the nature of the measured construct, the to those entitled to receive them.
age of the tested individuals, and whether decisions In some instances, users may be obligated to
muse be made immediately on the strength of the collect additional evidence about a test's technical
best available evidence, however weak, or whether quality. For example, if performance assessments
they can be delayed until better evidence becomes are locally scored, evidence of the degree of inter
available.But it is appropriate to expect the user to scorer agreement may be required. Users should
ascertain what the alternatives are, what the quality also be alert to the probable local consequences of
and consequences of these alternatives are, and test use, particularly in the case of large-scale
whether a delay in decision making would be ben testing programs.If the same test material is used
eficial.Cost-benefit compromises become necessary in successive years, users should actively monitor
in test use, as they often are in test development. the program to determine if reuse has compromised
However, in some contexts, legal requirements may the integrity of the results.
place limits on the extent to which such compromises Some of the standards that follow reiterate
can be made. fu with standards for the various ideas contained in other chapters, principally
phases of test development, when relevant standards chapter 3 ( "Fairness in Testing"), chapter 6 ( "Test
are not met in test use, the reasons should be per Administration, Scoring, Reporting, and Inter
suasive. The greater the potential impact on test pretation"), chapter 8 ( "The Rights and Respon
takers, for good or ill, the greater the need to sibilities of Test Takers"), chapter 10 ( "P sychological
identify and satisfy the relevant standards. Testing and fusessment"), chapter 1 1 ( "Workplace
In selecting a test and interpreting a test score, Testing and Credentialing"), and chapter 12 ( "Ed
the test user is expected to have a clear understanding ucational Testing and fusessment").This repetition
of the purposes of the testing and its probable is intentional. It permits an enumeration in one
consequences.The knowledgeable user has definite chapter of the major obligations that must be as
ideas on how to achieve these purposes and how sumed largely by the test administrator and user,
to avoid unfairness and undesirable consequences. although these responsibilities may refer to topics
In subscribing to the Standards, test publishers that are covered more fully in other chapters.

1 41
CHAPTER 9

STANDARDS FOR TEST USERS' RIGHTS AND RESPONSIBILITIES


The standards in this chapter begin with an over Cluster 1 . Validity of Interpretati ons
arching standard (numbered 9.0), which is designed
to convey the central intent or primary focus of
the chapter. The overarching standard may also Standard 9.1
be viewed as the guiding principle of the chapter,
Responsibility fo r test use should b e assumed by
and is applicable to all tests and test users. All
or delegated t o only those individuals wh o have
subsequent standards have been separated into
the training, professi onal credentials, and/ or ex
three thematic clusters labeled as follows:
perience necessary t o handle this resp onsibility.
All special qualificati ons for test administrati on
1. Validity of Interpretations
or interpretati on specified in the test manual
2. Dissemination oflnformation
should be met.
3. Test Security and Protection of Copyrights
Comment: Test users should only interpret the
Standard 9.0 scores of test takers whose special needs or char
acteristics are within the range of the test users'
Test users are responsible for knowing the validity qualifications.This standard has special significance
evidence in support of the intended interpretations in areas such as clinical testing, forensic testing,
of scores on tests that they use, from test selection personality testing, testing in special education,
through the use of scores, as well as common testing of people with disabilities or limited
positive and negative consequences of test use. exposure to the dominant culture, testing of
Test users also have a legal and ethical responsibility English language learners, and in other such situ
to protect the security of test content and the ations where the potential impact is great. When
privacy of test takers and should provide pertinent the situation or test-taker group falls outside the
and timely information to test takers and other user's experience, assistance should be obtained.
test users with whom they share test scores. A number of professional organizations have codes
Comment: Test users are professionals who fall of ethics that specify the qualifications required
into several categories, including those who ad of those who administer tests and interpret scores
minister tests and those who interpret and use within the organizations' scope of practice. Ulti
the results of tests.Test users who interpret and mately, the professional is responsible for ensuring
use the results of tests are responsible for ascertaining that the clinical training requirements, ethical
that there is appropriate validity evidence supporting codes, and legal standards for administeri ng and
their interpretations and uses of test results . In interpreting tests are met.
some circumstances, test users are also legally re
sponsible for ascertaining the effect of their testing Standard 9.2
practices on relevant subgroups and for considering
appropriate measures if negative consequences Pri or t o the ad opti on and use of a published
exist.In addition, although test users are often re test, the test user sh ould study and evaluate the
quired to share the results of tests with test takers materials provided by the test devel oper. Of par
and other groups of test users, they must also re ticular importance are materials that summarize
member that test content has to be protected to the test's purp oses, specify the pr ocedures for
maintain the integrity of test scores, and that test test administrati on, define the intended p opula
takers have reasonable expectations of privacy, ti on( s) of test takers, and discuss the score inter
which may be specified in certain federal or state pretati ons for which validity and reliability/pre
laws and regulations. cisi on data are available.

1 42
THE RIGHTS AND RESPONSIBILITIES OF TEST USERS

Comment: A prerequisite to sound test use is Standard 9.4


knowledge of the materials accompanying the in
strument.At a minimum, these include manuals When a test is to be used for a purpose for
provided by the test developer. Ideally, the user which little or no validity evidence is available,
should be conversant with relevant studies reported the user is responsible for documenting the ra
in the professional literature, and should be able tionale for the selection of the test and obtaining
to discriminate between appropriate and inap evidence of the reliabili ty/precision of the test
propriate tests for the intended use with the scores and the validity of the interpretations
intended population. The level of score supporting the use of the scores for this purpose.
reliability/precision and the types of validity Comment: The individual who uses test scores
evidence required for sound score interpretations for purposes that are not specifically recommended
depend on the test's role in the assessment process by the test developer is responsible for collecting
and the potential impact of the process on the the necessary validity evidence.Support for such
people involved.The test user should be aware of uses may sometimes be found in the professional
legal restrictions that may constrain the use of the literature. If previous evidence is not sufficient,
test. On occasion, professional judgment may then additional data should be collected over time
lead to the use of instruments for which there is as the test is being used. The provisions of this
little evidence ofvalidity of the score interpretations standard should not be construed as prohibiting
for the chosen use. In these situations, the user the generation of hyp otheses from test data. How
should not imply that the scores, decisions, or in ever, these hypotheses should be clearly labeled as
ferences are based on well-documented evidence tentative.Interested parties should be made aware
with respect to reliability or validity. of the potential limitations of the test scores in
such situations.
Standard 9.3
The test user should have a clear rationale fo r
Standard 9.5
the intended uses of a test or evaluation procedure Test users should be alert to the possibility of
in terms of the validity of interpretations based scoring errors and should take appropriate action
on the scores and the contribution the scores when errors are suspected.
make to the assessment and decision-making
process. Comment: The costs of scoring errors are great,
particularly in high-stakes testing programs. In
Comment: The test user should be clear about some cases, rescoring may be requested by the
the reasons that a test is being given. In other test taker. If such a test-taker right is recognized
words, j ustificatipn for the role of each instrument in published materials, it should be respected.
in selection, diagnosis, classification, and decision However, test users should not depend entirely
making should be arrived at before test adminis on test takers to alert them to the possibility of
tration, not afterwards.In some cases, the reasons scoring errors.Monitoring scoring accuracy should
for the referrals provide the rationale for the be a routine responsibility of testing program ad
choice of the tests, inventories, and diagnostic ministrators wherever feasible, and rescoring should
procedures to be used, and the rationale may also be done when mistakes are suspected.
be supported in printed materials prepared by
the test publisher.The rationale may come from
other sources as well, such as the empirical Standard 9.6
literature. Test users should be alert to potential misinter
pretations of test scores; they should take steps

1 43
CHAPTER 9

to minimize or avoid foreseeable misinterpretations give reporters, policy makers, or members of the
and inappropriate uses of test scores. public an opportunity to assimilate relevant data.
Misinterpretation often can be the result of inad
Comment: Untrained audiences may adopt sim
equate presentation of information that bears on
plistic interpretations of test results or may attribute
test score interpretation.
high or low scores or averages to a single causal
factor.Test users can sometimes anticipate such
misinterpretations and should try to prevent rhem. Standard 9.9
Obviously, not every unintended interpretation
can be anticipated, and unforeseen negative con When a test user contemplates an alteration in
sequences can occur.What is required is a reasonable test format, mode of administration, instructions,
effort to encourage sound interpretations and uses or the language used in administering a test,
and to address any negative consequences that the user should have a sound rationale and em
occur. pirical evidence, when possible, for concluding
that the reliability/precision of scores and the
validity of interpretations based on the scores
Standard 9.7 will not be compromised.
Test users should verify periodically that their Comment: In some instances, minor changes in
interpretations of test data continue to be ap format or mode of administration may be reasonably
propriate, given any significant changes in the expected, without evidence, to have little or no
population of test takers, the mode(s) of test ad effect on test scores, classification decisions, and/or
ministration, or the purposes in testing. appropriateness of norms. In other instances,
Comment: Over time, a gradual change in the however, changes in rhe format or administrative
characteristics of an examinee population may procedures could have significant effects on the
significantly affect the accuracy of inferences validity of interpretations of the scores-that is,
drawn from group averages.Modifications in test these changes modify or change the construct
administration in response to unforeseen circum being assessed. If a given modification becomes
stances also may affect interpretations. widespread, evidence for validity should be gathered;
if appropriate, norms should also be developed
under the modified conditions.
Standard 9.8
When test results are released to the public or to Standard 9.1 0
policy makers, those responsible for the release
should provide and explain any supplemental Test users should not rely solely on computer
information that will mi1:1imize possible misin generated interpretations of test results.
terpretations of the data.
Comment: T he user of automatically generated
Comment: Test users have a responsibility to scoring and reporting services has the obligation
report results in ways that facilitate the intended to be familiar with the principles on which such
interpretations for the proposed use( s) of the interpretations were derived. All users who are
scores, and this responsibility extends beyond the making inferences and decisions on the basis of
individual test taker to any individuals or groups these reports should have the ability to evaluate a
who are provided with test scores.Test users in computer-based score interpretation in the light
group testing situations are responsible for ensuring of other relevant evidence on each test taker.Au
that the individuals who use the test results are tomated narrative reports can be misleading, if
trained to interpret the scores properly.Preliminary used in isolation, and are not a substitute for
briefings prior to the release of test results can sound professional judgment.

1 44
THE RIGHTS AND RESPONSIBILITIES OF TEST U SERS

Standard 9.1 1 siderable relevant information is sometimes available.


Obvious alternative explanations of low scores in
When circumstances require that a test be ad clude low motivation, limited fluency in the lan
ministered in the same language to all examinees guage of the test, limited opportunity to learn,
in a linguistically diverse population, the test unfamiliarity with cultural concepts on which
user should investigate the validity of the score test items are based, and perceptual or motor im
interpretations for test takers with limited profi pairments.T he test user corroborates results from
ciency in the language of the test. testing with additional information from a variety
Comment: T he achievement, abilities, and traits
of sources, such as interviews and results from
of examinees who do not speak the language of other tests (e.g., to address the concept of reliability
the test as their primary language may be mis of performance across time and/or tests).When
measured by the test, even if administering an al an inference is based on a single study or based
ternative test is legally unacceptable. Sound on studies with samples that are not representative
practice requires ongoing evaluation of data to of the test takers, the test user should be more
provide evidence supporting the use of the test cautious about the inference that is made. In
with all linguistic groups or evidence to challenge clinical and counseling settings, the test user
the use of the test when language proficiency is should not ignore how well the test taker is func
not relevant. tioning in daily life.If tests are being administered
by computers and other electronic devices or via
the Internet, test users still have a responsibility
Standard 9. 1 2 to provide support for the interpretation of test
scores, including considerations of alternative ex
When a major purpose of testing is to describe
planations, when appropriate.
the status of a local, regional, or particular ex
aminee population, the criteria for inclusion or
exclusion of individuals should be adhered to Standard 9.1 4
strictly.
Test users should inform individuals who may
Comment: Biased results can arise from the ex need accommodations in test administration
clusion of particular subgroups of examinees. (e.g., older adults, test takers with disabilities,
T hus, decisions to exclude or include examinees or English language learners) about the availability
should be based on appropriately representing of accommodations and, when required, should
the population. see that these accommodations are appropriately
made available.
Standard 9.1 3 Comment: Appropriate accommodations depend
on the nature of the test and the needs of the test
In educational, clinical, and counseling settings,
takers, and should be in keeping with the docu
a test taker's score should not be interpreted in
mentation provided with the test. Test users
isolation; other relevant information that may
should inform test takers of the availabili ty of ac
lead to alternative explanations for the examinee's
commodations, and the onus may then fall on
test performance should be considered.
the test takers or their guardians to request ac
Comment: It is neither necessary nor feasible to commodations and provide documentation in
make an intensive review of every test taker's support of their requests. Test users should be
score. In some settings, there may be little or no able to indicate the information or evidence (e.g.,
collateral information of value. In counseling, test manual, research study) used to choose an
clinical, and educational settings, however, con- appropriate accommodation.

1 45
CHAPTER 9

Clu ster 2. Disse m i nation of Information Comment: The nature of score reports is often
dictated by practical considerations. In some
cases ( e.g., with some certification or employment
Standard 9.1 5 tests), only a brief printed report may be feasible.
T hose who have a legitimate interest in an as In other cases, it may be desirable to provide
sessment should be informed about the purposes both an oral and a written report.The interpre
of testing, how tests will be administered, the tation should vary according to the level of so
factors considered in scoring examinee responses, phistication of the recipient. When the examinee
how the scores will be used, how long the records is a young child, an explanation of the test results
will be retained, and to whom and under what is typically provided to parents or guardians.
conditions the records may be released. Feedback in the form of a score report or inter
pretation is not always provided when tests are
Comment: Individuals with a legitimate interest administered for personnel selection or promotion,
in assessment results include, but may not be or in certain other circumstances.In some cases,
limited to, test takers, parents or guardians of test federal or state privacy laws may govern the scope
takers, educators, and courts. This standard has of information disclosed and to whom it may be
greater relevance and application to educational disclosed.
and clinical testing than to employment testing.
In most uses of tests for screening job applicants
and applicants to educational programs, f or Standard 9.1 7
licensing professionals and awarding credentials, If a test taker or test user is concerned about the
or for measuring achievement, the purposes of integrity of the test taker's scores, the test user
testing and the uses to be made of the test scores should inform the test taker of his or her relevant
are obvious to the test takers.Nevertheless, it is rights, including the possibility of appeal and
wise to communicate this information at least representation by counsel.
briefly even in these settings. In some situations,
however, the rationale for the testing may be clear Comment: Proctors in entrance or licensure testing
to relatively few test takers. In such settings, a programs may report irregularities in the test ad
more detailed and explicit discussion may be war ministration process that result in challenges from
ranted.Retention of records, security requirements, test takers ( e.g., fire alarm in building or temporary
and privacy of records are often governed by legal failure of Internet access). Other challenges may
requirements or institutional practices, even in be raised by test users ( e.g., university admissions
situations where release of records would clearly officers) when test scores are grossly inconsistent
benefit the examinees. Prior to testing, where ap with other applicant information. Test takers
propriate, the test user should tell the test taker should be apprised of their rights, if any, in such
who will have access to the test results and the situations.
written report, how the test results will be shared
with the test taker, and whether and under what Standard 9.1 8
conditions the test results will be shared with a
Test users should explain to test takers their op
third party or the public ( e.g., in court proceedings).
portunities, if any, to retake an examination;
users should also indicate whether any earlier as
Standard 9.1 6 well as later scores will be reported to those
entitled to receive the score reports.
Unless circumstances clearly require that test results
be withheld, a test user is obligated to provide a Comment: Some testing programs permit test
timely report of the results to the test taker and takers to retake an examination several times, to
others entitled to receive this information. cancel scores, or to have scores withheld from po-

1 46
THE RIGHTS AND RESPONSIBILITIES OF TEST USERS

temial recipients.Test takers and other score re Standard 9.20


cipients should be informed of such privileges, if
any, and the conditions under which they apply. In situations where test results are shared with
the public, test users should formulate and share
the established policy regarding the release of
Standard 9.1 9 the results ( e.g., timeliness, amount of detail)
and apply that policy consistently over time.
Test users are obligated to protect the privacy of
examinees and institutions that are involved in a Comment: Test developers and test users should
testing program, unless a disclosure of private consider the practices of the communities they
information is agreed upon or is specifically au serve and facilitate the creation of common policies
thorized by law. regarding the r elease of test results. For example,
Comment: Protection of the privacy of individual
in many states, the release of data from large -scale
educational tests is often required by law.However,
examinees is a w ell-established principle in psy-
chological and educational measurement.Storage _ e ven when the release of data is not required but

and transm1ss10n C
of th"1s type of m10rmat10n lS fOUtindy done, test USerS should have clear
.
c
shouId meet ex1s tmg proress10nal and egal stan- policies governing the release procedures.Different
1
dards, and care should be take n to protect the policies without appropriate rationales can confuse
the public and lead to unnecessary controversy.
confidentiality of scores and ancillary information
(e.g., disability status). In certain circumstances,
test users and testing agencies may adopt more Cluster 3 . Test Security and
stringent restrictions on the communication and Protecti on of Copyrights
sharing of test results than relevant law dictates.
Privacy laws may apply to certain types of infor
mation, and similar or more rigorous standards Standard 9.21
sometim es arise through the codes of ethics
Test users have the responsibility to protect the
adopted by relevant professional organizations.
security of tests, including that of previous
In some t esting programs the conditions for dis
editions.
closure are stated to the examinee prior to testing,
and taking the test can constitute agreement to Comment: When tests are used for purposes of
the disclosure of test score information as specified. selection, credentialing, educational accountability,
In other programs, the test taker or his or her or for clinical diagnosis, treatment, and monitoring,
parents or guardians must formally agree to any the rigorous protection of test security is essential,
disclosure of test information to individuals or for reasons related to validity of inferences drawn,
agencies other than those specifi ed in the test ad protection of intellectual property rights, and the
ministrator's published literature.Applicable privacy costs associated with developing tests.Test developers,
laws, if any, may govern and allow ( as in the case test publishers, and individuals who hold the copy
of school districts for accountability purposes) or rights on tests provide specific guidelines about
prohibit ( as in clinical s ettings) the disclosure of test security and disposal of test materials. The
test information.It should be noted that the right test user is responsible for helping to ensure the
of the public and the media to examine the security of test materials according to the professional
aggregate test resulcs of public school systems is guidelines established for that test as w ell as any
often guaranteed by law.This may often include applicable legal standards. R esale of copyrighted
test scores disaggregated by demographic subgroups materials in open forums is a violation of this
when the numbers are sufficient to yield statistically standard, and audio and video r ecordings for
sound results and to pre vent the identification of training purposes must also be handled in such a
individual test takers. way that they are not released to the public.These

1 47
CHAPTER 9

prohibitions also apply to outdated and previous answer sheets or profile forms, scoring templates,
editions of tests; test users should help to ensure conversion tables of raw scores to reported scores,
that test materials are securely disposed of when and tables of norms.Storage and transmission of
no longer in use (e.g., upon retirement or after test information should satisfy existing legal and
purchase ofa new edition).Consistency and clarity professional standards.
in the definition of acceptable and unacceptable
practices is critical in such situations.When tests Standard 9.23
are involved in litigation, inspection of the instru
ments should be restricted-to the extent permitted Test users should remind all test takers, including
by law-to those who are obligated legally or by those taking electronically administered tests,
professional ethics to safeguard test security. and others who have access to test materials
that copyright policies and regulations may pro
Standard 9.22 hibit the disclosure of test items without specific
authorization.
Test users have the responsibility to respect test
Comment: In some cases, information on copy
copyrights, including copyrights of tests that are
rights and prohibitions on the disclosure of test
administered via electronic devices.
items are provided in written form or verbally as
Comment: Legally and ethically, test users may part of the procedure prior to beginning the test
not reproduce or create electronic versions of or as part of the administration procedures.How
copyrighted materials for routine test use without ever, even in cases where this information is not a
consent of the copyright holder.These materials formal part of the test administration, if materials
in both paper and electronic form-include test are copyrighted, test users should inform test
items, test protocols, ancillary forms such as takers of their responsibilities in this area.

1 48
PART Ill

Testing
Applications
1 0. PSYCHOLOGICAL TESTING AND
ASSESSMENT
BACKGROUND
This chapter addresses issues important t o profes on scaling and equating; on test administration,
sionals who use psychological tests to assess indi scoring, reporting, and interpretation; and on
viduals. Topics covered in this chapter include supporting documentation.
test selection and administration, test score inter The use of psychological tests provides one
pretation, use of collateral information in psy approach to collecting information within the
chological testing, types of tests, and purposes of larger framework of a psychological assessment of
psychological testing.The types of psychological an individual.Typically, psychological assessments
tests reviewed in this chapter include cognitive involve an interaction between a professional,
and neuropsychological, problem behavior, family who is trained and experienced in testing, the test
and couples, social and adaptive behavior, per taker, and a client who may be the test taker or
sonality, and vocational.In addition, the chapter another party.The test taker may be a child, an
includes an overview of five common uses of psy adolescent, or an adult.The client usually is the
chological tests: for diagnosis; neuropsychological person or agency that arranges for the assessment.
evaluation; intervention planning and outcome Clients may be patients, counselees, parents, chil
evaluation; judicial and governmental decisions; dren, employees, employers, attorneys, students,
and personal awareness, social identity, and psy government agencies, or other responsible parties.
chological health, growth, and action.The standards The settings in which psychological tests or in
in this chapter are applicable to settings where in ventories are used include ( but are not limited to)
depth assessment of people, individually or in preschools; elementary, middle, and secondary
groups, is conducted.Psychological tests are used schools; colleges and universities; pre-employment
in several other contexts as well, most notably in settings; hospitals; prisons; mental health and
employment and educational settings .Tests designed health clinics; and other professionals' offices.
to measure specific job-related characteristics across The tasks involved in a psychological
multiple candidates for selection purposes are assessment-collecting, evaluating, integrating,
treated in the text and standards of chapter 1 1; and reporting salient information relevant to the
tests used in educational settings are addressed in aspects of a test taker's functioning that are under
depth in chapter 12. examination-comprise a complex and sophisti
It is critical that professionals who use tests to cated set of professional activities.A psychological
conduct assessrn,ents of individuals have knowledge assessment is conducted to answer specific questions
of educational, linguistic, national, and cultural about a test taker's psychological functioning or
factors as well as physical capabilities that ifluence behavior during a particular time interval or to
( a) a test taker's development, ( b) the methods for predict an aspect of a test taker's psychological
obtaining and conveying information, and ( c) functioning or behavior in the future. Because
the planning and implementation of interventions. test scores characteristically are interpreted in the
Therefore, readers are encouraged to review chapter context of other information about the test taker,
3, which discusses fairness in testing; chapter 8, an individual psychological assessment usually
which focuses on rights of test takers; and chapter also includes interviewing the test taker; observing
9, which focuses on rights and responsibilities of the test taker's behavior in the appropriate setting;
test users.In chapters 1, 2, 4, 5, 6, and 7, readers reviewing educational, health, psychological, and
will find important additional detail on validity; other relevant records; and integrating these
on reliability and precision; on test development; findings with other information that may be pro-

1 51
CHAPTER 1 0

vided b y third parties.T he results from tests and other sources of information needed to evaluate
inventories used in psychological assessments may the test taker are identified.Preliminary findings
help the professional to understand test takers may lead to the selection of additional tests .T he
more fully and to develop more informed and ac professional is responsible for being familiar with
curate hypotheses, inferences, and decisions about the evidence of validity for the intended uses of
aspects of the test taker's psychological functioning scores from the tests and inventories selected, in
or appropriate interventions. cluding computer-administered or online tests.
T he interpretation of test and inventory scores Evidence of the reliability/precision of scores, and
can be a valuable part of the assessment process the availability of applicable normative data in
and, if used appropriately, can provide useful in the test's accumulated research literature also
formation to test takers as well as to other users of should be considered during test selection.In the
the test interpretation. For example, the results of case of tests that have been revised, editions
tests and inventories may be used to assess the psy currently supported by the publisher usually
chological functioning of an individual; to assign should be selected.On occasion, use of an earlier
diagnostic classification; to detect and characterize edition of an instrument is appropriate (e.g.,
neuropsychological impairment, developmental de when longitudinal research is conducted, or when
lays, and learning disabilities; to determine the an earlier edition contains relevant subtests not
validity of a symptom; to assess cognitive and per included in a later edition).In addition, professionals
sonality strengths or mental health and emotional are responsible for guarding against reliance on
behavior problems; to assess vocational interests test scores that are outdated; in such cases, retesting
and values; to determine developmental stages; to is appropriate.In international applications, it is
assist in health decision making; or to evaluate especially important to verify that the construct
treatment outcomes.Test results also may provide being assessed has equivalent meaning across in
information used to make decisions that have a ternational borders and cultural contexts.
powerful and lasting impact on people's lives (e.g., Validity and reliability/precision considerations
vocational and educational decisions; diagnoses; are paramount, but the demographic characteristics
treatment plans, including plans for psychophar of the group(s) for which the test originally was
macological intervention; intervention and outcome constructed and for which initial and subsequent
evaluations; health decisions; disability determina normative data are available also are important
tions; decisions on parole sentencing, civil com test selection considerations. Selecting a test with
mitment, child custody, and competency to stand demographically and clinically appropriate nor
trial; personal injury litigation; and death penalty mative groups relevant for the test taker and for
decisions). the purpose of the assessment is important for
the generalizability of the inferences that the pro
fessional seeks to make.Applying a test constructed
Test Selection and Admin istration for one group to other groups may not be appro
The selection and administration o f psychological priate, and score interpretations, if the test is
tests and inventories often is individualized for used, should be qualified and presented as hy
each participant.However, in some settings pre potheses rather than conclusions.
determined tests may be taken by all participants, Tests and inventories that meet high technical
and interpretations of results may be provided in standards of quality are a necessary but not a suf
a group setting. ficient condition for the responsible administration
T he assessment process begins by clarifying, and scoring of tests and interpretation and use of
as much as possible, the reasons why a test taker test scores.A professional conducting a psychological
will be assessed.Guided by these reasons or other assessment must complete the appropriate education
relevant concerns, the tests, inventories, and di and training, acquire appropriate credentials,
agnostic procedures to be used are selected, and adhere to professional ethical guidelines, and pos-

1 52
PSYCHOLOGICAL TESTING AND ASSESSMENT

sesses a high degree of professional j udgment and some professionals first administer tests to assess
scientific knowledge. basic domains ( e.g., attention) and end with tests
Professionals who oversee testing and assessment to assess more complex domaiiis ( e.g., executive
should be thoroughly versed in proper test admin functions). Professionals also are responsible for
istration procedures.They are responsible for en establishing testing conditions that are appropriate
suring that all persons who administer and score to the test taker's needs and abilities.For example,
tests have received the appropriate education and the examiner may need to determine if the test
training needed to perform their assigned tasks. taker is capable of reading at the level required
Test administrators should administer tests in the and if vision, hearing, psychomotor, or clinical
manner that the test manuals indicate and should impairments or neurological deficits are adequately
adhere to ethical and professional standards.The accommodated. Chapter 3 addresses access con
education and experience necessary to administer siderations and standards in detail.
group tests and/or to proctor computer-administered Standardized administration is not required
tests generally are less extensive than the qualifications for all tests but is important for the interpretation
necessary to administer and interpret scores from of test scores for many tests and purposes. In
individually administered tests that require inter those situations, standardized test administration
actions between the test taker and the test admin procedures should be followed.When nonstandard
istrator.In many situations where complex behavioral administration procedures are needed or allowed,
observations are required, the use of a nonprofes they should be described and justified.The inter
sional to administer or score tests may be inappro preter of the test results should be informed if the
priate.Prior to beginning the assessment process, test was unproctored or if it was administered
the test taker or a responsible party acting on the under nonstandardized procedures. In some cir
test taker's behalf ( e.g., parent, legal guardian) cumstances, test administration may provide the
should understand who will have access to the test opportunity for skilled examiners to carefully
results and the written report, how test results will observe the performance of test takers under stan
be shared with the test taker, and whether and dardized conditions. For example, the test ad
when decisions based on the test results will be ministrators' observations may allow them to
shared with the test taker and/or a third party or record behaviors being assessed, to understand
the public ( e.g., in court proceedings). the manner in which test takers arrived at their
Test administrators must be aware of any per answers, to identify test-raker strengths and weak
sonal limitations that affect their ability to nesses, and to make modifications in the testing
administer and score the test fairly and accurately. process. If tests are administered by computer or
These limitations may include physical, perceptual, other technological devices or online, the profes
and cognitive factors.Some tests place considerable sional is responsible for determining if the purpose
demands on the test administrator ( e.g., recording of the assessment and the capabilities of the test
responses rapidly, manipulating. equipment, or taker require the presence of a proctor or support
performing complex item scoring during admin staff ( e.g., to assist with the use of the computer
istration).Test administrators who cannot com equipment or software). Also, some computer
fortably meet these demands should not administer administered tests may require giving the test
such tests. For tests that require oral instructions taker the opportunity to receive instructions and
prior to or during administration, test administrators to practice prior to the test administration.Chapters
should be sure chat there are no barriers to being 4 and 6 provide additional derail on technologically
clearly understood by test takers. administered tests.
When using a battery of tests, the professional Inappropriate effort on the part of the person
should determine the appropriate order of tests being assessed may affect the results of psychological
to be administered.For example, when adminis assessment and may introduce error into the meas
tering cognitive and neuropsychological tests, urement of the construct in question.T herefore,

1 53
CHAPTER 1 0

in some cases, the importance o f expending ap consider other available data that support or
propriate effort when taking the test should be challenge the inferences.For example, the profes
explained to the test taker.For many tests, measures sional should review the test taker's history and in
of effort can be derived from stand-alone tests or formation about past behaviors, as well as the
from responses embedded within a standard as relevant literature, to develop familiarity with sup
sessment procedure ( e.g., increased numbers of porting evidence.At times, the professional also
errors, inconsistent responding, and unusual re should corroborate results from one testing session
sponses relevant to symptom patterns), and effort with results from other tests and testing sessions
may be measured throughout the assessment to address reliability/precision and validity of the
process.When low levels of effort and motivation inferences made about the test taker's performance
are evident during the test administration, con across time and/or tests.Triangulation of multiple
tinuing an evaluation may result in inappropriate sources of information-including stylistic and
score interpretations. test-taking behaviors inferred from observation
Professionals are responsible for protecting during the test administration-may strengthen
the confidentiality and security of the test results confidence in the inference.Importantly, data that
and the testing materials.Storage and transmission are not supportive of the inferences should be ac
of this type of information should satisfy relevant knowledged and either reconciled with other in
professional and legal standards. formation or noted as a limitation to the confidence
placed in the inference.When there is strong evi
Test Score Interpretation dence for the reliability/precision and validity of
the scores for the intended uses of a test and
Test scores used in psychological assessment ideally strong evidence for the appropriateness of the test
are interpreted in light of a number of factors, in for the test taker being assessed, then the professional's
cluding the available normative data appropriate ability to draw appropriate inferences increases.
to the characteristics of the test taker, the psycho When an inference is based on a single study or
metric properties of the test, indicators of effort, based on several studies whose samples are of
the circumstances of the test taker at the time the limited generalizability to the test taker, then the
test is given, the temporal stability of the constructs professional should be more cautious about the
being measured, and the effects of moderator inference and note in the report limitations regarding
variables and demographic characteristics on test conclusions drawn from the inference.
results.T he professional rarely has the resources T hreats to the interpretability of obtained
available to personally conduct the research or to scores are minimized by clearly defining how par
assemble representative norms that, in some types ticular psychological tests are to be used. T hese
of assessment, might be needed to make accurate threats occur as a result of construct-irrelevant
inferences about each individual test taker's past, variance ( i.e. , aspects of the test and the testing
current, and future functioning. T herefore, the process that are not relevant to the purpose of the
professional may need to rely on the research and test scores) and construct underrepresentation
the body of scientific knowledge available for the ( i.e., failure of the test to account for important
test that support appropriate inferences.Presentation facets relevant to the purpose of the testing). Re
of validity and reliability/precision evidence often sponse bias and faking are examples of construct
is not needed in the written report summarizing irrelevant components that may significantly skew
the findings of the assessment, but the professional the obtained scores, possibly resulting in inaccurate
should strive to understand, and be prepared to or misleading interpretations.In situations where
articulate, such evidence as the need arises. response bias or faking is anticipated, professionals
When making inferences about a test taker's may choose a test that has scales ( e.g., percentage
past, present, and future behaviors and other char of "yes" answers, percentage of "no" a nswers;
acteristics from test scores, the professional should "faking good," "faking bad") that clarify the threats

1 54
PSYCHOLOGICAL TESTING AND ASSESSMENT

to validity. In so doing, the professionals may be traits and personal characteristics.The quality of
able to assess the degree to which test takers are interpretations made from psychological tests and
acquiescing to the perceived demands of the test assessments often can be enhanced by obtaining
administrator or attempting to portray themselves credible collateral information from various third
as impaired by "faking bad," or as well functioning party sources, such as significant others, teachers,
by "faking good." health professionals, and school, legal, military,
For some purposes, including career counseling and employment records.The quality of collateral
and neuropsychological assessment, batteries of information is enhanced by using various methods
tests are frequently used.For example, career coun to acquire it. Structured behavioral observations,
seling batteries may include tests of abilities, values, checklists, ratings, and interviews are a few of the
interests, and personality. Neuropsychological methods that may be used, along with objective
batteries may include measures of orientation, at test scores to minimize the need for the scorer to
tention, communication skills, executive function, rely on individual judgment. For example, an
fluency, visual-motor and visual-spatial skills, evaluation of career goals may be enhanced by
problem solving, organization, memory, intelligence, obtaining a history of employment as well as by
academic achievement, and/or personality, along administering tests to assess academic aptitude
with tests of effort.When psychological test batteries and achievement, vocational interests, work values,
incorporate multiple methods and scores, patterns personality, and temperament.The availability of
of test results frequently are interpreted as reflecting information on multiple traits or attributes, when
a construct or even an interaction among constructs acquired from various sources and through the
underlying test performance. Interactions among use of various methods, enables professionals to
the constructs underlying configurations of test assess more accurately an individual's psychosocial
outcomes may be postulated on the basis of test functioning and facilitates more effective decision
score patterns.The literature reporting evidence of making.When using collateral data, the professional
reliability/precision and validity of configurations should take steps to ascertain their accuracy and
of scores that supports the proposed interpretations reliability, especially when the data come from
should be identified when possible. However, it is third parties who may have a vested interest in
understood that little, if any, literature exists that the outcome of the assessment.
describes the validity of interpretations of scores
from highly customized or flexible batteries of Types of Psychological
tests.The professional should recognize that variability
Testing and Assessment
in scores on different tests within a battery commonly
occurs in the general population, and should use For purposes of this chapter, the types of psycho
base rate data, when available, to determine whether logical tests have been divided into six categories:
the observed variability is exceptional.If the literature cognitive and neuropsychological tests; problem
is incomplete, the resulting inferences may be pre behavior tests; family and couples tests; social
sented with the qualification that they are hypotheses and adaptive behavior tests; personality tests; and
for future verification rather than probabilistic vocational tests.
statements regarding the likelihood of some behavior
that imply some known validity evidence. Cognitive and Neuropsychological
Testing and Assessment
Collatera l Information Used in Tests often are used to assess various classes of
Psychological Testing and Assessment cognitive and neuropsychological functioning, in
cluding intelligence, broad ability domains, and
Test scores chat are used as part of a psychological more focused domains (e.g., abstract reasoning
assessment are best interpreted in the context of and categorical thinking; academic achievement;
the test taker's personal history and other relevant attention; cognitive ability; executive function;

1 55
CHAPTER 1 0

language; learning and memory; motor and sen attention, divided attention, focused attention,
sorimotor functions and lateral preferences; and selective attention, and vigilance.Tests may measure
perception and perceptual organization/integration). ( a) levels of alertness, orientation, and localization;
Overlap may occur in the constructs that are ( b) the ability to focus, shift, and maintain
assessed by tests of differing functions or domains. attention and to track one or more stimuli under
In common with other types of tests, cognitive various conditions; ( c) span of attention; and
and neuropsychological tests require a minimally ( d) short-term information storage functioning.
sufficient level of test-taker capacity to maintain Scores for each aspect of attention that have been
attention as well as appropriate effort.For example, examined should be reported individually so that
when administering cognitive and neuropsycho the nature of an attention disorder can be clarified.
logical tests, some professionals first administer
tests to assess basic domains ( e.g., attention) and Cognitive ability. Measures designed to quantify
end with administration of tests to assess more cognitive abilities are among the most widely ad
complex domains ( e.g., executive function). ministered tests.The interpretation of results from
a cognitive ability test is guided by the theoretical
Abstract reasoning and categorical thinking. Tests constructs used to develop the test.Some cognitive
of reasoning and thinking measure a broad array ability assessments are based on results from mul
of skills and abilities, including the examinee's tidimensional test batteries that are designed to
ability to infer relationships, to form new concepts assess a broad range of skills and abilities. Test
or strategies, to respond to changing environmental results are used to draw inferences about a person's
circumstances, and to act in goal-oriented situations, overall level of intellectual functioning and about
as well as the ability to understand a problem or a strengths and weaknesses in various cognitive abil
concept, to develop a strategy to solve that problem, ities, and to diagnose cognitive disorders.
and, as necessary, to alter such concepts or strategies
as situations vary. Executive function. T his class of functions is in
volved in the organized performances ( e.g., cognitive
Academic achievement. Academic achievement flexibiliry, inhibitory control, multitasking) that
tests are measures of knowledge and skills that a are necessary for the independent, purposive, and
person has acquired in formal and informal effective attainment of goals in various cognitive
learning situations.Two major types of academic processing, problem-solving, and social situations.
achievement tests include general achievement Some tests emphasize ( a) reasoned plans of action
batteries and diagnostic achievement tests.General that anticipate consequences ofalternative solutions,
achievement batteries are designed to assess a ( b) motor performance in problem-solving situations
person's level of learning in multiple areas ( e.g., that require goal-oriented intentions, and/ or
reading, mathematics, and_ spelling). In contrast, ( c) regulation of performance for achieving a
diagnostic achievement tests typically focus on desired outcome.
one subject area ( e.g., reading) and assess an aca
demic skill in greater detail.Test results are used Language.Language deficiencies typically are iden
to determine the test taker's strengths and may tified with assessments that focus on phonology,
also help identify sources of academic difficulties morphology, syntax, semantics, supralinguistics,
or deficiencies. Chapter 12 provides additional and pragmatics.Various functions may be assessed,
detail on academic achievement testing in educa including listening, reading, and spoken and written
tional settings. language skills and abilities.Language disorder as
sessments focus on functional speech and verbal
Attention. Attention refers to a domain that en comprehension measured through oral, written,
compasses the constructs of arousal, establishment or gestural modes; lexical access and elaboration;
of sets, strategic deployment of attention, sustained repetition of spoken language; and associative

1 56
PSYCHOLOGICAL TESTING AND ASSESSMENT

verbal fluency.If a multilingual person is assessed tests assess activities ranging from perceptual speed
for a possible language disorder, the degree to to choice reaction time, to complex information
which the disorder may be due more directly to processing and visual-spatial reasoning.
developmental language issues (e.g., phonological,
morphological, syntactic, semantic, or pragmatic Problem Behavior Testing and Assessment
delays; intellectual disabilities; peripheral, sensory, Problem behaviors include behavioral adjustment
or central neurological impairment; psychological difficulties that interfere with a person's effective
conditions; or sensory disorders) than to lack of functioning in daily life situations.Tests are used
proficiency in a given language must be addressed. to assess the individual's behavior and self-per
ceptions for differential diagnosis and educational
Learning and memory. This class of functions classification for a variety of emotional and be
involves the acquisition, retention, and retrieval havioral disorders and to aid in the development
of information beyond the requirements of im of treatment plans. In some cases (e.g., death
mediate or short-term information processing and penalty evaluations), retrospective analysis is
storage.T hese tests may measure acquisition of required and multiple sources of information help
new information through various sensory channels provide the most comprehensive assessment
and by means of assorted test formats (e.g., word possible. Observing a person in her or his envi
lists, prose passages, geometric figures, form ronment often is helpful for understanding fully
boards, digits, and musical melodies). Memory the specific demands of the environment, not
tests also may require retention and recall of old only to offer a more comprehensive assessment
information (e.g., personal data as well as commonly but to provide more useful recommendations.
learned facts and skills). In addition, testing of
recognition of stored information may be used in Family and Couples Testing and Assessment
understanding memory deficits. Family testing addresses the issues of family dy
namics, cohesion, and interpersonal relations
Motor functions, sensorimotor functions, and among family members, including partners, parents,
lateral preferences. Motor functions (e.g., finger children, and extended family members.Tests de
tapping) and sensory functions (e.g., tactile stim veloped to assess families and couples are distin
ulation) are often measured as part of a compre guished by whether they measure the interaction
hensive neuropsychological evaluation. Motor patterns of partial or whole families, in both cases
tests assess various aspects of movement such as requiring simultaneous focus on two or more
speed, dexterity, coordination, and purposeful family members in terms of their transactions.
movement. Sensory tests evaluate function in the Testing with couples may address factors such as
areas of vision, hearing, touch, and sometimes issues of intimacy, compatibility, shared interests,
smell.Testing also is done to examine the integration trust, and spiritual beliefs.
of perceptual and motor functions.
Social and Adaptive Behavior
Perception and perceptual organization/integra Testing and Assessment
tion. This class of functioning involves reasoning Measures of social and adaptive behaviors assess
and judgment as they relate to the processing and motivation and ability to care for oneself and
elaboration of complex sensory combinations and relate to others.Social and adaptive behaviors are
inputs.Tests of perception may emphasize imme based on a repertoire of knowledge, skills, and
diate perceptual processing but also may require abilities that enable a person to meet the daily de
conceptualizations that involve some reasoning mands and expectations of the environment, such
and judgmental processes.Some tests have motor as eating, dressing, working, participating in leisure
components ranging from making simple move activities, using transportation, interacting with
ments to building complex constructions.These peers, communicating with others, making pur-

1 57
CHAPTER 1 0

chases, managing money, maintaining a schedule, how she or he may behave in new situations.Test
living independently, being socially responsive, scores outside the expected range may be considered
and engaging in healthy behaviors. strong expressions of normal traits or may be in
dicative of psychopathology. Such scores also may
Personality Testing and Assessment reflect normal functioning of the person within a
The assessment of personality requires a synthesis culture different from that of the population on
of aspects of an individual's functioning that con which the norms are based.
tribute to the formulation and expression of Other personality tests are designed specifically
thoughts, attitudes, emotions, and behaviors. to measure constructs underlying abnormal func
Some of these aspects are stable over time; others tioning and psychopathology. Developers of some
change with age or are situation specific.Cognitive of these tests use previously diagnosed individuals
and emotional functioning may be considered to construct their scales and base their interpretations
separately in assessing an individual, but their in on the association between the test's scale scores,
fluences are interrelated. For example, a person within a given range, and the behavioral correlates
whose perceptions are highly accurate, or who is of persons who scored within that range, as com
relatively stable emotionally, may be able to control pared with clinical samples. If interpretations
suspiciousness better than a person whose per made from scores go beyond the theory that
ceptions are inaccurate or distorted or who is guided the test's construction, then evidence of
emotionally unstable. the validity of the interpretations should be
Scores or personality descriptors derived from collected and analyzed from additional relevant
a personality test may be regarded as reflecting data.
the underlying theoretical constructs or empirically
derived scales or factors that guided the test's con Vocational Testing and Assessment
struction.The stimulus-and-response formats of Vocational testing generally includes the meas
personality tests vary widely.Some include a series urement of interests, work needs, and values, as
of questions ( e.g., self-report inventories) to which well as consideration and assessment of related el
the test taker is required to respond by choosing ements of career development, maturity, and in
from multiple well-defined options; others involve decision. Academic achievement and cognitive
being placed in a novel situation in which the test abilities, discussed earlier in the section on cognitive
taker's response is not completely structured ( e.g., ability, also are important components in vocational
responding to visual stimuli, telling stories, dis testing and assessment. Results from these tests
cussing pictures, or responding to other projective often are used to enhance personal growth and
stimuli).Results may consist of themes, patterns, understanding and for career counseling, out
or diagnostic indicators, as well as scores.The re placement counseling, and vocational decision
sponses are scored and combined into either making.These interventions frequently take place
logically or statistically derived dimensions estab in the context of educational and vocational reha
lished by previous research. bilitation. However, vocational testing may also
Personality tests may be designed to assess be used in the workplace as part of corporate pro
normal or abnormal attitudes, feelings, traits, and grams for career planning.
related characteristics.Tests intended to measure
normal personality characteristics are constructed Interest inventories. The measurement of interests
to yield scores reflecting the degree to which a is designed to identify a person's preferences for
person manifests personality dimensions empirically various activities. Self-report interest inventories
identified and hypothesized to be present in the are widely used to assess personal preferences, in
behavior of most individuals.A person's configu cluding likes and dislikes for various work and
ration of scores on these dimensions is then used leisure activities, school subjects, occupations, or
to infer how the person behaves presently and types of people.The resulting scores may provide

1 58
PSYCHOLOGICAL TESTING AND ASSESSMENT

insight into types and patterns of interests in ed uations; testing for intervention planning and
ucational curricula (e.g., college majors), in various outcome evaluation; testing for judicial and gov
fields of work (e.g., specific occupations), or in ernmental decisions; and testing for personal
more general or basic areas of interests related to awareness, social identity, and psychological health,
specific activities (e.g., sales, office practices, or growth, and action.However, these categories are
mechanical activities). not always mutually exclusive.

Work values inventories. T he measurement of Testing for Diagnosis


work values identifies a person's preferences for Diagnosis refers to a process that includes the col
the various reinforcements one may obtain from lection and integration of test results with prior
work activities.Sometimes these values are identified and current information about a person, together
as needs that persons seek to satisfy.Work values with relevant contextual conditions, to identify
or needs may be categorized as intrinsic and im characteristics of healthy psychological functioning
portant for the pleasure gained from the activity as well as psychological disorders. Disorders may
(e.g., being independent, using one's abilities) or manifest themselves in information obtained
as extrinsic and important for the rewards they during the testing of an individual's cognitive,
bring (e.g., pay, promotion).The format of work emotional, adaptive, behavioral, personality, neu
values tests usually involves a self-rating of the ropsychological, physical, or social attributes.
importance of the value associated with qualities Psychological tests are helpful to professionals
described by the items. involved in the diagnosis of an individual's psycho
logical health.Testing may be performed to confirm
Measures of career development, maturity, and a hypothesized diagnosis or to rule out alternative
indecision. Additional areas of vocational assessment diagnoses.Diagnosis is complicated by the prevalence
include measures of career development and ma of comorbidity between diagnostic categories.For
turity and measures of career indecision.Inventories example, an individual diagnosed with dementia
that measure career development and maturity may simultaneously be diagnosed as depressed.Or
typically elicit self-descriptions in response to a child diagnosed as having a learning disability
items that inquire about individuals' knowledge also may be diagnosed as suffering from an attention
of the world of work; self-appraisal of their deci deficit/hyperactivity disorder.The goal of diagnosis
sion-making skills; attitudes toward careers and is to provide a brief description of the test taker's
career choices; and the degree to which the indi psychological dysfunction and to assist each test
viduals already have engaged in career planning. taker in receiving the appropriate interventions for
Measures of career indecision usually are constructed the psychological or behavioral dysfunctions that
and standardized to assess both the level of career the client, or a third party, views as impairing the
indecision of a test taker and the reasons for, or client's expected functioning and/or enjoyment of
antecedents of, this indecision.Results from tests life.When the intent of assessment is differential
such as these are often used with individuals and diagnosis, the professional should use tests for
groups to guide the design and delivery of career which there is evidence that the scores distinguish
services and to evaluate the effectiveness of career between two or more diagnostic groups. Group
interventions. mean differences do not provide sufficient evidence
for the accuracy of differential diagnosis; additional
Purposes of Psychologi cal information, such as effect sizes or data indicating
Testing and Assessment the degree of overlap between criterion groups,
also should be provided by the test developers. In
For purposes of this chapter, psychological test developing treatment plans, professionals often use
uses have been divided into five categories: testing noncategorical diagnostic descriptions of client
for diagnosis; testing for neuropsychological eval- functioning along treatment-relevant dimensions

1 59
CHAPTER 1 0

(e.g., functional capacity, degree of anxiety, amount by prior research to belong to a specific diagnostic
of suspiciousness, openness to interpretations, group.
amount of insight into behaviors, and level of in Diagnoses made with the help of test scores
tellectual functioning). typically are based on empirically demonstrated
Diagnostic criteria may vary from one nomen relationships between the test score and the diag
clature system to another.Noting which nomen nostic category.Validity studies that demonstrate
clature system is being used is an important initial relationships between test scores and diagnostic
step because different diagnostic systems may use categories currently are available for some, but
the same diagnostic term to describe different not all, diagnostic categories. Many more studies
symptoms. Even within one diagnostic system, demonstrate evidence of validity for the relations
the symptoms described by the same term may between test scores and various subsets of symptoms
differ between editions of the manual.Similarly, a that contribute to a diagnostic category.Although
test that uses a diagnostic term in its title may ic often is not feasible for individual professionals
differ significantly from another test using a similar to personally conduct research into relationships
title or from a subscale using the same term. For between obtained scores and diagnostic categories,
example, some diagnostic systems may define de familiarity with the research literature that examines
pression by behavioral symptomatology (e.g., psy these relationships is important.
chomotor retardation, disturbance in appetite or The professional often can enhance the diag
sleep), by affective symptomatology (e.g., dysphoric nostic interpretations derived from test scores by
feeling, emotional flatness), or by cognitive symp integrating the test results with inferences made
tomatology (e.g., thoughts of hopelessness, mor from other sources of information regarding the
bidi ty ). Further, rarely are the symptoms of test taker's functioning, such as self-reported
diagnostic categories mutually exclusive.Hence, it history, information provided by significant others,
can be expected that a given symptom may be or systematic observations in the natural environ
shared by several diagnostic categories.More knowl ment or in the testing setting. In arriving at a di
edgeable and precisely drawn inferences relating agnosis, a professional also looks for information
to a diagnosis may be obtained from test scores if that does not corroborate the diagnosis, and in
appropriate weight is given to the symptoms those instances, places appropriate limits on the
included in the diagnostic category and to the degree of confidence placed in the diagnosis.
suitability of each test for assessing the symptoms. When relevant to a referral decision, the professional
Therefore, the first step in evaluating a test's should acknowledge alternative diagnoses that
suitability for yielding scores or information in may require consideration. Particular attention
dicative of a particular diagnostic syndrome is to should be paid to all relevant available data before
compare the construct that the test is intended to concluding that a test taker falls into a diagnostic
measure with the symptomatology described in category. Cultural competency is paramount in
the diagnostic criteria. the effort to avoid misdiagnosing or overpatholo
Different methods may be used to assess par gizing culturally appropriate behavior, affect, or
ticular diagnostic categories. Some methods rely cognition.Tests also are used to assess the appro
primarily on structured interviews using a "yes"/"no" priateness of continuing the initial diagnosis, es
or "true"/" false" format, in which the professional pecially after a course of treatment or if the client's
is interested in the presence or absence of diagno psychological functioning has changed over time.
sis-specific symptomatology.Other methods often
rely principally on tests of personality or cognitive Testing for Neuropsycho logical Evaluations
functioning and use configurations of obtained Neuropsychological testing analyzes the test raker's
scores.These configurations of scores indicate the current psychological and behavioral status, including
degree to which a test taker's responses are similar manifestations of neurological, neuropathological,
to those of individuals who have been determined and neurochemical changes that may arise during

1 60
PSYCHOLOGICAL TESTING AND ASSESSMENT

development or from psychopathology, bodily other government agencies sometimes require a


and/or brain injury, or illness . The purposes of person to submit involuntarily to a psychological
neuropsychological testing typically include, but assessment that may involve a wide range of psy
are not limited to, the fo llowing: differential chological tests.The goal of rhese psychological
diagnosis associated with the sources of cognitive, assessments is to provide important information
perceptual, and personality dysfunction; differential to a third party (e.g., test taker's attorney, opposing
diagnosis between two or more suspected etiologies attorney, judge, or administrative board) about
of cerebral dysfunction; evaluation of impaired the psychological functioning of the test taker
functioning secondary to a cortical or subcortical that has bearing on the legal issues in question.
event; establishment of neuropsychological baseline Informed consent generally should be obtained;
measurements for monitoring progressive cerebral informed consent for children or mentally in
disease or recovery effects; comparison of test competent individuals (e.g., individuals with de
results before and after pharmacologic, surgical, mentia) should be obtained from legal guardians.
behavioral, or psychological interventions; identi At the outset of the evaluation for judicial and
fication of patterns of higher cortical functions government decisions, the professional should ex
and dysfunctions for rhe formulation ofrehabilitation plain the intended purposes of the evaluation and
strategies and for the design of remedial procedures; identify who is expected to have access to the test
and characterization of brain behavior functions results and the report. Often, the professional
to assist in criminal and civil legal actions. and the test taker are not fully aware of legal
issues or parameters rhat impinge on rhe evaluation,
Testing for Intervention P lanning and if the test taker declines to proceed after
and Outcome Evaluation being notified of the nature and purpose of the
Professionals often rely on test results for assistance examination, the professional, as appropriate, may
in planning, executing, and evaluating interventions. attempt to administer the assessment, postpone
Therefore, their awareness of validity information the assessment, advise the test taker to contact
that supports or does not support rhe relationships her or his attorney, or notify rhe individual or
among test results, prescribed interventions, and agency requesting the assessment about the test
desired outcomes is important.Interventions may taker's unwillingness to proceed.
be used to prevent rhe onset of one or more Assessments for legal reasons may occur as
symptoms, to remediate deficits, and to provide part of a civil proceeding (e.g., involuntary com
for a person's basic physical, psychological, and mitment, testamentary capacity, competence to
social needs to enhance quality oflife.Intervention stand trial, ruling ofchild custody, personal injury,
planning typically occurs following an evaluation law suit), a criminal proceeding (e.g., competence
of rhe nature, evolution, and severity of a disorder to stand trial, ruling of not guilty by reason of in
and a review of personal and contextual conditions sanity, mitigating circumstances in sentencing) ,
that may affect its resolution.Subsequent evaluations determination of reasonable accommodations for
that require the repeated administration of the employees wirh disabilities, or an administrative
same test may occur in an effort to furrher diagnose proceeding or decision (e.g., license revocation,
the nature and severity of the disorder, to review parole, worker's compensation).The professional
rhe effects of interventions, to revise the interven is responsible for explaining test scores and the in
tions as needed, and to meet ethical and legal terpretations made from them in terms of the
standards. legal criteria by which rhe jury, judge, or adminis
trative board will decide rhe legal issue.In instances
Testing for Judicial and G overnmental Decisions involving legal issues, it is important to assess the
Clients may voluntarily seek psychological assess examinee's test-taking orientation, including response
ment to assist in matters before a court of law or bias, to ensure that rhe legal proceedings have not
other government agency. Conversely, courts or affected the responses given.For example, persons

1 61
CHAPTER 1 0

seeking t o obtain the greatest possible monetary test security ( e.g., releasing the test questions, the
award for a personal injury may be motivated to examinee's responses, or raw or standardized scores
exaggerate cognitive and emotional symptoms, on tests to another qualified professional) and
whereas persons attempting to forestall the loss of should seek, if necessary, appropriate legal and
a professional license may attempt to portray professional remedies.
themselves in the best possible light by minimizing
symptoms or deficits. In forming an assessment Testing for Personal Awareness, Social Identity,
opinion, it is necessary to interpret the test scores and Psychological Health, G rowth, and Action
with informed knowledge relating to the available Tests and inventories frequently are used to provide
validity and reliability evidence. When forming information to help individuals understand them
such opinions, it also is necessary to integrate a selves, identify their own strengths and weaknesses,
test taker's test scores with all other sources of in and clarify issues important to their own d evelop
formation that bear on the test taker's current ment. For example, test results from personality
status, including psychological, health, educational, inventories may help test takers better understand
occupational, legal, sociocultural, and other relevant themselves and their interactions with others.
collateral records. Measures of ethnic identity and acculturation
Some tests are intended to provide information two components of social identity-that assess
about a client's functioning that helps clarify a the cognitive, affective, and behavioral facets of
given legal issue ( e.g., parental functioning in a the ways in which people identify with their
child custody case or a defendant's ability to un cultural backgrounds, also may be informative.
derstand charges in hearings on competency to Psychological tests are used sometimes to assess
stand trial).The manuals of some tests also provide an individual's ability to understand and adapt to
demographic and actuarial data for normative health conditions.In these instances, observations
groups that are representative of persons involved and checklists, as well as tests, are used to measure
in the legal system.However, many tests measure the understanding that an individual with a health
constructs that are generally relevant to the legal condition ( e.g., diabetes) has about the disease
issues even though norms specific to the judicial process and about behavioral and cognitive tech
or governmental context may not be available. niques applicable to the amelioration or control
Professionals are expected to make every effort to of the symptoms of the disease state.
be aware of evidence of validity and reliability/ Results from interest inventories and tests of
precision that supports or does not support their ability may be useful to individuals who are making
interpretations and to place appropriate limits on educational and career decisions.Appropriate cog
the opinions rendered.Test users who practice in nitive and neuropsychological tests that have been
judicial and governmental settings are expected to normed and standardized for children may facilitate
be aware of conflicts of interest that may lead to the monitoring of development and growth during
bias in the interpretation of test results. the formative years, when relevant interventions
Protecting the confidentiality of a test taker's may be more efficacious for recognizing and pre
test results and of the test instrument itself poses venting potentially disabling learning difficulties.
particular challenges for professionals involved Test scores for young adults or children on these
with attorneys, judges, jurors, and other legal de types of measures may change in later years;
cision makers. The test taker has the right to therefore, test users should be cautious about over
expect that test results will be communicated only reliance on results that may be outdated.
to persons who are legally authorized to receive Test results may be used in several ways for
them and that other information from the testing self-exploration, growth, and decision making.
session that is not relevant to the evaluation will First, the results can provide individuals with new
not be reported. T he professional should be information that allows them to compare themselves
apprised of possible threats to confidentiality and with others or to evaluate themselves by focusing

1 62
PSYCHOLOGICAL TESTING AND ASSESSMENT

on self-descriptions and self-characterizations.Test and competence to select, administer, and interpret


results may also serve to stimulate discussions be tests and inventories as crucial elements of the
tween test taker and professional, to facilitate psychological testing and assessment process (see
test-taker insights, to provide directions for future chap. 9).The standards in this chapter provide a
treatment considerations, to help individuals framework for guiding the professional toward
identify strengths and weaknesses, and to provide achieving relevance and effectiveness in the use of
the professional with a general framework for or psychological tests within the boundaries or limits
ganizing and integrating information about an defined by the professional's educational, experi
individual.Testing for personal growth may take ential, and ethical foundations. Earlier chapters
place in training and development programs, and standards that are relevant to psychological
within an educational curriculum, during psy testing and assessment describe general aspects of
chotherapy, in rehabilitation programs as part of test quality (chaps. 1 and 2), fairness (chap. 3),
an educational or career-planning process, or in test design and development (chap. 4), and test
other situations. administration (chap. 6). Chapter 11 discusses
test uses for the workplace, including credentialing,
Summary and the importance of collecting data that provide
evidence of a test's accuracy for predicting job
The responsible use of tests in psychological performance; chapter 12 discusses educational
practice requires a commitment by the professional applications; and chapter 13 discusses test use in
to develop and maintain the necessary knowledge program evaluation and public policy.

1 63
CHAPTER 1 0

STANDARDS FOR PSYCHOLOGICAL TESTING AND ASSESSMENT


The standards in this chapter have been separated Standard 1 0.2
into five thematic clusters labeled as follows:
Those who select tests and draw inferences from
test scores should be familiar with the relevant
1. Test User Qualifications
evidence of validity and reliability/precision for
2. Test Selection
the intended uses ofthe test scores and assessments,
3. Test Administration
and should be prepared to articulate a logical
4. Test Interpretation
analysis that supports all facets of the assessment
5. Test Security
and the inferences made from the assessment.

Comment: A presentation and analysis of validity


Cl uster 1 . Test User Qualifications and reliability/precision evidence generally is not
needed in a report that is provided for the test
Standard 1 0. 1 taker or a third parry, because it is too cumbersome
and of little interest to most report readers.However,
T hose who use psychological tests should confine in situations in which the selection of tests may be
their testing and related assessment activities to problematic (e.g., oral subtests with deaf test takers),
their areas of competence, as demonstrated a brief description of the rationale for using or not
through education, training, experience, and ap using particular measures is advisable.
propriate credentials. When potential inferences derived from psy
Comment: Responsible use and interpretation of chological test scores are not supported by current
test scores require appropriate levels of experience, data yet may hold promise for future validation,
sound professional judgment, and understanding they may be described by the test developer and
of the empirical and theoretical foundations of test user as hypotheses for further validation in
tests. For many assessments, competency also re test score interpretation. Those receiving inter
quires sufficient familiarity with the population pretations of such results should be cautioned
of which the test taker is a member to facilitate that such inferences do not yet have adequately
test selection, test administration, and test score demonstrated evidence of validity and should not
interpretation. For example, when personality be the basis for a diagnostic decision or prognostic
tests and neuropsychological tests are administered formulation.
as part of a psychological assessment of an
individual, the test scores must be understood in Standard 1 0.3
the context of the individual's physical and psy
Professionals should verify that persons under
chological state; cultural and linguistic development;
their supervision have appropriate knowledge
and educational, gender, health, and occupational
and skills to administer and score tests.
background.Scoring also must take into account
other evidence relevant to the tests used. Test Comment: Individuals administering tests but
score interpretation requires professionally re not involved in their selection or interpretation
sponsible judgment that is exercised within the should be supervised by a professional. They
boundaries of knowledge and skill afforded by should have knowledge of, as well as experience
the professional's education, training, and supervised with, the test takers' presenting problems (e.g.,
experience, as well as the context in which the as brain injury) and the test settings (e.g., clinical,
sessment is being performed. forensic).

1 64
PSYCHOLOGICAL TESTING AND ASSESSMENT

Cluster 2 . Test Selection enables them to determine how much confidence


can be placed in interpretations for an individual.
Differences between group means and their sta
Standard 1 0.4
tistical significance provide inadequate information
Tests that are combined to form a battery of regarding validity for individual diagnostic pur
tests should be appropriate for the purposes of poses.Additional information that might be con
the assessment. sidered includes effect sizes or a table showing
the degree of overlap of predictor distributions
Comment: For example, in a neuropsychological
among different criterion groups.
assessment for evidence of an injury to an area of
the brain, it is necessary to select a combination
of tests with known diagnostic sensitivity and Cluster 3 . Test Ad ministration
specificity to impairments arising from trauma to
specific regions of the brain. Standard 1 0.7
Standard 1 0.5 Prior t o testing, professionals and test adminis
trators should provide the test taker, or appropriate
Tests selected for use in psychological testing others as applicable, with introductory information
should be suitable for the characteristics and in a manner understandable to the test taker.
background of the test taker.
Comment: The goal of optimal test administration
Comment: When tests are part of a psychological is to reduce error in the measurement of the con
assessment, the professional generally should take struct.For example, the test taker should understand
into account characteristics of the individual test parameters surrounding the test, such as testing
taker, including age and developmental level, time limits, feedback or lack thereof, and oppor
race/ethnicity; gender, and linguistic and/or physical tunities to take breaks.In addition, the test taker
characteristics that may affect the ability of the should have an understanding of the limits of
test taker to meet the requirements of the test. confidentiality, who will have access to the test
The professional should also take into account results, whether and when test results or decisions
the availability of norms and evidence of validity based on the scores will be shared with the test
for a population representative of the test taker.If taker, whether the test taker will have an opportunity
no normative or validity studies are available for a to retest, and under what circumstances retesting
relevant population, test interpretations should could occur.
be qualified and presented as hypotheses rather
than conclusions. Standard 1 0.8

Standard 1 0.6 Professionals and test administrators should


follow administration instructions, including
When differential diagnosis is needed, the pro calibration of technical equipment and verification
fessional should choose, if possible, a test or of scoring accuracy and replicability, and should
tests for which there is credible evidence that provide settings for testing that facilitate the
the scores of the test(s) distinguish between the performance of test takers.
two or more diagnostic groups of concern rather
Comment: Because the normative data against
than merely distinguishing abnormal cases from
which a test taker's performance will be evaluated
the general population.
were collected under the reported standard pro
Comment: Professionals will find it particularly cedures, the professional needs to be aware of and
helpful if evidence of validity is in a form that take into account the effect that any nonstandard

1 65
CHAPTER 1 0

procedures may have o n the test taker's obtained to have an inappropriate influence on the inter
score and the interpretation of that score.When pretation of the assessment results.
using tests that employ an unstructured response
Comment: Individuals or groups with a vested
format, such as some projective tests, the professional
interest in the significance or meaning of the
should follow the administration instructions pro
findings from psychological testing may include
vided and apply objective scoring criteria when
but are not limited to employers, health profes
available and appropriate.
sionals, legal representatives, school personnel,
In some cases, testing may be conducted in a
third-party payers, and family members.In some
realistic setting to determine how a test taker re
instances, legal requirements may limit a profes
sponds in these settings.For example, an assessment
sional's ability to prevent inappropriate interpre
for an attention disorder may be conducted in a
tations of assessments from affecting decisions,
noisy or distracting environment rather than in
but professionals have an obligation to document
an environment that typically protects the test
any disagreement in such circumstances.
taker from such external threats to performance
efficiency.
Standard 1 0 .1 1
Standard 1 0.9 Professionals should share test scores and inter
pretations with the test taker when appropriate
Professionals should take into account the purpose
or required by law. Such information should be
of the assessment, the construct being measured,
expressed in language that the test taker or,
and the capabilities of the test taker when
when appropriate, the test taker's legal represen
deciding whether technology-based administration
tative, can understand.
of tests should be used.
Comment: Test scores and interpretations should
Comment: Quality control should be integral to
be expressed in terms that can be understood
the administration of computerized or technolo
readily by the test taker or others entitled to the
gy-based tests. Some technology-based tests may
results.In most instances, a report should be gen
require that test takers have an opportunity to
erated and made available to the referral source.
receive instruction and to practice prior to the
That report should adhere to standards required
test administration, unless assessing ability to use.
by the profession and/or the referral source, and
the equipment is the purpose of the test. The
the information should be documented in a
professional is responsible for determining whether
manner that is understandable to the referral
the technology-based administration of the test
source.In some clinical situations, providing feed
should be proctored, or whether technical support
back to the test taker may actually cause harm.
staff are necessary to assis with the use of the test
Care should be taken to minimize unintended
equipment and software. The interpreter of the
consequences of test feedback.Any disclosure of
test scores should be informed if the test was un
test results to an individual or any decision not to
proctored or if no support staffwere available.
release such results should be consistent with ap
plicable legal standards, such as privacy laws.
C luster 4. Test Interpretation
Standard 1 0. 1 2
Standard 1 0.1 0
In psychological assessment, the interpretation
T hose who select tests and interpret test results of test scores or patterns of test battery results
should not allow individuals or groups with should consider other factors that may influence
vested interests in the outcomes of an assessment a particular testing outcome.Where appropriate,

1 66
PSYCHOLOGICAL TESTING AND ASSESSMENT

a description of such factors and an analysis of Standard 1 0. 1 4


the alternative hypotheses or explanations re
garding what may have contributed to the pattern Criterion-related evidence of validity should be
of results should be included in the report. available when recommendations or decisions
are presented by the professional as having an
Comment: Many factors (e.g., culture, gender,
actuarial basis.
race/ethnicity, educational level, effort, employment
status, left- or right-handedness, current mental Comment: Test score interpretations should not
state, health status, linguistic preference, and imply that empirical evidence exists for a relationship
testing situation) may influence individual test among particular test results, prescribed interven
results and the overall outcome of the psychological tions, and desired outcomes, unless such evidence
assessment. When preparing test score interpreta is available for populations similar to those repre
tions and reports drawn from an assessment, pro sentative of the examinee.
fessionals should consider the extent to which
these factors may introduce construct-irrelevant
Standard 1 0.1 5
variance into the test results.The interpretation
of test results in the assessment process also should The interpretation of test or test battery results
be informed, when possible or appropriate, by an for diagnostic purposes should be based on mul
analysis of stylistic and other qualitative features tiple sources of test and collateral information
of test-taking behavior that may be obtained from and on an understanding of the normative, em
observations, interviews, and historical information. pirical, and theoretical foundations, as well as
Inclusion of qualitative information may assist in the limitations, of such tests and data.
understanding the outcome of tests and evaluations.
In addition, tests of faking or effort often are used Comment: A given pattern of test performances
to determine the possibility of deception or ma represents a cross-sectional view of the individual
lingering. being assessed within a particular context.T he
interpretation of findings derived from a complex
battery of tests in such contexts requires appro
Standa rd 1 0.1 3 priate education about, supervised experience
When the validity of a diagnosis is appraised by with, and knowledge of procedural, theoretical,
evaluating the level of agreement between inter and empirical limitations of the tests and the
pretations of the test scores and the diagnosis, evaluation procedure.
the diagnostic terms or categories employed
should be carefully defined or identified. Standard 1 0.1 6
Comment: Two diagnostic systems typically used
are psychiatric (i.e., based on the Diagnostic and
If a publisher suggests that tests are to be used

Statistical Manual ofMental Disorders) and health


in combination with one another, the professional

related (i.e., based on the International Classification


should review the recommended procedures and

of Disease). As applicable, the system used to


evidence for combining tests and determine
whether the rationale provided by the publisher
diagnose the test taker should be noted.Some syn
is appropriate for the specific combination of
dromes (e.g., Mild Cognitive Impairment, Social
tests and their intended uses.
Learning Disability) do not appear in either system;
for these, a description of the deficits should be Comment: For example, if measures of intelligence
used, with the closest diagnosis possible. are packaged with measures of memory, or if

1 67
CHAPTER 1 0

measures of interests and personality styles are pack obsolete versions) should not be made available
aged together, then supporting reliability/precision to the public or resold to unqualified test users.
and validity data for such combinations of the test
Comment: Professionals should be knowledgeable
scores and interpretations should be available.
about and should conform to record-keeping and
confidentiality guidelines required by applicable
Standard 1 0. 1 7 federal law and within the jurisdictions where
T hose who use computer-generated interpreta they practice, as well as guidelines of the professional
tions of test data should verify that the quality organizations to which they belong.The test pub
of the evidence of validity is sufficient for the lisher, the test user, the test taker, and third parties
interpretations. (e.g., school, court, employer) may have different
levels of understanding or recognition of the need
Comment: Efforts to reduce a complex set of
for confidentiality of test materials.To the extent
data into computer-generated interpretations of a
possible, the professional who uses tests is responsible
given construct may yield misleading or oversim
for managing the confidentiality of test information
plified analyses of the meanings of test scores,
across all parties.It is important for the professional
which in turn may lead to faulty diagnostic and
to be aware of possible threats to confidentiality
prognostic decisions.Norms on which the inter
and the legal and professional remedies available.
pretations are based should be reviewed for their
Professionals also are responsible for maintaining
relevance and appropriateness.
the security of testing materials and respecting the
copyrights of all tests. Distribution, display, or
Cluster 5. Test Security resale of test materials (including obsolete editions)
to unauthorized recipients infringes the copyright
Standard 1 0.1 8 of the materials and compromises test security.
When it is necessary to reveal test content in the
Professionals and others who have access to test process of explaining results or in a court proceeding,
materials and test results should maintain the this should happen in a controlled environment.
confidentiality of the test results and testing When possible, copies of the content should not
materials consistent with scientific, professional, be distributed, or should be distributed in a manner
legal, and ethical requirements.Tests (including that protects test security to the extent possible.

1 68
1 1 . WORKPLACE TESTING AND
CREDENTIALING
BACKGROUND

Organizations use employment testing for many of providing self-insight to employees. Testing
purposes, including employee selection, placement, can also take place in the context of program
and promotion.Selection generally refers to decisions evaluation, as in the case of an experimental study
about which individuals will enter the organization; of the effectiveness of a training program, where
placement refers to decisions about how to assign tests may be administered as pre- and post
individuals to positions within the organization; measures. Some assessments conducted in em
and promotion refers to decisions about which in ployment settings, such as unstructured job in
dividuals within the organization will advance. terviews for which no claim of predictive validity
What all three have in common is a focus on the is made, are nonstandardized in nature, and it is
prediction of future job behaviors, with the goal generally not feasible to apply standards to such
of influencing organizational outcomes such as assessments.The focus ofchis chapter, however, is
efficiency, growth, productivity, and employee on the use oftesting specifically in staffing decisions
motivation and satisfaction. and credentialing.Many additional issues relevant
Testing used in the processes of licensure and to uses of testing in organizational settings are
certification, which will here be generically called discussed in other chapters: technical matters in
credentialing, focuses on an applicant's current chapters 1, 2, 4, and 5; documentation in chapter
skill or competence in a specified domain. In 7; and individualized psychological and personality
many occupations, individual practitioners must assessment of job candidates in chapter 10.
be licensed by governmental agencies. In other As described in chapter 3, the ideal of fairness
occupations, it is professional societies, employers, in testing is achieved if a given test score has the
or other organizations that assume responsibility same meaning for all individuals and is not sub
for credentialing. Although licensure typically in stantially influenced by construct-irrelevant barriers
volves provision of a credential for entry into an to individuals' performance.For example, a visually
occupation, credentialing programs may exist at impaired person may have difficulty reading ques
various levels, from novice to expert in a given tions on a personality inventory or other vocational
field. Certification is usually sought voluntarily, assessment provided in small print. Young people
although occupations differ in the degree to which just entering the workforce may be less sophisticated
obtaining certiqcation influences employability in test-taking strategies than more experienced
or advancement.The credentialing process may job applicants, and their scores may suffer. A
include testing and other requirements, such as person unfamiliar with computer technology may
education or supervised experiences.The Standards have difficulty with the user interface for a
applies to the use of tests as a component of the computer simulation assessment. In each of these
broader credentialing process. cases, performance is hindered by a source of
Testing is also conducted in workplaces for a variance that is unrelated to the construct of
variety of purposes other than staffing decisions interest. Sound testing practice involves careful
and credentialing.Testing as a tool for personal monitoring of all aspects of the assessment process
growth can be part of training and development and appropriate action when needed to prevent
programs, in which instruments measuring per undue disadvantages or advantages for some can
sonality characteristics, interests, values, preferences, didates caused by factors unrelated to the construct
and work styles are commonly used with the goal being assessed.

1 69
CHAPTER 1 1

Employment Testing skills, abilities, and other characteristics projected


to be necessary for performance on the target job
The Influence of Context on Test Use in the future, even if they are not part of the job
Employment testing involves using test information as currently constituted.
to aid in personnel decision making. Both the
content and the context of employment testing Screening in versus screening out. In some in
vary widely.Content may cover various domains of stances, the goal of the selection system is to
knowledge, skills, abilities, traits, dispositions, values, screen in individuals who are likely to be very
and other individual characteristics.Some contextual high performers on one set of behavioral or
features represent choices made by the employing outcome criteria of interest to the organization.
organization; others represent constraints that must In others, the goal is to screen out individuals
be accommodated by the employing organization. who are likely to be very poor performers. For ex
Decisions about the design, evaluation, and imple ample, an organization may wish to screen out a
mentation of a testing system are specific to the small proportion of individuals for whom the risk
context in which the system is to be used.Important of pathological, deviant, counterproductive, or
contextual features include the following: criminal behavior on the job is deemed too high.
The same organization may want to screen in ap
Internal versus external candidate pool. In some plicants who have a high probability of superior
instances, such as promotional settings, the can performance.
didates to be tested are already employed by the
organization. In others, applications are sought Mechanical versus judgmental decision making.
from individuals outside the organization.In yet In some instances, test information is used in a
other cases, a mix of internal and external candidates mechanical, automated fashion. T his is the case
is sought. when scores on a test battery are combined by
formula and candidates are selected in strict top
Trained versus untrained candidates. In some down rank order, or when only candidates above
instances, individuals with little training in a spe specific cut scores are eligible to continue to sub
cialized knowledge or skill are sought, either sequent stages of a selection system. In other in
because the j ob does not require the specialized stances, information from a test is judgmentally
knowledge or skill or because the organization integrated with information from other tests and
plans to offer training after the point of hire. In with nontest information to form an overall as
other instances, trained or experienced workers sessment of the candidate.
are sought with the expectation that they can im
mediately perform a specialized job.T hus, a par Ongoing versus one-time use of a test. In some
ticular job may require very different selection instances, a test may be used over an extended
systems, depending on whether trained or untrained period in an organization, permitting the accu
individuals will be hired or promoted. mulation of data and experience using the test in
that context. In other instances, concerns about
Short-term versus long-term focus. In some in test security are such that repeated use is infeasible,
stances, the goal of the selection system is to and a new test is required for each test adminis
predict performance immediately upon or shortly tration. For example, a work-sample test for life
after hire.In other instances, the concern is with guards, requiring retrieval of a mannequin from
longer-term performance, as in the case of pre the bottom of a pool, is not compromised if can
dictions as to whether candidates will successfully didates possess detailed knowledge of the test in
complete a multiyear overseas job assignment. advance. In contrast, a written job-knowledge
Concerns about changing job tasks and job re test for police officers may be severely compromised
quirements also can lead to a focus on knowledge, if some candidates have access to the test in
WORKPLACE TESTING AND CREDENTIALING

advance. T he key question is whether advance in situations with small sample sizes can be used
knowledge of test content affects candidates' per ( see the discussion on page 173 concerning settings
formance unfairly and consequently changes the with small samples), as well as content-oriented
constructs measured by the test and the validity studies using the subject matter expertS responsible
of inferences based on the scores. for designing the job.

Fixed applicant pool versus continuous flow. In Size of applicant pool relative to the number of
some instances, an applicant pool can be assembled job openings.The size of an applicant pool can
prior to beginning the selection process, as when constrain the type of testing system that is feasible.
an organization's policy is to consider all candidates For desirable jobs, very large numbers of candidates
who apply before a specific date. In other cases, may compete, and short screening tests may be
there is a continuous flow of applicants about used to reduce the pool to a size for which the ad
whom employment decisions need to be made on ministration ofmore time-consuming and expensive
an ongoing basis.Ranking of candidates is possible tests is practical. Large applicant pools may also
in the case of the fixed pool; in the case of a con pose test security concerns, limiting the organization
tinuous flow, a decision may need to be made to testing methods that permit simultaneous test
about each candidate independent of information administration to all candidates.
about other candidates.
T hus, test use by employers is conditioned by
Small versus large sample size.Sample size affects contextual features. Knowledge of these features
the degree to which different lines of evidence plays an important part in the professional judgment
can be used to examine validity and fairness of in that will influence both the types of testing system
terpretations of test scores for proposed uses of developed and the strategies used to evaluate crit
tests.For example, relying on the local setting to ically the validity of interpretations of test scores
establish empirical linkages between test and cri for proposed uses of the tests.
terion scores is not technically feasible with small
sample sizes.In employment testing, sample sizes The Validation Process in Employment Testing
are often small; at the extreme is a job with only a T he validation process often begins with a job
single incumbent.Large sample sizes are sometimes analysis in which information about job duties
available when there are many incumbents for and tasks, responsibilities, worker characteristics,
the job, when multiple jobs share similar require and other relevant information is collected.This
ments and can be pooled, or when organizations information provides an empirical basis for artic
with similar jobs collaborate in developing a se ulating what is meant by job performance in the
lection system. job under consideration, for developing measures
of job performance, and for hypothesizing char
A new job. A special case of the problem of small acteristics ofindividuals that may be predictive of
sample size exists when a new job is created and performance.
there are no job incumbents.As new jobs emerge, The fundamental inference to be drawn from
employers need selection procedures to staff the test scores in most applications of testing in em
new positions.Professional judgment may be used ployment settings is one of prediction: The test
to identify appropriate employment tests and pro user wishes to make an inference from test results
vide a rationale for the selection program even to some future job behavior or job outcome.
though the array of methods for documenting Even when the validation strategy used does not
validity may be restricted. Although validity involve empirical predictor-criterion linkages, as
evidence based on criterion-oriented studies can in the case of validity evidence based on test
rarely be assembled prior to the creation of a new content, there is an implied criterion. T hus,
job, the methods for generalizing validity evidence although different strategies for gathering evidence

1 71
CHAPTER 1 1

may be used, the inference t o be supported i s that are intended to assess an individual's standing on
scores on the test can be used to predict subsequent the characteristics assessed in those domains.
job behavior.The validation process in employment The diagram enumerates inferences about a
settings involves the gathering and evaluation of number of linkages that are commonly of interest.
evidence relevant to sustaining or challenging this The first linkage ( labeled 1 in the diagram) is be
inference.As detailed below and in chapter 1 ( in tween scores on a predictor measure and scores
the section "Evidence Based on Relations to Other on a criterion measure. This inference is tested
Variables"), a variety of validation strategies can through empirical examination of relationships
be used to support the inference. between the two measures.The second and fourth
It follows that establishing this predictive in linkages ( labeled 2 and 4) are conceptually similar:
ference requires attention to two domains: that of Both examine the relationship of an operational
the test ( the predictor) and that of the job behavior measure to the construct domain of interest.
or outcome of interest ( the criterion).Evaluating Logical analysis, expert judgment, and convergence
the use of a test for an employment decision can with or divergence from conceptually similar or
be viewed as testing the hypothesis of a linkage different measures are among the forms of evidence
between these domains. Operationally, there are that can be examined in testing these linkages.
many ways of linking these domains, as illustrated Linkage 3 involves the relationship between the
by the diagram below. predictor construct domain and the criterion
construct domain.T his inferred linkage is estab
lished on the basis of theoretical and logical
predictor criterion analysis. It commonly draws on systematic eval
mea ure measure
i
uation of job content and expert judgment as to
I the individual characteristics linked to successful
job performance.Linkage 5 examines a direct re
2 5 4
lationship of the predictor measure to the criterion

.1 .
construct domain.
I Some predictor measures are designed explicitly
predictor 3 criterion as samples of the criterion construct domain of
construct construct
domain interest; thus, isomorphism between the measure
domain
and the construct domain constitutes direct
Alternative links between predictor and criterion measures evidence for linkage 5. Establishing linkage 5 in
this fashion is the hallmark of approaches that
The diagram differentiates between a predictor rely heavily on what the Standards refers to as
construct domain and a predictor measure, and validity evidence based on test content. Tests in
between a criterion construct domain and a which candidates for lifeguard positions perform
criterion measure.A predictor construct domain is rescue operations, or in which candidates for
defined by specifying the set of behaviors, knowl word processor positions type and edit text,
edge, skills, abilities, traits, dispositions, and values provide examples of test content that forms the
that will be included under particular construct basis for validity.
labels ( e.g., verbal reasoning, typing speed, con A prerequisite to the use of a predictor measure
scientiousness). Similarly, a criterion construct for personnel selection is that the inferences con
domain specifies the set of job behaviors or j ob cerning the linkage between the predictor measure
outcomes that will be included under particular and the criterion construct domain be established.
construct labels ( e.g., performance of core job As the diagram illustrates, there are multiple
tasks, teamwork, attendance, sales volume, overall strategies for establishing this crucial linkage.One
job performance).Predictor and criterion measures strategy is direct, via linkage 5; a second involves

1 72
WORKPLACE TESTING AND CREDENTIALING

pamng linkage 1 and linkage 4; and a third above, there is no single direct route to establishing
involves pairing linkage 2 and linkage 3. these linkages.T hey involve lines of evidence sub
When the test is designed as a sample of the sumed under "construct validity" in prior con
criterion construct domain, the validity evidence ceptualizations of the validation process.A com
can be established directly via linkage 5.Another bination of lines of evidence ( e.g., expert judgment
strategy for linking a predictor measure and the of the characteristics predictive of job success, in
criterion construct domain focuses on linkages 1 ferences drawn from an analysis of critical incidents
and 4: pairing an empirical link between the pre of effective and ineffective job performance, and
dictor and criterion measures with evidence of interview and observation methods) may support
the adequacy with which the criterion measure inferences about the predictor constructs linked
represents the criterion construct domain. The to the criterion construct domain. Measures of
empirical link between the predictor measure and these predictor constructs may then be selected
the criterion measure is part of what the Standards or developed, and the linkage between the predictor
refers to as validity evidence based on relationships measure and the predictor construct domain can
to other variables.T he empirical link of the test be established with various lines of evidence for
and the criterion measure must be supplemented linkage 2, discussed above.
by evidence of the relevance of the criterion T he various strategies for linking predictor
measure to the criterion construct domain to scores to the criterion construct domain may
complete the linkage between the test and the cri differ in their potential applicability to any given
terion construct domain.Evidence of the relevance employment testing context.W'hile the availability
of the criterion measure to the criterion construct of certain lines of evidence may be constrained,
domain is commonly based on job analysis, al such constraints do not reduce the importance of
though in some cases the link between the domain establishing a validity argument for the predictive
and the measure is so direct that relevance is ap inference.
parent without job analysis ( e.g., when the criterion For example, methods for establishing linkages
construct of interest is absenteeism or turnover). are more limited in settings with only small
Note that this strategy does not necessarily rely samples available. In such situations, gathering
on a well-developed predictor construct domain. local evidence of predictor-criterion relationships
Predictor measures such as empirically keyed is not feasible, and approaches to generalizing ev
biodata measures are constructed on the basis of idence from other settings may be more useful.A
empirical links between test item responses and variety of methods exist for generalizing evidenc_e
the criterion measure of interest. Such measures of the validity of the interpretation of the predictive
may, in some instances, be developed without a inference from other settings. Validity evidence
fully established conception of the predictor con may be directly transported from another setting
struct domain; die basis for their use is the direct in a case where sound evidence ( e.g., careful job
empirical link between test responses and a relevant analysis) indicates that the local job is highly
criterion measure. Unless sample sizes are very comparable to the job for which the validity data
large, capitalization on chance may be a problem, are being imported.T hese methods may rely on
in which case appropriate steps should be taken evidence for linkage 1 and linkage 4 that have al
( e.g., cross-validation). ready been established in other studies, as in the
Yet another strategy for linking predictor scores case of the transportability study described previ
and the criterion construct domain focuses on ously.Evidence for linkage 1 may also be established
pairing evidence of the adequacy with which the using techniques such as meta-analysis to combine
predictor measure represents the predictor construct results from multiple studies, and a careful job
domain ( linkage 2) with evidence of the linkage analysis may establish evidence for linkage 4 by
between the predictor construct domain and the showing the focal job to be similar to other jobs
criterion construct domain ( linkage 3).As noted included in the meta-analysis.At the extreme, a

1 73
CHAPTER 1 1

selection system may b e developed for a newly Thus, any single characteristic will be only an im
created job with no current incumbents. Here, perfect predictor, and even complex selection sys
generalizing evidence from other settings may be tems only focus on the set of constructs deemed
especially helpful. most critical for the job, rather than on all char
For many testing applications, there is a con acteristics that can influence job behavior.T hird,
siderable cumulative body of research that speaks some measurement error always occurs, even in
to some, if not all, of the inferences discussed well-developed test and criterion measures.
above.A meta-analytic integration of this research T hus, testing systems cannot be judged against
can form an integral part of the strategy for a standard of perfect prediction. Rather, they
linking test information to the construct domain should be judged in terms of comparisons with
of interest.The value of collecting local validation available alternative selection methods.Professional
data varies with the magnitude, relevance, and judgment, informed by knowledge of the research
consistency of research findings using similar pre literature about the degree of predictive accuracy
dictor measures and similar criterion construct relative to available alternatives, influences decisions
domains for similar jobs. In some cases, a small about test use.
and inconsistent cumulative research record may Decisions about test use are often influenced
lead to a validation strategy that relies heavily on by additional considerations, including utility
local data; in others, a large, consistent research (i.e., cost-benefit) and return on investment, value
base may make investing resources in additional judgments about the relative importance ofselecting
local data collection unnecessary. for one criterion domain :versus others, concerns
Thus, multiple sources of data and multiple about applicant reactions to test content and
lines of evidence can be drawn upon to evaluate processes, the availability and appropriateness of
the linkage between a predictor measure and the alternative selection methods, and statutory or
criterion construct domain of interest.There is no regulatory requirements governing test use, fairness,
single preferred method of inquiry for establishing and policy objectives such as workforce diversity.
this linkage. Rather, the test user must consider Organizational values necessarily come into play
the specifics of the testing situation and apply in decisions about test use; thus, even organizations
professional judgment in developing a strategy for with comparable evidence supporting an intended
testing the hypothesis of a linkage between the inference drawn from test scores may reach different
predictor measure and the criterion domain. conclusions about whether to use any particular
test.
Bases for Evaluating Employment Test Use
Although a primary goal of employment testing Testing in Professional and
is the accurate predictioil of subsequent job be Occupational Credentialing
haviors or j ob outcomes, it is important to
recognize that there are limits to the degree to Tests are widely used in the credentialing of
which such criteria can be predicted. Perfect pre persons for many occupations and professions.
diction is an unattainable goal. First, behavior in Licensing requirements are imposed by federal,
work settings is influenced by a wide variety of state, and local governments to ensure that those
organizational and extra-organizational factors, who are licensed possess knowledge and skills in
including supervisor and peer coaching, formal sufficient degree to perform important occupational
and informal training, job design, organizational activities safely and effectively.Certification plays
structures and systems, and family responsibilities, a similar role in many occupations not regulated
among others.Second, behavior in work settings by governments and is often a necessary p recursor
is also influenced by a wide variety of individual to advancement. Certification has also become
characteristics, including knowledge, skills, abilities, widely used to indicate that a person has specific
personality, and work attitudes, among others. skills ( e.g., operation of specialized auto repair

1 74
WORKPLACE TESTING AND CREDENTIALING

equipment) or knowledge ( e.g., estate planning), sionals. Panels of experts in the field often work
which may be only a part of their occupational in collaboration with measurement experts to
duties.Licensure and certification will here gener define test specifications, including the knowledge
ically be called credentialing. and skills needed for safe, effective performance
Tests used in credentialing are intended to and an appropriate way of assessing them. T he
provide the public, including employers and gov Standards apply to all forms of testing, including
ernment agencies, with a dependable mechanism traditional multiple-choice and other selected-re
for identifying practitioners who have met particular sponse tests, constructed-response tasks, portfolios,
standards.T he standards may be strict, but not so situational j udgment tasks, and oral examinations.
stringent as to unduly restrain the right of qualified More elaborate performance tasks, sometimes
individuals to offer their services to the public. using computer-based simulation, are also used
Credentialing also serves to protect the public by in assessing such practice components as, for ex
excluding persons who are deemed to be not ample, patient diagnosis or treatment planning.
qualified to do the work of the profession or oc Hands-on performance tasks may also be used
cupation. Qualifications for credentials typically ( e.g., operating a boom crane or filling a tooth),
include educational requirements, some amount with observation and evaluation by one or more
of supervised experience, and other specific criteria, exammers.
as well as attainment of a passing score on one or Credentialing tests may cover a number of re
more examinations.Tests are used in credentialing lated but distinct areas of knowledge or skill.De
in a broad spectrum of professions and occupations, signing the testing program includes deciding
including medicine, law, psychology, teaching, what areas are to be covered, whether one or a
architecture, real estate, and cosmetology.In some series of tests is to be used, and how multiple test
of these, such as actuarial science, clinical neu scores are to be combined to reach an overall de
ropsychology, and medical specialties, tests are cision. In some cases, high scores on some tests
also used to certify advanced levels of expertise. are permitted to offset ( i.e., compensate for) low
Relicensure or periodic recertification is also scores on other tests, so that an additive combination
required in some occupations and professions. is appropriate.In other cases, a conjunctive decision
Tests used in credentialing are designed to de model requiring acceptable performance on each
termine whether the essential knowledge and skills test in an examination series is used.T he type of
have been mastered by the candidate.T he focus pass-fail decision model appropriate for a creden
is on the standards of competence needed for ef tialing program should be carefully considered,
fective performance ( e.g., in licensure this refers and the conceptual and/or empirical basis for the
to safe and effective performance in practice). decision model should be articulated.
Test design generally starts with an adequate def Validation of credentialing tests depends mainly
inition of the occupation or specialty, so that on content-related evidence, often in the form of
persons can be clearly identified as engaging in j udgments that the test adequately represents the
the activity.T hen the nature and requirements of content domain associated with the occupation or
the occupation, in its current form, are delineated. specialty being considered.Such evidence may be
To identify the knowledge and skills necessary for supplemented with other forms of evidence external
competent practice, it is important to complete to the test.For example, information may be pro
an analysis of the actual work performed and vided about the process by which specifications
then document the tasks and responsibilities that for the content domain were developed and the
are essential to the occupation or profession of expertise of the individuals making judgments
interest.A wide variety of empirical approaches about the content domain. Criterion-related
may be used, including the critical incident tech evidence is of limited applicability because cre
nique, job analysis, training needs assessments, or dentialing examinations are not intended to predict
practice studies and surveys of practicing profes- individual performance in a specific job but rather

1 75
CHAPTER 1 1

t o provide evidence that candidates have acquired mastery tests may not be designed to provide ac
the knowledge, skills, and judgment required for curate results over the full score range, many such
effective performance, often in a wide variety of tests report results as simply "pass" or "fail." When
jobs or settings ( we use the term judgm ent to refer feedback is given to candidates about how well or
to the applications of knowledge and skill to par how poorly they performed, precision throughout
ticular situations). In addition, measures of per the score range is needed. Conditional standard
formance in practice are generally not available errors of measurement, discussed in chapter 2,
for those who are not granted a credential. provide information about the precision of specific
Defining the minimum level of knowledge scores.
and skill required for licensure or certification is Candidates who fail may profit from infor
one of the most important and difficult tasks mation about the areas in which their performance
facing those responsible for credentialing. The was especially weak. T his is the reason that
validity of the interpretation of the test scores de subscores are sometimes provided. Subscores are
pends on whether the standard for passing makes often based on relatively small numbers of items
an appropriate distinction between adequate and and can be much less reliable than the total score.
inadequate performance.Often, panels of experts Moreover, differences in subscores may simply
are used to specify the level of performance that reflect measurement error. For these reasons, the
should be required. Standards must be high decision to provide subscores to candidates should
enough to ensure that the public, employers, and be made carefully, and information should be
government agencies are well served, but not so provided to facilitate proper interpretation.Chapter
high as to be unreasonably limiting.Verifying the 2 and Standard 2.3 speak to the importance of
appropriateness of the cut score or scores on a test subscore reliability.
used for licensure or certification is a critical Because credentialing tends to involve high
element of the validation process. Chapter 5 stakes and is an ongoing process, with tests given
provides a general discussion of setting cut scores on a regular schedule, it is generally not desirable
( see Standards 5. 21-5.23 for specific topics con to use the same test form repeatedly.T hus, new
cerning cut scores) . forms, or versions of the test, are generally needed
Legislative bodies sometimes attempt to legislate on an ongoing basis.From a technical perspective,
a cut score, such as answering 70% of test items all forms of a test should be prepared to the same
correctly.Cut scores established in such an arbitrary specifications, assess the same content domains,
fashion can be harmful for two reasons. First, and use the same weighting of components or
without detailed information about the test, job topics.
requirements, and their relationship, sound standard Alternate test forms should have the same
setting is impossible. Second, without detailed score scale so that scores can retain their meaning.
information about the fo'r mat of the test and the Various methods of linking or equating alternate
difficulty of items, such arbitrary cut scores have forms can be used to ensure that the standard for
little meaning. passing represents the same level of performance
Scores from credentialing tests need to be on all forms.Note that release of past test forms
precise in the vicinity of the cut score.They may may compromise the extent to which different
not need to be as precise for test takers who test forms are comparable.
clearly pass or clearly fail. Computer-based mastery Practice in professions and occupations often
tests may include a provision to end the testing changes over time. Evolving legal restrictions,
when it becomes clear that a decision about the progress in scientific fields, and refinements in
candidate's performance can be made, resulting techniques can result in a need for changes in test
in a shorter test for candidates whose performance content. Each profession or occupation should
clearly exceeds or falls below the minimum per periodically reevaluate the knowledge and skills
formance required for a passing score. Because measured in its examination used to meet the re-

1 76
WORKPLACE TESTING AND CREDENTIALING

quirements of the credential. When change is Passing a credentialing examination should signify
substantial, it becomes necessary to revise the chat the candidate meets the knowledge and skill
definition of the profession, and the test content, standards set by the credentialing body to ensure
to reflect changing circumstances.T hese changes effective practice.
to the test may alter the meaning of the score Issues of cheating and test security are of
scale.When major revisions are made in the test special importance for testing practices in creden
or when the score scale changes, the cut score tialing. Issues of test security are covered in
should also be reestablished. chapters 6 and 9.Issues of cheating by test takers
Some credentialing groups consider it necessary, are covered in chapter 8 (see Standards 8 .9-8.1 2,
as a practical matter, to adjust their passing score addressing testing irregularities).
or other criteria periodically to regulate the number Fairness and access, discussed in chapter 3,
of accredited candidates entering the profession. are important for licensing and certification
T his questionable procedure raises serious problems testing.An evaluation of an accommodation or
for the technical quality of the test scores and modification for a credentialing test should take
threatens the validity of the interpretation of a into consideration the critical functions performed
passing score as indicating entry-level competence. in the work targeted by the test. In the case of
Adjusting the cut score periodically also implies credentialing tests, the criticality of j ob functions
that standards are set higher in some years than in is informed by the public interest as well as the
others, a practice that is difficult to justify on the nature of the work itself When a condition
grounds of quality of performance. T he score limits an individual's ability to perform a critical
scale is sometimes adjusted so that a certain function of a job, an accommodation or modifi
number or proportion of candidates will reach cation of the licensing or certification exam may
the passing score.This approach, while less obvious not be appropriate (i.e., some changes may fun
to the candidates than changing the cut score, is damentally alter factors that the examination is
also technically inappropriate because it changes designed to measure for protection of the public's
the meaning of the scores from year to year. health, safety, and welfare).

1 77
CHAPTER 1 1

STANDARDS FOR WORKPLACE TESTING AND CREDENTIALING

The standards in this chapter have been separated of the tasks that are performed and/or the knowledge,
into three thematic clusters labeled as follows: skills, abilities, and other characteristics that are re
quired on the job.They should be clearly defined
1. Standards Generally Applicable to Both so that they can be linked to test content. The
Employment Testing and Credentialing knowledge, skills, abilities, and other characteristics
2. Standards for Employment Testing included in the content domain should be chose
3. Standards for Credentialing that qualified applicants already possess when being
considered for the job in question.Moreover, the
importance of these characteristics for the job
Cl uster 1 . Standards Generally
under consideration should not be expected to
Applicable to Both Employment Testing change substantially over a specified period of time.
and Credentia ling For credentialing tests, the target content do
main generally consists of the knowledge, skills,
Standard 1 1 .1 and judgment required for effective performance.
The target content domain should be clearly
Prior to development and implementation of an defined so it can be linked to test content.
employment or credentialing test, a dear statement
of the intended interpretations of test scores for
specified uses should be made. The subsequent Standard 1 1 .3
validation effort should be designed to determine
When test content is a primary source ofvalidity
how well this has been achieved for all relevant
evidence in support of the interpretation for the
subgroups.
use of a test for employment decisions or cre
Comment: The objectives of employment and dentialing, a close link between test content and
credentialing tests can vary considerably. Some the job or professional/occupational requirements
employment tests aim to screen out those least should be demonstrated.
suited for the job in question, while others are de
Comment: For example, if the test content samples
signed to identify those best suited for the job.
job tasks with considerable fidelity (e.g., with
Employment tests also vary in the aspects of job
actual job samples such as machine operation) or,
behavior they are intended to predict, which may
in the judgment of experts, correctly simulates
include quantity or quali ty of work output, tenure,
job task content ( e.g., with certain assessment
counterproductive behaviqr, and teamwork, among
center exercises), or if the test samples specific job
others.Credentialing tests and some employment
knowledge ( e.g., information necessary to perform
tests are designed to identify candidates who have
certain tasks) or skills required for competent
met some specified level of proficiency in a target
performance, then content-related evidence can
domain of knowledge, skills, or judgment.
be offered as the principal form of evidence of va
lidity.If the link between the test content and the
Standard 1 1 .2 job content is not clear and direct, other lines of
validity evidence take on greater importance.
Evidence ofvalidity based on test content requires
When evidence of validity based on test content
a thorough and explicit definition of the content
is presented for a job or class of jobs, the evidence
domain of interest.
should include a description of the major job
Comment: In general, the job content domain for characteristics that a test is meant to sample.It is
an employment test should be described in terms often valuable to also include information about

1 78
WORKPLACE TESTING AND CREDENTIALING

the relative frequency, importance, or criticality Cluster 2. Standards for


of the elements.For a credentialing examination, Employment Testing
the evidence should include a description of the
major responsibilities, tasks, and/or activities per
formed by practitioners that the test is meant to Standard 1 1 .5
sample, as well as the underlying knowledge and When a test is used to predict a criterion, the
skills required to perform those responsibilities, decision to conduct local empirical studies of
tasks, and/or activities. predictor-criterion relationships and the inter
pretation of the results should be grounded in
Standard 1 1 .4 knowledge of relevant research.

When multiple test scores or test scores and Comment: The cumulative literature on the rela
nontest information ate integrated for the purpose tionship between a particular type of predictor
of making a decision, the role played by each and type of criterion may be sufficiently large and
should be clearly explicated, and the inference consistent to support the predictor-criterion rela
made from each source of information should tionship without additional research.In some set
be supported by validity evidence. tings, the cumulative research literature may be
so substantial and so consistent that a dissimilar
Comment: In credentialing, candidates may be finding in a local study should be viewed with
required to score at or above a specified minimum caution unless the local study is exceptionally
on each of several tests (e.g., a practical, skill sound.Local studies ate ofgreatest value in settings
based examination and a multiple-choice knowledge where the cumulative research literature is sparse
test) or at or above a cut score on a total composite (e.g., due to the novel of the predictor and/or
ty
score. Specific educational and/or experience re criterion used), where the cumulative record is
quirements may also be mandated. A rationale inconsistent, or where the cumulative literature
and its supporting evidence should be provided does not include studies similar to the study from
for each requirement. For tests and assessments, the local setting (e.g., a study of a test with a large
such evidence includes, but is not necessarily cumulative literature dealing exclusively with pro
limited to, the reliability/precision of scores and duction jobs and a local setting involving managerial
the correlations among the tests and assessments. jobs).
In employment testing, a decision maker may
integrate test scores with interview data, reference
checks, and many other sources of information in Standard 1 1 .6
making employment decisions. The inferences Reliance on local evidence of empirically deter
drawn from test s.cores should be limited to those mined predictor-criterion relationships as a vali
for which validity evidence is available. For dation strategy is contingent on a determination
example, viewing a high test score as indicating of technical feasibility.
overall job suitability, and thus precluding the
need for reference checks, would be an inappropriate Comment: Meaningful evidence of predictor-cri
inference from a test measuring a single narrow, terion relationships is conditional on a number of
albeit relevant, domain, such as job knowledge. features, including (a) the job's being relatively
In other circumstances, decision makers integrate stable rather than in a period of rapid evolution;
scores across multiple tests, or across multiple ( b) the availability of a relevant and reliable
scales within a given test. criterion measure; (c) the availability of a sample
reasonably representative of the population of in
terest; and (d) an adequate sample size for estimating

1 79
CHAPTER 1 1

the strength of the predictor-criterion relationship. Comment: Errors of measurement in the criterion
If any of these conditions is not met, some alter and restrictions on the variability of predictor or
native validation strategy should be used.For ex criterion scores systematically reduce estimates of
ample, as noted in the comment to Standard the relationship between predictor measures and
1 1 .5, the cumulative research literature may the criterion construct domain, but procedures
provide strong evidence of validity. for correction for the effects of these artifacts are
available. When these procedures are applied,
Standard 1 1 .7 both corrected and uncorrected values should be
presented, along with the rationale for the correction
When empirical evidence of predictor-criterion procedures chosen. Statistical significance tests
relationships is part of the pattern of evidence for uncorrected correlations should not be used
used to support test use, the criterion measure(s) with corrected correlations.Other features to be
used should reflect the criterion construct domain considered include issues such as missing data for
of interest to the organization. All criteria used some variables for some individuals, decisions
should represent important work behaviors or about the retention or removal of extreme data
work outputs, either on the job or in job-relevant points, the effects of capitalization on chance in
training, as indicated by an appropriate review selecting predictors from a larger set on the basis
of information about the job. of strength of predictor-criterion relationships,
and the possibility of spurious predictor-criterion
Comment: When criteria are constructed to
relationships, as in the case of collecting criterion
represent job activities or behaviors (e.g., super
ratings from supervisors who know selection test
visory ratings of subordinates on important j ob
scores.Chapter 3, on fairness, describes additional
dimensions), systematic collection of information
issues that should be considered.
about the job should inform the development
of the criterion measures. However, there is no
clear choice among the many available j ob Standard 1 1 .9
analysis methods. Note that j ob analysis is not
Evidence of predictor-criterion relationships in
limited to direct observation of the job or direct
a current local situation should not be inferred
sampling of subject matter experts; large- scale
from a single previous validation study unless
j ob-analytic databases often provide useful in
the previous study of the predictor-criterion re
formation. There is not a clear need for job
lationships was done under favorable conditions
analysis to support criterion use when measures
(i.e., with a large sample size and a relevant cri
such as absenteeism, turnover, or accidents are
terion) and the current situation corresponds
the criteria of interest.
closely to the previous situation.

Standard 1 1 .8 Comment: Close correspondence means that the


criteria (e.g., the job requirements or underlying
Individuals conducting and interpreting empirical psychological constructs) are substantially the
studies of predictor-criterion relationships should same (e.g., as is determined by a job analysis),
identify artifacts that may have influenced study and that the predictor is substantially the same.
findings, such as errors of measurement, range Judgments about the degree of correspondence
restriction, criterion deficiency, criterion con should be based on factors that are likely to affect
tamination, and missing data. Evidence of the the predictor-criterion relationship.For example,
presence or absence of such features, and of a test of situational judgment found to predict
actions taken to remove or control their influence, performance of managers in one country may or
should be documented and made available as may not predict managerial performance in another
needed. country with a very different culture.

1 80
WORKPLACE TESTING ANO CREDENTIALING

Standard 1 1 . 1 0 content adequately samples the predictor construct


domain; and second, there should be evidence
If tests are to be used to make job classification for the relationship between the predictor construct
decisions (e.g., if the pattern of predictor scores domain and major factors of the criterion construct
will be used to make differential job assignments), domain.
evidence that scores are linked to different levels
or likelihoods of success among jobs, job groups, Comment: There should be a clear conceptual
or job levels is needed. rationale for these linkages. Both the predictor
construct domain and the criterion construct do
Comment: As noted in chapter 1 , it is possible main to which it is to be linked should be defined
for tests to be highly predictive of performance carefully.T here is no single preferred route to es
for different jobs but not provide evidence of dif tablishing these linkages.Evidence in support of
ferential success among the jobs. For example, linkages between the two construct domains can
the same people may be predicted to be successful include patterns of findings in the research
for each of the jobs. literature and systematic evaluation of job content
to identify predictor constructs linked to the cri
Standard 1 1 .1 1 terion domain. T he bases for judgments linking
the predictor and criterion construct domains
If evidence based on test content is a primary should be documented.
source of validity evidence supporting the use of For example, a test of cognitive ability might
a test for selection into a particular job, a similar be used to predict performance in a job that is
inference should be made about the test in a new complex and requires sophisticated analysis of
situation only if the job and situation are sub many factors. Here, the predictor construct domain
stantially the same as the job and situation where would be cognitive ability, and verifying the first
the original validity evidence was collected. link would entail demonstrating that the test is
Comment: Appropriate test use in this context an adequate measure of the cognitive ability
requires that the critical job content factors be domain.T he second linkage might be supported
substantially the same ( e.g., as is determined by a by multiple lines of evidence, including a compi
job analysis) and that the reading level of the test lation of research findings showing a consistent
material not exceed that appropriate for the new relationship between cognitive ability and per
job. In addition, the original meaning of the test formance on complex tasks, and by judgments
materials should not be substantially changed in from subject matter experts regarding the impor
the new situation.For example, "salt is to pepper" tance of cognitive ability for performance in the
may be the correct answer to the analogy item performance domain.
"white is to black'' in a culture where people ordi
narily use black pepper, but the item would have Cluster 3. Standards for Credentialing
a different meaning in a culture where white
pepper is the norm.
Standard 1 1 . 1 3
Standard 1 1 .1 2 The content domain to be covered by a creden
tialing test should be defined clearly and justified
When the use of a given test for p ersonnel in terms of the importance of the content for
selection relies on relationships between a predictor credential-worthy performance in an occupation
construct domain that the test represents and a or profession. A rationale and evidence should
criterion construct domain, two links need to be be provided to support the claim that the knowl
established. First, there should be evidence that edge or skills being assessed are required for cre
the test scores are reliable and that the test dential-worthy performance in that occupation

1 81
CHAPTER 1 1

and are consistent with the purpose for which liability estimates and associated standard errors
the credentialing program was instituted. of measurement may also be useful, particularly
the conditional standard error at the cut score.
Comment: Typ ically, some form of job or practice
However, the consistency of decisions on whether
analysis provides the primary basis for defining
to certify is of primary importance.
the content domain. If the same examination is
used in the credentialing of people employed in a
variety of settings and specialties, a number of Standard 1 1 . 1 5
different job settings may need to be analyzed.
Rules and procedures that are used to combine
Although the job analysis techniques may be
scores on different parts of an assessment or
similar to those used in employment testing, the
scores from multiple assessments to determine
emphasis for credentialing is limited appropriately
the overall outcome of a credentialing test should
to knowledge and skills necessary for effective
be reported to test takers, preferably before the
practice.T he knowledge and skills contained in a
test is administered.
core curriculum designed to train people for the
job or occupation may be relevant, especially if Comment: In some credentialing cases, candidates
the curriculum has been designed to be consistent may be required to score at or above a specified
with empirical job or practice analyses. minimum on each of several tests.In other cases,
In tests used for licensure, knowledge and the pass-fail decision may be based solely on a total
skills that may be important to success but are composite score. If tests will be combined into a
not directly related to the purpose of licensure composite, candidates should be provided infor
(e.g., protecting the public) should not be included. mation about the relative weighting of the tests.It
For example, in accounting, marketing skills may is not always possible to inform candidates of the
be important for success, and assessment of those exact weights prior to test administration because
skills might have utility for organizations selecting the weights may depend on empirical properties of
accountants for employment. However, lack of the score distributions (e.g., their variances).However,
those skills may nor present a threat to the public, candidates should be informed of the intention of
and thus the skills would appropriately be excluded weighting (e.g., test A contributes 25% and test B
from this licensing examination. T he fact that contributes 75% to the total score).
successful practitioners possess certain knowledge
or skills is relevant but not persuasive.Such infor Standard 1 1 . 1 6
mation needs to be coupled with an analysis of
the purpose of a credentialing program and the T he level of performance required for passing a
reasons that the knowledge or skills are required credentialing test should depend on the knowledge
in an occupation or profssion. and skills necessary for credential-worthy per
formance in the occupation or profession and
Standard 1 1 . 1 4 should not be adjusted to control the number or
proportion of persons passing the test.
Estimates of the consistency of test-based cre
Comment: T he cut score should be determined
dentialing decisions should be provided in addition
by a careful analysis and judgment of credential
to other sources of reliability evidence.
worthy performance (see chap. 5). When there
Comment: The standards for decision consistency are alternate forms of a test, the cut score should
described in chapter 2 are applicable to rests used refer to the same level of performance for all
for licensure and certification. Other types of re- forms.

1 82
1 2 . EDUCATIONAL TESTING
AND ASSESSMENT
BACKGROUND
Educational testing has a long history of use for also applies to assessments that are adopted for
informing decisions about learning, instruction, use across classrooms and whose developers make
and educational policy. Results of tests are used claims for the validity of score interpretations for
to make judgments about the status, progress, or intended uses.Admittedly, this distinction is not
accomplishments of individual students, as well always clear. Increasingly, districts, schools, and
as entities such as schools, school districts, states, teachers are using an array of coordinated instruction
or nations.Tests used in educational settings rep and/or assessment systems, many of which are
resent a variety of approaches, ranging from tra technology based.These systems may include, for
ditional multiple-choice and open-ended item example, banks of test items that individual
formats to performance assessments, including teachers can use in constructing tests for their
scorable portfolios.As noted in the introductory own purposes, focused assessment exercises that
chapter, a distinction is sometimes made between accompany instructional lessons, or simulations
the terms test and assessment, the latter term en and games designed for instruction or assessment
compassing broader sources of information than purposes.Even though it is not always possible to
a score on a single instrument. In this chapter we separate measurement issues from corresponding
use both terms, sometimes interchangeably; because instructional and learning issues in these systems,
the standards discussed generally apply to both. assessments that are part of these systems and
This chapter does not explicitly address issues that serve purposes beyond an individual teacher's
related to tests developed or selected exclusively instruction fall within the purview of the Standards.
to inform learning and instruction at the classroom Developers of these systems bear responsibility
level. Those tests often have consequences for for adhering to the Standards to support their
students, including influencing instructional claims.
actions, placing students in educational programs, Both the introductory discussion and the stan
and affecting grades that may affect admission to dards provided in this chapter are organized into
colleges.T he Standards provide desirable criteria three broad clusters: ( 1) design and development
of quality that can be applied to such tests. of educational assessments; ( 2) use and interpre
However, as with past editions, practical consid tation of educational assessments; and ( 3) admin
erations limit the Standards' applicability at the istration, scoring, and reporting of educational
classroom level. Formal validation practices are assessments.Although the clusters are related to
often not feasible for classroom tests because the chapters addressing operational areas of the
schools and teachers do not have the resources to standards, this discussion draws upon the principles
document the characteristics of their tests and are and concepts provided in the foundational chapters
not publishing their tests for widespread use. on validity, reliability/precision, and fairness and
Nevertheless, the core expectations of validity, re applies them to educational settings. It should
liability/precision, and fairness should be considered also be noted that this chapter does not specifically
in the development of such tests. address the use of test results in mandated ac
The Standards clearly applies to formal tests countability systems that may impose perform
whose scores or other results are used for purposes ance-based rewards or sanctions on institutions
that extend beyond the classroom, such as bench such as schools or school districts or on individuals
mark or interim tests that schools and districts such as teachers or principals.Accountability ap
use to monitor student progress.The Standards plications involving aggregates of scores are

1 83
CHAPTER 1 2

addressed in chapter 13 ("Uses of Tests for Program and to identify any gaps and/or misconceptions
Evaluation, Policy Studies, and Accountability"). that need to be addressed.
More formal assessments used for teaching
Design and Development of and learning purposes may not only inform class
room instruction but also provide individual and
Educational Assessments aggregated assessment data that others may use to
Educational tests are designed and developed to support learning improvement. For example,
provide scores that support interpretations for teachers in a district may periodically administer
the intended test purposes and uses. Design and commercial or locally constructed assessments
development of educational tests, therefore, begins that are aligned with the district curriculum or
by considering test purpose.Once a test's purposes state content standards.T hese tests may be used
are established, considerations related to the to evaluate student learning over one or more
specifics of test design and development can be units of instruction.Results may be reported im
addressed. mediately to students, teachers, and/or school or
district leaders. The results may also be broken
Major Purposes of Educational Testing down by content standard or subdomain to help
Although educational tests are used in a variety of teachers and instructional leaders identify strengths
ways, most address at least one of three major and weaknesses in students' learning and/or to
purposes: ( a) to make inferences that inform identify students, teachers, and/or schools that
teaching and learning at the individual or curricular may need special assistance. For example, special
level; (b) to make inferences about outcomes for programs may be designed to tutor students in
individual students and groups of students; and specific areas in which test results indicate they
( c) to inform decisions about students, such as need help.Because the test results may influence
certifying students' acquisition of particular knowl decisions about subsequent instruction, it is im
edge and skills for promotion, placement in special portant to base content domain or subdomain
instructional programs, or graduation. scores on sufficient numbers of items or tasks to
reliably support the intended uses.
Informing teaching and learning. Assessments In some cases, assessments administered during
that inform teaching and learning start with clear the school year may be used to predict student
goals for student learning and may involve a performance on a year-end summative assessment.
variety of strategies for assessing student status If the predicted performance on the year-end as
and progress.The goals are typically cognitive in sessment is low, additional instructional interven
nature, such as student understanding of rational tions may be warranted. Statistical techniques,
number equivalence, but may also address affective such as linear regression, may be used to establish
states or psychomotor skills.For example, teaching the predictive relationships.A confounding variable
and learning goals could include increasing student in such predictions may be the extent to which
interest in science or teaching students to form instructional interventions based on interim results
letters with a pen or pencil. improve the performance of initially low-scoring
Many assessments that inform teaching and students over the course of the school year; the
learning are used for formative purposes.Teachers predictive relationships will decrease to the extent
use them in day-to-day classroom settings to guide that student learning is improved.
ongoing instruction. For example, teachers may
assess students prior to starting a new unit to as Assessing student outcomes. The assessment of
certain whether they have acquired the necessary student outcomes typically serves summative func
prerequisite knowledge and skills.Teachers may tions, that is, to help assess pupils' learning at the
then gather evidence throughout the unit to see completion of a particular instructional sequence
whether students are making anticipated progress (e.g., the end of the school year).Educational testing

1 84
EDUCATIONAL TESTING AND ASSESSMENT

of student outcomes can be concerned with several For example, performance assessments that are
types of score interpretations, including standards more closely connected with instructional units
based interpretations, growth-based interpretations, may measure certain content standards that are
and normative interpretations.These outcomes may not easily assessed by a more traditional end-of
relate to the individual student or be aggregated year summative assessment.
over groups of students, for example, classes, sub T he evaluation of student outcomes can also
groups, schools, districts, states, or nations. involve interpretations related to student progress
Standards-based interpretations of student out or growth over time, rather than just performance
comes typically start with content standards, which at a particular time. In standards-based testing,
specify what students are expected to know and an important consideration is measuring student
be able to do. Such standards are typically established growth from year to year, both at the level of the
by committees of experts in the area to be tested. individual student and aggregated across students,
Content standards should be clear and specific for example at the teacher, subgroup, or school
and give teachers, students, and parents sufficient level. A number of educational assessments are
direction to guide teaching and learning. Academic used to monitor the progress or growth of individual
achievement standards, which are sometimes referred students within and/or across school years. Tests
to as performance standards, connect content stan used for these purposes are sometimes supported
dards to information that describes how well stu by vertical scales that span a broad range of devel
dents are acquiring the knowledge and skills con opmental or educational levels and include (but
tained in academic content standards. Performance are not limited to) both conventional multilevel
standards may include labels for levels of per test batteries and computerized adaptive assessments.
formance (e.g., "basic," "proficient," "advanced"), In constructing vertical scales for educational
descriptions of what students at different per tests, it is important to align standards and/or
formance levels know and can do, examples of learning objectives vertically across grades and to
student work that illustrate the range of achievement design tests at adjacent levels (or grades) that have
within each performance level, and cut scores substantial overlap in the content measured.
specifying the levels of performance on an assess However, a variety of alternative statistical
ment that separate adjacent levels of achievement. models exist for measuring student growth, not
T he process of establishing the cut scores for the all of which require the use of a vertical scale.In
academic achievement standards is often referred using and evaluating various growth models, it
to as standard setting. is important to clearly understand which questions
Although it follows from a consideration of each growth model can (and cannot) answer,
standards-based testing that assessments should what assumptions each growth model is based
be tightly aligned with content standards, it is on, and what appropriate inferences can be
usually not possible to comprehensively measure derived from each growth model's results. Missing
all of the content standards using a single summative data can create challenges for some growth
test. For example, content standards that focus models. Attention should be paid to whether
on student collaboration, oral argumentation, or some populations are being excluded from the
scientific lab activities do not easily lend themselves model due to missing data (for example, students
to measurement by traditional tests. As a result, who are mobile or have poor attendance).Other
certain content standards may be underemphasized factors to consider in the use of growth models
in instruction at the expense of standards that can are the relative reliability/precision of scores es
be measured by the end-of-year summative test. timated for groups with different amounts of
Such limitations may be addressed by developing missing data, and whether the model treats stu
assessment components that focus on various dents the same regardless of where they are on
aspects of a set of common content standards. the performance continuum.

1 85
CHAPTER 1 2

Student outcomes i n educational testing are Informing decisions about students. Test results
sometimes evaluated through norm-referenced in are often used in the process of making decisions
terpretations. A norm-referenced interpretation about individual students, for example, about
compares a student's performance with the per high school graduation, placement in certain ed
formances of other students. Such interpretations ucational programs, or promotion from one grade
may be made when assessing both status and to the next. In higher education, test results
growth.Comparisons may be made to all students, inform admissions decisions and the placement
to a particular subgroup ( e.g., other test takers who of admitted students in different courses ( e.g., re
have majored in the test taker's intended field of medial or regular) or instructional programs.
study), or to subgroups based on many other con Fairness is a fundamental concern with all
ditions ( e.g., students with similar academic per tests, but because decisions regarding educational
formance, students from similar schools). Norms placement, promotion, or graduation can have
can be developed for a variety of targeted populations profound individual effects, fairness is paramount
ranging from national or international samples of when tests are used to inform such decisions.
students to the students in a particular school Fairness in this context can be enhanced through
district ( i.e., local norms). Norm-referenced inter careful consideration of conditions that affect stu
pretations should consider differences in the target dents' opportunities to demonstrate their capa
populations at different times of a school year and bilities. For example, when tests are used for pro
in different years.When a test is routinely admin motion and graduation, the fairness of individual
istered to an entire target population, as in the case score interpretations can be enhanced by (a) pro
of a statewide assessment, norm-referenced inter viding students with multiple opportunities to
pretations are relatively easy to produce and generally demonstrate their capabilities through repeated
apply only to a single point in the school year. testing with alternate forms or other construct
However, national norms for a standardized achieve equivalent means; ( b) providing students with
ment test are often provided at several intervals adequate notice of the skills and content to be
within the school year. In that case, developers tested, along with appropriate test preparation
should indicate whether the norms covering a par materials; ( c) providing students with curriculum
ticular time interval were based on data or interpolated and instruction that afford them the opportunity
from data collected at other times of year. For ex to learn the content and skills to be tested;
ample, winter norms are often based on an inter ( d) providing students with equal access to disclosed
polation between empirical norms collected in fall test content and responses as well as any specific
and spring.T he basis for calculating interpolated guidance for test taking ( e.g., test-taking strategies);
norms should be documented so that users can be ( e) providing students with appropriate testing
made aware of the underlying assumptions about accommodations to address particular access needs;
student growth over the scliool year. and ( f ) in appropriate cases, taking into account
Because of the time and expense associated multiple criteria rather than just a single test
with developing national norms, many test developers score.
report alternative user mmns that consist ofdescriptive Tests informing college admissions decisions
statistics based on all those who take their test or a are used in conjunction with other information
demographically representative subset of those test about students' capabilities.Selection criteria may
takers over a given period of time.Although such vary within an institution by academic specialization
statistics-based on people who happen to take and may include past academic records, transcripts,
the test-are often useful, the norms based on and grade-point average or rank in class. Scores
them will change as the makeup of the reference on tests used to certify students for high school
group changes. Consequently, user norms should graduation or scores on tests administer e d at the
not be confused with norms representative of more end of specific high school courses may be used
systematically sampled groups. in college admissions decisions.The interpretations

1 86
EDUCATIONAL TESTING AND ASSESSMENT

inherent in these uses of high school tests should Development of Educational Tests
be supported by multiple lines of relevant validity
evidence (e.g., both concurrent and predictive ev As with all tests, once the construct and purposes
idence).Other measures used by some institutions of an educational test have been delineated, con
in making admissions decisions are samples of sideration must be given to the intended population
previous work by students, lists of academic and of test takers, as well as to practical issues such as
service accomplishments, letters of recommendation, available testing time and the resources available
and student-composed statements evaluated for to support the development effort. In the devel
the appropriateness of the goals and experience of opment of educational tests, focus is placed on
the student and/or for writing proficiency. measuring the knowledge, skills, and abilities of
Tests used to place students in appropriate all examinees in the intended population without
college-level or remedial courses play an important introducing any advantages or disadvantages
role in both community colleges and four-year because of individual characteristics (e.g., age,
institutions.Most institutions either use commercial culture, disability, gender, language, race/ethnicity)
placement tests or develop their own tests for that are irrelevant to the construct the test is in
placement purposes.The items on placement tests tended to measure. T he principles of universal
are typically selected to serve this single purpose design-an approach to assessment development
in an efficient manner and usually do not com that attempts to maximize the accessibility of a
prehensively measure prerequisite content. For test for all of its intended examinees-provide
example, a placement test in algebra will cover one basis for developing educational assessments
only a subset of algebra content taught in high in this manner.Paramount in the process is explicit
school. Results of some placement tests are used documentation of the steps taken during the de
to exempt students from having to take a course velopment process to provide evidence of fairness,
that would normally be required.Other placement reliability/precision, and validity for the test's in
tests are used by advisors for placing students in tended uses.The higher the stakes associated with
remedial courses or the most appropriate course the assessment, the more attention needs to be
in an introductory college-level sequence.In some paid to such documentation.More detailed con
cases, placement decisions are mechanized through siderations related to the development of educational
the application of locally determined cut scores tests are discussed in the chapters on fairness in
on the placement exam. Such cut scores should testing (chap.3) and test design and development
be established through a documented process in (chap. 4).
volving appropriate stakeholders and validated A variety of formats are used in developing
through empirical research. educational tests, ranging from traditional item
Results from educational tests may also inform formats such as multiple-choice and open-ended
decisions related to placing students in special in items to performance assessments, including
structional programs, including those for students scorable portfolios, simulations, and games. Ex
with disabilities, English learner, and gifted and amples of such performance assessments might
talented students.Test scores alone should never include solving problems using manipulable ma
be used as the sole basis for including any student terials, making complex inferences after collecting
in special education programming, or excluding information, or explaining orally or in writing
any student from such programming.Test scores the rationale for a particular course of government
should be interpreted in the context of the student's action under given economic conditions.An in
history, functioning, and needs.Nevertheless, test dividual portfolio may be used as another type of
results may provide an important basis for deter performance assessment. Scorable portfolios are
mining whether a student has a disability and systematic collections of educational products
what the student's educational needs are. typically collected, and possibly revised, over time.

1 87
CHAPTER 1 2

Technology is often used in educational settings teachers who need help; and/or predicting each
to present testing material and to record and student's likelihood of success on a summative as
score test takers' responses.Examples include en sessment. It is important to validate the interpre
hancements of text by audio instructions to tations made from test scores on such assessments
facilitate student understanding, computer-based for each of their intended uses.
and adaptive tests, and simulation exercises where T here are often tensions associated with using
attributes of performance assessments are supponed educational assessments for multiple purposes.
by technology. Some test administration formats For example, a test developed to monitor the
also may have the capacity to capture aspects of progress or growth of individual students across
students' processes as they solve test items.They school years is unlikely to also effectively provide
may, for example, monitor time spent on items, detailed and actionable diagnostic information
solutions tried and rejected, or editing sequences about students' strengths and weaknesses.Similarly,
for texts created by test takers. Technologies also an assessment designed to be given several times
make it possible to provide test administration over the course of the school year to predict
conditions designed to accommodate students student performance on a year-end summative
with particular needs, such as those with different assessment is unlikely to provide useful information
language backgrounds, attention deficit disorders, about student learning with respect to particular
or physical disabilities. instructional units. Most educational tests will
Interpretations of scores on technology-based serve one purpose better than others; and the
tests are evaluated by the same standards for more purposes an educational test is purported to
validity, reliability/precision, and fairness as tests serve, the less likely it is to serve any of those pur
administered through more traditional means. It poses effectively. For this reason, test developers
is especially imponant that test takers be familiarized and users should design and/or select educational
with the assessment technologies so that any un assessments to achieve the purposes they believe
familiarity with an input device or assessment in are most important, and they should consider
terface does not lead to inferences based on con whether additional purposes can be fulfilled and
struct-irrelevant variance. Furthermore, explicit should monitor the appropriateness of any identified
consideration of sources of construct-irrelevant additional uses.
variance should be part of the validation framework
as new technologies or interfaces are incorporated Use and Interp retation of
into assessment programs.Finally, it is important
Educational Assessments
to describe scoring algorithms used in technolo
gy-based tests and the expert models on which
they may be based, and to provide technical data Stakes and Consequences of Assessment
supporting their use in the testing system docu T he importance of the results of testing p rograms
mentation.Such documentation, however, should for individuals, institutions, or groups is often re
stop short of jeopardizing the security of the as ferred to as the stakes of the testing program.
sessment in ways that could adversely affect the When the stakes for an individual are high, and
validity of score interpretations. important decisions depend substantially on test
performance, the responsibility for providing evi
Assessments Serving Multiple Purposes dence supporting a test's intended purposes is
By evaluating students' knowledge and skills greater than might be expected for tests used in
relative to a specific set of academic goals, test low-stakes settings. Although it is never possible
results may serve a variety of purposes, including to achieve perfect accuracy in describing an indi
improving instruction to better meet student vidual's performance, efforts need to be made to
needs; evaluating curriculum and instruction dis minimize errors of measurement or errors in clas
trict-wide; identifying students, schools and/or sifying individuals into categories such a s "pass,"

1 88
EDUCATIONAL TESTING AND ASSESSMENT

"fail," "admit," or "reject." Further, supporting teaching and learning), to collect information
the validity of interpretations for high-stakes pur that bears on these issues, and to make decisions
poses, whether individual or institutional, typically about the uses of assessments that take this infor
entails collecting sound collateral information mation into account.
that can be used to assist in understanding the
factors that contributed to test results and to Assessments for Students With
provide corroborating evidence that supports in Disabilities and English Language Learners
ferences based on the results. For example, test In the 199 9 edition of the Standards, the material
results can be influenced by multiple factors, both on educational testing for special populations fo
institutional and individual, such as the quality cused primarily on individualized diagnostic as
of education provided, students' exposure to edu sessment and educational placement for students
cation (e.g., through regular school attendance), with special needs.Since then, requirements stem
and students' motivation to perform well on the ming from federal legislation have significantly
test.Collecting this type of information can con increased the participation of special populations
tribute to appropriate interpretations of test results. in large-scale educational assessment programs.
The high-stakes nature of some testing programs Special populations have also become more diverse
can create special challenges when new test versions and now represent a larger percentage of those
are introduced.For example,' a state may introduce test takers who participate in general education
a series of high school end-of-course tests that are programs. More students are being diagnosed
based on new content standards and are partially with disabilities, and more of these students are
tied to graduation requirements.T he operational included in general education programs and in
use of these new tests must be accompanied by state standards-based assessments. In addition,
documentation that students have indeed been the number of students who are English language
instructed on content aligned to the new standards. learners has grown dramatically, and the number
Because of feasibility constraints, this may require included in educational assessments has increased
a carefully planned phase-in period that includes accordingly.
special surveys or qualitative research studies that As discussed in chapter 3 ("Fairness in Testing"),
provide the needed opportunity-to-learn docu assessments for special populations involve a con
mentation.Until such documentation is available, tinuum of potential adaptations, ranging from
the tests should not be used for their intended specially developed alternate assessments to modi
high-stakes purpose. fications and accommodations of regular assessments.
Many types of educational tests are viewed as The purpose of alternate assessments and adaptations
tools of educational policy. Beyond any intended is to increase the accessibility of tests that may not
policy goals, it is important to consider potential otherwise allow students with some characteristics
unintended effects of large-scale testing programs. to display their knowledge and skills.Assessments
These possible unintended effects include (a) nar for special populations may also include assessments
rowing of curricula in some schools to focus ex developed for English language learners and indi
clusively on anticipated test content, (b) restriction vidually administered assessments that are used
of the range of instructional approaches to corre for diagnosis and placement.
spond to the testing format, (c) higher dropout
rates among students who do not pass the test, Alternate assessments. The term alternate assessments
and (d) encouragement of instructional or ad as used here, in the context of educational testing,
ministrative practices that may raise test scores refers to assessments developed for students with
without improving the quality of education.It is significant cognitive disabilities. Based on per
essential for those who mandate and use educational formance standards different from those used for
tests to be aware of such potential negative conse regular assessments, alternate assessments provide
quences (including missed opportunities to improve these students with the opportunity to demonstrate

1 89
CHAPTER 12

their standing and progress in learning.An alternate assessments have been used in some years and
assessment might consist of an observation checklist, regular assessments in other years.
a multilevel assessment with performance tasks,
or a portfolio that includes responses to selected Accommodations and modifications. To enable
response and/or open-ended tasks.The assessment assessment systems to include all students, ac
tasks are developed with the special characteristics commodations and modifications are provided to
of this population in mind. For example, a those students who need them, including those
multilevel assessment with performance tasks who participate in alternate assessments because
might include scaffolding procedures in which of their significant cognitive disabilities.Adaptations,
the examiner eliminates question distracters when which include both accommodations and modi
students answer incorrectly, in order to reduce fications, provide access to educational assessments.
question complexity.Or, in a portfolio assessment, Accommodations are adaptations to test format
the teacher might include work samples and ocher or administration ( such as changes in the way the
assessment information tailored specifically to the test is presented, the setting for the test, or the
student.The teacher may assess the same English way in which the student responds) that maintain
language arts standard by asking one student to the same construct and produce results that are
write a story and another to sequence a story comparable to those obtained by students who
using picture cards, depending on which activity do not use accommodations. Accommodations
provides students with access to demonstrate what may be provided to English language learners to
they know and can do. address their linguistic needs, as well as to students
T he development and use of alternate assess with disabilities to address specific, individual
ments in education have been heavily influenced characteristics chat otherwise would interfere with
by federal legislation. Federal regulations may accessibility. For example, a student with extreme
require chat alternate assessments used in a given dyslexia may be provided with a screen reader to
state have explicit connections to the content read aloud the scenarios and questions on a test
standards measured by the regular state assessment measuring science inquiry skills.The screen reader
while allowing for content with less depth, breadth, would be considered an accommodation because
and complexity. Such requirements clearly influence reading is not part of the defined construct (science
the design and development of alternate assessments inquiry) and the scores obtained by the student
in state standards-based programs. on the test would be assumed to be comparable
Alternate assessments in education should be to those obtained by students testing under regular
held to the same technical requirements that apply conditions.
to regular large-scale assessments.T hese include The use of accommodations should be sup
documentation and empirical data that support ported by evidence that their application does
test development, standard setting, validity, relia not change the construct that is being measured
bility/precision, and technical characteristics of by the assessment.Such evidence may be available
the tests. When the number of students served from studies of similar applications but may also
under alternate assessments is too small to generate require specially designed research.
stable statistical data, the test developer and users Modifications are adaptations to test format or
should describe alternate judgmental or other administration that change the construct being
procedures used to document evidence of the va measured in order to make it accessible for designated
lidity of score interpretations. students while retaining as much of the original
A variety of comparability issues may arise construct as possible.Modifications result in scores
when alternate assessments are used in statewide that differ in meaning from those for the regular
testing programs, for example, in aggregating the assessment. For example, a student with extreme
results of alternate and regular assessments or in dyslexia may be provided with a screen reader to
comparing trend data for subgroups when alternate read aloud the passages and questions on a reading

1 90
EDUCATIONAL TESTING AND ASSESSMENT

comprehension test that includes decoding as part students, classification consistency, and other
of the construct. In this case, the screen reader claims in the validity argument. The rationale
would be considered a modification because it and evidence supponing the ELP domain definition
changes the construct being measured, and scores and the roles/relationships of the language modalities
obtained by the student on the test would not be (e.g., reading, writing, speaking, listening) to
assumed to be comparable to those obtained by overall ELP are important considerations in artic
students testing under regular conditions.In many ulating the validity argument for an ELP test and
cases, accommodations can meet student access can inform the interpretation of test results.Since
needs without the use of modifications, but in no single assessment is equally effective in serving
some cases, modifications are the only option for all desired purposes, users should consider which
providing some srudems with access to an educational uses of ELP tests are their highest priority and
assessment.As with alternate assessments, compa choose or develop instruments accordingly.
rability issues arise with the use of modifications in Accommodations associated with ELP tests
educational testing programs. should be carefully considered, as adaptations that
Modified tests should be designed and developed are appropriate for regular content assessments
with the same c onsiderations of validity, may compromise the ELP standards being assessed.
reliability/precision, and fairness as regular assess In addition, users should establish common guidelines
ments. It is not sufficient to assume that the for using ELP results in making decisions about
validity evidence associated with a regular assessment ELL students.The guidelines should include explicit
generalizes to a modified version. policies and procedures for using results in identifying
An extensive discussion of modifications and and redesignating ELL students as English proficient,
accommodations for special populations is provided an imponant process because of the legal and edu
in chapter 3 ("Fairness in Testing"). cational importance of these designations. Local
education agencies and schools should be provided
Assessments for English language proficiency. with easy access to the guidelines.
An increasing focus on the measurement of English
language proficiency (ELP) for English language Individual assessments. Individually administered
learners (ELLs) has mirrored the growing presence tests are used by psychologists and other professionals
of these students in U.S. classrooms. Like stan in schools and other related settings to inform de
dards-based content tests, ELP tests are based on cisions about a variety of services that may be ad
ELP standards and are held to the same standards ministered to students. Services a.re provided for
for precision of scores and validity and fairness of students who are gifted as well as for those who
score interpretations for intended uses as are other encounter academic difficulties (e.g., students re
large-scale tests. quiring remedial reading instruction). Still other
ELP tests can serve a variety of purposes.They services are provided for students who display be
are used to identify students as English learners havioral, emotional, physical, and/or more severe
and qualify them for special ELL programs and learning difficulties.Services may be provided for
services, to redesignate sr udents as English profiient, students who are taught in regular classrooms as
and for purposes of diagnosis and instruction. well as for those receiving more specialized in
States, districts, and schools also use ELP tests to struction (e.g., special education students).
monitor these students' progress and to hold Aspects of the test that may result in con
schools and educators accountable for ELL learning struct-irrelevant variance for students with certain
and progress toward English proficiency. relevant characteristics should be taken into account
As with any educational test, validity evidence as appropriate by qualified testing professionals
for measures of ELP can be provided by examining when using test results to aid placement decisions.
the test blueprint, the alignment of content with For example, students' English language proficiency
ELP standards, construct comparability across or prior educational experience may interfere with

1 91
CHAPTER 1 2

their performance o n a test of academic ability results that are reported to them.Similarly, as test
and, if not taken into account, could lead to mis users, it is the responsibility of educators to pursue
classification in special education.Once a student and attain assessment literacy as it pertains to
is placed, tests may be administered to monitor their roles in the education system.
the progress of the student toward prescribed Test sponsors and test developers can promote
learning goals and objectives. Test results may educator assessment literacy in a variety of ways,
also be used to inform evaluations of instructional including workshops, development of written ma
effectiveness and determinations of whether the terials and media, and collaboration with educators
special services need to be continued, modified, in the test development process (e.g., development
or discontinued. of content standards, item writing and review,
Many types of tests are used in individualized and standard setting).In particular, those responsible
and special needs testing.T hese include tests of for educational testing programs should incorporate
cognitive abilities, academic achievement, learning assessment literacy into the ongoing professional
processes, visual and auditory memory, speech development of educators. In addition, regular
and language, vision and hearing, and behavior attempts should be made to educate other major
and personality.T hese tests typically are used in stakeholders in the educational process, including
conjunction with other assessment methods parents, students, and policy makers.
such as interviews, behavioral observations, and
reviews of records-for purposes of identifying Administration, Scoring , and
and placing students with disabilities.Regardless
Reporting of Educational Assessments
of the qualities being assessed and the data
collection methods employed, assessment data
used in making special education decisions are Administration of Educational Tests
evaluated in terms of evidence supporting intended Most educational tests involve standardized pro
interpretations as related to the specific needs of cedures for administration.T hese include directions
the students. T he data must also be judged in to test administrators and examinees, specifications
terms of their usefulness for designing appropriate for testing conditions, and scoring procedures.
educational programs for students who have special Because educational tests typically are administered
needs. For further information, see chapter 10 by school personnel, it is important for the spon
("Psychological Testing and Assessment"). soring agency to provide appropriate oversight to
the process and for schools to assign local roles
Assessment Literacy and and responsibilities (e.g., testing coordination)
Professional Development for training those who will administer the test.
Assessment literacy can be boadly defined as knowl Similarly, test developers have an obligation to
edge about the basic principles of sound assessment support the test administration process and to
practice, including terminology, the development provide resources to help solve problems when
and use of assessment methodologies and tech they arise.For example, with high-stakes tests ad
niques, and familiarity with standards by which ministered by computer, effective technical support
the quality of testing practices are judged. T he to the local administration is critical and should
results of educational assessments are used in de involve personnel who understand the context of
cision making across a variety of settings in class the testing program as well as the technical aspects
rooms, schools, districts, and states. Given the of the delivery system.
range and complexity of test purposes, it is im T hose responsible for educational testing pro
portant for test developers and those responsible grams should have formal procedures for granting
for educational testing programs to encourage testing accommodations and involve qualified
educators to be informed consumers of the tests personnel in the associated decision-making process.
and to fully understand and appropriately use For students with disabilities, changes in both in-

1 92
EDUCATIONAL TESTING AND ASSESSMENT

struction and assessment are typically specified in the task and because it is not feasible to include
an individualized education program (IEP). For more than one extended writing task in the test.
English language learners, schools may use guidance In addition, scoring based on item response theory
from the state or district to match students' (IRT ) models can result in item weights that
language proficiency and instructional experience differ from nominal or d esired weights. Such ap
with appropriate language accommodations.Test plications of IRT should include consideration
accommodations should be chosen by qualified and explanation of item weights in scoring. In
personnel on the basis of the individual student's general, the scoring rules used for educational
needs. It is particularly important in large-scale tests should be documented and include a validi
assessment programs to establish clear policies ty-based rationale.
and procedures for assigning and using accom In addition, test developers should discuss
modations.These steps help to maintain the com with policy makers the various methods of com
parability of scores for students testing with ac bining the results from different educational
commodations on academic assessments across tests used to make decisions about students, and
districts and schools.Once selected, accommoda should clearly document and communicate the
tions should be used consistently for both in methods, also known as decision rules.For example,
struction and assessment, and test administrators as part of graduation requirements, a state may
should be fully familiar with procedures for ac require a student to achieve established levels of
commodated testing. Additional information performance on multiple tests measuring different
related to test administration accommodations is content areas using either a noncompensatory
provided in chapter 3 ("Fairness in Testing"). or a compensatory decision rule. Under a non
compensatory decision rule, the student has to
Weighted and Composite Scoring achieve a determined level of performance on
Scoring educational tests and assessments requires each test; under a compensatory decision rule,
developing rules for combining scores on items the student may only have to achieve a certain
and/or tasks to obtain a total score and, in some total composite score based on a combination of
cases, for combining multiple scores into an overall scores across tests. For a high-stakes decision,
composite.Scores from multiple tests are sometimes such as one related to graduation, the rules used
combined into linear composites using nominal to combine scores across tests should be established
weights, which are assigned to each component with a clear understanding of the associated im
score in accordance with a logical j udgment of its plications.In these situations, important conse
relative importance.Nominal weights may some quences such as passing rates and classification
times be misleading because the variance of the error rates will differ depending on the rules for
composite is also determined by the variances combining test results. Test developers should
and covariances of the individual component document and communicate these implications
scores.As a result, the "effective weight" of each to policy makers to encourage policy decisions
component may not reflect the nominal weighting. that are fully informed.
When composite scores are used, differences be
tween norr:iinal and effective weights should be Reporting Scores
understood and documented. Score reports for educational assessments should
For a single test, total scores are often based support the interpretations and decisions of their
on a simple sum of the item and task scores. intended audiences, which include students, teach
However, differential weighting schemes may be ers, parents, principals, policy makers, and other
applied to reflect differential emphasis on specific educators. Different reports may be developed
content or constructs.For example, in an English and produced for different audiences, and the
language arts test, more weight may be assigned score report layouts may differ accordingly. For
to an extended essay because of the importance of example, reports prepared for individual students

1 93
CHAPTER 1 2

and parents may include background information compared by gathering data about the interpreta
about the purpose of the assessment, definitions tions and inferences made by users based on the
of performance categories, and more user-friendly data presented in each report.
representations of measurement error (e.g., error Online reporting capabilities give users flexible
bands around graphical score displays). Those access to test results. For example, the user can
who develop such reports should strive to provide select options online to break down the results by
information that can help students make productive content or subgroup. The options provided to
decisions about their own learning. In contrast, test users for querying the results should support
reports prepared for principals and district-level the test's intended uses and interpretations. For
personnel may include more detailed summaries example, online systems may discourage or disallow
but less foundational information because these viewing of results, in some cases as required by
individuals typically have a much better under law, if the sample sizes of particular subgroups fall
standing of assessments. below an acceptable number. In addition, care
As discussed in chapter 3, when modifications should be taken to allow access only to the appro
have been made to a test for some test takers that priate individuals. As with score reports, the
affect the construct being measured, consideration validity of interpretations from online supporting
may be given to reporting that a modification systems can be enhanced through usability research
was made because it affects the reliability/precision involving the intended score users.
of test scores or the validity of interpretations Technology also facilitates close alignment of
drawn from test scores.Conversely, when accom instructional materials with the results of educational
modations are made that do not affect the com tests. For example, results reported for an individual
parability of test scores, flagging those accommo student could include not only strengths and
dations is not appropriate. weaknesses but direct links to specific instructional
In general, score reports for educational tests materials that a teacher may use with the student
should be designed to provide information that is in the future.Rationales and documentation sup
understandable and useful to stakeholders without porting the efficacy of the recommended inter
leading to unwarranted score interpretations.Test ventions should be provided, and users should be
developers can significantly improve the design encouraged to consider such information in con
of score reports by conducting supporting research. junction with other evidence and j udgments about
For example, surveys of available reports for other student instructional needs.
educational tests can provide ideas for effectively When results are reported for large-scale as
displaying test results.In addition, usability research sessments, the test sponsors or users should prepare
with consumers of score reports can provide accompanying guidance to promote sound use
insights into report design. A number of techniques and valid interpretations of the data by the media
can be used in this type of research, including and other stakeholders in the assessment process.
focus groups, surveys, and analyses of verbal pro Such communications should address likely testing
tocols. For example, the advantages and disad consequences ( both positive and negative), as well
vantages of alternate prototype designs can be as anticipated misuses of the results.

1 94
EDUCATIONAL TESTING AND ASSESSMENT

STANDARDS FOR EDUCATIONAL TESTING AND ASSESSMENT


The standards in this chapter have been separated to monitor the impact of educational testing pro
into three thematic clusters labeled as follows: grams relates directly to fairness in testing, which
requires ensuring that scores on a given test reflect
1. Design and Development of Educational the same construct and have essentially the same
Assessments meaning for all individuals in the intended test
2. Use and Interpretation of Educational taker population. Consistent with appropriate
Assessments testing objectives, potential negative consequences
3. Administration, Scoring, and Reporting of should be monitored and, when identified, should
Educational Assessments be addressed to the extent possible.Depending on
the intended use, the person responsible for exam
Users ofeducational tests for evaluation, policy, ining the consequences could be the mandating
or accountability should also refer to the standards authority, the test developer, or the user.
in chapter 13 ( "Uses ofTests for Program Evalua
tion, Policy Studies, and Accountability"). Standard 1 2.2
I n educational settings, when a test is designed
Cluster 1 . D esign and Development of or used to serve multiple purposes, evidence of
Educational Assessments validity; reliability/precision, and fairness should
be provided for each intended use.
Standard 1 2.1 Comment: In educational testing, it has become
When educational testing programs are mandated common practice to use the same test for multiple
by school, district, state, or other authorities, purposes. For example, interim/benchmark tests
the ways in which test results are intended to be may be used for a variety of purposes, including
used should be clearly described by those who diagnosing student strengths and weaknesses,
mandate the tests. It is also the responsibility of monitoring individual student growth, providing
those who mandate the use of tests to monitor information to assist in instructional planning for
their impact and to identify and minimize individuals or groups of students, and evalmting
potential negative consequences as feasible. Con schools or districts.No test will serve all purposes
sequences resulting from the uses of the test, equally well.Choices in test design and development
both intended and unintended, should also be that enhance validity for one purpose may diminish
examined by the test developer and/or user. validity for other purposes. Different purposes
may require different kinds of technical evidence,
Comment: Mandated testing programs are often and appropriate evidence of validity, reliability/pre
justified in terms of their potential benefits for cision, and fairness for each purpose should be
teaching and learning.Concerns have been raised provided by the test developer. If the test user
about the potential negative impact of mandated wishes to use the test for a purpose not supported
testing programs, particularly when they directly by the available evidence, it is incumbent on the
result in important decisions for individuals or in user to provide the necessary additional evidence.
stitutions.T here is concern that some schools are See chapter 1 ( "Validity").
narrowing their curriculum to focus exclusively
on the objectives tested, encouraging instructional Standard 1 2.3
or administrative practices designed simply to raise
test scores rather than improve the quality of edu T hose responsible for the development and use
cation, and losing higher numbers ofstudents be of educational assessments should design all rel
cause many drop out after failing tests.The need evant steps of the testing process to promote

1 95
CHAPTER 1 2

access to the construct fo r all individuals and evaluated. The analyses should make explicit
subgroups for whom the assessment is intended. those aspects of the target domain that the test
represents, as well as those aspects that the test
Comment: It is important in educational contexts
fails to represent.
to provide for all students-regardless of their
individual characteristics-the opportunity to Comment: Tests are commonly developed to
demonstrate their proficiency on the construct monitor the status or progress of individuals and
being measured. Test specifications should clearly groups with respect to local, state, national, or
specify all relevant subgroups in the target popu professional content standards.Rarely can a single
lation, including those for whom the test may test cover the full range of performances reflected
not allow demonstration of knowledge and skills. in the content standards.In developing a new test
Items and tasks should be designed to maximize or selecting an existing test, appropriate interpre
access to the test content for all individuals in tation of test scores as indicators of performance
the intended test-taker population. Tools and on these standards requires documenting and
strategies should be implemented to familiarize evaluating both the relevance of the test to the
all test takers with the technology and testing standards and the extent to which the test is
format used, and the administration and scoring aligned to the standards. Such alignment studies
approach should avoid introducing any con should address multiple criteria, including not
struct-irrelevant variance into the testing process. only alignment of the test with the content areas
In situations where individual characteristics such covered by the standards but also alignment with
as English language proficiency, cultural or the standards in terms of the range and complexity
linguistic background, disability, or age are believed of knowledge and skills that students are expected
to interfere with access to the construct( s) that to demonstrate. Further, conducting studies of
the test is intended to measure, appropriate adap the cognitive strategies and skills employed by
tations should be provided to allow access to the test takers, or studies of the relationships between
content, context, and response formats of the test scores and other performance indicators
test items.T hese may include both accommoda relevant to the broader target domain, enables
tions (changes that are assumed to preserve the evaluation of the extent to which generalizations
construct being measured) and modifications to that domain are supported. This information
(changes that are assumed to make an altered should be made available to all who use the test
version of the construct accessible). Additional or interpret the test scores.
considerations related to fairness and accessibility
in educational tests and assessments are provided
Standard 1 2.5
in chapter 3 ("Fairness in Testing").
L ocal norms should be developed when appropriate
Standard 1 2.4 to support test users' intended interpretations.
When a test is used as an indicator of achievement Comment: Comparison of examinees' scores to
in an instructional domain or with respect to local as well as more broadly representative norm
specified content standards, evidence of the groups can be informative.Thus, sample size per
extent to which the test samples the range of mitting, local norms are often useful in conjunction
knowledge and elicits the processes reflected in with published norms, especially if the local pop
the target domain should be provided. Both the ulation differs markedly from the population on
tested and the target domains should be described which published norms are based.In some cases,
in sufficient detail for their relationship to be local norms may be used exclusively.

1 96
EDUCATIONAL TESTING AND ASSESSMENT

Standard 1 2.6 Comment: Students, parents, and educational


staff should be informed of the domains on which
Documentation of design, models, and scoring the students will be tested, the nature of the item
algorithms should be provided for tests admin types, and the criteria for determining mastery.
istered and scored using multimedia or computers. Reasonable efforts should be made to document
Comment: Computer and multimedia tests need the provision of instruction on the tested content
to be held to the same requirements of technical and skills, even though it may not be possible or
quality as other tests. For example, the use of feasible to determine the specific content of in
technology-enhanced item formats should be sup struction for every student.In addition and as ap
ported with evidence that the formats are a feasible propriate, evidence should also be provided that
way to collect information about the construct, students have had the opportunity to become fa
that they do not introduce construct-irrelevant miliar with the mode of administration and item
variance, and that steps have been taken to promote formats used in testing.
accessibility for all students.
Standard 1 2.9
Cluster 2 . Use and I nterpretati on of Students who must demonstrate mastery of
Educational Assessments certain skills or knowledge before being promoted
or granted a diploma should have a reasonable
Standard 1 2. 7 number of opportunities to succeed on alternate
forms of the test or be provided with technically
In educational settings, test users should take sound alternatives to demonstrate mastery of
steps to prevent test preparation activities and the same skills or knowledge. In most circum
distribution of materials to students that may ad stances, when students are provided with multiple
versely affect the validity of test score inferences. opportunities to demonstrate mastery, the time
interval between the opportunities should allow
Comment: In most educational testing contexts,
students to obtain the relevant instructional ex
the goal is to use a sample of test items to make
periences.
inferences to a broader domain.When inappropriate
test preparation activities occur, such as excessive Comment: T he number of testing opportunities
teaching of items that are equivalent to those on and the time between opportunities will vary
the test, the validity of test score inferences is ad with the specific circumstances of the setting.
versely affected.T he appropriateness of test prepa Further, policy may dictate that some students
ration activities and materials can be evaluated, should be given opportunities to demonstrate
for example, bY. determining the extent to which their achievement using a different approach.For
they reflect the specific test items and by considering example, some states that administer high school
the extent to which test scores may be artificially graduation tests permit students who have partic
raised as a result, without increasing students' ipated in the regular curriculum but are unable to
. level of genuine achievement. demonstrate the required performance level on
one or more of the tests to show, through a struc
Standard 1 2.8 tured portfolio of their coursework and other in
dicators (e.g., participation in approved assistance
When test results contribute substantially to de programs, satisfaction of other graduation re
cisions about student promotion or graduation, quirements), that they have the knowledge and
evidence should be provided that students have skills necessary to obtain a high school diploma.
had an opportunity to learn the content and If another assessment approach is used, it should
skills measured by the test. be held to the same standards of technical quality

1 97
CHAPTER 1 2

as the primary assessment.I n particular, evidence In cases where growth scores are predicted for
should be provided that the alternative approach individual students, results based on different ver
measures the same skills and has the same passing sions of tests taken over time may be used. For
expectations as the primary assessment. example, math scores in Grades 3, 4, and 5 may
be used to predict the expected math score in
Standard 1 2.1 O Grade 6 . In such cases, if complex statistical
models are used to predict scores for individual
In educational settings, a decision or charac students, the method for constructing the models
terization that will have major impact on a should be made explicit and should be justified,
student should take into consideration not just and supporting technical and interpretive infor
scores from a single test but other relevant in mation should be provided to the score users.
formation. Chapter 13 ("Uses of Tests for Program Evaluation,
Policy Studies, and Accountability") addresses the
Comment: In general, multiple measures or data
application of more complex models to groups or
sources will often enhance the appropriateness of
systems within accountability settings.
decisions about students in educational settings
and therefore should be considered by test sponsors
and test users in establishing decision rules and Standard 1 2. 1 2
policy. It is important that in addition to scores
on a single test, other relevant information (e.g., When a n individual student's scores from different
school coursework, classroom observation, parental tests are compared, any educational decision
reports, other test scores) be taken into account based on the comparison should take into account
when warranted. T hese additional data sources the extent of overlap between the two constructs
should demonstrate information relevant to the and the reliability or standard error of the differ
intended construct. For example, it may not be ence score.
advisable or lawful to automatically accept students
Comment: When difference scores between two
into a gifted program if their IQ is measured to
tests are used to aid in making educational
be above 130 without considering additional rel
decisions, it is important that the two tests be
evant information about their performance.Sim
placed on a common scale, either by standardization
ilarly, some students with measured IQs below
or by some other means, and, if appropriate,
130 may be accepted based on other measures or
normed on the same population at about the
data sources, such as a test of creativity, a portfolio
same time.In addition, the reliability and standard
of student work, or teacher recommendations.In
error of the difference scores between the two
these cases, other evidence of gifted performance
tests are affected by the relationship between the
serves to compensate for !he lower IQ test score.
constructs measured by the tests as well as by the
standard errors of measurement of the scores of
Standard 1 2.1 1 the two tests.For example, when scores on a non
verbal ability measure are compared with achieve
When difference or growth scores are used for in
ment test scores, the overlapping nature of the
dividual students, such scores should be clearly
two constructs may render the reliability of the
defined, and evidence of their validity, reliability/
difference scores lower than test users normally
precision, and fairness should be reported.
would expect. If the ability and/or achievement
Comment: The standard error of the difference tests involve a significant amount of measurement
between scores on the pretest and posttest, the re error, this will also reduce the confidence that can
gression of posttest scores on pretest scores, or be placed in the difference scores.All these factors
relevant data from other appropriate methods for affect the reliability of difference scores between
examining change should be reported. tests and should be considered when such scores

1 98
EDUCATIONAL TESTING AND ASSESSMENT

are used as a basis for making important decisions Standard 1 2. 1 4


about a student.This standard is also relevant in
comparisons of subscores or scores from different In educational settings, those who supervise others
components of the same test, such as may be re in test selection, administration, and score inter
ported for multiple aptitude test batteries, educa pretation should be familiar with the evidence for
tional tests, and/ or selection tests. the reliability/precision, the validity of the intended
interpretations, and the fairness of the scores.
They should be able to articulate and effectively
Standard 1 2. 1 3 train others to articulate a logical explanation of
the relationships among the tests used, the purposes
When test scores are intended to be used as part
served by the tests, and the interpretations of the
of the process for making decisions about edu
test scores for the intended uses.
cational placement, promotion, implementation
of individualized educational programs, or pro Comment: Appropriate interpretations of scores
vision of services for English language learners, on educational tests depend on the effective
then empirical evidence documenting the rela training of individuals who carry out test admin
tionship among particular test scores, the in istration and on the appropriate education of
structional programs, and desired student out chose who make use of test results. Establishing
comes should be provided. When adequate em ongoing professional development programs that
pirical evidence is not available, users should be include a focus on improving the assessment
cautioned to weigh the test results accordingly literacy of teachers and stakeholders is one mech
in light of other relevant information about the anism by which those who are responsible for test
students. use in educational settings can facilitate the validity
of test score interpretations.Establishing educational
Comment: The use of test scores for placement
requirements (e.g., an advanced degree, relevant
or promotion decisions should be supported by
coursework, or attendance at workshops provided
evidence about the relationship between the test
by the test developer or test sponsor) are ocher
scores and the expected benefits of the resulting
strategies that might be used to provide docu
educational programs.Thus, empirical evidence
mentation of qualifications and expertise.
should be gathered to support the use of a test
by a community college to place entering students
in different mathematics courses. Similarly, in Standard 1 2.1 5
special education, when test scores are used in
Those responsible for educational testing programs
the development of specific educational objectives
should take appropriate steps to verify that the
and instructional strategies, evidence is needed
individuals who interpret the test results to make
to show that the prescribed instruction is (a) di
decisions within the school context are qualified
rectly linked to the test scores, and (b) likely to
to do so or are assisted by and consult with
enhance student learning.When there is limited
persons who are so qualified.
evidence about the relationship among test results,
instructional plans, and student achievement Comment: When testing programs are used as a
outcomes, test developers and users should stress strategy for guiding instruction, the school personnel
the tentative nature of the test-based recom who are expected to make inferences about in
mendations and encourage teachers and other structional planning may need assistance in inter
decision makers to weigh the usefulness of the preting test results for this purpose.Such assistance
test scores in light of other relevant information may consist of ongoing professional development,
about the students. interpretive guides, training, information sessions,

1 99
CHAPTER 1 2

and the availability of experts to answer questions Comment: Differences in test scores between rel
that arise as test results are disseminated. evant subgroups ( e.g., classified by gender, race/eth
T he interpretation of some test scores is suffi nicity, school/district, or geographical region) can
ciently complex to require that the user have be influenced, for example, by differences in
relevant training and experience or be assisted by student characteristics, in course-taking patterns,
and consult with persons who have such training in curriculum, in teachers' qualifications, or in
and experience. Examples of such tests include parental educational levels. Differences in per
individually administered intelligence tests, interest formance of cohorts of students across time may
inventories, growth scores on state assessments, be influenced by changes in the population of
projective tests, and neuropsychological tests. students tested or changes in learning opportunities
for students. Users should be advised to consider
the appropriate contextual information and be
C luster 3 . Administration, Scori n g , and
cautioned against misinterpretation.
Reporting of Educational Assessments
Standard 1 2.1 8
Standard 1 2. 1 6
In educational settings, score reports should be ac
Those responsible for educational testing programs
companied by a clear presentation of information
should provide appropriate training, documen
on how to interpret the scores, including the degree
tation, and oversight so that the individuals who
of measurement error associated with each score or
administer and score the test(s) are proficient in
classification level, and by supplementary infonnation
the appropriate test administration and scoring
related to group sumrnaty scores. In addition, dates
procedures and understand the importance of
of test administration and relevant norming studies
adhering to the directions provided by the test
should be included in score reports.
developer.
Comment: Score information should be commu
Comment: In addition to being familiar with stan
nicated in a way that is accessible to persons
dardized test administration documentation and
receiving the score report. Empirical research in
procedures ( including test security protocols), it is
volving score report users can help to improve the
important for test coordinators and test administrators
clarity of reports.For instance, the degree of un
to be familiar with materials and procedures for
certainty in the scores might be represented by
accommodations and modifications for testing.
presenting standard errors of measurement graph
Test developers should therefore provide appropriate
ically; or the probability of misclassification asso
manuals and training materials that specifically
ciated with performance levels might be provided.
address accommodated administrations.Test coor
Similarly, when average or summary scores for
dinators and test administrators should also receive
groups of students are reported, they should be
information about the characteristics of the student
supplemented with additional information about
populations included in the testing program.
the sample sizes and the shapes or dispersions of
score distributions.Particular care should be taken
Standard 1 2. 1 7 to portray subscore information in score reports
In educational settings, reports of group differences in ways that facilitate proper interpretation.Score
in test scores should be accompanied by relevant reports should include the date of administration
contextual information, where possible, to enable so that score users can consider the validity of in
meaningful interpretation of the differences. ferences as time passes.Score reports should also
Where appropriate contextual information is include the dates of relevant norming studies so
not available, users should be cautioned against users can consider the age of the norms in making
misinterpretation. inferences about student performance.

200
EDUCATIONAL TESTING AND ASSESSMENT

Standard 1 2.1 9 based interpretation of their performance on a


standards-based test.In such instances, documen
In educational settings, when score reports
tation supporting the appropriateness of instruc
include recommendations for instructional in
tional assignments should be provided.Similarly,
tervention or are linked to recommended plans
when the pattern of subscores on a test is used to
or materials for instruction, a rationale for and
assign students to particular instructional inter
evidence to support these recommendations
ventions, it is important to provide both a rationale
should be provided.
and empirical evidence to support the claim that
Comment: Technology is making it increasingly these assignments are appropriate. In addition,
possible to assign particular instructional inter users should be advised to consider such pedagogical
ventions to students based on assessment results. recommendations in conjunction with other rele
Specific digital content (e.g., worksheets or lessons) vant information about students' strengths and
may be made available to students using a rules- weaknesses.

201
1 3 . USES O F TESTS FOR PROGRAM
EVALUATIO N , POLI CY STU D I ES ,
AN D ACCOU NTABILITY

BACKGROUND
Tests are widely used t o inform decisions as part modeling results aggregated at the classroom,
of public policy. One example is the use of tests school, or institution level. Systems or programs
in the context of the design and evaluation of that focus on accountability for individual students,
programs or policy initiatives. Program evaluation such as through test-based promotion policies or
is the set of procedures used to make judgments graduation exams, are addressed in chapter 12.
about a program's design, its implementation, ( However, many of the issues raised in that chapter
and its outcomes. Policy studies are somewhat are relevant to the use of educational tests for
broader than program evaluations; they contribute program evaluation or school accountability pur
to judgments about plans, principles, or procedures poses.) If accountability systems or programs
enacted to achieve broad public goals. Tests often include tests administered to teachers, principals,
provide the data that are analyzed to estimate the or other providers for purposes of evaluating their
effect of a policy, program, or initiative on outcomes practice or performance ( e.g., for teacher pay-for
such as student achievement or motivation. A performance programs that include a test of
second broad category of test use in policy settings teacher knowledge or an observation-based measure
is in accountability systems, which attach conse of their practices), those tests should be evaluated
quences ( e.g., rewards and sanctions) to the per according to the standards related to workplace
formance of institutions ( such as schools or school testing and credentialing in chapter 11.
districts) or individuals ( such as teachers or mental The contexts in which testing for evaluation
health care providers).Program evaluations, policy and accountability takes place vary in the stakes
studies, and accountability systems should not for test takers and for those who are responsible
necessarily be viewed as discrete categories. They for promoting specific outcomes (such as teachers
are frequently adopted in combination with one or health care providers). Testing programs for
another, as is the case when accountability systems institutions can have high stakes when the aggregate
impose requirements or recommendations to use performance of a sample or of the entire population
test results for . evaluating programs adopted by of test takers is used to make inferences about the
schools or districts. quality of services provided and, as a result,
The uses of tests for program evaluations, decisions are made about institutional status, re
policy studies, and accountability share several wards, or sanctions. For example, the quality of
characteristics, including measurement of the per reading curriculum and instruction may be judged
formance of a group of people and use of test in part on the basis of results of testing for levels
scores as evidence of the success or shortcomings of attainment reached by groups of students.Sim
of an institution or initiative.This chapter examines ilarly, aggregated scores on psychological tests are
these uses of tests.The accountability discussion sometimes used to evaluate the effectiveness of
focuses on systems that involve aggregates of treatment provided by mental health programs or
scores, such as school-wide or institution-wide agencies and may be included in accountability
averages, percentages of students or patients scoring systems. Even when test results are reported in
above a certain level, or growth or value-added the aggregate and intended for low-stakes purposes,

203
CHAPTER 1 3

the public release o f data may b e used to inform for those who conduct such studies to rely on
judgments about program quality, personnel, or measures developed for other purposes.In addition,
educational programs and may influence policy for reasons of cost or convenience, certain rests
decisions. may be adopted for use in a program evaluation
or policy study even though they were developed
Evaluation of Programs for a somewhat different population of respondents.
and Policy Initiatives Some tests may be selected because they are well
known and thought to be especially credible in
As noted earlier, program evaluation typically in the view of clients or public consumers, or because
volves making judgments about a single program, useful data already exist from earlier administrations
whereas policy studies address plans, principles, of the tests.Evidence for the validity of test scores
or procedures enacted to achieve broad public for the intended uses should be provided whenever
goals.Policy studies may address policies at various tests are used for program or policy evaluations or
levels of government, including local, state, federal, for accountability purposes.
and international, and may be conducted in both Because of administrative realities, such as
public and private organizational or institutional cost constraints and response burden, method
contexts.T here is no sharp distinction between ological refinements may be adopted to increase
policy studies and program evaluations, and in the efficiency of testing. One strategy is to obtain
many instances there is substantial overlap bet ween a sample of participants to be evaluated from the
the two types of investigations. Test results are larger set of those exposed to a program or poli cy.
often one important source of evidence for the When a sufficient number of clients are affected
initiation, continuation, modification, termination, by the program or policy that will be evaluated,
or expansion of various programs and policies. and when there is a desire to limit the time spent
Tests may be used in program evaluations or on testing, evaluators can create multiple forms
policy studies to provide information on the status of short tests from a larger pool of items. By con
of clients, students, or other groups before, during, structing a number of test forms consisting of rel
or after an intervention or policy enactment, as atively few items each and assigning the test forms
well as to provide score information for appropriate to different subsamples of test takers ( a procedure
comparison groups.Whereas many testing activities known as matrix sampling), a larger number of
are intended to document the performance of in items can be included in the study than could
dividual test takers, program evaluation and policy reasonably be administered to any single test taker.
studies target the performance of groups or the When it is desirable to represent a domain with a
impact of the test results on these groups. A large number of test items, this approach is often
variety of tests can be used for evaluating programs used.However, in matrix sample testing, individual
and policies; examples include standardized achieve scores usually are not created or interpreted.
ment tests administered by states or districts, pub Because procedures for sampling individuals or
lished psychological tests rhat measure outcomes test items may vacy in a number of ways, adequate
of interest, and measures developed specifically analysis and interpretation of test results depend
for the purposes of the evaluation. In addition, on a clear description of how samples were formed
evaluations of programs and policies sometimes and how the tests were designed, scored, and re
synthesize results from multiple studies or tests. ported.Reports of test results used for evaluation
It is important to evaluate any proposed test in or accountability should describe the sampling
terms of its relevance to the goals of the program strategy and rhe extent to which the sample is
or policy and/or to the particular questions its use representative of the population that is relevant
will address.It is relatively rare for a test to be de to the intended inferences.
signed specifically for program evaluation or policy Evaluations and policy studies sometimes rely
study purposes, and therefore ir is often necessary on secondary data analysis: analysis of data previously

204
USES OF TESTS FOR PROGRAM EVALUATION, POLICY STUDIES, AND ACCOUNTABILITY

c ollected for other purp oses.In some circumstances, of the test or its technical quality, including study
it may be difficult to ensure a g o od match between design, administrative feasibility, and the quality
the existing test and the interventi on or p olicy of other available data.T his chapter focuses on
under examinati on, or t o rec onstruct in detail the testing and d oes n ot deal with these other c onsid
c onditions under which the data were originally erati ons in any substantial way.In order t o devel op
c ollected. Sec ondary data analysis als o requires defensible c onclusi ons, h owever, investigators con
c onsiderati on of the privacy rights of test takers ducting program evaluati ons and p olicy studies
and others affected by the analysis. S ometimes sh ould supplement test results with data fr om
this requires determining whether the informed other s ources.These data may include informati on
c onsent obtained from participants in the original ab out pr ogram characteristics, delivery, c osts,
data c ollecti on was adequate to allow sec ondary client backgrounds, degree of participation, and
analysis t o proceed without a need for additional evidence of side effects. Because test results lend
c onsent. It may als o require an understanding of imp ortant weight to evaluation and p olicy studies,
the extent to which individually identifiable in it is critical that any tests used in these investigations
formation has been redacted fr om the data set be sensitive t o the questi ons of the study and ap
c onsistent with applicable legal standards. In se propriate for the test takers.
lecting ( or developing) a test or deciding whether
t o use existing data in evaluati on and p olicy Test-Based Accountability Systems
studies, careful investigators attempt t o balance
the purp ose of the test, the likelih ood that it will The inclusi on of test sc ores in educati onal ac
be sensitive t o the interventi on under study, its c ountability systems has become c ommon in the
credibility t o interested parties, and the c osts of United States and in other nati ons. M ost test
administration. Otherwise, test results may lead based educati onal accountability in the United
to inappropriate c onclusions ab out the progress, States takes place at the K-12 level, but many of
impact, and overall value of programs and p olicies the issues raised in the K-12 c ontext are relevant
under review. t o efforts t o adopt outc omes-based acc ountability
Interpretation of test sc ores in program evalu in p ostsec ondary educati on.In addition, acc ount
ati on and p olicy studies usually entails c omplex ability systems may inc orp orate information from
analysis of a number of variables. F or example, l ongitudinal data systems linking students' per
s ome programs are mandated for a broad p opula formance on tests and other indicators, including
ti on; others target only certain subgroups. S ome systems that capture a c oh ort's performance from
are designed to affect attitudes, beliefs, or values; presch ool through higher educati on and into the
others are intended to have a m ore direct impact w orkforce. Test-based acc ountability s ometimes
on behavi or, knowledge, or skills. It is imp ortant occurs in sectors other than education; one example
that the participants included in any study meet is the use of psych ol ogical tests t o create measures
the specified criteria for participating in the of effectiveness for providers of mental health
program or p olicy under review, so that appropriate care. These uses of tests raise issues similar t o
interpretation of test results will be p ossible.Test th ose that arise in educati onal c ontexts.
results will reflect n ot only the effects of rules for Test-based acc ountability systems take a variety
participant selection and the impact on: the par of approaches t o measuring performance and
ticipants of taking part in programs or treatments, h olding individuals or groups acc ountable for
but als o the characteristics of the participants. that performance. These systems vary al ong a
Relevant background informati on ab out clients number of dimensions, including the unit of ac
or students may be obtained t o strengthen the in c ountability ( e.g., district, sch ool, teacher), the
ferences derived from the test results.Valid inter stakes attached t o results, the frequency of meas
pretati ons may depend on additional c onsiderations urement, and whether n ontest indicat ors are in
that have n othing t o d o with the appropriateness cluded in the acc ountability system.One imp ortant

205
CHAPTER 1 3

measurement concern i n accountability stems stable over time and across students and items.
from the construction of an accountability index: These assumptions must be supported by evidence.
a number or label that reflects a set of rules for Moreover, those responsible for developing or
combining scores and other information to arrive implementing test-based accountability systems
at conclusions and inform decision making. An often assert that these systems will lead to specific
accountability index could be as simple as an outcomes, such as increased educator motivation
average test score for students in a particular or improved achievement; these assertions should
grade in a particular school, but most systems also be supported by evidence. In particular,
rely on more complex indices.T hese may involve efforts should be made to investigate any potential
a set of rules ( often called decision rules) for syn positive or negative consequences of the selected
thesizing multiple sources of information, such as accountability system.
test scores, graduation rates, course-taking rates, Similarly, the choice of specific rules and data
and teacher qualifications.An accountability index that are used to create an accountability index
may also be created from applications of complex should reflect the goals and values of those who
statistical models such as those used in value are developing the accountability system, as well
added modeling approaches. As discussed in as the inferences that the system is designed to
chapter 12, for high-stakes decisions, such as clas support.For example, if a primary goal of an ac
sification of schools or teachers into performance countability system is to identify teachers who
categories that are linked to rewards or sanctions, are effective at improving student achievement,
the establishment of rules used to create account the accountability index should be based on as
ability indices should be informed by a consideration sessments that are closely aligned with the content
of the nature of the information the system is in the teacher is expected to cover, and should take
tended to provide and by an understanding of into account factors outside the teacher's control.
how consequences will be affected by these rules. T he process typically involves decisions such as
T he implications of the rules should be commu whether to measure percentages above a cut score
nicated to decision makers so that they understand or an average of scale scores, whether to measure
the consequences of any policy decisions based status or growth, how to combine information
on the accountability index. for multiple subjects and grade levels, and whether
Test-based accountability systems include in to measure performance against a fixed target or
terpretations and assumptions that go beyond use a rank-based approach.T he development of
those for the interpretation of the test scores on an accountability index also involves political con
which they are based; therefore, they require ad siderations, such as how to balance technical con
ditional evidence to support their validity. Ac cerns and transparency.
countability systems in education typically aggregate
scores over the students in a class or school, and
may use complex mathematical models to generate Issues in Program and
a summary statistic, or index, for each teacher or Pol icy Evaluation and Accountability
school.T hese indices are often interpreted as esti
mates of the effectiveness of the teacher or school. Test results are sometimes used as one way to mo
Users of information from accountability systems tivate program administrators or other service
might assume that the accountability indices providers as well as to infer institutional effectiveness.
provide valid indicators of the intended outcomes T his use of tests, including the public reporting
of education ( e.g., mastery of the skills and knowl of results, is thought to encourage an institution
edge described in the state content standards), to improve its services for its clients.For example,
that differences among indices can be attributed in some test-based accountability systems, consis
to differences in the effectiveness of the teacher or tently poor results on achievement tests at the
school, and that these differences are reasonably school level may result in interventions that affect

206
USES OF TESTS FOR PROGRAM EVALUATION, POLICY STUDIES, AND ACCOUNTABILITY

the school's staffing or operations.The interpretation not been taken seriously, the motivation of test
of test results is especially complex when tests are takers may be explored by collecting additional
used both as an institutional policy mechanism information where feasible, using observation or
and as a measure of effectiveness. For example, a interview methods.Issues of inappropriate prepa
policy or program may be based on the assumption ration and unmotivated performance raise questions
that providing clear goals and general specifications about the validity of interpretations of test results.
of test content (such as the types of topics, con In every case, it is important to consider the
structs, cognitive domains, and response formats potential impact on the test taker of the testing
included in the test) may be a reasonable strategy p rocess itself, including test administration and
to communicate new expectations to educators. reporting practices.
Yet the desire to influence test or evaluation results Public policy decisions are rarely based solely
to show acceptable institutional performance could on the results of empirical studies, even when the
lead to inappropriate testing practices, such as studies are of high quality.The more expansive
teaching the test items in advance, modifying test and indirect the policy, the more likely it is that
administration procedures, discouraging certain other considerations will come into play, such as
students or clients from participating in the testing the political and economic impact of abandoning,
sessions, or focusing teaching exclusively on test changing, or retaining the policy, or the reactions
taking skills. These responses illustrate that the of various stakeholders when institutions become
more an indicator is used for decision making, the targets of rewards or sanctions.Tests used in
the more likely it is to become corrupted and policy settings may be subjected to intense and
distort the process that it was intended to measure. detailed scrutiny for political reasons.When the
Undesirable practices such as excessive emphasis test results contradict a favored position, attempts
on test-taking skills might replace practices aimed may be made to discredit the testing procedure,
at helping the test takers learn the broader domains content, or interpretation. Test users should be
measured by the test.Because results derived from able to defend the use of the test and the interpre
such practices may lead to spuriously high estimates tation of results but should also recognize that
of performance, the diligent investigator should they cannot control the reactions of stakeholder
estimate the impact of changes in teaching practices groups.
that may result from testing in order to interpret It is essential that all tests used in accountability,
the test results appropriately. Looking at possible program evaluation, or policy contexts" meet the
inappropriate consequences of tests as well as standards for validity, reliability, and fairness ap
their benefits will result in more accurate assessment propriate to the intended test score interpretations
of policy claims that particular types of testing and use. Moreover, as described in chapter 6,
programs lead to improved performance. tests should be administered by personnel who
Investigators conducting policy studies and are appropriately trained to implement the test
program evaluations may give no clear reasons to administration procedures.It is also essential that
the test takers for participating in the testing pro assistance be provided to those responsible for in
cedure, and they often withhold the results from terpreting study results for practitioners, the lay
the test takers.When matrix sampling is used for public, and the media. Careful communication
program evaluation, it may not be feasible to about goals, procedures, findings, and limitations
provide such reports. If little effort is made to increases the likelihood that the interpretations of
motivate the test takers to regard the test seriously the results will be accurate and useful.
( e.g., if the purpose of the test is not explained),
the test takers may have little reason to maximize Additional Considerations
their effort on the test.The test results thus may
misrepresent the impact of a program, institution, T his chapter and its associated standards are
or policy.When there is suspicion that a test has directed to users of tests in program evaluations,

207
CHAPTER 1 3

policy studies, and accountability systems. Users as well as educators, administrators, and policy
include those who mandate, design, or implement makers who are engaged in efforts to measure
these evaluations, studies, or systems and those school performance or evaluate the effectiveness
who make decisions based on the information of education policies or programs. In addition to
they provide. Users include, among others, psy the standards below, users should consider other
chologists who develop, evaluate, or enforce policies, available documents containing relevant standards.

208
USES OF TESTS FOR PROGRAM EVALUATION, POLICY STUDIES, ANO ACCOUNTABILITY

STANDARDS FOR USES OF TESTS FOR PROGRAM EVALUATION ,


POLICY STUDIES, AND ACCOUNTABILITY
T he standards in this chapter have been separated should also report appropriate sampling error
into two thematic clusters labeled as follows: variance estimates if simple random sampling was
not used.
1. Design and Development ofTesting Pro
grams and Indices for P rogram Evaluation, Standard 1 3.2
Policy Studies, and Accountability Systems
2. Interpretations and Uses of Information When change or gain scores are used, the proce
From Tests Used in Program Evaluation, Pol dures for constructing the scores, as well as their
icy Studies, and Accountability Systems technical qualities and limitations, should be re
ported. In addition, the time periods between
Users ofeducational tests for evaluation, policy, test administrations should be reported, and
or accountability should also refer to the standards care should be taken to avoid practice effects.
in chapter 12 ( "Educational Testing and Assess
Comment: T he use of change or gain scores pre
ment") and to the.other standards in this volume.
sumes that the same test, equivalent forms of the
test, or forms of a vertically scaled test are used
Cluster 1 . D esign and Development of and that the test ( or form or vertical scale) is not
Testing Programs and Indices for materially altered between administrations. T he
Program Eva luati on, Pol icy Studies, standard error of the difference between scores on
pretests and posttests, the error associated with
and Accountabil ity Systems
regression of posttest scores on pretest scores, or
relevant data from other methods for examining
Standard 1 3. 1 change, such as those based on structural equation
modeling, should be reported. In addition to
Users of tests who conduct program evaluations or
technical or methodological considerations, details
policy studies, or monitor outcomes, should clearly
related to test administration may also be relevant
describe the population that the program or policy
to interpreting change or gain scores.For example,
is intended to serve and should document the
it is important to consider that the error associated
extent to which the sample of test takers is repre
with change scores is higher than the error
sentative of that population. In addition, when
associated with the original scores on which they
matrix sampling procedures are used, rules for
are based. If change scores are used, information
sampling items and test takers should be provided,
about the reliability/precision of these scores
and error calculations must take the sampling
should be reported.It is also important to report
scheme into account. When multiple studies are
the time period between administrations of tests;
combined as part of a program evaluation or policy
and if the same test is used on multiple occasions,
study, information about the samples included in
the possibility of practice effects (i'.e., improved
each individual study should be provided.
performance due to familiarity with the test items)
Comment: It is important to provide information should be examined.
about sampling weights that may need to be
applied for accurate inferences about performance. Standard 1 3.3
When matrix sampling is used, documentation
should address the limitations that stem from this When accountability indices, indicators of ef
sampling approach, such as the difficulty in fectiveness in program evaluations or policy
creating individual-level scores. Test developers studies, or other statistical models (such as

209
CHAPTER 13

value-added models) are used, the method for nontest information i s included in an accountability
constructing such indices, indicators, or models index, the rules for combining the information
should be described and justified, and their tech need to be made explicit and must be justified.It
nical qualities should be reported. is important to recognize that when multiple
sources of data are collapsed into a single composite
Comment: An index that is constructed by ma
score or rating, the weights and distributional
nipulating and combining test scores should be
characteristics of the sources will affect the distri
subjected to the same validity, reliability, and
bution of the composite scores.The effects of the
fairness investigations that are expected for the
weighting and distributional characteristics on
test scores that underlie the index.The methods
the composite score should be investigated.
and rules for constructing such indices should be
When indices combine scores from tests ad
made available to users, along with documentation
ministered under standard conditions with those
of their technical properties. T he strengths and
that involve modifications or other changes to
limitations of various approaches to combining
administration conditions, there should b e a clear
scores should be evaluated, and information that
rationale for combining the information into a
would allow independent replication of the con
single index, and the implications for validity and
struction of indices, indicators, or models should
reliability should be examined.
be made available for use by appropriate parties.
As with regular test scores, a validity argument
should be set forth to justify inferences about C luster 2. Interp retations and Uses of
indices as measures of a desired outcome.It is im Information From Tests Used in
portant to help users understand the extent to Program Evaluation , Policy Studies,
which the models support causal inferences.For
and Accountabil ity Systems
example, when value-added estimates are used as
measures of teachers' effectiveness in improving
student achievement, evidence for the appropri Standard 1 3.4
ateness of this inference needs to be provided.
Evidence of validity, reliability, and fairness for
Similarly, if published ratings of health care
each purpose for which a test is used in a pr ogram
providers are based on indices constructed from
evaluation, policy study, or accountability system
psychological test scores of their patients, the
should be collected and made available.
public information should include information
to help users understand what inferences about Comment: Evidence should be provided of the
provider performance are warranted. Developers suitability of a test for use in program evaluation,
and users of indices should be aware of ways in policy studies, or accountability systems, including
which the process of combining individual scores the relevance of the test to the goals of the
into an index may introduce technical problems program, policy, or system under study and the
that did not affect the original scores. Linking suitability of the test for the populations involved.
errors, floor or ceiling effects, differences in vari Those responsible for the release or reporting of
ability across different measures, and lack of an test results should provide and explain any sup
interval scale are examples of features that may plemental information that will minimize possible
not be problematic for the purpose of interpreting misinterpretations or misuse of the data. In par
individual test scores but can become problematic ticular, if an evaluation or accountability system
when scores are combined into an aggregate meas is designed to support interpretations regarding
ure. Finally, when evaluations or accountability the effectiveness of a program, institution, or
systems rely on measures that combine various provider, the validity of these interpretations for
sources of information, such as when scores on the intended uses should be investigated and doc
multiple forms of a test are combined or when umented.Reports should include cautions against

210
USES OF TESTS FOR PROGRAM EVALUATION, POLICY STUDIES, AND ACCOUNTABILITY

making unwarranted inferences, such as holding mentation of any exclusion rules, testing modifi
health care providers accountable for test-score cations, or other changes to the test or adminis
changes that may not be under their control. If tration conditions; and provide evidence regarding
the use involves a classification of persons, insti the validity of score interpretations for subgroups.
tutions, or programs into distinct categories, the When summaries of test scores are reported sepa
consistency, accuracy, and fairness of the classifi rately by subgroup (e.g., by racial/ethnic group),
cations should be reported. If the same test is test users should conduct analyses to evaluate the
used fo r multiple purposes (e.g., monitoring reliability/precision of scores for these groups and
achievement of individual students; providing in the validity of score interpretations, and should
formation to assist in instructional planning for report this information when publishing the score
individuals or groups of students; evaluating summaries. Analyses of complex indices used for
districts, schools, or teachers), evidence related to accountability or for measuring program effec
the validity of interpretations for each of these tiveness should address the possibility of bias
uses should be gathered and provided to users, against specific subgroups or against programs or
and the potential negative effects for certain uses institutions serving those subgroups.If bias is de
(e.g., improving instruction) that might result tected (e.g., if scores on the index are shown to be
from unintended uses (e.g., high-stakes account subject to systematic error that is related to
abili ty) need to be considered and mitigated. examinee characteristics such as race/ethnicity),
When tests are used to evaluate the performance these indices should not be used unless they are
of personnel, the suitability of the tests for different modified in a way that removes the bias.Additional
groups of personnel (e.g., regular teachers, special considerations related to fairness and accessibility
education teachers, principals) should be examined. in educational tests and assessments are provided
in chapter 3.
When test results are used to support actions
Standard 1 3.5 regarding program or policy adoption or change,
the professionals who are expected to make inter
Those responsible for the development and use
pretations leading to these actions may need as
of tests for evaluation or accountability purposes
sistance in interpreting test results for this purpose.
should take steps to promote accurate interpre
Advances in technology have led to increased
tations and appropriate uses for all groups for
availability of data and reports among teachers,
which results will be applied.
administrators, and others who may not have re
Comment: T hose responsible for measuring out ceived training in appropriate test use and inter
comes should, to the extent possible, design the pretation or in analysis of test-score data. T hose
testing process to promote access and to maximize who provide the data or tools have the responsibility
the validity of interpretations (e.g., by providing to offer support and assistance to users, and users
appropriate accommodations) for any relevant have the responsibility to seek guidance on ap
subgroups of test takers who participate in program propriate analysis and interpretation. T hose re
or policy evalua.tion. Users of secondary data sponsible for the release or reporting of test results
should clearly describe the extent to which the should provide and explain any supplemental in
population included in the test-score database in formation that will minimize possible misinter
cludes all relevant subgroups.T he users should pretations of the data.
also document any exclusion rules that were Often, the test results for program evaluation
applied and any other changes to the testing or policy analysis are analyzed well after the tests
process that could affect interpretations of results. have been given. When this is the case, the user
Similarly, users of tests for accountability purposes should investigate and describe the context in
should make every effort to include all relevant which the tests were given.Factors such as inclu
subgroups in the testing program; provide docu- sion/exclusion rules, test purpose, content sampling,

21 1
CHAPTER 1 3

instructional alignment, and the attachment of is often justified on the grounds that it will
high stakes can affect the aggregated results and improve the quality of education by providing
should be made known to the audiences for the useful information to decision makers and by
evaluation or analysis. creating incentives to promote better performance
by educators and students.T hese kinds of claims
Standard 1 3.6 should be made explicit when the system is man
dated or adopted, and evidence to support their
Reports of group differences in test performance validity should be provided when available.T he
should be accompanied by relevant contextual collection and reporting of evidence for a particular
information, where possible, to enable meaningful validity claim should be incorporated into the
interpretation of the differences. If appropriate program design.A given claim for the benefits of
contextual information is not available, users test use, such as improving students' achievement,
should be cautioned against misinterpretation. may be supported by logical or theoretical argument
as well as empirical data. Due weight should be
Comment: Observed differences in average test
given to findings in the scientific literature that
scores between groups ( e.g., classified by gender,
may be inconsistent with the stated claim.
race/ethnicity, disability, language proficiency, so
cioeconomic status, or geographical region) can
be influenced by differences in factors such as op Standard 1 3.8
portunity to learn, training experience, effort, in
structor quality, and level and type of parental T hose who mandate the use of tests in policy,
support. In education, differences in group per evaluation, and accountability contexts and those
formance across time may be influenced by changes who use tests in such contexts should monitor
in the population of those tested ( including their impact and should identify and minimize
changes in sample size) or changes in their experi negative consequences.
ences.Users should be advised to consider the ap
Comment: T he use of tests in policy, evaluation,
propriate contextual information when interpreting
and accountability settings may, in some cases,
these group differences and when designing policies
lead to unanticipated consequences. Particularly
or practices to address those differences.In addition,
when high stakes are attached, those who mandate
if evaluations involve comparisons of test scores
tests, as well as those who use the results, should
across national borders, evidence for the compa
take steps to identify potential unanticipated con
rability of scores should be provided.
sequences. Unintended negative consequences
may include teaching test items in advance, mod
Standard 1 3.7 ifying test administration procedures, and dis
couraging or excluding certain test takers from
When tests are selected for use in evaluation or
taking the test.T hese practices can lead to spuri
accountability settings, the ways in which the
ously high scores that do not reflect performance
test results are intended to be used, and the con
on the underlying construct or domain of interest.
sequences they are expected to promote, should
In addition, these practices may be prohibited by
be clearly described, along with cautions against
law. Testing procedures should be designed to
inappropriate uses.
minimize the likelihood of such consequences,
Comment: In some contexts, such as evaluation and users should be given guidance and encour
of a specific curriculum program, a test may have agement to refrain from inappropriate test-prepa
a limited purpose and may not be intended to ration practices.
promote specific outcomes other than informing Some consequences can be anticipated on the
the evaluation. In other settings, particularly with basis of past research and understanding of how
test-based accountability systems, the use of tests people respond to incentives.For example, research

21 2
USES OF TESTS FOR PROGRAM EVALUATION, POLICY STUDIES, AND ACCOUNTABILITY

shows that educational accountability tests influence accountability context, a decision that will have
curriculum and instruction by signaling what is a major impact on an individual such as a
important for students to know and be able to reacher or health care provider, or on an organ
do.This influence can be positive if a test encourages ization such as a school or treatment facility,
a focus on valuable learning outcomes, but it is should take into consideration other relevant
negative if it narrows the curriculum in unintended information in addition to test scores.Examples
ways. These and other common negative conse of other information that may be incorporated
quences, such as possible motivational impact on into evaluations or accountability systems are
teachers and students (even when test results are measures of educators' or health care providers'
used as intended) and increasing dropout rates, practices ( e.g., classroom observations, checklists)
should be studied and the results taken into con and nontest measures of student attainment
sideration.The integrity of test results should be (course taking, college attendance).
maintained by striving to eliminate practices de In the case of value-added modeling, some re
signed to raise test scores without improving searchers have argued for the inclusion of student
performance on the construct or domain measured demographic characteristics (e.g., race/ethnicity
by the test. In addition, administering an audit and socioeconomic status) as controls, whereas
measure ( i.e., another measure of the tested con other work suggests that including such variables
struct) may detect possible corruption of scores. does not improve the performance of the measures
and can promote undesirable consequences such
as a perception that lower standards are being set
Standard 1 3.9 for some students than for others. Decisions re
garding what variables to include in such models
I n evaluation o r accountability settings, test
should be informed by empirical evidence regarding
results should be used in conjunction with in
the effects of their inclusion or exclusion.
formation from other sources when the use of
An additional type of information that is
the additional information contributes to the
relevant to the interpretation of test results in
validity of the overall interpretation.
policy settings is the degree of motivation of the
Comment: Performance on indicators other test takers.It is important to determine whether
than tests is almost always useful and in many test takers regard the test experience seriously,
cases essential.Descriptions or analyses of such particularly when individual scores are not reported
variables as client selection criteria, services, to test takers or when the scores are not associated
client characteristics, setting, and resources are with consequences for the test takers. Decision
often needed to provide a comprehensive picture criteria regarding whether to include scores from
of the program or policy under review and to individuals with questionable motivation should
aid in the interpretation of test results. In the be clearly documented.

213
GLOSSARY

This glossary provides definitions of terms as used achievement levels/proficiency levels: Descriptions of
in the text and standards.For many of the terms, test takers' levels of competency in a particular area of
multiple definitions can be found in the literature; knowledge or skill, usually defined in terms of categories
also, technical usage may differ from common ordered on a continuum, for example from "basic" to
"advanced," or "novice" to "expert." The categories
usage.
constitute broad ranges for classifying performance.
ability parameter: In item response theory (IRT), a See cut score.
theoretical value indicating the level of a test taker on
achievement standards: See performance standards.
the abili ty or trait measured by the test; analogous to
the concept of true score in classical test theory. achievement test: A test to measure the extent of
knowledge or skill attained by a test taker in a content
ability testing: The use of tests to evaluate the current
domain in which the test taker has received instruction.
performance of a person in some defined domain of
cognitive, psychomotor, or physical functioning. adaptation/test adaptation: I . Any change in test con
tent, format (including response format) , or adminis
accessibility: The degree to which the items or tasks on
tration conditions that is made to increase a test's ac
a test enable as many test takers as possible to demonstrate
cessibility for individuals who otherwise would face
their standing on the target construct without being
construct-irrelevant barriers on the original test. An
impeded by characteristics of the item that are irrelevant
adaptation may or may not change the meaning of the
to the construct being measured. A test that ranks high
construct being measured or alter score interpretations.
on chis criterion is referred to as accessible.
An adaptation that changes score meaning is referred
accommodations/test accommodations: Adjustments to as a modification; an adaptation that does not change
that do not alter the assessed construct that are applied the score meaning is referred to as an accommodation
to test presentation, environment, content, format (in (see definitions in this glossary). 2. Change made to a
cluding response format), or administration conditions test that has been translated into the language of a
for particular test takers, and that are embedded within target group and that takes into account the nuances
assessments or applied after the assessment is designed. of the language and cultur of that group.
Tests or assessments with such accommodations, and
adaptive test: A sequential form of individual testing
their scores, are said to be accommodated. Accommodated
in which successive items, or sets of items, in the test
scores should be sufficiently comparable to unaccom
are selected for administration based primarily on their
modated scores that they can be aggregated together.
psychometric properties and content, in relation to the
accountability index: A number or label that reflects a test taker's responses to previous items.
set of rules for combining scores and other information
adjusted validity or reliability coefficient: A validi ty
to form conclusion., and inform decision making in an
or reliability coefficient-most often, a product
accountability system.
moment correlation-that has been adjusted to offset
accountability system: A system that imposes student the effects of differences in score variability, criterion
performance-based rewards or sanctions on institutions variability, or the unreliability of test and/ or criterion
such as schools or school systems or on individuals scores. See restriction ofrange or variability.
such as teachers or mental health care providers.
aggregate score: A total score formed by combining
acculturation: A process related to the acquisition of scores on the same test or across test components. The
cultural knowledge and artifacts that is developmental scores may be raw or standardized. The components of
in nature and dependent upon time of exposure and the aggregate score may be weighted or not, depending
opportunity for learning. on the interpretation to be given to the aggregate score.

21 5
GLOSSARY

alignment: T he degree to which the content and battery: A set of tests usually administered as a unit.
cognitive demands of test questions match targeted T he scores on the tests usually are scaled so that they
content and cognitive demands described in the test can readily be compared or used in combination for
specifications. decision making.
alternate assessments/alternate tests: Assessments or behavioral science: A scientific discipline, such as soci
tests used to evaluate the performance of students in ed ology, anthropology, or psychology, in which the actions
ucational settings who are unable to participate in stan and reactions of humans and animals are studied
dardized accountability assessments, even with accom through observational and experimental methods.
modations. Alternate assessments or tests typically measure
b enchmark assessments: Assessments administered in
achievement relative to alternate content standards.
educational settings at specified times during a curriculum
alternate forms: Two or more versions of a test that are sequence, to evaluate students' knowledge and skills
considered interchangeable, in that they measure the relative to an explicit set of longer-term learning goals.
same constructs in the same ways, are built to the same See interim assessments or tests.
content and statistical specifications, and are administered
bias: 1. In test fairness, construct underrepresentation
under the same conditions using the same directions.
or construct-irrelevant components of test scores that
See equivalentforms, parallelforms.
differentially affect the performance of different groups
alternate or alternative standards: Content and per of test takers and consequently the reliability/precision
formance standards in educational assessment for and validity of interpretations and uses of their test
students with significant cognitive disabilities. scores. 2. In statistics or measurement, systematic error
in a test score. See construct underrepresentation, con
analytic scoring: A method of scoring constructed re
struct-irrelevant variance, fairness, predictive bias.
sponses (such as essays) in which each critical dimension
of a particular performance is judged and scored bilingual/multilingual: Having a degree of proficiency
separately, and the resultant values are combined for an in two or more languages.
overall score. In some instances, scores on the separate
calibration: l . In linking test scores, the process of
dimensions may also be used in interpreting performance.
relating scores on one test to scores on another that
Contrast with holistic scoring.
differ in reliability/precision from those on the first
anchor items: Items administered with each of two or test, so that scores have the same relative meaning for a
more alternate forms of a test for the purpose of group of test takers. 2. In item response theory, the
equating the scores obtained on these alternate forms. process ofestimating the parameters of the item response
function. 3. In scoring constructed response tasks, pro
anchor test: A set of anchor items used for equating.
cedures used during training and scoring to achieve a
assessment: Any systematic method of obtaining in desired level of scorer agreement.
formation, used to draw inferences about characteristics
certification: A process by which individuals are recog
of people, objects, or programs; a systematic process to
nized ( or certified) as having demonstrated some level
measure or evaluate the characteristics or performance
of knowledge and skill in some domain. See
ofindividuals, programs, orocher entities, for purposes
licensing, credentialing.
of drawing inferences; sometimes used synonymously
with test. classical test theory: A psychometric theory based on
the view that an individual's observed score on a test is
assessment literacy: Knowledge about testing that sup
the sum of a true score component for the test taker
ports valid interpretations of test scores for their intended
and an independent random error component.
purposes, such as knowledge about test development
practices, test score interpretations, threats to valid classification accuracy: Degree to which the assignment
score interpretations, score reliability and precision, of test takers to specific categories is accurate; the
test administration, and use. degree to which false positive and false negative classi
fications are avoided. See sensitivity, specificity.
automated scoring: A procedure by which constructed
response items are scored by computer using a rules coaching: Planned short-term instructional activities
based approach. for prospective test takers provided prior to the test ad-

21 6
GLOSSARY

ministration for the primary purpose of improving their pirical data and/or expert judgment using vanous
test scores. Activities chat approximate the instruction formats such as narratives, tables, and graphs. Sometimes
provided by regular school curricula or training programs referred to as automated scoring or narrative report.
are not typically referred to as coaching.
computerized adaptive test: An adaptive test administered
coefficient alpha: An internal-consistency reliability by computer. See adaptive test.
coefficient based on the number of parts into which a
concordance: In linking test scores for tests that measure
test is partitioned (e.g., items, subcests, or raters), the
similar constructs, the process of relating a score on one
interrelationships of the parts, and the rota! test score
test to a score on another, so chat the scores have the
variance. Also called Cronbach's alpha and, for dichoto
same relative meaning for a group of test takers.
mous items, KR-20. See internal-consistency coefficient,
reliability coefficient. conditional standard error of measurement: The
standard deviation of measurement errors chat affect
cogmt1ve assessment: The process of systematically
the scores of test takers at a specified test score level.
collecting test scores and related data to make judgments
about an individual's ability to perform various mental confidence interval: An interval within which the pa
activities involved in the processing, acquisition, retention, rameter of interest will be included with a specified
conceprualization, and organization of sensory, perceprual, probability.
verbal, spacial, and psychomotor information.
consequences: The outcomes, intended and unintended,
cognitive lab: A method of studying the cognitive of using tests in particular ways in certain contexts and
processes that test takers use when completing a cask with certain populations.
such as solving a mathematics problem or interpreting a
construct: T he concept or characteristic chat a test is
passage of text, typically involving test takers' thinking
designed to measure.
aloud while responding to the task and/or responding
to interview questions after completing the task. construct domain: The set of interrelated attributes
(e.g., behaviors; attirudes, values) that are included
cognitive science: The interdisciplinary srudy oflearning
under a construcc's label.
and information processing.
construct equivalence: 1 . The extent to which a
comparability/score comparability: In test linking, the construct measured by one test is essentially the same
degree of score comparability resulting from the application
as the construct measured by another test. 2. The
of a linking procedure. Score comparability varies along
degree to which a construct measured by a test in one
a continuum chat depends on the type of linking con cultural or linguistic group is comparable to the construct
ducted. See alternatefarms, equating, calibration, linking, measured by the same test in a different cultural or lin
moderation, projection, vertical scaling.
guistic group.
composite score: A score chat combines several scores construct-irrelevant variance: Variance in test-taker
according co a specified formula. scores that is attributable to extraneous factors that
computer-administered test: A test administered by a distort the meaning of the scores and thereby decrease
computer; test takers respond by using a keyboard, the validity of the proposed interpretation.
mouse, or other response devices. construct underrepresentation: The extent to which a
test fails to capture important aspects of the construct
computer-based mastery test: A test administered by
domain chat the test is intended to measure, resulting in
computer that indicates whether the test taker has
rest scores chat do not fully represent that construct.
achieved a specified level of competence in a certain
domain, rather than the test takers' degree of achievement constructed-response items, tasks, or exercises: Items,
in that domain. See mastery test. casks, or exercises for which test takers muse create
their own responses or produces rather than choose a
computer-based test: See computer-administered test.
response from a specified set. Shore-answer items require
computer-prepared interpretive report: A programmed a few words or a number as an answer; extended
interpretation of a test taker's test results, based on em- response items require at least a few sentences and may

21 7
GLOSSARY

include diagrams, mathematical proofs, essays, or differential item functioning (DIF): For a particular
problem solutions such as network repairs or other item in a test, a statistical indicator of the extent to
work products. which different groups of test takers who are at the
same ability level have different frequencies of correct
content domain: The set of behaviors, knowledge,
responses or, in some cases, different rates of choosing
skills, abilities, attitudes, or other characteristics to be
various item options.
measured by a test, represented in detailed test specifi
cations and often organized into categories by which differential test functioning (DTF): Differential per
items are classified. formance at the test or dimension level indicating that
individuals from different groups who have the same
content-related validity evidence: Evidence based on
standing on the characteristic assessed by a test do not
test content that supports the intended interpretation
have the same expected test score.
of test scores for a given purpose. Such evidence may
address issues such as the fidelity of test content to per discriminant evidence: Evidence indicating whether
formance in the domain in question and the degree to two tests interpreted as measures of different constructs
which test content representatively samples a domain, are sufficiently independent (uncorrelated) that they
such as a course curriculum or job. do, in fact, measure two distinct constructs.

content standard: In educational assessment, a statement documentation: The body of literature (e.g., test
of content and skills that students are expected to learn manuals, manual supplements, research reports, publi
in a subject matter area, often at a particular grade or at cations, user's guides) developed by a test's author, de
the completion of a particular level of schooling. veloper, user, and/or publisher to support test score in
terpretations for their intended use.
convergent evidence: Evidence based on the relationship
between test scores and other measures of the same or domain or content sampling: The process of selecting
related construct. test items, in a systematic way, to represent the total set
of items measuring a domain.
credentialing: Granting to a person, by some authority,
a credential, such as a certificate, license, or diploma, effort: The extent to which a test taker appropriately
chat signifies an acceptable level of performance m participates in test taking.
some domain of knowledge or activity.
empirical evidence: Evidence based on some form of
criterion domain: The construct domain of a variable data, as opposed to that based on logic or theory.
that is used as a criterion. See comtruct domain. English language learner (ELL): An individual who is
criterion-referenced score interpretation: The meaning not yet proficient in English. An ELL may be an indi
of a test score for an individual or of an average score vidual whose first language is not English, a language
for a defined group, indicating the individual's or minority individual just beginning to learn English, or
group's level of performance in relationship to some an individual who has developed considerable proficiency
defined criterion domain. Examples of criterion in English. Related terms include English learner (EL),
referenced interpretations include comparisons to cut limited English proficient (LEP), English as a second
scores, interpretations based on expectancy tables, and language (ESL), and culturally and linguistically diverse.
domain-referenced score interpretations. Contrast with equated forms: Alternate forms of a test whose scores
norm-referenced score interpretation. have been related through a statistical process known
as equating, which allows scale scores on equated forms
cross-validation: A procedure in which a scoring system
to be used interchangeably.
for predicting performance, derived from one sample,
is applied to a second sample to investigate the stability equating: A process for relating scores on alternate
of prediction of the scoring system. forms of a test so that they have essentially the same
meaning. The equated scores are typically reported on
cut score: A specified point on a score scale, such that
a common score scale.
scores at or above that point are reported, interpreted,
or acted upon differently from scores below that point. equivalent forms: See alternatefo rms, parallelforms.

21 8
GLOSSARY

error of measurement: The difference between an ob formative assessment: An assessment process used by
served score and the corresponding true score . See teachers and students during instruction that provides
standard error of measurement, systematic error, random feedback to adjust ongoing teaching and learning with
error, true score. the goal of improving students' achievement of intended
instructional outcomes.
factor: Any variable, real or hypothetical, that is an
aspect of a concept or construct. gain score: In testing, the difference between two scores
obtained by a test taker on the same test or two equated
factor analysis: Any of several statistical methods of
tests taken on different occasions, often before and
describing the interrelationships of a set of variables by
after some treatment.
statistically deriving new variables, called factors, that
are fewer in number than the original set of generalizability coefficient: An index of reliability/ pre
variables . cision based o n generalizability theory (G theory) . A
generalizability coefficient is the ratio of universe
fairness: The validi ty of test score interpretations for
score variance to observed score variance, where the
intended use(s) for individuals from all relevant subgroups.
o bserved score variance is equal to the universe score
A test that is fair minimizes the construct-irrelevant
variance plus the total error variance. See generalizability
variance associated with individual characteristics and
theory.
testing contexts that otherwise would compromise the
validity of scores for some individuals. generalizability theory: Methodological framework for
evaluating reliability/precision in which various sources
fake bad: Exaggerate or falsify responses to test items
of error variance are estimated through the application
in an effort to appear impaired .
of the statistical techniques of analysis of variance. The
fake good: Exaggerate or falsify responses to test items analysis indicates the generalizability of scores beyond
in an effort ro present oneself in an overly positive way. the specific sample of items, persons, and observational
conditions that were studied. Also called G theory.
false negative: An error of classification, diagnosis, or
selection leading to a determination that an individual group testing: Testing for groups of test takers, usually
does not meet the standard based on an assessment for in a group setting, typically with standardized adminis
inclusion in a particular group, when, in truth, he or tration procedures and supervised by a proctor or test
she does meet the standard (or would, absent measure administrator.
ment error) . See semitivity, specificity. growth models: Statistical models chat measure students'
false positive: An error of classification, diagnosis, or progress on achievement tests by comparing the test
selection leading to a determination chat an individual scores of the same students over time. See value-added
meets the standard based on an assessment for inclusion modeling.
in a particular group, when, in truth, he or she does high-stakes test: A test used to provide results that
not meet the standard (or would not, absent measurement have important, direct consequences for individuals,
error). See semitivity, specificity. programs, or institutions involved in the testing .
Contrast with low-stakes test.
field test: A test administration used ro check the
adequacy of testing procedures and the statistical char holistic scoring: A method of obtaining a score on a
acteristics of new test items or new test forms. A field test, or a test item, based on a judgment of overall per
test is generally more extensive than a pilot test. See formance using specified criteria . Contrast with analytic
pilot test. scoring.

flag: An indicator attached to a test score, a test item, individuilized education program (IEP): A documented
or other entity to indicate a special status . A flagged plan that delineates special education services for a
test score generally signifies a score obtained from a special-needs student and chat includes any adaptations
modified test resulting in a change in the underlying that are required in the regular classroom or for as
construct measured by the test. Flagged scores may not sessments and any additional special programs or
be comparable to scores chat are not flagged. services.

219
GLOSSARY

informed consent: The agreement of a person, or that attribute measured by the item. Also called item response
person's legal representative, for some procedure to be curve, item responsejunction.
performed on or by the individual, such as taking a rest
item context effect: Influence of item position, other
or completing a questionnaire.
items administered, time limits, administration conditions,
intelligence test: A rest designed to measure an individual's and so forth, on item difficulty and other statistical
level of cognitive functioning in accord with some rec item characteristics.
ognized theory of intelligence. See cognitive assessment.
item pool/item bank: T he collection or set of items
interim assessments or tests: Assessments administered from which a test or test scale's items are selected
during instruction to evaluate students' knowledge and during test development, or the total set of items from
skills relative to a specific set of academic goals to which a particular subset is selected for a rest taker
inform policy-maker or educator decisions at the class during adaptive testing.
room, school, or district level. See benchmark assess
item response theory (IRT): A mathematical model of
ments.
the functional relationship between performance on a
internal-consistency coefficient: An index of the test item, the test item's characteristics, and the test
reliability of test scores derived from the statistical in taker's standing on the construct being measured .
terrelationships among item responses or scores on sep
job analysis: The investigation of positions or job
arate parts of a test. See coefficient alpha, split-halves re
classes to obtain information about job duties and
liability coefficient.
tasks, responsibilities, necessary worker characteristics
internal structure: In test analysis, the factorial structure (e.g. knowledge, skills, and abilities), working conditions,
of item responses or subscales of a test. and/or other aspects of the work. See practice analysis.
interpreter: Someone who facilitates cross-cultural com job/job classification: A group of positions that are
munication by converting concepts from one language similar enough in duties, responsibilities, necessary
to another (including sign language). worker characteristics, and other relevant aspects that
they may be properly placed under the same job title.
interrater agreement/consistency: The level of consistency
with which two or more judges rate the work or per job performance measurement: Measurement of an
formance of test takers. See interrater reliability. incumbent's observed performance of a job as evaluated
by a job sample test, an assessment of job knowledge,
interrater reliability: The level of consistency in rank or
or ratings of the incumbent's actual performance on
dering of ratings across raters. See interrater agreement.
the job. See job sample test.
intrarater reliability: T he level of consistency among
job sample test: A test of the ability of an individual to
repetitions of a single rater in scoring test takers'
perform the tasks comprised by a job. See job performance
responses. Inconsistencies in the scoring process resulting
measurement.
from influences that are internal to the rater rather
than true differences in test takers' performances result licensing: T he granting, usually by a government
in low inrrarater reliability. agency, of an authorization or legal permission to
practice an occupation or profession. See certification,
inventory: A questionnaire or checklist that elicits in
credentialing.
formation about an individual's personal opinions, in
terests, attitudes, preferences, personality characteristics, linking/score linking: T he process of relating scores on
motivations, or typical reactions to situations and prob tests. See alternateforms, equating, calibration, moderation,
lems . projection, vertical scaling.

item: A statement, question, exercise, or task on a test local evidence: Evidence (usually related to reliability/pre
for which the test taker is to select or construct a cision or validity) collected for a specific test and a
response, or perform a task. See prompt. specific set of test takers in a single institution or at a
specific location.
item characteristic curve (ICC) : A mathematical
function relating the probability of" a certain item local norms: Norms by which test scores are referred to
response, usually a correct response, to the level of the a specific, limited reference population of particular in-

220
GLOSSARY

terest to the test user (e.g., population of a locale, or norms: Statistics or tabular data that summarize the
ganization, or institution). Local norms are not intended distribution or frequency of test scores for one or more
to be representative of populations beyond that limited specified groups, such as test takers of various ages or
setting. grades, usually designed to represent some larger popu
lation, referred to as the reference population. See local
low-stakes test: A test used to provide results that have
norms.
only minor or indirect consequences for individuals,
programs, or institutions involved in the testing. operational use: The actual use of a test, afrer initial
Contrast with high-stakes test. test development has been completed, to inform an in
terpretation, decision, or action, based in part or wholly
mastery test: A test designed to indicate whether a test
on test scores.
taker has attained a prescribed level of competence, or
mastery, in a domain. See cut score, computer-based opportunity to learn: The extent to which test takers
mastery test. have been exposed to the tested constructs through
their educational program and/or have had exposure to
matrix sampling: A measurement format in which a
or experience with the language or the majority culture
large set of test items is organized into a number of
required to understand the test.
relatively short item sets, each of which is randomly
assigned to a subsample of test takers, thereby avoiding parallel forms: In classical test theory, strictly parallel
the need to administer all items to all test takers. test forms that are assumed to measure the same
Equivalence of the short item sets, or subsets, is not construct and to have the same means and the same
assumed. standard deviations in the populations of interest. See
alternate farms.
meta-analysis: A statistical method of research in which
the results from independent, comparable studies are percentile: The score on a test below which a given
combined to determine the size of an overall effect or percentage of scores for a specified population occurs.
the degree of relationship between two variables.
percentile rank: The rank of a given score based on the
moderation: A process of relating scores on different percentage of scores in a specified score distribution
tests so that scores have the same relative meaning. that are below the score being ranked.
moderator variable: A variable that affects the direction performance assessments: Assessments for which the
or strength of the relationship between two other vari test taker actually demonstrates the skills the test is in
ables. tended to measure by doing tasks that require those
skills.
modification/test modification: A change in test
content, format (including response formats), and/or performance level: Label or brief statement classifying
administration conditions that is made to increase ac a test taker's competency in a particular domain, usually
cessibility for some individuals but that also affects the defined by a range of scores on a test. For example,
construct measured and, consequently, results in scores labels such as "basic" to "advanced," or "novice" to "ex
that differ in meaning from scores from the unmodified pert," constitute broad ranges for classifying proficiency.
assessment. See achievement levels, cut score, pe,formance-level descriptor,
standard setting.
neuropsychological assessment: A specialized type of
psychological assessment of normal or pathological performance-level descriptor: Descriptions of what test
processes affecting the central nervous system and the takers know and can do at specific performance levels.
resulting psychological and behavioral functions or
performance standards: Descriptions oflevels of knowl
dysfunctions.
edge and skill acquisition contained in content standards,
norm-referenced score interpretation: A score inter as articulated through performance-level labels (e.g.,
pretation based on a comparison of a test taker's per "basic," "proficient," "advanced"); statements of what
formance with the distribution of performance in a test takers at different performance levels know and
specified reference population. Contrast criterion can do; and cut scores or ranges of scores on the scale
reftrenced score interpretation. of an assessment that differentiate levels of performance.

221
GLOSSARY

See cut score, performance Level performance-level de projection: A method of score linking in which scores
scriptor. on one test are used to predict scores on another test
for a group of test takers, often using regression method
personality inventory: An inventory chat measures one
ology.
or more characteristics that are regarded generally as
psychological attributes or interpersonal tendencies. prompt/item prompt/writing prompt: The question,
stimulus, or instruction that elicits a test taker's response.
pilot test: A test administered to a sample of test takers
to try out some aspects of the test or test items, such as proprietary algorithms: Procedures, often computer
instructions, time limits, item response formats, or code, used by commercial publishers or test developers
item response options. Seefield test. that are not revealed to the public for commercial rea
sons.
policy study: A study that contributes to judgments
about plans, principles, or procedures enacted to achieve psychodiagnosis: Formalization or classification of
broad public goals. functional mental health status based on psychological
assessment.
portfolio: In assessment, a systematic collection of ed
ucational or work products chat have been compiled or psychological assessment: An examination of psycho
accumulated over time, according to a specific sec of logical functioning that involves collecting, evaluating,
principles or rules. and integrating test results and collateral information,
and reporting information about an individual.
position: In employment contexts, the smallest organi
zational unit, a set of assigned duties and responsibilities psychological testing: T he use of tests or inventories to
chat are performed by a person within an assess particular psychological characteristics of an in
organization. dividual.
practice analysis: An investigation of a certain occupation random error: A nonsystematic error; a component of
or profession to obtain descriptive information about test scores that appears to have no relationship to other
the activities and responsibilities of the occupation or variables.
profession and about the knowledge, skills, and abilities
random sample: A selection from a defined p opulation
needed to engage successfully in the occupation or pro
of entities according to a random process with the
fession. See job analysis.
selection of each entity independent of the selection of
precision of measurement: T he impact of measurement other entities. See sample.
error on the outcome of the measurement. See standard
raw score: A score on a test that is calculated by
error of measurement, error of measurement,
counting the number of correct answers, or more
reliability/precision.
generally, a sum or other combination of item scores.
predictive bias: T he systematic under- or over-prediction
of criterion performance for people belonging co groups reference population: T he population of test takers to
differentiated by characteristics not relevant to the which individual test takers are compared through the
criterion performance. test norms. T he reference population may be defined
in terms of test taker age, grade, clinical status at the
predictive validity evidence: Evidence indicating how time of testing, or other characteristics. See norms.
accurately test data collected at one time can predict
criterion scores chat are obtained at a later time. relevant subgroup: A subgroup of the population for
which a test is intended chat is identifiable in some way
proctor: In test administration, a person responsible that is relevant to the interpretation of test scores for
for monitoring the testing process and implementing their intended purposes.
the test administration procedures.
reliability coefficient: A unit-free indicator that reflects
program evaluation: T he collection and synthesis of the degree to which scores are free of random measure
evidence about the use, operation, and effects of a pro ment error. See generalizability theo1y.
gram; the set of procedures used to make judgments
about a program's design, implementation, and out reliability/precision: The degree to which test scores
comes. for a group of test takers are consistent over repeated

222
GLOSSARY

applications of a measurement procedure and hence by producing scale scores designed to support score in
are inferred to be dependable and consistent for an in terpretations. See scale.
dividual test taker; the degree to which scores are free
school district: A local education agency administered
of random errors of measurement for a given group.
by a public board of education or other public authority
See generalizability theory, classical test theory, precision
that oversees public elementary or secondary schools in
ofmeasurement.
a political subdivision of a state.
response bias: A test taker's tendency to respond in a
score: Any specific number resulting from the assessment
particular way or style to items on a test (e.g., acquiescence,
of an individual, such as a raw score, a scale score, an
choice of socially desirable options, choice of "true" on
estimate of a latent variable, a production count, an
a true-false test) that yields systematic, construct
absence record, a course grade, or a rating.
irrelevant error in test scores.
scoring rubric: T he established criteria, including
response format: The mechanism that a test taker uses
rules, principles, and illustrations, used in scoring con
to respond to a test item, such as selecting from a list of
structed responses to individual tasks and clusters of
options (multiple-choice question) or providing a
tasks.
written response (fill-in or written response to an open
ended or constructed-response question); oral response; screening test: A test that is used to make broad cate-
or physical performance. gorizations of test takers as a first step in selection
decisions or diagnostic processes.
response protocol: A record of the responses given by a
test taker to a particular test. selection: The acceptance or rejection of applicants for
a particular educational or employment opportunity.
restriction of range or variability: Reduction in the
observed score variance of a test-taker sample, compared sensitivity: In classification, diagnosis, and selection,
with the variance of the entire test-taker population, as the proportion of cases that are assessed as meeting or
a consequence of constraints on the process of sampling predicted to meet the criteria and which, in truth, do
test takers. See adjusted validity or reliability coefficient. meet the criteria.
retesting: A repeat administration of a test, using either specificity: In classification, diagnosis, and selection,
the same test or an alternate form, sometimes with ad the proportion of cases that are assessed as not meeting
ditional training or education between or predicted to not meet the criteria and which, in
administrations. truth, do not meet the criteria.
rubric: See scoring rubric. speededness: T he extent to which test takers' scores
sample: A selection of a specified number of entities, depend on the rate at which work is performed as well
called sampling units (test takers, items, etc.), from a as on the correctness of the responses. T he term is not
larger specified set of possible entities, called the used to describe tests of speed.
population. See random sample, stratified random sam split-halves reliability coefficient: An internal-consistency
ple. coefficient obtained by using half the items on a test to
scale: 1. T he system of numbers, and their units, by yield one score and the other half of the items to yield
which a value is reported on some dimension of meas a second, independent score. See internal-consistency
urement. 2. In testing, the set of items or sub tests used coefficient, coefficient alpha.
to measure a specific characteristic (e.g., a test of verbal
stability: T he extent to which scores on a test are es
ability or a scale of extroversion-introversion). sentially invariant over time, assessed by correlating rhe
scale score: A score obtained by transforming raw test scores of a group of individuals with scores on the
scores. Scale scores are typically used to facilitate same test or an equated test taken by the same group at
interpretation. a later time. See test-retest reliability coefficient.

scaling: The process of creating a scale or a scale score standard error of measurement: The standard deviation
to enhance test score interpretation by placing scores of an individual's observed scores from repeated ad
from different tests or test forms on a common scale or ministrations of a test (or parallel forms of a test)

223
GLOSSARY

under identical conditions. Because such data generally documentation regarding its technical quality for an
cannot be collected, the standard error of measurement intended purpose.
is usually estimated from group data. See error of
test development: T he process through which a test is
measurement.
planned, constructed, evaluated, and modified, including
standard setting: T he process, often judgment based, consideration of content, format, administration, scoring,
of setting cut scores using a structured procedure that item properties, scaling, and technical quality for the
seeks to map test scores into discrete performance test's intended purpose.
levels that are usually specified by performance-level
test documents: Documents such as test manuals,
descriptors.
technical manuals, user's guides, specimen sets, and di
standardization: 1. In test administration, maintaining rections for test administrators and scorers that provide
a consistent testing environment and conducting tests information for evaluating the appropriateness and tech
according to detailed rules and specifications, so that nical adequacy of a test for its intended purpose.
testing conditions are the same for all test takers on the
test form: A set of test items or exercises that meet re
same and multiple occasions. 2. In test development,
quirements of the specifications for a testing program.
establishing a reporting scale using norms based on the
Many testing programs use alternate test forms, each
test performance of a representative sample of individuals
built according to the same specifications but with
from the population with which the test is intended to
some or all of the test items unique to each form. See
be used.
alternate forms.
standards-based assessment: Assessment of an individual's
test format/mode: T he manner in which test content
standing with respect to systematically described content
is presented to the test taker: with paper and pencil, via
and performance standards.
computer terminal or Internet, or orally by an
stratified random sample: A set of random samples, exammer.
each of a specified size, from each of several different
sets, which are viewed as strata of a population. See test information function: A mathematical function
random sample, sample. relating each level of an ability or latent trait, as defined
under item response theory (IRT ) , to the reciprocal of
summative assessment: T he assessment of a test taker's the corresponding conditional measurement error vari
knowledge and skills typically carried out at the com ance.
pletion of a program of learning, such as the end of an
instructional unit. test manual: A publication prepared by test d evelopers
and/or publishers to provide information on test ad
systematic error: An error that consistently increases ministration, scoring, and interpretation and to provide
or decreases the scores of all test takers or some subset selected technical data on test characteristics. See user's
of test takers, but is not related to the construct that guide, technical manual.
the test is intended to measure. See bias.
test modification: Changes made in the content, format,
technical manual: A public;i.tion prepared by test de and/or administration procedure of a test to increase
velopers and/or publishers to provide technical and
the accessibili ty of the test for test takers who are
psychometric information about a test. unable to take the original test under standard testing
test: An evaluative device or procedure in which a sys conditions. In contrast to test accommodations, test
tematic sample of a test taker's behavior in a specified modifications change the construct being measured by
domain is obtained and scored using a standardized the test to some extent and hence change score inter
process. pretations. See adaptation/test ad4ptation, modification/test
modification. Contrast with accommod4tionsltest accom
test design: T he process of developing detailed specifi
mod4tions.
cations for what a test is to measure and the content,
cognitive level, format, and types of test items to be test publisher: An entity, individual, organization, or
used. agency that produces and/ or distributes a test.

test developer: The person(s) or organization responsible test-retest reliability coefficient: A reliability coefficient
for the design and construction of a test and for the obtained by administering the same test a second time

224
GLOSSARY

to the same group after a time interval and correlating purpose, appropriate uses, proper administration, scoring
the two sets of scores; typically used as a measure of procedures, normative data, interpretation of resulcs,
stability of the test scores. See stability. and case studies. See test manual.
test security: Protection of the content of a test from validation: T he process through which the validity of a
unauthorized release or use, to protect the integrity of proposed interpretation of test scores for their intended
the rest scores so they are valid for their intended use. uses is investigated.
test specifications: Documentation of the purpose and validity: T he degree to which accumulated evidence
intended uses of a rest as well as of the test's content, and theory support a specific interpretation of test
format, length, psychometric characteristics (of the scores for a given use of a test. If multiple interpretations
items and test overall), delivery mode, administration, of a test score for different uses are intended, validity
scoring, and score reporting. evidence for each interpretation is needed.
test-taking strategies: Strategies chat test takers might validity argument: An explicit justification of the degree
use while caking a test to improve their performance, to which accumulated evidence and theory support the
such as time management or the elimination of obviously proposed interpretation(s) ofrest scores for their intended
incorrect options on a multiple-choice question before uses.
responding to the question.
validity generalization: Application of validity evidence
test user: A person or entity responsible for the choice obtained in one or more situations to other similar
and administration of a test, for the interpretation of situations on the basis of methods such as meta
test scores produced in a given context, and for any de analysis.
cisions or actions that are based, in part, on test scores.
value-added modeling: Estimating the contribution of
timed test: A test administered to test takers who are individual schools or teachers to student performance
allotted a prescribed amount of rime to respond to the by means of complex statistical techniques that use
test. multiple years of student outcome data, which typically
top-down selection: Selection of applicants on the are standardized test scores. See growth models.
basis of rank-ordered test scores from highest to lowest . variance components: Variances accruing from the
true score: In classical test cheo ry, the average of the separate constituent sources chat are assumed to contribute
scores that would be earned by an individual on an un to the overall variance of observed scores. Such variances,

limited number of strictly parallel forms of the same estimated by methods of the analysis of variance, often
test. reflect situation, location, time, test form, rater, and
related effects. See generalizability theory.
unidimensional test: A test char measures only one di
mension or only one latent variable. vertical scaling: In test linking, the process of relating
scores on tests chat measure the same construct but
universal design: An approach to assessment development differ in difficul ty. Typically used with achievement
that attempts to maximize the accessibility of a test for and ability tests with content or difficulty char spans a
all of its intended test takers . variety of grade or age levels.
universe score: In generalizability theory, the expected
vocational assessment: A specialized type of psychological
value over all possible replications of a procedure for
assessment designed to generate hypotheses and inferences
the test taker. See generalizability theory.
about interests, work needs and values, career develop
user norms: Descriptive statistics (including percentile ment, vocational maturity, and indecision.
ranks) for a group of test takers that does not represent
weighted scores/scoring: A method of scoring a test in
a well-defined reference population, for example, all
which a different number of points is awarded for a
persons reseed during a certain period of time, or a set
correct (or diagnostically relevant) response for different
of self-selected test takers. See local norms, norms.
items . In some cases, the scoring formula awards
user's guide: A publication prepared by test developers differing points for each different response to the same
and/or publishers to provide information on a test's item .

225
I N DEX

Abbreviated test form, 44-45, 1 07 Classical test theory, 33-35, 37, 88


Accommodations, 45, 59-6 1 Classification, 30, 1 8 1
appropriateness, 62, 67- 69, 1 1 5, 145, 1 90 decision consistency, 40-4 1 , 46, 136
documenting, 67, 88 score labels, 136
English language learners (ELL), 1 9 1 Clinical assessment (see Psychological assessment)
meaning, 5 8 , 190 Coaching (see Practice effects)
score comparability, 59 Cognitive labs, 82
(see also Modifications) Collateral information, 1 55, 1 67
Accountability Composite scores, 27, 43, 93, 1 82, 1 93, 2 1 0
index, 206, 209-2 1 1 Computer adaptive testing (see Adaptive testing)
measures, reliability/precision, 40 Computer-administered tests, 83, 1 12, 1 1 6, 1 45, 1 53,
opportunity to learn, 57 1 66, 188, 1 97
systems, 203 Concordance (see Score linking)
Achievement standards (see Performance standards) Consequential evidence (see Validation evidence,
Adaptations, 50, 58-59 Unintended consequences)
alternate assessments, 189-190 Construct, 1 1
employment testing, 177 Construct irrelevance, 12, 54-56, 64, 67, 90, 1 54
rest-taker responsibilities, 132 Construct underrepresentation, 1 2, 1 54
test-user responsibilities, 144 accommodations, 60
translations, 60 Content standards, 185
(see also Accommodations, Modifications) Content validation evidence, 14-15
Adaptive testing Context effects, 45
item selection, 8 1 , 89, 98 Copyright protection, 1 47-148
reliability/precision, 43 Credentialing test (see Licensing, Certification testing)
score comparability, 106 Criterion variable, 1 7, 1 72, 180
specifications, 80-8 1 , 86 Criterion-referenced interpretation, 96
Admissions testing, 1 86-187 Cross-validation, 28, 89
Aggregate scores, 71, 1 1 9-1 20, 190, 210 Cut scores, 46, 96, 100-1 0 1 , 1 07-109, 129, 1 76
Alignment, 1 5, 26, 87-89, 1 85, 196 adjusting, 177, 182
Alternate assessments, 1 89-1 90 standard setting, 176
Anchor test design, 98, 105-1 06
Assessment Decision accuracy, 40, 1 36
formative, 1 84 Decision consistency, 40-4 1 , 44
meaning, 2, 1 83 estimating, 46
psychological, 1 5 1 reporting, 46, 136, 1 82
summative, 1 84 Difference scores, 43
Assessment literacy, 192 Differential item functioning (DIF), 1 6, 5 1 , 82
Attenuation, 29, 47, 1 80 Differential prediction, 18, 30, 5 1-52, 66
Differential test functioning (DTF), 5 1 , 65, 70-71
Bias, 49, 5 1 -54, 2 1 1 Dimensionality, 1 6, 27, 43
cultural, 52-53, 55-56, 60, 64 Disattenuated correlations, 29
predictive, 5 1-52, 66 Documentation, 1 23-126
(see also Differential item functioning, Differential availability, 129
prediction, Differential test functioning) cut scores, 107-109
equating procedures, 105
Certification testing, 136, 1 69, 174-175 forms differences, 86-87
Change scores (see Growth measures) norming procedures, 104
Cheating, 1 1 6-1 17, 132, 1 36-137 p sychometric item properties, 88-89

227
INDEX

rater qualifications, 92 Item format


rater scoring, 92 accessibility, 77
reliability/precision, 126 adaptations, 77
research studies, 1 26-1 27 performance assessments, 77-78
score interpretation, 92 portfolios, 78
score linking, 1 06 simulations, 78
score scale development, 1 02 Item response theory (IRT), 38
scoring procedures, 1 18, 1 97 information function, 34, 37-38
test administration, 127-128 Item tryout, 82, 88
test development, 1 26 Item weights, 93
test revision, 1 29
Language proficiency, 53, 55, 68-69, 146, 156-157,
Educational testing 1 9 1 (see also Translated tests)
accountability, 1 26, 147, 203-207, 209-213 Licensing, 1 69, 175
admissions, 186-187 Linking tests (see Score linking)
placement, 1 87 Local scoring, 128
purposes, 1 84-187, 195
Effect size, 29 Mandated tests, 195, 2 1 2-213
Employment testing Matrix sampling, 47, 1 1 9-120, 204, 209
contextual factors, 1 70-171 Meta-analysis, 29-30, 1 73-1 74, 209
job analysis, 1 73, 1 75, 1 82 Modifications, 24, 45, 67
validation, 1 75-1 76 appropriateness, 62, 69
validation process, 1 7 1-174, 1 78-1 79, 1 8 1 documenting, 68
English language proficiency, 1 9 1 meaning, 58, 1 90
Equating (see Score linking) score interpretations, 68, 191
Errors of measurement, 33-34 (see also Accommodations)
Expert review, 87-88 Multi-stage testing, 8 1 (see also Adaptive testing)

Fairness Norm-referenced interpretation, 96-97, 186


accessibility, 49, 52-53, 77 Norms, 96-97, 104, 1 26, 186
educational tests, 1 86 local, 1 96
meaning, 49 updating, 104-105
score validity, 53-54, 63 user, 97, 186
universal design, 50, 57-58, 187
(see also Bias) Observed score, 34
Faking, 1 54-155 Opportunity to learn, 56-57, 72, 1 97
Field testing, 83, 88
Flagging test scores (see Adaptations) Parallel tests, 35
Passing score (see Cut scores)
Gain scores (see Difference scores, Growth measures) Performance standards, 18 5
Generalizability theory framework, 34 Personality measures, 43, 142, 155, 1 58, 164
Group performance Personnel selection testing (see Employment testing)
interpretation, 66, 200, 207, 212 Placement tests, 1 69, 187
norms, 1 04 Policy studies, 203, 204
reliability/precision, 40, 46-47, 1 1 9 Practice effects, 24-25
subgroups, 72, 145, 1 65 Practice material, 9 1 , 1 1 6, 131
(see also Aggregate scores) Program evaluation, 203-204
Growth measures, 1 85, 198, 209 Psychological assessment
batteries, 155, 1 65-167
High-stakes tests, 1 89, 203 collateral information, 1 55, 167
diagnosis, 159-160, 1 65, 1 67
Informed consent, 1 3 1 , 1 34-135 interpretation, 15 5

228
INDEX

interventions, 1 6 1 Score interpretation, 23-25


meaning, 1 5 1 absolute, 39
personality, 1 58 automated, 1 1 9, 144, 1 68
process, 1 5 1-152 case studies, 128-129
purposes, 1 59-1 63 composite scores, 27, 43, 1 82
qualifications, 1 64 documentation, 92
types of, 1 55-1 57 inappropriate, 23, 27, 124, 143-1 44, 1 66
vocational, 1 58-1 59 meta-analysis, 30, 1 73-174
multiple indicators, 7 1 , 1 40-141, 145, 1 54-1 55,
Random errors, 36 1 66-1 67, 1 79, 1 98, 2 1 3
Rater agreement, 25, 39, 44, 1 1 8 qualifications, 139-142, 1 99-200
Rater training (see Scorer training) relative, 39
Raw scores, 1 03 reliability/precision, 33-34, 42, 1 1 9, 1 98-199
Records retention, 120-1 2 1 , 1 46 subgroups, 65, 70-72, 2 1 1
Reliability coefficient subscores, 27, 1 76, 2 0 1
interpretation, 44 test batteries, 1 55
meaning, 33-35 validation, 23, 27, 85, 1 99
Reliability/ precision Score linking, 99-100
documentation, 126 documentation, 1 06
meaning, 33 equating meaning, 97
Reliability/precision estimates equating methods, 98, 1 05-106
adjustments with, 29, 47 meaning, 95
interpretations, 38-39 Score reporting, 1 35
reporting of results, 40-45 adaptations, 61
reporting subscores, 43 automated, 1 1 9, 1 44, 1 68, 1 94
Reliability/precision estimation procedures, 36-37 errors, 1 20, 1 43
alternate forms, 34-35, 37, 95 flagging, 6 1 , 1 94
generalizability coefficient, 37-38 release, 1 35, 1 46-1 47, 2 1 1-2 12
group means, 40, 46-47 supporting materials, 1 1 9, 1 44, 1 66, 1 94, 200
internal consistency, 35-37 timelines, 1 36-137, 1 46
reporting, 47 transmission, 1 2 1 , 1 35
scorer consistency, 37, 44, 92 Scorer training, 1 1 2, 1 18
test-retest, 36-38 Scoring
Replications, 35-37 analytic, 79
Response bias, 1 54 holistic, 79
Restriction of range, 29, 47, 180 Scoring algorithms, 66-67, 9 1 -92, 1 18
Retention of records, 120- 1 2 1 , 146 documenting, 92
Retesting, 1 14-1 1 5, 1 32, 146-147, 1 52, 197 Scoring bias, 66
Scoring errors, 1 43
Scale drift, 1 07 Scoring portfolios, 78, 1 87
Sea.le scores Scoring rubrics, 79, 82, 92, 1 18
appropriate use, 1 02 bias, 57
documentation, 1 02 Security, 1 1 7, 120-121, 1 28, 1 32, 1 47-148, 1 68
drift, 1 07 Selection, 1 69
interpretation, 1 02-103 Sensitivity reviews, 64
Scale stability, 1 03 Short forms of tests (see Abbreviated test form)
Score comparability Standard error of measurement (SEM), 34, 37, 39-
adaptive testing, 1 0 6 40, 45-46
evidence, 60, 1 03, 1 05, 1 06 conditional, 34, 39, 46, 1 76, 1 82
interpretations, 6 1 , 7 1 , 95, 1 1 1 , 1 1 6 Standard setting (see Cut scores)
translations, 69 Standardized test, 1 1 1

'J'JQ
INDEX

Systematic errors, 36 Test-taker responsibilities, 1 3 1 - 1 32


adaptations, 1 32
Technical manuals (see Documentation) Test-taker rights, 1 31-133, 1 62
Test informed consent, 1 3 1 , 1 34-135
classroom, 1 83 irregularities, 1 37
meaning, 2, 1 83 research instrument, 91
Test administration, 1 14, 1 92 test preparation, 1 33
directions, 83, 90-91 , 1 1 2 Time limits, appropriateness, 90
documentation, 127-128 Translated tests, 60-6 1 , 68-69, 127
interpreter use, 69-70 True score, 34
qualifications, 127, 1 39, 1 42, 153, 1 64, 1 99-200
security, 128 Unintended consequences, 1 2, 1 9-20, 30-31, 1 24,
standardized, 65, 1 1 5 189, 1 96, 207, 2 1 2
variations, 87, 90, 1 1 5 Universal design, 50, 57-58, 63, 77, 1 87
Test bias (see Bias) Universe score, 34
Test developer, 23
meaning, 3, 76 Validation
Test development meaning, 1 1
accessibility, 1 95-196 process, 1 1-12, 1 9-2 1 , 23, 85, 1 7 1 -1 74, 2 1 0
design, 75 samples, 25, 1 26-127
documentation, 126 Validation evidence, 1 3-19
meaning, 75 absence of, 143, 1 64
(see also Universal design) concurrent, 1 7- 1 8
Test manuals (see Documentation) consequential, 1 9-2 1 , 30-31
Test preparation, 24-25, 1 34, 165, 1 97 construct-related, 27-28, 66
Test publisher, 76 content-oriented, 1 4, 26, 54-55, 87-89, 172,
Test revisions, 83-84, 93, 1 07, 1 76-177 1 75-176, 1 78, 1 8 1-182, 1 96
documentation, 129 convergent, 1 6- 1 7
Test security procedures, 83 criterion variable, 28, 1 72, 1 80
Test selection, 72, 139, 1 42-143, 204, 2 1 2 criterion-related, 1 7-19, 29, 66, 1 67, 1 72, 1 75-
psychological, 152, 1 64-1 65 1 76
Test specifications, 85-86 data collection, 26
adaptive testing, 8 0-8 1 discriminant, 1 6- 1 7
administration, 80 integration of, 21-22
content, 76, 85 internal structure, 1 6, 26-27
employment testing, 1 75 interrelationships, 1 6, 27-29
item formats, 77-78 predictive, 1 7-18, 28, 129, 1 67, 1 72, 1 79
length, 79 rater variables, 25-26
meaning, 76 ratings, 25-26
portfolios, 78 relations to other variables, 1 6-18, 1 72
purpose, 76 response processes, 15-16, 26
scoring, 79-80 statistical, 26, 28-29, 1 26
Test standards subgroups, 64
applicabili ty, 2-3, 5-6 validity generalization, 1 8, 173, 1 80
cautions, 7 Validity
enforcement, 2 fairness, 49-57
legal requirements, 1 , 7 meaning, 1 1 , 1 4
purposes, 1 process, 1 3
Test users, 139-141 reliability/precision implications, 34-35
responsibilities, 142, 1 53 Validity generalization, 1 8, 1 73, 1 80
Testing environment, 1 1 6 Vertical scaling, 95, 99, 1 85
Testing irregularities, 1 36-137, 146

230

You might also like