You are on page 1of 12

AMERICAN EDUCATIONAL RESEARCH ASSOCIATION

AMERICAN PSYCHOLOGICAL ASSOCIATION


NATIONAL COUNCIL ON MEASUREMENT IN EDUCATION
1. VALIDITY

BACKGROUND
Validity refers to the degree to which evidence the construct interpretation that will be made on
and theory support the interpretations of test the basis of the score or response pattern.
scores for proposed uses of tests. Validity is, Examples of constructs currently used in as-
therefore, the most fundamental consideration in sessment include mathematics achievement, general
developing tests and evaluating tests. The process cognitive ability, racial identity attitudes, depression,
of validation involves accumulating relevant and self-esteem. To support test development,
evidence to provide a sound scientific basis for the proposed construct interpretation is elaborated
the proposed score interpretations. It is the inter- by describing its scope and extent and by delin-
pretations of test scores for proposed uses that are eating the aspects of the construct that are to be
evaluated, not the test itself. When test scores are represented. The detailed description provides a
interpreted in more than one way (e.g., both to conceptual framework for the test, delineating
describe a test taker's current level of the attribute the knowledge, skills, abilities, traits, interests,
being measured and to make a prediction about a processes, competencies, or characteristics to be
future outcome), each intended interpretation assessed. Ideally, the framework indicates how
must be validated. Statements about validity the construct as represented is to be distinguished
should refer to particular interpretations for from other constructs and how it should relate to
specified uses. It is incorrect to use the unqualified other variables.
phrase "the validity of the test." The conceptual framework is partially shaped
Evidence of the validity of a given interpretation by the ways in which test scores will be used. For
of test scores for a specified use is a necessary con- instance, a test of mathematics achievement might
dition for the justifiable use of the test. Where suf- be used to place a student in an appropriate program
ficient evidence of validity exists, the decision as of instruction, to endorse a high school diploma,
to whether to actually administer a particular test or to inform a college admissions decision. Each of
generally takes additional considerations into ac- these uses implies a somewhat different interpretation
count. These include cost-benefit considerations, of the mathematics achievement test scores: that a
framed in different subdisciplines as utility analysis student will benefit from a particular instructional
or as consideration of negative consequences of intervention, that a student has mastered a specified
test use, and a weighing of any negative consequences curriculum, or that a student is likely to be successful
against the positive consequences of test use. with college-level work. Similarly, a test of consci-
Validation logically begins with an explicit entiousness might be used for psychological coun-
statement of the proposed interpretation of test seling, to inform a decision about employment, or
scores, along with a rationale for the relevance of for the basic scientific purpose of elaborating the
the interpretation to the proposed use. The construct of conscientiousness. Each of these
proposed interpretation includes specifying the potential uses shapes the specified framework and
construct the test is intended to measure. The the proposed interpretation of the test's scores and
term construct is used in the Standards to refer to also can have implications for test development
the concept or characteristic that a test is designed and evaluation. Validation can be viewed as a
to measure. Rarely, if ever, is there a single possible process of constructing and evaluating arguments
meaning that can be attached to a test score or a for and against the intended interpretation of test
pattern of test responses. Thus, it is always in- scores and their relevance to the proposed use. The
cumbent on test developers and users to specify conceptual framework points to the kinds of

11
.. CHAPTER 1
VALIDITY

evidence that might be collected to evaluate the


consider the perspectives of different interested of a mathematics test, it might include overreliance of the testing materials and procedures for the
proposed interpretation in light of the purposes of
parties, existing experience with similar tests and on reading comprehension skills that English lan- full range of applicants, and the consistency of
testing. As validation proceeds, and new evidence
contexts, and the expected consequences of the guage learners may be lacking. On a test designed the support for the proposed interpretation across
regarding the interpretations that can and cannot
proposed test use. A finding of unintended con- to measure science knowledge, test-taker inter- groups. Professional judgment guides decisions
be drawn from test scores becomes available,
sequences of test use may also prompt a consider- nalizing of gender-based stereotypes about women regarding the specific forms of evidence that can
revisions may be needed in the test, in the conceptual
ation of rival hypotheses. Plausible rival hypotheses in the sciences might be a source of construct-ir- best support the intended interpretation for a
framework that shapes it, and even in the construct
can often be generated by considering whether a relevant variance. specified use. As in all scientific endeavors, the
underlying the test.
test measures less or more than its proposed con- Nearly all tests leave out elements that some quality of the evidence is paramount. A few pieces
The wide variety of tests and circumstances
struct. Such considerations are referred to as potential users believe should be measured and of solid evidence regarding a particular proposition
makes it natural that some types of evidence will
construct underrepresentation (or construct deficiency) include some elements that some potential users are better than numerous pieces of evidence of
be especially critical in a given case, whereas
and construct-irrelevant variance (or construct con- consider inappropriate. Validation involves careful questionable quality. The determination that a
other types will be less useful. Decisions about tamination), respectively.
attention to possible distortions in meaning given test interpretation for a specific purpose is
what types of evidence are important for the val-
Construct underrepresentation refers to the arising from inadequate representation of the warranted is based on professional judgment that
idation argument in each instance can be clarified
degree to which a test fails to capture important construct and also to aspects of measurement, the preponderance of the available evidence
by developing a set of propositions or claims
aspects of the construct. It implies a narrowed such as test format, administration conditions, supports that interpretation. The quality and
that support the proposed interpretation for the
meaning of test scores because the test does not or language level, that may materially limit or quantity of evidence sufficient to reach this judg-
particular purpose of testing. For instance, when
adequately sample some types of content, engage qualify the interpretation of test scores for various ment may differ for test uses depending on the
a mathematics achievement test is used to assess
some psychological processes, or elicit some ways groups of test takers. That is, the process of vali- stakes involved in the testing. A given interpretation
readiness for an advanced course, evidence for
of responding that are encompassed by the intended dation may lead to revisions in the test, in the may not be warranted either as a result of insufficient
the following propositions might be relevant:
construct. Take, for example, a test intended as a conceptual framework of the test, or both. Inter- evidence in support of it or as a result of credible
(a) that certain skills are prerequisite for the ad-
comprehensive measure of anxiety. A particular pretations drawn from the revised test would evidence against it.
vanced course; (b) that the content domain of
test might underrepresent the intended construct again need validation. Validation is the joint responsibility of the
the test is consistent with these prerequisite
because it measures only physiological reactions When propositions have been identified that test developer and the test user. The test developer
skills; (c) that test scores can be generalized
and not emotional, cognitive, or situational com- would support the proposed interpretation of test is responsible for furnishing relevant evidence and
across relevant sets of items; (d) that test scores
ponents. As another example, a test of reading scores, one can proceed with validation by obtaining a rationale in support of any test score interpretations
are not unduly influenced by ancillary variables,
comprehension intended to measure children's empirical evidence, examining relevant literature, for specified uses intended by the developer. The
such as writing ability; (e) that success in the ad-
ability to read and interpret stories with under- and/or conducting logical analyses to evaluate test user is ultimately responsible for evaluating
vanced course can be validly assessed; and (f)
standing might not contain a sufficient variety of each of the propositions. Empirical evidence may the evidence in the particular setting in which the
that test takers with high scores on the test will
reading passages or might ignore a common type include both local evidence, produced within the test is to be used. When a test user proposes an
be more successful in the advanced course than of reading material.
contexts where the test will be used, and evidence interpretation or use of test scores that differs
test takers with low scores on the test. Examples
Construct-irrelevance refers to the degree to from similar testing applications in other settings. from those supported by the test developer, the
?f propositions in other testing contexts might
which test scores are affected by processes that are Use of existing evidence from similar tests and responsibility for providing validity evidence in
mclude, for instance, the proposition that test
extraneous to the test's intended purpose. The contexts can enhance the quality of the validity support of that interpretation for the specified
takers with high general anxiety scores experience
test scores may be systematically influenced to argument, especially when data for the test and use is the responsibility of the user. It should be
significant anxiety in a range of settings, the
some extent by processes that are not part of the context in question are limited. noted that important contributions to the validity
proposition that a child's score on an intelligence
construct. In the case of a reading comprehension Because an interpretation for a given use typ- evidence may be made as other researchers report
scale is strongly related to the child's academic
test, these might include material too far above or ically depends on more than one proposition, findings of investigations that are related to the
performance, or the proposition that a certain
below the level intended to be tested, an emotional strong evidence in support of one part of the in- meaning of scores on the test.
pattern of scores on a neuropsychological battery
reaction to the test content, familiarity with the terpretation in no way diminishes the need for
indicates impairment that is characteristic of
subject matter of the reading passages on the test, evidence to support other parts of the interpretation.
brain injury. The validation process evolves as
or the writing skill needed to compose a response.
Sources of Validity Evidence
these propositions are articulated and evidence For example, when an employment test is being
Depending on the detailed definition of the con- considered for selection, a strong predictor-criterion The following sections outline various sources
is gathered to evaluate their soundness.
struct, vocabulary knowledge or reading speed relationship in an employment setting is ordinarily of evidence that might be used in evaluating the
Identifying the propositions implied by a pro-
might also be irrelevant components. On a test not sufficient to justify use of the test. One should validity of a proposed interpretation of test
posed test interpretation can be facilitated by
designed to measure anxiety, a response bias to also consider the appropriateness and meaning- scores for a particular use. These sources of evi-
considering rival hypotheses that may challenge
underreport one's anxiety might be considered a fulness of the criterion measure, the appropriateness dence may illuminate different aspects of validity,
the proposed interpretation. It is also useful to
source of construct-irrelevant variance. In the case
12
13
1 ... ~i'
VALIDITY

but they do not represent distinct types of Evidence Based on Test Content purpose. For example, a test given for research and empirical analyses of the response processes
validity. Validity is a unitary concept. It is the purposes to compare student achievement across of test takers can provide evidence concerning the
fit between the construct and the detailed nature
degree to which all the accumulated evidence Important validity evidence can be obtained from states in a given domain may properly also cover
material that receives little or no attention in the of the performance or response actually engaged
supports the intended interpretation of test an analysis of the relationship between the content
curriculum. Policy makers can then evaluate in by test takers. For instance, if a test is intended
scores for the proposed use. Like the 1999 Stan- of a test and the construct it is intended to
to assess mathematical reasoning, it becomes im-
dards, this edition refers to types of validity evi- measure. Test content refers to the themes, wording, student achievement with respect to both content
portant to determine whether test takers are, in
dence, rather than distinct types of validity. To and format of the items, tasks, or questions on a neglected and content addressed. On the other
emphasize this distinction, the treatment that test. Administration and scoring may also be hand, when student mastery of a delivered cur- fact, reasoning about the material given instead
follows does not follow historical nomenclat~re relevant to content-based evidence. Test developers riculum is tested for purposes of informing of following a standard algorithm applicable only
(i.e., the use of the terms content validity or pre- often work from a specification of the content decisions about individual students, such as pro- to the specific items on the test.
motion or graduation, the framework elaborating Evidence based on response processes generally
dictive validity). domain. The content specification carefully describes
a content domain is appropriately limited to what comes from analyses of individual responses.
AB the discussion in the prior section emphasizes, the content in detail, often with a classification of
students have had an opportunity to learn from Questioning test takers from various groups
each type of evidence presented below is not areas of content and types of items. Evidence
making up the intended test-taking population
required in all settings. Rather, support is needed based on test content can include logical or the curriculum as delivered.
for each proposition that underlies a proposed empirical analyses of the adequacy with which Evidence about content can be used, in part, to about their performance strategies or responses
test interpretation for a specified use. A proposition the test content represents the content domain address questions about differences in the meaning to particular items can yield evidence that enriches
that a test is predictive of a given criterion can be and of the relevance of the content domain to the or interpretation of test scores across relevant sub- the definition of a construct. Maintaining records
supported without evidence that the test samples proposed interpretation of test scores. Evidence groups of test takers. Of particular concern is the that monitor the development of a response to a
a particular content domain. In contrast, a propo- based on content can also come from expert judg- extent to which construct underrepresentation or writing task, through successive written drafts
sition that a test covers a representative sample of ments of the relationship between parts of the construct-irrelevance may give an unfair advantage or electronically monitored revisions, for instance,
a particular curriculum may be supported without test and the construct. For example, in developing or disadvantage to one or more subgroups of test also provides evidence of process. Documentation
evidence that the test predicts a given criterion. a licensure test, the major facets that are relevant takers. For example, in an employment test, the of other aspects of performance, like eye move-
However, a more complex set of propositions, to the purpose for which the occupation is regulated use of vocabulary more complex than needed on ments or response times, may also be relevant to
e.g., that a test samples a specified domain and can be specified, and experts in that occupation the job may be a source of construct-irrelevant some constructs. Inferences about processes in-
thus is predictive of a criterion reflecting a related can be asked to assign test items to the categories variance for English language learners or others. volved in performance can also be developed by
domain, will require evidence supporting both defined by those facets. These or other experts Careful review of the construct and test content analyzing the relationship among pans of the
parts of this set of propositions. Tests developers can then judge the representativeness of the chosen domain by a diverse panel of experts may point to test and between the test and other variables.
are also expected to make the case that the scores set of items. potential sources of irrelevant difficulty (or easiness) Wide individual differences in process can be re-
vealing and may lead to reconsideration of certain
are not unduly influenced by construct-irrelevant Some tests are based on systematic observations that require further investigation.
variance (see chap. 3 for detailed treatment of of behavior. For example, a list of the tasks con- Content-oriented evidence of validation is at test formats.
the heart of the process in the educational arena Evidence of response processes can contribute
issues related to construct-irrelevant variance). In stituting a job domain may be developed from
general, adequate support for proposed interpre- observations of behavior in a job, together with known as align.ment, which involves evaluating the to answering questions about differences in meaning
tations for specific uses will require multiple judgments of subject matter experts. Expert judg- correspondence between student learning standards or interpretation of test scores across relevant sub-
sources of evidence. ments can be used to assess the relative importance, and test content. Content-sampling issues in the groups of test takers. Process studies involving
The position developed above also underscores criticality, and/or frequency of the various tasks. alignment process include evaluating whether test test takers from different subgroups can assist in
the fact that if a given test is interpreted in A job sample test can then be constructed from a content appropriately samples the domain set forward determining the extent to which capabilities irrel-
multiple ways for multiple uses, the propositions random or stratified sampling of tasks rated highly in curriculum standards, whether the cognitive de- evant or ancillary to the construct may be differ-
underlying these interpretations for different uses on these characteristics. The test can then be ad- mands of test items correspond to the level reflected entially influencing test takers' test performance.
in the student learning standards (e.g., content Studies of response processes are not limited
also are likely to differ. Support is needed for the ministered under standardized conditions in an
standards), and whether the test avoids the inclusion to the test taker. Assessments often rely on observers
propositions underlying each interpretation for a off-the-job setting.
of features irrelevant to the standard that is the in- or judges to record and/or evaluate test takers'
specific use. Evidence supporting the interpretation The appropriateness of a given content domain
performances or products. In such cases, relevant
of scores on a mathematics achievement test for is related to the specific inferences to be made tended target of each test item.
validity evidence includes the extent to which the
placing students in subsequent courses (i.e., from test scores. Thus, when considering an
evidence that the test interpretation is valid for its available test for a purpose other than that for
Evidence Based on Response Processes processes of observers or judges are consistent
Some construct interpretations involve more or with the intended interpretation of scores. For in-
intended purpose) does not permit inferring which it was first developed, it is especially
less explicit assumptions about the cognitive stance, if judges are expected to apply particular
validity for other purposes (e.g., promotion or important to evaluate the appropriateness of the
processes engaged in by test takers. Theoretical criteria in scoring test takers' performances, it is
teacher evaluation). original content domain for the proposed new
15
14
CHAPTER 1
VALIDITY

important to ascertain whether they are, in fact,


tionships form the basis for an estimate of score
applying the appropriate criteria and not being intended to assess the same or similar constructs criterion scores are of central importance. The
reliability, but such an index would be inappropriate
influenced by factors that are irrelevant to the in- provide convergent evidence, whereas relationships credibility of a test-criterion study depends on
for tests with a more complex internal structure.
tended interpretation (e.g., quality of handwriting between test scores and measures purportedly of the relevance, reliability, and validity of the inter-
Some studies of the internal structure of tests
is irrelevant to judging the content of an written different constructs provide discriminant evidence. pretation based on the criterion measure for a
are designed to show whether particular items
essay). Thus, validation may include empirical For instance, within some theoretical frameworks, given testing application.
may function differently for identifiable subgroups
studies of how observers or judges record and scores on a multiple-choice test of reading com- Historically, two designs, often called predictive
of test takers (e.g., racial/ethnic or gender sub-
evaluate data along with analyses of the appropri- prehension might be expected to relate closely and concurrent, have been distinguished for eval-
groups.) Differential item functioning occurs when
ateness of these processes to the intended inter- (convergent evidence) to other measures of reading uating test-criterion relationships. A predictive
different groups of test takers with similar overall
pretation or construct definition. comprehension based on other methods, such as study indicates the strength of the relationship
ability, or similar status on an appropriate criterion,
While evidence about response processes may essay responses. Conversely, test scores might be between test scores and criterion scores that are
have, on average, systematically different responses
be central in settings where explicit claims about expected to relate less closely (discriminant evidence) obtained at a later time. A concurrent study
to a particular item. This issue is discussed in
response processes are made by test developers or to measures of other skills, such as logical reasoning. obtains test scores and criterion information at
chapter 3. However, differential item functioning
where inferences about responses are made by test Relationships among different methods of meas- about the same time. When prediction is actually
is not always a flaw or weakness. Subsets of items
users, there are many other cases where claims uring the construct can be especially helpful in contemplated, as in academic admission or em-
that have a specific characteristic in common
about response processes are not part of the sharpening and elaborating score meaning and ployment settings, or in planning rehabilitation
(e.g., specific content, task representation) may
validity argument. In some cases, multiple response interpretation. regimens, predictive studies can retain the temporal
function differently for different groups of similarly
processes are available for solving the problems of Evidence of relations with other variables can differences and other characteristics of the practical
scoring test takers. This indicates a kind of multi-
interest, and the construct of interest is. only con- involve experimental as well as correlational evi- situation. Concurrent evidence, which avoids tem-
dimensionality that may be unexpected or may
cerned with whether the problem was solved cor- dence. Studies might be designed, for instance, to poral changes, is particularly useful for psychodi-
conform to the test framework.
rectly. As a simple example, there may be multiple investigate whether scores on a measure of anxiety agnostic tests or in investigating alternative measures
possible routes to obtaining the correct solution Evidence Based on Relations to Other Variables improve as a result of some psychological treatment of some specified construct for which an accepted
to a mathematical problem. or whether scores on a test of academic achievement measurement procedure already exists. The choice
In many cases, the intended interpretation for a differentiate between instructed and noninstructed of a predictive or concurrent research strategy in
Evidence Based on Internal Structure given use implies that the construct should be
groups. If performance increases due to short- a given domain is also usefully informed by prior
related to some other variables, and, as a result,
Analyses of the internal structure of a test can term coaching are viewed as a threat to validity, it research evidence regarding the extent to which
analyses of the relationship of test scores to
indicate the degree to which the relationships would be useful to investigate whether coached predictive and concurrent studies in that domain
variables external to the test provide another im-
among test items and test components conform to and uncoached groups perform differently. yield the same or different results.
portant source of validity evidence. External
the construct on which the proposed test score in- Test scores are sometimes used in allocating
variables may include measures of some criteria
terpretations are based. The conceptual framework Test-criterion relationships. Evidence of the individuals to different treatments in a way that is
that the test is expected to predict, as well as rela-
for a test may imply a single dimension of behavior, relation of test scores to a relevant criterion may advantageous for the institution and/or for the
tionships to other tests hypothesized to measure
or it may posit several components that are each be expressed in various ways, but the fundamental individuals. Examples would include assigning
the same constructs, and tests measuring related
expected to be homogeneous, but that are also question is always, how accurately do test scores individuals to different jobs within an organization,
or different constructs. Measures other than test
distinct from each other. For example, a measure predict criterion performance? The degree of ac- or determining whether to place a given student
scores, such as performance criteria, are often
of discomfort on a health survey might assess both curacy and the score range within which accuracy in a remedial class or a regular class. In that
used in employment settings. Categorical variables,
physical and emotional health. The extent to which is needed depends on the purpose for which the context, evidence is needed to judge the suitability
including group membership variables, become
item interrelationships bear out the presumptions test is used. of using a test when classifying or assigning a
relevant when the theory underlying a proposed
of the framework would be relevant to validity. The criterion variable is a measure of some at- person to one job versus another or to one
test use suggests that group differences should be
The specific types of analyses and their inter- tribute or outcome that is operationally distinct treatment versus another. Support for the validity
present or absent if a proposed test score interpre-
pretation depend on how the test will be used. from the test. Thus, the test is not a measure of a of the classification procedure is provided by
tation is to be supported. Evidence based on rela-
For example, if a particular application posited a criterion, but rather is a measure hypothesized as showing that the test is useful in determining
tionships with other variables provides evidence
series of increasingly difficult test components, a potential predictor of that targeted criterion. which persons are likely to profit differentially
about the degree to which these relationships are
empirical evidence of the extent to which response Whether a test predicts a given criterion in a from one treatment or another. It is possible for
consistent with the construct underlying the pro-
patterns conformed to this expectation would be given context is a testable hypothesis. The criteria tests to be highly predictive of performance for
posed test score interpretations.
provided. A theory that posited unidimensionality that are of interest are determined by test users, different education programs or jobs without pro-
would call for evidence of item homogeneity. In for example administrators in a school system or viding the information necessary to make a com-
Convergent and discriminant evidence. Rela-
this case, the number of items and item interrela- managers of a firm. The choice of the criterion parative judgment of the efficacy of assignments
tionships between test scores and other measures
and the measurement procedures used to obtain or treatments. In general, decision rules for selection
16
17
CHAPTER 1
VALIDITY

or placement are also influenced by the number


atively small. Thus, statistical summaries of past The extent to which predictive or concurrent Still other consequences are unintended, and
of persons to be accepted or the numbers that can
validation studies in similar situations may be validity evidence can be generalized to new are often negative. For example, school district or
be accommodated in alternative placement cate-
useful in estimating test-criterion relationships in . situations is in large measure a function of accu- statewide educational testing on selected subjects
gories (see chap. 11).
a new situation. This practice is referred to as the mulated research. Although evidence of general- may lead teachers to focus on those subjects at
Evidence about relations to other variables is study of validity generalization.
also used to investigate questions of differential ization can often help to support a claim of the expense of others. As another example, a test
In some circumstances, there is a strong validity in a new situation, the extent of available developed to measure knowledge needed for a
prediction for subgroups. For instance, a finding
basis for using validity generalization. This data limits the degree to which the claim can be given job may result in lower passing rates for one
that the relation of test scores to a relevant criterion
would be the case where the meta-analytic data- sustained. group than for another. Unintended consequences
variable differs from one subgroup to another
base is large, where the meta-analytic data ade- The above discussion focuses on the use of merit dose examination. While not all consequences
may imply that the meaning of the scores is not
quately represent the type of situation to which cumulative databases to estimate predictor-criterion can be anticipated, in some cases factors such as
the same for members of the different groups,
one wishes to generalize, and where correction relationships. Meta-analytic techniques can also prior experiences in other settings offer a basis for
perhaps due to construct underrepresentation or
for statistical artifacts produces a clear and con- be used to summarize other forms of data relevant anticipating and proactively addressing unintended
construct-irrelevant sources of variance. However,
sistent pattern of validity evidence. In such cir- to other inferences one may wish to draw from consequences. See chapter 12 for additional ex-
the difference may also imply that the criterion
cumstances, the informational value of a local test scores in a particular application, such as amples from educational settings. In some cases,
has different meaning for different groups. The
validity study may be relatively limited if not effects of coaching and effects of certain alterations actions to address one consequence bring about
differences in test-criterion relationships can also
actually misleading, especially if its sample size in testing conditions for test takers with specified other consequences. One example involves the
arise from measurement error, especially when
is small. In other circumstances, the inferential disabilities. Gathering evidence about how well notion of"missed opportunities," as in the case of
group means differ, so such differences do not
leap required for generalization may be much validity findings can be generalized across groups moving to computerized scoring of student essays
necessarily indicate differences in score meaning.
larger. The meta-analytic database may be small, of test takers is an important part of the validation to increase grading consistency, thus forgoing the
See the discussion of fairness in chapter 3 for
the findings may be less consistent, or the new process. When the evidence suggests that inferences educational benefits of addressing the same problem
more extended consideration of possible courses
situation may involve features markedly different from test scores can be drawn for some subgroups by training teachers to grade more consistently.
of action when scores have different meanings for
from those represented in the meta-analytic but not for others, pursuing options such as those These types of consideration of consequences
different groups.
database. In such circumstances, situation-specific discussed in chapter 3 can reduce the risk of of testing are discussed further below.
validity evidence will be relatively more inform- unfair test use.
Validity generalization. An important issue in
ative. Although research on validity generalization Interpretation and uses of test scores intended by
educational and employment settings is the degree
shows that results of a single local validation Evidence for Validity and test developers. Tests are commonly administered
to which validity evidence based on test-criterion
study may be quite imprecise, there are situations Consequences of Testing in the expectation that some benefit will be realized
relations can be generalized to a new situation
where a single study, carefully done, with adequate Some consequences of test use follow directly from the interpretation and use of the scores intended
without further study of validity in that new situ-
sample size, provides sufficient evidence to from the interpretation of test scores for uses in- by the test developers. A few of the many possible
ation. When a test is used to predict the same or
support or reject test use in a new situation. tended by the test developer. The validation benefits that might be claimed are selection of effi-
similar criteria (e.g., performance of a given job)
This highlights the importance of examining process involves gathering evidence to evaluate cacious therapies, placement of workers in suitable
at different times or in different places, it is
carefully the comparative informational value the soundness of these proposed interpretations jobs, prevention of unqualified individuals from
typically found that observed test-criterion corre- of local versus meta-analytic studies.
lations vary substantially. In the past, this has for their intended uses. entering a profession, or improvement of classroom
In conducting studies of the generalizability Other consequences may also be part of a instructional practices. A fundamental purpose of
been taken to imply that local validation studies
of validity evidence, the prior studies that are in- claim that extends beyond the interpretation or validation is to indicate whether these specific
are always required. More recently, a variety of
cluded may vary according to several situational use of scores intended by the test developer. For benefits are likely to be realized. Thus, in the case of
approaches to generalizing evidence from other
facets. Some of the major facets are (a) differences example, a test of student achievement might a test used in placement decisions, the validation
settings has been developed, with meta-analysis
in the way the predictor construct is measured, provide data for a system intended to identify would be informed by evidence that alternative
the most widely used in the published literature.
(b) the type of job or curriculum involved, (c) the and improve lower-performing schools. The claim placements, in fact, are differentially beneficial to
In particular, meta-analyses have shown that in
type of criterion measure used, (d) the type of test that testing results, used this way, will result in the persons and the institution. In the case of em-
some domains, much of this variability may be
takers, and (e) the time period in which the study improved student learning may rest on propositions ployment testing, if a test publisher asserts that use
due to statistical artifacts such as sampling fluctu-
was conducted. In any particular study of validity about the system or intervention itself, beyond of the test will result in reduced employee training
ations and variations across validation studies in
generalization, any number of these facets might propositions based on the meaning of the test costs, improved workforce efficiency, or some other
the ranges of test scores and in the reliability of
vary, and a major objective of the study is to de- itself. Consequences may point to the need for benefit, then the validation would be informed by
criterion measures. When these and other influences
termine empirically the extent to which variation evidence about components of the system that evidence in support of that proposition.
are taken into account, it may be found that the
in these facets affects the test-criterion correlations It is important to note that the validity of test
remaining variability in validity coefficients is rel- will go beyond the interpretation of test scores as
obtained.
a valid measure of student achievement. score interpretations depends not only on the uses
18
19
VALIDITY

of the test scores but specifically on the claims that


existing data collected for purposes other than a job that required only minimal functional also illustrates that different decision makers may
underlie the theory of action for these uses. For
test validation; in other cases new information literacy), or if the differences were due to the make different value judgments about the impact
example, consider a school district that wants to
will be needed to address the impact of the testing test's sensitivity to some test-taker characteristic of consequences on test use.
determine children's readiness for kindergarten, program. not intended to be part of the test construct, then The fact that the validity evidence supports
and so administers a test battery and screens out
the intended interpretation of test scores as pre- the intended interpretation of test scores for use
students with low scores. If higher scores do, in
Consequences that are unintended. Test score dicting job performance in a comparable manner in applicant screening does not mean that test use
fact, predict higher performance on key kindergarten
interpretation for a given use may result in unin- for all groups of applicants would be rendered in- is thus required: Issues other than validity, including
tasks, the claim that use of the test scores. for
tended consequences. A key distinction is between valid, even if test scores correlated positively with legal constraints, can play an important and, in
screening results in higher performance on these consequences that result from a source of error in some measure of job performance. If a test covers some cases, a determinative role in decisions about
key tasks is supported and the interpretation of
the intended test score interpretation for a given most of the relevant content domain but omits test use. Legal constraints may also limit an em-
the tesi: scores as a predictor of kindergarten use and consequences that do not result from some areas, the content coverage might be judged ployer's discretion to discard test scores from tests
readiness would be valid. If, however, the claim
error in test score interpretation. Examples of adequate for some purposes. However, if it is that have already been administered, when that
were made that use of the test scores for screening each are given below. found that excluding some components that could decision is based on differences in scores for sub-
would result in the greatest benefit to students,
As discussed at some length in chapter 3, one readily be assessed has a noticeable impact on se- groups of different races, ethnicities, or genders.
the interpretation of test scores as indicators of
domain in which unintended negative consequences lection rates for groups of interest (e.g., subgroup Note that unintended consequences can also
readiness for kindergarten might not be valid
of test use are at times observed involves test score differences are found to be smaller on excluded be positive. Reversing the above example of test
because students with low scores might actually
differences for groups defined in terms of race/eth- components than on included components), the takers who form a negative impression of an or-
benefit more from access to kindergarten. In this
nicity, gender, age, and other characteristics. In intended interpretation of test scores as predicting ganization based on the use of a particular test, a
case, different evidence is needed to support
such cases, however, it is important to distinguish job performance in a comparable manner for all different test may be viewed favorably by applicants,
different claims that might be made about the
between evidence that is directly relevant to validity groups of applicants would be rendered invalid. leading to a positive impression of the organization.
same use of the screening test (for example, evidence
and evidence that may inform decisions about Thus, evidence about consequences is relevant to A given test use may result in multiple consequences,
that students below a certain cut score benefit
social policy but falls outside the realm of validity. validity when it can be traced to a source of some positive and some negative.
more from another assignment than from assignment
For example, concerns have been raised about the invalidity such as construct underrepresentation In short, decisions about test use are appro-
to kindergarten). The test developer is responsible
effect of group differences in test scores on em- or construct-irrelevant components. Evidence priately informed by validity evidence about in-
for the validation of the interpretation that the
ployment selection and promotion, the placement about consequences that cannot be so traced is tended test score interpretations for a given use,
test scores assess the indicated readiness skills. The
of children in special education classes, and the not relevant to the validity of the intended inter- by evidence evaluating additional claims about
I: school district is responsible for the validation of
narrowing of a school's curriculum to exclude pretations of the test scores. consequences of test use that do not follow directly
i]: the proper interpretation of the readiness test
learning objectives that are not assessed. Although As another example, consider the case where from test score interpretations, and by value judg-
scores and for evaluation of the policy of using the
information about the consequences of testing research supports an employer's use of a particular ments about unintended positive and negative
readiness test for placement/admissions decisions.
111 may influence decisions about test use, such con- test in the personality domain (i.e., the test proves consequences of test use.
!,
! I sequences do not, in and of themselves, detract to be predictive of an aspect of subsequent job
I Claims made about test use that are not directly
from the validity of intended interpretations of performance), but it is found that some applicants Integrating the Validity Evidence
based on test score interpretations. Claims are
the test scores. Rather, judgments of validity or form a negative opinion of the organization due
sometimes made for benefits of testing that go
invalidity in the light of testing consequences to the perception that the test invades personal A sound validity argument integrates various
beyond the direct interpretations or uses of the
depend on a more searching inquiry into the privacy. Thus, there is an unintended negative strands of evidence into a coherent account of the
test scores themselves that are specified by the test sources of those consequences. consequence of test use, but one that is not due degree to which existing evidence and theory sup-
developers. Educational tests, for example, may
Take, as an example, a finding of different to a flaw in the intended interpretation of test port the intended interpretation of test scores for
be advocated on the grounds that their use will
hiring rates for members of different groups as a scores as predicting subsequent performance. Some specific uses. It encompasses evidence gathered
improve student motivation to learn or encourage
consequence of using an employment test. If the employers faced with this situation may conclude from new studies and evidence available from
changes in classroom instructional practices by
difference is due solely to an unequal distribution that this negative consequence is grounds for dis- earlier reported research. The validity argument
holding educators accountable for valued learning
of the skills the test purports to measure, and if continuing test use; others may conclude that the may indicate the need for refining the definition
outcomes. Where such claims are central to the those skills are, in fact, important contributors to benefits gained by screening applicants outweigh of the construct, may suggest revisions in the test
rationale advanced for testing, the direct exami-
job performance, then the finding of group dif- this negative consequence. As this example illus- or other aspects of the testing process, and may
nation of testing consequences necessarily assumes ferences per se does not imply any lack of validity trates, a consideration of consequences can influence indicate areas needing further study.
even greater importance. Those making the claims
for the intended interpretation. If, however, the a decision about test use, even though the conse- It is commonly observed that the validation
are responsible for evaluation of the claims. In
test measured skill differences unrelated to job quence is independent of the validity of the process never ends, as there is always additional
some cases, such information can be drawn from
performance (e.g., a sophisticated reading test for intended test score interpretation. The example information that can be gathered to more fully
20
21
VALIDITY

understand a test and the inferences that can be as research on a topic advances. For example, pre-
drawn from it. In this way an inference of validity
STANDARDS FOR VALIDITY
vailing standards of evidence may vary with the
is similar to any scientific inference. However, a stakes involved in the use or interpretation of the
The standards in this chapter begin with an over- be employed, and the processes by which the test
test interpretation for a given use rests on evidence test scores. Higher stakes may entail higher
arching standard (numbered 1.0), which is designed is to be administered and scored.
for a set of propositions making up the validity standards of evidence. As another example, in
to convey the central intent or primary focus of
argument, and at some point validation evidence areas where data collection comes at a greater
allows for a summary judgment of the intended the chapter. The overarching standard may also Standard 1.2
cost, one may find it necessary to base interpretations
interpretation that is well supported and defensible. J be viewed as the guiding principle of the chapter,
9n fewer data than in areas where data collection
and is applicable to all tests and test users. All A rationale should be presented for each intended
At some point the effort to provide sufficient comes with less cost.
subsequent standards have been separated into interpretation of test scores for a given use,
validity evidence to support a given test interpre- Ultimately, the validity of an intended inter-
three thematic clusters labeled as follows: together with a summary of the evidence and
tation for a specific use does end (at least provi- pretation of test scores relies on all the available
theory bearing on the intended interpretation.
sionally, pending the emergence of a strong basis evidence relevant to the technital quality of a
1. Establishing Intended Uses and Interpreta-
for questioning that judgment). Legal requirements testing system. Different components of validity Comment: The rationale should indicate what
may necessitate that the validation study be tions
evidence are described in subsequent chapters of propositions are necessary to investigate the
updated in light of such factors as changes in the 2. Issues Regarding Samples and Settings Used
the Standards, and include evidence of careful test intended interpretation. The summary should
test population or newly developed alternative in Validation
construction; adequate score reliability; appropriate combine logical analysis with empirical evidence
3. Specific Forms ofValidity Evidence
testing methods. test administration and scoring; accurate score to provide support for the test rationale. Evi~ence
The amount and character of evidence required scaling, equating, and standard setting; and careful may come from studies conducted locally, m ~he
to support a provisional judgment of validity attention to fairness for all test takers, as appropriate Standard 1.0 setting where the test is to be used; ~rom s~ec.1fic
often vary between areas and also within an area to the test interpretation in question. prior studies; or from comprehensive statistical
Clear articulation of each intended test score in-
syntheses of available studies meeting cle~rly spe~
terpretation for a specified use should be set forth,
ified study quality criteria. No type of ev1denc~ is
and appropriate validity evidence in support of
inherently preferable to others; rather, the quality
each intended interpretation should be provided.
and relevance of the evidence to the intended test
score interpretation for a given use determine ~he
Cluster 1. Establishing Intended value of a particular kind of evidence. A presentation
Uses and Interpretations of empirical evidence on any point should give
due weight to all relevant findings in the scientific
literature, including those inconsistent with the
Standard 1.1 intended interpretation or use. Test developers
The test developer should set forth clearly how have the responsibility to provide support for
test scores are intended to be interpreted and their own recommendations, but test users bear
consequently used. The population(s) for which ultimate responsibility for evaluating the quality
a test is intended should be delimited clearly, of the validity evidence provided and its relevance
and the construct or constructs that the test is to the local situation.
intended to assess should be described clearly.
Comment: Statements about validity should refer Standard 1.3
to particular interpretations and consequent uses. If validity for some common or likely interpretation
It is incorrect to use the unqualified phrase "the for a given use has not been evaluated, or i~ such
validity of the test." No test permits interpretations an interpretation is inconsistent with available
that are valid for all purposes or in all situations. evidence, that fact should be made clear and po-
Each recommended interpretation for a given use tential users should be strongly cautioned about
requires validation. The test deve~oper sho~ld making unsupported interpretations.
specify in clear language the population for whICh
the test is intended, the construct it is intended to Comment: If past experience suggests that a test
measure, the contexts in which test scores are to is likely to be used inappropriately for certain

22 23
VALIDITY

kinds of decisions or certain kinds of test takers, as well as empirical data. Appropriate weight
specific warnings against such uses should be Comment: Materials to aid in score interpretation well as the gender and ethnic composition of the
should be given to findings in the scientific should summarize evidence indicating the degree sample. Sometimes legal restrictions about privacy
given. Professional judgment is required to evaluate literature that may be inconsistent with the stated
the extent to which existing validity evidence sup- to which improvement with practice or coaching preclude obtaining or disclosing such population
expectation.
ports a given test use. can be expected. Also, materials written for test information or limit the level of particularity at
takers should provide practical guidance about which such data may be disclosed. The specific
Standard 1.6 the value of test preparation activities, including privacy laws, if any, governing the type of data
Standard 1.4 coaching. should be considered, in order to ensure that any
"When a test use is recommended on the grounds description of a population does not have the po-
If a test score is interpreted for a given use in a
that testing or the testing program itself will tential to identify an individual in a manner in-
way that has not been validated, it is incumbent
result in some indirect benefit, in addition to Cluster 2. Issues Regarding Samples consistent with such standards. The extent of
on the user to justify the new interpretation for
that use, providing a rationale and collecting
the utility of information. from interpretation of and Settings Used in Validation missing data, if any, and the methods for handling
the test scores themselves, the recommender missing data (e.g., use of imputation procedures)
new evidence, if necessary.
should make explicit the rationale for anticipating should be described.
Standard 1.8
Comment: Professional judgment is required to the indirect benefit.Logical or theoretical argu-
evaluate the extent to which existing validity evi- ments and empirical evidence for the indirect The composition of any sample of test takers Standard 1.9
dence applies in the new situation or to the new benefit should be provided. Appropriate weight from which validity evidence is obtained should
group of test takers and to determine what new should be given to any contradictory findings in be described in as much detail as is practical and When a validation rests in part on the opinions
evidence may be needed. The amount and kinds the scientific literature, including findings sug- permissible, including major relevant socio- or decisions of expert judges, observers, or raters,
of new evidence required may be influenced by gesting important indirect outcomes other than demographic and developmental characteristics. procedures for selecting such experts and for
experience with similar prior test uses or interpre- those predicted. eliciting judgments or ratings should be fully
tations and by the amount, quality, and relevance Comment: Statistical findings can be influenced
described. The qualifications and experience of
of existing data. Comment: For example, certain educational testing by factors affecting the sample on which the
the judges should be presented. The description
programs have been advocated on the grounds results are based. When the sample is intended to
A test that has been altered or administered in of procedures should include any training and
that they would have a salutary influence on class- represent a population, that population should
ways that change the construct underlying the instructions provided, should indicate whether
room instructional practices or would clarify stu- be described, and attention should be drawn to
test for use with subgroups of the population re- participants reached their decisions independently,
dents' understanding of the kind or level of any systematic factors that may limit the repre-
quires evidence of the validity of the interpretation and should report the level of agreement reached.
achievement they were expected to attain. To the sentativeness of the sample. Factors that might
made on the basis of the modified test (see chap. If participants interacted with one another or
extent that such claims enter into the justification reasonably be expected to affect the results include
3). For example, if a test is adapted for use with exchanged information, the procedures through
for a testing program, they become part of the ar- self-selection, attrition, linguistic ability, disability
individuals with a particular disability in a way which they may have influenced one another
gument for test use. Evidence for such claims status, and exclusion criteria, among others. If
that changes the underlying construct, the modified should be set forth.
should be examined-in conjunction with evidence the participants in a validity study are patients,
test should have its own evidence of validity for
about the validity of intended test score interpre- for example, then the diagnoses of the patients Comment: Systematic collection of judgments or
the intended interpretation.
tation and evidence about unintended negative are important, as well as other characteristics, opinions may occur at many points in test con-
consequences of test use-in making an overall such as the severity of the diagnosed conditions. struction (e.g., eliciting expert judgments of content
Standard 1.5 decision about test use. Due weight should be For tests used in employment settings, the em- appropriateness or adequate content representation),
given to evidence against such predictions, for ex- ployment status (e.g., applicants versus current in the formulation of rules or standards for score
"When it is clearly stated or implied that a rec-
ample, evidence that under some conditions edu- job holders), the general level of experience and interpretation (e.g., in setting cut scores), or in test
ommended test score interpretation for a given
cational testing may have a negative effect on educational background, and the gender and scoring (e.g., rating of essay responses). Whenever
use will result in a specific outcome, the basis
classroom instruction. ethnic composition of the sample may be relevant such procedures are employed, the quality of the
for expecting that outcome should be presented,
together with relevant evidence. information. For tests used in credentialing, the resulting judgments is important to the validation.
status of those providing information (e.g., can- Level of agreement should be specified clearly (e.g.,
Comment: If it is asserted, for example, that in-
Standard 1. 7
didates for a credential versus already-credentialed whether percent agreement refers to agreement
terpreting and using scores on a given test for em- If test performance, or a decision made therefrom individuals) is important for interpreting the re- prior to or after a consensus discussion, and whether
ployee selection will result in reduced employee is claimed to be essentially unaffected by practic~ sulting data. For tests used in educational settings, the criterion for agreement is exact agreement of
errors or training costs, evidence in support of and coaching, then the propensity for test per- relevant information may include educational ratings or agreement within a certain number of
that assertion should be provided. A given claim background, developmental level, community scale points.) The basis for specifying certain types
formance to change with these forms of instruction
may be supported by logical or theoretical argument should be documented. characteristics, or school admissions policies, as of individuals (e.g., experienced teachers, experienced

24
25
CHAPTER 1 VALIDITY

job incumbents, supervisors) as appropriate experts ifying and generating test content should be de- tionships among test items or among parts of the rationale and relevant evidence in support of
for the judgment or rating task should be articulated. scribed and justified with reference to the intended the test, evidence concerning the internal structure such interpretation should be provided. When
It may be entirely appropriate to have experts work population to be tested and the construct the of the test should be provided. interpretation of individual item responses is
together to reach consensus, but it would not then test is intended to measure or the domain it is likely but is not recommended by the developer,
be appropriate to treat their respective judgments Comment: It might be claimed, for example,
intended to represent. If the definition of the the user should be warned against making such
as statistically independent. Different judges may that a test is essentially unidimensional. Such a
content sampled incorporates criteria such as interpretations.
be used for different purposes (e.g., one set may claim could be supported by a multivariate statistical
importance, frequency, or criticality, these criteria
rate items for cultural sensitivity while another analysis, such as a factor analysis, showing that Comment: Users should be given sufficient guidance
should also be clearly explained and justified.
may rate for reading level) or for different portions the score variability attributable to one major di- to enable them to judge the degree of confidence
of a test. Comment: . For example, test developers might mension was much greater than the score variability warranted for any interpretation for a use recom-
provide a logical structure that maps the items on attributable to any other identified dimension, or mended by the test developer. Test manuals and
the test to the content domain, illustrating the showing that a single factor adequately accounts score reports should discourage overinterpretation
Standard 1.10
relevance of each item and the adequacy with for the covariation among test items. When a test of information that may be subject to considerable
When validity evidence includes statistical analyses which the set of items represents the content do- provides more than one score, the interrelationships error. This is especially important if interpretation
of test results, either alone or together with data main. Areas of the content domain that are not of those scores should be shown to be consistent of performance on isolated items, small subsets of
on other variables, the conditions under which included among the test items could be indicated with the construct(s) being assessed. items, or subtest scores is suggested.
the data were collected should be described in as well. The match of test content to the targeted
enough detail that users can judge the relevance domain in terms of cognitive complexity and the Standard 1.14
of the statistical findings to local conditions. At- accessibility of the test content to all members of (d) Evidence Regarding Relationships
the intended population are also important con- When interpretation of subscores, score differences, With Conceptually Related Constructs
tention should be drawn to any features of a val-
siderations. or profiles is suggested, the rationale and relevant
idation data collection that are likely to differ
evidence in support of such interpretation should Standard 1.16
from typical operational testing conditions and
be provided. Where composite scores are devel-
that could plausibly influence test performance. (b) Evidence Regarding Cognitive When validity evidence includes empirical analyses
oped, the basis and rationale for arriving at the
Processes of responses to test items together with data on
Comment: Such conditions might include (but composites should be given.
other variables, the rationale for selecting the ad-
would not be limited to) the following: test-taker
motivation or prior preparation, the range of test
Standard 1.12 Comment: When a test provides more than one ditional variables should be provided. Where ap-
score, the distinctiveness and reliability of the propriate and feasible, evidence concerning the
scores over test takers, the time allowed for test If the rationale for score interpretation for a given separate scores should be demonstrated, and the constructs represented by other variables, as well
takers to respond or other administrative conditions, use depends on premises about the psychological interrelationships of those scores should be shown as their technical properties, should be presented
the mode of test administration (e,g., unproctored processes or cognitive operations of test takers, to be consistent with the construct(s) being or cited. Attention should be drawn to any likely
online testing versus proctored on-site testing), then theoretical or empirical evidence in support assessed. Moreover, evidence for the validity of sources of dependence (or lack of independence)
examiner training or other examiner characteristics, of those premises should be provided. When state- interpretations of two or more separate scores among variables other than dependencies among
the time intervals separating collection of data on ments about the processes employed by observers would not necessarily justify a statistical or sub- the construct(s) they represent.
different measures, or conditions that may have or scorers are part of the argument for validity, stantive interpretation of the difference between
changed since the validity evidence was obtained. similar information should be provided. Comment: The patterns of association between
them. Rather, the rationale and supporting evidence
and among scores on the test under study and
Comment: If the test specification delineates the must pertain directly to the specific score, score
Cluster 3. Specific Forms of other variables should be consistent with theoretical
processes to be assessed, then evidence is needed combination, or score pattern to be interpreted
expectations. The additional variables might be
Validity Evidence that the test items do, in fact, tap the intended for a given use. When subscores from one test or
demographic characteristics, indicators of treatment
processes. scores from different tests are combined into a
conditions, or scores on other measures. They
composite, the basis for combining scores and for
(a) Content-Oriented Evidence how scores are combined (e.g., differential weighting
might include intended measures of the same
construct or of different constructs. The reliability
Standard 1.11 (c) Evidence Regarding Internal Structure versus simple summation) should be specified.
of scores from such other measures and the validity
Standard 1.13 of intended interpretations of scores from these
When the rationale for test score interpretation Standard 1.15 measures are an important part of the validity ev-
for a given use rests in part on the appropriateness If the rationale for a test score interpretation for When interpretation of performance on specific idence for the test under study. If such variables
of test content, the procedures followed in spec- a given use depends on premises about the rela- items, or small subsets of items, is suggested, include composite scores, the manner in which

26
27
CHAPTER 1
VALIDITY

the composites were constructed should be explained as information about the distribution of criterion
(e.g., transformation or standardization of the Standard 1.20 for adjusting the correlation to estimate the strength
performances conditional upon a given test score.
variables, and weighting of the variables). In of the correlation net of the effects of measurement
In the case of categorical rather than continuous When effect size measures (e.g., correlations be-
addition to considering the properties of each error in either or both variables. Reporting of an
variables, techniques appropriate to such data tween test scores and criterion measures, stan-
variable in isolation, it is important to guard adjusted correlation should be accompanied by a
should be used (e.g., the use of logistic regression dardized mean test score differences between
against faulty interpretations arising from spurious statement of the method and the statistics used in
in the case of a dichotomous criterion). Evidence subgroups) are used to draw inferences that go
sources of dependency among measures, including making the adjustment.
about the overall association between variables beyond describing the sample or samples on
correlated errors or shared variance due to common should be supplemented by information about which data have been collected, indices of the
methods of measurement or common elements. the form of that association and about the variability degree of uncertainty associated with these meas- Standard 1.22
of that association in different ranges of test scores. ures (e.g., standard errors, confidence intervals,
When a meta-analysis is used as evidence of the
(e) Evidence Regarding Relationships Note that data collection~ employing test takers or significance tests) should be reported. strength of a test-criterion relationship, the test
With Criteria selected for their extreme scores on one or more
Comment: Effect size measures are usefully paired and the criterion variables in the local situation
measures (extreme groups) typically cannot provide
with indices reflecting their sampling error to should be comparable with those in the studies
Standard 1.17 adequate information about the association.
make meaningful evaluation possible. There are summarized. If relevant research includes credible
When validation relies on evidence that test various possible measures of effect size, each ap- evidence that any other specific features of the
scores are related to one or more criterion variables, Standard 1.19 plicable to different settings. In the presentation testing application may influence the strength
information about the suitability and technical of indices of uncertainty, standard errors or confi- of the test-criterion relationship, the correspon-
quality of the criteria should be reported. If test scores are used in conjunction with other dence intervals provide more information and dence between those features in the local situation
variables to predict some outcome or criterion, thus are preferred in place of, or as supplements and in the meta-analysis should be reported.
Comment: The description of each criterion analyses based on statistical models of the pre- to, significance testing. Any significant disparities that might limit the
variable should include evidence concerning its dictor-criterion relationship should include those applicability of the meta-analytic findings to the
reliability, the extent to which it represents the additional relevant variables along with the test
Standard 1.21 local situation should be noted explicitly.
intended construct (e.g., task performance on the scores.
job), and the extent to which it is likely to be in- Comment: The meta-analysis should incorporate
Comment: In general, if several predictors of When statistical adjustments, such as those for
fluenced by extraneous sources of variance. Special all available studies meeting explicitly stated in-
some criterion are available, the optimum combi- restriction of range or attenuation, are made,
attention should be given to sources that previous clusion criteria. Meta-analytic evidence used in
nation of predictors cannot be determined solely both adjusted and unadjusted coefficients, as
research suggests may introduce extraneous variance test validation typically is based on a number of
from separate, pairwise examinations of the criterion well as the specific procedure used, and all
that might bias the criterion for or against identi- tests measuring the same or very similar constructs
variable with each separate predictor in turn, due statistics used in the adjustment, should be re-
fiable groups. and criterion measures that likewise measure the
to intercorrelation among predictors. It is often ported. Estimates of the construct-criterion re-
same or similar constructs. A meta-analytic study
informative to estimate the increment in predictive lationship that remove the effects of measurement
Standard 1.18 may also be limited to multiple studies of a single
accuracy that may be expected when each variable, error on the test should be clearly reported as
test and a single criterion. For each study included
including the test score, is introduced in addition adjusted estimates.
When it is asserted that a certain level of test in the analysis, the test-criterion relationship is
performance predicts adequate or inadequate to all other available variables. As empirically Comment: The correlation between two variables, expressed in some common metric, often as an
criterion performance, information about the derived weights for combining predictors can cap- such as test scores and criterion measures, depends effect size. The strength of the test-criterion rela-
levels of criterion performance associated with italize on chance factors in a given sample, analyses on the range of values on each variable. For example, tionship may be moderated by features of the sit-
given levels of test scores should be provided. involving multiple predictors should be verified the test scores and the criterion values of a selected uation in which the test and criterion measures
by cross-validation or equivalent analysis whenever subset of test takers (e.g., job applicants who have were obtained (e.g., types of jobs, characteristics
Comment: For purposes of linking specific test feasible, and the precision of estimated regression been selected for hire) will typically have a smaller of test takers, time interval separating collection
scores with specific levels of criterion performance, coefficients or other indices should be reported. range than the scores of all test takers (e.g., the of test and criterion measures, year or decade in
regression equations are more useful than correlation Cross-validation procedures include formula esti- entire applicant pool.) Statistical methods are which the data were collected). If test-criterion
coefficients, which are generally insufficient to mates of validity in subsequent samples and em- available for adjusting the correlation to reflect the relationships vary according to such moderator
fully describe patterns of association between tests pirical approaches such as deriving weights in one population of interest rather than the sample variables, then the meta-analysis should report
and other variables. Means, standard deviations, portion of a sample and applying them to an in- available. Such adjustments are often appropriate, separate estimated effect-size distributions condi-
and other statistical summaries are needed, as well dependent subsample.
'I as when results are compared across various situations. tional upon levels of these moderator variables
The correlation between two variables is also affected when the number of studies available for analysis
I

I
by measurement error, and methods are available permits doing so. This might be accomplished,

28
~ 29
CHAPTER 1 VALIDITY

for example, by reporting separate distributions porting of meta-analytic evidence, the individual irrelevant components or construct underrepre- question. Ensuring that unintended consequences
for subsets of studies or by estimating the magni- drawing on existing meta-analytic evidence must sentation. For example, although group differ- are evaluated is the responsibility of those making
tudes of the influences of situational features on evaluate the soundness of the meta-analysis for ences, in and of themselves, do not call into the decision whether to use a particular test, al-
effect sizes. the setting in question. though legal constraints may limit the test user's
question the validity of a proposed interpretation,
This standard addresses the responsibilities of they may increase the salience of plausible rival discretion to discard the results of a previously
the individual who is drawing on meta-analytic hypotheses that should be evaluated as part of administered test, when that decision is based
evidence to support a test score interpretation for Standard 1.24
the validation effort. A finding of unintended on differences in scores for subgroups of different
a given use. In some instances, that individual consequences may also lead to reconsideration races, ethnicities, or genders. These issues are
If a test is recommended for use in assigning
may also be the one conducting the meta-analysis; of the appropriateness of the construct in discussed further in chapter 3.
persons to alternative treatments, and if outcomes
in other instances, existing meta-analyses are relied
from those treatments can reasonably be compared
on. In the latter instance, the individual drawing
on a common criterion, then, whenever feasible,
on meta-analytic evidence does not have control
supporting evidence of differential outcomes
over how the meta-analysis was conducted or re- should be provided.
ported, and must evaluate the soundness of the
meta-analysis for the setting in question. Comment: If a test is used for classification into
alternative occupational, therapeutic, or educational
Standard 1.23 programs, it is not sufficient just to show that the
test predicts treatment outcomes. Support for the
Any meta-analytic evidence used to support an validity of the classification procedure is provided
intended test score interpretation for a given use by showing that the test is useful in determining
should be clearly described, including method- which persons are likely to profit differentially
ological choices in identifying and coding studies, from one treatment or another. Treatment categories
correcting for artifacts, and examining potential may have to be combined to assemble sufficient
moderator variables. Assumptions made in cor- cases for statistical analysis. It is recognized,
recting for artifacts such as criterion unreliability however, that such research may not be feasible,
and range restriction should be presented, and because ethical and legal constraints on differential
the consequences of these assumptions made assignments may forbid control groups.
clear.

Comment: The description should include docu- (f) Evidence Based on Consequences of
mented information about each study used as Tests
input to the meta-analysis, thus permitting evalu-
ation by an independent party. Note also that Standard 1.25
meta-analysis inevitably involves judgments re-
garding a number of methodological choices. The When unintended consequences result from test
bases for these judgments should be articulated. use, an attempt should be made to investigate
In the case of choices involving some degree of whether such consequences arise from the test's
uncertainty, such as artifact corrections based on sensitivity to characteristics other than those it
assumed values, the uncertainty should be ac- is intended to assess or from the test's failure to
knowledged and the degree to which conclusions fully represent the intended construct.
about validity hinge on these assumptions should Comment: The validity of test score interpreta-
be examined and reported. tions may be limited by construct-irrelevant
As in the case of Standard 1.22, the individual components or construct underrepresentation.
who is drawing on meta-analytic evidence to When unintended consequences appear to stem,
support a test score interpretation for a given use at least in part, from the use of one or more
may or may not also be the one conducting the tests, it is especially important to check that
meta-analysis. As Standard 1.22 addresses the re- these consequences do not arise from construct-
I
30 I +: 31
mt -
;

You might also like