Professional Documents
Culture Documents
BACKGROUND
Validity refers to the degree to which evidence the construct interpretation that will be made on
and theory support the interpretations of test the basis of the score or response pattern.
scores for proposed uses of tests. Validity is, Examples of constructs currently used in as-
therefore, the most fundamental consideration in sessment include mathematics achievement, general
developing tests and evaluating tests. The process cognitive ability, racial identity attitudes, depression,
of validation involves accumulating relevant and self-esteem. To support test development,
evidence to provide a sound scientific basis for the proposed construct interpretation is elaborated
the proposed score interpretations. It is the inter- by describing its scope and extent and by delin-
pretations of test scores for proposed uses that are eating the aspects of the construct that are to be
evaluated, not the test itself. When test scores are represented. The detailed description provides a
interpreted in more than one way (e.g., both to conceptual framework for the test, delineating
describe a test taker's current level of the attribute the knowledge, skills, abilities, traits, interests,
being measured and to make a prediction about a processes, competencies, or characteristics to be
future outcome), each intended interpretation assessed. Ideally, the framework indicates how
must be validated. Statements about validity the construct as represented is to be distinguished
should refer to particular interpretations for from other constructs and how it should relate to
specified uses. It is incorrect to use the unqualified other variables.
phrase "the validity of the test." The conceptual framework is partially shaped
Evidence of the validity of a given interpretation by the ways in which test scores will be used. For
of test scores for a specified use is a necessary con- instance, a test of mathematics achievement might
dition for the justifiable use of the test. Where suf- be used to place a student in an appropriate program
ficient evidence of validity exists, the decision as of instruction, to endorse a high school diploma,
to whether to actually administer a particular test or to inform a college admissions decision. Each of
generally takes additional considerations into ac- these uses implies a somewhat different interpretation
count. These include cost-benefit considerations, of the mathematics achievement test scores: that a
framed in different subdisciplines as utility analysis student will benefit from a particular instructional
or as consideration of negative consequences of intervention, that a student has mastered a specified
test use, and a weighing of any negative consequences curriculum, or that a student is likely to be successful
against the positive consequences of test use. with college-level work. Similarly, a test of consci-
Validation logically begins with an explicit entiousness might be used for psychological coun-
statement of the proposed interpretation of test seling, to inform a decision about employment, or
scores, along with a rationale for the relevance of for the basic scientific purpose of elaborating the
the interpretation to the proposed use. The construct of conscientiousness. Each of these
proposed interpretation includes specifying the potential uses shapes the specified framework and
construct the test is intended to measure. The the proposed interpretation of the test's scores and
term construct is used in the Standards to refer to also can have implications for test development
the concept or characteristic that a test is designed and evaluation. Validation can be viewed as a
to measure. Rarely, if ever, is there a single possible process of constructing and evaluating arguments
meaning that can be attached to a test score or a for and against the intended interpretation of test
pattern of test responses. Thus, it is always in- scores and their relevance to the proposed use. The
cumbent on test developers and users to specify conceptual framework points to the kinds of
11
.. CHAPTER 1
VALIDITY
but they do not represent distinct types of Evidence Based on Test Content purpose. For example, a test given for research and empirical analyses of the response processes
validity. Validity is a unitary concept. It is the purposes to compare student achievement across of test takers can provide evidence concerning the
fit between the construct and the detailed nature
degree to which all the accumulated evidence Important validity evidence can be obtained from states in a given domain may properly also cover
material that receives little or no attention in the of the performance or response actually engaged
supports the intended interpretation of test an analysis of the relationship between the content
curriculum. Policy makers can then evaluate in by test takers. For instance, if a test is intended
scores for the proposed use. Like the 1999 Stan- of a test and the construct it is intended to
to assess mathematical reasoning, it becomes im-
dards, this edition refers to types of validity evi- measure. Test content refers to the themes, wording, student achievement with respect to both content
portant to determine whether test takers are, in
dence, rather than distinct types of validity. To and format of the items, tasks, or questions on a neglected and content addressed. On the other
emphasize this distinction, the treatment that test. Administration and scoring may also be hand, when student mastery of a delivered cur- fact, reasoning about the material given instead
follows does not follow historical nomenclat~re relevant to content-based evidence. Test developers riculum is tested for purposes of informing of following a standard algorithm applicable only
(i.e., the use of the terms content validity or pre- often work from a specification of the content decisions about individual students, such as pro- to the specific items on the test.
motion or graduation, the framework elaborating Evidence based on response processes generally
dictive validity). domain. The content specification carefully describes
a content domain is appropriately limited to what comes from analyses of individual responses.
AB the discussion in the prior section emphasizes, the content in detail, often with a classification of
students have had an opportunity to learn from Questioning test takers from various groups
each type of evidence presented below is not areas of content and types of items. Evidence
making up the intended test-taking population
required in all settings. Rather, support is needed based on test content can include logical or the curriculum as delivered.
for each proposition that underlies a proposed empirical analyses of the adequacy with which Evidence about content can be used, in part, to about their performance strategies or responses
test interpretation for a specified use. A proposition the test content represents the content domain address questions about differences in the meaning to particular items can yield evidence that enriches
that a test is predictive of a given criterion can be and of the relevance of the content domain to the or interpretation of test scores across relevant sub- the definition of a construct. Maintaining records
supported without evidence that the test samples proposed interpretation of test scores. Evidence groups of test takers. Of particular concern is the that monitor the development of a response to a
a particular content domain. In contrast, a propo- based on content can also come from expert judg- extent to which construct underrepresentation or writing task, through successive written drafts
sition that a test covers a representative sample of ments of the relationship between parts of the construct-irrelevance may give an unfair advantage or electronically monitored revisions, for instance,
a particular curriculum may be supported without test and the construct. For example, in developing or disadvantage to one or more subgroups of test also provides evidence of process. Documentation
evidence that the test predicts a given criterion. a licensure test, the major facets that are relevant takers. For example, in an employment test, the of other aspects of performance, like eye move-
However, a more complex set of propositions, to the purpose for which the occupation is regulated use of vocabulary more complex than needed on ments or response times, may also be relevant to
e.g., that a test samples a specified domain and can be specified, and experts in that occupation the job may be a source of construct-irrelevant some constructs. Inferences about processes in-
thus is predictive of a criterion reflecting a related can be asked to assign test items to the categories variance for English language learners or others. volved in performance can also be developed by
domain, will require evidence supporting both defined by those facets. These or other experts Careful review of the construct and test content analyzing the relationship among pans of the
parts of this set of propositions. Tests developers can then judge the representativeness of the chosen domain by a diverse panel of experts may point to test and between the test and other variables.
are also expected to make the case that the scores set of items. potential sources of irrelevant difficulty (or easiness) Wide individual differences in process can be re-
vealing and may lead to reconsideration of certain
are not unduly influenced by construct-irrelevant Some tests are based on systematic observations that require further investigation.
variance (see chap. 3 for detailed treatment of of behavior. For example, a list of the tasks con- Content-oriented evidence of validation is at test formats.
the heart of the process in the educational arena Evidence of response processes can contribute
issues related to construct-irrelevant variance). In stituting a job domain may be developed from
general, adequate support for proposed interpre- observations of behavior in a job, together with known as align.ment, which involves evaluating the to answering questions about differences in meaning
tations for specific uses will require multiple judgments of subject matter experts. Expert judg- correspondence between student learning standards or interpretation of test scores across relevant sub-
sources of evidence. ments can be used to assess the relative importance, and test content. Content-sampling issues in the groups of test takers. Process studies involving
The position developed above also underscores criticality, and/or frequency of the various tasks. alignment process include evaluating whether test test takers from different subgroups can assist in
the fact that if a given test is interpreted in A job sample test can then be constructed from a content appropriately samples the domain set forward determining the extent to which capabilities irrel-
multiple ways for multiple uses, the propositions random or stratified sampling of tasks rated highly in curriculum standards, whether the cognitive de- evant or ancillary to the construct may be differ-
underlying these interpretations for different uses on these characteristics. The test can then be ad- mands of test items correspond to the level reflected entially influencing test takers' test performance.
in the student learning standards (e.g., content Studies of response processes are not limited
also are likely to differ. Support is needed for the ministered under standardized conditions in an
standards), and whether the test avoids the inclusion to the test taker. Assessments often rely on observers
propositions underlying each interpretation for a off-the-job setting.
of features irrelevant to the standard that is the in- or judges to record and/or evaluate test takers'
specific use. Evidence supporting the interpretation The appropriateness of a given content domain
performances or products. In such cases, relevant
of scores on a mathematics achievement test for is related to the specific inferences to be made tended target of each test item.
validity evidence includes the extent to which the
placing students in subsequent courses (i.e., from test scores. Thus, when considering an
evidence that the test interpretation is valid for its available test for a purpose other than that for
Evidence Based on Response Processes processes of observers or judges are consistent
Some construct interpretations involve more or with the intended interpretation of scores. For in-
intended purpose) does not permit inferring which it was first developed, it is especially
less explicit assumptions about the cognitive stance, if judges are expected to apply particular
validity for other purposes (e.g., promotion or important to evaluate the appropriateness of the
processes engaged in by test takers. Theoretical criteria in scoring test takers' performances, it is
teacher evaluation). original content domain for the proposed new
15
14
CHAPTER 1
VALIDITY
understand a test and the inferences that can be as research on a topic advances. For example, pre-
drawn from it. In this way an inference of validity
STANDARDS FOR VALIDITY
vailing standards of evidence may vary with the
is similar to any scientific inference. However, a stakes involved in the use or interpretation of the
The standards in this chapter begin with an over- be employed, and the processes by which the test
test interpretation for a given use rests on evidence test scores. Higher stakes may entail higher
arching standard (numbered 1.0), which is designed is to be administered and scored.
for a set of propositions making up the validity standards of evidence. As another example, in
to convey the central intent or primary focus of
argument, and at some point validation evidence areas where data collection comes at a greater
allows for a summary judgment of the intended the chapter. The overarching standard may also Standard 1.2
cost, one may find it necessary to base interpretations
interpretation that is well supported and defensible. J be viewed as the guiding principle of the chapter,
9n fewer data than in areas where data collection
and is applicable to all tests and test users. All A rationale should be presented for each intended
At some point the effort to provide sufficient comes with less cost.
subsequent standards have been separated into interpretation of test scores for a given use,
validity evidence to support a given test interpre- Ultimately, the validity of an intended inter-
three thematic clusters labeled as follows: together with a summary of the evidence and
tation for a specific use does end (at least provi- pretation of test scores relies on all the available
theory bearing on the intended interpretation.
sionally, pending the emergence of a strong basis evidence relevant to the technital quality of a
1. Establishing Intended Uses and Interpreta-
for questioning that judgment). Legal requirements testing system. Different components of validity Comment: The rationale should indicate what
may necessitate that the validation study be tions
evidence are described in subsequent chapters of propositions are necessary to investigate the
updated in light of such factors as changes in the 2. Issues Regarding Samples and Settings Used
the Standards, and include evidence of careful test intended interpretation. The summary should
test population or newly developed alternative in Validation
construction; adequate score reliability; appropriate combine logical analysis with empirical evidence
3. Specific Forms ofValidity Evidence
testing methods. test administration and scoring; accurate score to provide support for the test rationale. Evi~ence
The amount and character of evidence required scaling, equating, and standard setting; and careful may come from studies conducted locally, m ~he
to support a provisional judgment of validity attention to fairness for all test takers, as appropriate Standard 1.0 setting where the test is to be used; ~rom s~ec.1fic
often vary between areas and also within an area to the test interpretation in question. prior studies; or from comprehensive statistical
Clear articulation of each intended test score in-
syntheses of available studies meeting cle~rly spe~
terpretation for a specified use should be set forth,
ified study quality criteria. No type of ev1denc~ is
and appropriate validity evidence in support of
inherently preferable to others; rather, the quality
each intended interpretation should be provided.
and relevance of the evidence to the intended test
score interpretation for a given use determine ~he
Cluster 1. Establishing Intended value of a particular kind of evidence. A presentation
Uses and Interpretations of empirical evidence on any point should give
due weight to all relevant findings in the scientific
literature, including those inconsistent with the
Standard 1.1 intended interpretation or use. Test developers
The test developer should set forth clearly how have the responsibility to provide support for
test scores are intended to be interpreted and their own recommendations, but test users bear
consequently used. The population(s) for which ultimate responsibility for evaluating the quality
a test is intended should be delimited clearly, of the validity evidence provided and its relevance
and the construct or constructs that the test is to the local situation.
intended to assess should be described clearly.
Comment: Statements about validity should refer Standard 1.3
to particular interpretations and consequent uses. If validity for some common or likely interpretation
It is incorrect to use the unqualified phrase "the for a given use has not been evaluated, or i~ such
validity of the test." No test permits interpretations an interpretation is inconsistent with available
that are valid for all purposes or in all situations. evidence, that fact should be made clear and po-
Each recommended interpretation for a given use tential users should be strongly cautioned about
requires validation. The test deve~oper sho~ld making unsupported interpretations.
specify in clear language the population for whICh
the test is intended, the construct it is intended to Comment: If past experience suggests that a test
measure, the contexts in which test scores are to is likely to be used inappropriately for certain
22 23
VALIDITY
kinds of decisions or certain kinds of test takers, as well as empirical data. Appropriate weight
specific warnings against such uses should be Comment: Materials to aid in score interpretation well as the gender and ethnic composition of the
should be given to findings in the scientific should summarize evidence indicating the degree sample. Sometimes legal restrictions about privacy
given. Professional judgment is required to evaluate literature that may be inconsistent with the stated
the extent to which existing validity evidence sup- to which improvement with practice or coaching preclude obtaining or disclosing such population
expectation.
ports a given test use. can be expected. Also, materials written for test information or limit the level of particularity at
takers should provide practical guidance about which such data may be disclosed. The specific
Standard 1.6 the value of test preparation activities, including privacy laws, if any, governing the type of data
Standard 1.4 coaching. should be considered, in order to ensure that any
"When a test use is recommended on the grounds description of a population does not have the po-
If a test score is interpreted for a given use in a
that testing or the testing program itself will tential to identify an individual in a manner in-
way that has not been validated, it is incumbent
result in some indirect benefit, in addition to Cluster 2. Issues Regarding Samples consistent with such standards. The extent of
on the user to justify the new interpretation for
that use, providing a rationale and collecting
the utility of information. from interpretation of and Settings Used in Validation missing data, if any, and the methods for handling
the test scores themselves, the recommender missing data (e.g., use of imputation procedures)
new evidence, if necessary.
should make explicit the rationale for anticipating should be described.
Standard 1.8
Comment: Professional judgment is required to the indirect benefit.Logical or theoretical argu-
evaluate the extent to which existing validity evi- ments and empirical evidence for the indirect The composition of any sample of test takers Standard 1.9
dence applies in the new situation or to the new benefit should be provided. Appropriate weight from which validity evidence is obtained should
group of test takers and to determine what new should be given to any contradictory findings in be described in as much detail as is practical and When a validation rests in part on the opinions
evidence may be needed. The amount and kinds the scientific literature, including findings sug- permissible, including major relevant socio- or decisions of expert judges, observers, or raters,
of new evidence required may be influenced by gesting important indirect outcomes other than demographic and developmental characteristics. procedures for selecting such experts and for
experience with similar prior test uses or interpre- those predicted. eliciting judgments or ratings should be fully
tations and by the amount, quality, and relevance Comment: Statistical findings can be influenced
described. The qualifications and experience of
of existing data. Comment: For example, certain educational testing by factors affecting the sample on which the
the judges should be presented. The description
programs have been advocated on the grounds results are based. When the sample is intended to
A test that has been altered or administered in of procedures should include any training and
that they would have a salutary influence on class- represent a population, that population should
ways that change the construct underlying the instructions provided, should indicate whether
room instructional practices or would clarify stu- be described, and attention should be drawn to
test for use with subgroups of the population re- participants reached their decisions independently,
dents' understanding of the kind or level of any systematic factors that may limit the repre-
quires evidence of the validity of the interpretation and should report the level of agreement reached.
achievement they were expected to attain. To the sentativeness of the sample. Factors that might
made on the basis of the modified test (see chap. If participants interacted with one another or
extent that such claims enter into the justification reasonably be expected to affect the results include
3). For example, if a test is adapted for use with exchanged information, the procedures through
for a testing program, they become part of the ar- self-selection, attrition, linguistic ability, disability
individuals with a particular disability in a way which they may have influenced one another
gument for test use. Evidence for such claims status, and exclusion criteria, among others. If
that changes the underlying construct, the modified should be set forth.
should be examined-in conjunction with evidence the participants in a validity study are patients,
test should have its own evidence of validity for
about the validity of intended test score interpre- for example, then the diagnoses of the patients Comment: Systematic collection of judgments or
the intended interpretation.
tation and evidence about unintended negative are important, as well as other characteristics, opinions may occur at many points in test con-
consequences of test use-in making an overall such as the severity of the diagnosed conditions. struction (e.g., eliciting expert judgments of content
Standard 1.5 decision about test use. Due weight should be For tests used in employment settings, the em- appropriateness or adequate content representation),
given to evidence against such predictions, for ex- ployment status (e.g., applicants versus current in the formulation of rules or standards for score
"When it is clearly stated or implied that a rec-
ample, evidence that under some conditions edu- job holders), the general level of experience and interpretation (e.g., in setting cut scores), or in test
ommended test score interpretation for a given
cational testing may have a negative effect on educational background, and the gender and scoring (e.g., rating of essay responses). Whenever
use will result in a specific outcome, the basis
classroom instruction. ethnic composition of the sample may be relevant such procedures are employed, the quality of the
for expecting that outcome should be presented,
together with relevant evidence. information. For tests used in credentialing, the resulting judgments is important to the validation.
status of those providing information (e.g., can- Level of agreement should be specified clearly (e.g.,
Comment: If it is asserted, for example, that in-
Standard 1. 7
didates for a credential versus already-credentialed whether percent agreement refers to agreement
terpreting and using scores on a given test for em- If test performance, or a decision made therefrom individuals) is important for interpreting the re- prior to or after a consensus discussion, and whether
ployee selection will result in reduced employee is claimed to be essentially unaffected by practic~ sulting data. For tests used in educational settings, the criterion for agreement is exact agreement of
errors or training costs, evidence in support of and coaching, then the propensity for test per- relevant information may include educational ratings or agreement within a certain number of
that assertion should be provided. A given claim background, developmental level, community scale points.) The basis for specifying certain types
formance to change with these forms of instruction
may be supported by logical or theoretical argument should be documented. characteristics, or school admissions policies, as of individuals (e.g., experienced teachers, experienced
24
25
CHAPTER 1 VALIDITY
job incumbents, supervisors) as appropriate experts ifying and generating test content should be de- tionships among test items or among parts of the rationale and relevant evidence in support of
for the judgment or rating task should be articulated. scribed and justified with reference to the intended the test, evidence concerning the internal structure such interpretation should be provided. When
It may be entirely appropriate to have experts work population to be tested and the construct the of the test should be provided. interpretation of individual item responses is
together to reach consensus, but it would not then test is intended to measure or the domain it is likely but is not recommended by the developer,
be appropriate to treat their respective judgments Comment: It might be claimed, for example,
intended to represent. If the definition of the the user should be warned against making such
as statistically independent. Different judges may that a test is essentially unidimensional. Such a
content sampled incorporates criteria such as interpretations.
be used for different purposes (e.g., one set may claim could be supported by a multivariate statistical
importance, frequency, or criticality, these criteria
rate items for cultural sensitivity while another analysis, such as a factor analysis, showing that Comment: Users should be given sufficient guidance
should also be clearly explained and justified.
may rate for reading level) or for different portions the score variability attributable to one major di- to enable them to judge the degree of confidence
of a test. Comment: . For example, test developers might mension was much greater than the score variability warranted for any interpretation for a use recom-
provide a logical structure that maps the items on attributable to any other identified dimension, or mended by the test developer. Test manuals and
the test to the content domain, illustrating the showing that a single factor adequately accounts score reports should discourage overinterpretation
Standard 1.10
relevance of each item and the adequacy with for the covariation among test items. When a test of information that may be subject to considerable
When validity evidence includes statistical analyses which the set of items represents the content do- provides more than one score, the interrelationships error. This is especially important if interpretation
of test results, either alone or together with data main. Areas of the content domain that are not of those scores should be shown to be consistent of performance on isolated items, small subsets of
on other variables, the conditions under which included among the test items could be indicated with the construct(s) being assessed. items, or subtest scores is suggested.
the data were collected should be described in as well. The match of test content to the targeted
enough detail that users can judge the relevance domain in terms of cognitive complexity and the Standard 1.14
of the statistical findings to local conditions. At- accessibility of the test content to all members of (d) Evidence Regarding Relationships
the intended population are also important con- When interpretation of subscores, score differences, With Conceptually Related Constructs
tention should be drawn to any features of a val-
siderations. or profiles is suggested, the rationale and relevant
idation data collection that are likely to differ
evidence in support of such interpretation should Standard 1.16
from typical operational testing conditions and
be provided. Where composite scores are devel-
that could plausibly influence test performance. (b) Evidence Regarding Cognitive When validity evidence includes empirical analyses
oped, the basis and rationale for arriving at the
Processes of responses to test items together with data on
Comment: Such conditions might include (but composites should be given.
other variables, the rationale for selecting the ad-
would not be limited to) the following: test-taker
motivation or prior preparation, the range of test
Standard 1.12 Comment: When a test provides more than one ditional variables should be provided. Where ap-
score, the distinctiveness and reliability of the propriate and feasible, evidence concerning the
scores over test takers, the time allowed for test If the rationale for score interpretation for a given separate scores should be demonstrated, and the constructs represented by other variables, as well
takers to respond or other administrative conditions, use depends on premises about the psychological interrelationships of those scores should be shown as their technical properties, should be presented
the mode of test administration (e,g., unproctored processes or cognitive operations of test takers, to be consistent with the construct(s) being or cited. Attention should be drawn to any likely
online testing versus proctored on-site testing), then theoretical or empirical evidence in support assessed. Moreover, evidence for the validity of sources of dependence (or lack of independence)
examiner training or other examiner characteristics, of those premises should be provided. When state- interpretations of two or more separate scores among variables other than dependencies among
the time intervals separating collection of data on ments about the processes employed by observers would not necessarily justify a statistical or sub- the construct(s) they represent.
different measures, or conditions that may have or scorers are part of the argument for validity, stantive interpretation of the difference between
changed since the validity evidence was obtained. similar information should be provided. Comment: The patterns of association between
them. Rather, the rationale and supporting evidence
and among scores on the test under study and
Comment: If the test specification delineates the must pertain directly to the specific score, score
Cluster 3. Specific Forms of other variables should be consistent with theoretical
processes to be assessed, then evidence is needed combination, or score pattern to be interpreted
expectations. The additional variables might be
Validity Evidence that the test items do, in fact, tap the intended for a given use. When subscores from one test or
demographic characteristics, indicators of treatment
processes. scores from different tests are combined into a
conditions, or scores on other measures. They
composite, the basis for combining scores and for
(a) Content-Oriented Evidence how scores are combined (e.g., differential weighting
might include intended measures of the same
construct or of different constructs. The reliability
Standard 1.11 (c) Evidence Regarding Internal Structure versus simple summation) should be specified.
of scores from such other measures and the validity
Standard 1.13 of intended interpretations of scores from these
When the rationale for test score interpretation Standard 1.15 measures are an important part of the validity ev-
for a given use rests in part on the appropriateness If the rationale for a test score interpretation for When interpretation of performance on specific idence for the test under study. If such variables
of test content, the procedures followed in spec- a given use depends on premises about the rela- items, or small subsets of items, is suggested, include composite scores, the manner in which
26
27
CHAPTER 1
VALIDITY
the composites were constructed should be explained as information about the distribution of criterion
(e.g., transformation or standardization of the Standard 1.20 for adjusting the correlation to estimate the strength
performances conditional upon a given test score.
variables, and weighting of the variables). In of the correlation net of the effects of measurement
In the case of categorical rather than continuous When effect size measures (e.g., correlations be-
addition to considering the properties of each error in either or both variables. Reporting of an
variables, techniques appropriate to such data tween test scores and criterion measures, stan-
variable in isolation, it is important to guard adjusted correlation should be accompanied by a
should be used (e.g., the use of logistic regression dardized mean test score differences between
against faulty interpretations arising from spurious statement of the method and the statistics used in
in the case of a dichotomous criterion). Evidence subgroups) are used to draw inferences that go
sources of dependency among measures, including making the adjustment.
about the overall association between variables beyond describing the sample or samples on
correlated errors or shared variance due to common should be supplemented by information about which data have been collected, indices of the
methods of measurement or common elements. the form of that association and about the variability degree of uncertainty associated with these meas- Standard 1.22
of that association in different ranges of test scores. ures (e.g., standard errors, confidence intervals,
When a meta-analysis is used as evidence of the
(e) Evidence Regarding Relationships Note that data collection~ employing test takers or significance tests) should be reported. strength of a test-criterion relationship, the test
With Criteria selected for their extreme scores on one or more
Comment: Effect size measures are usefully paired and the criterion variables in the local situation
measures (extreme groups) typically cannot provide
with indices reflecting their sampling error to should be comparable with those in the studies
Standard 1.17 adequate information about the association.
make meaningful evaluation possible. There are summarized. If relevant research includes credible
When validation relies on evidence that test various possible measures of effect size, each ap- evidence that any other specific features of the
scores are related to one or more criterion variables, Standard 1.19 plicable to different settings. In the presentation testing application may influence the strength
information about the suitability and technical of indices of uncertainty, standard errors or confi- of the test-criterion relationship, the correspon-
quality of the criteria should be reported. If test scores are used in conjunction with other dence intervals provide more information and dence between those features in the local situation
variables to predict some outcome or criterion, thus are preferred in place of, or as supplements and in the meta-analysis should be reported.
Comment: The description of each criterion analyses based on statistical models of the pre- to, significance testing. Any significant disparities that might limit the
variable should include evidence concerning its dictor-criterion relationship should include those applicability of the meta-analytic findings to the
reliability, the extent to which it represents the additional relevant variables along with the test
Standard 1.21 local situation should be noted explicitly.
intended construct (e.g., task performance on the scores.
job), and the extent to which it is likely to be in- Comment: The meta-analysis should incorporate
Comment: In general, if several predictors of When statistical adjustments, such as those for
fluenced by extraneous sources of variance. Special all available studies meeting explicitly stated in-
some criterion are available, the optimum combi- restriction of range or attenuation, are made,
attention should be given to sources that previous clusion criteria. Meta-analytic evidence used in
nation of predictors cannot be determined solely both adjusted and unadjusted coefficients, as
research suggests may introduce extraneous variance test validation typically is based on a number of
from separate, pairwise examinations of the criterion well as the specific procedure used, and all
that might bias the criterion for or against identi- tests measuring the same or very similar constructs
variable with each separate predictor in turn, due statistics used in the adjustment, should be re-
fiable groups. and criterion measures that likewise measure the
to intercorrelation among predictors. It is often ported. Estimates of the construct-criterion re-
same or similar constructs. A meta-analytic study
informative to estimate the increment in predictive lationship that remove the effects of measurement
Standard 1.18 may also be limited to multiple studies of a single
accuracy that may be expected when each variable, error on the test should be clearly reported as
test and a single criterion. For each study included
including the test score, is introduced in addition adjusted estimates.
When it is asserted that a certain level of test in the analysis, the test-criterion relationship is
performance predicts adequate or inadequate to all other available variables. As empirically Comment: The correlation between two variables, expressed in some common metric, often as an
criterion performance, information about the derived weights for combining predictors can cap- such as test scores and criterion measures, depends effect size. The strength of the test-criterion rela-
levels of criterion performance associated with italize on chance factors in a given sample, analyses on the range of values on each variable. For example, tionship may be moderated by features of the sit-
given levels of test scores should be provided. involving multiple predictors should be verified the test scores and the criterion values of a selected uation in which the test and criterion measures
by cross-validation or equivalent analysis whenever subset of test takers (e.g., job applicants who have were obtained (e.g., types of jobs, characteristics
Comment: For purposes of linking specific test feasible, and the precision of estimated regression been selected for hire) will typically have a smaller of test takers, time interval separating collection
scores with specific levels of criterion performance, coefficients or other indices should be reported. range than the scores of all test takers (e.g., the of test and criterion measures, year or decade in
regression equations are more useful than correlation Cross-validation procedures include formula esti- entire applicant pool.) Statistical methods are which the data were collected). If test-criterion
coefficients, which are generally insufficient to mates of validity in subsequent samples and em- available for adjusting the correlation to reflect the relationships vary according to such moderator
fully describe patterns of association between tests pirical approaches such as deriving weights in one population of interest rather than the sample variables, then the meta-analysis should report
and other variables. Means, standard deviations, portion of a sample and applying them to an in- available. Such adjustments are often appropriate, separate estimated effect-size distributions condi-
and other statistical summaries are needed, as well dependent subsample.
'I as when results are compared across various situations. tional upon levels of these moderator variables
The correlation between two variables is also affected when the number of studies available for analysis
I
I
by measurement error, and methods are available permits doing so. This might be accomplished,
28
~ 29
CHAPTER 1 VALIDITY
for example, by reporting separate distributions porting of meta-analytic evidence, the individual irrelevant components or construct underrepre- question. Ensuring that unintended consequences
for subsets of studies or by estimating the magni- drawing on existing meta-analytic evidence must sentation. For example, although group differ- are evaluated is the responsibility of those making
tudes of the influences of situational features on evaluate the soundness of the meta-analysis for ences, in and of themselves, do not call into the decision whether to use a particular test, al-
effect sizes. the setting in question. though legal constraints may limit the test user's
question the validity of a proposed interpretation,
This standard addresses the responsibilities of they may increase the salience of plausible rival discretion to discard the results of a previously
the individual who is drawing on meta-analytic hypotheses that should be evaluated as part of administered test, when that decision is based
evidence to support a test score interpretation for Standard 1.24
the validation effort. A finding of unintended on differences in scores for subgroups of different
a given use. In some instances, that individual consequences may also lead to reconsideration races, ethnicities, or genders. These issues are
If a test is recommended for use in assigning
may also be the one conducting the meta-analysis; of the appropriateness of the construct in discussed further in chapter 3.
persons to alternative treatments, and if outcomes
in other instances, existing meta-analyses are relied
from those treatments can reasonably be compared
on. In the latter instance, the individual drawing
on a common criterion, then, whenever feasible,
on meta-analytic evidence does not have control
supporting evidence of differential outcomes
over how the meta-analysis was conducted or re- should be provided.
ported, and must evaluate the soundness of the
meta-analysis for the setting in question. Comment: If a test is used for classification into
alternative occupational, therapeutic, or educational
Standard 1.23 programs, it is not sufficient just to show that the
test predicts treatment outcomes. Support for the
Any meta-analytic evidence used to support an validity of the classification procedure is provided
intended test score interpretation for a given use by showing that the test is useful in determining
should be clearly described, including method- which persons are likely to profit differentially
ological choices in identifying and coding studies, from one treatment or another. Treatment categories
correcting for artifacts, and examining potential may have to be combined to assemble sufficient
moderator variables. Assumptions made in cor- cases for statistical analysis. It is recognized,
recting for artifacts such as criterion unreliability however, that such research may not be feasible,
and range restriction should be presented, and because ethical and legal constraints on differential
the consequences of these assumptions made assignments may forbid control groups.
clear.
Comment: The description should include docu- (f) Evidence Based on Consequences of
mented information about each study used as Tests
input to the meta-analysis, thus permitting evalu-
ation by an independent party. Note also that Standard 1.25
meta-analysis inevitably involves judgments re-
garding a number of methodological choices. The When unintended consequences result from test
bases for these judgments should be articulated. use, an attempt should be made to investigate
In the case of choices involving some degree of whether such consequences arise from the test's
uncertainty, such as artifact corrections based on sensitivity to characteristics other than those it
assumed values, the uncertainty should be ac- is intended to assess or from the test's failure to
knowledged and the degree to which conclusions fully represent the intended construct.
about validity hinge on these assumptions should Comment: The validity of test score interpreta-
be examined and reported. tions may be limited by construct-irrelevant
As in the case of Standard 1.22, the individual components or construct underrepresentation.
who is drawing on meta-analytic evidence to When unintended consequences appear to stem,
support a test score interpretation for a given use at least in part, from the use of one or more
may or may not also be the one conducting the tests, it is especially important to check that
meta-analysis. As Standard 1.22 addresses the re- these consequences do not arise from construct-
I
30 I +: 31
mt -
;