You are on page 1of 7

Estimating Present Performance

Up to this point, we have emphasized the role of the test-criterion relationship in predicting
future performance. Although this is probably its major use, at times we are interested in the
relation of test performance to some other current measure of performance. In this case, we obtain
both measures at approximately the same time and correlate the results. This is commonly done
when a test is being considered as a replacement for a more time consuming method of obtaining
information. For example, Mrs. Valencia, a biology teacher, wondered if an objective test of study
skills could be used in place of the elaborate observation and rating procedures she was currently
using. She believed that if a test could be substituted for the more complex procedures, she would
have much more time to devote to individual students during the supervised study period. An
analysis of the specific student characteristics on which she rated the students' study skills
indicated that many of me procedures could be stated in the form of objective test questions.
Consequently, she developed an objective test of study skills that she administered to her students.
To determine how adequately me test measured study skills, she correlated the test results with
ratings of the students' study skills she obtained through arduous observation. The resulting
correlation coefficient of 0.75 indicates considerable agreement between the test results and the
criterion measure and validates Mrs. Valencia's test of study skills.
Several factors influence the size of correlation coefficients, including validity coefficients.
Knowing these factors can help with the interpretation of a particular correlation coefficient-one
we have computed ourselves or one found in a test manual. Basic factors to consider are shown
in Figure 4.7. In general, larger correlation coefficients are obtained when the characteristics
measured are more alike (e.g., correlating scores from two reading tests), the spread of scores is
large, the stability of the scores is high, and the time span between measures is short. As we move
along the continuum toward the other end of the scale on any of these factors, the correlation
coefficients tend to become smaller. Thus, a small predictive validity coefficient might be
explained, in part, by any one of the factors shown on the right side of Figure 4.7 or, more
commonly, by some combination of them.

Expectancy Table
How well a test predicts future performance or estimates current performance on some
criterion measure also can be shown by directly plotting the data in a twofold chart as shown in
Figure 4.8. Here, Mr. Tanaka's data (from Table 4.3) have been tabulated by placing a tally
showing each individual's standing on both the fall aptitude scores and the spring mathematics
scores. For example, Sandra scored 119 on the fall aptitude test and 77 on the spring math test, so
a tally representing her performance was placed in the upper-right-hand cell. The performance of
all other students on the two tests was tallied in the same manner. Thus, each tally mark in Figure
4.8 represents how well each of Mr. Tanaka's 20 students performed on the fall and spring tests.
The total number of students in each cell and in each column and row is also indicated.
The expectancy grid shown in Figure 4.8 can be used as an expectancy table simply by using
the frequencies in each cell. The interpretation of such information is simple and direct For
example, of those students who scored above average on the fall aptitude test, none scored below
65 on the spring mathematics test, two of five scored between 65 and 74, and three of five scored
between 75 and 84. Of those who scored below average on the fall aptitude test, none scored in
the top category on the spring mathematics test, and four of five scored below 65. These
interpretations are limited to the group tested, but from such results one might make predictions
concerning future students. We can say, for example, that students who score above average on
the fall aptitude test will probably score above average on the spring mathematics test. Other
predictions can be made in the same way by noting.

Figure 4.8
Expectancy grid showing how scores on the fall aptitude test and spring
mathematics test are tallied in appropriate cells (from data in Table 4.3)

the frequencies in each cell of the grid in Figure 4.8. More commonly, the figures in an expectancy
table are expressed in percentages, which can be readily obtained from the grid by converting each
cell frequency to a percentage of the total number of tallies in its row. This has been done for the
data in Figure 4.8, and the results are presented in Table 4.4. The first row of the table shows that
of the five students who scored above average on the fall aptitude test, 40% (two students) scored
between 65 and 74 on the spring math rest, and 60% (three students) scored between 75 and 84.The
remaining rows should read in a similar manner. The use of percentages makes the figures in each
row and column comparable. Our predictions then can be made in standard terms (i.e., chances out
of 100) for all score levels. Our interpretation is apt to be a little more clear if we say that Maria's
chances of being in the top group on the criterion measure are 60 of 100 and that Jim's are only 10
of 100 than if we say that Maria's chances are 3 of 5 and Jim's are 1 of 10.
Expectancy tables take varied forms and may be used to show the relation between different types
of measures. The number of categories used with the predictor, or criterion, may be as few as two
or as many as seem desirable. Also, the predictor may be any set of measures useful in predicting,
and the criterion may be course grades, ratings, test scores, or whatever of success is relevant. For
example, expectancy tables are frequently used in predicting the grade-point averages of college
freshmen by combining data from high school grades and an admissions test such as the Scholastic
Aptitude Test (SAT) or the American College Testing Program's ACT. An example of such an
expectancy table is shown in Table 4.5. When interpreting expectancy tables based on a small
number of cases, such as Mr. Tanaka's class of 20 students, our predictions should be regarded as
highly tentative. Each percentage is based on so few students that we can expect large fluctuations
in these figures from one group of students to another. It is frequently possible to increase the
number of students represented in the table by combining test results from several classes. When
we do this, our percentages are, of course, much more stable, and our predictions can be made with
greater confidence. In any event, expectancy tables provide a simple and direct means of
indicating the predictive value of test results.
Another commonly used approach to obtain predicted or estimated criterion performance from
scores on an assessment is by means of regression equations. The use of regression equations to
obtain estimates will not be developed here, but it is illustrated in Appendix A.

The "Criterion" Problem


In a criterion-related validation study, a major problem is obtaining a satisfactory criterion of
success. Remember that Mr. Tanaka used a comprehensive departmental examination as the
criterion of success in his seventh-grade mathematics class and that Mrs. Valencia used her own
ratings of the students' study skills. In each instance, the criterion of success was only partially
suitable as a basis for test validation. Mr. Tanaka recognized that the departmental examination
did not measure all the important learning outcomes that he aimed at in teaching mathematics.
There was not nearly enough emphasis on problem solving or mathematical reasoning, the
interpretation of graphs and charts was sadly neglected, and, of course, the test did not evaluate
the students' attitudes toward mathematics (which Mr. Tanaka considered to be extremely
important). Likewise, Mrs. Valencia was well aware of the shortcomings of her rating of students'
study skills. She sensed that some students "put on a show" when they knew they were being
observed and that other students were probably overrated on study skills because of their high
achievement in class work. Despite these recognized shortcomings, both Mr. Tanaka and Mrs.
Valencia found it necessary to use these criterion measures because they were the best available.
The plights of Mr. Tanaka and Mrs. Valencia in finding a suitable criterion of success for test
validation are not unusual. For most educational purposes, there is no entirely satisfactory criterion
of success. Those used tend to be lacking in comprehensiveness and in most cases produce results
that are less stable than those of the test being validated. The lack of a completely satisfactory
criterion measure makes content and construct considerations all the more important.

CONSIDERATION OF CONSEQUENCES
Messick (1989, 1994) has argued persuasively that an overall judgment regarding the validity
of particular uses and interpretations of assessment results requires an evaluation of the
consequences of those uses and interpretations. Assessments are intended to contribute to
improved student learning. The question is, Do they? And, if so, to what extent? What impact do
assessments have on teaching? What are the possibly negative, unintended consequences of a
particular use of assessment results?
The expansion of the concept of validity to include consideration of the consequences of use
and interpretation of assessment results has been especially important in the recent movement
toward more authentic, performance-based approaches to assessment. Several proponents of these
alternative forms of assessment have argued that teachers are being held accountable for result,
has had unintended negative effects. They argue that high stakes associated with test result lead
teachers to focus narrowly on what is on the test while ignoring important parts of the curriculum
not covered by the test.
In some instances, a single form of a standardized test-that is, a single set of multiplechoice
questions-is used year after year in a state or school district. In such a situation, the narrowing
may be worse than simply focusing on the domain of skills that the test is intended to measure. It
may lead to teaching the specific content of the test items. Not only would this be an undesirable
narrowing of what is taught, but also it would likely inflate test scores and change the meaning of
the results, possibly changing the construct measured from problem solving to memorization
ability.
Although the negative effects are most serious where high stakes are attached to the results
obtained on a single test form used year after year, the concern about consequences is not limited
to such situations. Dlstortions of instrucrlon could be expected even if a new test form were used
each year because of a lack of alignment of the test to the learning objectives emphasized in the
curriculum. For example, a curriculum that emphasized problem solving and conceptual
understanding could be undermined by holding teachers accountable for student scores on a test
that emphasized low-level skills and factual knowledge. Drill and practice on the skills and facts
emphasized on the test might raise scores but would not facilitate achievement of the primary
learning goals.
Considerations of consequences are equally important for performance-based assessment. It
is just as important to attend to both the intended positive effects and the possible unintended
negative effects for performance-based assessments as it is for standardized, multiple-choice tests.
For both types of assessment, consequences are directly related to the stakes that are attached to
the results. As the stakes increase for teachers or students, so too should the demands for evidence
regarding consequences of the uses and interpretations of results.
An adequate consideration of consequences needs to include both intended consequences
(e.g., contributes to learning, increases student motivation) and unintended consequences (e.g.,
narrows the curriculum, increases the number of high school dropouts). Consequences are
particularly important where assessment results are used to make highstakes decisions regarding
individuals (e.g., retention in grade, assignment to a remedial instructional program, or the award
of a high school diploma).
The use of an assessment to retain students in grade is not recommended, but it is a way in
which assessments are sometimes used and may be used to illustrate the impact of consequences
on the overall validation. An analysis of consequences of the use of an assessment for retention in
grade should seek evidence that students who are retained are generally better off than they would
have been if they had not been retained. Such evidence might come from a comparison of the
subsequent achievement and attitudes of students just below and just above the minimum score
required for advancement in grade. A positive finding would require that students who were just
below the cut score and therefore retained in grade eventually outperformed their counterparts who
were just above the cut score and therefore promoted in grade. The opposite finding would provide
evidence that the use of assessment to retain students had negative consequences and would
contribute to a judgment that grade retention was not a valid use of this assessment. It would also
be important to compare long-term results, such as dropout rates and eventual graduation rates.
For a classroom assessment, studies of consequences obviously would be much less elaborate.
Indeed, considerations of consequences generally would be limited to a logical analysis of likely
effects.
Teachers have an excellent vantage point for considering the likely effects of assessments.
First, they know the learning objectives that they are trying to help their students achieve. Second,
they are quite familiar with instructional experiences that the students have had. Third, they have
an opportunity to observe students while they are working on an assessment task and to talk to
students about their performances. This firsthand awareness of learning objectives, instructional
experiences, and students can be brought to bear on an analysis of the likely effects of assessments
by systematically considering questions such as the following:
I. Do the tasks match important learning objectives? WYTIWYG(what You Test Is What You
Get) has become a popular slogan. Despite the fact that it is an oversimplification, it is a good
reminder that assessments need to reflect major learning outcomes. Problem-solving skills and
complex thinking skills requiring integration, evaluation, and synthesis of information are more
likely to be fostered by assessments that require the application of such skills than by assessments
that require students merely to repeat what the teacher has said or what is stated in the textbook.
2 Is there reason to believe that students study harder in preparation for the assessment?
Motivating student effort is a potentially important consequence of tests and assessments. The
chances of achieving this goal are improved if students have a clear understanding of what to
expect on the assessment, know how the results will be used, and believe that the assessment will
be fair.
3, Does the assessment artificially constrain the focus of students' study? If it is judged
important, for example, to be sure that students can solve a particular type of mathematics
problem, then it is reasonable to focus an assessment on that type of problem. However, much will
be missed if such an approach is the only mode of assessment. In many cases, the identification
of the nature of the problem may be at least as important as facility with application of a particular
formula or algorithm. Assessments that focus only on the latter skills are not likely to facilitate
development of problem identification skills.
4 Does the assessment encourage or discourage exploration and creative modes of expression?
Although it is important for students to know what to expect on an assessment and have a sense of
what to do to prepare for it, care should be taken to avoid overly narrow and artificial constraints
that will discourage students from exploring new ideas and concepts.

FACTORS INFLUENCING VALIDITY


Numerous factors tend to make assessment results invalid for their intended use. Some are
rather obvious and easily avoided. No teacher would think of measuring knowledge of social
studies with a mathematics assessment, nor would a teacher consider measuring problem-solving
skills in third-grade mathematics with an assessment designed for seventh graders. In both
instances, the assessment results would obviously be invalid. The factors influencing validity are
of this same general nature but much more subtle in character. For example, a teacher may
overload a social studies test with items concerning historical facts, and thus the scores are less
valid as a measure of achievement in social studies. Or a third-grade teacher may select
appropriate mathematical problems for an assessment but use vocabulary in the problems and
directions that only the better readers are able to understand. The mathematics assessment then
becomes, in part, a reading assessment and reduces the validity of the results for their intended
use. These examples show some of the more subtle factors influencing validity to which the reacher
should be alert, whether constructing classroom assessments or selecting published ones.

You might also like