Professional Documents
Culture Documents
Name:
Language
Classroom Practice
Assessment
page 25-51
Principles
and
How principles of language assessment can and should be applied to formal tests
but with the ultimate recognition that these principles also apply to assessments of
all kinds.
How do you know if a test is effective, appropriate useful, or, in down-to-earth
terms, a good test?
Can it be given within appropriate administrative constraints? Is it dependable?
Does it accurately measure what you want it to measure?
Is the language in the test representative of real world language use? Does the
test provide information that is useful for the learner?
There are five cardinal criteria for testing a test :
12345-
Practicality
Reliability
Validity
Authenticity
Wash back
A test that takes a few minutes for a student to take and several hours for an
examiner to evaluate is impractical for most classroom situations.
A test that can be score only by computer is impractical, because it takes too
long to score.
Reliability
A reliable test is consistent and dependable. If you give the same test to the same
student or matched students on two different occasions, the test should yield
similar results.
A reliable test .
- Is consistent in its conditions across two or more administrations
- Gives clear directions for scoring / evaluation
- Has uniform rubrics for scoring /evaluation
- Lends itself to consistent application of those rubrics by the scorer
- Contains items/takes that are unambiguous to the test-taker
What is unreliability?
The issue of the reliability of tests can be better understood by considering a
number of factors that can contribute to their unreliability.
There are four possible factors regarding fluctuations:
a) The student
b) The scoring
c) The test administration
d) The test itself
Student Related Reliability
The most common learner-related issue in reliability is caused by temporary illness,
fatigue, a bad day, anxiety, and other physical or psychological factors, which may
make an observed score deviate from ones true score.
Also included in this category are such factors as a test-takers test-wiseness, or
strategies for efficient test-taking.
Rater Reliability
Human error, subjectivity, and bias may enter into the scoring process. Inter-rater
reliability occurs when two or more scores yield consistent scores of the same test.
Failure to achieve intra-rater reliability could stem from lack of adherence to scoring
criteria, inexperience, inattention, or even preconceived biases.
Inter-rater reliability issues are not limited to contexts in which two or more
scores are involved. Intra-rater reliability is an internal factor, a common occurrence
for classroom teachers.
outside the building, students sitting next to open windows could not hear the
stimuli accurately.
Other sources of unreliability are found in photocopying variations, the amount
of light in different parts of the room, variations in temperature, and even the
condition of desks and chairs.
Test Reliability
Sometimes the nature of the test itself can cause measurement errors. Tests with
multiple choice items must be carefully designed to include a number of
characteristics that will guard against unreliability.
For example:
The items need to be evenly difficult, distractors need to be well designed, and
items need to be well distributed to make the test reliable.
In classroom based assessment, test unreliability can be caused by many factors,
including rater bias.
This typically occurs with subjective tests, with open ended responses (e.g., essay
responses) that require a judgment on the part of the teacher to determine correct
and incorrect answers.
Objective tests, in contrast, have predetermined fixed responses, a format which
of course increases their reliability.
Further unreliability may be caused by poorly written test items that is, items that
are ambiguous or that have more than one correct answer. Also, a test that contains
too many items may ultimately cause test-takers to become fatigued by the time
they reach the later items and respond incorrectly.
Validity
The most complex criterion of an effective test and arguably the most important
principle is validity.
The extent to which inferences made from assessment results are appropriate ,
meaningful , and useful in terms of the purpose of the assessment.
Samuel Messick defined validity as an integrated evaluative judgment of the degree
to which empirical evidence and theoretical rationales support the adequacy and
appropriateness of inferences and actions based on scores or other modes of
assessment.
A valid test .
- Measures exactly what it proposes to measure
- Does not measure irrelevant or contaminating variables.
- Relies as much as possible on empirical evidence(performance)
- Involves performance that samples the tests criterion ( objective)
- Offers useful , meaning full information about a test takers ability
- Is supported by a theoretical rationale or argument
A valid test of reading ability actually measures reading ability.
To measure writing ability, one might ask students to write as many words as they
can in 15 minutes, and then simply count the words for the final score. Such a test
would be easy to administer (practical) and the scoring quite dependable (reliable),
but would not constitute a valid test of writing ability without some consideration of
comprehensibility, rhetorical discourse elements
How is the validity of a test established?
There is no final absolute measure of validity, but several different kinds of evidence
may be invoked in support. As Messick emphasized it is important to note that
validity is a matter of degree, not all or none.
Statistical correlation with other related but independent measures are another
widely accepted form of evidence. Other concerns about a tests validity may focus
on the consequences of a test, beyond measuring the criteria themselves, or even
on the test-takers perception of validity.
Content -Related Evidence
If a test actually samples the subject matter about which conclusions are to be
drawn, and if it requires the test-taker to perform the behavior that is being
measured, it can claim content-related evidence of validity, often popularly referred
to as content-related validity.
If a course has perhaps 10 objectives but two are covered in a test, then content
validity suffers.
Another way of understanding content validity is to consider the difference
between direct and indirect testing.
Direct testing involves the test-taker in actually performing the target task.
In an indirect test, learners are not performing the task itself but rather a task that
is related in some way. If you intend to test learners oral production of syllable
stress and your test task is to have learners mark (with written accent marks)
stressed syllables in a list of written words, you could argue that you are indirectly
testing their oral production.
A direct test of syllable production would require that students actually produce
target words orally.
Criterion Related Evidence
A second form of evidence of the validity of a test may be found in what is called
criterion-related evidence, also referred to as criterion-related validity, or the extent
to which the criterion of the test has actually been reached.
In the case of teacher-made classroom assessment s , criterion-related evidence is
best demonstrated through a comparison of results of an assessment with results of
some other measure of the same criterion.
Criterion related evidence usually falls into one of two categories:
- Concurrent validity
Predictive validity
A test has concurrent validity if its results are supported by other concurrent
performance beyond the assessment itself. E.g. Validity of a high score on the final
exam of a foreign language course will be substantiated by actual proficiency in the
language.
The authenticity of test tasks in recent years has increased noticeably, Two or three
decades ago, unconnected, boring, contrived items were accepted as a necessary
component of testing.
Wash back
A facet of consequential validity discussed above is the effect of testing on teaching
and learning, otherwise known in the language assessment field as wash back.
Messick (1996) reminded us that the wash back effect may refer to both the
promotion and the inhibition of learning, thus emphasizing what may be referred to
as beneficial versus harmful or negative wash back.
The following factors comprise the concept of wash back:
A test that provides beneficial wash back
- Positively influences what and how teachers teach
- Positively influences what and how learners learn
- Offers learners a chance to adequately prepare
- Gives learners feedback that enhances their language development
- Is more formative in nature than summative
- Provides conditions for peak performance by the learner
In large scale assessment, wash back often refers to the effects that tests have
on instruction in terms of how students prepare for the test.
Cram courses and teaching to the test are examples of wash back that may have
both negative and positive effects.
Teachers can provide information that washes back to students in the form of
useful diagnoses of strengths and weaknesses.
Wash back also includes the effects of an assessment on teaching and weaknesses.
Wash back also includes the effects of an assessment on teaching and learning prior
to the assessment itself.
Informal performance assessment is by nature more likely to have built in Wash
back effects because the teacher is usually providing interactive feedback.
Formal tests can also have positive Wash back, but they provide on beneficial
Wash back if the students receive a simple letter grade or a single overall numerical
score.
The challenge to teachers is to create classroom tests serve as learning devices
through which Wash back is achieved. Students incorrect responses can become
windows of insight into further work.
Teachers can suggest strategies for success as part of their coaching role.
Wash back enhances a number of basic principles of language acquisition: intrinsic
motivation, autonomy, self-confidence, language ego, interlanguage and strategic
investment, among others.
Another view point on Wash back is achieved by a quick consideration of differences
between formative and summative tests.
1- Formative tests by definition provide wash back in the form of information to
the learner on progress toward goals.
2- Summative tests, which provide assessment at the end of a course or
program, do not need to offer much in the way of Wash back.
Finally Wash back implies that students have ready access to you to discuss the
feedback and evaluation you have given.
Whereas you almost certainly have known teachers with whom you wouldnt dare
argue about a grade, an interactive, cooperative, collaborative classroom can
promote an atmosphere of dialogue between students and teachers regarding
evaluative judgments.
345678-
Notice: you can find the answer for these 8 questions from pages 40 to
48 in book.
The end
Best Wishes!
Saeed Mojarradi
1393-08-18