Language Assessment Principles and Class

Chapter 2
Principles of Language Assessment

Book
Name:
Language
Classroom Practice
Assessment
page 25-51
Principles
and
Written by: H. Douglas Brown and Priyanvada Abeywickrama

Summarized by: Saeed Mojarradi
Professor: Dr.Zoghi
How principles of language assessment can and should be applied to formal tests
but with the ultimate recognition that these principles also apply to assessments of
all kinds.
How do you know if a test is effective, appropriate useful, or, in down-to-earth
terms, a good test?
Can it be given within appropriate administrative constraints? Is it dependable?
Does it accurately measure what you want it to measure?
Is the language in the test representative of real world language use? Does the
test provide information that is useful for the learner?
There are five cardinal criteria for testing a test :
12345-
Practicality
Reliability
Validity
Authenticity
Wash back
Practicality: refers to the logistical, down to earth, administrative issues

involved in making, giving, and scoring an assessment. These include costs the
amount of time it takes to construct and to administer, ease of scoring, and ease of
interpreting reporting the results.
A practical test
-
Stays within budgetary limits

Can be completed by the test-taker within appropriate time constraints
Has clear directions for administration
Appropriately utilizes available material resources
Does not exceed available material resources
Considers the time and effort involved for both design and scoring
Which tests are impractical?

- A test of language proficiency that takes a student five hours to complete are
impractical.
- A test that requires individual one-to-one proctoring is impractical for a group
of several hundred test-takers and only a hundred of examiners.
Language Assessment Principles and Classroom

Practice Ch. 2
1 | Page
A test that takes a few minutes for a student to take and several hours for an
examiner to evaluate is impractical for most classroom situations.
A test that can be score only by computer is impractical, because it takes too
long to score.
Reliability
A reliable test is consistent and dependable. If you give the same test to the same
student or matched students on two different occasions, the test should yield
similar results.
A reliable test .
- Is consistent in its conditions across two or more administrations
- Gives clear directions for scoring / evaluation
- Has uniform rubrics for scoring /evaluation
- Lends itself to consistent application of those rubrics by the scorer
- Contains items/takes that are unambiguous to the test-taker
What is unreliability?
The issue of the reliability of tests can be better understood by considering a
number of factors that can contribute to their unreliability.
There are four possible factors regarding fluctuations:
a) The student
b) The scoring
c) The test administration
d) The test itself
Student Related Reliability
The most common learner-related issue in reliability is caused by temporary illness,
fatigue, a bad day, anxiety, and other physical or psychological factors, which may
make an observed score deviate from ones true score.
Also included in this category are such factors as a test-takers test-wiseness, or
strategies for efficient test-taking.
Rater Reliability
Human error, subjectivity, and bias may enter into the scoring process. Inter-rater
reliability occurs when two or more scores yield consistent scores of the same test.
Failure to achieve intra-rater reliability could stem from lack of adherence to scoring
criteria, inexperience, inattention, or even preconceived biases.
Inter-rater reliability issues are not limited to contexts in which two or more
scores are involved. Intra-rater reliability is an internal factor, a common occurrence
for classroom teachers.
Test administration reliability

Unreliability may also result from the conditions in which the test is administrated. I
once witnessed the administration of a test of aural comprehension in which an
audio player was used to deliver items for comprehension, but because of noise

Practice Ch. 2
2 | Page
outside the building, students sitting next to open windows could not hear the
stimuli accurately.
Other sources of unreliability are found in photocopying variations, the amount
of light in different parts of the room, variations in temperature, and even the
condition of desks and chairs.
Test Reliability
Sometimes the nature of the test itself can cause measurement errors. Tests with
multiple choice items must be carefully designed to include a number of
characteristics that will guard against unreliability.
For example:
The items need to be evenly difficult, distractors need to be well designed, and
items need to be well distributed to make the test reliable.
In classroom based assessment, test unreliability can be caused by many factors,
including rater bias.
This typically occurs with subjective tests, with open ended responses (e.g., essay
responses) that require a judgment on the part of the teacher to determine correct
and incorrect answers.
Objective tests, in contrast, have predetermined fixed responses, a format which
of course increases their reliability.
Further unreliability may be caused by poorly written test items that is, items that
are ambiguous or that have more than one correct answer. Also, a test that contains
too many items may ultimately cause test-takers to become fatigued by the time
they reach the later items and respond incorrectly.
Validity
The most complex criterion of an effective test and arguably the most important
principle is validity.
The extent to which inferences made from assessment results are appropriate ,
meaningful , and useful in terms of the purpose of the assessment.
Samuel Messick defined validity as an integrated evaluative judgment of the degree
to which empirical evidence and theoretical rationales support the adequacy and
appropriateness of inferences and actions based on scores or other modes of
assessment.
A valid test .
- Measures exactly what it proposes to measure
- Does not measure irrelevant or contaminating variables.
- Relies as much as possible on empirical evidence(performance)
- Involves performance that samples the tests criterion ( objective)
- Offers useful , meaning full information about a test takers ability
- Is supported by a theoretical rationale or argument
A valid test of reading ability actually measures reading ability.
To measure writing ability, one might ask students to write as many words as they
can in 15 minutes, and then simply count the words for the final score. Such a test
would be easy to administer (practical) and the scoring quite dependable (reliable),

Practice Ch. 2
3 | Page
but would not constitute a valid test of writing ability without some consideration of
comprehensibility, rhetorical discourse elements
How is the validity of a test established?
There is no final absolute measure of validity, but several different kinds of evidence
may be invoked in support. As Messick emphasized it is important to note that
validity is a matter of degree, not all or none.
Statistical correlation with other related but independent measures are another
widely accepted form of evidence. Other concerns about a tests validity may focus
on the consequences of a test, beyond measuring the criteria themselves, or even
on the test-takers perception of validity.
Content -Related Evidence
If a test actually samples the subject matter about which conclusions are to be
drawn, and if it requires the test-taker to perform the behavior that is being
measured, it can claim content-related evidence of validity, often popularly referred
to as content-related validity.
If a course has perhaps 10 objectives but two are covered in a test, then content
validity suffers.
Another way of understanding content validity is to consider the difference
between direct and indirect testing.
Direct testing involves the test-taker in actually performing the target task.
In an indirect test, learners are not performing the task itself but rather a task that
is related in some way. If you intend to test learners oral production of syllable
stress and your test task is to have learners mark (with written accent marks)
stressed syllables in a list of written words, you could argue that you are indirectly
testing their oral production.
A direct test of syllable production would require that students actually produce
target words orally.
Criterion Related Evidence
A second form of evidence of the validity of a test may be found in what is called
criterion-related evidence, also referred to as criterion-related validity, or the extent
to which the criterion of the test has actually been reached.
In the case of teacher-made classroom assessment s , criterion-related evidence is
best demonstrated through a comparison of results of an assessment with results of
some other measure of the same criterion.
Criterion related evidence usually falls into one of two categories:
- Concurrent validity
Predictive validity
A test has concurrent validity if its results are supported by other concurrent
performance beyond the assessment itself. E.g. Validity of a high score on the final
exam of a foreign language course will be substantiated by actual proficiency in the
language.

Practice Ch. 2
4 | Page
The predictive validity of an assessment becomes important in the case of

placement tests , admissions assessment batteries , and achievement tests
designed to determine students readiness to move on to another unit.
The assessment criterion in such cases is not to measure concurrent ability but to
assess and predict a test takers likelihood of future success.
Construct Related Evidence
A third kind of evidence that can support validity, but one that does not play as
large a role for classroom teachers is construct-related validity, commonly referred
to as construct validity.
A construct is any theory, hypothesis, or model that attempts to explain observed
phenomena in our universe of perceptions.
Constructs may or may not be directly or empirically measured their verification
often requires inferential data. Proficiency, communicative competence, and fluency
are examples of linguistic constructs; self-esteem and motivation are psychological
constructs.
In the field of assessment, construct validity asks Does this test actually tap into
the theoretical construct as it has been defined? Lets say you are assessing a
students oral fluency.
To possess construct validity , your test should account for the various components
of fluency , speed , rhythm , juncture , (lack of ) hesitations , and other elements
within the construct of fluency.
Imagine, for example that you have been given a procedure for conducting an oral
interview. The scoring analysis for the interview includes several factors in final
score:
- Pronunciation
- Fluency
- Grammatical accuracy
- Vocabulary use
- Socio-linguistic appropriateness
If you were asked to conduct an oral proficiency interview that evaluated only
pronunciation and grammar, you could be justifiably suspicious about the construct
validity of that test.
Construct validity is a major issue in validating large-scale standardized tests of
proficiency.
Consequential Validity (Impact)
It encompasses all the consequences of a test , including such considerations as its

accuracy in measuring intended criteria, its effect on the preparation of test takers,
and the (intended and unintended)social consequences of a tests interpretation
and use.

Practice Ch. 2
5 | Page
The term impact to refer to consequential validity, perhaps more broadly,

encompassing the many consequences of assessment, before and after a test
administration.
The impact of test-taking and the use of test scores can according to Bachman and
Palmer be seen at both a macro level (the effect on society and education systems)
and a micro level (the effect on individual test-takers).
At the macro level Chio argued that the wholesale employment of standardized
tests for such gate keeping purposes as college admission deprive students of
crucial opportunities to learn and acquire productive language skills causing test
consumers to be increasingly disillusioned with EFL testing.
At the micro-level, specifically the classroom instructional level, another
important consequence of a test falls into the category of wash back. Geronlund
and Waugh (2008) encouraged teachers to consider the effect of assessments on
students motivation, subsequent performance in a course , independent learning ,
study habits , and attitude toward school work.
Face Validity
Face validity refers to the degree to which a test books right, and appears to
measure the knowledge or abilities it claims to measure, based on the subjective
judgment of the examinees who take it.
Despite the intuitive appeal of the concept of face validity, it remains a notion that
cannot be empirically measured or theoretically justified under the category of
validity.
Many assessment experts view face validity as a superficial factor that is too
dependent on the whim of the perceiver.
Teachers can increase students perception of fair tests by using:
- A well-constructed, expected format with familiar tasks
- Tasks that can be accomplished within an allotted time limit
Items that are clear and uncomplicated
- Directions that have been rehearsed in their previous course work
- Tasks that relate to their course work(content validity)
- A difficulty level that presents a reasonable challenge
Example:
A dictation test and a cloze test as a placement test for a group of learners of
English as a second language. Some learners were upset because such tests,
on the face of it, did not appear to them to test their true abilities in English.
They felt that a multiple choice grammar test would have been the
appropriate format to use.
Validity is a complex concept, yet it is indispensable to the teachers
understanding of what makes a good test.
Authenticity
A fourth major principle of language testing is authenticity, a concept that is
difficult to define, especially within the art and science of evaluating and
designing tests.

Practice Ch. 2
6 | Page
Bachman and Palmer (1996) defined authenticity as the degree of

correspondence of the characteristics of a given language test task to the
features of a target language task.
In a test, authenticity may be present in the following ways:

An authentic test
Contains language that is as natural as possible
Has items that are contextualized rather than isolated
Includes meaningful, relevant ,interesting topics
Provides some thematic organization to items , such as through a story line or
episode
Offers tasks that replicate real world tasks
The authenticity of test tasks in recent years has increased noticeably, Two or three
decades ago, unconnected, boring, contrived items were accepted as a necessary
component of testing.
Wash back
A facet of consequential validity discussed above is the effect of testing on teaching
and learning, otherwise known in the language assessment field as wash back.
Messick (1996) reminded us that the wash back effect may refer to both the
promotion and the inhibition of learning, thus emphasizing what may be referred to
as beneficial versus harmful or negative wash back.
The following factors comprise the concept of wash back:
A test that provides beneficial wash back
- Positively influences what and how teachers teach
- Positively influences what and how learners learn
- Offers learners a chance to adequately prepare
- Gives learners feedback that enhances their language development
- Is more formative in nature than summative
- Provides conditions for peak performance by the learner
In large scale assessment, wash back often refers to the effects that tests have
on instruction in terms of how students prepare for the test.
Cram courses and teaching to the test are examples of wash back that may have
both negative and positive effects.
Teachers can provide information that washes back to students in the form of
useful diagnoses of strengths and weaknesses.
Wash back also includes the effects of an assessment on teaching and weaknesses.
Wash back also includes the effects of an assessment on teaching and learning prior
to the assessment itself.
Informal performance assessment is by nature more likely to have built in Wash
back effects because the teacher is usually providing interactive feedback.

Practice Ch. 2
7 | Page
Formal tests can also have positive Wash back, but they provide on beneficial
Wash back if the students receive a simple letter grade or a single overall numerical
score.
The challenge to teachers is to create classroom tests serve as learning devices
through which Wash back is achieved. Students incorrect responses can become
windows of insight into further work.
Teachers can suggest strategies for success as part of their coaching role.
Wash back enhances a number of basic principles of language acquisition: intrinsic
motivation, autonomy, self-confidence, language ego, interlanguage and strategic
investment, among others.
Another view point on Wash back is achieved by a quick consideration of differences
between formative and summative tests.
1- Formative tests by definition provide wash back in the form of information to
the learner on progress toward goals.
2- Summative tests, which provide assessment at the end of a course or
program, do not need to offer much in the way of Wash back.
Finally Wash back implies that students have ready access to you to discuss the
feedback and evaluation you have given.
Whereas you almost certainly have known teachers with whom you wouldnt dare
argue about a grade, an interactive, cooperative, collaborative classroom can
promote an atmosphere of dialogue between students and teachers regarding
evaluative judgments.
Applying principles to the evaluation of classroom tests

Are there other principles that should be invoked in evaluating and designing
assessment?
The answer, of course, is yes.
Language assessment is an extra ordinary broad discipline with many branches,
interest areas, and issues. The process of designing effective assessment
instruments is far too complex to be reduced to five principles.
Good test construction, for example, is governed by research based rules of test
preparation, sampling of tasks, item design and construction, scoring responses,
ethical standards and so on.
However, if validity is not substantiated, all other considerations may be

rendered useless.
1- Are the test procedures practical?
2- Is the test itself reliable?

Practice Ch. 2
8 | Page
345678-
Can you ensure rater reliability?

Does the procedure demonstrate content validity?
Has the impact of the test been carefully accounted for?
Is the procedure biased for best?
Are the test tasks as authentic as possible?
Does the test offer beneficial Wash back to the learner?
Notice: you can find the answer for these 8 questions from pages 40 to
48 in book.
The end
Best Wishes!
Saeed Mojarradi
1393-08-18

Practice Ch. 2
9 | Page

Language Assessment Principles and Class

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Language Assessment Principles and Class

Uploaded by

Copyright:

Available Formats

Chapter 2

Principles of Language Assessment

Written by: H. Douglas Brown and Priyanvada Abeywickrama

Practicality: refers to the logistical, down to earth, administrative issues

Stays within budgetary limits

Which tests are impractical?

Language Assessment Principles and Classroom

Test administration reliability

Language Assessment Principles and Classroom

Language Assessment Principles and Classroom

Language Assessment Principles and Classroom

The predictive validity of an assessment becomes important in the case of

Consequential Validity (Impact)

It encompasses all the consequences of a test , including such considerations as its

Language Assessment Principles and Classroom

The term impact to refer to consequential validity, perhaps more broadly,

Language Assessment Principles and Classroom

Bachman and Palmer (1996) defined authenticity as the degree of

In a test, authenticity may be present in the following ways:

Language Assessment Principles and Classroom

Applying principles to the evaluation of classroom tests

However, if validity is not substantiated, all other considerations may be

Language Assessment Principles and Classroom

Can you ensure rater reliability?

Language Assessment Principles and Classroom

You might also like