You are on page 1of 26

Language Testing and assessment

1. LANGUAGE ASSESSMENT ELT Teacher Training Tarık İNCE


2. CHAPTER 1 TESTING ASSESSING AND TEACHING
3. In an era of communicative language teaching: Tests should measure up to standards of authenticity and
meaningfulness. Ts should design tests that serve as motivating learning experiences rather than anxietyprovoking threats. Tests; should be positive experiences should build a person‟s confidence and
become learning experiences should bring out the best in students shouldn‟t be degrading
shouldn‟t be artificial shouldn‟t be anxiety-provoking Language Assessment aims; to
create more authentic, intrinsically motivating assessment procedures that are appropriate for their context
& designed offer constructive feedback to sts
4. What is a test? A test is measuring a person’s ability, knowledge or performance in a given
domain. 1. Method A set of techniques, procedures or items. To qualify as a test, the method must be
explicit and structured. Like; Multiple-choice questions with prescribed correct answers A writing prompt
with a scoring rubric An oral interview based on a question script and a checklist of expected responses to
be filled by the administrator 2 Measure A means for offering the test-taker some kind of result. If an
instrument does not specify a form of reporting measurement, then that technique cannot be defined as a
test. Scoring may be like the followings Classroom-based short answer essay test may earn the test-taker a
letter grade accompanied by the instructor‟s marginal comments. Large-scale standardized tests
provide a total numerical score, a percentile rank, and perhaps some sub-scores.
5. 3. The test-taker(the individual) = The person who takes the test. Testers need to understand; who the
test-takers are? what is their previous experience and background? whether the test is appropriately
matched to their abilities? how should test-takers interpret their scores? 4. Performance Test measures
performance, but results imply test-taker‟ ability or competence. Some language tests measure
one‟s ability to perform language: To speak, write, read or listen to a subset of language Some
others measure a test-taker‟s knowledge about language: Defining a vocabulary item, reciting a
grammatical rule or identifying a rhetorical feature in written discourse.
6. 5. Measuring a given domain It means measuring the desired criterion and not including other factors.
Proficiency tests: Even though the actual performance on the test involves only a sampling of skills, that
domain is overall proficiency in a language – general competence in all skills of a language.
Classroom-based performance tests: These have more specific criteria. For example: A test of pronunciation
might well be a test of only a limited set of phonemic minimal pairs. A vocabulary test may focus on only
the set of words covered in a particular lesson. A well-constructed test is an instrument that provides an
accurate measure of the test taker‟s ability within a particular domain.
7. TESTING, ASSESSMENT & TEACHING TESTING are prepared administrative procedures that
occur at identifiable times in a curriculum. When tested, learners know that their performance is being
measured and evaluated. When tested, learners muster all their faculties to offer peak performance. Tests
are a subset of assessment. They are only one among many procedures and tasks that teachers can
ultimately use to assess students. Tests are usually time-constrained (usually spanning a class period or at
most several hours) and draw on a limited sample of behaviour. ASSESSMENT Assessment is an ongoing
process that encompasses a much wider domain. A good teacher never ceases to assess students, whether
those assessments are incidental or intended. Whenever a student responds to a question, offers a comment,
or tries out a new word or structure, the teacher subconsciously makes an assessment of the
student‟s performance. Assessment includes testing. Assessment is more extended and it includes
a lot more components.
8. What about TEACHING? For optimal learning to take place, learners must have opportunities to
“play” with language without being formally graded. Teaching sets up the practice
games of language learning: the opportunities for learners to listen, think, take risks, set goals, and process
feedback from the teacher (coach) and then recycle through the skills that they are trying to master. During
these practice activities, teachers are indeed observing students‟ performance and making various
evaluations of each learner. Then, it can be said that testing and assessment are subsets of teaching.
9. ASSESSMENT Informal Assessment They are incidental, unplanned comments and responses.
Examples include: “Nice job!” “Well done!” “Good work!

” “Did you say can or can’t?” “Broke or break!”,


or putting a ☺ on some homework. Classroom tasks are designed to elicit performance without
recording results and making fixed judgements about a student‟s competence. Examples of
unrecorded assessment: marginal comments on papers, responding to a draft of an essay, advice about how
to better pronounce a word, a suggestion for a strategy for compensating for a reading difficulty, and
showing how to modify a student‟s note-taking to better remember the content of a lecture.
Formal Assessment They are exercises or procedures specifically designed to tap into a storehouse of skills
and knowledge. They are systematic, planned sampling techniques constructed to give Ts and sts an
appraisal of student achievement. They are tournament games that occur periodically in the course of
teaching. It can be said that all tests are formal assessments, but not all formal assessment is testing.
Example 1: A student‟s journal or portfolio of materials can be used as a formal assessment of
attainment of the certain course objectives, but it is problematic to call those two procedures
“test”. Example 2: A systematic set of observations of a student‟s frequency of
oral participation in class is certainly a formal assessment, but not a “test”.
10. THE FUNCTION OF AN ASSESSMENT Formative Assessment Summative Assessment Evaluating
students in the It aims to measure, or summarize, what a student process of “forming”
their has grasped, and typically competencies and skills with occurs at the end of a course. the goal of
helping them to continue that growth process. It does not necessarily point the way to future progress. It
provides the ongoing development of learner‟s lang Example: Final exams in a course and general
Example: When you give sts a proficiency exams. comment or a suggestion, or call attention to an error,
that feedback is offered to improve learner‟s language ability. Virtually all kinds of informal
assessment are formative. All tests/formal assessment (quizzes, periodic review tests, midterm exams, etc.)
are summative.
11. IMPORTANT: As far as summative assessment is considered, in the aftermath of any test, students tend
to think that “Whew! I‟m glad that‟s over. Now I don‟t have to
remember that stuff anymore!” An ideal teacher should try to change this attitude among students.
A teacher should: · instill a more formative quality to his lessons · offer students an
opportunity to convert tests into “learning experiences”.
12. Norm-Referenced Tests TESTS Each test-taker‟s score is interpreted in relation to a mean
(average score), median (middle score), standard deviation (extend of variance in scores), and/or percentile
rank. The purpose is to place test-takers along a mathematical continuum in rank order. Scores are usually
reported back to the test-taker in the form of a numerical score. (230 out of 300, 84%, etc.) Typical of these
tests are standardized tests like SAT. TOEFL, ÜDS, KPDS, DS, etc. These tests are intended to be
administered to large audiences, with results efficiently disseminated to test takers. They must have fixed,
predetermined responses in a format that can be scored quickly at minimum expense. Money and efficiency
are primary Criterion-Referenced Tests They are designed to give testtakers feedback, usually in the form
of grades, on specific course or lesson objectives. Tests that involve the sts in only one class, and are
connected to a curriculum, are Criterion-Referenced Tests. Much time and effort on the part of the teacher
are required to deliver useful, appropriate feedback to students. The distribution of students‟
scores across a continuum may be of little concern as long as the instrument assesses appropriate
objectives. As opposed to standardized, large scale testing with its emphasis on classroom-based testing,
CriterionReferenced Testing is of more prominent interest than Norm-Referenced Testing.
13. Approaches to Language Testing: A Brief History Historically, language-testing trends have followed
the trends of teaching methods. During 1950s: An era of behaviourism and special attention to contrastive
analysis. Testing focused on specific lang elements such as phonological, grammatical, and lexical contrasts
between two languages. During 1970s and 80s: Communicative Theories were widely accepted. A more
integrative view of testing. Today: Test designers are trying to form authentic, valid instruments that
simulate real world interaction.
14. APPROACHES TO LANGUAGE TESTING A) Discrete-Point Testing B) Integrative Testing
Language can be broken down into its component parts and those parts can be tested successfully.
Component parts; listening, speaking, reading and writing. Units of language (discrete points); phonology,
graphology, morphology, lexicon, syntax and discourse. An language proficiency test should sample all 4
skills and as many linguistic discrete points as possible In the face of evidence that in a study each student
scored differently in various skills depending on his background, country and major field, Oller admitted
that “unitary trait hypothesis was wrong.” Language competence is a unified set of

interacting abilities that cannot be tested separately. Communicative competence is global and requires
such integration that it cannot be captured in additive tests of grammar, reading, vocab, and other discrete
points of lang. Two types of tests examples of integrative tests: *cloze test and **dictation. Unitary trait
hypothesis: It suggests an “indivisible” view of language proficiency; that vocabulary,
grammar, phonology, “4 skills”, and other discrete points of lang could not be
disentangled
15. Cloze Test: Cloze Test results are good measures of overall proficiency. The ability to supply
appropriate words in blanks requires a number of abilities that lie at the heart of competence in a language:
knowledge of vocabulary, grammatical structure, discourse structure, reading skills and strategies. It was
argued that successful completion of cloze items taps into all of those abilities, which were said to be the
essence of global language proficiency. Dictation Essentially, learners listen to a passage of 100 to 150
words read aloud by an administrator (or audiotape) and write what they hear, using correct spelling.
Supporters argue that dictation is an integrative test because success on a dictation requires careful
listening, reproduction in writing of what is heard, efficient short-term memory, to an extent, some
expectancy rules to aid the short-term memory.
16. c) Communicative Language Testing ( recent approach after mid 1980s) What does it criticise? In order
for a particular langtest to be useful for its intended purposes, test performance must correspond in
demonstrable ways to language use in non-test situations. Integrative tests such as cloze only tell us about a
candidate‟s linguistic competence. They do not tell us anything directly about a student‟s
performance ability. (Knowledge about a language, not the use of language) Any suggestion? A quest for
authenticity, as test designers centered on communicative performance. The supporters emphasized the
importance of strategic competence (the ability to employ communicative strategies to compensate for
breakdowns as well as to enhance the rhetorical effect of utterances) in the process of communication. Any
problem in using this approach? Yes, communicative testing presented challenges to test designers, because
they began to identify the real-world tasks that language learners were called upon to perform. But, it was
clear that the contexts for those tasks were extraordinarily widely varied and that the sampling of tasks for
any one assessment procedure needed to be validated by what language users actually do with language. As
a result: The assessment field became more and more concerned with the authenticity of tasks and the
genuineness of texts.
17. d) Performance-Based Assessment performance-based assessment of language typically involves oral
production, written production, open-ended responses, integrated performance (across skill areas), group
performance, and other interactive tasks. Any problems? It is time-consuming and expensive, but those
extra efforts are paying off in more direct testing because sts are assessed as they perform actual or
simulated real-world tasks. The advantage of this approach? Higher content validity is achieved because
learners are measured in the process of performing the targeted linguistic acts. Important performancebased assessment means that Ts should rely a little less on formally structured tests and a little more on
evaluation while sts are performing various tasks. In performance-based assessment: Interactive Tests
(speaking, requesting, responding, etc.) IN ☺ Paper-and-pencil OUT Result: in this test tasks can
approach the authenticity of real life language use.
18. CURRENT ISSUES IN CLASSROOM TESTING The design of communicative, performance-based
assessment continues to challenge both assessment experts and classroom teachers. There‟re three
issues which are helping to shape our current understanding of effective assessment. These are: ·
The effect of new theories of intelligence on the testing industry · The advent of what has come to
be called “alternative assessment” The increasing popularity of computer-based testing
New Views on Intelligence In the past: Intelligence was once viewed strictly as the ability to perform
linguistic and logical-mathematical problem solving. For many years, we‟ve lived in a word of
standardized, norm-referenced tests that are timed in a multiple-choice format consisting of a multiplicity
of logic constrained items, many of which are inauthentic. We were relying on timed, discrete-point,
analytical tests in measuring lang. We were forced to be in the limits of objectivity and give impersonal
responds.
19. Recently: Spatial intelligence musical intelligence bodily-kinesthetic intelligence interpersonal
intelligence intrapersonal intelligence EQ (Emotional Quotient) underscore emotions in our cognitive
processing. Those who manage their emotions tend to be more capable of fully intelligent processing,
because anger, grief, resentment, other feelings can easily impair peak performance in everyday tasks as
well as higher-order problem solving. These conceptualizations of intelligence‟ intuitive appeal

infused the 1990s with a sense of both freedom and responsibility in our testing agenda. In past, our
challenge was to test interpersonal, creative, communicative, interactive skills, doing so to place some trust
in our subjectivity and intuition.
20. Traditional and “Alternative” Assessment Traditional Assessment Alternative
Assessment -One-shot, standardized exams -Timed, multiple-choice format -Decontextualized test items
-Scores suffice for feedback -Norm-referenced scores -Focus on the “right” answer
-Summative -Oriented to product -Non-interactive process -Fosters extrinsic motivation Continuous
longterm assessment Untimed, free-response format Contextualized communicative tests Individualized
feedback and washback Criterion-referenced scores Open-ended, creative answers Formative Oriented to
process Interactive process Fosters intrinsic motivation
21. IMPORTANT It is difficult to draw a clear line of distinction between traditional and alternative
assessment. Many forms of assessment fall in between the two, and some combine the best of both. More
time and higher institutional budgets are required to administer and score assessments that presuppose more
subjective evaluation, more individualization, and more interaction in the process of offering feedback. But
the payoff of the “Alternative Assessment” comes with more useful feedback to
students, the potential for intrinsic motivation, and ultimately a more complete description of a
student‟s ability.
22. Computer-Based Testing Some computer-based tests are small-scale. Others are standardized, large
scale tests (e.g. TOEFL) in which thousands of test-takers are involved. A type of computer-based test
(Computer-Adaptive Test / CAT) is available In CAT, the test-taker sees only one question at a time, and the
computer scores each question before selecting the next one. Test-takers cannot skip questions, and, once
they have entered and confirmed their answers, they cannot return to questions. Advantages of ComputerBased Testing: o Classroom-based testing o Self-directed testing on various aspects of a lang (vocabulary,
grammar, discourse, etc) o Practice for upcoming high-stakes standardized tests o Some individualization,
in the case of CATs. o Scored electronically for rapid reporting of results. Disadvantages of ComputerBased Testing: Lack of security and the possibility of cheating in unsupervised computerized tests. Homegrown quizzes may be mistaken for validates assessments. Open-ended responses are less likely to appear
because of need for human scorers. The human interactive element is absent.
23. An Overall summary Tests Assessment is an integral part of the teaching-learning cycle. In an
interactive, communicative curriculum, assessment is almost constant. Tests can provide authenticity,
motivation, and feedback to the learner. Tests are essential components of a successful curriculum and
learning process. Assessments Periodic assessments can increase motivation as milestones of student
progress. Appropriate assessments aid in the reinforcement and retention of information. Assessments can
confirm strength and pinpoint areas needing further work. Assessments provide sense of periodic closure to
modules within a curriculum. Assessments promote sts autonomy by encouraging self-evaluation progress.
Assessments can spur learners to set goals for themselves. Assessments can aid in evaluating teaching
effectiveness.
24. Decide whether the following statements are TRUE or FALSE. 1. It‟s possible to create
authentic and motivating assessment to offer constructive feedback to the sts. ----------2. All tests should
offer the test takers some kind of measurement or result. ----3. Performance based tests measure test
takers‟ knowledge about language. ----4. Tests are the best tools to assess students. ----------5.
Assessment and testing are synonymous terms. ----------6. Ts‟ incidental and unplanned comments
and responses to sts is an example of formal assessment. ------7. Most of our classroom assessment is
summative assessment. ----------8. Formative assessment always points toward future formation of learning.
---9. The distribution sts‟ scores across a continuum is a concern in norm referenced test.
----------10. C riterion referenced testing has more instructional value than normreferenced testing for
classroom teachers. ----------1. TRUE 2. TRUE 3. FALSE They are designed to test actual use of lang not
knowledge about lang 4. FALSE (We cannot say they are best, but one of useful devices to assess sts.) 5.
FALSE (They are not.) 6. FALSE (They are informal assessment) 7. FALSE (formative assessment) 8.
TRUE 9. TRUE 10. TRUE
25. CHAPTER 2 PRINCIPLES OF LANGUAGE ASSESSMENT
26. There’re five testing criteria for “testing a test”: 1. Practicality 2. Reliability
3. Validity 4. Authenticity 5. Washback 1. PRACTICALITY A practical test · is not excessively
expensive, · stays within appropriate time constraints, · is relatively easy to administer, and
· has a scoring/evaluation procedure that is specific and time-efficient. For a test to be practical

· administrative details should clearly be established before the test, · sts should be able to
complete the test reasonably within the set time frame, · the test should be able to be administered
smoothly (prosedürle boğmamalı), · all materials and equipment should be
ready, · the cost of the test should be within budgeted limits, · the scoring/evaluation system
should be feasible in the teacher‟s time frame. · methods for reporting results should be
determined in advance.
27. 2. RELIABILITY A reliable test is consistent and dependable. The issue of reliability of a test may best
be addressed by considering a number of factors that may contribute to the unreliability of a test. Consider
following possibilities: fluctuations · in the student (Student-Related Reliability), · in scoring
(Rater Reliability), · in test administration (Test Administration Reliability), and · in the test
(Test Reliability) itself. Student-Related Reliability: Temporary illness, fatigue, a bad day, anxiety, other
physical or psychological factors may make an “observed” score deviate from
one‟s “true” score. Also a test-taker‟s “testwiseness” or strategies for efficient test taking can also be included in this category.
28. Rater Reliability: Human error, subjectivity, lack of attention to scoring criteria, inexperience,
inattention, or even preconceived (peşin hükümlü) biases may enter into
scoring process. Inter-rater unreliability occurs when 2 or more scorers yield inconsistent scores of the
same test. Intra-rater unreliability is because of unclear scoring criteria, fatigue, bias toward particular
“good” and “bad” students, or simple carelessness. One solution to
such intra-rater unreliability is to read through about half of the tests before rendering any final scores or
grades, then to recycle back through the whole set of tests to ensure an even-handed judgment. The careful
specification of an analytical scoring instrument can increase raterreliability. Test Administration
Reliability: Unreliability may also result from the conditions in which the test is administered. Street noise,
photocopying variations, poor light, temperature, desks and chairs. Test Reliability: Sometimes the nature
of the test itself can cause measurement errors. Timed tests may discriminate against sts who do not
perform well with a time limit. Poorly written test items may be a further source of test unreliability.
29. 3. VALIDITY The extent to which the assessment requires students to perform tasks that were included
in the previous classroom lessons. How is the validity of a test established? There is no final, absolute
measure of validity, but several different kinds of evidence may be invoked in support. it may be
appropriate to examine the extent to which a test calls for performance that matches that of the course or
unit of study being tested. In other cases we may be concerned with how well a test determines whether or
not students have reached an established set of goals or level of competence. it could be appropriate to
study statistical correlation with other related but independent measures. Other concerns about a
test‟s validity may focus on the consequences – beyond measuring the criteria
themselves - of a test, or even on the test-taker‟s perception of validity. We will look at these five
types of evidence below.
30. Content Validity: If a test requires the test-taker to perform the behaviour that is being measured,
content-related evidence of validity, often popularly referred to as content validity. If you assess a
person‟s ability to speak TL, asking sts answer paper-and-pencil multiple choice questions
requiring grammatical judgements does not achieve content validity. for content validity to be achieved,
one should be able to elicit the following conditions: · Classroom objectives should be identified
and appropriately framed. The first measure of an effective classroom test is the identification of objectives.
· Lesson objectives should be represented in the form of test specifications. A test should have a
structure that follows logically from lesson or unit you are testing. If you clearly perceive the performance
of test-takers as reflective of the classroom objectives, then you can argue this, content validity has
probably been achieved. To understand content validity consider difference between direct and indirect
testing. Direct testing involves the test-taker in actually performing the target task. Indirect testing involves
performing not target task itself, but that related in some way. Direct testing is most feasible (uygun) way to
achieve content validity in assessment.
31. Criterion-related Validity: It examines the extent to which the criterion of test has actually been
achieved. For example, a classroom test designed to assess a point of grammar in communicative use will
have criterion validity if test scores are corroborated either by observed subsequent behavior or by other
communicative measures of the grammar point in question. Criterion-related evidence usually falls into one
of two categories: Concurrent (uygun, aynı zamanda olan) validity: A test has concurrent validity if
its results are supported by other concurrent performance beyond the assessment itself. For example, the

validity of a high score on the final exam of a foreign language course will be substantiated by actual
proficiency in the language. · Predictive (öngörüsel, tahmini) validity: The
assessment criterion in such cases is not to measure concurrent ability but to assess (and predict) a testtaker‟s likelihood of future success. For example, the predictive validity of an assessment
becomes important in the case of placement tests, language aptitude tests, and the like.
32. · Construct Validity: Every issue in language learning and teaching involves theoretical
constructs. In the field of assessment, construct validity asks, “Does this test actually tap into the
theoretical construct as it has been identified?” (test gerçekten de test etmek
istediğim konu ya da beceriyi test etmede gerekli olan yapısal özellikleri
taşıyor mu?) Imagine that you have been given a procedure for conducting an oral
interview. The scoring analysis for the interview includes several factors in the final score: pronunciation,
fluency, grammatical accuracy, vocabulary use, and sociolinguistic appropriateness. The justification for
these five factors lies in a theoretical construct that claims those factors to be major components of oral
proficiency. So if you were asked to conduct on oral proficiency interview that evaluated only
pronunciation and grammar, you could be justifiably suspicious about the construct validity of that test.
“Large-scale standardized tests” olarak nitelediğimiz sınavlar
“construct validity” açısından pek de uygun değildir.
Çünkü pratik olması açısından (yani hem zaman hem de
ekonomik nedenlerden) bu testlerde ölçülmesi gereken bütün dil
becerileri
ölçülememektedir.
Örneğin
TOEFL‟
da
“oral
production”
bölümünün
olmaması
“construct validity” açısından büyük bir engel
olarak karşımıza çıkmaktadır.
33. Consequential Validity: Consequential validity encompasses all the consequences of a test, including
such considerations as its accuracy in measuring intended criteria, its impact on the preparation of testtakers, its effect on the learner, and the (intended and unintended) social consequences of a test‟s
interpretation and use. McNamara (2000, p. 54) cautions against test results that may reflect socioeconomic
conditions such as opportunities for coaching (özel ders, özel ilgi). For example, only some
families can afford coaching, or because children with more highly educated parents get help from their
parents. Teachers should consider the effect of assessments on students‟ motivation, subsequent
performance in a course, independent learning, study habits, and attitude toward school work.
34. Face Validity: the degree to which a test looks right, and appears to measure the knowledge or abilities
it claims to measure, based on the subjective judgment of test-takers · Face validity means that the
students perceive the test to be valid. Face validity asks the question “Does the test, on the
„face‟ of it, appear from the learner‟s perspective to test what it is designed to
test? · Face validity is not something that can be empirically tested by a teacher or even by a testing
expert. It depends on subjective evaluation of the test-taker. · A classroom test is not the time to
introduce new tasks. · If a test samples the actual content of what the learner has achieved or expects
to achieve, face validity will be more likely to be perceived. · Content validity is a very important
ingredient in achieving face validity. · Students will generally judge a test to be face valid if
directions are clear, the structure of the test is organized logically, its difficulty level is appropriately
pitched, the test has no “surprises”, and timing is appropriate. · To give an
assessment procedure that is “biased for best” a teacher offers students appropriate
review and preparation for the test, suggests strategies that will be beneficial, and structures the test so that
the best students will be modestly challenged and the weaker students will not be overwhelmed.
35. 4. AUTHENTICITY In an authentic test · the language is as natural as possible, · items
are as contextualized as possible, · topics and situations are interesting, enjoyable and/or humorous,
· some thematic (konuyla ilgili) organization, such as through a story line or episode is provided,
· tasks represent real-world tasks. Reading passages are selected from real-world sources that testtakers are likely to have encountered or will encounter. Listening comprehension sections feature natural
language with hesitations, white noise, and interruptions. More and more tests offer items that are
“episodic” in that they are sequenced to form meaningful units, paragraphs, or stories.
36. 5. WASHBACK Washback includes the effects of an assessment on teaching and learning prior to the
assessment itself, that is, on preparation for the assessment. Informal performance assessment is by nature
more likely to have built-in washback effects because the teacher is usually providing interactive feedback.

Formal tests can also have positive washback, but they provide no washback if the students receive a
simple letter grade or a single overall numerical score. Tests should serve as learning devices through
which washback is achieved. Sts‟ incorrect responses can become windows of insight into further
work. Their correct responses need to be praised, especially when they represent accomplishments in a
student‟s inter-language. Washback enhances a number of basic principles of language
acquisition: intrinsic motivation, autonomy, self-confidence, language ego, interlanguage, and strategic
investment, among others. To enhance washback comment generously & specifically on test
performance. Washback implies that students have ready access to the teacher to discuss the feedback and
evaluation he has given. Teachers can raise the washback potential by asking students to use test results as a
guide to setting goals for their future effort.
37. What is washback? In general terms: The effect of testing on teaching and learning In large-scale
assessment: Refers to the effects that the tests have on instruction in terms of how students prepare for the
test In classroom assessment: The information that washes back to students in the form of useful diagnoses
of strengths and weaknesses What does washback enhance? Intrinsic motivation Language ego Autonomy
Inter-language Self-confidence Strategic investment What should teachers do to enhance washback?
Comment generously and specifically on test performance Respond to as many details as possible Praise
strengths Criticize weaknesses constructively Give strategic hints to improve performance
38. Decide whether the following statements are TRUE or FALSE. 1. An expensive test is not practical. 2.
One of the sources of unreliability of a test is the school. 3. Sts, raters, test, and administration of it may
affect the test‟s reliability. 4. In indirect tests, students do not actually perform the task. 5. If
students are aware of what is being tested when they take a test, and think that the questions are
appropriate, the test has face validity. 6. Face validity can be tested empirically. 7. Diagnosing strengths and
weaknesses of students in language learning is a facet of washback. 8. One way of achieving authenticity in
testing is to use simplified language. 1. TRUE 2. FALSE 3. TRUE 4. TRUE 5. TRUE 6. FALSE 7. TRUE
8. FALSE
39. Decide which type of validity does each sentence belong to? 1. It is based on subjective judgment.
---------------------2. It questions the accuracy of measuring the intended criteria. ---------------------3. It
appears to measure the knowledge and abilities it claims to measure. ------------4. It measures whether the
test meets the objectives of classroom objectives. -------5. It requires the test to be based on a theoretical
background. ---------------------6. Washback is part of it. ---------------------7. It requires the test-taker to
perform the behavior being measured. -----------------8. The students (test-takers) think they are given
enough time to do the test. ----------9. It assesses a test-taker's likelihood of future success. (e.g. placement
tests). --------10. The students' psychological mood may affect it negatively or positively. -------------11. It
includes the consideration of the test's effect on the learner. ---------------------12. Items of the test do not
seem to be complicated. ---------------------13. The test covers the objectives of the course.
---------------------- 14. The test has clear directions. ---------------------1. Face 2. Consequential 3. Face 4.
Content 5. Construct 6. Content 7. Criterion related 8. Face 9. Criterion related 10. Consequential 11.
Consequential 12. Face validity 13. Content validity 14. Face validity
40. Decide with which type of reliability could each sentence be related? 1. There are ambiguous items. 2.
The student is anxious. 3. The tape is of bad quality. 4. The teacher is tired but continues scoring. 5. The
test is too long. 6. The room is dark. 7. The student has had an argument with the teacher. 8. The scorers
interpret the criteria differently. 9. There Is a lot of noise outside the building. 1. Test reliability 3. Test
administration reliability 2. Student-related reliability 4. Rater reliability 5. Test reliability 6. Test
administration reliability 7. Student-related reliability 8. Rater reliability 9. Test administration reliability
41. CHAPTER 3 DESIGNING CLASSROOM LANGUAGE TESTS
42. we examine test types, and learn how to design tests and revise existing ones. To start the process of
designing tests, we will ask some critical questions. 5 questions should form basis of your approach to
designing tests for class. Question 1: What is the purpose of the test? · Why am I creating this test?
· For an evaluation of overall proficiency? (Proficiency Test) · To place students into a
course? (Placement Test) · To measure achievement within a course? (Achievement Test) Once you
established major purpose of a test, you can determine its objectives. Question 2: What are the objectives of
the test? · What specifically am I trying to find out? · What language abilities are to be
assessed? Question 3: How will test specifications reflect both purpose and objectives? · When a
test is designed, the objectives should be incorporated into a structure that appropriately weights the various
competencies being assessed.

43. Question 4: How will test tasks be selected and the separate items arranged? · The tasks need to
be practical. · They should also achieve content validity by presenting tasks that mirror those of the
course being assessed. · They should be evaluated reliably by the teacher or scorer. · The
tasks themselves should strive for authenticity, and the progression of tasks ought to be biased for best
performance. Question 5: What kind of scoring, grading, and/or feedback is expected? · Tests vary
in the form and function of feedback, depending on their purpose. · For every test, the way results
are reported is an important consideration. · Under some circumstances a letter grade or a holistic
score may appropriate; other circumstances may require that a teacher offer substantive washback to the
learner.
44. TEST TYPES Defining your purpose will help you choose the right kind of test, and it will also help
you to focus on the specific objectives of the test. Below are the test types to be examined: 1. Language
Aptitude Tests 2. Proficiency Tests 3. Placement Tests 4. Diagnostic Tests 5. Achievement Tests
45. 1. Language Aptitude Tests They predict a person‟s success prior to exposure to the second
language. Aptitude test is designed to measure capacity or general ability to learn a FL. They are designed
to apply to the classroom learning of any language. Two standardized aptitude tests have been used in the
US. The Modern Language Aptitude Test (MLAT), Pimsleur Language Aptitude Battery(PLAB) Tasks in
MLAT includes: Number learning, phonetic script, spelling clues, words in sentences, and paired
associates. There‟s no unequivocal evidence that language aptitude tests predict communicative
success in a language. Any test that claims to predict success in learning a language is undoubtedly flawed
because we now know that with appropriate self-knowledge, and active strategic involvement in learning,
everyone can succeed eventually.
46. 2. Proficiency Tests A proficiency test is not limited to any one course, curriculum, or single skill in the
language; rather, it tests overall ability. It includes: standardized multiple choice items on grammar,
vocabulary, reading comprehension, and aural comprehension. Sometimes a sample of writing is added,
and more recent tests also include oral production. Such tests often have content validity weaknesses.
Proficiency tests are almost always summative and norm-referenced. They are usually not equipped to
provide diagnostic feedback. Their role is to accept or to deny someone‟s passage into next stage
of a journey TOEFL is a typical standardized proficiency test. Creating & validating them with
research is time-consuming & costly process To choose one of a number of commercially available
proficiency tests is a far more practical method for classroom teachers.
47. 3. Placement Tests The objective of placement test is to correctly place sts into a course or level.
Certain proficient tests can act in the role of placement tests. A placement test usually includes a sampling
of the material to be covered in the various courses in a curriculum. Sts should find the test neither too easy
nor too difficult but challenging. ESL Placement Test (ESLPT) at San Francisco State University has three
parts. Part 1: sts read a short article and then write a summary essay. Part 2: sts write a composition in
response to an article. Part 3: multiple-choice; sts read an essay and identify grammar errors in it. ESL is
more authentic but less practical, because human evaluators are required for the first two parts. Reliability
problems present but mitigated by conscientious training evaluators What is lost in practicality and
reliability is gained in the diagnostic information that the ESLPT provides.
48. 4. Diagnostic Tests A diagnostic test is designed to diagnose specified aspects of a language. A
diagnostic test can help a student become aware of errors and encourage the adoption of appropriate
compensatory strategies. A test of pronunciation diagnose phonological features that are difficult for Sts
and should become part of a curriculum. Such tests offer a checklist of features for administrator to use in
pinpointing difficulties. A writing diagnostic elicit a writing sample from sts that would allow Ts to identify
those rhetorical and linguistic features on which the course needed to focus special attention. A diagnostic
test of oral production was created by Clifford Prator (1972) to accompany a manual of English
pronunciation. In the test; Test-takers are directed to read 150-word passage while they are tape recorded.
The test administrator then refers to an inventory(envanter, deftere kayıtlı eşya) of
phonological items for analyzing a learner‟s production. After multiple listening, they produce
checklist for errors in 5 categories. Stress - rhythm, Intonation, Vowels, Consonants, Other factors. This
information help Ts make decisions about aspects of English phonology.
49. 5. Achievement Tests Achievement test is related directly to lessons, units, or even a total curriculum.
Achievement tests should be limited to particular material addressed in a curriculum within a particular
time frame and should be offered after a course has focused on the objectives in question. There‟s
a fine line of differences between diagnostic test and achievement test. Achievement tests analyze the

extent to which students have acquired language features that have already been taught.
(Geçmişin analizini yapıyor.) Diagnostic tests should elicit information on what
students need to work on in the future. (Gelecek ile ilgili bir analiz yapılıyor.) Primary role
of achievement test is to determine whether course objectives have been met – and appropriate
knowledge and skills acquired – by the end of a period of instruction. They are often summative
because they are administered end of a unit or term. But effective achievement tests can serve as useful
washback by showing the errors of students and helping them analyze their weaknesses and strengths.
Achievement tests range from five- or ten-minute quizzes to three-hour final examinations, with an almost
infinite variety of item types and formats.
50. practical steps in constructing classroom tests: A) Assessing Clear, Unambiguous Objectives Before
giving a test; examine the objectives for the unit you‟re testing. Your first task in designing a test,
then, is to determine appropriate objectives. “Students will recognize and produce tag questions,
with the correct grammatical form and final intonation pattern, in simple social conversations. B) Drawing
Up Test Specifications (Talimatlar) Test specifications will simply comprise a) a broad outline of the test b)
what skills you will test c) what the items will look like This is an example for test specifications based on
the objective stated above: “Students will recognize and produce tag questions, with the correct
grammatical form and final intonation pattern, in simple social conversations.”
51. C) Devising Test Tasks how students will perceive them(face validity) the extent to which authentic
language and contexts are present potential difficulty caused by cultural schemata In revising your draft,
you should ask yourself some important questions: 1. Are the directions to each section absolutely clear? 2.
Is there an example item for each section? 3. Does each item measure a specified objective? 4. Is each item
stated in clear, simple language? 5. Does each multiple choice have appropriate distracters; that is, are the
wrong items clearly wrong and yet sufficiently “alluring” that they aren‟t
ridiculously easy? 6. Is the difficulty of each item appropriate for your students? 7. Is the language of each
item sufficiently authentic? 8. Do the sum of items and the test as a whole adequately reflect the learning
objectives? In the final revision of your test, Time yourself if the test should be shortened or lengthened,
make the necessary adjustments make sure your test is neat and uncluttered on the page if there is an audio
component, make sure that the script is clear,
52. D) Designing Multiple-Choice Test Items There‟re a number of weaknesses in multiple-choice
items: The technique tests only recognition knowledge. Guessing may have a considerable effect on test
scores. The technique severely restricts what can be tested. It is very difficult to write successful items.
Washback may be harmful. Cheating may be facilitated. However, 2 principles support multiple-choice
formats are practicality - reliability. Some important jargons in Multiple-Choice Items: Multiple-choice
items are all receptive, or selective, that is, test-taker chooses from a set of responses rather than creating a
response. Other receptive item types include true-false questions and matching lists. Every multiple-choice
item has a stem, which presents several options or alternatives to choose from. One of those options, the
key, is correct response, others serve as distractors .
53. IMPORTANT!!! Consider the following four guidelines for designing multiple-choice items for both
classroom-based and large-scale situations: 1. Design each item to measure a specific objective.
(aynı anda hem modal bilgisini hem de article bilgisini ölçme.) 2. State both stem and
options as simply and directly as possible. Do not use superfluous (lüzumsuz) words, and another
rule of succinctness (az ve öz) is to remove needless redundancy (gereksiz bilgi) from your options.
3. Make certain that the intended answer is clearly the only correct one. Eliminating unintended possible
answers is often the most difficult problem of designing multiple-choice items. With only a minimum of
context in each stem, a wide variety of responses may be perceived as correct. 4. Use item indices
(indeksler) to accept, discard, or revise items: The appropriate selection and arrangement of suitable
multiple-choice items on a test can best be accomplished by measuring items against three indices: a) item
facility(IF), or item difficulty b) item discrimination (ID), or item differentiation, and c) distractor analysis
54. a) Item facility (IF) is the extent to which an item is easy or difficult for the proposed group of testtakers. 20 öğrenciden 13 doğru cevap geldiyse; 13/20=0,65(%65). %15 %85‟in kabul edilebilir Two good reasons for including a very easy item (%85 or higher) are to
build in some affective feelings of “success” among lower-ability students and to serve
as warm-up items. And very difficult items can provide a challenge to high estability sts. b) Item
discrimination (ID) is extent to which an item differentiates between high- and low-ability test-takers. An
item on which high-ability students and low-ability students score equally well would have poor ID

because it did not discriminate between the two groups. An item that garners(toplamak) correct responses
from most of the high-ability group and incorrect responses from most of low-ability group has good
discrimination power. 30 öğrenciyi en iyiden en düşüğe kadar
üç eşit parçaya ayır. En yüksek notu alan 10
öğrenci ile en düşük notu alan 10 öğrenciyi bir
item‟da aşağıdaki gibi ayıralım Item # High-ability
students (top 10) Low-ability students (bottom10) Correct 7 2 Incorrect 3 8 ID: 7-2=5/ 10= 0,50 The result
tells us that us that the item has a moderate level of ID. High discriminating level would approach 1.0 and
no discriminating power at all would be zero. In most cases, you would want to discard an item that scored
near zero. No absolute rule governs establishment of acceptable and unacceptable ID indices.
55. c) Distractor efficiency (DE) is the extent to which the distractors “lure” a sufficient
number of test-takers, especially lower-ability ones, and those responses are somewhat evenly distributed
across all distractors. Example: *Note: C is the correct response. Choices High-ability students (10) Lowability students (10) A 0 3 B 1 5 C* 7 2 D 0 0 E 2 0 The item might be improved in two ways: a) Distractor
D doesn‟t fool anyone. Therefore it probably has no utility. A revision might provide a distractor
that actually attracts a response or two. b) Distractor E attracts more responses (2) from the high-ability
group than the low-ability group (0). Why are good students choosing this one? Perhaps it includes a subtle
reference that entices the high group but is “over the head” of low group, and therefore
latter sts don‟t even consider it. The other two distractor (A and B) seem to be fulfilling their
function of attracting some attention from the lower-ability students.
56. SCORING, GRADING AND GIVING FEEDBACK A) Scoring As you design a test, you must
consider how the test will be scored and graded Scoring plan reflects relative weight that you place on each
section and items hangi beceriyi daha çok önemsemişse o beceriye fazla puan vermek
gerekir Oral production %30, Listening %30, Reading %20 ve Writing %20 şeklinde. B) Grading
Grading doesn‟t mean just giving “A” for 90-100. It‟s not that simple.
How assign letter grades is a product of country, culture and context of class institutional expectations
(most of them unwritten), explicit and implicit definitions of grades that you have set forth, the relationship
you have established with the class, Sts‟ expectations that have been engendered in previous tests,
quizzes in class.
57. C) Giving Feedback Feedback should become beneficial washback. Those are some examples of
feedback: 1. a letter grade 2. a total score 3. four subscores (speaking, listening, reading, writing) 4. for the
listening and reading sections a. an indication of correct/incorrect responses b. marginal comments 5. for
the oral interview a. scores for each element being rated c. oral feedback after the interview 6. on the essay
b. checklist of areas needing work d. post-interview conference to go over results a. scores for each element
being rated b. a checklist of areas needing work e. a self-assessment c. marginal end-of-essay comments,
suggestions d. post-test conference to go work 7. on all or selected parts of the test, peer checking of results
8. a whole-class discussion of results of the test 9. individual conferences with each student to review the
whole test
58. Decide whether the following statements are TRUE or FALSE. 1.language aptitude test measures a
learner‟s future success in learning a FL. 2. Language aptitude tests are very common today. 3. A
proficiency test is limited to a particular course or curriculum. 4. The aim of a placement test is to place a
student into particular level. 5. Placement tests have many varieties. 6. Any placement test can be used at a
particular teaching program. 7. Achievement tests are related to classroom lessons, units, or curriculum. 8.
A five-minute quiz can be an achievement test. 9. The first task in designing a test is to determine test
specification. 1. TRUE 2. FALSE 3. FALSE 4. TRUE 5. TRUE 6. FALSE (Not all placement tests suit
every teaching program.) 7. TRUE 8. FALSE 9. FALSE (The first task is to determine appropriate
objectives.)
59. Decide whether the following statements are TRUE or FALSE. 1. It is very easy to develop multiplechoice tests. 2. Multiple-choice tests are practical but not reliable. 3. Multiple-choice tests are time-saving
in terms of scoring and grading. 4. Multiple-choice items are receptive. 5. Each multiple-choice item in a
test should measure a specific objective. 6. The stem of a multiple-choice item should be as long as possible
in order to help students to understand the context. 7. If the Item Facility value is .10(% 10), it means the
item is very easy. 8. Item discrimination index differentiates between high and low-ability sts. 1. FALSE (It
seems easy, but is not very easy.) 2. FALSE (They can be both practical and reliable.) 3. TRUE 4. TRUE 5.

10

TRUE 6. FALSE (It should be short and to the point.) 7. FALSE (An item with an IF value of .10 is a very
difficult one.) 8. TRUE
60. Chapter 4 STANDARDIZED TESTING:
61. WHAT IS STANDARDIZATION: A standardized test presupposes certain standard objectives or
criteria that are held constant across one form of the test to another.. They measure a broad band of
competencies, but not only one particular curriculum They are norm-referenced and the main goal is to
place sts in a rank order. Scholastic Aptitude Test (SAT): college entrance exam seeking further information
The Graduate Record Exam (GRE): test for entry into many graduate school programs Graduate
Management Admission Test (GMAT) & Law School Aptitude Test (LSAT): tests that specialize in
particular disciplines Test of English as a Foreign Language (TOEFL): produced by the International
English Language Testing System (IELTS) The tests are standardized because they specify a set of
competencies for a given domain and through a process of construct validation they program a set of tasks.
In general standardized test items are in the form of MC. They provide „objective‟
means for determining correct and incorrect responses. However MC is not the only test item type in
standardized test. Human scored tests of oral and written production are also involved.
62. ADVANTAGES AND DISADVANTAGES OF STANDARDIZED TESTS: -Advantages: * Readymade previously (Ts don‟t need to spend time to prepare it) * It can be administered to a large
number of sts in a time constraint * Easy to score thanks to MC format scoring (computerized or holepunched grid scoring) * It has face validity -Disadvantages: * Inappropriate use of tests * Misunderstanding
of the difference between direct and indirect testing
63. characteristics of a standardized test
64. • DEVELOPING A STANDARDIZED TEST: - Knowing how to develop a standardized test
can be helpful to revise an existing test, adapt or expand an existing test, create a smaller-scale standardized
test (A) The Test of English as a Foreign Language (TOEFL) „general ability or
proficiency‟ (B) The English as a Second Language Placement Test (ESLPT), San Francisco State
University (SFSU) „placement test at a university‟ (C) The Graduate Essay Test (GET),
SFSU „gate-keeping essay test‟
65. 1. Determine the purpose and objectives of the test. - Standardized tests are expected to be valid and
practical TOEFL *To evaluate the English proficiency of people whose NL is not English. *Colleges and
universities in the US use the score TOEFL score to admit or refuse international applicants for admission
ESLPT *To place already admitted sts at SFSU in an approp. course in academic writing and oral
production. *To provide Ts some diagnostic information about sts GET *To determine whether their
writing ability is sufficient to permit them to enter graduate-level courses in their programs(it is offered
beginning of each term)
66. 2. Design test specification. TOEFL the first step is to define the construct of language proficiency After
breaking langcompetence down into subset of 4 skills each performance mode can be examined on a
continuum of linguistic units. (pronun, spelling, word, grammar) Oral production section tests fluency and
pronunciation by using imitation Listening section focuses on a particular feature of lang or overall
listening comprehens Reading section aims to test comprehension of long/short passages, single sentences,
phrases or words Writing section tests writing ability in the form of open-ended(free composition) or it can
be structured to elicit anything from correct spelling to discourse-level competence ESLPT Designing test
specs for ESLPT was simpler tasks . purpose is placement and construct validation of a test consisted of an
examination of the content of the ESL courses *In recent revision of ESLPT, content & face validity
are important theoretical issues. And also practicality, reliability in tasks and item response formats equally
important The specification mirrored reading-based and process writing approach used in class. GET
specification for GET are skills of writing grammatically and rhetorically acceptable prose on a topic , with
clearly produced organization of ideas and logical development.
67. 3. Design, select, and arrange test tasks/items. TOEFL • Content coding: the skills and a
variety of subject matter without biasing (the content must be universal and as neutral as possible)
• Statistical characteristic: it include IF and ID • Before administration, they are piloted
and scientifically selected to meet difficulty specifications within each subsection, section and the test
overall. ESLPT For written parts; the main problems are a) selecting appropriate passages(conform the
standards of content validity) • b) providing appropriate prompts (they should fit the passages)
• c) processing data form pilot testing • In the MC editing test; first (easier task) choose
an approp. essay within whick embed errors. And a more complicated one is to embed a specified number

11

errors from a pre-determined error categories.(T can perceive the categories from sts GET previous error in
written work & sts‟ error can be used as distractors) Topics are appealing and capable of
yielding intended product of an essay that requires an organized logical arguments conclusion. No pilot
testing of prompts is conducted. • Be careful about the potential cultural effect on the numerous
international students who must take the GET
68. 4. Make appropriate evaluations of different kinds of items. - IF, ID and distractor analysis may not be
necessary for classroom (one-time) test, but they are must for standardized MC test. - For production
responses, different forms of evaluation become important. (i.e. practicality, reliability & facility)
*practicality: clarity of directions, timing of test, ease of administration & how much time is required
to score *reliability: is a major player is instances where more than one scorer is employed and to a lesser
extent when a single scorer has to evaluate tests over long spans of time that could lead to deterioration of
standards *facilities: is key for valid and successful items. Unclear direction, complex lang, obscure topic,
fuzzy data, culturally biased information may lead to higher level of difficulty GET *No data are collected
from sts on their perceptions, but the scorers have an opportunity to reflect on the validity of given topic
69. 5. Specify scoring procedures and reporting formats. TOEFL -Scores are calculated and reported for
*three sections of TOEFL *a total score *a separate score ESLPT *It reports a score for each of the essay
section (each essay is read by 2 readers) *Editing section is machined scanned *It provides data to place sts
and diagnostic information *sts don‟t receive their essay back GET *Each GET is read by two
trained reader. They give scores between 1 to 4 *recommended score is 6 as threshold for allowing sts to
pursue graduate-level courses *If the st gets score below 6, he either repeat the test or take a remedial
course
70. 6. Performing ongoing construct validation studies. Any standardized test must be accompanied by
systematic periodic corroboration of its effectiveness and by steps toward its improvement TOEFL *the
latest study on TOEFL examined the content characteristics of the TOEFL from a communicative
perspective based on current research in applied linguistics and language proficiency assessment ESLPT
*The development of the new ESLPT involved a lengthy process both content and construct validation,
along with facing such practical issues as scoring the written sections and a machine-scorable MC answer
sheet GET *There is no research to validate the GET itself. Administrators rely on the research on
university level academic writing tests such as TWE. *Some criticism of the GET has come from
international test-takers who posit that the topics and time limits of the GET work to the disadvantage of
writers whose native language is not English.
71. Primary market TOEFL U.S. universities and colleges for admission purposes Type Computer-based
and paper-based Response modes Multiple-choice responses and essay Time allocation Up to 4 hours (CB);
3 hours (PB) Specifications CB: A listening section which includes dialogs, short conversations, academic
discussions, and mini lectures; a structure section which tests formal language with two types of questions
(completing incomplete sentences and identifying one of four underlined words or phrases that is not
acceptable in English; a reading section which include four to five passages on academic subjects with 1014 questions for each passage; writing section which requires examinees to compose an essay on a given
topic
72. MELAB Primary market U.S. and Canadian language programs and colleges; some worldwide
educational settings Type Paper-based Response modes Multiple-choice responses and essay Time
allocation 2.5 to 3.5 hours Specifications A 30-minute impromptu essay on a given topic; a 25-minute
multiple-choice listening comprehension test; a 100-item 75-minute multiple choice test of grammar, cloze
reading, vocabulary, and reading comprehension; an optional oral interview
73. IELTS Primary market Australian, British, Canadian, and New Zealand academic institutions and
professional organizations and some American academic institutions Type Computer-based for Reading and
Writing sections; paper-based for Listening and Speaking parts Response modes Multiple-choice responses,
essay, and oral production Time allocation 2 hours, 45 minutes Specifications A 60-minute reading; a 60minute writing; a 30-minute listening of four sections; a 10 to 15 minute speaking of five sections
74. TOEIC Primary market Worldwide; workplace settings Type Computer-based and paper-based
Response modes Multiple-choice responses Time allocation 2 hours Specifications A 100-item,
approximately 45-minute listening administered by audiocassette and which includes statements, questions,
short conversations, and short talks; a 100-item, 75-minute reading which includes cloze sentences, error
recognition, and reading comprehension
75. CHAPTER 5 STANDARDIZED-BASED ASSESSMENT:

12

76. Mid 20th Century Standardized tests had unchallenged popularity and growth. Standardized tests
brought convenience, efficiency, air of empirical science. Tests were considered to be a way of making
reforms in education. Quickly and cheaply assessing students became a political issue. Late 20th Century
*There was possible inequity and disparity between the tests in such tests and the ones they teach in
classes. *The claims in mid-20th century began to be questioned/criticised in all areas. *Teachers were in
the leading position of those challenges. The Last 20 Years *Educators become aware of weaknesses in
standardized testing: They were not accurate measures of achievement and success and they were not based
on carefully framed, comprehensive and validated standards of achievement. *A movement has started to
establish standards to assess students of all ages and subject-matter areas. *There have been efforts on
basing the standardised tests on clearly specified criteria for each content area being measured.
77. Criticism: Some teachers claimed that those tests were unfair there were dissimilarity between the
content & task of the tests & what they were teaching in their classes Solutions: By becoming
aware of these weaknesses, educators started to establish some standards on which sts of all ages &
subject matter areas might be assessed most departments of education at all state level in the US have
specified the appropriate standards (criteria, objectives) for each grade level(pre-school to grade 12) and
each content area (math, science, arts…) The construction of standards makes possible concordance
between standardized test specification and the goals and objectives (ESL, ESOL, ELD,ELLs) (LEP is
discarded because of the negative connotation word „limited‟) pg 105 please
78. ELD STANDARDS In creating benchmarks for accountability, there is a tremendous responsibility to
carry out a comprehensive study of a number of domains: Categories of language; phonology, discourse,
pragmatic, functional and sociolinguistic elements. Specification of what ELD students‟ needs
are. A realistic scope of standards to be included in curriculum.(MUFRADATTAKI STANDARDLAR
GERCEKCI OLCAK) Standards for teachers ( qualifications, expertise, training)(OGRETMENLERE
STANDARD GETIRIYOR) A thorough analysis of means available to assess student attainment of those
standards.(OGRENCILERIN OGRENDIKLERINI NASIL DEGERLENDIRECEZ
79. ELD ASSESSMENT The development of standards obviously implies the responsibility for correctly
assessing their attainment. It is found that the standardized tests of the past decades were not in line with
newly developed standards the interactive process not only of developing standards but also of creating
standards-based assessment started. Specialists design, revise and validate many tests. The California
English Language Development Test (CELDT) is a battery of instruments designed to assess attainment of
ELD standards across grade level. (not publicly available) Language and literacy assessment rubric
collected students‟ work. Teachers‟ observations recorded on scannable forms. It
provided useful data on students‟ performance for oral production, reading and writing in different
grades
80. CASAS AND SCANS CASAS: (Comprehensive Adult Student Assessment System): Designed to
provide broadly based assessments of ESL curricula across US. It includes more than 80 standardized
assessment instruments used to; *place sts in programs *diagnose learners‟ needs *monitor
progress *certify mastery of functional skills At higher level of education (colleges, adult and language
schools, workplace) SCANS: (Secretary‟s Commissions in Achieving Necessary Skills): outlines
competencies necessary for language in the workplace the competencies are acquired and maintained
through training in basic skills(4 skills); thinking skills (reasoning & problem solving); personal
qualities (self-esteem & sociability) Resources (allocating time, materials, staff etc.) Interpersonal
skills, teamwork, customer service etc. Information processing, evaluating data, organising files etc,
Systems, understanding social and organizational system, Technology use and application
81. TEACHER STANDARDS – OGRETMEN NASIL OLMALI Linguistic and language
development Culture and interrelationship between language and culture Planning and managing
instructions Consequences of standardized based and standardized testing Positive High level of
practicality and reliability Provides insights into academic performance Accuracy in placing a number of
test takers on to a norm referenced scala Ongoing construct validation studies Negative They involve a
number of test biases A small but significant number of test takers are not assessed fairly nor they are
assessed accurately Fosters extinct motivation Multiple intelligence are not considered There is danger of
test driven learning and teaching In general performance is not directly assessed
82. Test bias Standardized tests involve many test bias (lang, culture, race, gender, learning styles) National
Centre for Fair and Open Testing claims of tests bias from; teachers, parents, students, and legal
consultants. (reading texts, listening stimulus) Standardised tests do not promote logical-mathematical and

13

verbal linguistic to the virtual exclusions of the other contextualised, integrative intelligence. (some
learners may need to be assessed with interviews, portfolios, samples of work, demonstrations, observation
reports) more formative assessment rather than summative. That would solve test bias problems but it is
difficult to control it in standardized items. Those who use standardised tests for the gate keeping purposes,
with few if only other assessments would do well to consider multiple measures before attributing infallible
predictive power to standardised test. Test-driven learning and teaching It is another consequence of
standardized testing. When students know that one single measure of performance will determine their lives
they are less likely to take positive attitudes towards learning. Extrinsic motivation not intrinsic Ts are also
affected from test-driven policies. They are under pressure to make sure their sts excelled in the exam, at
the risk of ignoring other objectives in the curriculum. A more serious effect was to punish schools with
lower-socioeconomic neighbourhood
83. ETHICAL ISSUES: CRITICAL LANGUAGE TESTING One of by-products of rapid growing testing
industry is danger of an abuse of power. „Tests represent a social technology deeply embedded in
education, government and business; tests are most powerful as they are often the single indicators for
determining the future of individuals‟ (Shohamy) Standards ,specified by client educational
institutions, bring with them certain ethical surrounding the gate-keeping nature of standardized tests.
Teachers can demonstrate standards in their teaching. Teachers can be assessed through their classroom
performance. Performance can be detailed with „indicators‟: examples of evidence that
the teacher can meet a part of a standard. Indicators are more than „how to‟ statements
(complex evidence of performance. Performance based assessment is integrated (not a checklist or discrete
assessments) Each assessment has performance criteria against which performance can be measured.
Performance criteria identify to what extend the teacher meets the standard. Student learning is at the heart
of the teacher‟s performance.
84. 6 ASSESSING LISTENING
85. OBSERVING THE PERFORMANCE OF FOUR SKILLS 1. two interacting concepts: Performance
Observation”” Sometimes the performance does not indicate true competence a bad
night‟s rest, illness, an emotional distraction, test anxiety, a memory block, or other studentrelated reliability factor. One important principle for assessing a learner‟s competence is to
consider the fallibility of the results of a single performance such as that produced in a test. The form which
involve performances and contexts in measurement should design following: Several tests that are
combined t form an assessment. The listening tasks are designed to assess the candidate‟s ability
to process form of spoken English. A single test with multiple test tasks to account for learning styles and
performance variables In-class and extra-class graded work Alternative forms of assessment ( e. g journal,
portfolio, conference, observation, self – assessment, peer – assessment )
86. Multiple measures give more reliable & valid assessment than a single measure We can observe
neither the process of performing nor a product? 1. Receptive skills -- Listening performance The process
of listening performance is about : Invisible, inaudible – process of internalizing meaning form the
auditory signals being transmitted to the ear and brain. 2 The productive skills allow us to hear and see the
process as it is performance writing can give permanent product of written piece. But recorded speech,
there is no permanent observable product for speaking. THE IMPORTANCE OF LISTENING Listening
has often played second fiddle to its counterpart of speaking. But its rare to find just a listening test.
Listening is often implied as component of speaking. Oral production ability – other than
monologues, speeches, reading aloud and the like– is only as good as one‟s listening
comprehension. Input the aural-oral mode accounts for a large proportion of successful language
acquisition.
87. BASIC TYPES OF LISTENING For effective test, designing appropriate assessment tasks in listening
begins with the specification of objectives, or criteria. The following processes flash through your brain : 1.
recognize speech sounds and hold a temporary “ imprint” of them in short-term memory.
2. Simultaneously determine the type of speech event. 3. use (bottom-up) linguistic decoding skills and / or
(top-down) background schemata to bring a plausible interpretation to the message and assign a literal and
intended meaning to the utterance. ( Jeremy Harmer, page on 305) said.. This study shows is that activating
student‟s schemata. 4. in most cases, delete the exact linguistic form in which the message was
originally received in favor of conceptually retaining important or relevant information in long-term
memory.

14

88. four commonly identified types of listening performances 1. Intensive. Listening for perception of the
components. Teacher use audio material on tape or hard disk when they want their students to practice
listening skills 2. Responsive. 3. Selective. 4. Extensive. Extensive listening will usually take a place
outside the classroom. Material for extensive listening can be obtained from a number of sources.
89. Micro and Macro skills Micro skills Attending to smaller bits and chunks, in more of bottom-up process
Discriminate among sounds of English retain chunks of language of different lengths in short-term memory
Recognize stress patterns, words in stressed/ unstressed position, rhythmic structure , intonation contours,
and their role in signaling information Recognize reduce form of words. Distinguish word boundaries,
recognize the core of a words and interpret word order patterns and their significance Process speech at
different rates of delivery Process speech containing pauses, errors, corrections, other performance
variables Recognize grammatical word classes (nouns, verbs, etc.), systems (e.g. tense, agreement,
pluralization), pattern, rules, and elliptical forms. Detect sentence constituents and distinguish between
major-minor constituents Recognize particular meaning may be expressed in different grammatical form
Recognize cohesive device in spoken discourse
90. Macroskills Focusing on larger elements involved in a top-down approach recognize the
communicative functions of utterances, according to situations, participants, goals Infer situations,
participants, goals using real-world knowledge From events, ideas, and so on, described, predict outcomes,
infer links and connections between events, deduce causes and effects, and detect such relations as main
idea, supporting idea, new information, given information, generalization, and exemplification Distinguish
between literal and implied meanings Use the facial, kinesics, body language, and other nonverbal clues to
decipher meanings Develop and uses a battery of listening strategies, such as detecting key words, guessing
the meaning from context, appealing for help, and signaling comprehension or lack thereof
91. What Makes Listening Difficult 1. Clustering Chunking-phrases, clauses, constituents 2. Redundancy
Repetitions, Rephrasing, Elaborations and Insertions 3. Reduced Forms Understanding reduced forms that
may not be a part of learner‟s past experiences in classes where only formal
”textbook” lang has been presented 4. Performance variables Hesitations, False starts,
Corrections, Diversion 5 Colloquial Language Idioms, slang, reduced forms, shared cultural knowledge 6.
Rate of Delivery Keeping up with speed of delivery, processing automatically as speker continu 7. Stress,
Rhythm, and Intonation: Correctly understanding prosodic elements of spoken language, which is more
difficult than understanding the smaller phonological bits and pieces. 8. Interaction:
Negotiation,clarification,attending signals,turn taking,maintenance,termination
92. Designing Assessment Tasks • Recognizing Phonological and Morphological Elements
Phonemic pair, consonants Test-takers hear : He’s from California Test-takers read : A.
He’s from California B. She’s from California Phonemic pair, vowels Test-takers hear :
is he living? Test-takers read : A. is he leaving? B. is he living?
93. Morphological pair, -ed ending Test-takers hear : I missed you very much. Test-takers read : A. I missed
you very much B. I miss you very much Stress pattern in can’t Test-takers hear : My girlfriend
can’t go to the party Test-takers read : A. My girlfriend can go to the party B. My girlfriend
can’t go to the party One word stimulus Test-takers hear : vine Test-takers read : A. Vine B. Wine
94. •Paraphrase Recognition – Sentence Paraphrase Test-takers hear : Hellow, my name
is Keiko. I come from Japan Test-takers read : A. Keiko is comfortable in japan B. Keiko wants to come to
Japan C. Keiko is Japanese D. Keiko likes Japan – Dialogue paraphrase Test-takers hear Testtakers read : man : Hi, Maria, my name is George. woman : Nice to meet you, George. Are you American?
man : no, I’m Canadian : A. George lives in United States B. George is American C. George
comes from Canada D. Maria is Canadian
95. Designing Assessment Tasks • Appropriate response to a question Test-takers hear Test-takers
read : how much time did you take to do your homework? : A. in about an hour B. about an hour C. about
$10 D. yes, I did. • Open-ended response to a question Test-takers hear Test-takers write or speak :
how much time did you take to do your homework? : __________________________________
96. Designing Assessment Tasks : Selective Listening Test-taker listens a limited quantity of aural input and
discern some specific information Listening Cloze (cloze dictations or Partial Dictation) Listening cloze
tasks require the test-taker to listen a story, monologue or conversatation and simultaneously read written
text in which selected words or phrases have been deleted One Potentional Weakness of listening cloze
technique They may be simply become reading comprehension tasks. Test-takers who are asked to listen to
a story with periodic deletions in the written version may not need to listen at all, yet may still able to

15

respond with the appropriate word or phrase. Information Transfer aurally processed must be trnasfered to a
visual representation, E.g labelling a diagram, identifying an element in a picture, completing a form, or
showing routes on a map. Chart Filling Test-takers see the chart about Lucy‟s daily schedule and
fill in the schedule. Sentence Repetition The test-takers must retain a strecth of language long enough to
reproduce it, and then must respond with an oral repetition of that stimulus.
97. DESIGNING ASSESSMENT TASKS: EXTENSIVE LISTENING Dictation: Test-takers hear a
passage, typically 50-100 words, recited 3 times; First reading, natural speed, no pauses, test-takers listen
for gist. Second reading, slowed speed, pause at each break, test-takers write. Third reading, natural speed,
test takers check their work. Communicative Stimulus-Response Tasks The test-takers are presented with a
stimulus monologue or conversation and then are asked to respond to a set of comprehension questions.
First: Test-takers hear the insrtuction and dialogue or monologue. Second: Test-takers read the multiplechoice comprehension questions and items then chose the correct one Authentic Listening Tasks Buck
(2001-p.92)“Every test requires some components of communicative language ability, and no test
covers them all. Similarly, every task shares some characteristics with target-language tasks, and no test is
completely authentic”
98. Alternatives to assess comprehension in a truly communicative context Note taking Listening to a
lecturer and write down the important ideas. Disadvantage: scoring is time consuming Advantages: mirror
real classroom situation it fulfills the criteria of cognitive demand, communicative language &
authenticity Editing Editing a written stimulus of an aural stimulus Interpretive tasks: paraphrasing a story
or conversation Potential stimuli include: song lyrics, poetry, radio, TV, news reports, etc. Retelling Listen
story &simply retell it either orally or written à show full comprehension Difficulties: scoring
and reliability validity, cognitive, communicative ability, authenticity are well incorporated into the task.
Interactive listening (face to face conversations)
99. Chapter-7 Assessing Speaking
100. Challenges of the testing speaking: 1- The interaction of speaking and listening 2- Elicitation
techniques 3- Scoring BASIC TYPES OF SPEAKING 1.Imitative: (parrot back) Testing the ability to
imitate a word, phrase, sentence. Pronunciation is tested. Examples: Word, phrase, sentence repetition 2.
Intensive: The purpose is producing short stretches of oral language. It is designed to demonstrate
competence in a narrow band of grammatical, phrasal, lexical, phonological relationships (stress / rhythm /
intonation) 3.Responsive: (interacting with the interlocutor) include interaction and test comprehension but
somewhat limited level of very short conversations, standards greetings, small talk, simple requests and
comments, and the like. 4. Interactive: Difference between responsive and interactive speaking is length
and complexity of interaction, which includes multiple exchanges /or multiple participant. 5. Extensive
(monologue) : Extensive oral production tasks include speeches, oral presentations, story-telling, during
which the opportunity for oral interaction from listeners is either highly limited (perhaps to nonverbal
responses) or ruled out together.
101. Micro- and Macroskills of Speaking microskills of speaking refer to producing small chunks of
language such as phonemes, morphemes, words and phrasal units. The macroskills include the speakers'
focus on the larger elements such as fluency, discourse, function, style cohesion, nonverbal communication
and strategic options. Macroskills 1.Apropriately accomplish communicative functions according to
situations, participants,and goals. 2.Use appropriate styles, registers, implicative, redundancies, pragmatic
conventions, conversation rules, floor-keeping and –yielding, interrupting, and other
sociolinguistic features in face-to-face conversations. 3.Convey links and connections between events and
communicative such relations as focal and peripheral ideas, events and feelings, new information and given
information, generalization and exemplification. 4.Convey facial features, body language, and other
nonverbal cues along with verbal language. 5.Develop and use a battery of speaking strategies, such as
emphasizing key words, rephrasing, providing a context for interpreting the meaning of words, appealing
for help, and accurately assessing how well your interlocutor is understanding you.
102. Microskills: 1.Produce differences among English phonemes and allophonic variants. 2.Produce
chunks of language of different lengths. 3.Produce English stress patterns, words in stressed and unstressed
positions, rhytmic structure, and intonation contours. 4.Produce reduced forms of words and phrases. 5.Use
adequate number of lexical units(words) to accomplish pragmatic purposes 6.Produce fluent speech at
different rates of delivery. 7.Monitor one‟s own oral production and use various devicespauses,
fillers, self-corrections, backtracking- to enhance the clarity of the message. 8.Use grammatical word
classes (nouns,verbs,etc.),systems (tense, agreement, pluralization), word order, patterns, rules, and

16

elliptical forms. 9.Produce speech in natural constituents: in appropriate phrases, pause groups,breath
groups, and sentence constituents. 10.Express a particular meaning in different grammatical forms. 11.Use
cohesive devices in spoken discourse.
103. Three important issues as you set out to design tasks; 1.No speaking task is capable of isolating the
single skills of oral production. Concurrent involvement of the additional performance of aural
comprehension, and possibly reading, is usually necessary. 2.Eliciting the specific criterion you have
designated for a task can be tricky because beyond the word level, spoken language offers a number of
productive options to test-takers. Make sure your elicitation prompt achieves its aims as closely as possible.
3.It is important to carefully specify scoring procedures for a response so that ultimately you achieve as
high a reliability index as possible. interaction between speaking and listening or reading is unavoidable.
Interaction effect: impossibility of testing speaking in isolation Elicitation techniques: to elicit specific
criterion we expect from test takers. Scoring: to achieve reliability
104. Designing Assessment Tasks: Imitative Speaking paying more attention to pronunciation, especially
suprasegmentals, in attempt to help learners be more comprehensible. Repetition tasks are not allowed to
occupy a dominant role in an overall oral production assessment, and as long as avoid a negative washback
effect. In a simple repetition task, test-takers repeat the stimulus, whether it is a pair of words, a sentence,
or perhaps a question ( to test for intonation production.) Word repetition task: Scoring specifications must
be to avoid reliability breakdowns. A common form of scoring simply indicates 2 or 3 point system for each
response Scoring scale for repetition tasks: 2 acceptable pronunciation 1 comprehensible, partially correct
pronunciation 0 silence, seriously incorrect pronunciation The longer the stretch of language, the more
possibility for error and therefore the more difficult it becomes to assign a point system to the text.
105. PHONEPASS TEST The phonepass test has supported the construct validity of its repetition tasks not
just for discourse and overall oral production ability. The PhonePass tests elicits computer-assisted oral
production over a telephone. Test-takers read aloud, repeat sentences, say words, and answer questions.
Test-takers are directed to telephone a designated number and listen for directions. The test has five
sections. Part A Testee read aloud selected sentences forum among printed on the test sheet. Part B Testee
repeat sentences dictated over the phone. Part C Testee answer questions with a single word or a short
phrase of 2 or 3 words. Part D Testee hear 3 word groups in random order and link them in correctly
ordered sentence Part E Testee have 30 seconds to talk about their opinion about some topic that is dictated
over phone. Scores are calculated by a computerized scoring template and reported back to the testtaker
within minutes. Pronunciation, reading fluency, repeat accuracy and fluency, listening vocabulary are the
sub-skills scored The scoring procedure has been validated against human scoring with extraordinary high
reliabilities and correlation statistics.
106. Designing Assessment Tasks: Intensive Speaking test-takers are prompted to produce short stretches of
discourse (no more then a sentence) through which they demonstrate linguistic ability at a specified level
lang Intensive tasks may also be described as limited response tasks, or mechanical tasks, or what
classroom pedagogy would label as controlled responses. Directed Response Tasks Administrator elicits a
particular grammatical form or a transformation of a sentence. Such tasks are clearly mechanical and not
communicative(possible drawbacks),but they do require minimal processing of meaning in order to
produce the correct grammatical output.(practical advantages Read – Aloud Tasks (to improve
pronunciation and fluency) include beyond sentence level up to a paragraph or two. It is easily administered
by selecting a passage that incorporates test specs and bye recording testee‟ output; the scoring is
easy because all of the test-takers’s oral production is controlled. If reading aloud shows certain
practical adavantages (predictable output, practicality, reliability in scoring), there are several drawbacks
Reading aloud is somewhat inauthentic in that we seldom read anything aloud to someone else in the real
world, with exception of a parent reading to a child.
107. Sentence / Dialogue Completion Tasks and Oral Questionnaries ( to produce omitted lines, words in a
dialogue appropiriately) Test-takers read dialogue in which one speaker‟s lines have been omitted.
Testtakers are first given time to read through the dialogue to get its gist and to think about appropriate
lines to fill in. An advantage of this technique lies in its moderate control of the output of the test-taker
(practical advantage). One disadvantage of this technique is its reliance on literacy and an ability to transfer
easily from written to spoken English.(possible drawback) Another disadvantage is contrived, inauthentic
nature of this task. (drawback.) Picture – Cued Tasks (to elicit oral production by using pictures)
One of more popular ways to elicit oral language performance at both intensive and extensive levels is a
picture-cued stimulus that requires a destcription from the test-taker. Assessment of oral production may be

17

stimulated through a more elaborate picture. (practical advantages) Maps are another visual stimulus that
can be used to assess the language forms needed to give directions and specify locations.(practical
advantage)
108. Scoring may be problematic depending on the expected performance. Scoring scale for intensive tasks
2 comprehensible; acceptable target form 1 comprehensible; partially correct target form 0 silence, or
seriously incorrect target form Translation (of Limited Stretches of Discourse) (To translate from target
language to native language) The test-takers are given a native language word, phrase, or sentence and are
asked to translate it. As an assessment procedure, the advantages of translation lie in its control of the
output of the test-taker, which of course means that scoring is more easily specified.
109. Designing Assessment Tasks: Response Speaking Assessment involves brief interactions with an
interlocutor, differing from intensive tasks in the increased creativity given to the test-taker and from
interactive tasks by the somewhat limited length of utterances. Question and Answer Question and answer
tasks can consist of one or two questions from an interviewer, or they can make up a portion of a whole
battery of questions and prompts in an oral interview. The first question is intensive in its purpose; it is a
display question intended to elicit a predetermined correct response. Questions at the responsive level tend
to be genuine referential questions in which the test-taker is given more opportunity to produce meaningful
language in response. Test-takers respond with a few sentences at most. Test-takers respond with questions.
A potentially tricky form of oral production assessment involves more than one test-taker with an
interviewer. With two students in an interview contxt, both test-takers can ask questions of each other.
110. Giving Instruction and Directions The technique is simple : the administrator poses the problem, and
the testtaker responds. Scoring is based primarily on comprehensibility and secondarily on other specified
grammatical or discourse categories. Eliciting instructions or directions Paraphrasing read or hear a number
of sentences and produce a paraphrase of the sentence. Advantages they elicit short stretches of output and
perhaps tap into testee ability to practice conversation by reducing the output/input ratio. If you use short
paraphrasing tasks as an assessment procedure, it‟s important to pinpoint objective of task clearly.
In this case, the integration of listening and speaking is probably more at stake than simple oral production
alone. TEST OF SPOKEN ENGLISH (TSE) The TSE is a 20 –minute audio-taped test of oral
language ability within an academic or Professional environment. The scores are also used for selecting and
certifying health professionals such as physicians, nurses, pharmacists, physical therapists, and veterinaries.
The tasks on the TSE are designed to elicit oral production in various discourse categories rather than in
selected phonological, grammatical, or lexical targets.
111. Designing Assessment Tasks: Interactive Speaking Tasks include long interactive discourse
( interview, role plays, discussions, games). İnterview A test administrator and a test-taker sit down
in a direct face-to-face Exchange and proceed through a protocol of questions and directives. The interview
is then scored on accuracy in pronunciation and/or grammar, vocabulary usage, fluency, pragmatic
appropriateness, task accomplishment, and even comprehension. Placement interviews, designed to get a
quick spoken sample from a student to verify placement into a course, Four stages: 1.Warm-up : (small
talk) interviewer directs matual introductions, helps testee become comfortable, apprises testee, anxieties.
(No scoring) 2.Level check: interviewer stimulates testee to respond using expected - predicted forms and
functions. This stage give interviewer a picture of testee‟s extroversion, readiness to speak,
confidence.Linguistic target criteria are scored in this phase. 3.Probe: Probe questions and prompts
challenge testee to go heights of their ability, to extend beyond limits of interviewer‟s expectation
through difficult questions. 4.Wind-down: This phase is a short period of time during which interviewer
encourages testee to relax with easy questions, sets testee‟s ease,
112. The scussess of an oral interview will depend on; *clearly specifying administrative procedures of the
assessment(practicality) *focusing the q and probes on the purpose of the assessment(validity)
*appropriately eliciting an optimal amount and quality of oral production from the test-taker.( biased for
best performance) *creating a consistent, workable scoring system (reliability).
113. Role Play Role playing is a popular pedagogical activity in communicative language teaching classes.
Within constraints set forth by guidelines, it frees students to be somewhat creative in their linguistic
output. While role play can be controlled or „‟guided‟‟ by the
interviewer, this technique takes test-takers beyond simple intensive and responsive levels to a level of
creativity and complexity that approaches real-world pragmatics. Scoring presents the usual issues in any
task that elicits somewhat unpredictable responses from test-takers. Discussions and Conversations As
formal assessment devices, discussions and conversations with and among students are difficult to specify

18

and even more difficult to score. But as informal techniques to assess learners, they offer a level of
authenticity and spontaneity that other assessment techniques may not provide. Assessing the performance
of participants through score or checklists should be carefully designed to suit the objectives of the
observed discussion. Discussion is a integrative task, and so it is also advisable to give some cognizance to
comprehension performance in evaluating learners.
114. Games Among informal assessment devices are a variety of games that directly involve language
production. Assessment games: 1.‟‟Tinkertoy‟‟ game (Logo block)
2.Crossword puzzles 3.Information gap grids 4.City maps ORAL PROFICIENCY INTERVIEW (OPI) The
best-known oral interview format is the Oral Proficinecy Interview. OPI is the result of historical
progression of revisions under the auspices of several agencies, including the Educational Testing Service
and American Council on Teaching Foreign Language (ACTFL). The OPI is carefully designed to elicit
pronunciation, fluency and integrative ability, sociolinguistic and cultural knowledge, grammar, and
vocabulary. Performance is judged by the examiner to be at one of ten possible levels on the ACTFLdesignated proficiency guidelines for speaking: Superior; Advanced-high, mid, low; Intermediate-high,
mid,low; Novice-high, mid,low.
115. Designing Assessments : Extensive Speaking involves complex, relatively lengthy stretches of
discourse. They are variations on monologues, with minimal verbal interaction. Oral Presentations it would
not be uncommon to be called on to present a report, a paper, a marketing plan, a sales idea, a design of
new product, or a method. Once again the rules for effective assessment must be invoked: a- specify the
criterion, b-set appropriate tasks, c- elicit optimal output, d-establish practical, reliable scoring procedures.
Scoring is the key assessment challenge. Picture –Cued Story-Telling techniques for eliciting oral
production is through visual pictures, photographs, diagrams, and charts. consider a picture or series of
pictures as a stimulus for a longer or description. Criteria for scoring need to be clear about what it is you
are hoping to assess.
116. Retelling a Story, News Event In this type of task, test-takers hear or read a story or news event that
they are asked to retell. The objectives in assigning such a task vary from listening comprehension of the
original to production a number of oral discourse features (communicating sequences and relationships of
events, stress and emphasis patterns,‟ ‟expression‟‟ in the case of a
dramatic story), fluency, and interaction with the hearer. Scoring should meet the intended criteria
Translation (of Extended Prose) Longer texts are presented for test-taker to read in NL and then translate
into English (dialogues, directions for assembly of a product, synopsis of a story or play or movie,
directions on how to find something on map, and other genres). The advantage of translation is in the
control of the content, vocabulary, and to some extent, the grammatical and discourse features. The
disadvantage is that translation of longer text is a highly specialized skill for which some individuals obtain
post-baccalaureate. Criteria for scoring should take into account not only purpose in stimulating a
translation but possibility of errors that are unrelated to oral production ability
117. 8 ASSESSING READING
118. TYPES (GENRES) OF READING Academic reading Reference material , Textbooks, theses Essays,
papers, Test directions, Editorials and opinion writing Job-related reading Messages, Letters/ emails,
Memos Personal reading Newspapers , magazines, Letters, emails, cards, invitations, Schedules (trains,
bus)
119. Microskills : Discriminate among the distinctive graphemes and orthographic patterns of English.
Retain chunks of language of different lenghts in short-term memory. Process writing at an efficient rate of
speed to suit the purpose. Recognize a core of word, and interpret word order patterns and their
significance. Recognize grammatical word classes(nouns, verbs, etc), systems (tense agreement,
pluralization), patterns, rules and elliptical forms. Recognize cohesive devices in written discourse and their
role in signaling the relationship between and among clauses.
120. Macroskills : Recognize the rhetorical forms of written discourse and their significance for
interpretation. Recognize the communicative functions of written text, according to form and purpose Infer
context that is not explicit by using background knowledge From described events, ideas, etc, infer links
and connections between events, deduce causes and effects, and detect such relations as main idea,
supporting idea, new information, generalization, and exemplification Distinguish between literal and
implied meanings. Detect culturally specific references and interpret them in a context of the appropriate
cultural schemata. Develop and use a battery of reading strategies, such as scanning and skimming,

19

detecting discourse markers, guessing the meaning of words from the context, and activating schemata for
interpretation of texts.
121. Some principal strategies for reading comprehension: Identify your purpose in reading a text Apply
spelling rule and conventions for bottom-up decoding Use lexical analysis to determine meaning Guess at
meaning when you aren‟t certain Skim the text for the gist and for main ideas Scan the text for
specific information(names, dates, key words) Use silent reading techniques for rapid processing Use
marginal notes, outlines, charts, or semantic maps for understanding and retaining information Distinguish
between literal and implied meanings Capitalize on discourse markers to process relationships.
122. TYPES OF READING Perceptive Involve attending to the components of larger stretches of discourse
: letters, words, punctuation, and other graphemic symbols. Selective Is largely an artifact of assessment
formats. Used picture-cued tasks, matching, true/ false, multiple-choice, etc. Interactive Interactive task is
to identify relevant features (lexical, symbolic, grammatical, and discourse) within texts of moderately
short length with the objective of retaining the information that is processed. Extensive The purposes of
assessment usually are to tap into a learner‟s global understanding of a text, as opposed to asked
test-takers to “zoom in” on small details. Top down processing is assumed for most
extensive tasks.
123. PERCEPTIVE READING Reading Aloud Reads them aloud, one by one, in the presence of-an
administrator. Written response Reproduce the probein writing. Evaluation of the test taker‟s
response must be carefully treated. Multiple-choise Choosing one of four or five possible answers. PictureCued Items Shown a picture, written text and are given one of a number of possible tasks to perform.
124. SELECTIVE READING The test designer focuses on formal aspects of language (lexical,
grammatical, and a few discourse features). Category includes what many incorrectly think of as testing
“vocabulary and grammar” Multiple-Choise (for Form-Focused Criteria) They may have
little context, but might serve as a vocab or grammar check. Matching Tasks The most frequently appearing
criterion in matching procedures is vocabulary. Editing Tasks For grammatical or rhetorical errors is a
widely used test method for assessing linguistic competence in reading. Picture Cued Tasks read sentence
or passage and choose one of four pictures that is described read a series of sentences or definitions, each
describing a labeled part of a picture or diagram. Gap-Filling Tasks Is to create completion items where
test-takers read part of a sentence and then complete it by writing a phrase.
125. INTERACTIVE READING Cloze Tasks fill in gaps in an incomplete image (visual, auditory, or
cognitive) and supply (from background schemata) omitted details. Impromptu Reading Plus
Comprehension Questions without some component of assessment involving impromptu reading and
responding to questions. Short-Answer Tasks following reading passages is the age-old short-answer
format. Editing (Longer Texts) The technique has been applied successfully to longer passages of 200 to
300 words. 1th authenticity, 2nd tasks simulates proofreading one‟s own essay. 3th connected to a
specific curriculum. Scanning Strategy used by all readers to find relevant information in a text. Ordering
Tasks Variations on this can serve as an assessment of overall global understanding of a story and of the
cohesive devices that signal the order of events or ideas. Information Transfers Reading Charts, Maps,
Graphs, Diagrams media presuppose reader‟s schemata for interpreting them and are
accompanied by oral or written discourse to convey, clarify, question, argue, debate, among other linguistic
functions.
126. EXTENSIVE READING Involves longer texts than we have been dealing with up to this point.
Skimming Tasks Process of rapid coverage of reading matter to determine its gist or main idea
Summarizing and Responding Is make summary of the text and give it a respond about the text Note
Taking and Outlining A teacher, perhaps in one-on-one conferences with students, can use student notes/
outlines as indicators of the presence or absence of effective reading strategies, and thereby point the
learners in positive directions.
127. UNIT 9: ASSESSING WRITING
128. GENRES OF WRITING Academic Writing papers and general subject reports essays, compositions
academically focused journals, short-answer test responses technical reports (e.g., lab reports), theses,
dissertations Job-Related Writing messages letters/emails, memos (e.g., interoffice), reports (e.g., job
evaluations, project reports) schedules, labels, signs, advertisements, announcements, manuals Personal
Writing letters, emails, greeting cards, invitations messages, notes, calendar entries, shopping lists,
reminders financial documents (e.g., checks, tax forms, loan applications) forms, questionnaires, medical
reports, immigration documents diaries, personal journals, fiction (eg. Short stories, poetry)

20

129. MICROSKILLS AND MACROSKILLS OF WRITING Micro-skills Produce graphemes and


orthographic patterns of English. Produce writing at an efficient rate of speed to suit the purpose. Produce
an acceptable core of words and use appropriate word order patterns. Use acceptable grammatical systems
(Tense, agreement), patterns and rules. Express a particular meaning in different grammatical forms. Use
cohesive devices in written discourse. Macro-skills Use the rhetorical forms and conventions of written
discourse. Appropriately accomplish the communicative functions of written texts according to form and
purpose. Convey links and connections between events, communicate such relations as main idea,
supporting idea, new information, generalization, exemplification. Distinguish between literal and implied
meanings when writing. Correctly convey culturally specific references in the context of the written text.
Develop&use writing strategies, accurately assessing audience‟s interpretation, using
prewriting devices, writing fluency in first drafts, using phrases and synonyms, soliciting feedback and
using feedback for revising and
130. Types of Writing Performance Imitative Writing Assess ability to spell correctly & perceive
phoneme/grapheme correspondences Form rather than meaning (letters, words, punctuation, brief
sentences, mechanics of writing) Intensive Writing To produce appropriate vocabulary within a context and
correct grammatical features in a sentence More form than meaning but meaning and context are of some
importance (collocations, idioms, correctness, appropriateness) Responsive Writing Connect sentences
& create a logically connected 2 or 3 paragraphs Discourse conventions with strong emphasis on
context and meaning (limited discourse level, connecting sentences logically) mostly 2-3 paragraphs
Extensive Writing To manage all the processes of writing for all purposes to write longer text (Essays,
papers, theses) Processes of writing (strategies of writing)
131. IMITATIVE WRITING Tasks in Hand Writing Letters, Words, and Punctuation Copying ( bit __ / bet
__ / bat __ ) Copy the words given in the spaces provided Listening cloze selection tasks Write the missing
words in blanks by selecting according to what they hear Combination of dictation with a written text
Purpose=to give practice in writing Picture-cued tasks Write the word the picture represents Make sure that
pictures are not ambiguous Form completion tasks Complete the blanks in simple forms Eg. Name,
address, phone number Make sure that students have practiced filling out such forms Converting
numbers/abbreviations to words Either write out the numbers or converting abbreviations to words More
reading than writing, so specify the criterion Low authenticity, Reliable method to stimulate handwritten
English
132. Spelling Tasks and Detecting Phoneme-Grapheme Correspondences Spelling Tests Write words that
are dictated, Choose words that have been heard or spoken Scoring=correct spelling Picture-Cued Tasks
Write words that are displayed by pictures Eg. Boot-book, read-reed, bit-bite Choose items according to
your test purpose Multiple Choice Techniques Choose and write the word with the correct spelling to fit the
given sentences Items are better to have writing component / addition of homonym to make the task
challenging Clashes with reading, so be careful To assess the ability to spell words correctly and to process
phoneme-grapheme correspondences Matching Phonetic Symbols Write the correctly spelled word
alphabetically Since Latin alphabet and Phonetic alphabet symbols are different from each other, this works
well.
133. INTENSIVE (CONTROLLED) WRITING Dictation Writing what is heard aurally Listening &
correct spelling punctuation Dicto-comp Re-writing the paragraph in one's own words after hearing it for 2
or 3 times Listening & vocabulary & spelling & Punctuation Grammatical transformation
Making grammatical transformations by changing or combining forms of lang Grammatical competence,
Easy to administer & practical & reliable No meaningful value, Even with context no authenticity
Picture-cued 1. Short sentences 2. Picture description 3. Picture sequence description Reading non-verbal
means & grammar & spelling & vocabulary Reading-Writing integration, Scoring
problematic when pictures are not clear
134. Vocabulary assessment Either defining or using a word in a sentence, assessing collocations and
derived morphology Vocabulary & grammar, Less authentic: using a word in sentence? Ordering
Ordering / re-ordering a scrambled set of words If verbal=intensive speaking, If written=intensive writing
Reading and grammar Appealing for who like word games and puzzles, Inauthentic Needs practicing in
class, Both reading and writing Short answer and sentence completion Answering or asking questions for
the given statements / writing 2 or 3 sentences using the given prompts Reading& Writing, Scoring on
a 2-1-0 scale is appropriate

21

135. 1. AUTHENTICITY (face and content validity) Teacher becomes less instructor, more coach or
facilitator Assessment: formative  (+) washback > practicality and reliability 2. SCORING
Both how Ss string words together and what they say 3. TIME No time constraints  freedom for
drafts before finished product Questioned issue= Timed impromptu format  valid method of
writing assessment
136. RESPONSIVE AND EXTENSIVE WRITING 1. Paraphrasing Its importance: To say something in
one's own words, to avoid plagiarism to offer some variety in expression Test takers' task: Paraphrasing
sentences or paragraphs with purposes in mind Assessment type: Informal and formative, Positive
washback Scoring: Giving similar messages is primary Discourse, grammar and vocabulary are secondary
2. Guided question and answer Its importance: To provide benefits of guiding test takers without dictating
the form of the output Test takers' task: Paraphrasing sentences or paragraphs with purposes in mind
Assessment type: Informal and formative Scoring: Either on a holistic scale or an analytical one
137. 3. Paragraph Construction Tasks Topic Sentence Writing The presence or absence of topic sentence
The effectiveness of topic sentence Topic Development in a Paragraph The clarity of expression The logic
of the sequence The unity and cohesion The overall effectiveness Multi Paragraph Essay Addressing
topic /main idea / purpose Organizing supporting ideas Using appropriate details for supporting ideas
Facility and fluency in language use Demonstrating syntactic variety 4.Strategic Options Free writing,
outlining, drafting and revising are strategies which help writers create effective texts Writers need to know
their subject and purpose and audience to write developing main and supporting ideas is the purpose for
only essay writing Some tasks commonly addressed in academic writing courses are compare/contrast,
problem solution, pro/cons and cause and effect. Assessment of tasks in academic writing course could be
formative & informal Knowing conventions &opportunities of genre will help to write
effectively. Every genre of writing requires different conventions.
138. Test of Written English (TWE®) Time allocated: 30 minutes time limit/ no preparation ahead of
time Prepared by: a panel of experts Scoring: a mean score of 2 independent ratings based on a holistic
scoring Number of raters: 2 trained raters working independently Limitations: inauthentic / not real life /
puts test takers into artificially time constraint context inappropriate for instructional purposes Strengths:
serves for administrative purposes Follow 6 steps to be successful Carefully identify the topic. Plan your
supporting ideas. In introductory paragraph, restate topic and state organizational plan of essay. Write
effective supporting paragraphs (show transitions, include a topic sentence, specify details). Restate your
position and summarize in the concluding paragraph. Edit sentence structure and rhetorical expression.
139. SCORING METHODS FOR RESPONSIVE AND EXTENSIVE WRITING Holistic Scoring
Definition: Assigning a single score to represent general overall assessment Purpose of use: Appropriate for
administrative purposes / Admission into an institution or placement in a course Advantage(s): Quick
scoring High inter-rater reliability, Easily interpreted scores by lay persons Emphasizes strengths of written
piece Applicable to many different disciplines Disadvantages No washback potential Masking the
differences across the sub skills within each score Not applicable to all genres Needs trained evaluators to
use the scale accurately
140. Primary Trait Scoring Assigning a score based on the effectiveness of the text's achieving its purposes
(accuracy, clarity, description, expression of opinion) Purpose of use To focus on the principle function of
the text Advantage(s) Practical Allows both the writer and scorer to focus on the function / purpose
Disadvantage(s) Breaking text down into subcategories and giving separate ratings for each
141. Analytic Scoring Definition Listening short monologues to scan for certain information Purpose of use
Classroom instructional purposes Advantage(s) *More backwash into the further stages of learning
Diagnose both the weaknesses and strengths of writing Disadvantage(s) Lower practicality since scorers
have to attend to details with each sub-score.
142. BEYOND SCORING: RESPONDING TO EXTENSIVE WRITING Here, the writer is talking about
process approach to writing and how the assessment takes place in this approach. Many educators advocate
process approach to writing. This pays attention to various stages that any piece of writing goes through.
By spending time with learners on pre-writing phases, editing, re-drafting and finally producing a finished
version of their work, a process approach aims to get to the heart of the various skills that most writers
employ. Types of responding: Self, peer, teacher responding Assessment type: Informal / formative
Washback: Potential positive washback Role of the assessor: Guide / facilitator
143. GUIDELINES FOR ASSESSING STAGES OF WRITTEN COMPOSITION Initial stages Focus:
Meaning & Main idea & organization Ignore: Grammatical and lexical errors / minor errors

22

Indicate: Global errors but not corrected Later stages Focus: Fine tuning toward a final version Ignore:
Indicate: Problems related to cohesion/documentation/citation
144. 10 BEYOND TESTS: ALTERNATIVES IN ASSESSMENT
145. Characteristics of Alternative Assessment require students to perform, create, produce, or do
something; use real-world contexts or simulations; are non-intrusive in that they extend the day-to-day
classroom activities; allow students to be assessed on what they normally do in class every day; use tasks
that represent meaningful instructional activities; focus on processes as well as products; tap into higherlevel thinking and problem-solving skills; provide information about both the strengths and weaknesses of
students; are multi-culturally sensitive when properly administered; ensure that people, not machines, do
the scoring, using human judgment; encourage open disclosure of standards and rating criteria; and call
upon teachers to perform new instructional and assessment roles.
146. DILEMMA OF MAXIMIZING BOTH PRACTICALITY AND WASHBACK LARGE SCALE
STANDARDIZED TESTS ALTERNATIVE ASSESSMENT one-shot performances timed multiple-choice
decontextualized norm-referenced foster extrinsic motivation highly practical, reli-able instruments
• minimize time and money • much practicality or reliability • cannot offer
much washback or authenticity • open-ended in their time orientation and format, •
contextualized to a curriculum, • referenced to the criteria (objectives) of that curriculum
• likely to build intrinsic motivation • considerable time and effort • offer much
authenticity and washback • • • • • • •
147. The dilemma of maximizing both practicality and washback The principal purpose of this chapter is to
examine some of the alternatives in assessment that are markedly different from formal tests. Especially
large scaled standardized tests, tend to be one shot performances that are timed, multiple choice
decontextualized, norm-referenced, and that foster extrinsic motivation. On the other hand, tasks like
portfolios, journals, Conferences and interviews and self assessment are Open ended in their time
orientation and format Contextualized to a curriculum Referenced to the criteria ( objectives) of that
curriculum and Likely to build intrinsic motivation.
148. PORTFOLIOS One of the most popular alternatives in assessment, especially within a framework of
communicative language teaching, is portfolio development. portfolios include materials such as Essays
and compositions in draft and final forms Reports, project outlines Poetry and creative prose Artwork,
photos, newspaper or magazine clippings; Audio and/or video recordings of presentations, demonstrations,
etc Journals, diaries, and other personal reflection ; Test, test scores, and written homework exercises Notes
on lecturer; and Self-and peer- assessments-comments, and checklists.
149. Successful portfolio development will depend on following a number of steps and guidelines. 1. State
objectives clearly. 2. Give guidelines on what materials to include. 3. Communicate assessment criteria to
students, 4. Designate time within the curriculum for portfolio development. 5. Establish periodic schedules
for review and conferencing. 6. Designate an accessible place to keep portfolios. 7. Provide positive
washback giving final assessment
150. JOURNALS a journal is a log or account of one‟s thoughts, feelings, reactions, assessment,
ideas, or progress toward goals, usually written with little attention to structure, form, or correctness.
Categories or purposes in journal writing, such as the following: a. Language learning logs b. Grammar
journals c. Responses to readings d. Strategies based learning logs e. Self-assessment reflections f. Diaries
of attitudes, feelings, and other affective factors g. Acculturation logs
151. CONFERENCES AND INTERVIEWS Conferences Conferences is not limited to drafts of written
work including portfolios and journals. Conferences must assume that the teacher plays the role of a
facilitator and guide, not of an administrator, of a formal assessment. Interviews Interview may have one or
more of several possible goals in which the teacher assesses the student‟s oral production
ascertains a students need before designing a course of curriculum seeks to discover a students‟
learning style and preferences One overriding principle of effective interviewing centers on the nature of
the questions that will be asked.
152. OBSERVATIONS In order to carry out classroom observation, it is of course important to take the
following steps: 1. Determine the specific objectives of the observation. 2. Decide how many students will
be observed at one time 3. Set up the logistics for making unnoticed observations 4. Design a system for
recording observed performances 5. Plan how many observations you will make
153. SELF AND PEER ASSESSMENT Five categories of self and peer assessment: 1. Assessment of
performance, in this category, a student typically monitors him or herself in either oral or written

23

production and renders some kind of evaluation of performance. 2. Indirect assessment of performance,
indirect assessment targets larger slices of time with a view to rendering an evaluation of general ability as
opposed to one to one specific, relatively time constrained performance. 3. Metacognitive assessment for
setting goals, some kind evaluation are more strategic in nature, with the purpose not just of viewing past
performance or competence but of setting goals and maintaining an eye on the process of their pursuit. 4.
Socioaffective assessment, yet another type of self and peer assessment comes in the form of methods of
examining affective factors in learning. Such assessment is quite different from looking at and planning
linguistic aspects of acquisition. 5. Student generated tests, a final type of assessment that is not usually
classified strictly as self or peer assessment is the technique of engaging students in the process of
constructing tests themselves.
154. GUIDELINES FOR SELF AND PEER ASSESSMENT Self and peer assessment are among the best
possible formative types of assessment and possibly the most rewarding. Four guidelines will help teachers
bring this intrinsically motivating task into the classroom successfully. 1. Tell students the purpose of
assessment 2. Define the task clearly 3. Encourage impartial evaluation of performance or ability 4. Ensure
beneficial washback through follow up tasks A TAXONOMY OF SELF AND PEER ASSESSMENT
TASKS It is helpful to consider a variety of tasks within each of the four skills( listening skill, speaking
skill, reading skill, writing skill). An evaluation of self and peer assessment according to our classic
principles of assessment yields a pattern that is quite consistent with other alternatives to assessment that
have been analyzed in this chapter. Practicality can achieve a moderate level with such procedures as
checklists and questionnaires
155. CHAPTER 11: GRADING AND STUDENT EVALUATION
156. GUIDELINES FOR SELECTING GRADING CRITERIA It is essential for all components of grading
to be consistent with an institutional philosophy and/or regulations (see below for a further discussion of
this topic). All of the components of a final grade need to be explicitly stated in writ-ing to students at the
beginning of a term of study, with a designation of percent-ages or weighting figures for each component.
If your grading system includes items (d) through (g) in the questionnaire above (improvement, behavior,
effort; motivation), it is important for you to recog-nize their subjectivity. But this should not give you an
excuse to avoid converting such factors into observable and measurable results. Finally, consider allocating
relatively _ small weights to items (c) through (h) so that a grade primarily reflects achievement. A
designation of 5 percent to 10 percent of a grade to such factors will not mask strong achievement in a
course.
157. CALCULATING GRADES: ABSOLUTE AND RELATIVE GRADING ABSOLUTE GRADING: If
you pre-specify standards of performance on a numerical point system, you are using an absolute system of
grading. For example, having established points for a midterm test, points for a final exam, and points
accumu-lated for the semester, you might adhere to the specifications in the table below. The key to making
an absolute grading system work is to be painstakingly clear on competencies and objectives, and on tests,
tasks, and other assessment techniques that will figure into the formula for assigning a grade.
158. RELATIVE GRADING: It is more commonly used than absolute grading. It has the advantage of
allowing your own interpretation and of adjusting for unpredicted ease or difficulty of a test. Relative
grading is usually accomplished by ranking students in order of performance (percentile ranks) and
assigning cut-off points for grades. An older, relatively uncommon method of relative grading is what has
been called grading "on the curve," a term that comes from the normal bell curve of normative data plotted
on a graph.
159. TEACHERS’ PERCEPTIONS OF APPROPRIATE GRADE DISTRIBUTIONS Most
teachers bring to a test or a course evaluation an interpretation of estimated appropriate distributions,
follow that interpretation, and make minor adjustments to compensate for such matters as unexpected
difficulty. What is sur-prising, however, is that teachers' preconceived notions of their own standards for
grading often do not match their actual practice INSTITUTIONAL EXPECTATIONS AND
CONSTRAINTS For many institutions letter grading is foreign but point systems (100 pts or percentages)
are common. Some institutions refuse to employ either a letter grade or a numerical system of evaluation
and instead offer narrative evaluations of Ss. This preference for more individualized evaluations is often a
reaction to overgeneralization of letter and numerical grading.
160. CROSS-CULTURAL FACTORS AND THE QUESTION OF DIFFICULTY A number of variables
bear on the issue. In many cultures, it is unheard of to ask a student to self-assess performance. Ts assign a
grade, and nobody questions the teacher's criteria. measure of a good teacher is one who can design a test

24

that is so difficult that no student could achieve a perfect score. The fact that students fall short of such
marks of perfection is a demonstration of the teacher's superior knowledge. as a corollary, grades of A are
reserved for a highly select few, and students are delighted with Bs. one single final examination is the
accepted determinant of a student's entire course grade. the notion of a teacher's preparing students to do
their best on a test is an educational contradiction. In some cultures a "hard" test is a good test, but in
others, a good test results in a distribution like the one in the bar graph for a "great bunch": a large
proportion of As and Bs, a few Cs, and maybe a D or an F for the "deadbeats" in the class.
161. How do you gauge such difficulty as you design a classroom test that has not had the luxury of
piloting and pre-testing? The answer is complex. It is usually a combination of a number of possible
factors: experience as a teacher (with appropriate intuition) adeptness at designing feasible tasks special
care in framing items that are clear and relevant mirroring in-class tasks that students have mastered
variation of tasks on the test itself reference to prior tests in the same course a thorough review and
preparation for the test knowledge of your students' collective abilities a little bit of luck
162. WHAT DO LETTER GRADES “MEAN”? Typically, institutional manuals for
teachers and students will list the following descriptors of letter grades: A: excellent B: good C: adequate
D: inadequate/unsatisfactory F: failing/unacceptable The overgeneralization implicit in letter grading
underscores the meaninglessness of the adjectives typically cited as descriptors of those letters. Is there a
solution to their gate-keeping role? 1. Every teacher who uses letter grades or a percentage score to provide
an evaluation, whether a summa-tive, end-of-course assessment or on a formal assessment procedure,
should a. use a carefully constructed system of grading, b. assign grades on the basis of explicitly stated
criteria, and c. base criteria on objectives of course or assessment procedure(s). 2. Educators everywhere
must work to persuade the gatekeepers of the world that letter/numerical evaluations are simply one side of
a complex representation of a student's ability. Alternatives to letter grading are essential considerations.
163. ALTERNATIVES TO LETTER GRADING For assessment of a test, paper, report, extra-class
exercise, or other formal, scored task, the primary objective of which is to offer formative feedback, the
pos-sibilities beyond a simple number or letter include a teacher's marginal and/or end comments, a
teacher's written reaction to a student's self-assessment of performance, a teacher's review of the test in the
next class period, peer-assessment of performance, self-assessment of performance, and a teacher's
conference with the student. For summative assessment of a student at the end of a course, those same additional assessments can be made, perhaps in modified forms: a teacher's marginal and/or end of
exam/paper/project comments T's summative written evaluative remarks on a journal, portfolio, or other
tangible product T's written reaction to a student's self assessment of performance in a course a completed
summative checklist of competencies, with comments narrative evaluations of general performance on key
objectives a teacher's conference with the student
164. A more detailed look is now appropriate for a few of the summative alternatives to grading,
particularly self-assessment, nar-rative evaluations, checklists, and conferences. 1. Self-assessment. Selfassessment of end-of-course at-tainment of objectives is recommended through the use of the following:
Checklists a guided journal entry that directs the student to reflect on the content and linguistic objectives
an essay that self-assesses, a teacher-student conference 2. Narrative evaluations. In protest against the
widespread use of letter grades as exclusive indicators'of achievement, a number of institutions have at one
time or another required narrative evaluations of students. In some instances those narratives replaced
grades, and in others they supplemented them. (pg. 296-297) Advantages: individualization, evaluation of
multiple objectives of a course, face validity, washback potential. Disadvantages: not quantified by
admissions and transcript evaluation offices, not practical-time consuming, Ss‟ paying little
attention to these, Ts‟ succumbing to formulaic narratives which follow a template.
165. 3- Checklist evaluations. To compensate for the time-consuming impracticality of narrative evaluation,
some programs opt for a compromise: a checklist with brief comments from the teacher ideally followed by
a conference and/or a response from the student. Advantages: increased practicality, reliability, washback.
Teacher time is minimized; uniform measures are applied across all students; some open-ended comments
from the teacher are available; and the student responds with his or her own goals (in light of the results of
the check-list and teacher comments). !!! When the checklist format is accompanied, as in this case, by
letter grades as well, virtually none of the disadvantages of narrative evalu-ations remain, with only a small
chance that some individualization may be slightly. 4.Conferences. Perhaps enough has been said about the
virtues of conferencing. You already know that the impracticality of scheduling sessions with students is
offset by its washback benefits.

25

166. SOME PRINCIPLES AND GUIDELINES FOR GRADING AND EVALUATION You should now
understand that grading is not necessarily based on a universally accepted scale, grading is sometimes
subjective and context-dependent, grading of tests is often done on the "curve," grades reflect a teacher's
philosophy of grading, grades reflect an institutional philosophy of grading, cross-cultural variation in
grading philosophies needs to be understood, grades often conform, by design, to a teacher's expected
distribution of stu-dents across a continuum, tests do not always yield an expected level of difficulty, letter
grades may not "mean" the same thing to all people, and alternatives to letter grades or numerical scores are
highly desirable as addi-tional indicators of achievement.
167. With those characteristics of grading and evaluation in mind, the following principled guidelines
should help you be an effective grader and evaluator of student performance: Develop an informed,
comprehensive personal philosophy of grading that is consistent with your philosophy of teaching and
evaluation. Ascertain an institution‟s philosophy of grading and, unless otherwise negotiated,
conform to that philosophy (so that you are not out of step with others). Design tests that conform to
appropriate institutional and cultural expectations of the difficulty that Ss should experience. Select
appropriate criteria for grading and their relative weighting in calculating grades. Communicate criteria for
grading to Ss at the beginning of the course and at subsequent grading periods (mid-term, final).
Triangulate letter grade evaluations with alternatives that are more formative and that give more washback.

26

You might also like