You are on page 1of 55

ADMINISTRATION, SCORING AND REPORTING

Introduction
Administering a test usually is the simplest phases of the testing process. There are some common problems associated with test administration, however, that may also affect those scores. Careful planning can help the teacher avoid or minimize such difficulties. When giving tests it is important that everything possible be done to obtain valid results. Cheating, poor testing conditions, and test anxiety, as well as errors in test scoring procedures contribute to invalid test results. Many of these factors may be controlled by practicing good test administration procedures. Practicing these procedures will prove to be less time consuming and less troublesome than dealing with problems resulting from poor procedures. After administering a test, the teachers responsibility is to score it or arrange to have it scored. The teacher then interprets the results and uses these interpretations to make grading, selection, placement or other decisions. To accurately interpret test scores, however, the teacher needs to analyze the performance of the test as a whole and of the individual test items, and to use these data to draw valid inferences about student performance. This information also helps faculty prepare for post test discussions with students about the exam.

Administrating A Test
It plays a vital role in enhancing the reliability of the test scores. Test should be administered in a congenial environment strictly as per the instructions planned and assure uniformity of conclusions to all the people tested. Suggestions to administer the test Long announcements before or during the test should be not be made. Instructions should be given in writing. The test administration should not respond to the individual problems of the examinees. The steps to be followed in the administration of group tests are: a) b) c) d) e) Motivate the students to do their best Follow the directions closely. Keep time accurately. Record any significant events that might influence test scores Collect the test materials promptly.

The guiding principle in administering an achievement test is that all students must be given a fair chance to demonstrated their achievement of the learning outcomes being measured. This mean a physical and psychological environment conducive to their best efforts and the control of factors that might interfere with valid measurement. Students will not perform at their best if they are tense and anxious during testing. They should also be reassured that the time limits are adequate to allowed them to complete the test. This, of course, assumes that the test will be used to improve learning and that the time limits are adequate. The things to avoid while administering a test are Do not talk unnecessarily before the test. Keep interruptions to a minimum during the test. Avoid giving hits to pupils who ask.
2

Administering Exams
How an exam is administered can affect student performance as such as how the exam was written. Below is a list of general principles to consider when designing and administering examinations. 1. Give complete instructions as to how to take the examination counts or the amount of time to spend on each section. This helps students to allocate their efforts wisely. 2. State specifically what aids(e.g. calculator, notebooks) students are allowed to use in the examination room. 3. Use assignments and homework to provide preparation for taking the exams. For examples, if the assignments ask all essay questions, it would be inappropriate for the examination to consist of 200 multiple choice questions. 4. Practice taking the completed test yourself. You should count on the students to take about 4 times the amount of time it takes you to complete the test. 5. For final examinations; structure the test to cover the scope of the entire course. The examination should be comprehensive enough to test adequately the students learning of the course material. Use a variety of different types of questions on the examination(e.g. multiple- choice, essay, etc) because some topics are covered more effectively with certain types of questions. Group questions of the same type together when possible. 6. Tell the students what types of questions will be on the test(i.e essay, multiple-choice etc) prior to the examination. Allow students to see past(retired) expect. For essay exams, students understand how they will be evaluated (if appropriate). 7. Provide students with a list of review questions or topics covered on the exam along with an indication of the relative emphasis on each topic. 8. Give detailed study suggestions. 9. Indicate how much the examination will count toward determining the final grade.
3

Importance Of Test Administration


Consistency Standardized tests are designed to be administered under consistent procedures so that the test taking experience is as similar as possible across examinees. This similar experience increases the fairness of the as well as making examinees scores more directly comparable. Typical guidelines related to the test administration locations state that all the sites should be comfortable, and should have good lighting, ventilation and handicap accessibility. Interruptions and distractions, such as excessive noise, should be prevented. The time limits that have been established should be adhered to for all test administrations. Test security Test security consists of method designed to prevent cheating, as well as to protect the test items and content from being exposed to future test- takers. Test administration procedures related to test security may begin as early s the registration procedure. Many exam programs restrict examinees from registering for a test unless they meet certain eligibility criteria. When examinees arrives at the test site , additional provisions for test security include verify each examinees identification and restricting materials(such as photographic or communication devices)that an examinee is allowed to bring into the test administration. If the exam program uses multiple parallel test forms, these may be distributed in a spiral fashion, in order to prevent one examinee form being able to copy from another.(form A is distributed to the first examinee, form B to the second examinee, form A to the third examinee etc). The test proctors should also remain attentive throughout the test administration to prevent cheating and other security breaches. When testing is complete, all test related materials should be carefully collected from the examinees before they depart.

Summary The use of orderly, standardized test administration procedures is beneficial to examinees. In particular, administration procedures designed to promote consistent conditions for all examinees increase the exam programs fairness. Test administration procedures related to security protect the integrity of the test items. In both of these cases, the standardization of test administration procedures prevents some examinees from being unfairly advantaged over other examinees. How many questions should I give? It is important to allow your students enough time to complete the exam comfortably and reasonably. Inevitably this will mean you must make some choices about which questions you will ask. o o o o o o One minute per objective type question Two minute for a short answer requiring one sentence Five to ten minutes for a longer short answer Ten minutes for a problem that would take you two minutes to answer Fifteen minutes for a short, focused essay Thirty minutes for an essay of more than one to two pages

You should add ten minutes or so to allow for the distribution and collection of the exam.

Administering testsThere are several things you should keep in mind to make the experience run as smoothly as possible Have extra copies of the test on hand, in case you have miscounted or in the event of some other problem.

Minimize interruptions during the exam by reading the directions briefly at the start and refraining from commenting during the exam unless you discover a problem. Periodically write the time remaining on the board . Be alert for cheating but do not hover over the students and cause a distraction. There are also some steps that you can take to reduce the anxiety that students will inevitably feel leading up to and during an exam. Consider the following Have old exams on file in the department office for students to review. Give students practice exams prior to the real test. Explain in advance of the test day, the exam format and rules, and explain how this fits with your philosophy of testing. Give students tips on how to study for and take the exam- this is not a test of their test taking ability, but rather of their knowledge, so help them learn to take tests. Have extra office hours and a review session before the test. Arrive at the exam site early, and be there yourself(rather than sending a proxy) to communicated the importance of the event.

Recommendations for improving Test scores:


1) When a test is announced well in advance, do not wait until the day before to begin studying spaced practice is more effective than massed practice. 2) Ask the instructor for old copies of the examination to practice with. 3) Ask other students what kinds of tests the instructor usually gives. 4) Dont turn study session into social occasion, isolated studying is usually more effective.
6

5) Dont be too comfortable when studying lying down is a physical care for your body to sleep. 6) Study for the type of test which was announced. 7) If you do not know the type(style) of test, study for a free recall exam. 8) Ask yourself questions about the subject material read for detail, recite the material just prior to test. 9) Try to form material you are studying into test questions. 10)Read test directions carefully before beginning exam. Ask administrator if unclear or some details are not included. 11)If essay test, think about question and mentally formulate answer before you begin writing. 12)Pace yourself while taking test. Do not try to be first person finished. Allow enough time to review answers at end of session. 13)If you can rule out one wrong answer choice, guess even if there is a penalty for wrong answers. 14)Skip more difficult items and return to them. Later, particularly if there are a lot of questions. 15)When time permits, review your answer. Dont be overly eager to hand in your test paper before all the available time has elapsed.

Scoring The Test


7

The principles of valuation should be followed in scoring the test. It enhances the objectivity and reliability of the test. Reliability- The degree of accuracy, consistency with which an exam, test measures, what it seeks to measure a given variable. The degree of consistency among test scores. A test score is called reliable, when we have reasons for believing it to be stable and trustworthy. Objectivity- A test is objective, when the scorers personal judgment does not affect the scoring. It eliminates fixed opinion or judgment of the person who scores it. The extent to which independent and competent examiners agree on what constitutes a good answer for each of the elements of a measuring instruments. Selection-type items

Prepare stencils when useful When using stencil with holes, make sure that students marked only one alternative When response wrong, put red mark through correct answer Apply formula for guessing only when a test is speeded Weight all items the same (doing otherwise seldom makes a difference and only confuses scoring)

Supply-type items Use your carefully developed rubrics.

Utilizing Rubrics as Assessment Tools


What is a rubric? A rubric is a scoring and instructional tool used to assess student performance using a task-specific range or set of criteria. To measure student performance against this pre-determined set of criteria for the task and levels of performance(i.e. from poor to excellent) for each criterion. Most rubrics are designed as a one- or two- page document formatted with a table or grid that outlines the learning criteria for a specific lesson, assignment or project.
8

Rubrics can be created in a variety of forms and levels of complexity, but they all: Focus on measuring a stated objective(performance, behavior, or quality) Use a range to rate performance Contain specific performance characteristic arranged in levels indicating the degree to which a standard has been met. Two major types of rubrics: A holistic rubric involves one global, holistic rating with a single score for an entire product or performance based on an overall impression. These are useful for summative assessment where an overall performance rating is needed, for example, portfolios. A holistic rubric requires the teacher to score the overall process or product as a whole, without judging the component parts separately. An analytical rubric divides a product or performance into essential traits that are judged separately. Analytical rubrics are usually more useful for day to day classroom use since they provide more detailed and precise feedback to the student. An analytical rubric, the teacher scores separate, individual parts of the products or performance first, then sums the individual scores to obtain a total score.

Assessing student learning


Rubrics provide instructors with an effective means of learning-centered feedback and evaluation of student work. As instructional tools, rubrics enable stdents to guage the strengths and weaknesses of their work and learning. As assessment tools, rubrics enable faculty to provide detailed and informative evaluations of students work.

Advantages of using rubrics:


They allow assessment to be more objective and consistent. They clarify the instructors criteria in specific term. They clearly show students how their work will be evaluated and what is expected. They promote awareness of the criteria to use when students assess per performance. They provide benchmarks against which to measure progress. They reduce the amount of time teachers spend evaluating student work by allowing them to simply circle an item in the rubrics.
10

They increase students sense of responsibility for their own work. Steps of creating rubrics: 1. Define your assignment or project This is the task you asking your student to perform. 2. Decide on a scale of performance. These can be a level for each grade(A-F)or three levels(outstanding, acceptable, not acceptable; Great job, Okay,What happened?). These are listed at the top of the grid. 3. Identify the criteria of the task. These are the observable and measurable characteristics of the task. They listed in the left-hand column. They can be weighted to convey relative importance of each. 4. Describe the performance of each critierion. These descriptors indicate what performance looks like at each level. They offer specific feedback. Use samples of student work to help you determine quality work.

Suggestions for use:


Hand out the rubric with rubric the assignment. Return the rubrics with the performance descriptors circled. Have students develop their own rubrics for a project. Have students use the rubric for self-assessment or peer assessment

11

12

13

14

Methods Of Scoring In Standardized Test


Different tests use different methods of scoring based on different needs. The following table summarizes the three main categories of test scores: 1. Raw Scores 2. Criterion-referenced Scores 3. Norm-referenced Scores (how most standardized tests are scored Score How score is determined By counting the number (or calculating a percentage) of correct responses or points earned. By comparing performance to one or more criteria or standards for success Uses Potential drawbacks

Raw score

Often used in teacherdeveloped assessment instruments.

Useful when determining whether specific instructional objectives have been achieved. Also useful when determining if basic skills that are prerequisites for other tasks have been learned. By equating a Useful when Age or student's explaining normGrade referenced test Equivalent performance to the average performance performance to people (normunfamiliar with referenced) of students at a particular age or standard scores. Criterionreferenced Score
15

Scores may be difficult to interpret without knowledge of how performance relates to either a specific criterion or a norm or group. Criteria for assessing mastery of complex skills may be difficult to identify.

Scores are frequently misinterpreted, especially by parents. Scores may be

grade level

inappropriately used as a standard that all students must meet. Scores are often inapplicable when achievement at the secondary level or higher is being assessed. Do not give a typical range of performance for students at that age or grade.

Percentile Rank (normreferenced)

Standard Score (normreferenced

By determining the percentage of students at the same age or grade level who obtained lower scores. By determining how far the performance is from the mean (for the age or grade level) with respect to standard deviation units.

Useful when explaining normreferenced test performance to people unfamiliar with standard scores. Useful when describing a student's standing within the norm group

Scores overestimate differences near the mean and underestimate differences at the extremes. Scores are not easily understood by people without some knowledge of statistics.

16

Standard Score
Definition

Standard scorehow far above or below average a student scored Distance is calculated in standard deviation (SD) units (a standard deviation is a measure of spread or variability) The mean and standard deviation are for a particular norm group. Standard Scores are by far the most complicated of the five types of scores so they deserve a more in-depth look. When looking at the normal distribution, a line is drawn from the highest point on the curve to the xaxis. This point is the mean score. A standard deviation's worth is counted out on each side of the mean and those points are marked. Another standard deviation is counted out and two more points are marked. When the normal distribution is divided up this way, you will always get the same percentage of students scoring in each part. About 68% will score within one standard deviation of the mean (34% in each direction). As you move further from the mean, fewer and fewer students will perform at these scores. A standard score simply tells us where a student scores in relation to this normal distribution in standard deviation units .

Advantages Based on the normal curve, which means that 1. 2. 3. 4. Scores are distributed symmetrically around the mean (average) Each SD represents a fixed (but different) percentage of cases Almost everyone is included between 3.0 and 3.0 SDs of the mean The SD allows conversion of very different kinds of raw scores to a common scale that has (a) equal units and (b) can be readily interpreted in terms of the normal curve 5. When we can assume that scores follow a normal curve (classroom tests usually dont but standardized tests do), we can translate standard scores into percentilesvery useful.
17

Types of Standard Score


All Standard Scores

Share a common logic Can be translated into each other

Z-Score

Simplest The one on which all others based Formula: z = (X-M)/SD, where X is persons score, M is groups average, and SD is groups spread (standard deviation in scores Z is negative for scores that are below average, so zs are usually converted into some other system that has all positive numbers

T- Score First a z-score is computed. Then a T-score with a mean of 50 and a standard deviation of 10 is applied. T-scores are whole numbers and are never negative.

Normally distributed standard scores M=50, SD=10 Can be obtained from z scores: T = 50 + 10(z)

Normalized Standard Scores


Starts with scores that you want to make conform to the normal curve Get percentile ranks for each score Transform percentiles into z scores using a conversion table (I handed one out in class) Then transform into any other standard score you want (e.g., T-score, IQ equivalents) Hope that your assumption was right, namely, that the scores really do naturally follow a normal curve. If they dont, your interpretations (say, of equal units) may be somewhat mistaken
18

Stanines

Very simple type of normalized standard score Ranges from 1-9 (the standard nines) Each stanine from 2-8 covers SD Stanine 5 = percentiles 40-59 (the middle 20 percent) A difference of 2 stanines usually signals a real difference

Strengths 1. 2. 3. 4. Easily explained to students and parents Normalized, so can compare different tests Can add stanines to get a composite score Easily recorded (only one column)

Limitations 1. 2. IQ ScoresTests that measure intelligence have a mean of 100 and (for the most part) a standard deviation of 15. Most people will score between 85 and 115. Someone who scores below a 70 is typically considered mentally retarded. Normal-Curve Equivalents (NCE)

Like all standard scores, cannot record growth Crude, but prevents over interpretation

Normally distributed standard scores M=50 SD=21.06 Results in scores that go from 1-99 Like percentiles, expect that have equal units (this means that they make fewer distinctions in the middle of the curve and more at the extremes)
19

Standard Age Scores (SAS)


Normally distributed standard scores Put into an IQ metric, where M=100 SD=15 (Wechsler IQ Test) or SD=16 (Stanford-Binet IQ Test)

Converting among Standard Scores


Easy Convertibility

All are different ways of saying the same thing All represent equal units at different ranges of scores All can be averaged (among themselves) Can easily convert one into the other Figure 19.2 on p 494 shows how they line up with each other But interpretable only when scores are actually normally distributed (standardized tests usually are) Downsidenot as easily understood by students and parents as are percentiles

Using Standard Scores to Examine Profiles


Uses

You can compare a students scores on different tests and subtests when you convert all the scores to the same type of standard score But all the tests must use the same norm group Plotting profiles can show their relative strengths and weaknesses Should be plotted as confidence bands to illustrate fringe of error Interpret scores as different only when their bands do not overlap Sometimes plotted separately by male and female (say, on vocational interest tests), but is controversial practice
20

Tests sometimes come with tabular or narrative reports of profiles

Using Standard Scores to Examine Mastery of Skill Types

Some standardized tests try to provide some criterion-referenced information by providing scores on specific sets of skills (see Figure 19.4 on p. 498) Be very cautious with theseuse them as clues only, because each skill area typically has very few items

Cautions in Interpreting Standardized Test Scores


Scores should be interpreted 1. With clear knowledge about what the test measures. Dont rely on titles; examine the content (breadth, etc.) 2. In light of other factors (aptitudes, educational experiences, cultural background, health, motivation, etc.) that may have affected test performance 3. According to the type of decision being made (high or low for what?) 4. As a band of scores rather than a specific value. Always subtract and add 1 SEM from the score to get a range to avoid over interpretation 5. In light of all your evidence. Look for corroborating or conflicting evidence 6. Never rely on a single score to make a big decision

21

Marking Versus Grading


Giving marks and grade as a response to students work is part of teachers routine work . Marking refers to assigning marks or points to students performance against marking scheme set for a test or an assignment. More often than not, marking and scoring are regarded as part of the normal practice of grading. Brookhart (2004)defines grading as scoring or rating a individual assignments. Grading will explain with attaching the meaning to the score that tells us if the expectation have been exceeded , met or not met. In relation to marking and grading of assessments, the University of Greenwich makes the following helpful points: 1. Assessment is a matter of judgment , not simply computation. 2. Marks and grades are not absolute values, but symbols used by examiners to communicate their judgment of a students work. 3. Marks and grades provide data for decisions about students fulfillment of learning outcomes. Marking and Grading Criteria: Higher education institutions is normally use an institution-wide grading scale for undergraduate programmes, whereas postgraduate programmes tend to be graded on a pass/fail basis or pass/fail/distinction basis. Grading scales tend to incorporate both percentage grading ; latter meaning letter such as A,B,C,etc. the grading scale used in the UNIVERSITY of Greenwich shown: Mark on 100scale 70+ 0- Comments Work of exceptional quality Work of very good quality
22

60-69

50-59 40-49

30-39 0-29

Work of good quality Work of satisfactory standard Compensatable fail Fail

Undergraduate grading scale are likely to be similar in other higher education institutions. It is interesting to compare this scalewith percentage equivalents for the class honours degree. 0-30% 35-39% 40-49% 50-59% 60-69% 70% or more Fail Pass degree Third class honours Lower second class honours Upper second class honours First class honours

Marks or grades are assigned to students essays to indicate the degree of achievement they have attained and there are two systems for assigning grades. Absolute grading gives the student marks for her essay answer, depending on how well the essay has met assessment criteria and is usually expressed as a percentage or letter ,e.g. 60% or B. Relative grading tells the student how his essay answer rated in relation to other students doing the same test, by indicating whether or not he was average , above average, or below average. Relative grading usually uses a literal scale such as A,B, C, D and F. Some teachers would argue that two grades are the best way of marking, so that students are given either a pass or fail grade.
23

This gets over the problem of deciding what constitute on A or a C grade but does reduce the information conveyed by a particular grade, since no discrimination is made between students who pass with a very high level of achievement and those who barely pass at all. Common methods of Grading a. Letter Grades: there is a great flexibility in the number of grades that can be adopted i.e. 3-11. However 3- point scales may not differentiate well between students of different abilities. 11-point scales make too fine distinctions and can introduce arbitrariness. Most common scale is 7 and 5. Examples of 7 points grading scale. O- outstanding A- Very good B- Good C- Average D- Below average E- Poor F- Very poor Example : of 5 points grading scale A+- Excellent A- Good B- Average C- Satisfactory D- Fail STRENGTHS-

24

Easy to use. Easy to interpret theoretically Provide a concise summary

Limitations Meaning of grades may vary widely. Do not describe strengths/ weaknesses of students. 2. Number /Percentage Grades (5,3,2,1,0) or (98%, 80%,60% etc) It is same as letter grades. Only difference is that instead of letters numbers or percentage is used. Strengths Easy to use Easy to interpret theoretically Provide a concise summary May be combined with letter grades More continuous than letter grades. Limitations Meaning of grades may vary widely. Do not describe strengths/weaknesses of students. Meaning may need to be explained or interpreted.

Two category grades-(pass-fail) it is good for courses that require-mastery of learning.


25

Strengths Less reliable Does not contain enough information about students achievement. Provides no indication of the level of learning Checklists and rating scales- they are more detailed and since they are too detailed it is cumbersome for teachers to prepare. Strengths Present detailed lists of students achievements. Can be combined with letter grades. Good for clinical evaluation Limitations May become too detailed to easily comprehend. Difficult for record keeping. Uses of grading 1. Describe unambiguously the worth, merit or value of the work accomplished. Grades are intended to communicate the achievement of students. 2. Grades motivate students to learn 3. Provide information to students for self evaluation for analysis of strengths and weaknesses. 4. Grades communicate performance levels to other. 5. Grades help in selecting people for rewards.
26

6. Communicate teachers judgment of the students progress. Analytical method of marking (marking scheme) When using absolute grading to specific criteria, it is useful to use the analytic method of marking. In this method, a marking scheme is prepared in advance and marks are allocated to the specific points of content in the marking specification. However, it is often difficult to decide how many marks should be given to a particular aspect, but the relative importance of each should be reflected in the allocation. This method has the advantage that it can be more reliable provided the marker is conscientious and it will bring to light any errors in the writing of

the question before the test is administered Global method of marking( structured impressionistic marking) The global method is also termed structured impressionistic marking, and is best used with relative grading. This method still requires a marking specification, but in this case it serves only as a standard of comparison. The grades used are not usually percentages but scale, such as excellent/good/average/below average/unsatisfactory scales can be devised. According to preference, but it is important to select examples of answers that serve as standards for each of the points on the scale. The teacher then reads each answer through very quickly and put in the appropriate pile, depending whether it gives the impression of excellent, good etc. the process is then repeated and it is much more effective if a colleague is asked to do the second reading. This method is much faster than the analytical one and can be quite effective for large numbers of questions. Uses of marking
27

Marking has two distinct stakeholders, the students and the tutor. Both should use marking as a means of raising achievement and attainment. From tutors perspective marking should: Check student understanding. Direct future lesson planning and teaching. Monitor progress through collection of marks. Helps to assess student progress and attainment. Set work of appropriate levels. Have the clear objectives about what and how you teach. Informs students and parents formatively and summatively. From students perspective marking should and could help them. Identify carelessness. Proof- reading- i.e by making them check their work of spelling, punctuation etc. Draft work-students can become actively involved in improving their own work. Identify areas of weakness and strength. Identify areas that lack understanding and knowledge. Become more motivated and value to their work.

28

Scoring Essay Questions


Prepare an outline of the expected answer in advance. Use the scoring method which is most appropriate. o Point method: each answer is compared to the ideal answer in the scoring key and a given number of points assigned in terms of adequacy of the answer. o Rating method: where the rating method is used, it is desirable to make separate ratings for each characteristic evaluated. That is answers should be rated separately for each characteristic evaluated. That is answers should be rated separately for organization, comprehensiveness, relevance of ideas, and the like. Decide on provision for handling factors which are irrelevant to the learning outcomes being measured. o Legibility of hand writing, spelling, sentence structure, punctuation and neatness, special efforts should be made to keep away such factors from influencing our judgment. Evaluate all answer to one question before going on to the next question. o The halo effect is less likely to form when the answers for a given pupil are not evaluated in continuous sequence. Evaluate the answers without looking at pupils name. If especially important decisions are to be based on the results, obtain two or more independent ratings. Methods in scoring essay tests It is critical that the teacher prepare, in advance a detailed ideal answer. Student paper should be scored anonymously and that all answers to a given item be scored one at a time, rather than grading each total separately Distractors in scoring essay tests Handwriting style Grammar Knowledge of the students
29

Neatness Two ways of scoring essay test 1. Holistic scoring-in this, type, a total score is assigned in each essay items based the teachers general impression or over-all assessment. 2. Analytic scoring- in this type, the essay is scored in term of each component

Disadvantages in scoring essay test


Carryout effectCarryout effect in which the teacher develops an impression of the quality of the answer from on item and carries it over to the next response. If the student answer from one item well, the teacher may be influenced to score subsequent responses at a similarly high level; the same situation may occur with a poor response. Halo effectThere may be a tendency in evaluating essay items to be influenced by a general impression of the student or feelings about the student, either positive or negative, that create halo effect when judging the quality of the answers. For instance, the teacher may hold favorable opinions about the student from class or clinical practice and believe that this learner had made significant improvement in the course, which in turn might influence the scoring the responses.

Scoring Guidelines
These are the descriptions of scoring criteria that the trained readers will follow to determine the score (16) for your essay. Papers at each level exhibit all or most of the characteristics described at each score point. Score = 6

30

Essays within this score range demonstrate effective skill in responding to the task. The essay shows a clear understanding of the task. The essay takes a position on the issue and may offer a critical context for discussion. The essay addresses complexity by examining different perspectives on the issue, or by evaluating the implications and/or complications of the issue, or by fully responding to counterarguments to the writer's position. Development of ideas is ample, specific, and logical. Most ideas are fully elaborated. A clear focus on the specific issue in the prompt is maintained. The organization of the essay is clear: the organization may be somewhat predictable or it may grow from the writer's purpose. Ideas are logically sequenced. Most transitions reflect the writer's logic and are usually integrated into the essay. The introduction and conclusion are effective, clear, and well developed. The essay shows a good command of language. Sentences are varied and word choice is varied and precise. There are few, if any, errors to distract the reader. Score = 5 Essays within this score range demonstrate competent skill in responding to the task. The essay shows a clear understanding of the task. The essay takes a position on the issue and may offer a broad context for discussion. The essay shows recognition of complexity by partially evaluating the implications and/or complications of the issue, or by responding to counterarguments to the writer's position. Development of ideas is specific and logical. Most ideas are elaborated, with clear movement between general statements and specific reasons, examples, and details. Focus on the specific issue in the prompt is maintained. The
31

organization of the essay is clear, although it may be predictable. Ideas are logically sequenced, although simple and obvious transitions may be used. The introduction and conclusion are clear and generally well developed. Language is competent. Sentences are somewhat varied and word choice is sometimes varied and precise. There may be a few errors, but they are rarely distracting. Score = 4 Essays within this score range demonstrate adequate skill in responding to the task. The essay shows an understanding of the task. The essay takes a position on the issue and may offer some context for discussion. The essay may show some recognition of complexity by providing some response to counterarguments to the writer's position. Development of ideas is adequate, with some movement between general statements and specific reasons, examples, and details. Focus on the specific issue in the prompt is maintained throughout most of the essay. The organization of the essay is apparent but predictable. Some evidence of logical sequencing of ideas is apparent, although most transitions are simple and obvious. The introduction and conclusion are clear and somewhat developed. Language is adequate, with some sentence variety and appropriate word choice. There may be some distracting errors, but they do not impede understanding. Score = 3 Essays within this score range demonstrate some developing skill in responding to the task. The essay shows some understanding of the task. The essay takes a position on the issue but does not offer a context for discussion. The essay may acknowledge
32

a counterargument to the writer's position, but its development is brief or unclear. Development of ideas is limited and may be repetitious, with little, if any, movement between general statements and specific reasons, examples, and details. Focus on the general topic is maintained, but focus on the specific issue in the prompt may not be maintained. The organization of the essay is simple. Ideas are logically grouped within parts of the essay, but there is little or no evidence of logical sequencing of ideas. Transitions, if used, are simple and obvious. An introduction and conclusion are clearly discernible but underdeveloped. Language shows a basic control. Sentences show a little variety and word choice is appropriate. Errors may be distracting and may occasionally impede understanding. Score = 2 Essays within this score range demonstrate inconsistent or weak skill in responding to the task. The essay shows a weak understanding of the task. The essay may not take a position on the issue, or the essay may take a position but fail to convey reasons to support that position, or the essay may take a position but fail to maintain a stance. There is little or no recognition of a counterargument to the writer's position. The essay is thinly developed. If examples are given, they are general and may not be clearly relevant. The essay may include extensive repetition of the writer's ideas or of ideas in the prompt. Focus on the general topic is maintained, but focus on the specific issue in the prompt may not be maintained. There is some indication of an organizational structure, and some logical grouping of ideas within parts of the essay is apparent. Transitions, if used, are simple and obvious, and they may be inappropriate or misleading. An introduction and conclusion are discernible but minimal. Sentence structure and
33

word choice are usually simple. Errors may be frequently distracting and may sometimes impede understanding. Score = 1 Essays within this score range show little or no skill in responding to the task. The essay shows little or no understanding of the task. If the essay takes a position, it fails to convey reasons to support that position. The essay is minimally developed. The essay may include excessive repetition of the writer's ideas or of ideas in the prompt. Focus on the general topic is usually maintained, but focus on the specific issue in the prompt may not be maintained. There is little or no evidence of an organizational structure or of the logical grouping of ideas. Transitions are rarely used. If present, an introduction and conclusion are minimal. Sentence structure and word choice are simple. Errors may be frequently distracting and may significantly impede understanding. No Score Blank, Off-Topic, Illegible, Not in English, or Void. Guidelines in scoring Essay test to avoid subjectivity Decide what factors constitute a good answer before administering an essay question. Explain these factors in the item item. Read all the answers to a single essay question before reading other questions. Reread essay answer a second time after initial scoring

34

Scoring Objective Items


Following are the method of scoring objectives items: Scoring key Strip key Scoring stencil Scoring key: if the pupils answers are recorded on the test paper itself, a scoring key is usually obtained marking the correct answer on a blank copy of the test. The scoring procedure is then simply a matter of comparing the columns of answers on this master copy with the columns of answers on each pupils paper. Strip Key: a strip key, which consists merely of strip of paper on which columns of answers are recorded may also be used. Scoring Stencil:where separate answer sheets are used, a scoring stencil is most convenient. This is a blank answer sheet with holes punched where the correct answers should appears. One of the most important advantages of objective type test is ease and accuracy of scoring. The best way to score objective tests is with a test scanner. This technology can speed up scoring and minimized scoring errors. When using a test scanner, a scoring key is prepared on a machinescorable answer sheet and it is read by the scanner first. After the scanner reads the scoring key, the student responses are read and stored on the hard disk of an attached computer. A separate program is used to score the student responses by comparing each response to the correct answer on the answer key. When this process is complete each students score, along with item analysis information is printed.

35

Item Analysis
The procedure used to judge the quality of an item is called, Item analysis. Item analysis is a post administration examination of a test . The quality of a test depends upon the individual items of a test. A test is usually desirable to evaluate effectiveness of items. It provides information concerning how well each item in the test functions. An item analysis tells about the quality of an item. One primary goal of item analysis is to help improve the test by revising or discarding ineffective items. Another important function is to ascertain what test takers do and do not know. I Item Analysis describes the statistical analyses, which allow measurement of the effectiveness of individual test items. An understanding of the factors which govern effectiveness (and a means of measuring them) can enable us to create more effective test questions and also regulate and standardize existing tests. Item analysis helps to find out how difficult the test item . Similarly it also helps to know how well the item discriminates between high and low scorers in the test. Item analysis further helps to detect specific technical flaws and thus provide further information for improving test items To ascertain whether the questions/ items do their job effectively. A detailed test and item analysis has to be done before a meaningful and scientific inference about the test can be made in terms of its validity, reliability, objectivity and usability. A systematic analysis aims at finding the performance of a group.

36

The central tendency of marks obtained by them, e.g normal/average; positive or negative skewness high or low value. The variability characterized by standard deviation(SD) indicates the nature of spread of marks, the greater the spread, and the greater will be value of standard deviation. Coefficient of reliability for the test indicating the degree of consistency with which the test has measured the students abilities. A high value of this means that the test is reliable and it produces virtually repeatable, scores for the students. Item analysis is useful in making meaningful interpretations and value judgments about students performance. A teacher or paper setter comes to know whether the items had the right level of difficulty and whether there was discrimination between more able and less able students. Item analysis defines and maintains standard of performance, ensures comparability of standards, o To understand the behavior of items, o To become better item writers, scientific, professional and competent teachers. Item analysis is a process of examining class-wide performance on individual test items. There are three common types of item analysis which provide teachers with three different types of information:

Difficulty Index - Teachers produce a difficulty index for a test item by calculating the proportion of students in class who got an item correct. (The name of this index is counter-intuitive, as one actually gets a measure of how easy the item is, not the difficulty of the item.) The larger the
37

proportion, the more students who have learned the content measured by the item.

Discrimination Index - The discrimination index is a basic measure of the validity of an item. It is a measure of an item's ability to discriminate between those who scored high on the total test and those who scored low. Though there are several steps in its calculation, once computed, this index can be interpreted as an indication of the extent to which overall knowledge of the content area or mastery of the skills is related to the response on an item. Perhaps the most crucial validity standard for a test item is that whether a student got an item correct or not is due to their level of knowledge or ability and not due to something else such as chance or test bias.

Analysis of Response Options - In addition to examining the performance of an entire test item, teachers are often interested in examining the performance of individual distractors (incorrect answer options) on multiple-choice items. By calculating the proportion of students who chose each answer option, teachers can identify which distractors are "working" and appear attractive to students who do not know the correct answer, and which distractors are simply taking up space and not being chosen by many students. To eliminate blind guessing which results in a correct answer purely by chance (which hurts the validity of a test item), teachers want as many plausible distractors as is feasible. Analyses of response options allow teachers to fine tune and improve items they may wish to use again with future classes.

38

Steps involved in Item Analysis


For each item count the number of students in each group who answered the item correctly. For alternate response type of items, count the number of students in each group who choose each alternative. Award of score to each student. A practical, simple and rapid method is to perforate on your answer sheet the boxes corresponding to the correct answer, placing the perforated sheet on the students answer sheet the raw score can be found almost automatically.

Ranking in order of merit and identifying high and low groups. Arrange the answer sheets from the highest score to the lowest score. Make two groupsi.e., highest scores in one group; lowest scores in other group or top and bottom halves.

Calculation of difficulty index of a question


For each item, compute the percentage of students who get the item correct is called item difficulty index. 1. D=R/N *100 R: number of pupils who answered the item correctly. N: total number of pupils who tried them.

39

The higher the difficulty index, the easier is the item. Difficulty level/facility level of a test; it is an index of how easy or difficult the test is form is a ratio of the average score of a sample of subjects on the test to the maximum possible score on the test. It is usually expressed in percentage. 2. Difficulty level= average on the test/ Maximum possible score * 100 3. Difficulty index= H+L/N *100 H: Number of correct answers to the high group. L: Number of correct answers to the low group. N: Total number of students in both groups. 4. Find out the facility value of objective tests first. 5. Facility value= Number of students answering questions correctly * 100 Number of students who have taken the test. If the facility value is 70 and above, those are easy questions; if it is below 70 the questions are difficult ones. Estimating Discrimination Index(DI) The discriminating power (validity index) of an item refers to the degree to which a given item discriminates among students who differ sharply in the functions measured by the test as a whole. Formula-1 DI= RU-RL/1/2 N RU= Number of correct responses from the upper group. RL= Number of correct responses from lower group.

40

N= Total number of pupils who tried them. High discriminate value questions are needed for selection purposes. Formula-2 DI= No. of HAQ-LAQ/No. of HAG No. of HAQ: number of students in high ability group answering the questions correctly No. of LAQ: Number of students in low ability group answering questions correctly. No. of HAG: Number of students in high ability group

Positive Discrimination: If an item is answered correctly by superiors (upper groups) and but not answered correctly by inferiors (lower group) such item possess positive discrimination. Negative Discrimination: An item answered correctly by inferiors (lower group) but not answered correctly by the superiors (upper groups) such item possess negative discrimination. Zero Discrimination: If an item is answered correctly by the same number of superiors as well as inferiors examinees of the same group. The item cannot discriminate between superior and inferior examinees. Thus, the

discrimination power of the item is zero.

41

Item analysis is a general term that refers to the specific methods used in education to evaluate test items, typically for the purpose of test construction and revision. Regarded as one of the most important aspects of test construction and increasingly receiving attention, it is an approach incorporated into item response theory (IRT), which serves as an alternative to classical measurement theory (CMT) or classical test theory (CTT). Classical measurement theory considers a score to be the direct result of a person's true score plus error. It is this error that is of interest as previous measurement theories have been unable to specify its source. However, item response theory uses item analysis to differentiate between types of error in order to gain a clearer understanding of any existing deficiencies. Particular attention is given to individual test items, item characteristics, probability of answering items correctly, overall ability of the test taker, and degrees or levels of knowledge being assessed. The Purpose Of Item Analysis There must be a match between what is taught and what is assessed. However, there must also be an effort to test for more complex levels of understanding, with care taken to avoid over-sampling items that assess only basic levels of knowledge. Tests that are too difficult (and have an insufficient floor) tend to lead to frustration and lead to deflated scores, whereas tests that are too easy (and have an insufficient ceiling) facilitate a decline in motivation and lead to inflated scores.
42

Tests can be improved by maintaining and developing a pool of valid items from which future tests can be drawn and that cover a reasonable span of difficulty levels. Item analysis helps improve test items and identify unfair or biased items. Results should be used to refine test item wording. In addition, closer examination of items will also reveal which questions were most difficult, perhaps indicating a concept that needs to be taught more thoroughly. If a particular distracter (that is, an incorrect answer choice) is the most often chosen answer, and especially if that distracter positively correlates with a high total score, the item must be examined more closely for correctness. This situation also provides an opportunity to identify and examine common misconceptions among students about a particular concept. In general, once test items have been created, the value of these items can be systematically assessed using several methods representative of item analysis: a) a test item's level of difficulty, b) an item's capacity to discriminate, and c) the item characteristic curve.

Difficulty is assessed by examining the number of persons correctly endorsing the answer. Discrimination can be examined by comparing the number of persons getting a particular item correct with the total test score. Finally, the item characteristic curve can be used to plot the likelihood of answering correctly with the level of success on the test. Using Item Analysis Results It helps the judge the worth or quality of a test.
43

Aids in subsequent test revisions. Lead to increase skill in test construction. Provides diagnostic value and help in planning future learning activities. Provides a basis for discussing test results. For making decisions about the promotion of students to the next higher grade. To bring about improvement in teaching methods and techniques. For making decisions about the promotion of students to the next higher grade. To bring about improvement in teaching methods and techniques. Item Difficulty Perhaps item difficulty should have been named item easiness; it expresses the proportion or percentage of students who answered the item correctly. Item difficulty can range from 0.0 (none of the students answered the item correctly) to 1.0 (all of the students answered the item correctly). Experts recommend that the average level of difficulty for a four-option multiple choice test should be between 60% and 80%; an average level of difficulty within this range can be obtained, of course, when the difficulty of individual items falls outside of this range. If an item has a low difficulty value, say, less than .25, there are several possible causes: the item may have been miskeyed; the item may be too challenging relative to the overall level of ability of the class; the item may be ambiguous or not written clearly; there may be more than one correct answer. Further insight into the cause of a low difficulty value can often be gained by examining the percentage of students who chose each response option. For example, when a high percentage of students chose a single option other than the

44

one that is keyed as correct, it is advisable to check whether a mistake was made on the answer key. Item Statistics Item statistics are used to assess the performance of individual test items on the assumption that the overall quality of a test derives from the quality of its items.

Item Number. This is the question number taken from the student answer sheet. Up to 150 items can be scored on the Standard Answer Sheet (purple).

Mean and S.D. The mean is the "average" student response to an item. It is computed by adding up the number of points earned by all students for the item, and dividing that total by the number of students. The standard deviation, or S.D., is a measure of the dispersion of student scores on that item, that is, it indicates how "spread out" the responses were. The item standard deviation is most meaningful when comparing items which have more than one correct alternative and when scale scoring is used. For this reason it is not typically used to evaluate classroom tests.

Item Difficulty. For items with one correct alternative worth a single point, the item difficulty is simply the percentage of students who answer an item correctly. In this case, it is also equal to the item mean. The item difficulty index ranges from 0 to 100; the
45

higher the value, the easier the question. When an alternative is worth other than a single point, or when there is more than one correct alternative per question, the item difficulty is the average score on that item divided by the highest number of points for any one alternative.

Item difficulty is relevant for determining whether students have learned the concept being tested. It also plays an important role in the ability of an item to discriminate between students who do not. The item will have low discrimination if it is so difficult that almost everyone gets it wrong or guesses, or so easy that almost everyone gets it right.

To maximize item discrimination, desirable difficulty levels are slightly higher than midway between chance and perfect scores for the item. (The chance score for five-option questions, for example, is .20 because one-fifth of the students responding to the question could be expected to choose the correct option by guessing.) Ideal difficulty levels for multiple-choice items in terms of discrimination potential are: Format Ideal Difficulty

Five-response multiple-choice 70 Four-response multiple-choice 74 Three-response multiple-choice 77 True-false (two-response multiple choice) 85 classifies item difficulty as "easy" if the index is 85% or above; "moderate" if it is between 51 and 84%; and "hard" if it is 50% or below.

46

Item Discrimination Item discrimination refers to the ability of an item to differentiate among students on the basis of how well they know the material being tested. Various hand calculation procedures have traditionally been used to compare item responses to total test scores using high and low scoring groups of students. Computerized analyses provide more accurate assessment of the discrimination power of items because they take into account responses of all students rather than just high and low scoring groups.

The item discrimination index between student responses to a particular item and total scores on all other items on the test. This index is the equivalent of a point-biserial coefficient in this application. It provides an estimate of the degree to which an individual item is measuring the same thing as the rest of the items. Because the discrimination index reflects the degree to which an item and the test as a whole are measuring a unitary ability or attribute, values of the coefficient will tend to be lower for tests measuring a wide range of content areas than for more homogeneous tests. Item discrimination indices must always be interpreted in the context of the type of test which is being analyzed. Items with low discrimination indices are often ambiguously worded and should be examined. Items with negative indices should be examined to determine why a negative value was obtained. For example, a negative value may indicate that the item was miskeyed, so that students who
47

knew the material tended to choose an unkeyed, but correct, response option. Tests with high internal consistency consist of items with mostly positive relationships with total test score. In practice, values of the discrimination index will seldom exceed .50 because of the differing shapes of item and total score distributions. Item discrimination as "good" if the index is above .30; "fair" if it is between .10 and .30; and "poor" if it is below .10.

Alternate Weight. This column shows the number of points given for each response alternative. For most tests, there will be one correct answer which will be given one point, but ScorePak allows multiple correct alternatives, each of which may be assigned a different weight.

Means. The mean total test score (minus that item) is shown for students who selected each of the possible response alternatives. This information should be looked at in conjunction with the discrimination index; higher total test scores should be obtained by students choosing the correct, or most highly weighted alternative. Incorrect alternatives with relatively high means should be examined to determine why "better" students chose that particular alternative.

48

Frequencies and Distribution. The number and percentage of students who choose each alternative are reported. The bar graph on the right shows the percentage choosing each response. Frequently chosen wrong alternatives may indicate common misconceptions among the students.

Difficulty and Discrimination Distributions At the end of the Item Analysis report, test items are listed according their degrees of difficulty (easy, medium, hard) and discrimination (good, fair, poor). These distributions provide a quick overview of the test, and can be used to identify items which are not performing well and which can perhaps be improved or discarded.

Test Statistics Two statistics are provided to evaluate the performance of the test as a whole.

Reliability Coefficient. The reliability of a test refers to the extent to which the test is likely to produce consistent scores. The particular reliability coefficient reflects three characteristics of the test:

1. The inter correlations among the items -- the greater the relative number of positive relationships, and the stronger those relationships are, the greater the

reliability. Item discrimination indices and the test's reliability coefficient are related in this regard.
49

2. The length of the test -- a test with more items will have a higher reliability, all other things being equal.

3. The content of the test -- generally, the more diverse the subject matter tested and the testing techniques used, the lower the reliability.

Reliability coefficients theoretically range in value from zero (no reliability) to 1.00 (perfect reliability). In practice, their approximate range is from .50 to .90 for about 95% of the classroom tests scored

High reliability means that the questions of a test tended to "pull together." Students who answered a given question correctly were more likely to answer other questions correctly. If a parallel test were developed by using similar items, the relative scores of students would show little change. Low reliability means that the questions tended to be unrelated to each other in terms of who answered them correctly. The resulting test scores reflect peculiarities of the items or the testing situation more than students' knowledge of the subject matter. As with many statistics, it is dangerous to interpret the magnitude of a reliability coefficient out of context. High reliability should be demanded in situations in
50

which a single test score is used to make major decisions, such as professional licensure examinations. Because classroom examinations are typically combined with other scores to determine grades, the standards for a single test need not be as stringent. The following general guidelines can be used to interpret reliability coefficients for classroom exams:

Reliability Interpretation .90 and above Excellent reliability; at the level of the best standardized tests .80 - .90 Very good for a classroom test .70 - .80 Good for a classroom test; in the range of most. There are probably a few items which could be improved. .60 - .70 Somewhat low. This test needs to be supplemented by other measures (e.g., more tests) to determine grades. There are probably some items which could be improved. .50 - .60 Suggests need for revision of test, unless it is quite short (ten or fewer items). The test definitely needs to be supplemented by other measures (e.g., more tests) for grading. .50 or below Questionable reliability. This test should not contribute heavily to the course grade, and it needs revision.

The measure of reliability used. This is the general form of the more commonly reported KR-20 and can be applied to tests composed of items with different numbers of points given for different response alternatives. When coefficient alpha is applied to tests in which each item has only one correct answer and all

51

correct answers are worth the same number of points, the resulting coefficient is identical to KR-20.

Standard Error of Measurement. The standard error of measurement is directly related to the reliability of the test. It is an index of the amount of variability in an individual student's performance due to random measurement error. If it were possible to administer an infinite number of parallel tests, a student's score would be expected to change from one administration to the next due to a number of factors. For each student, the scores would form a "normal" (bellshaped) distribution. The mean of the distribution is assumed to be the student's "true score," and reflects what he or she "really" knows about the subject. The standard deviation of the distribution is called the standard error of measurement and reflects the amount of change in the student's score which could be expected from one test administration to another. Whereas the reliability of a test always varies between 0.00 and 1.00, the standard error of measurement is expressed in the same scale as the test scores. For example, multiplying all test scores by a constant will multiply the standard error of measurement by that same constant, but will leave the reliability coefficient unchanged. A general rule of thumb to predict the amount of change which can be expected in individual test scores is to multiply the standard error of measurement by 1.5. Only rarely would one expect a student's score to increase or decrease by more than that amount between two such similar
52

tests. The smaller the standard error of measurement, the more accurate the measurement provided by the test.

A CAUTION in Interpreting Item Analysis Results Each of the various item statistics provides information which can be used to improve individual test items and to increase the quality of the test as a whole. Such statistics must always be interpreted in the context of the type of test given and the individuals being tested are not synonymous with item validity.

1.An external criterion is required to accurately judge the validity of test items. By using the internal criterion of total test score, item analyses reflect internal consistency of items rather than validity. 2. The discrimination index is not always a measure of item quality. There is a variety of reasons an item may have low discriminating power: (a) extremely difficult or easy items will have low ability to discriminate but such items are often needed to adequately sample course content and objectives; (b) An item may show low discrimination if the test measures many different content areas and cognitive skills. For example, if the majority of the test measures "knowledge of facts," then an item assessing "ability to apply principles" may have a low correlation with total test score, yet both types of items are needed to measure attainment of course objectives. 3.Item analysis data are tentative. Such data are influenced by the type and number of students being tested, instructional procedures employed, and chance errors. If repeated use of items is possible, statistics should be recorded for each administration of each item.

53

Summary
In the light of above discussion, we have discussed about administrating a test and various suggestions to administer the test, importance of test administration, recommendations for improving test scores. we learnt about scoring methods , various standard scores and marking and grading criteriaand its types. We discussed about scoring essay test and objective test. We had detailed glance on item analysis ,item difficulty and its uses.

Conclusion
By above discussion, I conclude the topic that by knowing proper knowledge about good practice of administration of test and various methods of scoring the test helps to improve performance of the student and teachers evaluation skill.

54

Bibliography
B. Sankaranarayan(2008), LEARNING AND TEACHING NURSING, 2nd edition, Brainfill publishers. Pg no-232-233 K P Neeraja(2003), TEXTBOOK OF NURSING EDUCATION,1st edition, Gopson paper ltd, Noida. Pg no-413-425 Francis M. Quinn(2000), PRINCIPLE AND PRACTICE OF NURSE EDUCATION, 4th edition, nelson thornes ltd. Pg no-210-214 Marlyin H. Oermann(2009), EVALUATION AND TESTING NURSING EDUCATION, 3rd edition, springe publisher. Pg no- 122-126

55

You might also like