Professional Documents
Culture Documents
et al. Chapter 5
Popham, Chapter 3
E& F Chapter 5
M & L Chapter 12
1. Concepts of Reliability: Overview
2. Different Types of Reliability Estimates
3. Implications for Design of Assessments/Tests
(E&F, p80)
Time 1 Time 2 Difference
Gerry 20 30 +10
Mary 30 30 0
John 50 40 -10
Kate 60 70 +10
Tom 65 70 +5
80
70
60
50
40
Time 1
30
Time 2
20
10
70
60
50
40
Time 1
30
Time 2
20
10
70
60
50 Gerry
40 Mary
30 John
20 Kate
10 Tom
0
Time 1 Time 2
Time 1 Time 2 Difference
Gerry 20 70 +50
Mary 30 10 -20
John 50 90 +40
Kate 60 30 -30
Tom 65 15 -50
100
90
80
70
60
50
Time 1
40
Time 2
30
20
10
90
80
70
60
50
Time 1
40
Time 2
30
20
10
90
80
70
60 Gerry
50 Mary
40 John
30
Kate
20
Tom
10
Time 1 Time 2
Observed scores Observed scores
error error
Reliable!!! Unreliable!!!
All test scores contain a certain amount of
errors! Errors inflate or lower the observed
score, irrespective of the construct to be
measured.
True Score
Observed Systematic
Test Score errors
Test score
errors
Random
errors
(W&G, p59)
Systematic errors Random errors
Time of the test (e.g., Memory
Wed vs. Fri afternoon) Variation in
Time limit motivation
Timing of notice Concentration from
(notice vs. unnoticed) time to time
Carelessness in
Reading ability marking answers
Certain words Computational errors
Clues Luck in guessing
In class vs. Take-home Test Anxiety
project
Parents help
Theoretical Definition of Test Reliability
(E&F, p72)
Reliability is how large/small the amount of
random error (measurement error) in the test
score.
Theoretically defined:
Errors are random.
Errors are uncorrelated with the true scores.
Errors within a group cancel each other out as
the average error scores is zero.
True scores and error scores are unknown.
(E&F, p72)
Point 1. Reliability is a measure of a test.
Point 2. Reliability estimates require two or more
independent measurements.
Point 3. Reliability indicates the degree of
agreement between the pairs of the scores on the
two measures for each person.
Point 4. Correlation coefficient is used to
estimate the degree of agreement
Point 5. Reliability of a test is a sample-based
estimate.
Point 6. Reliability estimates make sense when the
items in a test are supposed to measure one
construct/domain.
Point 1. Reliability estimate (typically) is a
measure of a test (not an individual item)
Test scores
Item 1
Item 2
Item 3
Item 4
Item 5
Point 2. Reliability estimates require two or more
independent measurements, obtained from
equivalent tests on each individual of the group.
Test scores
Test scores
Research purpose
Classroom
assessment
be obtained? Also depends on
How high should reliability
theoretical expectations of the content being measured.
IQ tests
Achievement Tests:
Multiple-choice
Achievement Tests:
Noncognitive Performance-based
Characteristics
(e.g., motivation,
attitudes)
Expressed in correlation coefficients: The square
of a correlation coefficient the percentage of
shared true variance
Reliability of .90 of a test
Two measures have 81% in common overlap,
share a common variance just over 4/5
about 1/5 (20%) is error component.
Reliability of .80 of a test
The repeated measures 64 % of common
variance (36% of error component: error
component was more than 1/3)
Reliability of .70 of a test
(E&F, p71)
Point 5. Reliability of a test is a sample-based
estimate. It is an estimate based on a particular
group of examinees. Reliability will vary depending
on the group who provided the test scores!
Implication:
Reliability of the same test can vary.
The variation can be due to differences in the
relationship between the item difficulty and the
groups ability level.
Item difficulty for a particular group of test-takers: The
more appropriate a test is to the level of ability in the
group, the higher the reliability of the scores it will
yield.
(E&F, p71)
Point6. Reliability estimates make sense
when the items in a test are supposed to
measure one construct/domain.
The extent to which an assessment accurately
measures what it is designed to measure
(Trice, 2000 p. 29)
The extent to which the scores obtained by
administering a psychometric instrument are
consistent and thereby relatively free from
errors of measurement (Aiken, 2002, p. 42)
The extent to which error is eliminated from
the assessment process (Trice, 2000 p. 29)
When a student receive a score on a test, we
should be able to expect that that score accurately
represent the students performance level.
We expect a certain amount of variation in test
performance, influenced by TIME (1) from one
time to another (test-retest); influenced by A
SAMPLE OF TASKS (2) from one sample of
items to another (alternate forms); influenced by
RATERS (3) from one rater to another (inter-
rater); influenced by PARTS OF THE TESTS (4)
from one part of the same test to another
(internal consistency or split-halves).
(W&G, p42, 59)
influenced by TIME influenced by PARTS OF THE TESTS
influenced by A SAMPLE OF TASKS
Equivalent
Test-retest Internal Spilt-half
forms
reliability consistency method
method
This is a true definition of reliability.
Giving the same test a few days apart.
Correlation is obtained as a test-retest reliability estimate
between the first set and the second set.
Obtain the scores on repeated measures for the same
individuals; Evaluate how consistently the same individuals
perform on the same set of items in different occasions.
Test A
= Test A
(E&F, p75)
Correlation higher than .90 is
considered excellent for either standardized
tests or classroom-assessment tests.
Correlation about .80 or higher is
expected in standardized tests.
Correlation about .70 or higher is
expected in classroom-assessment tests.
Ifthe interval is long, errors of measurement may get
confused with real changes.
In reporting, include the time interval between the two
administration when reporting test-retest reliability estimates
(e.g., 3-months, 60 days, 5 days, etc.)
Administration of the same test not appealing to most
students or teachers, including lack of student interest
(psychological resistance).
Students answers to the second test are not independent
of their answers to the first test.
Possibly by student discussion
Influenced by recall: At the second administration, students
might remember what they answered in the first time
Practice effect
Test A
Test B
Equivalent forms reliability
Alternate forms reliability
Parallel forms reliability
Test A
Test A Test B
Test A-1
Test A
Test A-2
Test A-1
Test A
Test A-2
Split-halves reliability estimates: Approximates
the reliability of the whole test.
The correlation is based on the half of the total
test. Correction to be made by the Spearman-
Brown formula.
Spearman-Brown Reliability.
(W&G, p62)
Cronbach, L. J. (1951). Coefficient alpha and the
internal structure of tests. Psychometrika, 16, 297-334.
Cronbach, L. J. (1984). Essentials of psychological testing
(4th ed.). New York: Harper & Row.
Cronbach, Lee
(1916-2001)
Stanford University
Internal Consistency methods require the
same test be administered only once, involving
one test!
Item1 Item2
Test A
Item3 Item4
Equivalent
Test A Test B forms
reliability
Test A-1
Test A Split-halves
Test A-2 reliability
Internal Consistency does not use correlation
coefficients.
Coefficient alpha (Cronbach's alpha): Internal
consistency is typically expressed by Coefficient
alpha.
(E&F, p78)
Alpha
Internal consistency
Coefficient
0.9 Excellent (High-Stakes testing)
0.8 < 0.9 Good (Low-Stakes testing)
0.7 < 0.8 Acceptable (Surveys)
0.6 < 0.7 Questionable
0.5 < 0.6 Poor
< 0.5 Unacceptable
Item 1 Item 2 Item 3 Item 4 Item 5 Average
Tom 2 2 2 2 2 2
Mary 3 3 3 3 3 3
Jane 4 4 4 4 4 4
Jessica 2 2 2 2 2 2
George 5 5 5 5 5 5
Coefficient alpha = 1
Item 1 Item 2 Item 3 Item 4 Item 5 Average
Tom 20 20 20 20 20 20
Mary 3 3 3 3 3 3
Jane 40 40 40 40 40 40
Jessica 2 2 2 2 2 2
George 50 50 50 50 50 50
Coefficient alpha = 1
Item 1 Item 2 Item 3 Item 4 Item 5 Total
Tom 2 4 3 5 2 16
Mary 2 4 3 5 2 16
Jane 2 4 3 5 2 16
Jessica 2 4 3 5 2 16
George 2 4 3 5 2 16
Tom 2 4 3 1 2 12
Mary 2 5 3 2 2
14
Jane 2 4 2 3 2
13
Jessica 2 5 3 3 2
15
George 3 4 3 3 3 16
Variance 0.2 0.3 0.2 0.8 0.2 2.5
Requirement
Re-test within of alternate Most popular Dividing the
a short period forms method test into half
Adequacy of Assumption
Resistance,
alternate of Dimension Random half
practice-
forms and the Items
effects
K-R20 Spearman-
Brown
K-R 21 formula
From To Stability
Consistency
Stability Stability over Stability over
across forms time time & forms
Equivalent
Equivalent Test-retest
Internal forms
Split-half with the
consistency forms a few days
same test
apart
(Gipps, p58)
Inter-rater reliability is used when assessment
results are produced by raters, experts, teachers,
etc.
Typically based on agreement between the
raters. Also called reader reliability (not test-
taker reliability of test reliability)
Essays, Writing, Speaking, Performance Assessment
Assessment results made dichotomously: e.g., when
decisions are made whether a student is a master vs.
non-master or pass vs. fail.
(E&F, p73 ; Gipps, p71; w&G, p63-64)
Reliability (as inter-rater-agreement) is
considered in relation to the correctness of
classification (e.g., how consistently the raters
will classify masters and non-masters).
Expressed by:
Percentages of agreement (classified correctly)
between (two or more) raters.
Correlations between multiple sets of ratings by
different experts for a single set of test.
27 13
Total
Out of 40
Rater 2
students
30
Pass
10
Fail
Out of 40 students Rater 1
Pass Fail Total
Pass
Rater 2
Fail
Total 27 13 40
Pass 30
Rater 2
Fail 10
Total 40
Out of 40 students Rater 1
Pass Fail Total
Pass
25 5 30
Rater 2
Fail 2 8 10
Total 27 13 40
1 8 2 10
Rater 2 2 7 3 10
3 9 1 10
4 2 8 10
T 8 9 14 9 40
% Consistency = Percentage of Agreement
% Consistency = ((8 + 7 + 9 + 8) / 40) * 100 % = 80% (W&G, p65-66)
Out of 40 Rater 1
students 1 2 3 4 Total
1 8 1 1 10
Rater 2 2 2 7 1 10
3 1 9 10
4 1 1 8 10
T 11 10 11 8 40
% Consistency = Percentage of Agreement
% Consistency = ((8 + 7 + 9 + 8) / 40) * 100 = 80% (w&G, p66)
Out of 40 Rater 1
students 1 2 3 4 Total
1 8 1 1 10
Rater 2 2 2 7 1 10
3 1 9 10
4 1 1 8 10
T 11 10 11 8 40
% Consistency = Percentage of Agreement
% Consistency = ((8 + 7 + 9 + 8) / 40) * 100 = 80% (W&G, p66)
Out of 40 Rater 1
students 1 2 3 4 Total
1 8 1 1 10
Rater 2 2 2 7 1 10
3 1 9 10
4 1 1 8 10
T 11 10 11 8 40
% Consistency = Percentage of Agreement
% Consistency = ((8 + 1 + 2 + 7 + 1 + 1+ 9 + 8) / 40) * 100 = 93% (W&G, p66)
Inter-rater consistency shows the consistency
of rating
Also
Spread of disagreement
Evidence for any rating/rater idiosyncrasies
Judge being consistently lenient or harsh
Which judge is more lenient/harsh
Where the judgment is more lenient/harsh by which
judge
The extent to which leniency/harshiness accounts for the
disagreements
(w&G, p65-66)
Rater 1
1 2 3 4
1 8
Rater
2 2 3 7
3 2 1 9
4 8
Rater 1
1 2 3 4
1 8 2
Rater
2 2 7 3
3 9 1
4 10
Rater 1
1 2 3 4
1 8
Rater
2 2 3 7
3 2 1 9
4 8
Rater 1
1 2 3 4
1 8 2
Rater
2 2 7 3
3 9 1
4 10
Rater 1 Rater 1
1 2 3 4 1 2 3 4
1 8 2 1 8 1 1
Rater Rater
2 2 7 3 2 2 2 7 1
3 9 1 3 1 9
4 2 8 4 1 1 8
Rater 1 Rater 1
1 2 3 4 1 2 3 4
1 8 1 1 1 8 2
Rater
2 2 1 7 1 1 Rater 2 2 1 7 2
3 1 9 3 1 9
4 1 1 8 4 2 8
Rater 1 Rater 1
1 2 3 4 1 2 3 4
1 8 2 1 8 1 1
Rater Rater
2 2 7 3 2 2 2 7 1
3 9 1 3 1 9
4 2 8 4 1 1 8
Rater 1 Rater 1
1 2 3 4 1 2 3 4
1 8 1 1 1 8 2
Rater
2 2 1 7 1 1 Rater 2 2 1 7 2
3 1 9 3 1 9
4 1 1 8 4 2 8
Pass 25 (50%) Pass 30 (60%)
Judge A Judge B
Fail 25 (50%) Fail 20 (40%)
50 50
The probability that both of them would say "Yes" randomly is 0.50 0.60 = 0.30.
The probability that both of them would say "No" is 0.50 0.40 = 0.20.
Thus the overall probability of random agreement is Pr(e) = 0.3 + 0.2 = 0.5.
Judge B
Pass Fail
Pass 20 5 25 (50%)
Judge A
10 15 25 (50%)
Fail
30 (60%) 20 (40%) 50
Judge B
Pass Fail
Pass
20 5 25 (50%)
Judge
A
Fail 10 15 25 (50%)
30 (60%) 20 (40%) 50
http://en.wikipedia.org/wiki/Cohen's_kappa
Pr(e) (the probability of random agreement):
Judge A: "Yes" to 25 applicants and "No" to 25
applicants. Judge A: "Yes" 50% of the time.
Judge B: "Yes" to 30 applicants and "No" to 20
applicants. Judge B: "Yes" 60% of the time.
The probability that both of them would say "Yes"
randomly is 0.50 0.60 = 0.30.
The probability that both of them would say "No" is
0.50 0.40 = 0.20.
Thus the overall probability of random agreement is
Pr(e) = 0.3 + 0.2 = 0.5.
Using formula for Cohen's Kappa,
http://en.wikipedia.org/wiki/Cohen's_kappa
Assessing reliability of school marks:
Inter-rater reliability: Different markers
score the same piece of work
Intra-rater reliability: Same marker score
the same pieces of work on different occasions.
(Gipps, p58)
Marking Reliability
Double marking
Control scripts
Discrepancy check: A 3-rd marker is brought
in (e.g., if a difference is 3 out of 5)
Average of ratings of three judges
Agreement (e.g., if 2 out of 3 can agree)
Subjectivity of the judgments
Assessment criteria being interpreted in the same way by all
judges/raters
Rater/marker bias (conscious or unconscious)
Expectations based on ability level, ethnic origin, social class
Skills, knowledge irrelevant to the construct of interest
Cultural differences of what constitutes competent
performance and what should be considered as values.
Variety and complexity of making scheme
Performance being evaluated according to the same rubric
and standards by all markers.
Structure in the question
Essay type questions vs. more structured questions.
(Gipps, p58, 144)
Whether scores can be obtained
Inter-rater in a consistent way, on a single
agreement instance of a task
Whether the
performance outcome
Generalizability
can be generalized
across heterogeneous
task domains?
1. Concepts of Reliability in Assessment/Tests
2. Different Types of Reliability Estimates
3. Factors influencing Reliability Estimates
4. Implications in Developing Assessment/Tests
1. Test-retest reliability
2. Equivalent (parallel) forms reliability
3. Based on Internal consistency: more practical,
requires only one single test, single administration
a) Coefficient alpha
b) Kuder-Richardson Formula 20/21
c) Split-half reliability
4. Rater-related reliability
a) Inter-rater reliability (agreement between raters on the
same assessment task)
b) Intra-rater reliability (agreement of the same raters
judgments on different occasions)
(E&F, p73;Gipps, p57)