You are on page 1of 113

Millar

et al. Chapter 5
Popham, Chapter 3

E& F Chapter 5
M & L Chapter 12
1. Concepts of Reliability: Overview
2. Different Types of Reliability Estimates
3. Implications for Design of Assessments/Tests

How will you know if your assessments are reliable?


How can you demonstrate that your assessments are
reliable?
How do you develop reliable assessments?
Accuracy
When we give marks/test scores to students,
we should be able to say the scores
accurately represent the students ability on
the matters being tested.
Consistency
How consistently examinees perform on the
same (similar) set of tasks.

(E&F, p80)
Time 1 Time 2 Difference

Gerry 20 30 +10
Mary 30 30 0
John 50 40 -10
Kate 60 70 +10
Tom 65 70 +5
80

70

60

50

40
Time 1
30
Time 2
20

10

Gerry Mary John Kate Tom


80

70

60

50

40
Time 1
30
Time 2
20

10

Gerry Mary John Kate Tom


80

70

60

50 Gerry
40 Mary
30 John
20 Kate
10 Tom
0

Time 1 Time 2
Time 1 Time 2 Difference

Gerry 20 70 +50
Mary 30 10 -20
John 50 90 +40
Kate 60 30 -30
Tom 65 15 -50
100

90

80

70

60

50
Time 1
40
Time 2
30

20

10

Gerry Mary John Kate Tom


100

90

80

70

60

50
Time 1
40
Time 2
30

20

10

Gerry Mary John Kate Tom


100

90

80

70

60 Gerry
50 Mary
40 John
30
Kate
20
Tom
10

Time 1 Time 2
Observed scores Observed scores

error error
Reliable!!! Unreliable!!!
All test scores contain a certain amount of
errors! Errors inflate or lower the observed
score, irrespective of the construct to be
measured.

True Score
Observed Systematic
Test Score errors
Test score
errors
Random
errors
(W&G, p59)
Systematic errors Random errors
Time of the test (e.g., Memory
Wed vs. Fri afternoon) Variation in
Time limit motivation
Timing of notice Concentration from
(notice vs. unnoticed) time to time
Carelessness in
Reading ability marking answers
Certain words Computational errors
Clues Luck in guessing
In class vs. Take-home Test Anxiety
project
Parents help
Theoretical Definition of Test Reliability

Observed Score X = True Score + Error

Variance of Observed Score X =


Variance of True Score + Variance of Error

Variance of True Score


Reliability =
Variance of Observed Scores

(E&F, p72)
Reliability is how large/small the amount of
random error (measurement error) in the test
score.

Theoretically defined:
Errors are random.
Errors are uncorrelated with the true scores.
Errors within a group cancel each other out as
the average error scores is zero.
True scores and error scores are unknown.
(E&F, p72)
Point 1. Reliability is a measure of a test.
Point 2. Reliability estimates require two or more
independent measurements.
Point 3. Reliability indicates the degree of
agreement between the pairs of the scores on the
two measures for each person.
Point 4. Correlation coefficient is used to
estimate the degree of agreement
Point 5. Reliability of a test is a sample-based
estimate.
Point 6. Reliability estimates make sense when the
items in a test are supposed to measure one
construct/domain.
Point 1. Reliability estimate (typically) is a
measure of a test (not an individual item)

Test scores
Item 1
Item 2
Item 3
Item 4
Item 5

Point 2. Reliability estimates require two or more
independent measurements, obtained from
equivalent tests on each individual of the group.

Test scores
Test scores

(E&F, p64, p71)


Point 3. Reliability indicates the degree of
agreement between the pairs of the scores on the
two measures for each person.
Point 4. Correlation coefficient is used to estimate
the degree of agreement between the pairs of the
scores on the two measures for each person.

Reliability coefficient is a correlation coefficient


indicating the correlation between two sets of
measurements taken for getting reliability estimates
(W&G, p.60).
Reliability coefficient is a correlation coefficient
between the set of test scores and another set of test
scores on an equivalent test, obtained independent
from the member of the same group (E&F, p. 71)
(E&F, p65-7)
Relatively High Correlation between X and Y
Hardly Any
Correlation

Expressed in correlation coefficients: Variance in


the ability of test-takers: The wider the range of
the ability of the group, the higher the reliability.
be obtained? Depends on the
How high should reliability
consequences of the decision based on the test results.
University
entrance
High school examination
certificate

Research purpose
Classroom
assessment
be obtained? Also depends on
How high should reliability
theoretical expectations of the content being measured.

IQ tests
Achievement Tests:
Multiple-choice

Achievement Tests:
Noncognitive Performance-based
Characteristics
(e.g., motivation,
attitudes)
Expressed in correlation coefficients: The square
of a correlation coefficient the percentage of
shared true variance
Reliability of .90 of a test
Two measures have 81% in common overlap,
share a common variance just over 4/5
about 1/5 (20%) is error component.
Reliability of .80 of a test
The repeated measures 64 % of common
variance (36% of error component: error
component was more than 1/3)
Reliability of .70 of a test
(E&F, p71)
Point 5. Reliability of a test is a sample-based
estimate. It is an estimate based on a particular
group of examinees. Reliability will vary depending
on the group who provided the test scores!
Implication:
Reliability of the same test can vary.
The variation can be due to differences in the
relationship between the item difficulty and the
groups ability level.
Item difficulty for a particular group of test-takers: The
more appropriate a test is to the level of ability in the
group, the higher the reliability of the scores it will
yield.
(E&F, p71)
Point6. Reliability estimates make sense
when the items in a test are supposed to
measure one construct/domain.
The extent to which an assessment accurately
measures what it is designed to measure
(Trice, 2000 p. 29)
The extent to which the scores obtained by
administering a psychometric instrument are
consistent and thereby relatively free from
errors of measurement (Aiken, 2002, p. 42)
The extent to which error is eliminated from
the assessment process (Trice, 2000 p. 29)
When a student receive a score on a test, we
should be able to expect that that score accurately
represent the students performance level.
We expect a certain amount of variation in test
performance, influenced by TIME (1) from one
time to another (test-retest); influenced by A
SAMPLE OF TASKS (2) from one sample of
items to another (alternate forms); influenced by
RATERS (3) from one rater to another (inter-
rater); influenced by PARTS OF THE TESTS (4)
from one part of the same test to another
(internal consistency or split-halves).
(W&G, p42, 59)
influenced by TIME influenced by PARTS OF THE TESTS
influenced by A SAMPLE OF TASKS

Equivalent
Test-retest Internal Spilt-half
forms
reliability consistency method
method
This is a true definition of reliability.
Giving the same test a few days apart.
Correlation is obtained as a test-retest reliability estimate
between the first set and the second set.
Obtain the scores on repeated measures for the same
individuals; Evaluate how consistently the same individuals
perform on the same set of items in different occasions.

Test A
= Test A

(E&F, p75)
Correlation higher than .90 is
considered excellent for either standardized
tests or classroom-assessment tests.
Correlation about .80 or higher is
expected in standardized tests.
Correlation about .70 or higher is
expected in classroom-assessment tests.
Ifthe interval is long, errors of measurement may get
confused with real changes.
In reporting, include the time interval between the two
administration when reporting test-retest reliability estimates
(e.g., 3-months, 60 days, 5 days, etc.)
Administration of the same test not appealing to most
students or teachers, including lack of student interest
(psychological resistance).
Students answers to the second test are not independent
of their answers to the first test.
Possibly by student discussion
Influenced by recall: At the second administration, students
might remember what they answered in the first time
Practice effect

(E&F, p61, 75)


http://www.youtube.com/watch?v=9hq5jZrFTbE&list=PLDC28DFE1A0C1F79E (4:02)
Two equivalent forms of tests are used.
Involving two tests!!!
Equivalent forms reliability: the degree to which scores
on two tests (designed to be equivalent) are the same.
Correlation is typically used to calculate the degree of
agreement on two equivalent forms of tests.

Test A
Test B
Equivalent forms reliability
Alternate forms reliability
Parallel forms reliability

o Two independent samples of the same tests measuring


the same thing.
o Each item in one test has an equivalent question on the
other set.
o The two forms of the test should be constructed
independently.
Administer equivalent Administer equivalent
forms of the test to the forms of the test to the
same group, during the same group, a few days a
same testing session. part.

Test A
Test A Test B

More demanding type of


Test B reliability estimates!
(W&G, p61)
Obstacles to use this method
Equivalent forms of a test are not always available.
Equivalent forms of a test are not always easy to
produce.
Preparing, developing, and proving equivalent forms of tests
(requirement and adequacy of alternate forms is strict).
How do we know that two tests are actually equivalent?
In the classroom setting
Teachers dont have alternate forms of the same test.
It would be hard to persuade students to take the similar
tests again (e.g., in a few days apart).
(E&F, p76; Gipps, p57)
Itrequires only one single administration of a
single test. Split a test into two reasonably
equivalent halves making two
independent test scores.

Test A-1
Test A
Test A-2

(E&F, p76 ; W&G, p61)


Definition: Split-halves reliability shows the
degree to which the two arbitrarily selected
halves of the same test provide the same
results.
Requires separately scoring of two sets of
items from the same test and calculating a
correlation coefficient between the two sets of
scores.

(E&F, p76 ; W&G, p61)


Test A
Test A or
Test B

Test A-1
Test A
Test A-2
Split-halves reliability estimates: Approximates
the reliability of the whole test.
The correlation is based on the half of the total
test. Correction to be made by the Spearman-
Brown formula.
Spearman-Brown Reliability.

This is the formula converting the half-test


correlation to the full-length correlation.
(E&F, p76 ; W&G, p62)
Lets say, a reliability estimate of two halves
of a test was .60. What is the reliability of
the total test?

2 0.6 / (1 + 0.6) = 0.75


Spearman-Brown formula makes it clear that
increasing test length will increase the reliability
of a test.
This formula shows how much reliability will
increase when the test length increased by
double.

(W&G, p62)
Cronbach, L. J. (1951). Coefficient alpha and the
internal structure of tests. Psychometrika, 16, 297-334.
Cronbach, L. J. (1984). Essentials of psychological testing
(4th ed.). New York: Harper & Row.

Cronbach, Lee
(1916-2001)
Stanford University
Internal Consistency methods require the
same test be administered only once, involving
one test!
Item1 Item2

Test A
Item3 Item4

Internal consistency asks the question about


the relation of all items to each other.
(Trice, p37)
Test-retest
Test A Test A
reliability

Equivalent
Test A Test B forms
reliability

Test A-1
Test A Split-halves
Test A-2 reliability
Internal Consistency does not use correlation
coefficients.
Coefficient alpha (Cronbach's alpha): Internal
consistency is typically expressed by Coefficient
alpha.

Technically, alpha is the average of all possible


split-half reliabilities.
Cronbachs alpha ranges from 0 to 1.

(E&F, p78)
Alpha
Internal consistency
Coefficient
0.9 Excellent (High-Stakes testing)
0.8 < 0.9 Good (Low-Stakes testing)
0.7 < 0.8 Acceptable (Surveys)
0.6 < 0.7 Questionable
0.5 < 0.6 Poor
< 0.5 Unacceptable
Item 1 Item 2 Item 3 Item 4 Item 5 Average
Tom 2 2 2 2 2 2

Mary 3 3 3 3 3 3

Jane 4 4 4 4 4 4

Jessica 2 2 2 2 2 2

George 5 5 5 5 5 5

Coefficient alpha = 1
Item 1 Item 2 Item 3 Item 4 Item 5 Average
Tom 20 20 20 20 20 20

Mary 3 3 3 3 3 3

Jane 40 40 40 40 40 40

Jessica 2 2 2 2 2 2

George 50 50 50 50 50 50

Coefficient alpha = 1
Item 1 Item 2 Item 3 Item 4 Item 5 Total

Tom 2 4 3 5 2 16
Mary 2 4 3 5 2 16

Jane 2 4 3 5 2 16

Jessica 2 4 3 5 2 16

George 2 4 3 5 2 16

There is zero variance in the test.


Reliability cannot be calculated.
Item 1 Item 2 Item 3 Item 4 Item 5 Total

Tom 2 4 3 1 2 12
Mary 2 5 3 2 2
14
Jane 2 4 2 3 2
13
Jessica 2 5 3 3 2
15
George 3 4 3 3 3 16
Variance 0.2 0.3 0.2 0.8 0.2 2.5

Coefficient alpha = 0.40


(E&F, p78)
Variance of Total scores
= Item variance + Item covariance
= How much the scores on the same item vary
across different individuals + How much the
scores on different items vary within the same
individuals (across items)

Coefficient alpha is a direct


estimate of the item variance
and the item covariance in the
test.
Alpha is large when (the sum of) the item variance
(variance between individuals on the same item) is
small compared to the total variance in the test (which
is sum of the item variance and covariance between
different items).
High alpha will be produced when the item variance is
small and the total variance and item covariance are
large.
When the total variance is mostly from the item
covariance rather than item variance
When item variance is consistently small across different
items
When covariances of pairs of the items are high
Different items are highly related whereas the same item
scores do not vary much across individuals.
(E&F, p78-79)
For alpha reliability: The same person getting
the same score across the items in a test is
more important than the similarity in the
scores across different people on the same
items in the test.
In this case, individual items are consistently
telling the same story in terms of
individuals score differences in the test as a
whole.
Assumptions under Internal consistency:
Testing mode should be standardized. Test modes or
context should be singular If classroom assessment
tasks contain a mix of modes and contexts, high internal
consistency cannot be expected.
Unidimensionality. Items in a test need to be
homogeneous and assess a unidimensional skill or
attribute.
Correlational analysis. If a test is designed in a way
that all students are expected to do well, then it is hard
to demonstrate high internal consistency due to the
limited range of scores (correlation techniques).
(Gipps, p58, 60)
Coefficientalpha provides a reliability estimate for a
measure composed of items scored with values other
than 0 and 1.
Essay tests with varying point values
Attitude scales with more than 2 response options

When alpha is used to estimate the reliability of a


test that is scored dichotomously (0 or 1), the result is
the same as Kuder-Richardson Formula20 (K-R20).
K-R20 is a special case of the alpha.
Kuder-Richardson Formula 21 is just a simplified
version of Kuder-Richardson Formula 20. Replacing
the proportions in K-R20 with the means in K-R21.
(E&F, p78)
Equivalent
Test-retest Internal Spilt-half
forms
reliability consistency method
method
Repeat with Alternate Measure is
the same forms Cronbachs
alpha divided into
respondents required two halves

Requirement
Re-test within of alternate Most popular Dividing the
a short period forms method test into half

Adequacy of Assumption
Resistance,
alternate of Dimension Random half
practice-
forms and the Items
effects

K-R20 Spearman-
Brown
K-R 21 formula
From To Stability
Consistency
Stability Stability over Stability over
across forms time time & forms

Equivalent
Equivalent Test-retest
Internal forms
Split-half with the
consistency forms a few days
same test
apart

(Gipps, p58)
Inter-rater reliability is used when assessment
results are produced by raters, experts, teachers,
etc.
Typically based on agreement between the
raters. Also called reader reliability (not test-
taker reliability of test reliability)
Essays, Writing, Speaking, Performance Assessment
Assessment results made dichotomously: e.g., when
decisions are made whether a student is a master vs.
non-master or pass vs. fail.
(E&F, p73 ; Gipps, p71; w&G, p63-64)
Reliability (as inter-rater-agreement) is
considered in relation to the correctness of
classification (e.g., how consistently the raters
will classify masters and non-masters).
Expressed by:
Percentages of agreement (classified correctly)
between (two or more) raters.
Correlations between multiple sets of ratings by
different experts for a single set of test.

(E&F, p73 ; Gipps, p71; w&G, p63-64)


Out of 40 Rater 1
students
Pass Fail

27 13
Total

Out of 40
Rater 2
students
30
Pass

10
Fail
Out of 40 students Rater 1
Pass Fail Total

Pass
Rater 2
Fail

Total 27 13 40

Out of 40 students Rater 1


Pass Fail Total

Pass 30
Rater 2
Fail 10

Total 40
Out of 40 students Rater 1
Pass Fail Total
Pass
25 5 30
Rater 2
Fail 2 8 10
Total 27 13 40

% Consistency = Percentage of Agreement


% Consistency = ((25 + 8) / 40) * 100 = 83%
(W&G, p63-64)
Out of 40 Rater 1
students 1 2 3 4 Total

1 8 2 10

Rater 2 2 7 3 10

3 9 1 10

4 2 8 10

T 8 9 14 9 40
% Consistency = Percentage of Agreement
% Consistency = ((8 + 7 + 9 + 8) / 40) * 100 % = 80% (W&G, p65-66)
Out of 40 Rater 1
students 1 2 3 4 Total
1 8 1 1 10

Rater 2 2 2 7 1 10

3 1 9 10

4 1 1 8 10

T 11 10 11 8 40
% Consistency = Percentage of Agreement
% Consistency = ((8 + 7 + 9 + 8) / 40) * 100 = 80% (w&G, p66)
Out of 40 Rater 1
students 1 2 3 4 Total
1 8 1 1 10

Rater 2 2 2 7 1 10

3 1 9 10

4 1 1 8 10

T 11 10 11 8 40
% Consistency = Percentage of Agreement
% Consistency = ((8 + 7 + 9 + 8) / 40) * 100 = 80% (W&G, p66)
Out of 40 Rater 1
students 1 2 3 4 Total
1 8 1 1 10

Rater 2 2 2 7 1 10

3 1 9 10

4 1 1 8 10

T 11 10 11 8 40
% Consistency = Percentage of Agreement
% Consistency = ((8 + 1 + 2 + 7 + 1 + 1+ 9 + 8) / 40) * 100 = 93% (W&G, p66)
Inter-rater consistency shows the consistency
of rating
Also
Spread of disagreement
Evidence for any rating/rater idiosyncrasies
Judge being consistently lenient or harsh
Which judge is more lenient/harsh
Where the judgment is more lenient/harsh by which
judge
The extent to which leniency/harshiness accounts for the
disagreements

(w&G, p65-66)
Rater 1

1 2 3 4
1 8
Rater
2 2 3 7
3 2 1 9
4 8
Rater 1

1 2 3 4
1 8 2
Rater
2 2 7 3
3 9 1
4 10
Rater 1

1 2 3 4
1 8
Rater
2 2 3 7
3 2 1 9
4 8
Rater 1

1 2 3 4
1 8 2
Rater
2 2 7 3
3 9 1
4 10
Rater 1 Rater 1

1 2 3 4 1 2 3 4
1 8 2 1 8 1 1
Rater Rater
2 2 7 3 2 2 2 7 1
3 9 1 3 1 9
4 2 8 4 1 1 8

Rater 1 Rater 1

1 2 3 4 1 2 3 4
1 8 1 1 1 8 2
Rater
2 2 1 7 1 1 Rater 2 2 1 7 2
3 1 9 3 1 9
4 1 1 8 4 2 8
Rater 1 Rater 1

1 2 3 4 1 2 3 4
1 8 2 1 8 1 1
Rater Rater
2 2 7 3 2 2 2 7 1
3 9 1 3 1 9
4 2 8 4 1 1 8

Rater 1 Rater 1

1 2 3 4 1 2 3 4
1 8 1 1 1 8 2
Rater
2 2 1 7 1 1 Rater 2 2 1 7 2
3 1 9 3 1 9
4 1 1 8 4 2 8
Pass 25 (50%) Pass 30 (60%)
Judge A Judge B
Fail 25 (50%) Fail 20 (40%)
50 50

The probability that both of them would say "Yes" randomly is 0.50 0.60 = 0.30.
The probability that both of them would say "No" is 0.50 0.40 = 0.20.
Thus the overall probability of random agreement is Pr(e) = 0.3 + 0.2 = 0.5.

Judge B
Pass Fail
Pass 20 5 25 (50%)
Judge A
10 15 25 (50%)
Fail

30 (60%) 20 (40%) 50
Judge B
Pass Fail
Pass
20 5 25 (50%)
Judge
A
Fail 10 15 25 (50%)

30 (60%) 20 (40%) 50

% Consistency = Percentage of Agreement


% Consistency = (20 + 15) / 50 = 0.70 (70%)
Using formula for Cohen's Kappa,

(Percentage of Agreement Random Agreement)


Inter-rater reliability =
(1 Random Agreement)

http://en.wikipedia.org/wiki/Cohen's_kappa
Pr(e) (the probability of random agreement):
Judge A: "Yes" to 25 applicants and "No" to 25
applicants. Judge A: "Yes" 50% of the time.
Judge B: "Yes" to 30 applicants and "No" to 20
applicants. Judge B: "Yes" 60% of the time.
The probability that both of them would say "Yes"
randomly is 0.50 0.60 = 0.30.
The probability that both of them would say "No" is
0.50 0.40 = 0.20.
Thus the overall probability of random agreement is
Pr(e) = 0.3 + 0.2 = 0.5.
Using formula for Cohen's Kappa,

http://en.wikipedia.org/wiki/Cohen's_kappa
Assessing reliability of school marks:
Inter-rater reliability: Different markers
score the same piece of work
Intra-rater reliability: Same marker score
the same pieces of work on different occasions.

(Gipps, p58)
Marking Reliability
Double marking
Control scripts
Discrepancy check: A 3-rd marker is brought
in (e.g., if a difference is 3 out of 5)
Average of ratings of three judges
Agreement (e.g., if 2 out of 3 can agree)
Subjectivity of the judgments
Assessment criteria being interpreted in the same way by all
judges/raters
Rater/marker bias (conscious or unconscious)
Expectations based on ability level, ethnic origin, social class
Skills, knowledge irrelevant to the construct of interest
Cultural differences of what constitutes competent
performance and what should be considered as values.
Variety and complexity of making scheme
Performance being evaluated according to the same rubric
and standards by all markers.
Structure in the question
Essay type questions vs. more structured questions.
(Gipps, p58, 144)
Whether scores can be obtained
Inter-rater in a consistent way, on a single
agreement instance of a task

Whether the same


task/performance/tests
Test-retest produce the same score from a
particular individual across
times and places?

Parallel- Whether assessing the same


form construct be generalized
across parallel task?

Whether the
performance outcome
Generalizability
can be generalized
across heterogeneous
task domains?
1. Concepts of Reliability in Assessment/Tests
2. Different Types of Reliability Estimates
3. Factors influencing Reliability Estimates
4. Implications in Developing Assessment/Tests
1. Test-retest reliability
2. Equivalent (parallel) forms reliability
3. Based on Internal consistency: more practical,
requires only one single test, single administration
a) Coefficient alpha
b) Kuder-Richardson Formula 20/21
c) Split-half reliability
4. Rater-related reliability
a) Inter-rater reliability (agreement between raters on the
same assessment task)
b) Intra-rater reliability (agreement of the same raters
judgments on different occasions)
(E&F, p73;Gipps, p57)

You might also like