Professional Documents
Culture Documents
Tutorial
Purpose: To present a primarily conceptual introduction to development, typically referred to as classical test theory
item response theory (IRT) and Rasch models for speech- (CTT), in several theoretical and practical ways. Adminis-
language pathologists (SLPs). tration, scoring, and interpretation of IRT instruments are
Method: This tutorial introduces SLPs to basic concepts different from methods used for most traditional CTT instru-
and terminology related to IRT as well as the most common ments. SLPs will need to understand the basic concepts
IRT models. The article then continues with an overview of of IRT instruments to use these tools in their clinical and
how instruments are developed using IRT and some basic research work. This article provides an introduction to IRT
principles of adaptive testing. concepts drawing on examples from speech-language
Conclusion: IRT is a set of statistical methods that are pathology.
increasingly used for developing instruments in speech-
language pathology. While IRT is not new, its application in
speech-language pathology to date has been relatively limited
in scope. Several new IRT-based instruments are currently Key Words: item response theory, outcomes measurement,
emerging. IRT differs from traditional methods for test Rasch model
S
peech-language pathologists (SLPs) rely on a wide need to assess with this client? Does the instrument address
range of assessment instruments in their workfor skill levels appropriate to my client? Was the instrument
example, articulation tests, language batteries, and developed using population samples reasonably similar to
questionnaires asking clients to rate their communication my client? Is the instrument appropriate in terms of age,
experiences. These instruments are important for quantifying language, and cultural concerns? Have I done enough back-
the performance or perspective of clients, tracking changes ground study with the instrument so that I can administer it
in clients over time, and reporting this information succinctly correctly? One issue that SLPs may not be familiar with is
to key stakeholders including the clients, referral sources, the different theoretical approaches to measurement and how
and payers. Applied researchers conducting treatment out- these different approaches affect instrument development,
comes studies depend on these instruments to document administration, scoring, and interpretation.
treatment efficacy. SLPs are aware of many questions they Many of the instruments SLPs currently use were devel-
should ask about instruments before choosing one to use with oped using the psychometric methods of classical test theory
a client: Does the instrument address the trait or skill that I (CTT) and are based on a definition of measurement as
the assignment of numerals to objects or events according
to a rule (Stevens, 1946, p. 677). The four scales that are
a associated with this definition are referred to as nominal,
University of Washington, Seattle
b ordinal, interval, and ratio scales, descriptions of which are
VA Pittsburgh Healthcare System Geriatric Research
in most introductory statistics textbooks (e.g., Gravatter &
Education and Clinical Center, and University of Pittsburgh, PA
c Wallnau, 2000). Others have proposed that measurement
Louisiana State University, Baton Rouge
d requires more stringent criteria than the Stevens (1946)
VA Puget Sound Medical Center, Seattle, WA
definitionin particular, that measurement requires a linear,
Correspondence to Carolyn Baylor: cbaylor@u.washington.edu additive scale with equal units ( like inches on a ruler) that
Editor and Associate Editor: Laura Justice allow the numbers to be manipulated mathematically (i.e., ad-
Received September 10, 2010 dition, subtraction, or averaging; Bond & Fox, 2001; Michell,
Revision received February 10, 2011 1997, 2003; Wright, 1999). Item response theory (IRT;
Accepted April 25, 2011 Lord & Novick, 1968) and the related set of measurement
DOI: 10.1044/1058-0360(2011/10-0079) models based on the work of Georg Rasch (Rasch, 1960,
American Journal of Speech-Language Pathology Vol. 20 243259 August 2011 A American Speech-Language-Hearing Association 243
1980) offer a different set of statistical methods that attempt this article is to provide an introduction to IRT and Rasch
to address these more stringent definitions of measurement. models for clinical SLPs and applied researchers who may
There are many differences between CTT and IRT in how be interested in using IRT instruments but are unfamiliar
instruments are constructed, administered, and interpreted. with these measurement models. This tutorial provides an
Even if SLPs are not familiar with the terminology of overview of the concepts and terminology, a brief description
CTT, they are probably aware of many CTT guidelines. For of how IRT instruments are constructed, and a description of
example, all items in an instrument (or subscale) need to adaptive administration, which is a particular advantage of
be administered to obtain a valid and reliable score. Longer IRT instruments.
instruments are generally more reliable than shorter instru- Before proceeding, an explanation of how terminology
ments. Scores across clients or test situations cannot be is used in this article is warranted. In the prior paragraphs,
compared unless the same items were administered or the IRT and Rasch have been mentioned together. IRT models
instruments are equated psychometrically. Test scores are were developed in the 1950s and 1960s by Fred Lord
interpreted relative to a normative or reference sample, mean- and Alan Birnbaum (Birnbaum, 1968; Bock, 1997; Lord
ing that a clients ability is understood by comparing that & Novick, 1968). For a review, see Bock (1997). Around
client to the normative sample (e.g., percentile rankings). the same time, the Danish mathematician Georg Rasch
Many features of IRT differ from those of CTT (Embretson independently developed a model that is mathematically
& Reise, 2000). For example, options for adaptive testing identical to the one-parameter IRT model (Rasch, 1960).
allow for only a subset of items from a larger item bank to However, Raschs work had a different theoretical orien-
be administered. This tailors assessment for each client and tation than that of Lord and colleagues (Wright, 1997).
lowers response burden. These shorter item subsets can be Whereas the latter were more oriented to the practical needs
as reliable as longer CTT instruments resulting in greater of large-scale educational testing, Rasch was more focused
measurement efficiency (Cook, OMalley, & Roddey, 2005). on the principles underlying scientific measurement and
Adaptive testing leads to different item subsets being ad- on developing psychometric models that would meet the
ministered to different clients, or to the same client at different most rigorous standards of scientific measurement.1 There
testing points, yet these scores can be directly compared are philosophical and methodological differences between
facilitating comparison across time and clients. Finally, test IRT and Rasch models. However, most SLPs will not need
scores are usually interpreted on the logit scale (discussed to dwell on these differences for their clinical purposes.
below), with the scale representing the difficulty of item Because of this, the distinctions between IRT and Rasch
content. Comparing a clients performance or response directly will be mentioned only briefly, and the focus will be on the
to the difficulty level of the content or skill is a more direct concepts and applications that are common to both IRT
method of understanding the clients ability than knowing only and Rasch. For ease of reading, the term IRT is used in this
where the client ranks in a normative sample. article to encompass both IRT and Rasch-based instruments
IRT and Rasch models are not new. They date back to where they have commonalities.
the 1960s and have been in use since then particularly in
educational and aptitude testing. Over the past 1015 years,
IRT has been increasingly applied in psychology and the Key IRT Concepts
health sciences, and recently more IRT work has gained Much of the information presented in this tutorial can
momentum in speech-language pathology. Early examples be found in introductory texts on IRT or Rasch models as
of the application of IRT to instruments relevant for well as in peer-reviewed publications (Bond & Fox, 2001;
speech-language pathology include the Token Test Embretson & Reise, 2000; Hambleton, Swaminathan, &
(Willmes, 1981), the Test of Adolescent/Adult Word Rogers, 1991; Hays, Morales, & Reise, 2000; Jette & Haley,
Finding (German, 1990), the Peabody Picture Vocabulary 2005; Reeve et al., 2007; Smith & Smith, 2004). For ease
Test (Dunn & Dunn, 1997), the Aachen Aphasia Test (Willmes, of reading, this list will not be referenced repeatedly through-
2003), and the Rehabilitation Institute of Chicago Evalua- out this text for basic concepts, but additional references will
tion of Right Hemisphere DysfunctionRevised (Cherney, be used for specific points.
Halper, Heinemann, & Semik, 1996). However, this work
has either tended to be technical in orientation with limited Latent Traits
impact on clinical practice or has not capitalized on some
of the particular advantages of IRT. Several research groups Our discussion of IRT begins with the definition of a
are now working on a variety of new IRT-based instruments latent trait. A latent trait is a characteristic or ability of an
intended for clinical use (Baylor, Yorkston, Eadie, Miller, individual that is not directly observable but instead must
& Amtmann, 2009; Donovan, Kendall, Young, & Rosenbek, be inferred based on some aspect of a persons performance
2008; Donovan, Velozo, & Rosenbek, 2007; Hula, Austermann or presentation. Many areas of communication are latent
Hula, & Doyle, 2009; Hula, Doyle, & Austermann Hula, traits such as language and cognitive processes. For example,
2010; Justice, Bowles, & Skibbe, 2006; Kendall et al., 2010; we cannot directly observe a clients auditory comprehension.
Milman et al., 2008), and more are likely to emerge in coming 1
years. The work of Lord and colleagues may be considered data-driven, in that
they developed increasingly complex models to better describe empirical
Because IRT and Rasch instruments will become more data, while the work of Rasch and his students can be characterized as
prevalent in the SLPs toolkit, SLPs need to be familiar with model-driven, in that they have sought to develop more restrictive models
the basic terminology and concepts of IRT. The purpose of that would satisfy the requirements of scientific measurement.
represents both client ability and item difficulty for the con- is most sensitive for estimating trait levels at or very near
struct (or trait) estimated by the instrument, in this example, 1.0 logit. This sensitivity is demonstrated by the nonlinear
word finding. Lower logit values represent lower word- aspect of the ICC. Note how the slope of the ICC is steeper
finding ability and easier items (items that clients are more in the region around the item difficulty value and shallower
likely to answer correctly). The y-axis represents the prob- at the ends of the curve. Greater differences in response
ability of a correct response on the item. The two curves probability are seen among clients who have trait values
in Figure 1 are called item characteristic curves (ICCs). closer to the item difficulty value, and smaller changes in
The ICC for each item demonstrates that as ability increases response probability across trait levels are seen more distant
(x-axis), the probability of responding to the item correctly from the item difficulty value.
increases ( y-axis). In Figure 1, the two ICCs represent two To explore the concept of item difficulty further, com-
separate items from the Western Aphasia Battery (WAB; pare the two items in Figure 1. The first item (The grass
Kertesz, 1982) sentence completion task. The curve on the is ____) has an item difficulty of 1.0 logit, and the second
left is the ICC for the item The grass is ____(green); the item (Where do nurses work?) has an item difficulty of 2.0
curve on the right is for the item Where do nurses work? logits. Changes in item difficulty affect the probability of
These data are from a reanalysis of the WAB using a Rasch a correct response. Because the distance between these two
model (Hula, Donovan, Kendall, & Gonzalez-Rothi, 2010). items is exactly 1 logit, clients are 2.718 times more likely
In Figure 1, the item difficulty for the item The grass to respond correctly to the easier (lower logit) item. One
is ____ (green) is 1.0 logit. Note that 1.0 on the x-axis cor- advantage of IRT scaling is that if we have an estimate for
responds to where the ICC crosses 50% on the y-axis. This a clients ability level, we can estimate the probability of
means that a person with an ability level of 1.0 logit has a that individual answering any given item correctly. Using
50% probability of answering the item correctly.5 This item Figure 1, a client with an ability level of 1.0 logit will re-
spond correctly to items like The grass is ____ 50% of the
5
time and to items like Where do nurses work? approxi-
This commonly used interpretation is actually an abbreviated version of mately 27% of the time. Having this kind of information
the following more technically correct interpretation: 50% of individuals
with an ability level of 1.0 will answer this item correctly. Also, given a
helps SLPs avoid administering items that are too difficult
set of items with difficulty equal to 1.0, a single individual with this ability or too easy and to instead give items that will yield the most
level will correctly answer 50% of them. information about a client.
FIGURE 3. Operating characteristic curves for a polytomous item analyzed with a 1-PL IRT
model. These curves show the threshold trait levels for responses crossing from one category
to another based on theta.
FIGURE 4. Category response curves for a polytomous item. These are the category response
curves for the same item depicted in Figure 3. The category response curves show the
probability of a client choosing each category dependent upon trait level.
Rasch (Rasch, 1960)/1-PL Item difficulty Dichotomous The Rasch and 1-PL models are mathematically
equivalent but stem from different theoretical
approaches. This model is described in the text.
Partial Credit Model Item difficulty Polytomous Originally designed for items that have multiple
(Masters, 1982; steps for which the respondent can get partial credit.
Masters & Wright, 1996) Also appropriate for categorical rating scale items.
Key feature is that Partial Credit Model makes no
assumptions about the ordering of the response
categories, meaning that the order and distance
between response categories can vary across
items. This allows developers to explore how
the response format works for the items and to
change the response formats if needed (i.e.,
add more categories or remove categories).
Rating Scale Model Item difficulty Polytomous Similar to the Partial Credit Model except that this
(Andrich, 1978a, 1978b) model assumes the response categories are
in the same order with the same distance
between categories for all items. This does not
allow exploration of variations in and function of
response categories. Cannot be used if different
items have different response formats.
2-PL Item difficulty and Dichotomous This model is described in the text.
discrimination
Graded Response Model Item difficulty and Polytomous Items can have different numbers of response
(Samejima, 1969, 1996) discrimination categories or different response formats across
the items.
3-PL Item difficulty, Dichotomous This model takes into account the possibility of
discrimination, clients answering items correctly due to guessing,
and guessing particularly that very low ability clients can still
score some items correctly due to guessing.
Note. 1-PL = one-parameter logistic; 2-PL = two-parameter logistic; 3-PL = three-parameter logistic.
a
The parameters correspond to the 1-PL (item difficulty), 2-PL (item difficulty and discrimination), and 3-PL (item difficulty, discrimination, and
guessing parameters) models.
b
Dichotomous response formats have only two possible responses such as correct/incorrect, true/false, or yes/no. Polytomous response formats
are multiple response formats such as Likert rating scales (i.e., strongly agree, slightly agree, slightly disagree, or strongly disagree).
guide to instrument development, and SLPs interested in and fixed during the original development of the instrument.
taking on this task will need to consult more advanced If there are problems with the items in an IRT analysis
references and the expert advice of a psychometrician expe- such as multidimensionality, poor fit to the IRT model,
rienced in IRT. However, while many SLPs may not be de- or uneven distribution across the measurement range (all
veloping item banks, having insight into this process may issues to be discussed further below), the test developers have
be helpful when using IRT instruments. SLPs should be limited options for improving an existing item set because
familiar with the basic concepts behind item bank construc- there is no larger candidate item set from which to draw items
tion to be informed consumers of the instruments. This that better fit the IRT model. Some researchers also use IRT to
section may help SLPs interpret information in test manuals, compile new item sets by combining existing CTT instru-
including instructions for administration and scoring. ments measuring the same construct. This provides a larger
In some instances, existing instruments developed using pool of items from which to draw for the instrument but still
CTT methods are reanalyzed using IRT. IRT analyses of has some of the limitations just described.
these instruments can shed additional light on the measure-
ment properties of the instruments. However, it should be
noted that reanalyzing CTT instruments using IRT does Meeting Key IRT Assumptions
not provide all of the advantages of IRT. For example, When creating an IRT item bank, instrument developers
item banks generated from their beginning through IRT begin by identifying the construct that they are interested
typically contain a large number of items that are specifically in assessing (i.e., reading comprehension or verbal naming).
chosen from an even larger set of candidate items to meet The developers collect a large pool of candidate items that
several criteria that will be discussed below. Test develop- they think reflect that construct. They may use items that
ers can choose the most desirable items from among the already exist in other instruments (with appropriate refer-
candidate set to create the optimal item bank for specific ences to those instruments), or they may write their own
measurement purposes. When reanalyzing existing CTT items. When developers feel that they have a comprehensive
instruments, the number of items has already been reduced set of candidate items, they will administer the items to a
that clients do indeed seem to answer the item according to the curve but deviate notably from the model in the lower
what the model would predict based on their trait level. Con- section of the curve. This is an example of poor item fit.
trast this with Part B (item: getting your point across when IRT analysis programs will often provide item fit statis-
you are upset) in which the pattern does not hold. In Part B, tics. Developers typically keep items whose fit statistics fall
the data follow the predicted model for the upper portion of within a certain recommended range per the IRT model.
but substantially less precise estimation for clients outside in the measurement properties introduced into the item bank
this region. When constructing an item bank, developers with the new items are identified.
can examine the item information functions to choose those
items that provide the most information across the trait
range for which the instrument is intended. When forming Other Steps in Instrument Development
an item bank or a subset of items from a bank, the infor- Before leaving the topic of item bank development, we
mation functions from each item can be added together to should note that using IRT in and of itself does not guarantee
form a curve that represents the measurement precision that an instrument is appropriate and relevant for the mea-
across the entire item set. This is the test information function. surement purposes at hand. Instrument development requires
In test information functions, information ( y-axis) values many steps to ensure that the instrument addresses a relevant
higher than 10 are associated with reliability coefficients .90 and meaningful construct, and estimates it in a valid and
(Embretson & Reise, 2000), so measurement is considered reliable manner. These steps should involve a combina-
to be most reliable in the ability range for which the test tion of qualitative and quantitative processes, which are
information function is above 10. described elsewhere (Brod, Tesler, & Christensen, 2009;
This section has described some of the key steps in IRT DeWalt, Rothrock, Yount, & Stone, 2007; Fries, Bruce, &
item bank development. Item banks can be changed over Cella, 2005). The resources of the Patient Reported Outcomes
time. More items may be added, although this requires gath- Measurement Information System (www.nihpromis.org), the
ering responses from a large number of individuals and National Institutes of Healths roadmap initiative for the
reanalyzing the entire item bank to ensure that any changes development of patient-reported outcomes measures, may
also be helpful. SLPs may also wonder about some of the compare these scores directly. One of the most significant
traditional tests of validity and reliability that they are advantages of adaptive testing is that it increases measure-
used to seeing in test manuals. Many of the IRT processes ment efficiency, meaning that a clients trait level can be
described above provide documentation of reliability and estimated using fewer items, thus lowering response burden
validity. Evidence of unidimensionality, good item fit, and while maintaining measurement precision at a level that is
lack of DIF, for example, all provide support that the instru- commensurate with, or perhaps even more reliable than,
ment is valid and addresses only the intended trait without longer fixed-length CTT instruments (Cook, OMalley, &
confounding variables. Developers will want to continue Roddey, 2005). Several new IRT-based instruments that
validity and reliability testing to document issues such as utilize adaptive testing are on the horizon for speech-language
comparisons of the new instrument to current gold standard pathology.
instruments, reliability of the instrument with time and Adaptive testing begins with an IRT item bank calibrated
retesting, and sensitivity of the instrument to change, par- using the methods of instrument development described
ticularly for the purpose of measuring treatment outcomes. above. Item banks can be of widely varying sizes, from 20
to more than 100 items. Ideally, SLPs want to administer as
few items as necessary to reduce the burden on the client,
Administering the Item Bank particularly if communication impairments make respond-
One of the key advantages of IRTand one of the key ing to the items challenging. However, SLPs also want to
differences between IRT and CTT instrumentsis the option administer enough items to ensure that they have a reason-
of adaptive assessment. CTT instruments typically require ably precise estimate of the trait level. Instrument developers
that all of the items in the scale and /or component subscales may prepackage subsets of items in tailored short forms.
must be administered to obtain a valid, reliable, and inter- The different short forms may be appropriate for different
pretable score (although many CTT instruments do have groups of clients, perhaps different severity levels of a com-
stopping rules for establishing basals or ceilings). Instru- munication disorder. The most flexible method of adaptive
ments based in IRT may be used in the same way with all assessment, however, involves starting with the item bank in
clients receiving the same set of test items. However, one its entirety and selecting items based on the individual clients
of the major advantages of IRT is that it permits adminis- responses.
tration of different subsets of items to different clients (or Administering an adaptive IRT-based instrument is some-
to the same client at different times) with the ability to what like an audiologist conducting a hearing examination