Professional Documents
Culture Documents
Honeyman-
Buck
Information
Transfer:
Free Text vs
Structured
Format
TABLE 1: Results of Analysis of Variance for the Number of Correctly Answered Questions Per Case
Source Effect Type Degrees of Freedom Mean Square Error F Value Pr > F
Format (free text vs structure) Fixed 1 1.069 0.86 0.3541
Case ID (case type) Random 8 5.254 4.25 0.0001
Subject Random 15 11.783 9.52 < 0.0001
Order Fixed 11 1.521 1.23 0.2734
Format × subject Random 14 1.580 1.28 0.2292
Error 1 — 139 1.237 — —
Case type Fixed 2 0.482 0.09 0.9153
Error 2 — 7.879 5.388 — —
Note—Pr = probability, — = not applicable.
TABLE 2: Results of Analysis of Variance for the Number of Seconds Taken to Complete Each Case
Source Effect Type Degrees of Freedom Mean Square Error F Value Pr > F
Format (free text vs structure) Fixed 1 3,056 0.67 0.4129
Case ID (case type) Random 8 24,503 5.41 < 0.0001
Subject Random 15 80,762 17.82 < 0.0001
Order Fixed 11 1,251 0.28 0.9893
Format × subject Random 14 2,866 0.63 0.8343
Error 1 — 139 4,531 — —
Case type Fixed 2 46,880 1.86 0.2174
Error 2 — 7.905 25,171 — —
Note—Pr = probability, — = not applicable.
Statistical Analysis manipulation and statistical calculation. For all Outcome = β1 format + β2 case type +
Each subject’s participation generated a set of 12 tests of significance, we set p = 0.05 as the cutoff β3 caseID(case type) + β4 subjectID +
experimental results. These consisted of answers to and used two-tailed alternate hypotheses. The anal- β5 format × subjectID + β6 order + Error
the 10 questions for a case and the time-stamped ysis was done on the basis of a balanced incomplete
navigation data. We scored the subject’s answers block design. The factors included examination Standard F statistics using the type 3 sums of
against a key to obtain the number correct. The start type (3 levels), report format (2 levels), and indi- squares and appropriate error terms were used to
time was subtracted from the final submission time vidual cases (4 levels per type). Thus, there were 24 test the coefficients (β1 − β6) against the null hy-
to obtain number of seconds taken to do each case. factor level combinations (treatments), a block size pothesis of no effect (βi = 0). The Duncan proce-
An efficiency score was then calculated for each of 12 (cases per subject), and eight blocks (sub- dure was used to perform multiple comparisons of
case by dividing the number of questions answered jects). The experiment was replicated twice with mean number of report views by case type [3].
correctly by the number of seconds taken to finish two groups of eight subjects for a total of 16 sub- Since none of the fixed effects was significant in
the entire case. This result was multiplied by 60 to jects and 192 experimental units. Summary statis- any of the other models, no additional multiple
give the number of correctly answered questions per tics for the five outcomes were generated, including comparisons were performed. The two outcomes
minute. The time-stamped navigation activity mean, median, mode, SD, frequency distribution relating to subjects’ test-taking habits (report views
records were processed to obtain two outcomes for plot, and normal probability plot. and answer selections) were correlated within sub-
each case. The number of times the subject moved We performed analysis of variance with a gen- jects, formats, case types, and overall to obtain
back to view the report from the question page was eral linear model procedure (SAS PROC GLM) to Pearson correlation coefficients [4].
tabulated. This outcome could take any positive in- test for differences in each of the five outcomes Because our results showed equivalence on the
teger and will be called report views. The number of (percent of questions correct, time taken, effi- main variable of interest (report format) we per-
answer selections made during each case was ciency, report views, and answer selections) formed post hoc power analysis. The sample size
counted as well. This outcome was at least 10 and jointly related to our independent variables. We was initially set with eight subjects each looking at
was higher when subjects changed their minds about used the same linear model for each outcome. 12 cases for a total of 96 experimental units. We
one or more answers. Thus, there were five out- Fixed effects included report format, case type, were able to double the planned sample size because
comes analyzed: number of questions correct, time and the order that the case was presented to the many students responded to the request for participa-
in seconds taken to complete the case, efficiency subject. Random effects included case identity tion and running the experiments was quite easy due
score, report views, and answer selections. nested within case type, subject identity, and re- to all subjects’ familiarity with the testing system.
We used the Statistical Analysis System (SAS port format crossed with subject identity. The Our analysis was based on paired testing between
Version 9 for Windows, SAS Institute) for all data model was specified as follows: free text and structured format with α = 0.05 and
β = 0.01 (90% power). We used the root mean The time taken to complete the cases port format and case type had the same mean
square error from the analysis of variance output as ranged from 30 to 707 sec with mean of 351 number across all levels.
the estimate of sigma for each outcome. and SD of 108. The distribution was not The number of times subjects went back to
For the postexperimental survey, Likert-scaled skewed with a median of 341 but there was look at the report text (report views) while an-
items from the debriefing questionnaires were sum- some kurtosis. The single observation at 30 swering questions ranged from two to 32 with
marized by calculating the median value, the 10th sec was a distinct outlier with the next lowest a mean of 12, median of 11, and SD of 6.8.
percentile, and the 90th percentile. The general value being 102 sec (first percentile). None of The distribution was near normal, allowing for
preference items were enumerated and percentages the fixed effects were significant (format the discrete nature of the outcome. The
calculated. The qualitative content of the debriefing p = 0.41, order p = 0.99, and case type analysis of variance analysis showed no dif-
focus group was summarized from a transcription p = 0.22) and there was no interaction be- ference in the mean number by report type
of a tape recording made during the session. tween format and subject effects on the time. (structured = 13.4, free text = 12.3, p = 0.93).
The majority of the variance (62%) was par- Again, there was considerable variance be-
Results titioned between subjects and between cases tween subjects (p < 0.0001) with a minimum
Experimental Results within case type and R squared for the model of 4.7 times per case ranging up to 25 times
Examination of the response data from the was 0.71. The analysis of variance results are per case. Interestingly, there was a significant
testing system revealed that all 16 subjects reproduced in Table 2. difference between the type of case (p =
submitted valid responses to the 10 questions The efficiency (number of correctly an- 0.0016) even though there was no significant
for each of their 12 cases and that they per- swered questions per minute) ranged from difference between the individual cases
formed them in the assigned order. The timing 0.52 to 4.7 with mean of 1.56 and SD of 0.56. (p = 0.11). The mean number of times the re-
data were complete and consistent with the an- The distribution was nearly normal except port was consulted for abdominal CT (13.1)
swer responses with no gaps, extra entries, or that the positive tail was longer. This was due was essentially the same as for abdominal
ambiguities. Thus, there were 192 complete re- to the same outlier (30 sec to complete the songraphy (12.9). However, for head CT, sub-
sponse sets available for analysis. These con- case) mentioned above. None of the fixed ef- jects went back to look at the report an average
sisted of timing data and answers to all 10 fects were significant (format p = 0.92, order of 11 times. We tested for interaction between
questions for each of 12 cases completed by p = 0.60, and case type p = 0.48) and there case type and subject effects and found none.
our 16 subjects. Each case was shown eight was no interaction between format and sub- Therefore, the tendency to look back at head
times in its free text form and eight times with ject effects on the efficiency. Just under half CT reports less frequently than sonography
the structured format. What follows are of the variance (46%) was partitioned be- and abdomen CT was shared by all subjects.
univariate statistics on each outcome for the tween subjects and between cases within case Across the entire sample of 192 cases, the
entire data set (n = 192) along with the multi- type and R squared for the model was 0.57. correlation between report views and answer
factorial analysis of variance results. The analysis of variance results are repro- selections was weakly positive with a Pearson
The number of correct responses (score) duced in Table 3. coefficient of 0.22 (p = 0.002). When we ex-
ranged from two to 10 with a mean of 8.35 The number of times that any answer was amined the relationship between report views
and SD of 1.52. The distribution was nega- selected for each case (answer selections) and answer selections for each subject, only
tively skewed with median and mode both be- ranged from 10 (obligate floor value) to 19 three of the 16 had significant correlations.
ing nine. None of the fixed effects (format with a mean of 11.5, SD of 1.68, median of These were all positive (0.58, 0.64, 0.75) and
p = 0.35, order p = 0.27, and case type 11, and mode of 10. The distribution was, as probably accounted for the aggregate correla-
p = 0.92) were significant and there was no expected, not normal and looked more like a tion. The correlations between report views
interaction between format and subject ef- Poisson type with a mean of 11.5. We elected and answer selections stratified by case type
fects on the score. The majority of the vari- to proceed with analysis of variance despite and report format were all weakly positive
ance (60%) was partitioned between subjects the violation of normality because this out- (0.18 to 0.28). The one exception was the
and between cases within case type and R come was considered to be secondary and the head CT cases, where there was no correla-
squared for the model was 0.58. The analysis results were relatively uninteresting. The only tion between numbers of report views and an-
of variance results are reproduced in Table 1. significant effect was between subjects. Re- swer selections.
TABLE 3: Results of Analysis of Variance for Efficiency of Answering Questions (Score / Time) × (60)
Source Effect Type Degrees of Freedom Mean Square Error F Value Pr > F
Format (free text vs structure) Fixed 1 0.002 0.01 0.9172
Case ID (case type) Random 8 1.078 6.08 < 0.0001
Subject Random 15 1.126 6.36 < 0.0001
Order Fixed 11 0.148 0.84 0.6040
Format × subject Random 14 0.259 1.46 0.1339
Error 1 — 139 0.177 — —
Case type Fixed 2 0.881 0.80 0.4843
Error 2 — 7.915 1.108 — —
Note—Pr = probability, — = not applicable.
The head CT cases seemed to elicit a some- words (8/15 = 53%), as a semiquantitative news story, subjects wanted to see reports
what different pattern of report viewing unre- scale (3/15 = 20%), and in quantitative terms have an equivalent of the lead paragraph in
lated to the number of times answers were se- (2/15 = 13%). The final general question which findings are synthesized and con-
lected. Considering that the head CT asked about how subjects would respond to densed to allow rapid review and to highlight
structured format was functionally organized an explicitly worded recommendation in a ra- the diagnostic impression.
rather than strictly by anatomy, we wanted to diology report. The responses were as fol-
be sure there was no interaction between the lows. The subject would be compelled to fol- Discussion
type of case and our main independent vari- low the recommendation (2/15 = 13%), they The working hypothesis of the experiment
able of interest, the report format. When we might be compelled to follow the recommen- was that our subjects would take significantly
added this interaction term to the analysis of dation (10/15 = 67%), and they would not less time to read a structured report and an-
variance models for score, time, and effi- feel so compelled (3/15 = 20%). swer questions about its content than they
ciency, we were reassured to find that it was The responses to the Likert-scaled ques- would with the free text version. We were not
not significant for any of the outcomes. Thus, tions about preference between free text (1) certain that subjects would attain higher or
the finding that report format had no effect on and structured format (10) are summarized in even equal scores with the structured version.
the outcomes is conclusive and holds across Table 4 with median, mode, and range for To allow for this, we introduced a measure of
case type. each one. Clearly, the subjects strongly efficiency (correctly answered questions per
For the post hoc power analysis, the root tended to prefer the structured format for the minute). The pattern of outcomes that we ex-
mean square errors (sigma) were as follows: seven separately articulated domains as well pected was less time for structure, equal to
score = 1.32, time (seconds) = 65, and effi- as overall. During the mediated discussion, slightly lower scores (accuracy) with struc-
ciency = 0.43. As described above, we set this perception was reinforced. At least five ture, and increased efficiency with structure.
α = 0.05 and power to 90%. For score, we subjects clearly expressed the opinion that To our surprise, there were no significant dif-
could have detected a difference of about 0.5 they would like to see all radiology reports ferences on any of the three measures. The
in the mean of correctly answered questions. formatted in a manner similar to our struc- planned power to detect differences between
The observed mean scores were 8.28 for free tured condition (as in Appendix 3). One par- free text and structure was achieved and in-
text and 8.43 for structured format, with a dif- ticipant mentioned that the headings should deed exceeded in the experiment. Further-
ference of 0.15. For time to complete each be consistent across all instances of a report more, all hypotheses were tested using two-
case, we could have detected a difference of type. He said “it might be confusing if some tailed methods thus allowing for either alter-
about 30 sec. The observed mean times were people include biliary system under liver and native to the null of equivalence.
355 sec for free text and 347 sec for structured some people included it under gallbladder, We assert that there is no effect of report
format with a difference of 8.4 sec. Finally, for example.” Others felt that the order of format on speed, accuracy, or efficiency for
for efficiency of answering questions, we headings should be altered dynamically so our subject population reading the types of re-
could have detected a difference of about 0.2 that the abnormal findings would be at the ports we presented to them. By extension, we
questions per minute. The observed effi- top of the report. One potential drawback to suggest that the same phenomenon may per-
ciency for free text was 1.57 questions per the structured format was expressed as fol- tain to practicing physicians. If this is true, de-
minute and for structured format, it was 1.55 lows: “The overall gestalt of the examination signers of reporting systems may not need to
questions per minute for a difference of 0.02 might be lost in structure whereas with free work so hard to produce old-style narrative
questions per minute. For all three of the main text a sense of severity and more acuity can documents out of structured elements. At the
outcomes of interest, the observed effect of be conveyed.” A corollary opinion that was same time, it would seem that free text reports
report format was far less (approaching an or- expressed quite forcefully and frequently are not as difficult to read for content as some
der of magnitude) than the difference our ex- was that reports should still have a clearly la- believe. The choice of report format and
periment was powered to detect. beled impression section. The concept of or- structure may be made on the basis of refer-
ganizing a report like a newspaper story was ring physician preference and considering the
Qualitative Results linked to the need for an impression. Like a effect of reading into structure by radiolo-
The debriefing meeting was attended by 15
of 16 subjects. One of the male students was
serving in a clinical clerkship at another insti- TABLE 4: Summary of Responses to Format Preference Questions
tution. One of the general questions asked for Question—Preference With Respect To: Median Mode Range
preference about the report organization. One Your accuracy (number correct) 9 10 6–10
option to this question was “like a laboratory
Your speed in finding answers 9 10 3–10
report.” By this we meant standardized head-
ings in the body with results organized under Your efficiency in finding answers 9 10 6–10
these headings. The choices and number of Your certainty of knowing the answer 9 10 5–10
responses were as follows. Like a laboratory Knowing what was not mentioned in the report 10 10 5–10
report (11/15 = 73%), like a newspaper story Negative findings 9 10 2–10
(1/15 = 7%), and in the current (unstructured)
Positive findings 9 10 7–10
format (3/15 = 20%). We also asked how
they would prefer to have uncertainty ex- General preference 10 10 3–10
pressed. The choices and responses were in Note—Questions were all Likert scaled with 1 = free text, 5 = no preference, and 10 = structured format.
gists. However, we advise those seeking to nization and format like the “laboratory re- using “telegraphic” constructions such as
fundamentally change the way in which radi- port” option preferred by our subjects. “LIVER: Negative.”
ology reports are created and displayed to Selection of senior medical students to serve To address limitations described above and
proceed with caution in consideration of the as subjects proved to be quite successful. Inter- extend the scope of our inferences, we plan on
following. In a recent article, Ash et al. [5] est and enthusiasm were such that we were at least three extensions to the experiment us-
discussed unintended consequences of over- able to double the sample size with little effort. ing the same cases, questions, and randomiza-
emphasizing structured information entry in Even after the study had closed, numerous stu- tion scheme. First, we will remove the button
health care informatics. They cite evidence dents asked to participate. We found the exper- on the question page that allows going back to
from studies in cognitive psychology and so- imental paradigm to be very acceptable to sub- review the report. Subjects will know this and
ciology that in a shared context, concise, un- jects and quite easy to administer. These that they must answer all 10 questions after
constrained, free text communication is the factors should allow us to easily extend and ex- one reading. There will be no time limit for ei-
most effective for coordinating work around a pand the experiments—using additional co- ther reading the report or answering the ques-
complex task (5, 6). Overly structured data horts of senior medical students—to address tions. Second, we will place a time constraint
can lead to loss of cognitive focus by clini- limitations described below. on how long the report is visible before
cians, both during input and review. This can Perhaps the most important limitation in our switching to the question page. Again, sub-
cause clinicians to experience a loss of over- study has to do with generalizing our results jects will know that they cannot go back to re-
view about the case at hand when they have to from senior medical students answering con- view the report while answering questions.
attend to data contained in many different tent-specific questions to practicing physicians Third, we will enable either structured or free
fields, sometimes on different screens within using radiology reports for clinical decision text formats to be viewed at the discretion of
an interface (7, 8). Furthermore, the act of making. We narrowed the focus of the research the subject while they go through a case. The
writing or dictating in narrative form may be to evaluate readability of the documents con- tracking code will record which version(s)
integral to the cognitive processing of the taining radiology interpretations with respect they look at and for how long. This will allow
case [9]. Our finding of dissonance between to the format alone. Medical students’ subse- us to determine if subjects develop and actu-
subjects’ preferences concerning report for- quent experiences during training and practice ally act on a preference for one format or the
mat and their actual performance reading certainly do lead to differences in many skills other as they move through the cases.
them for comprehension confirms that the and habits. However, we argue that the simple Further research will involve psychometric
cognitive issues are complex. In our opinion, ability to read a passage of text and compre- evaluation of the questions themselves. Once
tried and true methods of authoring and dis- hend its content is already well established by the three additional experiments detailed
playing radiology reports should not be aban- the senior year of medical school. above have been completed, we will have a
doned without considering the consequences. A difficult design consideration was wheth- large number (64) of answers to each of 120
Our follow-up session and the question- er to test subjects with both the free text and different questions about radiology report con-
naires completed by the subjects shed light structured version of each case. This would tent. This should allow us to use standard tech-
on reader preference for report format. They have provided even greater power to detect niques to assess item difficulty, reliability, var-
all strongly and consistently preferred the the effect of format on outcomes by virtue of ious correlations, and discriminatory power.
structured version to the free text. This pref- having a directly paired comparison. We These results will be interesting in their own
erence for structured format was consistent think that our choice of a balanced block de- right by revealing what kinds of questions are
across all seven domains that we asked about sign, incomplete in the format factor, was challenging for readers to answer. This might
with modal values on the 10-point Likert valid for two reasons. First, the planned and guide radiologists in explicitly including
scale all being 10 (prefer structure). Also, achieved power of the chosen design allowed phraseology in their reports to address these
the corollary question about general report us to detect differences between free text and difficulties. Types of questions that exhibit
organization resulted in 73% preferring a structure that were far less than what we con- high levels of variance in the answers given or
“laboratory report” format over the alterna- sidered to be practically relevant. Second, are poorly correlated with subject’s overall
tives. The opinions of our subjects are en- having subjects see cases twice would have scores will also be of interest. Given this
tirely consistent with other workers’ find- introduced methodologically difficult prob- knowledge about the types of questions that
ings with respect to physician preferences lems with memory effects. are most reliable and discriminatory, we can
about radiology reports. There is a large Another issue is that our subjects had no redesign the cases and questions to optimize
body of published research detailing the time constraints or other pressures placed on power to detect subtle differences in reader
opinion of referring physicians regarding the them during testing. We plan on adding fea- performance. Such information about question
content and format of radiology reports tures to the experimental paradigm that will content may guide other researchers in their
[10–15]. The terminology differs somewhat stress a subject’s short-term memory of the own experiments about readability of medical
but attributes consistently endorsed by con- material. The structured versions of the reports documents.
sumers (readers) of radiology reports in- we used had phrasing and syntax identical to To our knowledge, this work is the first
clude complete, itemized, and structured. that found in the original narrative versions. In experimental evaluation of radiology reports
Another element that is commonly preferred practice, the language and construction of in- whose primary outcomes are quantitative
by referring clinicians is that the report terpretative statements would likely be rather measures of information transfer to readers
should contain a complete listing of perti- different in structured reports. Readers may be of the documents. Based on the results de-
nent negative findings. In aggregate, these more (or less) able to rapidly comprehend scribed above, we assert that there is no dif-
opinions seem to militate for a report orga- medical content presented in structured format ference in information transfer efficiency
between free text (narrative style) report for- ory of evolution. III. regression, heredity and pan- porting: attitudes of referring physicians. Radiology
mat and structured (itemized) reports having mixia. Phil Trans Royal Soc Ser A 1896; 187:253 1988; 169:825–826
the same content. Despite the fact that they 5. Ash JS, Berg M, Coiera E. Some unintended con- 11. Gunderman RB, Ambrosius WT, Cohen M. Radi-
performed no better with the structured ver- sequences of information technology in health care: ology reporting in an academic children’s hospital:
sions, our subjects clearly preferred it to the the nature of patient care information system-re- what referring physicians think. Pediatr Radiol
free text format. lated errors. J Am Med Inform Assoc 2004; 2000; 30:307–314
11:104–112 12. Johnson AJ, Ying J, Swan JS, Willicam LS, Apple-
6. Garrod S. How groups co-ordinate their concepts gate KE, Littenberg B. Improving the quality of ra-
References and terminology: implications for medical infor- diology reporting: a physician survey to define the
1. Merritt CRB. New president says emphasize signal, matics. Methods Inf Med 1998; 37:471–476 target. J Am Coll Radiol 2004; 1:497–505
delete noise from radiology reports. ARRS Memo: 7. Patel VL, Kaufman DR. Medical informatics and 13. Lafortune M, Breton G, Baudouin JL. The radio-
Newsletter of the American Roentgen Ray Society the science of cognition. J Am Med Inform Assoc logical report: what is useful for the referring phy-
2004; 15(3):1–8 1998; 5:493–502 sician? Can Assoc Radiol J 1988; 39:140–143
2. Sistrom CL, Langlotz CP. A framework for im- 8. Patel VL, Kushniruk AW. Understanding, navigat- 14. McLoughlin RF, So CB, Gray RR, Brandt R. Ra-
proved radiology reporting. J Am Coll Radiol 2005; ing and communicating knowledge: issues and diology reports: how much descriptive detail is
2:159–167 challenges. Methods Inf Med 1998; 37:460–470 enough? AJR 1995; 165:803–806
3. Duncan DB. t tests and intervals for compari- 9. Berg M. Practices of reading and writing: the con- 15. Naik SS, Hanbidge A, Wilson SR. Radiology re-
sons suggested by the data. Biometrics 1975; stitutive role of the patient record in medical work. ports: examining radiologist and clinician prefer-
31:339–359 Sociol Health Illness 1996; 18:499–524 ences regarding style and content. AJR 2001;
4. Pearson K. Mathematical contributions to the the- 10. Clinger NJ, Hunter TB, Hillman BJ. Radiology re- 176:591–598
APPENDIX 2: Free Format Report of an Abdominal CT Scan as Obtained from the Medical Record
A CT scan of the abdomen and pelvis was There is perinephric fat stranding adjacent to There is no mesenteric or retroperitoneal
performed on 3/3/02 without priors. Routine the right kidney. No additional ureteral or lymphadenopathy. No pathologic osseous le-
contiguous images were obtained from the bladder stones are identified. The left kidney sions are seen.
upper abdomen to the proximal femurs with- is without hydronephrosis or significant
out the use of intravenous contrast, following perinephric fat stranding. However, there is Impression:
a renal stone protocol. an exophytic 1.5 cm hypodensity off the in- 1. Multiple bilateral renal calculi as de-
The visualized lung bases are clear and ferior pole of the left kidney that likely rep- scribed above. There is a 6 mm at least par-
show no evidence of pulmonary parenchymal resents a renal cortical cyst. tially obstructing calculus at the right uretero-
nodules or masses. Unenhanced examination of the liver is pelvic junction that is causing mild-to-
There are multiple bilateral renal stones. unremarkable without biliary dilatation. Un- moderate proximal hydronephrosis. Addi-
The largest is present within the left renal enhanced examination of the spleen, tionally, there is perinephric fat stranding ad-
pelvis and measures 16 × 9 mm. There is a 6 adrenals, and pancreas is unremarkable. The jacent to the right kidney.
mm stone at the right ureteropelvic junction bowel is not opacified but otherwise is 2. Cortically based hypodensity within the
with mild-to-moderate proximal hydroneph- unremarkable. There is a normal-appearing inferior pole of the left kidney that likely rep-
rosis. The distal ureter is decompressed. appendix. resents a renal cortical cyst.