Professional Documents
Culture Documents
Non-technical text
Formal Informal
Linguist Also, the authoring tool will help the Also, the authoring tool helps you
writer to tag each fragment of text tag each fragment of text that you
with a particular set of selection fea- write with the selection features that
tures. you choose.
Layperson Also, the writer’s assistant will help Also, the writer’s assistant helps you
the writer to tag each snippet of text tag each snippet of text that you write
with the particular patient features. with the patient features that you
choose.
The key to WebbeDoc’s ability to produce tailored ument; authoring and knowledge-based document
documents by selection from a single master docu- management; and sentence planning for automated
ment is the manner of representation of the master post-editing.
document: a WebbeDoc master document has a well-
defined structure of ordering relations, rhetorical rela- The next step: Generation of Web pages by
tions, and other linguistic information, such as coref-
erence links. In the first implementation, the master selection and repair
document was built manually according to our model Representing a master document
of a master document, with additional structural con-
straints imposed so that piecewise selection and re- Text Specification Language, or TSL, is the language
combination would not create any infelicities such as used to represent master documents in the parent
abrupt changes of topic, unnecessary duplications of HealthDoc system. We anticipate that WebbeDoc mas-
noun phrases, or unresolvable pronouns. ter documents will have a hybrid representation: part
TSL (for the portions that will be subject to syntactic
But to compose a master document of this style and or stylistic repair), part “frozen” English text (for the
internal complexity required the efforts of computa- portions that need never be revised). We have defined
tional linguists, rhetoricians, and Web document de- TSL to be an extension of the Sentence Plan Language
signers; obviously this is not realistic for the average (SPL) that is used by the Penman text generation sys-
Web user! In a realistic and usable implementation, tem (Penman Natural Language Group 1989), whose
WebbeDoc would need an authoring tool and a sen- KPML derivation (Bateman 1995) is used in Health-
tence planner that could work in real-time to repair Doc. An SPL expression is an abstract specification
and polish the selected text—we can’t expect the aver- of a sentence, which Penman can convert to the cor-
age Web document author to pre-compile all the pos- responding surface form. This permits expression of
sible combinations in advance. Therefore, to develop the content of the document. The basic SPL structures
such a system, a number of research issues must be ad- are annotated with selection and repair information to
dressed, including representation of the master doc- produce the corresponding TSL representation.
The format of the annotations for selection follows For example, the first sub-topic of the sample
the structure of a user model, with annotations or- WebbeDoc text given above elaborates on the se-
ganized by personal and demographic category; for lection specification facility in the authoring tool;
example: the second sub-topic justifies the kind of special-
:reader-role (layperson) ized linguistic knowledge needed by the author-
:reader-age (adult) ing tool. Essentially, a sub-topic is a semantically
coherent piece of the document.
Other kinds of annotation for selection, such as read-
ing level and preferred style of presentation, will, for Each sub-topic is a collection of version sets that are
the moment, be represented in a similar manner: connected by ordering relations, rhetorical rela-
tions, coreference links, and formatting relations.
:technical-level (low)
A version set is a set of textual variations such
:formality (informal)
that each variation fulfills the same communica-
The annotations can be included at any level in the SPL tive goal, but has a semantic content and prag-
so that the system can make selections at any level of matic form tailored to a particular audience. Each
linguistic granularity. As stylistic and pragmatic cus- variation in a version set is characterized by a logi-
tomization becomes more complex, additional repre- cal condition and a semantically coherent piece of
sentations will probably be needed. text. The logical condition uses terms that range
But this information isn’t enough. We also re- over sets of mutually exclusive features.
quire the internal discourse structure to be repre- We interpret “mutual exclusion” to mean that the
sented explicitly, to guide repairs to the structure of conditions assigned to the variations in a version
the text. Therefore, TSL contains several kinds of ad- set define a clean partition of the set, so that ex-
ditional annotations, including topic ordering informa- actly one of the variations must be chosen.
tion, coreference links, and rhetorical relations between
sentences. In addition to these current kinds of an- In the example given earlier, the first sub-topic is
notations, WebbeDoc’s TSL will contain information a singleton version set, sentence (1), while the sec-
on formatting and document presentation that would ond version set is made up of the eight sentences
be marked up for inclusion according to specific user shown in table 1, and the third set also contains
eight different sentence variations.
preferences.8
Ordering relations may exist between the version
The model of a master document A master docu- sets that make up a sub-topic. These relations in-
ment is constructed according to a formal model; the dicate the preferred order of the sequence of varia-
model that we describe here is the most general, in- tions that have been selected to form the working
tended for the overall HealthDoc system, which does document, and thereby specify the ordering of
selection and repair of a master document. (The cur- sub-topics prior to the invocation of the sentence
rent version of WebbeDoc, which does generation by planner.
selection only, with no repairs involved, uses a more
constrained model of a master document.) Preferred order can vary by reader. For example,
We define the general model of a master document the author of the WebbeDoc MD might decide that
(MD) as follows: for computational linguists, the sub-topic about
the authoring tool’s linguistic intelligence should
An MD has a coherent high-level communicative precede the sub-topic on the selection criteria, but
goal, such as to inform, to command, to persuade, for laypersons, the reverse order would be prefer-
to impress. For example, the purpose of the cur- able.
rent WebbeDoc MD is to inform (and impress)
the reader about the goals and technical achieve- Rhetorical relations may exist between the version
ments of the HealthDoc project. sets that make up a sub-topic. The rhetorical
relations that we are currently using are taken
An MD has a coherent topic structure, with a divi- from Rhetorical Structure Theory (RST) (Mann
sion into topics, sub-topics, and so on. The small- and Thompson 1988). In the current version of
est topic unit of an MD at the moment is a sub- WebbeDoc, the same rhetorical relation must ex-
sub-topic; however, we believe the form of the ist between any two members of adjacent version
“smallest topic unit” will vary with the particular sets.
document.
In the example we have been using, the rhetorical
Each sub-topic corresponds to a section of the doc- relations are as follows:
ument that satisfies a more specific communica- Any choice from the second version set (shown
tive goal, such as to justify or elaborate upon. in table 1) elaborates upon sentence (1) (the first
8
Indeed, we anticipate that there will be a distinct “re- version set).
pair” module for document formatting in the sentence plan- Any choice from the third version set justifies any
ner used with WebbeDoc. earlier choice from the second version set.
Coreference links may be defined between any two Functions of sentence planning and automated
version sets. In our example, the following terms, post-editing
used in the first and second version sets, are coref- In general, selecting material from pre-existing text
erential: authoring tool, authoring facility, writer’s and then editing it to recover coherence and cohesion
workbench, writer’s assistant, and it. (The first two can involve a wide range of problems in various as-
terms are also near-synonyms.) pects of sentence planning. For example, both syntac-
tic and semantic aggregation may be needed, as well as
Formatting information may be defined at each
chunking of whole and partial propositions. Pronouns
topic and sub-topic level. Formatting informa-
and other forms of reference need to be chosen. And,
tion may also be defined between and within ver-
of course, aggregation and sentence restructuring will
sion sets, including illustrations, choice of colour,
affect the rhetorical relations between the elements of
design of layout, and so on.
the text.
Our current work is focusing on the development
Authoring a master document of two key modules of the sentence planner: for dis-
course structuring and for aggregation. It is unlikely
WebbeDoc master documents may be based on the that every ordering of the blocks of text that are orga-
natural-language text of pre-existing material, or they nized into a master document will produced a coher-
may be created from scratch (or some combination of ent sequence of selected pieces of text. To ensure that
the two). Either alternative requires the involvement any resulting document makes sense, the discourse
of a human. module uses the rhetorical relations that hold among
the textual units to produce a sequence that is most
The author of a WebbeDoc master document would likely to be coherent. In later work, an additional
normally be a professional technical writer or Web- module will be built to determine the linguistic phras-
document designer, who will need to understand the ing of the discourse relation.
nature of customized and customizable texts, but who The aggregation module eliminates redundancy in
should not be assumed to have any special knowledge TSL expressions by grouping together entities that are
or understanding of TSL or the innards of WebbeDoc. arguments of the same rhetorical relation, verbal pro-
The authoring tool, therefore, should be no more cess, etc. Each aggregation rule recognizes an exact
difficult for the author to use than, say, the more- match of some portions of two input TSL expressions
sophisticated features of a typical word processor. The and returns a single, fused, expression. The actions
text is therefore written in English, and will be trans- of the aggregation module will generally affect the re-
lated to TSL by the authoring tool. (The English source sulting syntactic structure.
text is retained in the TSL for use in subsequent author- A critical problem is the distribution of repair tasks
ing sessions—for example, if the document is updated among the planning modules, as there are often strong
or amended.) interactions. The responsibilities of each module and
It is the writer’s job to decide upon the basic ele- the overlaps between them are an area of on-going
ments of the text, the formatting, ordering, rhetorical, research for our sentence-planning group.
and coreferential links between them, and the condi-
tions under which each element should be included in Conclusion
the output. The elements of the text are then typed into
The HealthDoc project and its WebbeDoc offspring
the authoring tool in English, and are marked up by
aim to provide a comprehensive approach to the au-
the writer with conditions for inclusion, links for co-
tomated tailoring of both paper documents and Web-
hesion and coreference, and annotations for ordering
based materials. We incorporate explicit user mod-
and formatting of the document layout. An example
eling as a basis for the document tailoring, and we
of the authoring tool’s main interface (depicting part
take into account user information ranging from sim-
of the sample WebbeDoc master document described
ple demographic data to complex pragmatic prefer-
earlier) is given in figure 2.
ences. We have developed a model of language gen-
The tool then translates the text into TSL. This is eration, “generation by selection and repair”, that re-
essentially a process of semi-automated parsing, so lies on a “master-document” representation that pre-
that whenever an ambiguity cannot be resolved, the determines the basic form and content of a text and
writer is queried in an easy-to-understand form. The yet is amenable to editing and revision for customiza-
design and development of the authoring tool and its tion. The WebbeDoc project aims to provide useful
user interface is part of the current phase of the overall techniques for natural language applications on the
HealthDoc project (fall 1996 to spring 1997). The user Web and to address a number of important issues for
interface is being developed by Parsons (1997), while research in more-general systems for language gener-
Banks (1997) is implementing the English-to-TSL con- ation.
version (for more details on the underlying model of
conversion, see DiMarco and Banks (1997)).
Figure 2: The main interface of the authoring tool
Acknowledgements by medical condition and personal characteristics.”
The HealthDoc Project is supported by a grant from Technol- Workshop on Artificial Intelligence in Patient Education,
ogy Ontario, administered by the Information Technology Glasgow, August 1995.
Research Centre. Vic DiCiccio was instrumental in helping Green, Stephen (1992). “A functional theory of style
us to obtain the grant, and has been invaluable in subsequent for natural language generation.” Master’s thesis,
administration. Substantial portions of the sections of this Department of Computer Science, University of Wa-
paper that described the HealthDoc project and authoring of terloo, 1993.
master documents were written by Graeme Hirst; they are Green, Stephen J. and DiMarco, Chrysanne (1996).
used here with his permission. Some material in the section “Stylistic decision-making in natural language gen-
on the functions of sentence planning was written by Eduard
Hovy; it is used here with his permission. We are grateful
eration.” In Trends in natural language generation:
to Graeme Hirst and Eduard Hovy for many helpful com- An artificial intelligence perspective. Giovanni Adorni
ments on this research and this paper. The other members and Michael Zock (eds.). Springer-Verlag Lecture
of the HealthDoc Project have also contributed to the work Notes in Artificial Intelligence (a subseries of Lec-
described here, especially Daniel Marcu, Kim Parsons, and ture Notes in Computer Science) number 1036, 1996.
Phil Edmonds. Jonathan Dursi kindly provided help with Hirst, Graeme (1995). “Near-synonymy and the struc-
some of the LATEXdevelopment. ture of lexical knowledge.” Working notes, AAAI
Symposium on Representation and Acquisition of Lexi-
References cal Knowledge: Polysemy, Ambiguity, and Generativity,
Banks, Steven (1997). Master’s thesis. Department of Stanford University, March 1995, 51–56.
Computer Science, University of Waterloo, expected Hovy, Eduard and Wanner, Leo (1996). “Manag-
Spring 1997. ing sentence planning requirements.” Proceedings,
Bateman, John Arnold (1995). “KPML: The KOMET– ECAI-96 Workshop on Gaps and Bridges: New Direc-
Penman multilingual linguistic resource develop- tions in Planning and Natural Language Generation,
ment environment.” Proceedings, 5th European Work- Budapest, August 1996.
shop in Natural Language Generation, Leiden, May Hoyt, Pat (1993). A goal-directed functionally-based
1995, 219–222. stylistic analyzer. Master’s thesis, Department of
Campbell, Marci Kramish; DeVellis, Brenda M.; Computer Science, University of Waterloo, 1993.
Hoyt, Pat and DiMarco, Chrysanne (1994). “A goal-
Strecher, Victor J.; Ammerman, Alice S.; DeVellis,
Robert F.; and Sandler, Robert S. (1994). “Improv- directed multi-level stylistic analyzer.” Proceed-
ing dietary behavior: The effectiveness of tailored ings, 10th Canadian Conference on Artificial Intelli-
gence, Banff, May 1994, 23–30.
messages in primary care settings.” American Jour-
nal of Public Health, 84(5), May 1994, 783–787. Mann, William C. and Thompson, Sandra A. (1988).
“Rhetorical Structure Theory: Toward a functional
DiMarco, Chrysanne (1990). Computational stylistics for
theory of text organization.” Text, 8(3), 1988, 243–
natural language translation. PhD thesis, Department
281.
of Computer Science, University of Toronto, 1990.
Parsons, Kimberley J. (1997). Master’s thesis, Depart-
Published as technical report CSRI-239.
ment of Computer Science, University of Waterloo,
DiMarco, Chrysanne and Banks, Steven (1997). “Us-
expected Spring 1997.
ing subsumption classification on a stylistic hierar-
Penman Natural Language Group (1989). “The Pen-
chy as the basis of a multi-stage conversion of natu-
man primer”, “The Penman user guide”, and “The
ral language text to sentence plans.” In preparation.
Penman reference manual.” Information Sciences
DiMarco, Chrysanne; Hirst, Graeme; and Stede, Man-
Institute, University of Southern California.
fred (1993). “The semantic and stylistic differentia-
Skinner, Celette Sugg; Strecher, Victor J.; and Hos-
tion of synonyms and near-synonyms.” Proceedings,
pers, Harm (1994). “Physicians’ recommendations
AAAI Spring Symposium on Building Lexicons for Ma-
for mammography: Do tailored messages make a
chine Translation, Stanford, March 1993, 114–121.
difference?” American Journal of Public Health, 84(1),
DiMarco, Chrysanne and Hirst, Graeme (1993a). “A
January 1994, 43–49.
computational theory of goal-directed style in syn-
Strecher, Victor J.; Kreuter, Matthew; Den Boer, Dirk-
tax.” Computational Linguistics, 19(3), September
Jan; Kobrin, Sarah; Hospers, Harm J; and Skinner
1993, 451–499.
Celette S. (1994). “The effects of computer-tailored
DiMarco, Chrysanne and Hirst, Graeme (1993b). “Us-
smoking cessation messages in family practice set-
age notes as the basis for a representation of near-
tings.” The Journal of Family Practice, 39(3), Septem-
synonymy for lexical choice.” Proceedings, Ninth An-
ber 1994, 262–270.
nual Conference of the University of Waterloo Centre for
Wanner, Leo and Hovy, Eduard (1996). “The Health-
the New Oxford English Dictionary and Text Research,
Doc sentence planner.” Proceedings of the Eighth In-
Oxford, September 1993, 33–43.
ternational Workshop on Natural Language Generation,
DiMarco, Chrysanne; Hirst, Graeme; Wanner, Leo; Brighton, UK, June 1996.
and Wilkinson, John (1995). “HealthDoc: Cus-
tomizing patient information and health education