Professional Documents
Culture Documents
JOHN A. BATEMAN
This entry accordingly characterizes the distinct kinds of multimodal corpora currently
envisaged and describes to what extent the prerequisites of selection, prepreparation, and
quantity are met. Different kinds of data raise different kinds of problems. This is import-
ant both for designing tools and for considering combinations of types of dataalways a
central concern when considering multimodal data. Explorations in developing multimodal
corpus work often fail to leverage appropriately off already existing techniques. In future,
this will need to be addressed more effectively since the natural complexity of multimodal
data can easily overwhelm individual efforts.
The broadest distinction is between linear and nonlinear data. This is an essential distinc-
tion since many discussions of multimodal corpora are situated entirely within one area
(typically the first) rather than the other. Many discussions of multimodal corpora and
tools accordingly only consider linear data and provide no support for, or even mention
of, nonlinear data.
Linear data include material which is essentially organized to unfold along a single
dimension of actualization. This dimension may either be in space, as in traditional
written-text data, or in time, as in recordings of spoken language. In both cases the single
dimension of organization allows relatively straightforward access to any part of the data:
In the case of written data we have the explicit ordering information about the words and
characters representedfor example, word number 13,219 of War and Peacewhile in the
case of spoken language data, we have the time at which some phonetic event occurs. For
this reason, speech corpora generally work with time-stamped data. This continues to apply
when that further information is itself complex, as in video data; even here, the basic
temporal organization remains of considerable benefit when designing corpora and corpus
tools.
Nonlinear data include material where there is no single organizing dimension that can
be used for providing access. This generally involves spatially distributed information, such
as that found on pages of documents, printed advertisements, paintings, and so on. As
we shall see, corpora for this area are in a far less developed state. Tools built for linear
data are generally unusable for nonlinear data, although some developments are being
attempted.
Further dimensions that we will combine with the above are usually seen in terms of
sensory channels. For example, the data may be essentially graphically visual, as in video
data or paintings, or acoustic, as in speech corpora, or linguistically visual, as in typography
and text layout. The sensory channel view alone is insufficient for effective corpus design,
however, since each sensory channel can in fact carry very different kinds of information
that need to be distinguished when prepreparing data for inclusion in corpora. Moreover,
a strong interdependence exists between the kinds of corpora that are attempted and the
technological support available for storing and manipulating the required data. There is
also the question of just what kinds of data are considered as sensible targets for investi-
gation at all. Both areas have changed considerably in recent years.
Linear Data
In this section we briefly run through some of the types of data and corresponding corpora
that can be considered to be organized as a one-dimensional unfolding of information.
We will see that that information can then itself come to include substantial multimodal
contributions that are no longer one-dimensional; nevertheless their embedding within a
single dimension of unfolding provides a critical foundation for applying corpus methods
effectively.
multimodal corpus-based approaches 3
above, textual corpora are accordingly annotated with additional information, such as
part-of-speech information, grammatical categories, semantic labeling, and so forth, that
increases their value for linguistic research. When other kinds of data are consideredfor
example, searching speech data for acoustic eventsthe problem grows considerably.
There are many open questions concerning the kinds of annotations and processing tech-
niques that will make multimodal corpora most useful. In short, the further we move away
from corpora consisting of plain written text, the more difficulties and open questions we
are confronted with.
Most corpora nowadays draw on a particular technology for storing and organizing
their data in a form that makes them both accessible and capable of being arbitrarily
enriched with further annotations: This technology is that provided by Extensible Markup
Language (XML). XML provides a straightforward, machine-readable format that allows
information to be annotated, or marked up, as desired; it is an outgrowth of a long history
of research on document markup and is the current recommendation for structured data
of the World-Wide Consortium W3C (http://www.w3.org/TR/xml).
An example of a simple XML markup for part of speech is the following, slightly amended
from the British National Corpus (www.natcorp.ox.ac.uk):
Here we can see that an XML expression consists of some data enclosed by a labeled
opening tag and a matching labeled closing tag: in this case <w . . .>, indicating a word,
and </w> respectively. The raw data are what is between these opening and closing tags:
in this case the characters but. Central to the utility of the framework is the fact that
each opening tag can contain predefined attributes and values, giving further information
about the tagged item. In the present case, the opening tag specifies that the tagged material
belongs to the particular word class CJC, has the headword but, and the part-of-
speech information CONJ, and has been given the (unique to the current document)
identifier w345. In addition, tags can be nested hierarchically, making it possible to
construct arbitrarily complex (but well-formed) structures.
This is an example of in-line markup, where the information about the data is con-
tained in the same file as the data themselvesthe occurrence of the word but in some
text, in the present case. In contrast to this, standoff annotation as introduced above works
by putting the information about the data in a separate file and referring to the original
data with a cross-reference to some identifying feature of the data. That identifying feature
can either be something like the explicit identifier given in the example with the id
attribute or draw instead on some logical property of the file, such as the nth character in
the file. The advantages of such standoff annotation are that structures can be defined that
do not need to nest properly, which is useful in many cases of linguistic information
because the levels of abstraction are orthogonal and do not necessarily nest one within the
other (e.g., syntactic structure, intonational phrasing, and typography), and arbitrary kinds
of information can be added in a modular and still processible fashion. This becomes
increasingly important as we move to corpora that cover multimodal information.
Finally, the real win of applying the XML standard to corpora annotation is that it
provides access to powerful search and data manipulation tools often developed quite
independently of linguistic concerns. Since XML is used as the format of choice for almost
all online data these days, the need to manipulate such data is widespread. As a conse-
quence, search and processing mechanisms that can work with XML are increasingly
provided as standard functionality even for normal Web browsers. This trend will certainly
continue and provides a strong foundation for increasingly powerful corpus manipulation
tools capable of working with any combination of data appropriately organized according
multimodal corpus-based approaches 5
to defined XML schemes. This is also an important prerequisite for getting corpus tools
out of proprietary formats and tools into more open, and hence freely available and adapt-
able, resources of use to the community.
Transcription
Approaches to corpus preparation that are designed by linguists generally draw on the
central linguistic operation of transcription. Transcription is the process of enriching basic
language data so as to already include qualitative classifications considered appropriate
for further theory building. The simplest kind of transcription for spoken data might
involve writing peaks of intonational prominence in capital letters; nowadays this would
be replaced by a complete annotation scheme reflecting some model of intonation. The
relation between transcription and theory building is an intimate one; fundamental dis-
cussion is offered by Ochs (1979).
6 multimodal corpus-based approaches
Media Analysis
Given that the starting point for many of the natural interaction data described above is
provided by video recordings of situations, there is a further natural extension both of
techniques and of interest to include not only natural interactions in the wild but audio-
visual media representations. Here there is an overlap with research performed in media
and communication studies, which also have a long tradition of investigating video data
for socially significant variations and patterns (see Ludes & Herzog, 2004). From the lin-
guistic perspective, research is pursued that focuses more either on the interactions of
individuals presented in the media, as in political interviews or news reporting, or on the
communicative possibilities of the medium itself. This latter is a further extension of the
domain of multimodality to take in the technical resources employed for constructing
audiovisual representations in general, involving such features as camera angles, shots
and transitions between shots, camera distances, framing, color balance, perspective, and
much more. Early approaches to this from a multimodal corpus perspective can be seen
in Thibault (2000), Baldry and Thibault (2005), and Tan (2009). For more general media
representations there is also the international standard MPEG-7 (ISO/IEC 15938: Multimedia
content description interface) for media content: this shows considerable overlaps in aims
but has not so far been considered together with the needs of multimodal corpus design.
For a thorough consideration of film as a multimodal artifact at the document level, however,
see Bateman and Schmidt (2011).
All of the approaches discussed so far depend upon the linearity property of the semiotic
modes involved. Particularly, the time-based modes make it possible to combine as many
layers or annotation tracks as required by anchoring them against time stamps; some current
attempts at standardization even require the existence of time stamps as a precondition
for interoperability and translatability of annotations. But semiotic modes relying on the
visual channel are not organized around time; they are organized around space and spatial
relationships (see Kress, 2003).
This presents substantial problems for devising usable markup schemes. A simple geo-
metric representation rarely provides an appropriate level of abstraction for much of what
is happening visually, and the range and scope of the semiotic modes carried by the visual
channel are still only poorly understood. There are extensive categories of the kinds of
information representations that are employedfor example, 2D graphs, animations, films,
written text, photographs, drawings, diagrams, flowcharts, maps, and many morebut
still little theoretically sound organization for their combination. Written text may play
decisive roles for the interpretability of graphics, diagrams may be animated, maps may
be annotated with diagrammatic representations (e.g., contour lines), and so on. Moreover,
there are additional modes of semiotic organization that are essentially higher-order in
multimodal corpus-based approaches 7
that their purpose is precisely to combine information offerings from other modes. This,
for example, is the function of layout as described in detail in Bateman (2008). Whenever
language occurs together with such kinds of information, the interrelation of linguistic and
nonlinguistic presentations raises important theoretical considerations that can only benefit
from corpus-based approaches.
In some domains there are already proposals of a nonlinguistic nature for possible forms
of markup and corpus design; for example, the automatic document recognition community
has large collections of annotated layouts that are used as ground truths for testing recognition
algorithms (Antonacopoulos, Karatzas, & Bridson, 2006). Moreover, geographic information
science has richly layered representations of spatial data underlying maps; art history has
classifications for the content and organization of pictures; and information designers have
characterizations for many kinds of diagrams and graphics. When moving into the collec-
tion of corpus data involving such artifacts, it will therefore be advisable to be aware of
treatments of this kind before defining transcription schemes motivated from within single
disciplinary perspectives. Although at present most linguistically informed approaches to
such artifacts have not moved beyond proposals for transcription, first moves toward full
corpus annotation schemes are described in the GeM framework for static multimodal
nonlinear documents in Bateman (2008, 2009).
It is already evident that there will be an increasing need for combinations of all the develop-
ments described so far. Many current projects are concerned with the analysis of artifacts
such as Web sites, and these can be characterized most effectively as dynamic layouted
composite documents. Since Web sites include all of the resources of nonlinear, that is,
spatially organized, layout plus the ability to include video and other dynamically displayed
data, all of the kinds of annotations/transcriptions mentioned so far may find application
there.
The move to include such artifacts within corpus-based research will again demand that
tools are able to bring together both linear and nonlinear data types. Moreover, within
each data type there will be many distinct kinds of annotation required. In the long run,
it may be necessary to have distinct corpus levels, or tiers, for each kind of semiotic mode
operative within the data to be analyzed. This then goes considerably beyond distinguishing
sensory channels, as argued in detail in Bateman (2010). Some tools, such as ANVIL (Kipp,
in press), are now allowing spatial annotations within video data, but this is still exploratory
and subordinated to the temporal organization. Further development is required.
Conclusions
Following on the general acceptance within linguistics of the value of performing linguistic
research with respect to increasingly large collections of naturally occurring texts, that is,
corpora, it has been natural to ask whether these methods and techniques can also be
applied to data including those other modes of communication accompanying the linguistic
acts. This combination of concerns has led to the design, construction, and use of multimodal
corpora for informing research that seeks to place language use in richer, multimodally
constituted contexts. More abstractly, this represents a gradual change in the boundaries
drawn between the linguistic and the paralinguistic: Ever more phenomena that would
previously have been termed paralinguistic, in the sense of accompanying but only weakly
influencing linguistic form and expression, are now being moved into the center of concern
and so demand methods and approaches by which they can be addressed empirically.
8 multimodal corpus-based approaches
There are also further combinations of modes that have so far received very little study
but which are beginning now to be approached using corpus-based methods; for example,
the combination/interaction of language and music, as occurs in narrative film, or the
combination of eye-tracking behavior and language. As corpus-based empirical approaches
move to include such modes of communication, it is crucial that insights from traditions
other than the narrowly linguistic are given due consideration.
Corpus-based approaches need therefore to enter into dialogue with such work and not
assume that straightforward extensions of the linguistic, or simple outsiders views of
the material will be sufficient. The advances to be made here are nevertheless considerable,
not only for our understanding of how language works within such multimodal contexts
but also for how other communicative modes function. Linguistic, corpus-based methods
promise much for placing studies of a broad variety of modalities on a firmer empirical
footingbut this can only be done in cooperation and dialogue with those disciplines
where those modalities have already been the principal focus of attention.
SEE ALSO: Analyzing Spoken Corpora; Corpora: Multimodal; Corpora: Specialized; Corpus
Linguistics: Overview; Corpus Linguistics: Quantitative Methods; Multimodal Text Analysis;
Speech Analysis Software; Transcribing Multimodal Interaction
References
Allwood, J., Kopp, S., Grammer, K., Oberzaucher, E. A., & Koppensteiner, M. (2007). The
analysis of embodied communicative feedback in multimodal corpora: A prerequisite for
behavior simulation. Journal on Language Resources and Evaluation, 41(34), 25572.
Antonacopoulos, A., Karatzas, D., & Bridson, D. (2006). Ground truth for layout analysis
performance evaluation. In H. Bunke & A. L. Spitz (Eds.), Proceedings of Document Analysis
Systems (DAS 2006) (Lecture Notes in Computer Science, 3872, pp. 30211). Berlin, Germany:
Springer.
Baldry, A., & Thibault, P. J. (2005). Multimodal corpus linguistics. In G. Thompson & S. Hunston
(Eds.), System and corpus: Exploring connections (pp. 16483). London, England: Equinox.
Bateman, J. A. (2008). Multimodality and genre: A foundation for the systematic analysis of multimodal
documents. London, England: Palgrave Macmillan.
Bateman, J. A. (2009). Discourse across semiotic modes. In J. Renkema (Ed.), Discourse, of course:
An overview of research in discourse studies (pp. 5566). Amsterdam, Netherlands: John
Benjamins.
Bateman, J. A. (2010). The decomposability of semiotic modes. In K. L. OHalloran & B. A. Smith
(Eds.), Multimodal studies: Multiple approaches and domains (Routledge studies in multimodality,
pp. 1738). London, England: Routledge.
Bateman, J. A., & Schmidt, K-H. (2011). Multimodal film analysis: How films mean. Routledge studies
in multimodality. London, England: Routledge.
Kipp, M. (in press). Multimodal annotation, querying and analysis in ANVIL. In M. Maybury
(Ed.), Multimedia information extraction.
Kipp, M., Neff, M., & Albrecht, I. (2007). An annotation scheme for conversational gestures:
How to economically capture timing and form. Journal on Language Resources and Evaluation,
41(34), 32539.
Kranstedt, A., Kopp, S., & Wachsmuth, I. (2002). MURML: A multimodal utterance representa-
tion markup language for conversational agents (Technical Report 2002/05, SFB 360 Situated
Artificial Communicators, Universitt Bielefeld). Retrieved October 17, 2011 from http://
www.sfb360.uni-bielefeld.de/reports/2002/2002-5.html
Kress, G. (2003). Literacy in the new media age. London, England: Routledge.
multimodal corpus-based approaches 9
Ludes, P., & Herzog, O. (Eds.). (2004). The world language of key visuals: Computer sciences, human-
ities, social sciences, Vol. 1: Visual hegemonies: An outline. Mnster, Germany: LIT Verlag.
Norris, S. (2002). The implication of visual research for discourse analysis: Transcription beyond
language. Visual Communication, 1(1), 97121.
Norris, S. (2004). Analyzing multimodal Interaction: A methodological framework. London, England:
Routledge.
Ochs, E. (1979). Transcription as theory. In E. Ochs & B. B. Schieffelin (Eds.), Developmental
pragmatics (pp. 4372). New York, NY: Academic Press.
Schmidt, T., Duncan, S., Ehmer, O., Hoyt, J., Kipp, M., Loehr, D., . . . & Sloetjes, H. (2009). An
exchange format for multimodal annotations. In M. Kipp, J.-C. Martin, P. Paggio, & D. Heylen
(Eds.), Multimodal corpora (pp. 20721). Berlin, Germany: Springer.
Tan, S. (2009). A systemic functional framework for the analysis of corporate television adver-
tisements. In E. Ventola & A. J. M. Guijarro (Eds.), The world told and the world shown:
Multisemiotic issues (pp. 15782). Basingstoke, England: Palgrave Macmillan.
Thibault, P. J. (2000). The multimodal transcription of a television advertisement: Theory and
practice. In A. P. Baldry (Ed.) Multimodality and multimediality in the distance learning age
(pp. 31185). Campobasso, Italy: Palladino Editore.
Thompson, H. S., & McKelvie, D. (1997). Hyperlink semantics for standoff markup of read-only
documents. In Proceedings of SGML Europe 97 (pp. 2279).
Suggested Readings
Allwood, J. (2008). Multimodal corpora. In A. Ldeling & M. Kyt (Eds.), Corpus linguistics: An
international handbook (pp. 20725). Berlin, Germany: De Gruyter.
Kipp, M., Martin, J.-C., Paggio, P., & Heylen, D. (Eds.). (2009). Multimodal corpora. Berlin, Germany:
Springer.