You are on page 1of 14

Emotion 2007, Vol. 7, No.

1, 158 171

Copyright 2007 by the American Psychological Association 1528-3542/07/$12.00 DOI: 10.1037/1528-3542.7.1.158

Multimodal Expression of Emotion: Affect Programs or Componential Appraisal Patterns?


Klaus R. Scherer
Swiss Centre for Affective Sciences, University of Geneva

Heiner Ellgring
University of Wu rzburg

In earlier work, the authors analyzed emotion portrayals by professional actors separately for facial expression, vocal expression, gestures, and body movements. In a secondary analysis of the combined data set for all these modalities, the authors now examine to what extent actors use prototypical multimodal configurations of expressive actions to portray different emotions, as predicted by basic emotion theories claiming that expressions are produced by fixed neuromotor affect programs. Although several coherent unimodal clusters are identified, the results show only 3 multimodal clusters: agitation, resignation, and joyful surprise, with only the latter being specific to a particular emotion. Finding variable expressions rather than prototypical patterns seems consistent with the notion that emotional expression is differentially driven by the results of sequential appraisal checks, as postulated by componential appraisal theories. Keywords: multimodal emotion expression, facial and vocal expression, affect programs, appraisal theory, component process model

The fact that emotions are almost always accompanied by a motor expression component, communicating the senders affective reaction through signs in face, voice, gestures, and bodily posture, is universally recognized by scholars in the affective sciences (Davidson, Scherer, & Goldsmith, 2003). The importance of emotional expression is also documented by numerous studies on emotional recognition, especially from facial and vocal cues, showing that observers ability to correctly infer the underlying emotional state with betterthan-chance accuracy is impressive (Juslin & Scherer, 2005; Keltner & Ekman, 2000; Keltner, Ekman, Gonzaga, & Beer, 2003; Russell, Bachorowski, & Ferna ndez-Dols, 2003; Scherer, 1999; Scherer, Johnstone, & Klasmeyer, 2003). Despite this extensive literature, there is only scarce empirical evidence on the nature of the conjoint expressive patterns in the face, the voice, and the body that characterize specific emotions. The main reason is that although the literature abounds with recognition studies, there have been only relatively

Klaus R. Scherer, Swiss Centre for Affective Sciences, University of Geneva, Geneva, Switzerland; Heiner Ellgring, Institute for Psychology, University of Wu rzburg, Wu rzburg, Germany. This article, planned for a long time, was to be coauthored by Harald Wallbott, who, as part of the research team from the outset, made major contributions to the overall design, the recording and selection of the stimuli, and the coding of the gestures and bodily movements. His untimely death has made this impossible. We thank a large number of collaborators in the study for their precious contribution, in particular Rainer Banse. This work has been supported by a grant of the Deutsche Forschungsgemeinschaft, by facilities of the Max Planck Institute of Psychiatry in Munich, Germany, and by contributions of the Swiss Center of Affective Sciences (NCCR Affective Sciences). Correspondence concerning this article should be addressed to Klaus R. Scherer, Swiss Centre for Affective Sciences, University of Geneva, rue des Battoirs 7, CH-1205, Geneva, Switzerland. E-mail: klaus.scherer@ pse.unige.ch 158

few production studies measuring the actual behavior patterns shown by an individual experiencing an emotional episode and analyzing the coherence of expressive actions across modalities (for exceptions, see Scherer & Wallbott, 1985, for multichannel analysis of discourse, or, more recently, Cohn et al., 2004, using multimodal data from image tracking of smiles). Among the reasons is the difficulty of recording naturally occurring emotional expressions due to ethical and practical constraints, especially given the need for high-quality recording allowing microanalysis of expressive behavior. In consequence, most of the studies of patterning in expression production have been performed with actors portraying different emotions. Although this is clearly less satisfactory than the study of real emotion expressions, it at least allows one to obtain some information on the production frequency of particular expressive features or feature combinations for specific emotions (see the detailed justification for the use of actor portrayals in Banse & Scherer, 1996; Scherer & Ellgring, 2007). In general, studies of this kind have also been modality specific; that is, they focused on the face (Ekman & Rosenberg, 2005; Keltner et al., 2003), the voice (Ellring & Scherer, 1996; Juslin & Scherer, 2005; Scherer, 2003; Scherer et al., 2003), and, much more rarely, the body (Wallbott, 1998). The main reason is that microanalytic coding of expression is a complex and time-consuming method, the use of which is generally restricted to laboratories that have made a major investment in a particular technique, mostly modality specific. For this reason, there have not been many attempts to investigate the relationship of expressive features and feature combinations across several modalities. The rare attempts at multimodal analysis in the field of nonverbal communication have been restricted to a few short utterances, aiming at an analysis of meaning structure rather than being guided by psychobiological emotion models (Pittenger, Hockett, & Danehy, 1960; Scheflen, 1973). Why should one expect multimodal configurations of emotional expressions and what are the theoretical predictions? Both everyday experience and research on nonverbal behavior (Feldman &

MULTIMODAL EXPRESSION OF EMOTION

159

Rime , 1991; Hinde, 1972; Knapp & Hall, 1997) have shown that much of spontaneously occurring emotional expression is multimodal. The main reason for this may be that the large majority of emotional episodes occur in social situations (Scherer & Wallbott, 1994; Scherer, Wranik, Sangsue, Tran, & Scherer, 2004) with ongoing interpersonal interactions generally involving verbal discourse. As soon as there is emotional expression in speech, it is naturally accompanied by appropriate facial expressions and gestures, and even professional speakers and actors find it hard to independently manipulate different channels of communication in emotional expression. Given the highly integrated multimodal production of emotional utterances (see Siegman & Feldstein, 1985), observers react strongly to channel discrepancies, and dissociation is often seen as a sign of pathology or deception (Ekman, 2001; Ekman, OSullivan, Friesen, & Scherer, 1991). Given the ubiquity of multimodal integration in verbal and nonverbal communication, it is surprising that most theories of emotion remain silent as to the mechanism that is assumed to underlie integrated multimodal emotion expression. In some sense, this is not surprising, as research on emotional expression has mostly been concentrated on the face. However, general theories of emotion should make predictions about the nature of the response patterning that characterizes different emotions, including multiple expression modalities. Below we outline two such theories making sufficiently precise predictions to allow testing concrete hypotheses: basic emotion models and componential appraisal models. Basic emotion models are based on Tomkinss (1962) interpretation of Darwins (1872/1998) account of the evolutionary functions of emotions and their expression, represented by theoretical proposals made by Ekman (1972, 1992, 2003) and Izard (1971, 1992). In this tradition, basic emotions are defined as affect programs, triggered by appropriate eliciting events, that produce emotion-specific response patterns (such as prototypical facial and vocal expressions as well as physiological reactions). Although theorists differ as to the number and nature of basic emotions, anger, joy, sadness, fear, and disgust are generally included. Presumably the neuromotor affect programs underlying these basic emotions (see Scherer & Ellgring, 2007, for a more detailed review of the current theoretical positions) should be highly integrated across response modalities (including autonomic physiological reactions). Although we do not know of any explicit theoretical statements on this issue by protagonists of such theories, we would assume (on the basis of the affect program notion) that they implicitly predict the existence of a limited number of emotionspecific multimodal clusters of expressive actions. Using anger as an example, basic emotion models should predict that the anger program triggered by an anger-producing event will generally produce a multimodal configuration consisting of knitted brows, clenched teeth, raised fist, body leaning forward, and a loud and strident voice. Componential appraisal models have been developed in an attempt to capture the complexity of emotion as a dynamic episode in the life of an organism that involves changes in all of its subsystems (e.g., cognition, motivation, physiological reactions, motor expressionsthe components of emotion) in the service of flexible adaptation to events of high relevance and potentially important consequences (adopting a functional approach in the Darwinian tradition; Ellsworth & Scherer, 2003; Scherer, 1984,

2001). The central emotion mechanism proposed by these theories is appraisal (based on the pioneering work of Arnold, 1960, and Lazarus, 1966, 1991), the continuous, recursive evaluation of events in terms of a number of criteria such as novelty, intrinsic pleasantness, goal conduciveness, and normative significance of the event, as well as the coping potential of the organism. In contrast to basic emotion models, componential appraisal theories predict that the appraisal outcomes for these different criteria directly drive response patterning in terms of physiological reactions, motor expression, and action preparation (Ellsworth & Scherer, 2003; Frijda, 1986; Roseman & Smith, 2001; Scherer, 1984, 1992, 1994; Smith, 1989; Smith & Scott, 1997). Thus, anger is expected to be the result of an event being appraised as an obstruction to reaching a goal or satisfying a need, produced by an unfair intentional act of another person, that could be removed by powerful action (with a correspondent response patterning consisting of aggressive action tendencies, involving sympathetic arousal, knitted brows, square mouth with teeth clenched, and loud, strident vocal utterances). Specifically, the component process model (Scherer, 1984, 2001) predicts that the results of stimulus evaluation checks produce direct efferent effects and thus produce emotional response patterns in a sequential cumulative fashion. Detailed predictions are made for the facial, vocal, and gestural responses expected as the result of particular appraisal results (see Scherer, 1987, 2001; Scherer & Ellgring, 2007). In this model, the multimodal organization of expressive action is seen as determined by particular appraisal configurations (which may be constituents of several emotions) rather than by affect programs that are specific to and unique for particular emotions. Furthermore, there is less of an expectation of finding prototypical patterns for specific emotions. Although emotional expressions will be similar for a given emotion to the extent that appraisal results are similar, the component process model predicts that individual differences in appraisal across situations will produce a large set of highly variable expressions for a given emotion. In this article, we present one of the first extensive cross-modal analyses of emotional expression using objective microcoding of behavior to examine these divergent predictions. Specifically, we analyze data from a long-term research project that obtained an extensive set of emotion portrayals by professional stage actors, conducted several judgment studies to determine the ability of observers to recognize the portrayed emotions, and used detailed behavior coding and feature extraction to analyze the expression patterns in several modalities. Earlier articles from our research group have presented the data for individual modalities: Banse and Scherer (1996) for the voice, Wallbott (1998) for gestures and body movements, and Scherer and Ellgring (2007) for facial expression. These studies did not find strong evidence for emotion-specific affect programs in the unimodal expression domains, and we generally conclude that the evidence may seem to privilege a componential appraisal account. In other words, these results could be accounted for by lawful patterning of expressive actions organized by the functional aspects of adaptive information processing or action tendencies produced by specific appraisal results. One could argue that the lack of strong evidence for affect programs in individual expression modalities makes it a futile exercise to search for affect programs in multimodal patterns. However, we do not think that one can make this assumption on an a priori basis. Rather, one would assume that if there are indeed involuntary

160

SCHERER AND ELLGRING

emotion-specific affect programs, they should be most evident in multimodal configurations. This assumption is based on the established fact, described earlier, that it is much more difficult to manipulate expressive behavior in an integrated and authentic manner across modalities than within a single modality. Furthermore, although individual elements of expression in single modalities may serve specific functions in discourse (e.g., brow movements used for emphasis; Ekman, 1979), the emotion specificity of affect programs might best come to the fore in integrated multimodal expression configurations. Thus, in this article we examine whether we have found evidence for a large number of emotionspecific multimodal expression clusters (which could be seen as supporting an affect program notion) or, rather, for more local configurations for several emotions that might support the component process model assumption of adaptive responses to specific events as based on appraisal results. In what follows, we first briefly review the methods used to obtain the portrayals, to conduct judgment studies to validate the expressions and choose the most representative tokens, and to code the facial and bodily behavior and extract acoustic features of the vocalizations. The major variables for each modality and some representative results are presented as background. We then describe how the three separate data sets, containing measures of the expressive behavior in different modalities for the same set of actor portrayals, have been combined for configurational analysis. This investigation allows us to examine the question of whether actors use stable multimodal configurations of behaviors to express specific emotions. As established methods for the quantitative analysis of multimodal behavior configurations in emotional expression are largely missing, our approach was necessarily exploratory. A particular difficulty was posed by the fact that the coding of the behavior in the different modalities is rather differentfacial muscle action units for facial expression, types of hand and body movements in gesture analysis, and continuously changing acoustic parameters in vocal expression.

Method Emotion Portrayals


Actors. Twelve professional stage actors (6 men and 6 women) were recruited in Munich, Germany.1,2 All actors were native speakers of German. They had all graduated from professional acting schools and were regularly employed in radio, TV, and stage work. They were paid for their participation. Emotions studied. A representative number of different emotions, including members of the same emotion family (anger, fear, sadness, and happiness; see also Ekman, 1992, 2003) with similar emotion quality and different intensity levels, were used in this study. The following 14 emotions were selected: hot anger, cold anger, panic fear, anxiety, despair, sadness, elation, happiness, interest, boredom, shame, pride, disgust, and contempt. The distinctions between the members of the four emotion families can be illustrated with representative scenarios: Hot angerHaving sublet my apartment, I find an incredible mess after my return, none of the agreements having been kept. Cold angersharing a flat with two colleagues, I have to again do a chore that was to be shared.

Panic feara bus driver losing control of the vehicle, which starts veering off the road Anxiety having to walk back through a forest late at night Despair on a mountain hiking trail, a friend falls into a crevice and gives no sign of life. Sadness having to give my pet dog away because we cannot keep pets in a new apartment Elationafter a long period of searching, I am offered the job I always dreamt of. Happinessmeeting an ex-friend whom I am still very fond of and who seems ready to resume the relationship Scenarios. To evaluate the effect of differences in antecedent situations or events, two different eliciting scenarios (see examples above) were developed for each of the 14 emotions. To ensure that these short scenarios represented typical antecedent situations for the elicitation of the respective emotion, they were selected from a corpus of situations collected in several large intercultural studies on emotional experience. In these studies, more than 3,800 respondents had described emotional experiences as well as the respective eliciting situations (Scherer, Wallbott, & Summerfield, 1986; Scherer & Wallbott, 1994). For the emotions that had been used in the intercultural questionnaire studies, situations or events that were frequently mentioned as elicitors were chosen as the basis for the development of scenarios. The respective situation descriptions were rewritten in such a way as to render the scenarios stylistically similar across emotions. For the remaining emotions, scenarios were developed on the basis of events reported in the literature. Standard sentences. To avoid effects of differences in phonemic structure on the acoustic variables, standardized language material was used. The following two standard sentences from an earlier study (Scherer, Banse, Wallbott, & Goldbeck, 1991), composed of phonemes from several Indo-European languages, were used: Hat sundig pron you venzy and Fee gott laish jonkill gosterr. These meaningless utterances resemble normal speech. Listeners generally have the impression of listening to an unknown foreign language. The audio and video records of these standard utterances served as units of analysis in the data analyses reported in the Results section. Design. The variables discussed previously were combined in a 14 (emotion) 6 (actor) 2 (sex of actor) 2 (sentence) 2 (scenario) factorial design. For each cell, two emotion portrayals were recorded, yielding a total of 1,344 voice samples. Recording of emotion portrayals. Three to 7 days before the recording, the actors received a booklet containing the two eliciting scenarios and labels for each of the 14 emotions and the two standard sentences. They were asked to familiarize themselves with the standard sentences and the desired accentuation and intonation. The recording sessions took place at the Max Planck Institute for Psychiatry in Munich, Germany. At the beginning of the recording session, the actors received a script consisting of 56
1 Rather than refer the reader to detailed methods descriptions in the articles describing the analyses of the separate modalities, we reproduce the essential details here for the readers convenience. 2 The following description is based on the Method section of an earlier report on these stimuli, in which the vocal expression was studied with the help of digital acoustic analyses (Banse & Scherer, 1996).

MULTIMODAL EXPRESSION OF EMOTION

161

pages, one for each of the Emotion Scenario Sentence combinations. These pages contained the label of the intended emotion, the text of the scenario, and the standard sentence. The actors were told to imagine each scenario vividly and to start performing when they actually felt the intended emotion. They then acted out the portrayal twice. If they felt that a rendering was not optimal, they could repeat it. There were no time constraints; the whole session was recorded continuously. The portrayals were recorded on audio- and videotape. For audio recording, a highquality microphone (Sennheiser) and a professional reel-to-reel tape recorder (Revox) were used. The distance and orientation of the actors to the microphone was held constant. The input level of the sound recording was optimized for each actor and kept constant over all emotions. For video recording, two cameras captured the face and the whole body of the actor, respectively. The two video images were mixed to produce a split-screen image and recorded on U-Matic (Sony) and VHS recorders (Panasonic) simultaneously. Face validity of the portrayals. A necessary requirement for both the decoding and encoding parts of the study was the recording of portrayals of high quality with respect to authenticity and recognizability. The 1,344 emotion portrayals generated by the design described above clearly exceeded the feasible size of a corpus to be used in encoding and decoding studies. In consequence, the selection of a limited number of portrayals imposed itself. As in earlier studies, we used the criteria of authenticity and recognizability to effect the selection. We defined authenticity as the likelihood of encountering the respective expression in ordinary life. Recognizability was defined as a credible representation of the target emotion. To ensure maximal representativeness of the portrayals chosen for the corpus given the factorial design, we decided to select two portrayals for each of the 112 cells in the remaining 14 (emotion) 2 (sex of actor) 2 (sentence) 2 (scenario), using the weighted criteria of high authenticity and recognizability. This selection was performed in two steps. First, expert ratings (performed by 12 acting students) were used to screen the large number of portrayals with respect to perceived recognizability (identification of target emotion) and authenticity (probability of encountering expression in ordinary life). These ratings were performed separately for the audio, the visual, and the combined audiovisual channel. For authenticity, a 6-point scale (1 very good, 6 very poor; based on the German school grades) was used; for recognizability, a 4-point scale (1 clearly recognizable, 4 not recognizable) was used. Selection procedure. For each cell of the 14 (emotion) 2 (sex of actor) 2 (scenario) 2 (sentence) factorial design, two items were chosen in such a way as to meet the following criteria (in hierarchical order): (a) a mean recognizability rating of 2 or better and an authenticity rating of 4 or better in the combined audiovisual presentation; (b) mean ratings of recognizability of 3.5 or better in the audio and visual conditions, respectively; (c) two different actors represented in each cell; and (d) mean recognizability ratings of 2 or better in all three judgment conditions. Thus, priority was given to the prototypicality or recognizability of the expressions. For the entire sample of 224 portrayals, only 4 were included that did not meet the first criterion: 3 portrayals for despair and 1 portrayal for boredom. For these 4 portrayals, the

mean score on recognizability in the audiovisual condition amounted to 2.75. Once 224 items had been selected, we decided to add some items that had high quality with regard to the criteria but were ranked next best on the basis of the expert rating. These items were candidates to replace preselected items for the acoustic analyses in case some of the latter showed low recognizability in the recognition study. By adding 4 such borderline items for each of the 14 emotions, a total of 280 portrayals were selected for the recognition study. Recognition study. Twelve undergraduate psychology students at the University of Giessen, Giessen, Germany (9 women; mean age 22 years) viewed these 280 portrayals and checked 1 of 14 emotion labels on prepared answer sheets. From the total of 280 portrayals, 224 were selected in such a way that each cell of the factorial design contained the two most recognizable portrayals from two different actors. Again, the emphasis was on the prototypicality or recognizability of the chosen expressions. In case of conflict between these two criteria, preference was given to two actors instead of one if the mean recognition rate for that cell did not drop more than 15%.

Coding of Facial Expression


Facial actions were coded using the Facial Action Coding System (FACS)3 (Ekman & Friesen, 1978). Facial expression was coded from videotapes by using a U-Matic system and allowing variable speed forward and backward. For each item (the standard sentence in a video take), the apex of the facial expression was determined by two independent observers (two female technical assistants trained in behavior observation). During a single video take, up to three different expressions could occur: shortly before, during, or immediately after the sentence was spoken. The video frame for the apex of each facial expression was determined on the basis of the SMPTC time code (hoursminutessecondsframes) previously inserted into the picture. The interrater variability in finding the apex was in the range of 0 to 5 video frames (1/25 s, i.e., 200 ms). This procedure allowed FACS coders to immediately go to the relevant points on the tape for the detailed analysis of the facial behavior. In a second step, the facial expression with the maximum intensity was chosen, and the action units (AUs) at this apex time point were coded separately by trained FACS coders, determining facial activity in the upper and lower face. Reliability was checked repeatedly for sample items, resulting in reliabilities ranging from .76 to .83. Because the occurrence and not the duration of AUs was determined, reliability can be regarded as good to excellent (cf. Sayette, Cohn, Wertz, Perrott, & Parrott, 2001). Special AU codes were created to note whether the apex of facial activity occurred before, during, or after the utterance of the standard sentence. In case of the apex during the utterance, AUs
3 FACS is an anatomically based coding system, first suggested by the Swedish anatomist Hjortsjo (1970), in which elements of complex facial actions are coded according to the visible changes on the facial surface. The single elements are noted as AUs, singly or in combination (e.g., frequently found combinations are AU1 2lifting the inner and outer eyebrows or AU6 12raising the cheeks and the mouth corners in a smile).

162

SCHERER AND ELLGRING

were coded only if the facial activity was more intense then was necessary for pronunciation. In more than three quarters of the cases, the apex of facial activity occurred during the utterance (see Table 6 in Scherer & Ellgring, in press).

Coding of Bodily and Gestural Movements


Combining both functional and anatomical approaches to categorizing gestures and body movement (Scherer & Wallbott, 1985), an eclectic approach to the coding of body movement and posture was developed (see detailed description in Wallbott, 1998). Trained observers recorded all bodily activity visible in the 224 portrayals in a free format. From this analysis, a preliminary category system was devised and used by two trained, independent coders who coded all 224 takes with the system. After a preliminary reliability check, the category system, as shown in Table 2 in Wallbott (1998), was used. Movements and postures, as well as movement quality judgments (activation, spatial extensiveness, and dynamic acceleration), were coded for each take, but only during the utterance of the respective sentence (which had a mean duration of about 2 to 3 s). Multiple coding was allowed where appropriate. Coders were unaware as to the respective emotions encoded in the different takes. During the coding, the part of the video screen showing the close-up of the actors face was masked. As shown in Table 2 of Wallbott (1998), a high percentage of agreement across independent coders was achieved, ranging between .73 and .99.

geneous subgroups as determined by post hoc testing). The original Wallbott (1998) paper provides a detailed interpretation of these data (see summary in Table 5 of that paper), including the findings of the rated quality of movement dimensions (movement activity, expansiveness/spatial extension, movement dynamics/ energy/power). The latter are not used in this article, which focuses on objective behavior measures. Table 3 provides the data obtained in the analysis of the vocal part of the actors emotion portrayals in the form of profiles of acoustic parameters, objectively measured by digital parameter extraction from the speech signal. The table is adapted from Table 6 in Banse and Scherer (1996). In the interest of space, the data for the spectral parameters extracted from speech segments are not reproduced, as the essential differences contributing to the differentiation of the 14 emotions is contained in the proportion-ofenergy summary parameters. Also, the original table reported, in addition to the means, the standard deviations for the parameters, which are omitted here. As the original table did not show ANOVA tests and post hoc comparisons of differences between emotions, the data were reanalyzed and the respective information included in Table 3 to obtain compatibility with the data presentation for the face and the body.4

Selection of Parameters for Multimodal Analysis


For the purposes of the multimodal analysis, a selection of variables used for the analyses of the individual modalities had to be performed to avoid a bias toward particular modalities. Such a bias could be produced by (a) including a substantially larger number of variables in one modality compared with others, (b) including variables for behaviors that are very rarely used (and that are thus unlikely to be combined with others), and (c) including variables that are used too frequently and too indiscriminately (leading to an overestimation of meaningful co-occurrences). For discrete behavior categories (facial expression and gestures), we chose those categories that satisfied the following criteria: (a) occurring sufficiently frequently (in at least 5% of all cases) and (b) differentiating between the different emotions (excluding categories occurring in more than 70% of the portrayals). Using these criteria, the following variables were used in the analyses: facial AUs 1, 2, 4, 5, 6, 7, 9, 10, 11, 12, 13, 14, 15, 17, 20, 22, 23, 25, 26, 27, and 41 and gesture categories: anatomical codes upper body collapsed, shoulders moving up, head shake,5 arms stretched out in frontal direction, arms stretched out sideways, hands open
4 It should be noted that for the face and the bodily movements, absencepresence (0 1) codes were used. In consequence, the post hoc comparisons are only expressed with respect to significant differences between homogeneous subsets and nonoverlapping members marked by subscript letters for the higher value ranges of the mean frequencies (presence). In the case of the vocal parameters, there is a continuous variable. In consequence, the nonoverlapping members of the homogeneous subsets are marked by a and b for the high range and c for the low range. 5 The head shake category was measured by Wallbott (1998) but not reported in his Table 3. It is included here on the basis of a reanalysis of the data.

Acoustic Analysis of Vocal Parameters


Procedure. The sound recordings were digitized on a 386 Compaq personal computer with a sampling frequency of 16,000 Hz and transferred to a PDP 11/23 computer. An automatic acoustical analysis of the selected emotion portrayals was performed by means of the Giessen Speech Analysis System. The acoustic parameters extracted in this fashion are described in Banse and Scherer (1996). Importantly, all parameters were transformed into z scores, partialing out actor differences (because of the strong individual differences in habitual voice quality).

Results Unimodal Expression Patterns


Table 1 shows the results for facial behavior as reported in Scherer and Ellgring (2007). The table provides a list of the AUs (with a brief description), the mean frequencies for each of the 14 emotions, the results of a one-way analysis of variance (ANOVA) testing for frequency differences over emotions (identifying the emotions in nonoverlapping homogeneous subgroups as determined by post hoc testing). A detailed analysis and interpretation of these results, including a comparison to theoretical predictions made on the basis of the component process model (Scherer, 1984, 2001), is provided in Scherer and Ellgring (2007). Table 2 shows the results of the coding of gestures and body movements, as reported in Table 3 of Wallbott (1998). The table shows the categories of body movements measured, the mean frequencies of use of the respective categories by the actors, the results of a one-way ANOVA testing for frequency differences over emotions (identifying the emotions in nonoverlapping homo-

MULTIMODAL EXPRESSION OF EMOTION

163

Table 1 Mean Proportion of Actors Using Specific Action Units (AUs) to Encode Specific Emotions
Actor-portrayed emotions Facial AUs AU1 Inner brow raiser AU2 Outer brow raiser AU4 Brow lowerer AU5 Upper lid raiser AU6 Cheek raiser AU7 Lid tightener AU9 Nose wrinkler AU10 Upper lip raiser AU11 Nasolabial furrow AU12 Lip corner puller AU13 Cheek puffer AU14 Dimpler AU15 Lip corner depressor AU17 Chin raiser AU20 Lip stretcher AU22 Lip funneler AU23 Lip tightener AU25 Lips part AU26 Jaw drops AU27 Mouth stretches AU41 Lids droop Hot Cold Panic Elated anger anger fear Anxiety Despair Sadness joy Happiness Interest Boredom Shame Pride Disgust Contempt 0.31 0.44 0.25 0.31 0.13 0.19 0.13 0.13 0.19 0.19 0.31b 0.19 0.19 0.56a 0.94
a

0.31 0.94a 0.38 0.69a 0.31 0.69b 0.50a

0.63 0.38 0.56 0.19

0.88a 0.44 0.94a 0.25 0.44b

0.56 0.31 0.63c

0.44 0.44

0.31 0.38

0.56 0.63a

0.25 0.25

0.38 0.31 0.25

0.31 0.31

0.25

5.18 .00 2.81 .00

0.56 0.38c 0.38c

0.13

11.49 .00 3.57 .00 10.89 .00

0.19 0.63b

0.38 0.81a

0.13 0.50a 0.13 0.81


a

6.77 .00 1.98 .02

0.50 0.13

9.90 .00 6.14 .00 43.47 .00 3.70 .00 10.77 .00

0.88

0.31 0.56a 0.25b

0.75

0.13 0.13 0.19 0.44a 0.25 0.25 0.19 0.19 0.13 0.38 0.38a 0.19 0.31 0.13 0.56 0.56 0.25b 0.19 0.13 0.19 0.63b 0.19 0.19 0.25 0.44 0.19 0.25a 0.31a 0.50 0.19 0.31 0.56 0.19 0.31 0.56 0.38 0.19 0.13 0.13 0.38 0.13

0.25a 0.19 0.38a 0.19 0.19 0.25 0.25 0.25

2.57 .00 2.30 .01 2.76 .00 1.27 .23

0.81a

0.25 0.25 0.25

0.44 0.38

0.38 0.31

1.03 .43 1.27 .23 3.16 .00 5.38 .00 4.77 .00

Note. F and p values reported are for one-way analyses of variance (dfs 13, 210). For easier readability, proportions .10 have been omitted. Means in bold can be considered replications of findings reported in Gosselin, Kirouac, and Dore (1995, Table 3). From Are Facial Expressions of Emotion Produced by Categorical Affect Programs or Dynamically Driven by Appraisal? by K. R. Scherer and H. Ellgring, 2007, Emotion, 7, p. 121. Copyright 2007 by the American Psychological Association. Adapted with permission of the authors. a,b,c Nonoverlapping means for different emotions in homogeneous subsets (based on Newman-Keuls post hoc comparisons).

and close, and back of hands pointing forward,6 and functional codesillustrators and self-manipulators (see Friesen, Ekman, & Wallbott, 1979; Scherer & Wallbott, 1985). The selection criteria could not be used for the vocal parameters, which are measured continuously over the complete utterance. Banse and Scherer (1996) analyzed 48 variables, many of them highly intercorrelated. Using the criterion of discrimination mentioned earlier, we included in the multimodal analysis those parameters that were shown by Banse and Scherer to differentiate between the portrayed emotions (see also Juslin & Scherer, 2005): mean fundamental frequency (heard as pitch, mean F0 in Table 3), mean amplitude or vocal energy (heard as loudness; MElog in Table 3), the relative proportion of energy in the upper range of the spectrum (beyond 1 kHz; strong high-energy frequency heard as sharpness; the inverse of variable PE1000 in Table 3), and speech rate or tempo (as

measured by the duration of vocal segments of the standardized utterance; the inverse of variable DurVo in Table 3). As mentioned earlier, each of the standardized utterances was used as the unit of analysis. As these units are very short, lasting only a few seconds, the number of expressive actions that can occur in the brief span of time are necessarily limited. In addition, as shown in the Method section, most of the expressive activity is focused on the verbal utterance. In consequence, even though the exact timing of the expressive actions has not been determined on a precise time line, it can be assumed that the actions or parameter
6 The back-of-hands-forward category was measured by Wallbott (1998) but not reported in his Table 3. It is included here on the basis of a reanalysis of the data.

164

SCHERER AND ELLGRING

Table 2 Mean Frequency of Different Body Movements Over 14 Emotions


Means for emotions Category Upper body collapsed Shoulders up Shoulders forward Head downward Head backward Head bent sideways Lateral hand/arm movements Arms stretched out frontal Arms stretched out sideways Arms crossed in front of chest Hands opening/ closing Back of hands sideways Illustrator Self-manipulator Movement activity Expansiveness/ Spatial extension Movement dynamics/ energy/ power 1 2 3 0.08 0.63a 0.25 0.25b 0.07 0.67a 0.67
a

5 0.38a 0.13 0.44a 0.31a 0.06 0.06

6 0.13 0.25

7 0.56a 0.13 0.06 0.06

8 0.06 0.19 0.38b 0.13

9 0.06 0.25 0.38b

10 0.06 0.38 0.19 0.06 0.06

11 0.63a 0.25 0.06 0.25b 0.19

12

13

14 0.38a

F 7.91 2.96 3.94 2.03 3.97 1.81 6.99 5.86

p 0.001 0.001 0.001 0.020 0.001 0.043 0.001 0.001 0.001 0.001 0.001 0.002 0.001 0.001 0.001 0.001

0.32 0.13 0.06 0.25 0.44b 0.44


b

0.60a 0.07 0.07

0.19

0.31 0.06 0.06 0.06 0.44b 0.38 0.06


c

0.31 0.13 0.06 0.31a 0.13

0.06 0.38a 0.25 0.06

0.06 0.06 0.31 0.13 0.13 0.13 0.50a

0.19 0.38
c

0.06 0.06

0.13 0.06 0.06

0.06 0.06 0.13 0.06

0.13

0.33

0.25 0.06 0.06 0.13 0.50 0.06 0.06 150e 106 0.25 0.13 0.50 0.06 0.25 156e 100
b

0.25

0.13 0.31
a

3.25 0.19 3.17 4.31 0.25 144 100 2.78 5.56 4.76 4.60 7.64

0.25 0.75 0.31 169d 144c

0.67a 0.93 0.73a 200b 200a

0.56b 0.75 0.50b 219a 194a

0.06 0.69 106 106 0.31 0.13 125 106

0.50a 0.63 0.25 0.25 181d 150b

0.50a 0.50 0.19 0.38b 163e 113

0.38 0.56 0.13 0.06 194e 138d

0.38 0.69 0.19 0.50a 175d 106

0.25 0.75 0.44c 0.06 175d 125

0.19 0.31 0.13 156e 125

169e

273a

213b

119

125

131

100

188c

169e

200c

138f

175d

150f

119

14.10

0.001

Note. Decimal points omitted for the last three variables for space reasons. Emotions: 1 cold anger; 2 hot anger; 3 elated joy; 4 happiness; 5 disgust; 6 contempt; 7 sadness; 8 despair; 9 anxiety; 10 panic fear; 11 shame; 12 interest; 13 pride; 14 boredom. From Bodily Expression of Emotion, by H. G. Wallbott, 1998, European Journal of Social Psychology, 28, p. 888. Copyright 1998 by John Wiley & Sons Limited. Reprinted with permission. a,b,c,d,e Nonoverlapping means for emotions in homogeneous subsets based on Newman-Keuls post hoc comparisons.

changes found conjointly for a particular utterance represent an organized multimodal behavior pattern.

Multimodal Expression Patterns


One might first ask to what extent the multimodal data set allows a better discrimination of the 14 emotions than the unimodal data set. Banse and Scherer (1996) reported a 52.5% classification rate for 16 acoustic parameters, which dropped to 24.7% in a cross-validation design. On the basis of a jackknifing procedure, they estimated that with 16 parameters, one could realistically expect a 40% success rate with this stimulus set. As to the face, Scherer and Ellgring (2007) report a 52.2% success rate with 21 AUs (they did not run a crossvalidation because of data limitations). Wallbott (1998) did not compute a discriminant analysis for the gesture and body movements. With the selected multimodal data set described above, we ran a discriminant analysis including all 38 variables. The overall prediction success was 79.0%, dropping to 49.1% in a cross-validation design. Although prediction success is much higher than in the two unimodal discriminant analyses reported earlier, this could be due to the larger number of variables. We therefore ran a stepwise discriminant analysis (see Table 4). The following 10 variables were entered in the

order listed until reaching the criterion of insufficient significance for adding a further variable: AU12, high amplitude; AU13, AU14, high F0; AU10, AU7, low speech rate; AU4, upper body collapsed. Of the original category memberships, 54.0% were correctly predicted, a value that dropped to 50.4% in cross-validation. Thus, with only 10 multimodal variables, a rather respectable cross-validated prediction success can be achieved, rather higher than what seems possible with unimodal discrimination. The standardized canonical discriminant function coefficients and functions at group centroids (see Table 5) provide some interesting information on how variables combine in the discriminant functions to disambiguate specific emotions (see data in boldface). Thus, AU12 and AU13 discriminate elated joy, happiness, and pride; high F0 and high amplitude in the voice discriminate the high-excitation emotions hot anger, elated joy, despair, and panic fear; AU14 and slow speech tempo discriminate shame and boredom; upper body collapsed, AU4, and high F0 discriminate emotions produced by appraisal of low control and power (Scherer, 2001) such as disgust, sadness, despair, and panic fear; and, finally, slow speech tempo by itself discriminates sadness from the other emotions.

MULTIMODAL EXPRESSION OF EMOTION

165

Table 3 Means for Selected Acoustic Parameters of Vocal Utterances for 14 Emotions
Emotions Vocal Hot Cold Panic Elated parameters anger anger fear Anxiety Despair Sadness joy Happiness Interest Boredom Shame Pride Disgust Contempt MF0 P25F0 P75F0 SdF0 MElog DurArt DurVo HammI DO1000 PE500 PE1000 1.13a 0.16b 1.23a 0.92a 0.15b 1.39a 1.13a 0.05b 0.91a 0.5c 0.1 0.63 1.19a 0.52 0.84b 0.31c0.14 0.58c 0.45 0.15 0.47 1.13a 0.29 0.27 1.17c0.51 0.45 0.55c0.58c 0.12 1.34c0.52 0.28 0.58 0.28 0.83 0.86a 0.37 0.35c 0.38 0.33 0.16 0.15 0.53 0.99a 1.15a 0.73a 0.73 1.00b 0.32 0.07 0.9 0.72 0.51c 0.59 0.32 0.52 0.08 0.43c 1.16c 1.04a 1.25a 0.43 1.32a 1.23a 0.9a 1.24a 1.21a 1.2a 0.21 1.05b 0.12 0.34 0.58 0.66 0.29 0.05 0.64 0.62 0.52 0.14 0.48 0.49c 0.45 0.43 0.15 0.32 0.39 0.17 0.14 0.32 0.26 0.19 0.66c 0.42 0.03 0.23 0.3 0.11 0.8 0.83 0.69 0.07 0.54 0.7 0.94a 0.4 0.7 0.27 0.44 0.49 0.46 0.64 0.51 0.41 0.37 0.42c 0.07 1.14c 0.13 0.32 0.22 0.2 0.06 0.49 0.26 0.89a 0.04 0.51a 0.09 0.03 0.35 0.29 0.37 0 0.33 0.51 0.08 0.01 0.46 0.45 0.17 0.11 1.03c 0.93c 0.85c 0.35 0.48 0.15 0.06 0.37 0.05 0.12 0.17 F 27.83 30.64 15.83 3.88 27.21 4.95 5.93 6.69 13.74 4.96 7.15 p .001 .001 .001 .001 .001 .001 .001 .001 .001 .001 .001

Note. Fundamental frequency: MF0 mean, P25F0 25th percentile; P75F0 75th percentile; SdF0 standard deviation, Energy: MElog mean; Speech rate: DurArt duration of articulation periods, DurVo duration of voiced periods; HammI Hammarberg index; DO1000 slope of spectral energy above 1,000 Hz; PE500 proportion of voiced energy up to 500 Hz; PE1000 proportion of voiced energy up to 1,000 Hz. From Acoustic Profiles in Vocal Emotion Expression, by R. Banse and K. R. Scherer, 1996, Journal of Personality and Social Psychology, 70, p. 625. Copyright 1996 by the American Psychological Association. Adapted with permission of the authors. a,b,c Nonoverlapping means for emotions in homogeneous subsets (based on Newman-Keuls post hoc comparisons; a, b high range; c low range).

As the purpose of the current analysis is to examine the cooccurrences of facial, vocal, and bodily behaviors in the portrayals of different emotions, we decided to examine stable groupings of behaviors from different expressive modalities and then examine their relative frequency for the different emotions. Although for both facial and bodily movements a binary code was used (0 for absence of a category, 1 for presence in a portrayal), the acoustic parameters are continuous variables. To render the data comparable, we created new binary variables on the basis of the z scores, creating separate variables for low and high values on the respective variables. Thus, z values smaller than .5 were coded as 1 for the low version of the parameter, those larger than .5 as 1 for the high version. If the

value was located between .5 and .5, a 0 was assigned for both versions. Given the dichotomous nature of the variables in all modalities after transformation, we used hierarchical cluster analysis for binary data with the phi coefficient as similarity measure and the average linkage (between-groups) clustering procedure over the data for all emotions. The result is shown in Figure 1. Even though this procedure allows only an exploratory analysis of the emerging patterns rather than a principled hypothesis-testing approach, the results provide some interesting insights. Inspection of the dendrogram shows that at lower distances, only clusters of variables in

Table 4 Classification Results of a Stepwise Multivariate Discriminant Analysis With 10 Variables (in Percentages)
Predicted Original 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. Cold anger Hot anger Elated joy Happiness Disgust Contempt Sadness Despair Anxiety Panic fear Shame Interest Pride Boredom 1 37.50 25.00 2 12.50 50.00 93.75 87.50 68.75 31.25 6.25 6.25 6.25 18.75 6.25 6.25 12.50 18.75 12.50 6.25 12.50 62.50 6.25 25.00 25.00 43.75 12.50 50.00 6.25 62.50 62.50 12.50 18.75 6.25 6.25 68.75 12.50 25.00 43.75 6.25 6.25 6.25 18.75 12.50 25.00 6.25 25.00 6.25 56.25 12.50 6.25 18.75 6.25 25.00 3 4 5 12.50 6 12.50 7 8 12.50 12.50 9 12.50 12.50 6.25 12.50 10 11 12 13 14 Total 100 100 100 100 100 100 100 100 100 100 100 100 100 100

Note. Variables entered in the following sequence: action unit (AU) 12, high vocal amplitude; AU13, AU14, high vocal fundamental frequency; AU10, AU7, slow speech tempo; AU4, upper body collapsed. Of original grouped cases, 54.0% correctly classified; of cross-validated grouped cases, 50.4% correctly classified.

166

SCHERER AND ELLGRING

Table 5 Standardized Canonical Discriminant Function Coefficients and Functions at Group Centroids
Functions 1 2 3 4 5 6 7 8 9 10

Standardized canonical discriminant function coefficients UBSLUMP AU4 AU7 AU10 AU12 AU13 AU14 HIF0 HIAMP HIDUR 0.12 0.15 0.07 0.10 1.35 1.09 0.02 0.06 0.22 0.06 0.24 0.00 0.02 0.09 0.04 0.16 0.25 0.52 0.68 0.23 0.23 0.01 0.48 0.61 0.02 0.09 0.45 0.16 0.11 0.45 0.32 0.69 0.38 0.10 0.21 0.25 0.24 0.32 0.42 0.03 0.26 0.06 0.23 0.69 0.26 0.11 0.38 0.38 0.01 0.26 0.23 0.14 0.52 0.29 0.08 0.09 0.48 0.05 0.13 0.60 0.61 0.36 0.27 0.03 0.00 0.49 0.41 0.22 0.25 0.20 0.31 0.10 0.39 0.29 0.02 0.09 0.14 0.49 0.40 0.41 0.38 0.27 0.08 0.00 0.22 0.69 0.10 0.06 0.14 0.18 0.23 0.54 0.31 0.03 0.18 0.11 0.36 0.45 0.44 0.32

Functions at group centroids Cold anger Hot anger Elated joy Happiness Disgust Contempt Sadness Despair Anxiety Panic fear Shame Interest Pride Boredom 1.74 1.65 4.24 5.25 1.66 1.47 1.62 1.93 1.25 1.66 1.02 0.65 5.31 1.43 0.96 2.30 2.50 0.90 1.27 1.14 1.59 2.51 0.70 1.85 1.75 0.13 0.85 2.04 0.43 0.03 0.66 0.30 1.52 2.25 0.78 0.30 0.20 0.17 1.05 0.15 0.30 2.17 1.05 0.76 0.16 0.23 1.08 1.09 1.06 0.95 0.49 0.75 0.16 0.75 0.20 1.12 0.32 0.18 0.82 0.26 0.90 0.50 0.20 0.11 1.01 0.09 0.16 0.97 0.08 0.76 0.47 0.13 0.14 0.19 0.71 0.49 1.35 0.03 0.60 0.15 0.21 0.33 0.23 0.39 0.02 0.18 0.55 0.06 0.44 0.24 0.27 0.47 0.34 0.38 0.61 0.51 0.61 0.49 0.54 0.02 0.23 0.05 0.47 0.54 0.15 0.05 0.26 0.24 0.58 0.25 0.14 0.21 0.00 0.36 0.25 0.21 0.06 0.11 0.09 0.05 0.27 0.02 0.32 0.10 0.44 0.12 0.24 0.25 0.01 0.04 0.03 0.01 0.17 0.36 0.04 0.27 0.27 0.06 0.01 0.06

Note. Bold numbers indicate for each function which variables define the function (upper part of table) and which emotions are best discriminated by the respected function (lower part of table). UBSLUMP upper body slumping; AU action unit; HIF0 high fundamental frequency; HIAMP high amplitude; HIDUR high duration (slow tempo).

the same modality combine. Thus, in the facial domain, we find the following mono-modality clusters: AUs 1, 2, 4, 11: This combination of inner and outer brow raiser, brow lowerer, and nasolabial furrow can be interpreted as a signature of attention deployment and concentration (see Scherer & Ellgring, 2007). AUs 6 and 12: Cheek raiser and lip corner puller (zygomaticus) represents the ubiquitous facial signature for happiness. AUs 13, 17, and 23: a combination of cheek puffer, chin raiser, and lip tightener that is rarely described in the literature. In the bodily and gestural modality, only one consistent cluster emerges: arms stretched out in frontal direction, illustrators, hands open and closed, and, although at a greater distance, shoulders moving up: a cluster that seems to represent highly expressive gesturing, illustrating the content of speech utterances. In the vocal modality, three variables combine for both the low and high versions, respectively: high F0, high amplitude, and high upper frequency energy (generally indicative of high physiological arousal; see Banse & Scherer, 1996; Juslin & Laukka, 2003). low F0, low amplitude, and low upper frequency energy (generally indicative of low physiological arousal, passivity, or depressed state; see Banse & Scherer, 1996; Juslin & Laukka, 2003).

At a higher level of agglomeration, at larger distances, we do find some cross-modal clusters. Thus, the high vocal arousal cluster is joined by AU27 (mouth stretch) and arms stretching sideways (a pattern dominated by arousal-related indices). At a still higher level, one could argue for a supercluster combining the latter, the illustrator cluster, and AU5 (upper lid raiser), yielding a total pattern one might describe as agitation (multimodal agitation). In addition, the positive facial cluster is joined by AU26 (jaws drop), fast speech rate, and head shaking (likely to consist of lateral head movements rather than the NO-emblem shake), a

This supercluster reminds one of the description of depressive patients expressions. In a longitudinal study of depressed patients (Ellgring, 1989), control patients showed a relatively homogeneous facial pattern in standardized clinical interviews: AU 1 2 (lifting the eyebrows) and AU 6 12 (smile) dominated. In contrast, there were three depression-specific patterns, one consisting of reduced facial activity, the other two characterized by negative affective tension, one with a predominance of AU 20 (lip stretch) and AU 24 (lip press), the other by AU 14 (dimpler) with a lack of positive elements (AU 1 2, AU 12; Ellgring, 1989, p. 71ff.). These depression-specific facial expression patterns can be interpreted as indicators of fear, anger, and contempt. Reduced facial activity might be interpreted as lack of interest or apathy, due to an absence of behavioral urge.

MULTIMODAL EXPRESSION OF EMOTION

167

Figure 1. Dendrogram resulting from a hierarchical cluster analysis of expressive behaviors in three modalities. The cluster analysis was performed with SPSS using phi coefficient similarity and average linkage/ between-groups clustering. An alternation between bold, italic, and normal font is used to separate the different clusters described in the text. Hi up freq energy high values on Energy in the Upper Frequency Range; Arm stretch side arms stretch sideways; Low up freq energy low values on Energy in the Upper Frequency Range; Back hands forward back of the hands pointing forward.

general pattern that suggests joyful surprise (multimodal joyful surprise). In the lower half of the dendrogram, we find a pattern in which the low vocal arousal cluster combines with, successively, slow speech rate, upper body collapsed, AU14 (dimpler), AU41 (eyelids

drop), back of hands pointing forward, and self-manipulators (the latter two variables may be linked as for many self-manipulators, such as scratching or rubbing frontal parts of the body; the back of the hand would point forward). We suggest calling this supercluster multimodal resignation.7

168

SCHERER AND ELLGRING

We then computed combined variables for the clusters described above to examine whether specific multimodal clusters occurred more frequently for specific emotions. It should be noted that these cluster analysis results only indicate certain tendencies of the cluster members to co-occur. Thus, for a particular portrayal, a cluster might be represented in a more or less complete form. In consequence, we combined the cluster members in the form of summing the individual variable. A combined cluster variable can thus range from 0 to N maximal number of cluster members. One can reasonably argue that if we find higher sums, indicative of more complete cluster representation, for particular emotions, this would be an indication of the likelihood that this emotion would be expressed prototypically by the respective cluster, as predicted by affect program models. We then computed one-way ANOVAs across emotions for all of the cluster variables. In each case, the F values reached p .001, indicating significant differences in the frequency and completeness of the respective cluster in the actor portrayals of the different emotions. To determine which emotions were most frequently expressed with the help of the respective cluster, post hoc comparisons (NewmanKeuls) were computed. The results (means and homogeneous emotion subsets) are shown in Table 6.

Discussion
We first need to examine whether the relative dearth of multimodal patterns identified in this analysis could be a result of artifacts in our procedures. It might be argued that selecting portrayals for authenticity and recognizability restricts sampling of possible portrayals and may confound a more naturalistic categorization of emotion-specific behavior. However, not to select for authenticity and recognizability severely limits the ecological validity of the corpus by including portrayals that judges consider unlikely to be encountered in ordinary life (lack of authenticity) and/or that targets cannot be recognized by observers despite the actors best efforts to portray the target emotion. It is difficult to interpret the meaning or significance of the behavior patterns found in such portrayals. In any case, it is most likely that we would have found even less cross-modal patterning than in the

well-recognized portrayals. Furthermore, it is important to use expressions judged as authentic to minimize the danger of manipulation of single modalities (see the introduction to this article). Another possible artifact could consist of having coded behaviors in different modalities that were not time locked, occurring in different parts of the portrayal (coding AUs for the apex of activation, gestures in terms of presence of a type, and vocal features as means over the complete period of activation). As we did not code the facial and gestural behavior on a precise time line with respect to their onset and offset (see Scherer & Wallbott, 1985, pp. 218 222, for an early proposal), we could not determine precise synchronization of different behaviors (e.g., in terms of joint acceleration). This is planned for a new corpus of actor portrayals that we have recently recorded in our laboratory (Ba nziger, Pirker, & Scherer, 2006). However, as the portrayal segments consisted of very brief standardized utterances, often only a single facial expression and a single gesture were observed per segment. In consequence, by taking the short utterances as units of analysis we can be reasonably certain that the conjoint occurrence of facial expressions, gestures, and vocalization changes are indeed time locked to a specific unit of analysis. We examined whether the notion of emotion-specific expression patterns as based on fixed-affect programs might find support by actors using multimodal combinations of expressive behaviors to portray specific emotions. However, the only clear evidence of multimodal emotion specificity was found for joyful surprise, presumably elicited by unexpected positive outcomes that generate an expression of incredulity (head shake and jaw drop), a bursting out of vocalizations (high speech rate), and the ubiquitous signs of happiness or positive affecttrue or Duchenne smiles (consisting of a combination of AU6 and AU12; Ekman & Friesen, 1982; Frank, Ekman, & Friesen, 1993). The other two multimodal clusters that we were able to identify across several emotions were labeled multimodal agitation and multimodal resignation, respectively. These labels were chosen specifically to distinguish them from the arousal or activation dimension postulated by dimensional theories of emotion to emphasize that the fundamental

Table 6 Means of Nine Combined Cluster Variables for 14 Portrayed Emotions and Indicators for Homogeneous Subsets Based on Post Hoc Testing of Overall Significant Differences
Emotion Cluster Face attention Face positive Face pride Body illustration Voice high arousal Voice low arousal Multimodal agitation Multimodal joyful surprise Multimodal resignation
a,b,c

Cold Hot Elated anger anger joy Happiness Disgust Contempt Sadness Despair Anxiety Panic fear Shame Interest Pride Boredom 1.06 0.00 0.25 1.31c 1.69b 0.31 3.19c 1.06 0.13 0.19 2.50a 2.38a 0.00 5.88a 0.88 1.56a 0.06 2.06b 2.13b 0.00 4.63b 3.00a 0.56 0.69 1.69a 0.31 0.44 0.19 1.56c 0.69 2.75a 2.13
b

F 8.82 33.77 2.85 7.55 33.42 15.01 28.19

p .000 .000 .001 .000 .000 .000 .000

0.81 0.38 0.25 0.38 0.25 1.31c 0.63 1.00 2.44


b

0.25 0.06 0.31 0.31 0.13 1.63c 0.50 0.69 2.06

1.56 0.00 0.06 0.00 0.06 2.25a 0.06 0.38 4.56


a

2.81a 0.44 0.25 1.25c 2.50a 0.00 4.44b 0.94 0.63

1.75b 0.00 0.06 1.06 0.19 1.06 1.56 1.13 1.94

2.63a 0.06 0.06 1.00 2.13b 0.13 4.38b 1.25 0.75

0.94 0.06 0.19 0.88 0.06 2.06b 1.19 0.50 4.00


a

1.25 0.38 0.06 1.38c 0.75 0.50 2.56

0.63 1.13b 0.88a 0.63 0.38 1.06 1.19

0.50 0.00 0.13 0.00 0.19 2.06b 0.19 0.25 4.56


a

0.88 1.00 0.88 0.38

1.69b 1.75b 0.94 1.56

13.73 .000 20.35 .000

Nonoverlapping means for emotions in homogeneous subsets based on Newman-Keuls post hoc comparisons.

MULTIMODAL EXPRESSION OF EMOTION

169

factor underlying the differentiation of emotion is to be found in appraisal (see also Scherer, Dan, & Flykt, 2006). One can argue that the emotions characterized by the multimodal agitation expression cluster are all characterized by appraisal of high urgency of adaptive actions, in particular hot anger (subgroup denoted by a superscript a in Table 6) as well as panic fear and despair (subgroup denoted by a superscript b in Table 6), including an urge to communicate this emotional reaction and the action tendencies that accompany it. Emotions characterized by the multimodal resignation cluster, on the other hand, seem to be characterized by appraisals of loss of control and resignation, such as sadness, shame, and boredom (subgroup a in Table 6), and thus the absence of behavioral preparation and the urge to inform ones environment about this state of affairs. This pattern of results, suggesting that certain multimodal expression configurations reflect adaptive behavior patterns following the evaluation of a particular event, seems quite compatible with a component process model explanation of emotion (Scherer, 1984, 2001; Scherer & Ellgring, 2007). This interpretation is supported by the fact that the respective multimodal pattern occurs frequently for many different emotions, including positive emotions (elated joy for agitation and quiet happiness for resignation; see Table 6). It seems that this multimodal response pattern is driven by the presence or absence of behavioral urges produced by appraisal rather than triggered by emotion-specific affect programs. The current results can also be explained by the notion of individual specificity, which is compatible with a component appraisal account. Individuals tend to use only part of the possible expressive elements in the vocal, gestural, and facial display of an emotion. This has been shown for nonverbal behavior during the course of depression (Ellgring, 1989). In this study, six nonverbal parameters (general facial activity; frequency of AU12, i.e., smiling; the repertoire of facial activity; the amount of gazing at the interviewer; the amount of speaking; and gestural activity) were compared with regard to their frequency and variability in relation to a normal control group. It appeared that about 90% of endogenous and 80% of neurotic depressed patients had reduced values in at least one of these parameters and showed substantial improvements after therapy. However, none of the patients showed these deficits or improvements in all six of the parameters (with the exception of one endogenous patient at the beginning of therapy). Deficits and improvements in at least three of the six parameters were observed in only 20% to 50% of all cases (Ellgring, 1989, p. 150ff.). In general, one can assume that nonverbal elements combine in a logical or instead of a logical and association (as used in the current statistical analyses), possibly based on a principle of least effort (Zipf, 1949): Only parts of possible configurations may be used by individual actors in a portrayal, as suggested by the assumption of a pars pro toto principle in speech communication (a part stands for the whole; Herrmann, 1982; see also Scherer & Grandjean, 2006). Moreover, there probably is emotion specificity insofar as certain elements of the nonverbal repertoire are more suited to express one emotion or one class of emotions than others. Observers are able to allow for individual specificity and emotion specificity as they reliably identify rather specific emotions from multi-

modal nonverbal displays. Like our closest relatives, the chimpanzees, we may have perceptual biases for multimodal cues (Parr, 2004) that play a central role in our ability to recognize emotion signals. In contrast, statistical algorithms searching for common patterns require logical and combinations. If elements a and b and c must be jointly present, higher levels of aggregation, like the dimensions of arousal and pleasantness, or in our case agitation and resignation, emerge.

Conclusion
The current evidence is clearly limited by the nature of the expressive behaviors that have been coded in the three studies and that have provided the data for the current analysis. One might well argue that the facial expressions, gesture categories, and acoustic features of speech that were analyzed in these studies are too gross to reflect emotion-specific expression patterns, as predicted by affect program models. It is indeed possible that there are very subtle indicators for specific emotions that have so far escaped the attention of researchers specializing in the microcoding and analysis of expressive behavior. One argument in favor of this assumption is that judges are doing rather well in decoding individual emotions, even in the case of a large number of different emotions (14 in the present study), some being members of the same emotion family. Specific signals, such as blushing and looking away in shame (see Keltner & Buswell, 1997), might occur too infrequently to have made it into the current coding lists. To settle the question, it would seem necessary for protagonists of affect program mechanisms to publish specific hypotheses, taking into account recent modifications of basic emotion theories (see Scherer & Ellgring, 2007). Unfortunately, as described elsewhere (Scherer and Grandjean, 2006), most of the work in this domain relies on judgment studies as the basis for claims concerning production mechanisms. However, the fact that judges are able to differentiate individual emotions cannot be readily used as an argument for the position that argues in favor of emotion-specific expression programs (see detailed argument in Scherer & Grandjean, 2006). Apart from the problems surrounding the manifold possibilities of artifacts in judgment procedures (response biases, guessing, exclusion of alternatives; see Banse & Scherer, 1996; Frank & Stennett, 2001; Rosenberg & Ekman, 1995; Russell, 1994), judgmental discrimination can be equally well explained by a component process approach, suggesting that expressive behavior is driven by a detailed pattern of results on a complex profile of appraisal criteria. If different appraisal results produce specific adaptive motor responses and if unique appraisal profiles produce emotions that are labeled with specific terms, judges should be able to match appraisal profile driven expression patterns with specific labels. The current data set is limited by the fact that actor portrayals were used and that only a relatively small numbers of actors encoded a fairly large number of emotions. In addition, the use of standardized verbal utterances without meaning may constrain the variety of emotional expressions that are produced by the actors. This is particularly true for nonverbal affect bursts, brief involuntary outbursts of emotional reactions (such as cries, sighs, and sudden inhalations) that are generally highly integrated across modalities (Scherer, 1994). Some of these, such as disgust bursts,

170

SCHERER AND ELLGRING Ekman, P. (2003). Emotions revealed. New York: Times Books. Ekman, P., & Friesen, W. V. (1978). The facial action coding system: A technique for the measurement of facial movement. Palo Alto, CA: Consulting Psychologists Press. Ekman, P., & Friesen, W. V. (1982). Felt, false and miserable smiles. Journal of Nonverbal Behavior, 6, 238 252. Ekman, P., OSullivan, M., Friesen, W. V., & Scherer, K. R. (1991). Face, voice and body in detecting deception. Journal of Nonverbal Behavior, 15, 125135. Ekman, P., & Rosenberg, E. L. (2005). What the face reveals: Basic and applied studies of spontaneous expression using the Facial Action Coding System (FACS) (2nd ed.). New York: Oxford University Press. Ellgring, H. (1989). Nonverbal communication in depression. Cambridge, England: Cambridge University Press. Ellgring, H., & Scherer, K. R. (1996). Vocal indicators of mood change in depression. Journal of Nonverbal Behavior, 20, 83110. Ellsworth, P. C., & Scherer, K. R. (2003). Appraisal processes in emotion. In R. J. Davidson, H. Goldsmith, & K. R. Scherer (Eds.), Handbook of the affective sciences (pp. 572595). New York: Oxford University Press. Feldman, R. S., & Rime , B. (Eds.). (1991). Fundamentals of nonverbal behavior. Cambridge, England: Cambridge University Press. Frank, M. G., Ekman, P., & Friesen, W. V. (1993). Behavioral markers and recognizability of the smile of enjoyment. Journal of Personality and Social Psychology, 64, 8393. Frank, M. G., & Stennett, J. (2001). The forced choice paradigm and the perception of facial expressions of emotion. Journal of Personality and Social Psychology, 80, 75 85. Friesen, W. V., Ekman, P., & Wallbott, H. (1979). Measuring hand movements. Journal of Nonverbal Behavior, 4, 97112. Frijda, N. H. (1986). The emotions. Cambridge, England: Cambridge University Press. Gosselin, P., Kirouac, G., & Dore , F. Y. (1995). Components and recognition of facial expression in the communication of emotion by actors. Journal of Personality and Social Psychology, 68, 114. Herrmann, T. (1982). Language and situation: The pars pro toto principle. In C. Frazer & K. Scherer (Eds.), Advances in the social psychology of language (pp. 123158). Cambridge, England: Cambridge University Press. Hinde, R. (Ed.). (1972). Non-verbal communication. Cambridge, England: Cambridge University Press. Hjortsjo , C. H. (1970). Mans face and mimic language. Malmo , Sweden: Nordens Boktryckeri. Izard, C. E. (1971). The face of emotion. New York: Appleton-CenturyCrofts. Izard, C. E. (1992). Basic emotions, relations among emotions, and emotion-cognition relations. Psychological Review, 99, 561565. Juslin, P. N., & Laukka, P. (2003). Communication of emotions in vocal expression and music performance: Different channels, same code? Psychological Bulletin, 129, 770 814. Juslin, P. N., & Scherer, K. R. (2005). Vocal expression of affect. In J. Harrigan, R. Rosenthal, & K. Scherer (Eds.), The new handbook of methods in nonverbal behavior research (pp. 65135). Oxford, England: Oxford University Press. Kappas, A., Hess, U., & Scherer, K. R. (1991). Voice and emotion. In R. Feldman & B. Rime (Eds.), Fundamentals of nonverbal behavior (pp. 200 238). New York: Cambridge University Press. Keltner, D., & Buswell, B. N. (1997). Embarrassment: Its distinct form and appeasement functions. Psychological Bulletin, 122, 250 270. Keltner, D., & Ekman, P. (2000). Facial expression of emotion. In M. Lewis & J. Haviland-Jones (Eds.), Handbook of emotions (2nd ed., pp. 236 249). New York: Guilford Press. Keltner, D., Ekman, P., Gonzaga, G. C., & Beer, J. (2003). Facial expression of emotion. In R. J. Davidson, K. R. Scherer, & H. Goldsmith

might be particularly good examples for emotion-specific multimodal responses generated by affect programs (disgust generally being expressed by a rather unique facial expression). In this case, the component process model prediction would not differ from an affect program explanation, as the sudden involuntary reaction in a disgust burst would be considered the direct efferent effect of a novel and intrinsic unpleasantness evaluation. Standardized verbal utterances may not be the ideal setting to study this type of reaction, as the burstlike nature of these responses cannot be extended over an utterance, even of brief duration. In consequence, the data reported here can only be suggestive and help inform further work in this area. What is clearly needed to address the issues at stake is the detailed microcoding of a large number of comparable expressions of different real-life emotions, if at all possible with conjoint measurement of appraisal processes. Only this type of data is likely to yield the information necessary to decide this central question in our understanding of the mechanisms underlying emotional expression. Unfortunately, the task is rendered even more difficult by the fact that realistic expressions of everyday emotions may be even more affected by situational constraints, display rules, or other such push factors (Kappas, Hess, & Scherer, 1991; Scherer, 1985) than by actor expressions (see Banse & Scherer, 1996, for an argument in this sense), so that complex experimental designs for induction or systematic observation may be required to control for all potential determinants. Increasing attention to multimodal patterns in expression in recent years, particularly in applied areas such as affective computing, may produce a paradigm shift in the study of emotional expression and encourage more ambitious research programs in this domain.

References
Arnold, M. B. (1960). Emotion and personality: Vol. 1. Psychological aspects. New York: Columbia University Press. Banse, R., & Scherer, K. R. (1996). Acoustic profiles in vocal emotion expression. Journal of Personality and Social Psychology, 70, 614 636. Ba nziger, T., Pirker, H., & Scherer, K. R. (2006, May 24 26). GEMEP GEneva Multimodal Emotion Portrayals: A corpus for the study of multimodal emotional expressions. LREC 06 Workshop on Corpora for Research on Emotion and Affect, Genoa, Italy. In Proceedings of the Fifth Conference on Language Resources and Evaluation (pp. 1519). Cohn, J. F., Reed, L. I., Moriyama, T., Xiao, J., Schmidt, K., & Ambadar, Z. (2004, May). Multimodal coordination of facial action, head rotation, and eye motion during spontaneous smiles. Paper presented at the Sixth IEEE International Conference on Automatic Face and Gesture Recognition, Seoul, Korea. Darwin, C. (1998). The expression of the emotions in man and animals (P. Ekman, Ed.). New York: Oxford University Press. (Original work published 1872) Davidson, R., Scherer, K. R., & Goldsmith, H. (Eds.). (2003). Handbook of affective sciences. New York: Oxford University Press. Ekman, P. (1972). Universals and cultural differences in facial expression of emotion. In J. K. Cole (Ed.), Nebraska Symposium on Motivation: Vol. 19 (pp. 207283). Lincoln: University of Nebraska Press. Ekman, P. (1979). About brows: Emotional and conversational signals. In M. V. Cranach, K. Foppa, W. Lepenies, & D. Ploog (Eds.), Human ethnology (pp. 169 202). Cambridge, England: Cambridge University Press. Ekman, P. (1992). An argument for basic emotions. Cognition and Emotion, 6, 169 200. Ekman, P. (2001). Telling lies: Clues to deceit in the marketplace, marriage, and politics (3rd ed.) New York: Norton.

MULTIMODAL EXPRESSION OF EMOTION (Eds.), Handbook of the affective sciences (pp. 415 432). New York: Oxford University Press. Knapp, M. L., & Hall, J. A. (1997). Nonverbal communication in human interaction. New York: Harcourt Brace Jovanovich. Lazarus, R. S. (1966). Psychological stress and the coping process. New York: McGraw-Hill. Lazarus, R. S. (1991). Emotion and adaptation. New York: Oxford University Press. Parr, L. A. (2004). Perceptual biases for multimodal cues in chimpanzee (Pan troglodytes) affect recognition. Animal Cognition, 7, 171178. Pittenger, R., Hockett, C., & Danehy, J. (1960). The first five minutes. Ithaca, NY: Paul Martineau. Roseman, I. J., & Smith, C. A. (2001). Appraisal theory. In K. Scherer, A. Schorr, & T. Johnstone (Eds.), Appraisal processes in emotion: Theory, methods, research (pp. 319). Oxford, England: Oxford University Press. Rosenberg, E. L., & Ekman, P. (1995). Conceptual and methodological issues in the judgment of facial expressions of emotion. Motivation and Emotion, 19, 111138. Russell, J. A. (1994). Is there universal recognition of emotion from facial expression? A review of the cross-cultural studies. Psychological Bulletin, 115, 102141. Russell, J. A., Bachorowski, J.-A., & Ferna ndez-Dols, J.-M. (2003). Facial and vocal expressions of emotion. Annual Review of Psychology, 54, 329 349. Sayette, M. A., Cohn, J. F., Wertz, J. M., Perrott, M. A., & Parrott, D. J. (2001). A psychometric evaluation of the Facial Action Coding System for assessing spontaneous expression. Journal of Nonverbal Behavior, 25, 167185. Scheflen, A. (1973). Communicational structure. Bloomington: Indiana University Press. Scherer, K. R. (1984). On the nature and function of emotion: A component process approach. In K. R. Scherer & P. Ekman (Eds.), Approaches to emotion (pp. 293318). Hillsdale, NJ: Erlbaum. Scherer, K. R. (1985). Vocal affect signaling: A comparative approach. In J. Rosenblatt, C. Beer, M.-C. Busnel, & P. J. B. Slater (Eds.), Advances in the study of behavior (Vol. 15, pp. 189 244). New York: Academic Press. Scherer, K. R. (1987). Toward a dynamic theory of emotion: The component process model of affective states. Geneva Studies in Emotion and Communication, 1, 198. Retrieved June 23, 2006, from http:// www.unige.ch/fapse/emotion/publications/geneva_studies.html Scherer, K. R. (1992). What does facial expression express? In K. Strongman (Ed.), International review of studies on emotion (Vol. 2, pp. 139 165). Chichester, England: Wiley. Scherer, K. R. (1994). Affect bursts. In S. van Goozen, N. E. van de Poll, & J. A. Sergeant (Eds.), Emotions: Essays on emotion theory (pp. 161196). Hillsdale, NJ: Erlbaum. Scherer, K. R. (1999). Universality of emotional expression. In D. Levinson, J. Ponzetti, & P. Jorgenson (Eds.), Encyclopedia of human emotions (Vol. 2, pp. 669 674). New York: Macmillan. Scherer, K. R. (2001). Appraisal considered as a process of multi-level sequential checking. In K. R. Scherer, A. Schorr, & T. Johnstone (Eds.),

171

Appraisal processes in emotion: Theory, methods, research (pp. 92 120). New York: Oxford University Press. Scherer, K. R. (2003). Vocal communication of emotion: A review of research paradigms. Speech Communication, 40, 227256. Scherer, K. R. (2004). Feelings integrate the central representation of appraisal-driven response organization in emotion. In A. S. R. Manstead, N. H. Frijda, & A. H. Fischer (Eds.), Feelings and emotions: The Amsterdam Symposium (pp. 136 157). Cambridge, England: Cambridge University Press. Scherer, K. R., Banse, R., Wallbott, H. G., & Goldbeck, T. (1991). Vocal cues in emotion encoding and decoding. Motivation and Emotion, 15, 123148. Scherer, K. R., Dan, E., & Flykt, A. (2006). What determines a feelings position in three-dimensional affect space? A case for appraisal. Cognition and Emotion, 20, 92113. Scherer, K. R. & Ellgring, H. (2007). Are facial expressions of emotion produced by categorical affect programs or dynamically driven by appraisal? Emotion, 7, 113130. Scherer, K. R., & Grandjean, D. (2006). Inferences from facial expressions of emotion have many facets. Manuscript submitted for publication. Scherer, K. R., Johnstone, T., & Klasmeyer, G. (2003). Vocal expression of emotion. In R. J. Davidson, K. R. Scherer, & H. Goldsmith (Eds.), Handbook of the affective sciences (pp. 433 456). New York: Oxford University Press. Scherer, K. R., & Wallbott, H. G. (1985). Analysis of nonverbal behavior. In T. A. van Dijk (Ed.), Handbook of discourse analysis (Vol. 2, pp. 199 230). London: Academic Press. Scherer, K. R., & Wallbott, H. G. (1994). Evidence for universality and cultural variation of differential emotion response patterning. Journal of Personality and Social Psychology, 66, 310 328. Scherer, K. R., Wallbott, H. G., & Summerfield, A. B. (Eds.). (1986). Experiencing emotion: A crosscultural study. Cambridge, England: Cambridge University Press. Scherer, K. R., Wranik, T., Sangsue, J., Tran, V., & Scherer, U. (2004). Emotions in everyday life: Probability of occurrence, risk factors, appraisal and reaction pattern. Social Science Information, 43, 499 570. Siegman, A. W., & Feldstein, S. (Eds.). (1985). Multichannel integrations of nonverbal behavior. Hillsdale, NJ: Erlbaum. Smith, C. A. (1989). Dimensions of appraisal and physiological response in emotion. Journal of Personality and Social Psychology, 56, 339 353. Smith, C. A., & Scott, H. S. (1997). A componential approach to the meaning of facial expressions. In J. A. Russell & J. M. Fernandez-Dols (Eds.), The psychology of facial expression (pp. 229 254). New York: Cambridge University Press. Tomkins, S. S. (1962). Affect, imagery, consciousness: Vol. 1. The positive affects. New York: Springer. Wallbott, H. G. (1998). Bodily expression of emotion. European Journal of Social Psychology, 28, 879 896. Zipf, G. K. (1949). Human behavior and the principle of least effort. Cambridge, MA: Addison Wesley.

Received January 6, 2006 Revision received August 2, 2006 Accepted August 3, 2006