Professional Documents
Culture Documents
In this paper, a real-time system to create a talking head from a video sequence
without any user intervention is presented. In the proposed system, a probabilistic
approach, to decide whether or not extracted facial features are appropriate for creating a
three-dimensional (3-D) face model, is presented. Automatically extracted two-
dimensional facial features from a video sequence are fed into the proposed probabilistic
framework before a corresponding 3-D face model is built to avoid generating an
unnaturalor nonrealistic 3-D face model. To extract face shape, we also present a face
shape extractor based on an ellipse model controlled by three anchor points, which is
accurate and computationally cheap. To create a 3-D face model, a least-square approach
is presented to find a coefficient vector that is necessary to adapt a generic 3-D model
into the extracted facial features. Experimental results show that the proposed system can
efficiently build a 3-D face model from a video sequence without any user intervention
for various Internet applications including virtual conference and a virtual story teller that
do not require much head movements or high-quality facial animation.
Index Terms—MPEG-4 facial object, probabilistic approach, speech-driven talking
heads, talking heads, virtual face.
I. INTRODUCTION
/a 2 +y 2/b 2
as an ellipse, a is the distance between x position of P1 and P2 and b
is distance between y position of P1 and P3(If face shape is
symmetric)
3) Add intensity of pixels that are lower than the left and right anchor
points on the ellipse and record the sum.
4) Move the left and right anchor points up and down to find
parameters of an ellipse that produces maximum boundary energy for
the face shape from an edge image [see Fig. 2(e)] using (1).After
positions of facial components such as mouth and eyes are known as
shown in Fig. 2(a) using various methods , the proposed face shape
extractor is ready to start. We assume that a human face has a
homogeneous color distribution,
where E(x,y) is the intensity of an edge image [Fig. 2(e)] and denotes
a subset of pixels on an ellipse, whose pixels are located lower than
the left and right anchor points.
III. PROBABILITY NETWORKS
Probabilistic approaches have been successfully used to locate human
faces from a scene and to track deformations of local features . Cipolla
et al. proposed a probabilistic framework to combine different facial
features and face groups, achieving a high confidence rate for face
detection from a complicated scene. Huang et al. used a probabilistic
network for local feature tracking by modeling locations and velocities
of selected features points. In our automated system, a probabilistic
framework is adopted to maximally use facial feature evidence for
deciding correctness of extracted facial features before a 3-D face
model is built. Fig. 3 shows the selected FDPs for the proposed
probabilistic framework. The network hierarchy used in our approach is
shown in Fig. 4, which consists of a facial feature net, a face shape net,
and a topology net. The facial feature net has a mouth net and eye net
as its subnets. The detail of each subnet is shown in Fig. 5. In the
networks, each node represents a random variable and each arrow
denotes conditional dependency between two nodes. In a study of face
anthropometry , data are collected by measuring distances and angles
among selected key points from a human face, e.g., corners of eyes,
mouth and ears, to describe the variability of a human face. Based on
the study, we are characterizing a frontal face by measuring distances
and covariance between key points chosen from the study. All nodes in
the proposed probability networks are classified into four groups:
Mouth=[D(8.1,2.2),D(8.4,8.3),D(2.3,8.2)], Eyes=
[D(3.12,3.7),D(3.12,3.8),D(3.8,3.11),D(3.13,3.9)],Topology=[D(2.1,9.15
),D(2.1,3.8),D(9.15,2.2)],and Face
Shape=[D(2.2,2.1),D(10.7,10.8)],where D(P1,P2) is a distance between
FDPs P1 and P2 defined in MPEG-4 standard . In our network, the
distance between two feature points is defined as a random variable
for each node. For instance, we model D(3.5, 3.6), the distance
between centers of the left and right eyes, and D(2.1, 9.15), the
distance of selected two points FDP 2.1 and FDP 9.15, shown in Fig.
5(b), as a 2-D Gaussian distribution, estimating means, standard
deviations, and correlation coefficients. Fig. 5(c) shows graphical
illustrations of the relationship between two nodes in the proposed.
probability networks. For example, the distance between FDP
3.5 and FDP 3.6, and the length between FDP 8.4 and FDP 8.3
(width of mouth), are modeled as a 2-D Gaussian distribution where
denote the distance between two
selected FDPs, the means, and standard deviation of D1 respectively.
denotes the correlation coefficients between two nodes D1 and D2 . To
model 2-D Gaussian distributions of D(3.5, 3.6) and distances of
selected paired points, a database from is used in our simulations. The
reason we model probability distributions based on FDP3.5 and FDP3.6
is that the left and right eye centers are the features that can be
detected most reliably and accurately from a video sequence
according to our implementation. The chain rule and conditional
independence relationship are applied to calculate the joint probability
of each network. For
instance, the probability of the face shape net is defined as a joint
probability of all three nodes, D(3.5, 3.6), D(2.2, 2.1), and D(10.7,
10.8), as follows:
a 3-D model that is quite realistic and good enough for various Internet
applications that do not require high-quality facialanimation.
ACKNOWLEDGMENT
The authors wish to thank the anonymous reviewers for their valuable
comments.
REFERENCES [1] W.-S. Lee, M. Escher, G. Sannier, and N. Magnenat-
Thalmann, “MPEG-4 compatible faces from orthogonal photos,” in Proc.
Int. Conf. Computer Animation, 1999, pp. 186–194.
[2] P. Fua and C. Miccio, “Animated heads from ordinary images: a
leastsquares approach,” Comput. Vis. Image Understand., vol. 75, no.
3, pp.247–259, 1999.
[3] F. Pighin, R. Szeliski, and D. H. Salesin, “Resynthesizing facial
animation through 3-D model-based tracking,” in Proc. 7th IEEE Int.
Conf. Computer Vision, vol. 1, 1999, pp. 143–150.
[4] Z. Liu, Z. Zhang, C. Jacobs, and M. Cohen, “Rapid Modeling of
Animated Faces From Video,”, Tech.l Rep. MSR-TR-2000-11.
[5] A. C. A. del Valle and J. Ostermann, “3-D talking head customization
by adapting a generic model to one uncalibrated picture,” in Proc. IEEE
Int. Symp. Circuits and Systems, 2001, pp. 325–328.
[6] C. J. Kuo, R.-S. Huang, and T.-G. Lin, “3-D facial model estimation
from single front-view facial image,” IEEE Trans. Circuits Syst. Video
Technol., vol. 12, no. 3, pp. 183–192, Mar. 2002.
[7] L. Moccozet and N. Magnenat Thalmann, “Dirichlet free-from
deformations symmetry,” in Proc. 11th IAPR Int. Conf. Pattern
Recognition, 1992, pp. 117–120.
and their application to hand simulation,” in Proc. Computer
Animation 97, 1997, pp. 93–102.
[8] V. Blanz and T. Vetter, “A morphable model for the synthesis of 3-D
faces,” in Computer Graphics, Annu. Conf. Series, SIGGRAPH 1999, pp.
187–194.
[9] E. Cosatto and H. P. Graf, “Photo-realistic talking-heads from image
smples,” IEEE Trans. Multimedia, vol. 2, no. 3, pp. 152–163, Jun. 2000.
[10] I.-C. Lin, C.-S. Hung, T.-J. Yang, and M. Ouhyoung, “A speech
driven talking head system based on a single face image,” in Proc. 7th
Pacific Conf. Computer Graphics and Applications, 1999, pp. 43–49.
[11] http://www.ananova.com/ [Online]
[12] R.-S.Wang andY.Wang, “Facial feature extraction and tracking in
video sequences,” in Proc. IEEE Int. Workshop on Multimedia Signal
Processing, 1997, pp. 233–238.
[13] D. Reisfeld and Y.Yeshurun, “Robust detection of facial features by
generalized
symmetry,” in Proc. 11th IAPR Int. Conf. Pattern Recognition, 1992, pp.
117–120.
CHOI AND HWANG: AUTOMATIC CREATION OF A TALKING HEAD FROM A VIDEO SEQUENCE
A. K. Peters, 1996.
[29] Dual Rate Speech Coder for Multimedia Communications
Transmitting
at 5.3 and 6.3 kbits/s, ITU-T Recommendation G.723.1, Mar. 1996.
[30] K. H. Choi and J.-N. Hwang, “A real-time system for automatic
creation
of 3-D face models from a video sequence,” in Proc. IEEE Int. Conf.
Acoustics, Speech, and Signal Processing, 2002, pp. 2121–2124.