Professional Documents
Culture Documents
(a) (b)
Fig. 1. Our face aging method. (a) approximation of the latent vector to reconstruct the input image; (b) switching the age
condition at the input of the generator G to perform face aging.
1. Given an input face image x of age y0 , find an opti- The acGAN model proposed in this work uses the same
mal latent vector z ∗ which allows to generate a recon- design for the generator G and the discriminator D as in [15].
structed face x̄ = G(z ∗ , y0 ) as close as possible to the Following [12], we inject the conditional information at the
initial one (cf. Figure 1-(a)). input of G and at the first convolutional layer of D. acGAN is
optimized using the ADAM algorithm [16] during 100 epochs
2. Given the target age ytarget , generate the resulting face which takes about one day on a modern GPU. In order to
image xtarget = G(z ∗ , ytarget ) by simply switching encode person’s age, we have defined six age categories: 0-
the age at the input of the generator (cf. Figure 1-(b)). 18, 19-29, 30-39, 40-49, 50-59 and 60+ years old. They have
The first step of the presented face aging method (i.e. in- been selected so that the training dataset (cf. Subsection 3.1)
put face reconstruction) is the key one. Therefore, in Sub- contains at least 5, 000 examples in each age category. Thus,
section 2.2, we present our approach to approximately recon- the conditions of acGAN are six-dimensional one-hot vectors.
struct an input face making a particular emphasize on preserv-
ing the original person’s identity in the reconstructed image. 2.2. Approximative Face Reconstruction with acGAN
2.1. Age Conditional Generative Adversarial Network 2.2.1. Initial Latent Vector Approximation
Introduced in [9], GAN is a pair of neural networks (G, D): Contrary to autoencoders, cGANs do not have an explicit
the generator G and the discriminator D. G maps vectors z mechanism for inverse mapping of an input image x with at-
from the noise space N z with a known distribution pz to the tributes y to a latent vector z which is necessary for image
image space N x . The generator’s goal is to model the distri- reconstruction: x = G(z, y). As in [12, 17], we circum-
bution pdata of the image space N x (in this work, pdata is the vent this problem by training an encoder E, a neural network
distribution of all possible face images). The discriminator’s which approximates the inverse mapping.
goal is to distinguish real face images coming from the image In order to train E, we generate a synthetic dataset of
distribution pdata and synthetic images produced by the gen- 100K pairs (xi , G(zi , yi )), i = 1, . . . , 105 , where zi ∼
erator. Both networks are iteratively optimized against each N (0, I) are random latent vectors, yi ∼ U are random age
other in a minimax game (hence the name “adversarial”). conditions uniformly distributed between six age categories,
Conditional GAN (cGAN) [13, 14] extends the GAN G(z, y) is the generator of the priorly trained acGAN, and
model allowing the generation of images with certain at- xi = G(zi , yi ) are the synthetic face images. E is trained to
tributes (“conditions”). In practice, conditions y ∈ N y can minimize the Euclidean distances between estimated latent
be any information related to the target face image: level of vectors E(xi ) and the ground truth latent vectors zi .
illumination, facial pose or facial attribute. More formally, Despite GANs are arguably the most powerful generative
cGAN training can be expressed as an optimization of the models today, they cannot exactly reproduce the details of all
function v(θG , θD ), where θG and θD are parameters of G real-life face images with their infinite possibilities of minor
and D, respectively: facial details, accessories, backgrounds etc. In general, a nat-
ural input face image can be rather approximated than exactly
min max v(θG , θD ) = Ex,y∼pdata [log D(x, y)] reconstructed with acGAN. Thus, E produce initial latent ap-
θG θG
(1) proximations z0 which are good enough to serve as initializa-
+ Ez∼pz (z),ey∼py [log (1 − D(G(z, ye), ye))] tions of our optimization algorithm explained hereafter.
2.2.2. Latent Vector Optimization 3.2. Age-Conditioned Face Generation
Face aging task, which is the ultimate goal of this work, as- Figure 2 illustrates synthetic faces of different ages generated
sumes that while the age of a person must be changed, his/her with our acGAN. Each row corresponds to a random latent
identity should remain intact. In Subsection 3.3, it is shown vector z and six columns correspond to six age conditions y.
that though initial latent approximations z0 produced by E re- acGAN perfectly disentangles image information encoded by
sult in visually plausible face reconstructions, the identity of latent vectors z and by conditions y making them indepen-
the original person is lost in about 50% of cases (cf. Table 1). dent. More precisely, we observe that latent vectors z encode
Therefore, initial latent approximations z0 must be improved. person’s identity, facial pose, hair style, etc., while y encodes
In [17], the similar problem of image reconstruction en- uniquely the age.
hancement is solved by optimizing the latent vector z to min-
imize the pixelwise Euclidean distance between the ground 0-18 19-29 30-39 40-49 50-59 60+
truth image x and the reconstructed image x̄. However, in
the context of face reconstruction, the described “Pixelwise”
latent vector optimization has two clear downsides: firstly, it
increases the blurriness of reconstructions and secondly (and
more importantly), it focuses on unnecessary details of input
face images which have a strong impact on pixel level, but
have nothing to do with a person’s identity (like background,
sunglasses, hairstyle, moustache etc.)
Therefore, in this paper, we propose a novel “Identity-
Preserving” latent vector optimization approach. The key idea Fig. 2. Examples of synthetic images generated by our
is simple: given a face recognition neural network F R able acGAN using two random latent vectors z (rows) and con-
to recognize a person’s identity in an input face image x, the ditioned on the respective age categories y (columns).
difference between the identities in the original and recon-
structed images x and x̄ can be expressed as the Euclidean In order to objectively measure how well acGAN man-
distance between the corresponding embeddings F R(x) and ages to generate faces belonging to precise age categories, we
F R(x̄). Hence, minimizing this distance should improve the have used the state-of-the-art age estimation CNN described
identity preservation in the reconstructed image x̄: in [20]. We compare the performances of the age estimation
CNN on real images from the test part of IMDB-Wiki cleaned
z ∗ IP = argmin ||F R(x) − F R(x̄)||L2 (2) and on 10K synthetic images generated by acGAN. Despite
z the age estimation CNN has never seen synthetic images dur-
In this paper, F R is an internal implementation of the ing the training, the resulting mean age estimation accuracy
“FaceNet” CNN [18]. The generator G(z, y) and the face on synthetic images is just 17% lower than on natural ones. It
recognition network F R(x) are differentiable with respect proves that the obtained acGAN can be used for generation of
to their inputs, so the optimization problem 2 can be solved realistic face images with the required age.
using the L-BFGS-B algorithm [19] with backtracking line
search. The L-BFGS-B algorithm is initialized with initial la- 3.3. Identity-Preserving Face Reconstruction and Aging
tent approximations z0 . Here and below in this work, we re-
fer to the results of “Pixelwise” and “Identity-Preserving” la- As explained in Subsection 2.2, we perform face reconstruc-
tent vector optimizations as optimized latent approximations tion (i.e. the first step of our face aging method) in two
and denote them respectively as z ∗ pixel and z ∗ IP . In Sub- iterations: firstly, (1) using initial latent approximations ob-
section 3.3, it is shown both subjectively and objectively that tained from the encoder E and then (2) using optimized latent
z ∗ IP better preserves a person’s identity than z ∗ pixel . approximations obtained by either “Pixelwise” or “Identity-
Preserving” optimization approaches. Some examples of
original test images, their initial and optimized reconstruc-
3. EXPERIMENTS
tions are presented in Figure 3 ((a), (b) and (c), respectively).
3.1. Dataset It can be seen in Figure 3 that the optimized reconstruc-
tions are closer to the original images than the initial ones.
Our acGAN has been trained on the IMDB-Wiki cleaned However, the choice is more complicated when it comes to the
dataset [20] which is a subset of the public IMDB-Wiki comparison of the two latent vector optimization approaches.
dataset [21]. IMDB-Wiki cleaned contains about 120K im- On the one hand, “Pixelwise” optimization better reflects su-
ages, 110K of which have been used for training of acGAN perficial face details: such as the hair color in the first line
and the remaining 10K have been used for the evaluation of and the beard in the last line. On the other hand, the identity
identity-preserving face reconstruction (cf. Subsection 3.3). traits (like the form of the head in the second line or the form
Reconstruction
Optimization Face Aging
Initial
Original
Reconstruction
Pixelwise IP 0-18 19-29 30-39 40-49 50-59 60+
Fig. 3. Examples of face reconstruction and aging. (a) original test images, (b) reconstructed images generated using the initial
latent approximations: z0 , (c) reconstructed images generated using the “Pixelwise” and “Identity-Preserving” optimized latent
approximations: z ∗ pixel and z ∗ IP , and (d) aging of the reconstructed images generated using the identity-preserving z ∗ IP
latent approximations and conditioned on the respective age categories y (one per column).
[5] Bernard Tiddeman, Michael Burt, and David Perrett, [16] Diederik Kingma and Jimmy Ba, “Adam: A
“Prototyping and transforming facial textures for per- method for stochastic optimization,” arXiv preprint
ception research,” IEEE Computer Graphics and Ap- arXiv:1412.6980, 2014.
plications, vol. 21, no. 5, pp. 42–50, 2001. [17] Jun-Yan Zhu, Philipp Krähenbühl, Eli Shechtman, and
Alexei A Efros, “Generative visual manipulation on the
[6] Ira Kemelmacher-Shlizerman, Supasorn Suwajanakorn,
natural image manifold,” in Proceedings of European
and Steven M Seitz, “Illumination-aware age progres-
Conference on Computer Vision, Amsterdam, Nether-
sion,” in Proceedings of Computer Vision and Pattern
lands, 2016.
Recognition, Columbus, USA, 2014.
[18] Florian Schroff, Dmitry Kalenichenko, and James
[7] Jinli Suo, Song-Chun Zhu, Shiguang Shan, and Xilin
Philbin, “Facenet: A unified embedding for face recog-
Chen, “A compositional and dynamic model for face
nition and clustering,” in Proceedings of Computer Vi-
aging,” IEEE Transactions on Pattern Analysis and Ma-
sion and Pattern Recognition, Boston, USA, 2015.
chine Intelligence, vol. 32, no. 3, pp. 385–401, 2010.
[19] Richard H Byrd, Peihuang Lu, Jorge Nocedal, and
[8] Yusuke Tazoe, Hiroaki Gohara, Akinobu Maejima, and Ciyou Zhu, “A limited memory algorithm for bound
Shigeo Morishima, “Facial aging simulator consider- constrained optimization,” SIAM Journal on Scientific
ing geometry and patch-tiled texture,” in Proceedings of Computing, vol. 16, no. 5, pp. 1190–1208, 1995.
ACM SIGGRAPH, Los Angeles, USA, 2012.
[20] Grigory Antipov, Moez Baccouche, Sid-Ahmed
[9] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Berrani, and Jean-Luc Dugelay, “Apparent age es-
Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron timation from face images combining general and
Courville, and Yoshua Bengio, “Generative adversarial children-specialized deep learning models,” in Pro-
nets,” in Proceedings of advances in Neural Information ceedings of Computer Vision and Pattern Recognition
Processing Systems, Montreal, Canada, 2015. Workshops, Las Vegas, USA, 2016.
[10] Diederik P Kingma and Max Welling, “Auto-encoding [21] Rasmus Rothe, Radu Timofte, and Luc Van Gool, “Dex:
variational bayes,” in Proceedings of International Con- Deep expectation of apparent age from a single image,”
ference on Learning Representations, Banff, Canada, in Proceedings of International Conference on Com-
2014. puter Vision Workshops, Santiago, Chile, 2015.
[11] Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, [22] Brandon Amos, Bartosz Ludwiczuk, and Mahadev
Hugo Larochelle, and Ole Winther, “Autoencoding Satyanarayanan, “Openface: A general-purpose face
beyond pixels using a learned similarity metric,” in recognition library with mobile applications,” Tech.
Proceedings of International Conference on Machine Rep., CMU-CS-16-118, CMU School of Computer Sci-
Learning, New York, USA, 2016. ence, 2016.