You are on page 1of 16

Enhanced Reweighted MRFs for Efficient Fashion Image Parsing

QIONG WU and PIERRE BOULANGER, University of Alberta

Previous image parsing methods usually model the problem in a conditional random field which describes
a statistical model learned from a training dataset and then processes a query image using the conditional
probability. However, for clothing images, fashion items have a large variety of layering and configuration,
and it is hard to learn a certain statistical model of features that apply to general cases. In this article,
we take fashion images as an example to show how Markov Random Fields (MRFs) can outperform Con-
ditional Random Fields when the application does not follow a certain statistical model learned from the
training data set. We propose a new method for automatically parsing fashion images in high processing
efficiency with significantly less training time by applying a modification of MRFs, named reweighted MRF
(RW-MRF), which resolves the problem of over smoothing infrequent labels. We further enhance RW-MRF
with occlusion prior and background prior to resolve two other common problems in clothing parsing, occlu-
sion, and background spill. Our experimental results indicate that our proposed clothing parsing method
significantly improves processing time and training time over state-of-the-art methods, while ensuring com-
parable parsing accuracy and improving label recall rate.
CCS Concepts: rComputing methodologies Image segmentation; r Human-centered
computing Interactive systems and tools
Additional Key Words and Phrases: Image parsing, image segmentation, fashion parsing, markov random
field, conditional random field
ACM Reference Format:
Qiong Wu and Pierre Boulanger. 2016. Enhanced re-weighted mrfs for efficient fashion image parsing. ACM
Trans. Multimedia Comput. Commun. Appl. 12, 3, Article 42 (March 2016), 16 pages.
DOI: http://dx.doi.org/10.1145/2890104

1. INTRODUCTION
With the fast growth of fashion-related applications such as e-commerce, clothes rec-
ommendation, and retrieval, fashion image parsing, namely detecting and classifying 42
clothing item regions, has drawn much research attention in recent years [Dong et al.
2013; Yamaguchi et al. 2012]. However, all previous research efforts [Liang et al. 2015a,
2015b; Dong et al. 2014; Liu et al. 2015] focus on improving parsing accuracy for fashion
images, which do not achieve high processing and training efficiency for real-time im-
age tagging applications, such as image tagging platforms. For example, EyeDentifyIt
[Wu et al. 2014] requires efficient semiautomated and automated image tagging tools
to help bloggers tag image content for image monetization purposes. In addition, to
our best knowledge, all existing methods do not consider the unique characteristics of
fashion images. For example, fashion images have large variations in their appearance,
layering, and occlusion, and it is difficult to learn a certain statistical model of features
that apply to general cases.

Authors addresses: Q. Wu, Computing Science Department, University of Alberta, Edmonton; P. Boulanger,
Computing Science Department, University of Alberta, 116 St. and 85 Ave. Edmonton, Alberta T6G 2R3.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or commercial advantage and that
copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for
components of this work owned by others than ACM must be honored. Abstracting with credit is permitted.
To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this
work in other works requires prior specific permission and/or a fee. Permissions may be requested from
Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212)
869-0481, or permissions@acm.org.
c 2016 ACM 1551-6857/2016/03-ART42 $15.00
DOI: http://dx.doi.org/10.1145/2890104

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 12, No. 3, Article 42, Publication date: March 2016.
42:2 Q. Wu and P. Boulanger

In this work, we aim at providing real-time fashion parsing methods. To this end,
we apply novel Markov Random Fields (MRFs) which provide efficient model training
and inferencing. Previous research efforts model the fashion parsing problem using
Conditional Random Fields (CRFs), which describe a statistical model learned from a
training dataset. However, because fashion items have a large variety of layering and
configuration, it is hard to learn a certain statistical model of features that apply to
general cases. For example, shirts alone have an incredibly wide range of appearances,
and they can be near to a set of different clothing items such as shorts, a purse, and
a blazer which have nonuniform features. Such image characteristics increase the
difficulty of learning a local/global statistical model that is generally applicable to all
fashion images. In addition, training CRF models at different levels of granularity
also significantly increase the training time. Therefore, for clothing images, a MRF
which typically formulates a probabilistic generative framework and incorporates local
relationships on predicting labels between neighbouring nodes is more suitable to solve
clothing parsing problem than a conditional probability model CRF. In this article, we
take fashion images as an example to show how MRFs can outperform CRFs when the
application does not follow a certain statistical model learned from the training data
set.
According to characteristics of fashion images, we propose a deviation of MRFs,
named reweighted MRF (RW-MRF), which resolves the problem of over smoothing
infrequent fashion labels, caused by unbalanced training data. We further enhance
RW-MRF with occlusion prior and background prior to resolve two other common prob-
lems in clothing parsing: occlusion and background spill. In addition, we integrate the
proposed method in a web-based image tagging platform to provide a real-time tagging
tool. Our experimental results indicate that our proposed clothing parsing method sig-
nificantly improves processing and training time over state-of-the-art methods, while
ensuring comparable parsing accuracy and improving label recall rate.
The major contributions of this article can be summarized as follows:
To the best of our knowledge, this article is the first attempt to explore the rea-
sons causing common problems for clothing parsing, including low parsing accuracy
for clothing items having new features, infrequent label prediction, occlusion, and
background spill.
To justify infrequent labels prediction, we propose a modification of MRFs, named
reweighted MRF (RW-MRF), which solves unbalanced labeled data in the training
data set. In RW-MRF, the smoothness term is computed using both local feature
contrasts and the predicted unary term.
We propose occlusion prior to enhance RW-MRF, which solves the occlusion problem
in clothing image parsing. A background prior is also applied to enhance RW-MRF
to solve background spill problem.
We have developed a real-time web-based image tagging tool which enables users
to tag fashion images rapidly, for example, collecting training datasets and tagging
fashion images for image-click-ads purposes.

2. RELATED WORK
2.1. Image Parsing
Image parsing, also called semantic segmentation, refers to the task of segmenting
an image into semantically meaningful regions where each region is labeled according
to a specific object class [Tu et al. 2005]. The existing approaches tackle the problem
from various aspects: elementary regions (e.g., global, regional, and local), features to
describe regions such as textons [Leung and Malik 2001], features learned from large-
scale deep convolutional neural networks [Zheng et al. 2015]), spatial relationship

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 12, No. 3, Article 42, Publication date: March 2016.
Enhanced Reweighted MRFs for Efficient Fashion Image Parsing 42:3

cues such as shape and pose [Winn and Jojic 2005]), incorporation of context [He et al.
2004], and different optimization techniques such as back propagation and graph cut.
The most successful algorithms typically use MRFs [Tighe and Lazebnik 2010] or its
variant CRFs.
The state-of-the-art image parsing methods usually model the problem in the frame-
work of CRF [Yamaguchi et al. 2012; Shotton et al. 2006; He et al. 2004]. A CRF
approach models the conditional probability of labels given an image, which will likely
depend on structures at different levels of granularity in the image. For example, He
et al. [2004] proposed to learn label features at regional levels (a pattern of ground pix-
els above water pixels) as well as global levels (rhino/hippo in the water with sky above
the horizon) based on a set of labeled images. However, for fashion images, clothing
styles vary widely and individual clothing items also have many different appearances.
For example, shirts alone have an incredibly wide range of appearances, and they can
be near a set of different clothing items such as shorts, a purse, and a blazer, which
have nonuniform features. Such image characterises increase the difficulty of learning
a local/global statistical model that is generally applicable to all fashion images. In
addition, training models of CRFs at different levels of granularity also significantly
increase the training time.
Within this category of work, the goal of our article is to provide an approach for
clothing parsing. To our best knowledge, this is the first work to explore the supe-
riority of MRFs over CRFs for this specific image parsing problem. After exploring
fashion image characteristics, we propose RW-MRF, an enhanced MRF framework
with occlusion prior and background prior, to improve fashion parsing accuracy and
processing/training efficiency.
2.2. Clothing Parsing
The first work in clothing parsing was proposed by Hasan and Hogg [2010], where they
incorporate a shape prior model to the MRF formulation. In their algorithm, they only
considered four categories: shirt, jacket, tie, and skin. Some previous work incorporated
additional information to increase parsing accuracy, such as color-category labels [Liu
et al. 2014]. Parselets [Dong et al. 2013] and coparsing [Yang et al. 2014] are based on
assumptions that the semantic regions have homogeneous appearance. However, this
does not hold true for most of clothing items which have large appearance diversity.
Yamaguchi et al. [2013] proposed a data-driven approach by building local models from
retrieved clothing items. Such data-driven methods are highly limited to the predefined
database and cannot be applied for general cases.
The work of Yamaguchi et al. [2012] is most similar to ours. For the input, we both
use the same training dataset. In the query phase, a user needs to provide a list of
labels to be parsed along with the query image. For the output, both works produce
segmented regions, with each being assigned a label. However, our approach differs
considerably from that of Yamaguchi et al. [2012] in many ways. First, we do not label
data at the superpixel level but instead at the region level during the training phase.
As mentioned previously, we also incorporate background prior, occlusion prior, and a
reweighted pairwise term in our framework.
2.3. Real-Time Image Labeling Tool
Studying image tagging tools has been an active research field in recent years. Au-
tomated image recognition technologies rely heavily on large amounts of accurately
labeled training data, which is extremely tedious and costly to collect. Traditional la-
beling tools such as in-house labeling [Fei-Fei et al. 2006] and LabelMe [Russell et al.
2008] are mainly designed for researchers and crowdsourcing workers and do not uti-
lize automated technologies. Markup SVG [Kim et al. 2011] leverages image processing

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 12, No. 3, Article 42, Publication date: March 2016.
42:4 Q. Wu and P. Boulanger

and recognition techniques in a predefined SVG abstraction layer, which is not trivial
to use by nonprofessionals. Researchers often design labeling tools by themselves for
collecting certain kinds of labeled images, such as ImageNet [Deng et al. 2009] and
Fashionista [Yamaguchi et al. 2012]. Such labeling tools are very limited in utilization
beyond collecting particular images. Recently, Wu et al. [2014] proposed a web-based
image labeling tool driven by an image-click-ads user interface, EyeDentifyIt,1 which
motivates general web users to label images that can be utilized for machine learning.
Their work utilizes image processing technologies to reduce human labour involved in
a manual labeling process. However, their tool is not intuitive to use due to the limits
of semiautomated techniques. In this work, we integrate the proposed clothing parsing
method in EyeDentifyIt, so the labeling process can be automated with intuitive inputs
from users. Our method greatly reduces manual labour in an image labeling process.
3. REWEIGHTED MRF MODEL FOR CLOTHING PARSING
This section describes the proposed framework for parsing fashion images, including
the formal mathematical definition.
The processing pipeline is defined as follows:
1. Estimate pose configuration {x p} from image I
2. Obtain superpixel {si } from image I
3. Compute feature vectors from {si }
4. Formulate the problem in RW-MRF and compute energy function E(L), including
unary term and pairwise term, based on pretrained global model and feature
vectors
5. Predict label assignments L by inferencing E(L)
We will describe each step and formal definition of RW-MRF model in detail in the
following subsections.
3.1. Pose Estimation
Two common preprocessing steps are required when performing fashion image parsing:
pose estimation and superpixel segmentation. Figure 1 shows an example of the pose
estimation and the SLIC segmentation. Most fashion images contain one single person
with a relatively simple pose. The persons clothing appearance is in general highly
correlated to his/her pose. Under such assumptions, the human pose provides important
clues to fashion tag predictions. For example, a belt can only appear at the waist. As
in other similar work [Yamaguchi et al. 2012; Liu et al. 2014; Kalantidis et al. 2013],
we use a state-of-the-art pose estimation algorithm described in Yang and Ramanan
[2011] to compute the locations of 14 joints, such as the head, neck, and left/right knees,
represented by the vector X = {x p} as in Yang and Ramanan [2011] by:
X = arg max P(X|I). (1)
X

Because there is a high correlation between a human pose and clothing items in
ground-truth training data, the prediction is assumed to possess the same kind of
corelationship. Poor pose estimation can cause erroneous classification prediction. In
both MRF and CRF models, predictions computed based on feature vectors (containing
estimated poses features) alone are mainly reflected from data term. Therefore, we
employ local region contrasts in our proposed model to compute the pairwise term,
which can correct erroneous data term initialization to obtain more reliable predictions.
More details are explained in Section 3.3.

1 http://www.eyedentifyit.com.

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 12, No. 3, Article 42, Publication date: March 2016.
Enhanced Reweighted MRFs for Efficient Fashion Image Parsing 42:5

Fig. 1. Proposed RW-MRF model pipeline.

3.2. Image Patches and Features


As in many parsing methods [Yamaguchi et al. 2012], a superpixel segmentation is
employed as a preprocessing step. We assume that each patch contains similar pixels
and all pixels in one patch belong to the same category, so the superpixel can be used
as a building block in our processing pipeline, which highly reduces the computational
cost. A recent fashion parsing method [Yamaguchi et al. 2012] presents a hierarchical
segmentation algorithm [Arbelaez et al. 2011], which yields more optimal segmenta-
tion seeds (merging similar regions in a hierarchical structure) at higher computational
cost. In contrast, we obtain more regularly gridded superpixel regions {si } using a naive
but more efficient algorithm SLIC [Achanta et al. 2012]. Because of such suboptimal-
ity, corrections need to be applied in follow-up processing steps to obtain comparable
parsing accuracy.
Before computing the label set L, a feature vector of five elements is extracted from
each pixel as their global feature representation. These features are as follows:

RGB color [m n 3]: red, green, and blue color channels of an image, each channel
with a value in the range [0, 255];
CIELAB color [m n 3]: a color model adopted by CIE [Schanda 2007] in 1976 that
better describes uniform color spacing in their values, with dimension L for lightness
and a and b for the color-opponent dimensions;
Gabor feature [m n 4]: a feature filter used for edge detection;
Absolute two-dimensional (2D) coordinates [m n 2]: the absolute 2D coordinates
of each pixel;
and Relative 2D coordinates with respect to each body joint location x p [m n 28]:
the 2D coordinates of each pixel relative to each body joint location;

where [m, n] is the size of image I. Each feature is normalized in a 10-bin histogram
independently and then concatenated to form the final feature vector of [m n 400],
which is then aggregated by superpixel patches, represented as (si , X).

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 12, No. 3, Article 42, Publication date: March 2016.
42:6 Q. Wu and P. Boulanger

3.3. RW-MRF Model


The clothing parsing problem can be formulated as a pixel-level labeling problem.
The goal is to assign a clothing tag such as top, pants, skin, hair, or null
(background) to each pixel. Because tagging each pixel is not very computationally
efficient, we simplify the problem by grouping uniform pixels into a superpixel region
and reduce the problem to a graph labeling process over a set of superpixels. Let
I = {si }iS denote a fashion image showing a person, where si is the data from the ith
patch of the superpixel set S. One can formulate the problem as a graph model that
finds the solution L , minimizing an energy function defined as:

L = arg min(E(L))
L
(2)
= arg min(Edata (L) + Esmooth(L)).
L

The unary term accounts for the cost to assign a label to a superpixel patch according
to its feature. One important feature for clothing parsing is human pose determination,
denoted by X = {x p}, where x p is a set of image coordinates for body joint p, such as
the head, arms, legs, and neck. The pairwise term accounts for the cost of assigning a
pair of labels to neighbouring patches, which incorporates region contrasts. The unary
term and pairwise term can be represented by:

Edata (L) = 1 (li |X, I) (3)
iS

and

Esmooth(L) = 2 (li , l j |X, I), (4)
(i, j)V

where 1 and 2 are the unary and pairwise terms, respectively, and V is the set of
neighbouring patches.
Unary Term: Classifier Potential. With the feature vector for each superpixel (si , X)
and parameters of the trained model , we model the unary term using the probability
of a label assignment for each superpixel patch:

1 (li |X, I) = 1 ln P(li |(si , X), ). (5)

Reweighted Pairwise Term: Spatial Smoothness for Unbalanced Unary Term. The
pairwise term favours piecewise constant label map. The idea is that neighbouring
patches should have similar labels, especially if their colors are close. Such a pairwise
prior can reduce the effects of imperfect pose detection results. The pairwise term
2 (li , l j |X, I) encodes neighbouring nodes affinity through edge weights in such a way
that nodes connected by edges with high weight are considered to be strongly connected
and edges with low-weight are considered disconnected nodes. The patch affinity is
measured by the Euclidean distance between mean colors (in CIELAB color space) of
two patches si and s j , represented as:

w = max(Ii I j , ), (6)

where is a clipping distance, in order to prevent division by zero in pairwise term


computation (Eq. (7)) and to limit the overflow of the total energy E(L) (larger than
M AX EN ERGY ) in Eq. (2). The pairwise term in our model is defined to measure the

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 12, No. 3, Article 42, Publication date: March 2016.
Enhanced Reweighted MRFs for Efficient Fashion Image Parsing 42:7

Fig. 2. Comparison of pairwise term of MRF and RW-MRF. (a) The original pairwise term is computed by
only considering the similarity of two neighbouring nodes. (b) The reweighted pairwise term is computed by
considering not only feature similarity but also the unary term of two neighbouring nodes.

reweighted spatial smoothness, represented as:


max(1 (li |X, I)) + max(1 (l j |X, I))
2 (li , l j |X, I) = 2
2w (7)
exp( w),
1
where 2 is the average square distance between color vectors for adjacent patches in
an image.
The pairwise term in Eq. (7) assigns the edge weight between two nodes using inten-
sity difference reweighted by the data term of two nodes. Such a reweighted smoothing
term is used to correct oversmoothing of infrequent labels, such as necklace and
purse. As shown in Figure 2, infrequent labels cause low probability prediction, and
this is known as the unbalanced data problem. The effect of reweighing is demonstrated
in the experimental section.
4. ENHANCING PRIORS
4.1. Background Prior (BP)
The background prior was first proposed by Wei et al. [2012] for tackling an objects
saliency level in an image detection problem. It is based on two observations: salient
objects do not touch the image boundary (namely boundary prior) and backgrounds
are continuous and homogeneous (namely connectivity prior). It is a basic rule of pho-
tographic composition that most photographers will not crop salient objects along the
view frame. Because in fashion images garments are usually attached to the human
body, which is a silent object, one can apply the same rule to fashion images by distin-
guishing its foreground from its background.
As proposed by Wei et al. [2012], we also assume that all the boundary patches are
background. As shown in Figure 1, the superpixel patches around the image borders
are assigned as background nodes, represented as:
1 (li = null|X, I) = 0. (8)
Applying a background prior has two major advantages. First, it further reduces
the computational complexity for patches on the image borders. Second, it provides a
prior to group homogeneous background regions together in an energy optimization
computation.
4.2. Occlusion Prior (OP)
In fashion images, because human poses, clothing layering, and configuration have
large variations, very often clothing items (as well as background) are occluded. In

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 12, No. 3, Article 42, Publication date: March 2016.
42:8 Q. Wu and P. Boulanger

Fig. 3. Illustration of the occlusion problem: (a) original image. (b) Parsing result using Yamaguchi et al.
[2012] algorithm. (c) Parsing result with occlusion prior, using our method.

Fig. 4. Graph power: NON edges in red in G2 is computed from G.

such a case, when the object is partially occluded and separated into disconnected
regions, these regions may be assigned different labels even when they have similar
colors. An example is shown in Figure 3. Because the sweater is partially occluded by
the bag and the hair on the shoulders, the sweater regions on the left arm and the right
arm are disconnected. Therefore, the sweater label predicted by Yamaguchi et al. [2012]
is not complete (predicted as null for the sweater regions around the arms). This is
a common problem for both the CRF and MRF methods, which consider neighbouring
patches only in pairwise terms. For clothing patches that are partially occluded (dis-
connected) by other objects, they are not considered as neighbouring nodes and they
are not connected by an edge in the graph modelling process.
We tackle this problem by considering neighbours of neighbourhood (NON) rela-
tionship between nodes as neighbourhood as well in MRF. Intuitively, adding edges
between NON nodes could help smooth label assignments for object regions that are
occluded by other objects. The idea is that NON patches (potentially occluded object
parts) should have similar labels if they possess similar features.
Suppose that the neighbourhood set is V ; our goal is to find a new neighbourhood
set V that includes NON, represented as:
V = V + NON. (9)
k
We use a graph power estimation algorithm to compute NON. The kth power G of
an undirected graph G is defined as another graph that has the same set of vertices but
with two vertices that are considered adjacent when their distance in G is at most k.
Figure 4 shows an example of G2 from Wikipedia.2 For the purpose of image parsing,

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 12, No. 3, Article 42, Publication date: March 2016.
Enhanced Reweighted MRFs for Efficient Fashion Image Parsing 42:9

we look for a NON relationship between nodes, and therefore k = 2. G2 is computed by


building an adjacency matrix A for the graph, and then nonzero entries of A2 give the
adjacency matrix of the second power of the graph.

5. TRAINING AND INFERENCE


The model in Equation (5) is trained by logistic regression with L2 regularization
(liblinear library [Fan et al. 2008]), which provides the estimation of probability distri-
bution P(li |(si , X)). Because we use an MRF model, there is no need to compute the
pairwise model, and the pairwise term is computed based on local image information.
Therefore, the training time is significantly reduced compared to CRF models. We ex-
perimentally chose and and find the best values for 1 and 2 by maximizing the
cross validation of pixel accuracy in training data. In our experiment, typical values
are [1, 5], [0.1, 0.3], 1 [19, 23], and 2 [1, 3]. The main computational
cost comes from MRF inference step. Alpha-expansion implemented using the gco-v3.0
library [Boykov and Kolmogorov 2004] is applied to solve the multilabel optimiza-
tion problem. The computational complexity is O(|S||L|), where |S| is the number of
superpixels and |L| is the number of clothing labels.

6. EXPERIMENTAL RESULTS
Different sets of experimental evaluation are carried out to quantitatively and quali-
tatively analyze the performance of the method. The Fashionista dataset [Yamaguchi
et al. 2012], containing 685 annotated images with clothing labels, is used to evaluate
the performance quantitatively. Results are compared with state-of-the-art clothing
parsing methods [Yamaguchi et al. 2012]. All measurements use 10-fold cross valida-
tion. Images randomly downloaded from the web are used to qualitatively evaluate the
performance, which shows the methods robustness against new image features which
are not included in the training dataset.

6.1. Quantitative Comparisons


Two criteria were used in the evaluation: Average Pixel Accuracy (APA) and Mean
Average Garment Recall (MAGR), defined, respectively, as:
N 
 
Ii (# of Pixels of True Pos Labels)
APA = /N (10)
Ii (# of Total Pixels)
i=1

and
N 
 
Ii (# of True Pos Labels)
MARG = /N, (11)
Ii (# True Pos Labels+ # of False Neg Labels)
i=1

where N is the number of images in the dataset.


As in Yamaguchi et al. [2012], we experimentally chose the best model parameters
that maximize the pixel APA in all our experiments. No ground-truth pose information
was used in all computations.
6.1.1. Comparison among MRF, RW-MRF, and Enhanced RW-MRF. We compared four dif-
ferent versions of our method and summarised the results in Table I, including
MRF, reweighted MRF (RW), re-weighted MRF with background prior (RW+BP),
and re-weighted MRF with both background and occlusion prior (RW+BP+OP). RW

2 https://en.wikipedia.org/wiki/Graph_power.

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 12, No. 3, Article 42, Publication date: March 2016.
42:10 Q. Wu and P. Boulanger

Table I. Labeling APA and Recall Rate for Using MRF, Reweighted MRF (RW),
Reweighted MRF with Background Prior (RW+BP), and Reweighted MRF
with both Background and Occlusion Prior (RW+BP+OP)
MRF RW RW+BP RW+BP+OP
Pixel APA 87.3% 88.8% 89.7% 90.5%
MAGR 61.5% 63.0% 62.8% 61.4%

Table II. Performance Comparison: Our Method Compared to the CRF Method
by [Yamaguchi et al. 2012] and a Baseline Labeling
Method Pixel APA MAGR Training Processing
ours 90.5% 63.0% 631.8 sec 5.2 sec
[Yamaguchi et al. 2012] 85.1% 57.2% 4546.7 sec 81.5 sec
baseline 77.6% 12.8% N/A N/A

outperforms MRF on MAGR mainly because the reweighted pairwise term can avoid
oversmoothing infrequent labels.3 The pixel APA steadily improves with more prior
information, with RW+BP+OP reaching the best pixel APA 90.5%. However, BP and
OP also cause a slight drop on MAGR because of the potential of oversmoothing (loss
of labels) for infrequent labels. BP increases the chance of a node being assigned as
background, and OP increases the chance of a node being assigned the same label as
its NON.
6.1.2. Comparison between CRF and Enhanced RW-MRF. We compared the performance
of our method with the CRF model by Yamaguchi et al. [2012]. Table II summarizes
the results of the performance comparison. The baseline method naively predicts all
regions to be background resulting in a 77.6% APA and a 12.8% MAGR. The CRF model
by Yamaguchi et al. [2012] obtained a 85.1% APA and a 57.2% MAGR, respectively. In
comparison, our method obtains a 5.4% gain on pixel APA and a 5.8% gain on MAGR
over the CRF model proposed by Yamaguchi et al. [2012]. Since we use the MRF model,
we only need to train the global model for the unary term. No pairwise model needs to
be trained over pairwise clothing labels and features of neighbouring pairs, as in the
CRF model [Yamaguchi et al. 2012]. This gives us a 86.1% improvement on the training
time. Compared to the expensive hierarchical segmentation algorithm [Arbelaez et al.
2011] using the CRF model, the robustness of our model allows us to use a more simple
segmentation algorithm (SLIC), which improved the processing time by 93.6%. Due
to the simplicity of the MRF training model, our method also exceeds other state-of-
the-art methods on training time while maintaining or improving APA and MAGR.
For example, a postlets-based method [Dong et al. 2014] reported 87% APA for the
Fashionista dataset, but it takes much longer training time, for example, more than
half a day. Deep learning-based methods typically use 10 times more training images
(e.g., 6000 images [Liang et al. 2015a]) and take much longer (e.g., 11 to 12 days [Liu
et al. 2015]) to train a model, with a slight improvement in APA (e.g., 91.1% in Liang
et al. [2015a]). Efficient processing and training makes our method more applicable to
real-time applications such as web-based image tagging.

6.2. Qualitative Comparisons


We qualitatively compare the parsing results between the CRF model by Yamaguchi
et al. [2012] and our method. Figure 5 shows the parsing results of eight test cases: four
images (from Figures 5(a) to 5(d)) randomly downloaded from the web and four images
(from Figures 5(e) to 5(h)) from Fashionista. Figures 5(a) to 5(d) show that our method

3 Infrequent labels are defined as labels in small regions or appearing in low frequency.

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 12, No. 3, Article 42, Publication date: March 2016.
Enhanced Reweighted MRFs for Efficient Fashion Image Parsing 42:11

Fig. 5. Comparison of image parsing results in visual quality: in each subfigure, the left image is the parsing
result by Yamaguchi et al. [2012] and the right image is our parsing result.

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 12, No. 3, Article 42, Publication date: March 2016.
42:12 Q. Wu and P. Boulanger

Fig. 6. Comparison of parsing results in visual quality between the MRF (left) and reweighted MRF (right)
model in each subfigure.

performs robustly on new image features that are not included in the training dataset.
In our method, only the data term relies on the training model. The smoothness term
is computed from the query image itself, which reduces the effects of the training
model on the parsing results. Therefore, our method can use the smoothness term to
correct the erroneous computation from the training model. Although only four test
cases for such images are displayed here, the other similar test cases were performed
to prove the same effectiveness of our algorithm on many images. From the results,
one can also see that our method is better at preserving label smoothness (in line with
region contrasts) as well as retaining infrequent region labels, for example, the hat in
Figure 5(c) and the shoes in Figure 5(g).
6.2.1. Improved Results on Infrequent Labels. We also compared the parsing results of
RW-MRF versus MRF model. As shown in Figure 6, the necklace in (a), the shoes and
skin in (b), the purse in (c), and the socks in (d) are suppressed in MRF model but well
recovered in the RW-MRF model.
6.2.2. Improved Results on Background Spill and Occlusion. In Figure 7, we illustrate pro-
gressive improvements made by using RW+BP and RW+BP+OP compared to the exist-
ing method by Yamaguchi et al. [2012]. Images in Figures 7(a) and 7(d) are illustrated
in the work by Yamaguchi et al. [2012] as bad examples because of labeling spill in
the background regions. As shown here, these problems can be corrected with the
RW+BP+OP.

7. REAL-TIME IMAGE LABELING TOOL


Efficient computation of our clothing parsing algorithm allows it to be applied to real-
time applications. Using the proposed parsing algorithm, we built a web-based clothing
labeling tool to help users label fashion images more efficiently. Figure 8 demonstrates

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 12, No. 3, Article 42, Publication date: March 2016.
Enhanced Reweighted MRFs for Efficient Fashion Image Parsing 42:13

Fig. 7. Some parsing results for the Fashionista [Yamaguchi et al. 2012] dataset, using the CRF model by
Yamaguchi et al. [2012] and the RW+BP and RW+BP+OP models.

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 12, No. 3, Article 42, Publication date: March 2016.
42:14 Q. Wu and P. Boulanger

Fig. 8. Image labeling interface with automated image parsing method (better viewed in color): (a) A user
adds a list of fashion items (jacket, pants, and boots in this example) that appear in the image. (b) After
computation, parsing regions for all labels are returned as differently coloured polygon regions, with each
polygon having equally interpolated points. (c) Labeling regions can be visualized and interacted in a web-
based image-click-ads framework [Wu et al. 2014].

the tool. To label an image, a user just needs to select labels (hair, skin, and null
are added by default) for all items appeared in the image. Then the image along
with given labels are processed by the proposed parsing method. After computing, the
predicted region for each label is returned as a polygon region, which can be edited
(dragging around polygon points or adding new intermediate points) to improve the
region accuracy. After labeling, results can be saved for future data retrieval. As shown
in Figure 8(c), the image labeled by our tool can be visualized and interacted through
an image-click-ads framework enabled by EyeDentifyIt [Wu et al. 2014].
Our labeling tool can be used for different image labeling purposes. For example,
researchers and/or crowdsourcing workers can use it for collecting labeled images.
Compared to state-of-the-art image labeling tool such as LabelMe [Russell et al. 2008]
and Markup SVG [Kim et al. 2011], we integrate the image parsing method to automat-
ically label image regions for different labels. Our tool label images in a format enabled
by EyeDentifyIt [Wu et al. 2014], so they can be easily utilized for the image-click-ads
purpose.

8. CONCLUSION AND FUTURE WORK


This article proposes a novel method to parse fashion images into constituent gar-
ments regions. By using background prior, occlusion prior, and a novel reweighted
MRF model, we show that the algorithm can outperform state-of-the-art methods. The
background prior initiates the border regions as the background nodes, and the occlu-
sion prior builds NON relationship to add more pairwise edges in the graph model.
Both contribute to increase the parsing accuracy. The pairwise term in our MRF model
is redefined by incorporating the data term to perform as a reweighted pairwise term,
which improves the region prediction for infrequent labels. Our method is also robust to
queries of random images whose features do not exist in the training dataset, because
we employ local region contrasts in MRF to correct erroneous data terms computed by
the training model.
In future work, we would like to consider improving the limitations of our current
method, including reducing effects of pose detection on parsing results, preventing over
smoothing of regions that have similar features (e.g., hair and skin in Figure 6(a)).

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 12, No. 3, Article 42, Publication date: March 2016.
Enhanced Reweighted MRFs for Efficient Fashion Image Parsing 42:15

REFERENCES
Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurelien Lucchi, Pascal Fua, and Sabine Susstrunk. 2012.
SLIC superpixels compared to state-of-the-art superpixel methods. IEEE Trans. Pattern Anal. Mach.
Intell. 34, 11 (Nov 2012), 22742282.
Pablo Arbelaez, Michael Maire, Charless Fowlkes, and Jitendra Malik. 2011. Contour detection and hierar-
chical image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 33, 5 (May 2011), 898916.
Yuri Boykov and Vladimir Kolmogorov. 2004. An experimental comparison of min-cut/max-flow algorithms
for energy minimization in vision. IEEE Trans. Pattern Anal. Mach. Intell. 26, 9 (Sept. 2004), 11241137.
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. ImageNet: A large-scale hierarchical image
database. In CVPR09.
Jian Dong, Qiang Chen, Xiaohui Shen, Jianchao Yang, and Shuicheng Yan. 2014. Towards unified human
parsing and pose estimation. In Proceedings of the 2014 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR14).IEEE, Washington, DC, 843850.
Jian Dong, Qiang Chen, Wei Xia, Zhongyang Huang, and Shuicheng Yan. 2013. A deformable mixture
parsing model with parselets. In Proceedings of the 2013 IEEE International Conference on Computer
Vision (ICCV13). IEEE Computer Society, Washington, DC, 34083415.
Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. 2008. LIBLINEAR: A
library for large linear classification. J. Mach. Learn. Res. 9 (June 2008), 18711874.
Li Fei-Fei, R. Fergus, and P. Perona. 2006. One-shot learning of object categories. IEEE Trans. Pattern Anal.
Machine Intell. 28, 4 (April 2006), 594611.
Basela Hasan and David Hogg. 2010. Segmentation using deformable spatial priors with application to
clothing. In Proceedings of the British Machine Vision Conference. BMVA Press, 83.183.11.
Xuming He, Richard S. Zemel, and Miguel A. Carreira-Perpinan. 2004. Multiscale conditional random fields
for image labeling. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision
and Pattern Recognition, 2004 (CVPR 2004), Vol. 2. IEEE, Washington, DC, II695.
Yannis Kalantidis, Lyndon Kennedy, and Li-Jia Li. 2013. Getting the Look: Clothing recognition and segmen-
tation for automatic product suggestions in everyday photos. In Proceedings of the 3rd ACM Conference
on International Conference on Multimedia Retrieval (ICMR13). ACM, New York, NY, 105112.
E. Kim, XiaoLei Huang, and Gang Tan. 2011. Markup SVG: An online content-aware image abstraction and
annotation tool. IEEE Trans. Multimed. 13, 5 (Oct 2011), 9931006.
Thomas Leung and Jitendra Malik. 2001. Representing and recognizing the visual appearance of materials
using three-dimensional textons. Int. J. Comput. Vision 43, 1 (2001), 2944.
Xiaodan Liang, Si Liu, Xiaohui Shen, Jianchao Yang, Luoqi Liu, Jian Dong, Liang Lin, and Shuicheng Yan.
2015a. Deep human parsing with active template regression. IEEE Transactions on Pattern Analysis
and Machine Intelligence 37, 12 (2015), 24022414.
Xiaodan Liang, Chunyan Xu, Xiaohui Shen, Jianchao Yang, Si Liu, Jinhui Tang, Liang Lin, and Shuicheng
Yan. 2015b. Human parsing with contextualized convolutional neural network. In Proceedings of the
IEEE International Conference on Computer Vision. 13861394.
Si Liu, Jiashi Feng, C. Domokos, Hui Xu, Junshi Huang, Zhenzhen Hu, and Shuicheng Yan. 2014. Fashion
parsing with weak color-category labels. IEEE Trans. Multimed. 16, 1 (Jan 2014), 253265.
Si Liu, Xiaodan Liang, Luoqi Liu, Xiaohui Shen, Jianchao Yang, Changsheng Xu, Liang Lin, Xiaochun Cao,
and Shuicheng Yan. 2015. Matching-cnn meets KNN: Quasi-parametric human parsing. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition. 14191427.
Bryan C. Russell, Antonio Torralba, Kevin P. Murphy, and William T. Freeman. 2008. LabelMe: A database
and web-based tool for image annotation. Int. J. Comput. Vision 77, 13 (May 2008), 157173.
Janos Schanda. 2007. Colorimetry: Understanding the CIE System. John Wiley & Sons, New York, NY.
Jamie Shotton, John Winn, Carsten Rother, and Antonio Criminisi. 2006. TextonBoost: Joint appearance,
shape and context modeling for multi-class object recognition and segmentation. In Proceedings of the
9th European Conference on Computer VisionVolume Part I (ECCV06). Springer-Verlag, Berlin, 115.
DOI:http://dx.doi.org/10.1007/11744023_1
Joseph Tighe and Svetlana Lazebnik. 2010. Superparsing: Scalable nonparametric image parsing with
superpixels. In Computer VisionECCV 2010. Springer, Berlin, 352365.
Zhuowen Tu, Xiangrong Chen, Alan L. Yuille, and Song-Chun Zhu. 2005. Image parsing: Unifying segmen-
tation, detection, and recognition. Int. J. Comput. Vision 63, 2 (2005), 113140.
Yichen Wei, Fang Wen, Wangjiang Zhu, and Jian Sun. 2012. Geodesic saliency using background priors. In
Proceedings of the 12th European Conference on Computer VisionVolume Part III (ECCV12). Springer-
Verlag, Berlin, 2942.

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 12, No. 3, Article 42, Publication date: March 2016.
42:16 Q. Wu and P. Boulanger

John Winn and Nebojsa Jojic. 2005. Locus: Learning object classes with unsupervised segmentation. In
Proceedings of the 10th IEEE International Conference on Computer Vision (ICCV05). Vol. 1. IEEE,
Washigton, DC, 756763.
Qiong Wu, Rui Gao, Xida Chen, and Pierre Boulanger. 2014. Tagging driven by interactive image discovery:
Tagging-tracking-learning. In Proceedings of the 2014 IEEE International Symposium on Multimedia
(ISM14). IEEE, Washington, DC, 179186.
K. Yamaguchi, M. H. Kiapour, and T. L. Berg. 2013. Paper doll parsing: Retrieving similar styles to parse
clothing items. In Proceedings of the 2013 IEEE International Conference on Computer Vision (ICCV13).
IEEE, Washington, DC, 35193526.
Kota Yamaguchi, M. Hadi Kiapour, Luis E. Ortiz, and Tamara L. Berg. 2012. Parsing clothing in fashion
photographs. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR12). IEEE, Washington, DC, 35703577.
Wei Yang, Ping Luo, and Liang Lin. 2014. Clothing co-parsing by joint image segmentation and labeling. In
Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR14). IEEE,
Washington, DC, 31823189.
Yi Yang and D. Ramanan. 2011. Articulated pose estimation with flexible mixtures-of-parts. In Proceedings
of the 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR11). IEEE Computer
Society, Washington, DC, 13851392.
Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su, Dalong
Du, Chang Huang, and Philip Torr. 2015. Conditional random fields as recurrent neural networks.
arXiv:1502.03240 (2015).

Received August 2015; revised November 2015; accepted September 2015

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 12, No. 3, Article 42, Publication date: March 2016.

You might also like