You are on page 1of 14

http://www.paper.edu.

cn
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 3, MARCH 2013 383

Low-Rank Structure Learning via Nonconvex


Heuristic Recovery
Yue Deng, Qionghai Dai, Senior Member, IEEE, Risheng Liu, Zengke Zhang,
and Sanqing Hu, Senior Member, IEEE

Abstract In this paper, we propose a nonconvex framework LSA Local subspace affinity.
to learn the essential low-rank structure from corrupted data.
Different from traditional approaches, which directly utilizes
MoG Mixture of Gaussian.
convex norms to measure the sparseness, our method introduces RANSAC Random sample consensus.
more reasonable nonconvex measurements to enhance the RPCA Robust principal component analysis.
sparsity in both the intrinsic low-rank structure and the sparse SC Subspace clustering.
corruptions. We will, respectively, introduce how to combine SSC Sparse subspace clustering.
the widely used  p norm (0 < p < 1) and log-sum term
into the framework of low-rank structure learning. Although
the proposed optimization is no longer convex, it still can be
I. I NTRODUCTION
effectively solved by a majorizationminimization (MM)-type
algorithm, with which the nonconvex objective function is
iteratively replaced by its convex surrogate and the nonconvex
L EARNING the intrinsic data structures via matrix analy-
sis [1], [2] has received wide attention in many fields,
e.g., neural networks [3], learning systems [4], [5], control
problem finally falls into the general framework of reweighed
approaches. We prove that the MM-type algorithm can converge theory [6], computer vision [7], [8], and pattern recognition
to a stationary point after successive iterations. The proposed [9][11]. There are quite a number of efficient mathematical
model is applied to solve two typical problems: robust principal tools for rank analysis, e.g., principal component analysis
component analysis and low-rank representation. Experimental (PCA) and singular value decomposition (SVD). However,
results on low-rank structure learning demonstrate that our these typical approaches could only handle some preliminary
nonconvex heuristic methods, especially the log-sum heuristic
recovery algorithm, generally perform much better than the and simple problems. With the recent progresses of compres-
convex-norm-based method (0 < p < 1) for both data with sive sensing [12], a new concept on nuclear norm optimization
higher rank and with denser corruptions. has emerged into the field of rank minimization [13] and has
Index Terms Compressive sensing, log-sum heuristic, matrix led to a number of interesting applications, e.g., low-rank
learning, nuclear norm minimization, sparse optimization. structure learning (LRSL) from corruptions.
LRSL is a general model for many practical problems in
N OMENCLATURE the communities of machine learning and signal processing,
ADM Alternating direction method. which considers learning data of the low-rank structure from
MM Majorization minimization. sparse errors [14][17]. Such a problem can be formulated as
GPCA Generalized principal component analysis. P = f (A) + g(E), where P is the corrupted matrix observed
PCP Principal component pursuit. in practical world; A and E are low-rank matrix and sparse
LHR Log-sum heuristic recovery. corruption, respectively, and the functions f () and g() are
LLMC Locally linear manifold clustering. both linear mappings. Recovering two variables (i.e., A and
LRMR Low-rank matrix recovery. E) just from one equation is an ill-posed problem but is still
LRR Low-rank representation. possible to be addressed by optimizing
LRSL Low-rank structure learning.
(P0) min rank(A) + E0
(A,E)
Manuscript received February 12, 2012; revised December 5, 2012; accepted
December 7, 2012. Date of publication January 4, 2013; date of current s.t. P = f (A) + g(E). (1)
version January 30, 2013. This work was supported in part by the National
Basic Research under Project 2010CB731800 and the Key Project of National In (P0), rank(A) is adopted to describe the low-rank structure
Science Foundation of China under Grant 61120106003, Grant 61035002, and of matrix A, and the sparse errors are penalized via E0 ,
Grant 61021063. where 0 norm counts the number of all the nonzero entries in
Y. Deng, Q. Dai and Z. Zhang are with the Department of
Automation, Tsinghua University, Beijing 100084, China (e-mail: a matrix. (P0) is always referred to as sparse optimization since
yuedeng.thu@gmail.com; qhdai@tsinghua.edu.cn; zzk@tsinghua.edu.cn). rank term and 0 norm are sparse measurements for matrices
R. Liu is with the Electronic Information and Electrical Engineering and vectors, respectively. However, such sparse optimization
Department, Dalian University of Technology, Dalian 116024, China (e-mail:
rsliu0705@gmail.com). is of little use due to the discrete nature of (P0) and the exact
S. Hu is with the College of Computer Science, Hangzhou Dianzi University, solution to it requires an intractable combinatorial search.
Hangzhou 310000, China (e-mail: sqhu@hdu.edu.cn). A common approach that makes (P0) trackable tries to
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. minimize its convex envelope, where the rank of a matrix
Digital Object Identifier 10.1109/TNNLS.2012.2235082 is replaced by the nuclear norm and the sparse errors are
2162237X/$31.00 2013 IEEE


http://www.paper.edu.cn
384 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 3, MARCH 2013

penalized via 1 norm, which are convex envelopes for rank() advancements are also verified on two practical applications
and  0 , respectively. In practical applications, LRSL via of shadow removal on face images and video background
1 heuristic is powerful enough for many learning tasks with modeling.
relative low-rank structure and sparse corruptions. However, In the second task of LRR, the power of a nonconvex
when the desired matrix becomes complicated, e.g., it has high heuristic model will be generalized to LRR for subspace
intrinsic rank structure or the corrupted errors become dense, clustering (SC), whose goal is to recover the underlying
the convex approach may not achieve promising performances. low-rank correlation of subspaces in spite of noisy distur-
In order to handle those tough tasks via LRSL, in this bances. In order to judge the performances, we will first
paper we take the advantages of nonconvex approximation apply it to motion segmentation in video sequences, which
to better enhance the sparseness of signals. The prominent is a benchmark test for SC algorithms. Besides, in order to
reason that we adopt nonconvex terms is mainly due to their highlight the robustness of the proposed framework to noises
enhanced sparsity. Geometrically, nonconvex terms generally and disturbances, we apply LHR and pHR to stock clustering
lie much closer to the essential 0 norm than the convex 1 that is to determine a stocks industrial category given its
norm [18]. historical price record. From both the experiments, nonconvex
To fully exploit the merits of nonconvex terms for LRSL, we heuristic model gains higher clustering accuracy than other
formulate (P0) as a semidefinite programming (SDP) problem state-of-the-art algorithms and the improvements are espe-
so that LRSL with two different norms will eventually be cially noticeable on the stock data which include significant
converted into an optimization only with 1 norm. Thanks disturbances.
to the SDP formulation, nonconvex terms can be explicitly The contribution of this paper is threefold.
combined into the paradigm of LRSL and we will investigate 1) This paper presents a nonconvex heuristic framework to
two widely used nonconvex terms in this paper, i.e.,  p norm handle the typical LRSL problem with enhanced sparsity
(0 < p < 1) and log-sum term. Accordingly, two nonconvex terms. We introduce an MM algorithm to solve the
models, i.e.,  p -norm heuristic recovery ( pHR) and log-sum nonconvex LRSL optimization with reweighted schemes
heuristic recovery (LHR), will be proposed for corrupted and theoretical justifications are provided to prove that
matrix learning. Theoretically, we will analyze the relationship the proposed algorithm converges to a stationary point.
of these two models and reveal that the proposed LHR exhibits 2) The proposed nonconvex heuristic model, especially the
the same objective of pHR when p infinitely approaches to 0+ . LHR method, extends the feasible region of existing 1
Therefore, LHR owns more powerful sparseness enhancement norm-based LRSL algorithm, which implies that it could
capabilities than pHR. successfully handle more learning tasks with denser
For the sake of accurate solutions, the majorization corruptions and higher rank.
minimization (MM) algorithm will be applied to solve the 3) We apply the LHR model to a new task of stock
nonconvex heuristic model. The MM algorithm is imple- clustering which serves to demonstrate that LRSL is not
mented in an iterative way that it first replaces the nonconvex only a powerful tool restricted in the areas of image
component of the objective with a convex upper-bound and and vision analysis, but also can be applied to solve the
then it minimizes the constructed surrogate, which makes the profitable financial problems.
nonconvex problem fall exactly into the general paradigm of The remainder of this paper is organized as follows.
the reweighted schemes. Accordingly, it is possible to solve We review previous works in Section II. The background of
the nonconvex optimization following a sequence of convex the typical convex LRSL model and its specific form of the
optimizations and we will prove that with the MM framework, SDP will be provided in Section III. Section IV introduces
nonconvex models finally converge to a stationary point after the general nonconvex heuristic models and discusses how
successive iterations. to solve the nonconvex problem by the MM algorithm. We
We will adapt the nonconvex models to two specific models addresses the LRMR problem and compare the proposed LHR
for practical applications. In a nutshell, they will be used to and pHR models with PCP from both the simulations and
solve the problems of low-rank matrix recovery (LRMR) and practical applications in Section V. Nonconvex models for
low-rank representation (LRR). In LRMR, nonconvex heuristic LRR and subspace segmentation are discussed in Section VI.
models are used to recover a low-rank matrix from sparse Section VII concludes this paper.
corruptions and their performance will be compared with the
benchmark principal component pursuit (PCP) [14] method. II. P REVIOUS W ORKS
In practice, our approach often performs very well in spite of
In this part, we review some related works from the fol-
its simplicity. By numerical simulations, nonconvex models
lowing perspectives. First, we discuss two famous models
could handle many tough tasks that typical algorithm fails to
in LRSL, i.e., LRMR from corruptions and LRR. Then,
handle. For example, the feasible region of LHR is much larger
some previous works about MM algorithm and reweighted
than that of PCP, which implies that it could deal with much
approaches are presented.
denser corruptions and exhibit much higher rank tolerance.
The feasible region of PCP is subjected to the boundary of
PCP + PCP = 0.35, where and are rank rate and error A. Low-Rank Structure Learning
rate, respectively. With the proposed LHR model, the feasible 1) Low-Rank Matrix Recovery: Sparse learning and rank
boundary can be extended to LHR + LHR = 0.58. The analysis are now drawing more and more attention in both
http://www.paper.edu.cn
DENG et al.: LOW-RANK STRUCTURE LEARNING VIA NONCONVEX HEURISTIC RECOVERY 385

the fields of machine learning and signal processing. In [19], For compressive sensing, the reweighted method was used
sparse learning is incorporated into a nonnegative matrix fac- in 1 heuristic and led to a number of practical applications
torization framework for blind sources separation. Besides, it including portfolio management [30] and image processing
has also been introduced to the typical subspace-based learning [18]. Reweighted nuclear norm was first discussed in [31] and
framework for face recognition [20]. In addition to the widely the convergence of such an approach has been proven in [32].
used 1 norm-based sparse learning, some other surrogates Although there are some previous works on reweighted
have been proposed for signal and matrix learning. In [21], the approaches for rank minimization, our approach is quite dif-
1/2 norm and its theoretical properties are discussed for sparse ferent. First, this paper tries to consider a new problem of
signal analysis. In [22], elastic-net-based matrix completion LRSL from corruptions while not on the single task of sparse
algorithms extend the extensively used elastic-net penalty to signal or nuclear norm minimization. Besides, existing works
matrix cases. A fast algorithm for matrix completion by using on reweighted nuclear norm minimization in [31] and [32] are
1 filter has been introduced in [23]. solved by SDP which could only handle the matrix of relative
Corrupted matrix recovery considers decomposing a low- small size. In this paper, we will use the first-order numerical
rank matrix from sparse corruptions which can be formulated algorithm [e.g., alternating direction method (ADM)] to solve
as P = A + E, where A is a low-rank matrix, E is the sparse the reweighed problem, which can significantly improve the
error, and P is the observed data from real-world devices, e.g., numerical performance. Due to the distributed optimization
cameras, sensors, and other pieces of equipment. The rank of strategy, it is possible to generalize the learning capabilities to
P is not low, in most scenarios, due to the disturbances of E. large-scale matrices.
How can we recover the low-rank structure of the matrix from
gross errors? This interesting topic has been discussed in a
III. 1 -BASED LRSL
number of works, e.g., [14][16]. Wright et al. proposed the
PCP (a.k.a. RPCA) to minimize the nuclear norm of a matrix In this part, we formulate the LRSL as an SDP. With
by penalizing the 1 norm of errors [15]. PCP could exactly the SDP formulation, it will become apparent that typical
recover the low-rank matrix from sparse corruptions. In some LRSL is a kind of general 1 heuristic optimizations. This
recent works, Ganesh et al. investigated the parameter choos- section serves as the background material for the discussions in
ing strategy for PCP from both the theoretical justifications Section IV.
and simulations [24]. In this paper, we will introduce the As stated previously, the basic optimization (P0) is noncon-
reweighted schemes to further improve the performances of vex and generally impossible to solve as its solution usually
PCP. Our algorithm could exactly recover a corrupted matrix requires an intractable combinatorial search. In order to make
from much denser errors and higher rank. (1) trackable, convex alternatives are widely used in a number
2) LRR: LRR [5] is a robust tool for SC [25], the desired of works, e.g., [14] and [15]. Among these approaches, one
task of which is to classify the mixed data in their corre- prevalent method tries to replace the rank of a matrix by its
sponding subspaces/clusters. The general model of LRR can convex envelope, i.e., the nuclear norm, and the 0 sparsity
be formulated as P = PA + E, where P is the original is penalized via 1 norm. Accordingly, by convex relaxation,
mixed data, A is the affine matrix that reveals the correlations the problem in (2) can actually be recast as a semidefinite
between different pairs of data, and E is the residual of such a programming
representation. In LRR, the affine matrix A is assumed to be of
low rank and E is regarded as sparse corruptions. Compared min A + E1
(A,E)
with existing SC algorithms, LRR is much robust to noises s.t. P = f (A) + g(E) (2)
and archives promising clustering results on public datasets.

Fast implementations for LRR solutions are introduced in [26] where A = ri=1 i (A) is the nuclear norm of the matrix
by iteratively linearizing approaches. In this paper, inspired which is defined
 as the summation of the singular values of A
by LRR, we will introduce the log-sum recovery paradigm to and E1 = i j |E i j | is the 1 norm of a matrix. Although
LRR and show that, with the log-sum heuristic, its robustness the objective in (2) involves two norms, nuclear norm and 1
to corruptions can be further improved. norm, its essence is based on the 1 heuristic. We will verify
this point with the following lemma.
B. MM Algorithm and Reweighted Approaches Lemma 3.1: For a matrix X Rmn , its nuclear norm is
The MM algorithm is widely used in machine learning and equivalent to the following optimization:
signal processing. It is an effective strategy for nonconvex
problems in which the hard problem is solved by optimizing min 21 [tr(Y) + tr(Z)]
(Y,Z,X)
a series of easy surrogates. Therefore, most optimizations X =   (3)
Y X
via the MM algorithm fall into the framework of reweighted s.t. T 0
X Z
approaches.
In the field of machine learning, the MM algorithm has where Y Rmm , Z Rnn are both symmetric and positive
been applied to parameter selection for bayesian classification definite. The operator tr() means the trace of a matrix and 
[27]. In the area of signal processing, the MM algorithm leads represents semipositive definite.
to a number of interesting applications, including wavelet- The proof of Lemma 3.1 may refer to [13] and [30].
based processing [28] and total variation minimization [29]. According to this lemma, we can replace the nuclear norm
http://www.paper.edu.cn
386 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 3, MARCH 2013

in (2) and formulate it in the form of concave  p norm instead of the typical 1 norm. Starting from
1 pHR, another nonconvex heuristic model with much sparser
min [tr(Y) + tr(Z)] + E1 penalization can be derived. Obviously, p > 0, minimizing
(Y,Z,A,E)
2
  the aforementioned pHR is equivalent to
Y A   
s.t. 0
AT Z 1 1 1
min F( X ) = H p ( X ) m + n + mn
P = f (A) + g(E). (4) X D p 2 2
1 m
|Yii | 1 1
p 
n
|Z ii | p 1
From Lemma 3.1, we know that both Y and Z are symmetric = +
and positive definite. Therefore, the trace of Y and Z can 2 p 2 p
i=1 i=1
be expressed as a specific form of 1 norm, i.e., tr(Y) = 
n 
m
|E i j | p 1
diag(Y)1 . diag(Y) is an operator that only keeps the entries + . (7)
on the diagonal position of Y in a vector. Therefore, the p
i=1 j =1
optimization in (4) can be expressed as
The optimization in (6) is the same as the problem in (7)
1 because the multiplied scaler (1/ p) is a positive constant
min (diag(Y)1 + diag(Z)1 ) + E1 (5)
X D 2 and (1/2)m + (1/2)n + mn is a constant. According to
the LHspitals rule, we know that lim p0 (x p 1/ p) =
where X = {Y, Z, A, E} and
  (( p (x p 1))/ p ( p)) = log x, where p ( f ( p)) stands for the
Y A derivative of f ( p) with respect to p. Accordingly, by taking
D = (Y, Z, A, E) :  0, (A, E) C .
AT Z the limit lim p0+ F(X ) in (7), we get the LHR model H L ( X ):
(A, E) C stands for convex constraint. 1 
(LHR) H L ( X ) = min f L (diag(Y)) + f L (diag(Z))
X D 2
IV. C ORRUPTED M ATRIX R ECOVERY VIA + f L (E). (8)
N ONCONVEX H EURISTIC
X R
For any matrix mn , the log-sum term is defined
By Lemma 3.1, the convex problem with two norms in (2)
has been successfully converted to an optimization only with as f L (X) = i j log(|X i j | + ), where > 0 is a small
1 norm, and therefore, it is called 1 -heuristic. 1 norm is regularization constant.
the convex envelope of the concave 0 norm but a number From (6) and (8), we know that LHR is a particular case
of previous research works have indicated the limitation of of pHR by taking the the limit of p at 0+ . It is known that
approximating 0 sparsity with 1 norm, e.g., [18], [33], and when 0 < p < 1, the closer p approaches to zero, the stronger
[34]. It is natural to ask, for example, whether might a different sparse enhancement that  p -based optimization exhibits. We
alternative not only find a correct solution, but also outperform also comment here that when p equals zero, the pHR exactly
the performance of 1 norm? One natural inspiration is to corresponds to the intractable discrete problem in (1). When
use some nonconvex terms lying much closer to the 0 norm p = 0 and p 0+, pHR gives two different objectives. This
than the convex 1 norm. However, by using the nonconvex finding does not deny our basic derivation since when p = 0
heuristic terms, two problems come out inevitably: 1) which or p < 0, the equivalence from (6) to (7) does not hold any
nonconvex functionality is ideal, and 2) how to efficiently longer. Therefore, LHR only uses a limit to approximate to
solve the nonconvex optimization. In the following two sub- intractable objective of 0 -based recovery. This is meanwhile
sections, we will, respectively, address these two problems by the very reason why we denote a plus on the superscript of
introducing the LHR and its reweighted solution. zero in limit p 0+ . LHR exploits the limit of the 0 norm
in the objective and is regarded to have much stronger sparsity
enhancement capability than general pHR.
A. Nonconvex Heuristic Recovery
Due to much more powerful sparseness of LHR, in this
In this paper, we will introduce two nonconvex terms to paper, we advocate the usage of LHR for nonconvex based
enhance the sparsity of model in (5). The first one is the widely LRSL. Therefore, in the remainder of this paper, we will
used  p norm with 0 < p < 1. Intuitively, it lies in the scope discuss the formulations of LHR for low-rank optimization in
between the 0 norm and the 1 norm. Therefore, it is believed detail. However, fortunately, thanks to the natural connections
to have a better sparse representation ability than the convex between pHR and LHR, the optimization rule of LHR is
1 norm. We define the general concave  p norm by f p (X ) = also applied to pHR. We will state their tiny variations
i j |X i j | , 0 < p < 1. Therefore, by taking it into (5), the
p
when necessary. The experimental comparisons on numerical
following  p -norm Heuristic Recovery ( pHR) optimization is simulations and practical applications of pHR and LHR will
obtained: be given in Sections V and VI.
1 
( pHR) H p ( X ) = min f p (diag(Y)) + f p (diag(Z))
X D 2 B. Solving LHR via Reweighed Approaches
+ f p (E). (6)
Although we have placed a powerful term to enhance the
In the formulation of pHR, obviously, it differs from (5) only sparsity, unfortunately it also causes nonconvexity into the
on the selection of the sparse norm, where the latter uses objective function. For example, the LHR model is not convex
http://www.paper.edu.cn
DENG et al.: LOW-RANK STRUCTURE LEARNING VIA NONCONVEX HEURISTIC RECOVERY 387

since the log-function over R++ = (, ) is concave. In In (11), the operator in the error term denotes the
most cases, a nonconvex problem can be extremely hard to component-wise product of two variables, i.e., for W E and
solve. Fortunately, the convex upper bound of f L () can be E: (W E E)i j = (W E )i j E i j . According to [30], we know
easily found and defined by its first-order Taylor expansion. that Y = U
UT and Z = V
VT , if we do SVD
Therefore, we will introduce the MM algorithm to solve the for A = U
VT . Accordingly, the weight matrix WY =
LHR optimization. (U
UT + Im )1/2 and matrix W Z = (V
VT + In )1/2 .1
The MM algorithm replaces the hard problem by a sequence We should comment here that (11) is also applied to solve
of easier ones. It proceeds in an expectation maximization- the pHR problem. It just uses different weighting matrices
like fashion by repeating two steps of majorization and mini- WY = diag((U
UT + Im ))( p1)/2,W Z = diag((V
VT +
mization in an iterative way. During the majorization step, it In ))( p1)/2, and W E = [(E i j + )( p1) ]. The derivations of
constructs the convex upper bound of the nonconvex objective. these parameter matrices are very similar to the formulations
In the minimization step, it minimizes the upper bound. of LHR in Appendix IV that we omit here.
According to Appendix A, the first-order Taylor expansion Here, based on the MM algorithm, we have converted the
of each component in (8) is well defined. Therefore, we can nonconvex LHR optimization to be a sequence of convex
construct the upper bound of LHR instead to optimize the reweighted problems. We call it reweighted method (11) since
following problem: in each iteration we should redenote the weight matrix set
1   1 W and use the updated weights to construct the surrogate
= tr ( Y +Im )1 Y + tr[( z +In )1 Z]
min T ( X | ) convex function. Besides, the objective in (11) is convex with
X D 2 2 a summation of a nuclear norm and an 1 norm and can be
  1
+ Ei j + E i j + const. (9) solved by convex optimization. In the next two sections, the
ij
general LHR model will be adapted to two specific models
In (9), set X = {Y, Z, A, E} contains all the variables to be and we will provide the optimization strategies for those two
optimized and set = { Y , Z , E } contains all the para- models, respectively. But before that, we first stop here and
meter matrices. The parameter matrices define the points at extend some theoretic discussions of the LHR model.
which the concave function is linearized via Taylor expansion.
See Appendix A for details. At the end of (9), const stands C. Theoretical Justifications
for the constants that are irrelative to {Y, Z, A, E}. In some
previous works of MM algorithms [18], [27], [32], they denote In this part, we investigate some theoretical properties of
the parameter in tth iteration with the optimal value of X the proposed nonconvex heuristic algorithm with the MM
optimization and prove its convergence. For simplicity, we
of the last iteration, i.e., = X t .
define the objective function in (8) as H ( X) and the surrogate
To numerically solve the LHR optimization, we remove the X is a set containing
and get function in (9) is defined as T ( X | ).
constants that are irrelative to Y, Z, and E in T ( X | )
all the variables and set records the parameter matrices.
the new convex objective
Before discussing the convergence property of LHR, we will
1  2     first provide two lemmas.
min tr WY Y + tr W2Z Z + (W E )i j E i j
2 ij Lemma 4.1: If set t := X t , the MM algorithm could
monotonically decrease the nonconvex objective function
where WY(Z) = ( Y(Z) + Im(n) )1/2 and (W E )i j = (E i j +
H ( X), i.e., H ( X t +1) H ( X t ).
)1 i j . It is worth noting that tr(WY2 Y) = tr(WY YWY ).
The proof of this lemma may refer to Appendix B.1.
Besides, since both WY and W Z are positive definite, the first
According to Lemma 4.1, we can give Lemma 4.2 to prove
constraint in (8) is equivalent to
    the convergence of the LHR iterations.
WY 0 Y A WY 0 Lemma 4.2: Let X = { X 0 , X 1 X t } be a sequence
 0.
0 WZ AT Z 0 WZ generated by the MM framework; after successive iterations,
such a sequence converges to the same limit point.
Therefore, after convex relaxation, the optimization in (8) The proof of Lemma 4.2 includes two steps. First, we should
is now subjected to prove that there exists a convergent subsequence X tk X . See
1 the discussions in Appendix B.3 for details. Then, we prove the
min [tr(WY YWY ) + tr(W Z ZW Z )] + W E E1 whole Lemma 4.2 by the contradiction in Appendix B.2. Based
2 
WY YWY WY AW Z on the two lemmas proved previously, we can now provide the
s.t. 0
(WY AW Z )T W Z ZW Z convergence theorem of the proposed LHR model.
P = f (A) + g(E). (10) Theorem 4.3: With the MM framework, the LHR model
finally converges to a stationary point.
Here, we apply Lemma 3.1 to (10) once again and rewrite See Appendix B.4 for the proof. In this part, we have
the optimization in (10) in the form of the summation of the shown that with the MM algorithm, the LHR model could
nuclear norm and 1 norm converge to a stationary point. However, it is impossible to

min WY AW Z  + W E E1 1 In cases, the weighting matrices may cause complex numbers due to
(A,E) the inverse operation. In such condition, we use the approximating matrices
s.t. P = f (A) + g(E). (11) WY = U(
+ Im )1/2 U T and W Z = V(
+ Im )1/2 V T in LHR.
http://www.paper.edu.cn
388 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 3, MARCH 2013

claim that the converged point is the global minimum since the
Algorithm 1 Optimization Strategy of LHR for Corrupted
objective function of LHR is not convex. Fortunately, with a
Matrix Recovery
good starting point, we can always find a desirable solution by
iterative approaches. In this paper, the solution of 1 heuristic Input : Corrupted matrix P and parameter
(1)
model was used as a starting point and it could always lead Initialization : t := 1, E i0j := 1, i, j. WY (Z ) = Im(n) .
to a satisfactory result. 1 repeat
(t ) (t ) (t )
2 Update the weighting matrices W E , WY and W Z
according to current estimation of A and E ;(t ) (t )
V. LHR FOR LRMR F ROM C ORRUPTIONS
3 Reset C0 > 0; 0 > 0; > 1; k = 1; A0 = E0 = 0;
In this part, we first apply the LHR model to recover a low- 4 while not converged do
rank matrix from corruption and its performance is compared
5 //Variables updating.
with the widely used PCP.
E ikj = s 1  (t)  (P Ak1 1 C1k )i j , i j ;
6 (W E )i j 
(t ) (t )
A. Proposed Algorithm 7 Jk = d1 (WY Ak1 W Z + 1 Ck2 ) ;
8 Ak =
Based on the LHR derivations, the corrupted LRMR prob- (t ) (t )
lem can be formulated as a reweighted problem Ak1 + [WY (h1k +1 Ck2 )W Z +(h2k +1 Ck1 )];
9 //Dual ascending.
min WY AW Z  + W E E1 10 Ck1 = Ck1 + k h1k ;
(A,E) 1
s.t. P = A + E. (12) 11 Ck2 = Ck1 2 + k h2k ;
12 k := k + 1, k+1 = k ;
Because the reweighted weights are placed in the nuclear 13 end
norm, it is impossible to directly get the closed-form solution 14 (A(t ) , E(t ) ) = (Ak , Ek );
of the nuclear norm minimization. Therefore, inspired by this 15 t := t + 1;
paper [5], we introduce another variable J to (12) by adding 16 until convergence;
another equality constraint and to solve Output : (A(t ) , E(t ) ).
min J + W E E1
s.t. h1 = P A E = 0
B. Simulation Results and Validation
h2 = J WY AW Z = 0. (13)
We have explained how to recover a low-rank matrix via
Based on the transformation, there is only one single J in LHR in preceding sections. In this section, we will conduct
the nuclear norm of the objective that we can directly get its some experiments to test its performances with the compar-
closed-form update rule by [35]. There are quite a number isons to robust PCP from numerical simulations.
of methods that can be used to solve it, e.g., with proximal 1) Numerical Simulations: We demonstrate the accuracy of
gradient (PG) algorithm [36] or ADMs [37]. In this paper, we the proposed nonconvex algorithms on randomly generated
will introduce the ADM method since it is more effective and matrices. For an equivalent comparison, we adopted the same
efficient. Using the ALM method [38], it is computationally data generating method in [14] that all the algorithms are
expedient to relax the equality in (13) and instead solve performed on the squared matrices and the ground-truth low-
L = J + W E E1 + < C1 , h1 > rank matrix (rank r ) with m n entries, denoted as A , is
  generated by an independent random orthogonal model [14];
+ < C2 , h2 > + h1 2F + h2 2F . (14) the sparse error E is generated via uniformly sampling the
2
matrix and the error values are randomly generate in the
where <, > is an inner product and C1 and C2 are the lagrange range [100, 100]. The corrupted matrix is generated by P =
multipliers, which can be updated via dual ascending method. A +E , where A and E are the ground truth. For simplicity,
Equation (14) contains three variables, i.e., J, E, and A. we denote the rank rate as = (rank(A )/max{m, n}) and the
Accordingly, it is possible to solve a problem via a distributed error rate as = (E0 /m n).
optimization strategy called the ADM. The convergence of For an equivalent comparison, we use the code in [40]
the ADM for convex problems has been widely discussed in a to solve the PCP problem.3 Reference [14] indicated that
number of works [37], [39]. By ADM, the joint optimization PCP method could exactly recover a low-rank matrix from
can be minimized by four steps such as E-minimization, corruptions within the region of + < 0.35. Here, in order
J-minimization, A-minimization, and dual ascending. to highlight the effectiveness of our LHR model, we directly
The detailed derivations are similar to the previous works consider much difficult tasks, that is, we set + = 0.5.
in [8] and [38] and we omit them here. The whole framework We compare the PCP ( p = 1) model with the proposed
to solve the LHR model for LRMR via reweighted schemes pHR (with p = 1/3 and p = 2/3) and the LHR (can be
is given in Algorithm 1.2 In lines 6 and 7 of the algorithm, regarded as p 0+ ). Each experiment is repeated ten times
s () and d () are defined as signal shrinkage operator and
matrix shrinkage operator, respectively [38]. 3 In [38], Lin et al. provided two solvers, i.e., exact and inexact solvers, to
solve the PCP problem. In this paper, we use the exact solver for PCP because
2 The optimization for pHR is very similar by changing the weight matrices. it performs better than the inexact solver.
http://www.paper.edu.cn
DENG et al.: LOW-RANK STRUCTURE LEARNING VIA NONCONVEX HEURISTIC RECOVERY 389

TABLE I
E VALUATIONS OF LRMR OF ROBUST PCA AND N ONCONVEX H EURISTIC R ECOVERY (M EAN S TD )

rank(A ) = 0.4m ||E ||0 = 0.1m 2 rank(A ) = 0.1m ||E ||0 = 0.4m 2
AA  F AA  F
m methods A  F rank(A) ||E||0 time(s) A  F rank(A) ||E||0 time(s)
200 PCP (4.6 1.1)e1 102 12 21 132 5.9 1.3 (1.2 0.5)e1 107 11 23 098 7.4 1.5
pHR2/3 (3.7 0.3)e2 88 4 4378 211 16.4 3.1 (9.3 0.9)e3 20 0 16011 17 16.3 2.6
pHR1/3 (1.8 0.3)e2 83 2 4113 77 13.1 3.7 (3.6 0.6)e3 20 0 16 000 0 13.4 3.5
LHR (8.1 1.2)e4 80 0 4000 13 12.7 2.5 (1.3 0.7)e3 20 0 16 031 21 14.1 2.7
400 PCP (4.5 1.2)e1 205 37 82 149 27.4 3.3 (6.4 2.3)e1 217 52 89 370 33.2 4.6
pHR2/3 (2.3 0.9)e2 193 23 15 782 123 73.8 12.3 (5.0 1.1)e3 71 9 64 000 0 63.2 15.7
pHR1/3 (1.2 0.5)e2 160 0 15 873 115 64.2 13.4 (4.0 0.7)e4 41 3 64 000 0 63.2 12.4
LHR (2.3 0.7)e3 160 2 16 038 24 53.4 11.8 (1.7 0.5)e4 40 0 64 000 0 54.3 10.7
800 PCP (4.7 1.1)e1 435 97 336 188 36.2 8.4 (9.1 1.7)e2 348 42 355 878 50.1 10.6
pHR2/3 (2.3 0.6)e2 361 32 63 901 115 103.6 21.5 (6.2 0.9)e3 80 3 257 762 1268 129.2 25.2
pHR1/3 (8.7 2.3)e3 320 0 63 962 36 96.2 27.3 (5.3 1.1)e3 80 0 256 097 72 119.2 23.6
LHR (1.7 0.3)e3 320 0 63 999 12 89.3 21.2 (4.1 0.5)e3 80 0 255 746 133 107.6 25.3

0.5 boundary of PCP


not
boundary of pHR(2/3)
feasible
boundary of pHR(1/3)
0.4
boundary of LHR
Error rate ()

0.3

0.2
feasible
0.1

0
0 0.1 0.2 0.3 0.4 0.5
Rank ()
(a) (b)

Fig. 1. Feasible region and the convergence verifications. (a) Feasible region verification. (b) Convergence verification.

and the mean values and their standard deviations (std) are uniformly distributed in [100, 100]. In the feasible region
tabulated in Table I. In the table, (A A  F /A  F ) denotes verification, when the recovery accuracy is larger than 1%
the recovery accuracy, rank denotes the rank of the recovered (i.e., (A A  F /A  F ) > 0.01), it is believed that the
matrix A, E0 is the card of the recovered errors, and time algorithm diverges. The two rates and are varied from
records the computational costs in seconds. We do not report 0 to 1 with the step of 0.025. At each test point, all the algo-
the std of the PCP method on the recovered errors because rithms are repeated ten times. If the median recovery accuracy
PCP definitely diverges on the tasks and its std does not exhibit is less than 1%, the point is regarded as the feasible point. The
significant statistic meanings. feasible regions of these two algorithms are shown in Fig. 1(a).
From the results, obviously, compared with PCP, the LHR From Fig. 1(a), the feasible region of LHR is much larger
model could exactly recover the matrix from higher ranks and than the region of PCP. We get the same conclusion as made
denser errors. The pHR model could correctly recover the in [14] that the feasible boundary of PCP roughly fits the
matrix in most cases but the recover accuracy is a bit lower curve, that is, PCP + PCP = 0.35. The boundary of LHR
than LHR. We also report the processing time in Table I. The is around the curve that LHR + LHR = 0.575. Moreover,
computer implementing the experiments is equipped with a on the two sides of the red curve in Fig. 1(a), the boundary
2.3 GHz CPU processor and a 4 GB RAM. equation can be even extended to LHR + LHR = 0.6.
2) Feasible region: Since the basic optimization involves Although the performance of pHR is not as good as LHR,
two terms, i.e., low-rank matrix and sparse error, in this part it still greatly outperforms the performance of PCP. When
we will vary these two variables to test the feasible boundary p = 1/3 and p = 2/3, the boundary equations are subjected
of PCP pHR and LHR, respectively. The experiments are to pHR + pHR = 0.52 and pHR + pHR = 0.48, respectively.
conducted on the 400 400 matrices with sparse errors These improvements are reasonable since pHR and LHR
http://www.paper.edu.cn
390 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 3, MARCH 2013

use the functionalities that are much closer to the 0 norm. TABLE II
Accordingly, the proposed nonconvex heuristic method covers P ERFORMANCE C OMPARISONS W ITH D IFFERENT
a larger feasible region of from this test, it is apparent that I NITIALIZATION M ETHODS
the proposed LHR algorithm covers the largest area of the
feasible region, which implies that LHR could handle more Random Initialization Zero Initialization
difficult tasks that robust PCA fails to do. AA  F AA  F
Corruptions A  F time(s) A  F time(s)
3) Convergence Verification: Finally, we will experimen- 15% (1.63 0.07)e5 25.7 3.2 1.58e5 19.3
tally verify the convergence of the LHR. The experiments are 30% (5.33 0.06)e5 41.2 4.1 5.17e5 39.6
conducted on 400400 matrices with the rank equivalent to 40 45% (2.61 0.01)e4 64.7 7.2 2.37e4 51.2
and the portion of gross errors are set as 15%, 30%, and 45%,
respectively. The experimental results are reported in Fig. 1(b)
where the axiss coordinate denotes the iteration sequences.
The convex problem has a unique global minimum and thus it
The top subfigure in Fig. 1(b) reports the time cost of each
is very robust to the initial points. However, with different
iteration. It is interesting to note that the denser the error,
initializations, the time costs are different. Generally, zero
the more the time cost required for one iteration. Besides, the
initialization requires less computational costs than random
most time-consuming part occurs in the first iteration. During
methods. This is because with a bad random initialization,
the first iteration, (11) subjects to the typical PCP problem.
each inner programming requires many loops to get to the
However, in the second and third iterations, the weight matrix
optimum.
is assigned with different values and thus it could make
For the weighting matrices, their initial value cannot be
(11) converge with less iterations. Therefore, the time cost
arbitrarily set because there is not any prior about these two
for each iteration is different in LHR. The first iteration
sparse components. An arbitrary random initialization will
needs many computational resources while the later ones can
possibly make the whole programming diverge. Therefore,
be further accelerated owing to the penalty of the weight
a good choice is to initialize all the weighting matrices by
matrix.
identity matrices. With such setting, LHR exactly solves a
The middle subfigure records the stopping criterion, which
PCP-like optimization in the first inner programming. After
is denoted as (W (t +1) W (t )  F /W (t )  F ). It is believed
getting the initial guess for the two matrices, in the second and
that the LHR converges when the stopping criterion is less
following inner loops, the weighting matrices can be initialized
than 1e 5. It is apparent from Fig. 1(b) that the LHR
according to the current estimation of A and E.
could converge in just three iterations with 15% and 30%
gross errors. While for the complicated case with 45% errors,
LHR can converge in four steps. The bottom subfigure shows C. Practical Applications
the recovery accuracy after each iteration. It is obvious that PCP is a powerful tool for many practical applications.
the recovery accuracy increases significantly from the first Here, we will conduct two practical applications to verify the
iteration to the second one. Such an increase phenomenon effectiveness of PCP and LHR on real-world data.
verifies the advantage of the reweighted approach derived from 1) Shadow and Specularities Removal From Faces: In
LHR. compute vision, a diverse of computational models has been
4) Robustness on Initialization: We discuss the perfor- adopted to address many kinds of tasks for face analysis [9],
mances of the proposed algorithm with different initialization [41]. Among these mathematical models, matrix recovery and
strategies. In the previous experiments, following the steps in sparse learning have been proven to be very robust tools for
Algorithm 1, both the matrices A and E are initialized to be shadow removal on faces. Following the framework suggested
zero. In this part, to further verify the effectiveness of LHR, we in [14], we stack the faces of the same subject under different
adopt a random initializing strategy. All the entries in A and E lighting conditions as the columns in a matrix P. The exper-
are, respectively, randomly initialized. With such initialization, iments are conducted on extended Yale-B dataset where each
LHR is implemented on a 400400 matrix with rank = 10 and face is with the resolutions of 192 168. Then, the corrupted
with 15%, 30%, and 45% corruptions, respectively. Since the matrix P is recovered by PCP and LHR, respectively. After
algorithm is randomly initialized, we repeat LHR ten times recovery, the shadows, specularities, and other reflectance are
with different initializations on the same matrix. The final removed in the error matrix (E) and the clean faces are
recovery accuracy and the time costs are reported in Table II accumulated in the low-rank matrix (A). The experimental
with standard deviation. results are provided in Fig. 2, where in each subfigure from
From the table, it is apparent that a different initialization left to right are original faces in Yale-B, faces recovered by
method does not make significant differences on the recovery PCP, faces recovered by pHR ( p = 1/3),4 and faces recovered
accuracy. The random initialization method returns much by LHR, respectively. It is greatly recommended to enlarge the
consistent accuracy with tiny standard deviation. Meanwhile, faces in Fig. 2 to view the details. In Fig. 2(a), when there exist
the accuracy of random initialization-based recovery is very dense shadows on the face image, the effectiveness of LHR
similar to the accuracy obtained by zero initialization, which becomes apparent to remove the dense shadows distribute on
is advocated in Algorithm 1. It is not surprising to see 4 We only report the result of p = 1/3 here since in the previous numerical
the robustness of LHR with different initializations since simulation pHR ( p = 1/3) achieves higher recovery accuracy than pHR with
the inner loops of LHR depend on a convex programming. P = 2/3.
http://www.paper.edu.cn
DENG et al.: LOW-RANK STRUCTURE LEARNING VIA NONCONVEX HEURISTIC RECOVERY 391

(a)

(a)

(b)

(c)

(b) Fig. 3. Benchmark videos for background modeling. In each subfigure, from
left to right are original video frames, foreground ground truth, LHR result,
Fig. 2. Shadow and specularities removal from faces (best viewed on screen). and PCP result, respectively. (a) HW (439 frames). (b) Lab (886 frames).
(a) Dense shadow. (b) Shadow texture. (c) Seam (459 frames).

TABLE III
the left face. However, in Fig. 2(a), there are no significant Q UANTITATIVE E VALUATION OF PCP AND N ONCONVEX H EURISTIC
differences between the two nonconvex models. Both of them R ECOVERY FOR V IDEO S URVEILLANCE
achieve sound result. However, the dense texture removal
Data False Negative Rate% False Positive Rate% Time(m)
ability is especially highlighted in Fig. 2(b), where there are
significant visual contrasts between the faces recovered by MoG PCP pHR LHR MoG PCP pHR LHR PCP pHR LHR
PCP, pHR, and LHR. The face recovered by LHR is much HW 22.2 18.7 16.2 14.3 8.8 7.8 8.2 8.4 13.2 24.7 23.5
cleaner.
Lab. 15.1 10.1 9.4 8.3 6.7 6.4 6.4 6.1 25.4 45.3 43.7
2) Video Surveillance: The background modeling can also
Seam 23.5 11.3 10.1 9.2 9.7 6.1 6.5 6.3 11.4 23.2 19.9
be categorized as an LRMR problem, where the backgrounds
correspond to the low-rank matrix A and the foregrounds are
removed in the error matrix E. We use the videos and ground
truth in [42] for quantitative evaluations. Three videos used in paper. Therefore, without the loss of generality, we use the
this experiment are listed in Fig. 3. mixture of Gaussian (MoG) [43] as the comparison baseline.
For the sake of computational efficiency, we normalize each In the MoG, five Gaussian components are used to model
image to the resolutions of 120 160 and all the frames are each pixel in the image. For evaluation, both the false-negative
converted to gray scaler. The benchmark videos used here rate (FNR) and false-positive rate (FPR) are calculated in
contain too many frames which lead to a large matrix. the sense of foreground detection. These two scores exactly
It is theoretical feasible to use the two methods for any correspond to the Type I and Type II errors in machine
large matrix recovery. Unfortunately, for practical implemen- learning, whose definitions may refer to [44]. FNR indicates
tation, large matrices are always beyond the memory limi- the ability of the method to correctly recover the foreground
tation of MATLAB. Therefore, for each video, we uniformly and FPR represents the potential of a method to distinguish
divide the large matrix to be submatrices which has less the background. Both these two rates are judged by the
than 200 columns. We recover these submatrices by setting criterion that the less the better. The experimental results
= (1/ m), respectively. are tabulated in Table III. We also report the time cost (in
The segmented foregrounds and the ground truth are shown minutes) of PCP, pHR, and LHR on these videos. But we
in Fig. 3. From the results, we know that LHR could remove omit the time cost of MoG since it can be finished in almost
much denser errors from the corrupted matrix rather than PCP. real time.
Such a claim is verified from three sequences in Fig. 3 that From the results, PCP and LHR greatly outperform the
LHR make much complete object recovery from the video. performance of MoG. Moreover, LHR has lower FNRs than
Besides, in Fig. 3(c), it is also apparent that LHR only keeps PCP and pHR which implies that LHR could better detect
dense errors in the sparse error term. In the seam sequences, the foreground than them. However, on the video highway
there are obvious illumination changes in different frames. and seam, the FPR score of LHR is a little worse than PCP
PCP is sensitive to these small variations and thus makes much and pHR. One possible reason may ascribe to that there are
more small isolated noise parts in the foreground. On the other too many moving shadows in these two videos, where both
hand, LHR is much robust to these local variations and only the objects and shadows are regarded as errors. In the ground
keeps dense corruptions in the sparse term. truth frames, the shadows are regarded as background. LHR
Although there are many advanced techniques for video could recover much denser errors from a low-rank matrix and
background modeling, it is not the main concern of this thus causes a relative low FNR score.
http://www.paper.edu.cn
392 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 3, MARCH 2013

TABLE IV
Algorithm 2 Update Rule for the Variables in (16)
M OTION S EGMENTATION E RRORS (M EAN ) OF S EVERAL A LGORITHMS ON

E ikj = s  k1 1 C k ) , i j ;
 (t)  (P P A 1 ij
1 1 (W E )i j  THE H OPKINS 155 M OTION S EGMENTATION D ATABASE
(t ) k1 (t )
2 J = d1 (WY A
k W Z + 1 Ck2 ); Category Method Two Three All
3 A = A
k k1 + [W (bk +1 Ck )W(t ) +P T (bk +1 Ck )];
(t ) Algebraic GPCA 11.2 27.7 14.2
Y 1 2 Z 2 1
4 //Dual ascending. Statistic RANSAC 8.9 24.1 12.5
5 Ck1 = Ck1
1 + k bk1 ; Manifold LSA 8.7 21.4 11.6
k1
6 C2 = C2 + k bk2 ;
k LLMC 8.1 20.8 10.9
SSC 5.4 15.3 7.6
Sparse LRR 4.7 15.1 6.9
pHR 4.2 14.4 6.1
VI. LHR FOR LRR LHR 3.1 13.9 5.6
In this part, LHR will be applied to the task of LRR [5],
[17] by formulating the constraint as P = PA + E, where the
correlation affine matrix A is of low rank and the noises in For comparisons, we will compare LHR with LRR as
E are sparse. In the remaining parts of this section, we will well as other benchmark algorithms for SC. The compar-
first show how to use the joint optimization strategy to solve isons include random sample consensus (RANSAC) [45],
the LRR problem by the LHR model. Then, two practical generalized principal component analysis (GPCA) [46], local
applications on motion segmentation and stock clustering will subspace affinity (LSA), locally linear manifold clustering
be presented and discussed. (LLMC), and sparse subspace clustering (SSC). RANSAC is
a statistic method which clusters data by iteratively distin-
guishing the data by inliers and outliers. GPCA presents an
A. Proposed Algorithm algebraic method to cluster the mixed data by the normal
When applying LHR to LRR, we should solve a sequence vectors of the data points. Manifold-based algorithms, e.g.,
of convex optimizations in the form LSA and LLMC, assume that one point and its neighbors
span as a linear subspace and they are clustered via spectral
min WY AW Z  + W E E1 embedding. SSC assumes that the affine matrix between data
s.t. P = PA + E. (15) are sparse and it segments the data via normalized cut [47].
In LRR [5], Liu et al. introduced two models that, respec-
To make the nuclear norm trackable, we add an equality and
tively, used 1 norm and 2,1 norm to penalize sparse cor-
tries to solve
ruptions. In this paper, we will only report the results with
min J + W E E1 comparison to 2,1 norm since it always performs better than
s.t. b1 = P PA E = 0 1 penalty in LRR. In order to provide a thorough compar-
ison with LRR, we strictly follow the steps and the default
b2 = J WY AW Z = 0. (16) parameter settings suggested in [5]. For the LHR model, we
Using the ADM strategy and following the similar derivations choose parameter = 0.4. In the experiments of LRR for
introduced in Section V-A, we can solve the optimization in motion segmentation, some postprocessing is performed on
(16) and we directly provide the update rules for each variable the learned low-rank structure to seek for the best clustering
in Algorithm 2. accuracy. For example, in LRR, after getting the representation
To show that LHR ideally represents low-rank struc- matrix A, an extra PCP processing is implemented on A to
tures from data, experiments on SC are conducted on two enhance the low rankness and such postprocessing definitely
datasets. First, we test LHR on slightly corrupted dataset, i.e., increases SC accuracy. However, the main contribution of this
Hopkins156 motion database. Since the effectiveness of LHR paper only focuses the LHR model on LRSL while not on the
is especially emphasized on the data with great corruptions, single task of SC. Therefore, we exclude all the postprocessing
we will also consider one practical application of using LHR steps to emphasize the effectiveness of the LRSL model itself.
for stock clustering. In our result, all the methods are implemented with the same
criterion to avoid bias treatments.
Hopkins155 contains two subspace conditions in a video
B. Results and Performance Evaluation sequence, i.e., with two motions or three motions, and thus we
1) Motion Segmentation in Video Sequences: In this part, report the segmenting errors for two subspaces (Two), for three
we apply LHR to the task of motion segmentation in subspaces (Three), and for both conditions (All) in Table IV.
Hopkins155 dataset [25]. Hopkins155 database is a benchmark From the results, we know that sparse-based methods generally
platform to evaluate general SC algorithms, which contains outperform other algorithms for motion segmentation. Among
156 video sequences and each of them has been summarized to three sparse methods, LHR gains the best clustering accuracy.
be a matrix recoding 3950 data vectors. The primary task of However, the accuracy only has slight improvements on LRR.
SC is to categorize each motion to its corresponding subspace, As indicated in [5], motion data only contain small corruptions
where each video corresponds to a sole clustering task and it and LRR could already achieve promising performance with
leads to 156 clustering tasks in total. the accuracy higher than 90%. With some postprocessing
http://www.paper.edu.cn
DENG et al.: LOW-RANK STRUCTURE LEARNING VIA NONCONVEX HEURISTIC RECOVERY 393

implementations, the accuracy can even be further improved. Areospace&Defense


2
Therefore, in order to highlight the effectiveness of LHR on 1
LRR with corrupted data, some more complicated problems 0
will be considered. 1
2
2) Stock Clustering: It is not trivial to consider applying 0 20 40 60 80 100 120 140 160 180 200
the LHR model to more complicated practical data where the Banks
2
effectiveness of LHR on corrupted data will be overempha- 1
sized. In practical world, one of the most difficult data struc- 0
tures to be analyzed is the stock price which can be greatly 1
2
affected by company news, rumors, and global economic 0 20 40 60 80 100 120 140 160 180 200
atmosphere. Therefore, data mining approaches of financial Wireless communication
2
signals have been proven to be very difficult but, on the other 1
hand, they are very profitable. 0
In this paper, we will discuss how to use the LRR and LHR 1
2
models for the interesting, albeit not very lucrative, task of 0 20 40 60 80 100 120 140 160 180 200
stock clustering based on their industrial categories. In many
Fig. 4. Normalized stock prices in NY of the categories: areospace and
stock exchange centers around the world, stocks are always defense, banks, and wireless communication. In each category, lines in
divided into different industrial categories. For example, on different colors represent different stocks (best viewed on screen).
the New York Stock Exchange Center, IBM and J. P. Morgan
TABLE V
are, respectively, categorized into the computer-based system
C LUSTERING E RRORS OF THE S TOCKS IN T EN C ATEGORIES FROM
category and money center banks category. It is generally
N EW Y ORK AND H ONG K ONG M ARKETS
assumed that stocks in the same category always have similar
market performance. This basic assumption is widely used by Markets GPCA RANSAC LSA LLMC
many hedge funds for statistic arbitrage. In this paper, we New York 60.1 59.3 51.7 54.3
consider that stocks in the same industrial category span as Hong Kong 57.3 55.8 54.7 53.7
a subspace and therefore the goal of stock clustering, a.k.a. Markets SSC LRR pHR LHR
New York 48.6 44.3 39.3 36.2
stock categorization, is to identify a stocks industrial label by Hong Kong 49.1 47.2 42.7 38.3
its historical prices.
The experiments are conducted on stocks from two global
stock exchange markets in New York and Hong Kong. In where p(t) is the price of a certain stock at time t, (t),
each market, we choose ten industrial categories which have and (t) are, respectively, the average value and standard
the largest market capitalizations. The categories divided by deviation of the stock prices in the interval [t , t]. We plot
the exchange centers are used as the ground truth label. the normalized stock prices of three categories in Fig. 4. After
In each category, we only choose the stocks whose market normalization, we further adopt the PCA method to reduce the
capitalizations are within the top ten ranks in one category. The dimensions of stocks from R200 to R5 . Theoretically, the rank
stock prices on the New York market are obtained from [48] of subspaces after PCA should be 10 1 = 9 because it
and the stock prices in the Hong Kong market are obtained contains ten subspaces and the rank is degraded by 1 during
from [49]. Unfortunately, some historical prices for stocks PCA implementation. But, in the simulation, we find that the
in [48] are not provided.5 Therefore, for the U.S. market, maximal clustering accuracies for both markets are achieved
we accumulated 76 stocks divided into 10 classes and each with the PCA dimensions of 5.
class contains 7 to 9 stocks; for the Hong Kong market, we The clustering errors of different SC methods on the stocks
obtain 96 stocks spanning 10 classes. For classification, the from these two markets are summarized in Table V. From
weekly closed prices from January 7, 2008 to October 31, the results, it is obvious that LHR significantly outperforms
2011, including 200 weeks, are used because financial experts other methods. It improves statistic- and graph-based methods
always look at weekly close prices to judge the long-term trend for about 20%. Among all the sparse methods, LHR makes
of a stock. improvements on LRR for about 8%. Although LHR performs
As stated previously, the stock prices may have extreme the best among all the methods, the clustering accuracy is
highs and lows, which cause outliers in the raw data. Besides, only about 63% and 61% on U.S. and Hong Kong markets,
the prices of different stocks vary; that cannot be evaluated respectively. The clustering accuracy is not as high as those
with the same quantity scaler. For the ease of pattern mining, on the motion data. This may be ascribed to the fact that
we use the time-based normalization strategy suggested in [50] the raw data and ground truth label themselves contain many
and [51] to preprocess the stock prices uncertainties. See the bottom subfigure in Fig. 4 for the
stocks in the wireless communication category; the normalized
p(t) (t)
p(t) = stock marked with green performs quite different from other
(t) stocks in the same category. But the experimental results
5 For example, in the industrial category of drug manufactures, it is not reported here are sufficient to verify the effectiveness of
possible to get the historical data of CIPILA.LTD from [48] which is the SC for ten classes categorization. If no intelligent learning
only interface for us to get the stock prices in the U.S. approaches were imposed, the expected accuracy may be only
http://www.paper.edu.cn
394 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 3, MARCH 2013

10%. Although with such bad raw data, the proposed LHR B. Theoretic Proof of the Convergence of LHR
could achieve the accuracy as high as 62% in a definitely 1) Proof of Lemma 4.1: In order to prove the monotonically
unsupervised way. decreasing property, we can instead prove
       
VII. C ONCLUSION H X t +1 T X t +1 | t T X t | t = H X t . (20)
This paper presented an LHR algorithm to learn the essential
We prove (20) by the following three steps.
low-rank structures from corrupted matrices. We introduced an
MM algorithm to convert the nonconvex objective function a 1) The first inequality follows from the argument that
is the upper bound of H ( X).
T ( X | )
series of convex optimizations via reweighed approaches and
proved that the solution may converge to a stationary point. 2) The second inequality holds since the MM algorithm
Then, the general model was applied to two practical tasks computes X t +1 = arg min X T ( X | t ). The function T ()
of LRMR and SC. In both of the two models, we gave the is convex; therefore, X t +1 is the unique global mini-
solution/update rules to each variable in the joint optimizations mum. This property guarantees that T ( X t +1 | t +1 ) <
via ADM. For the general PCP problem, LHR extended the T (| t ) with any X = X t +1 and T ( X t +1 | t +1 ) =
feasible region to the boundary of + = 0.58. For the SC T (| t ) if and only if X = X t +1 .
problem, LHR achieved state-of-the-art results on motion seg- 3) The last equality can be easily verified by expanding
mentation and achieved promising results on stock clustering T ( X t | t ) and making some simple algebra. The trans-
which contain too many outliers and uncertainties. However, formation is straightforward and omitted here.
a limitation of the proposed LHR model is for the reweighted 2) Proof of Lemma 4.2: We give a proof by contradiction.
phenomenon that requires to solve convex optimizations for We assume that sequence X diverges, which means that
multiple times. The implementation of LHR is a bit more time limt  X t +1 X t  F = 0. According to the discussions
consuming than PCP and LRR. Therefore, the LHR model is in Appendix B.3, we know that there exists a convergent
especially recommended to learn the low-rank structure from subsequence X tk converging to , i.e., limk X tk =
data with denser corruptions and higher ranks. and meanwhile, we can construct another convergent sub-
sequence X tk +1 that limk X tk +1 = . We assume that
A PPENDIX = . Since the convex upper-bound T (| ) is continu-
A. Constructing the Upper Bound of LHR ous, we get lim k T ( X tk +1 | tk ) = T ( lim X tk +1 | tk ) <
k
  
To see how the MM works for LHR, let us recall the
objective function in (8) and make some simple algebra T ( lim X tk | tk ) = limk T ( X tk | tk ). The strict less
operations k
  
1 
diag(Y) L + diag(Z) L + E L than operator < holds because = . See 1) in the
2   proof of Lemma 4.1 for details. Therefore, it is straightforward
1  
to get the following inequalities: lim k H ( X tk +1 )
= log(Yii + ) + log(Z kk + )
2
i k limk T ( X tk +1 | tk ) < limk T ( X tk | tk ) = limk
 H ( X tk ). Accordingly
+ log(|E i j | + )    
ij lim H X tk +1 < lim H X tk . (21)
1  k k
= log det(Y + Im ) + log det(Z + In ) Besides, it is obvious that the function of H () in (8) is
2 
+ log(|E i j | + ) (17) bounded below, i.e., H ( X) > (mn + m + n) log . Moreover,
ij
as proved in Lemma 4.1, H ( X) is monotonically decreasing,
which guarantee that limt H ( X t ) exists
where Im Rmm is an identity matrix. It is well known      
that the concave function is bounded by its first-order Taylor lim H X tk = lim H X t = lim H X t +1
k t t
expansion. Therefore, we calculate the convex upper bounds  
tk +1
of all the terms in (17). For the term log det(Y + Im ) = lim H X . (22)
k
log det(Y + Im ) log det( Y + Im ) Obviously, (22) contradicts to (21). Therefore, = and we
+tr[( Y + Im )1 (Y Y )]. (18) get the conclusion that limt  X t +1 X t  F = 0.
3) Convergence of Subsequences in the Proof of Lemma 4.2:
The inequality in (18) holds for any Y 0. Similarly, for In this part, we provide the discussions about the properties
any ( E )i j > 0 of the convergent subsequences that are used in the proof of
   E i j ( E )i j

Lemma 4.2.
log(|E i j | + ) log[( E )i j + ]+ . Since sequence X t = {Yt , Zt , At , Et } is generated via
ij ij ( E )i j +
(19) (8), we know that X D strictly holds. Therefore, all the
We replace each term in (17) with the convex upper bound variables (i.e., Yt , Zt , At , Et ) in set X should be bounded.
and define T ( X | ) as the surrogate function after convex This claim can be easily verified because if any variable
relaxation. in the set X goes to infinity, the constraints in domain D
http://www.paper.edu.cn
DENG et al.: LOW-RANK STRUCTURE LEARNING VIA NONCONVEX HEURISTIC RECOVERY 395

will not be satisfied. Accordingly, we know that sequence [7] J. Wright, A. Ganesh, S. Rao, Y. Peng, and Y. Ma, Robust principal
X t is bounded. According to the BolzanoWelestrass theorem component analysis: Exact recovery of corrupted low-rank matrices, in
Proc. Neural Inform. Process. Syst., 2009, pp. 19.
[52], we know that every bounded sequence has a convergent [8] Y. Deng, Y. Liu, Q. Dai, Z. Zhang, and Y. Wang, Noisy depth maps
subsequence. Since X t is bounded, it is apparent that there fusion for multiview stereo via matrix completion, IEEE J. Select.
exists a convergent subsequence X tk . Without the loss of Topics Signal Process., vol. 6, no. 5, pp. 566582, Sep. 2012.
generality, we can construct another subsequence X tk +1 which [9] Y. Deng, Q. Dai, and Z. Zhang, Graph laplace for occluded face
completion and recognition, IEEE Trans. Image Process., vol. 20, no. 8,
is also convergent. The proof of the convergence of X tk +1 pp. 23292338, Aug. 2011.
relies on the monotonically decreasing property proved in [10] Y. Deng, Q. Dai, R. Wang, and Z. Zhang, Commute time guided
Lemma 4.1. Since H () is monotonically decreasing, it is transformation for feature extraction, Comput. Vis. Image Understand.,
vol. 116, no. 4, pp. 473483, 2012.
easy to check that H ( X tk ) H ( X tk +1 ) H ( X tk+1 ) [11] R. Liu, Z. Lin, S. Wei, and Z. Su, Feature extraction by learning
H ( X tk+1 +1 ) H ( X tk+1+1 ). According to the aforementioned lorentzian metric tensor and its extensions, Pattern Recognit., vol. 43,
inequalities, we get that no. 10, pp. 32983306, 2010.
      [12] D. Donoho, Compressed sensing, IEEE Trans. Inform. Theory, vol. 52,
no. 4, pp. 12891306, Apr. 2006.
lim H X tk lim H X tk+1 lim H X tk+2 . (23)
k k k [13] B. Recht, M. Fazel, and P. Parrilo, Guaranteed minimum-rank solutions
of linear matrix equations via nuclear norm minimization, SIAM Rev.,
Since subsequence Xtk converges, it is obvious that vol. 52, no. 3, pp. 471501, 2010.
[14] E. J. Candes, X. Li, Y. Ma, and J. Wright, Robust principal component
limk H ( X tk ) = limk H ( X tk+2 ) = . According to analysis? J. ACM, vol. 59, no. 3, pp. 137, May 2011.
the famous Squeeze theorem [53], from (23), we get the [15] V. Chandrasekaran, S. Sanghavi, P. A. Parrilo, and A. S. Willsky, Rank-
limk H ( X tk +1 ) = H (limk X tk +1 ) = . Since the sparsity incoherence for matrix decomposition, Tech. Rep. 2220, Elect.
function H () is monotonically decreasing and X is bounded, Eng. Res. Lab., Univ. Texas, Austin, Jun. 2009.
[16] D. Hsu, S. M. Kakade, and T. Zhang, Robust matrix decomposition
the convergence of H ( X tk +1 ) can be obtained if and only if with outliers, Tech. Rep. 1011.1518, Dept. Stat., Univ. Pennsylvania,
the subsequence X tk +1 is convergent. Philadelphia, PA, Dec. 2010.
4) Proof of Theorem 4.3: As stated in Lemma 4.2, the [17] G. Liu, Z. Lin, and Y. Yu, Robust subspace segmentation by low-rank
representation, in Proc. Int. Conf. Mach. Learn., 2010, pp. 663670.
sequences generated by the MM algorithm converges to a
[18] E. J. Candes, M. Wakin, and S. Boyd, Enhancing sparsity by reweighted
limitation and here we will first prove that the convergence is a 1 minimization, J. Fourier Anal. Appl., vol. 14, no. 5, pp. 877905,
fixed point. We define the mapping from X k to X k+1 as M(), 2007.
and it is straightforward to get lim t X t = limt X t +1 = [19] Z. Yang, Y. Xiang, S. Xie, S. Ding, and Y. Rong, Nonnegative blind
source separation by sparse component analysis based on determinant
limt M( X t ), which implies that lim t X t = is a measure, IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 10,
fixed point. In the MM algorithm, when constructing the pp. 16011610, Oct. 2012.
upper bound, we use the first-order Taylor expansion. It is [20] Z. Lai, W. K. Wong, Z. Jin, J. Yang, and Y. Xu, Sparse approximation to
well known that the convex surrogate T ( X| ) is tangent to the eigensubspace for discrimination, IEEE Trans. Neural Netw. Learn.
Syst., vol. 23, no. 12, pp. 19481960, Dec. 2012.
H ( X) at X by the property of Taylor expansion. Accordingly, [21] Z. Xu, X. Chang, F. Xu, and H. Zhang, L1/2regularization: A threshold-
the gradient vectors of T ( X | ) and H ( X) are equal when ing representation theory and a fast solver, IEEE Trans. Neural Netw.
evaluating at X . Besides, we know that 0
T ( X | ) Learn. Syst., vol. 23, no. 7, pp. 10131027, Jul. 2012.
X = [22] H. Li, N. Chen, and L. Li, Error analysis for matrix elastic-net regu-
and because it is tangent to H ( X), we can directly get that larization algorithms, IEEE Trans. Neural Netw. Learn. Syst., vol. 23,
0 X = H ( X) which proves that the convergent fixed point no. 5, pp. 737748, May 2012.
[23] R. Liu, Z. Lin, S. Wei, and Z. Su, Solving principal component pursuit
is also a stationary point of H (). in linear time via l1 filtering, Tech. Rep., 11085359, Elect. Eng. Res.
Lab., Univ. Texas, Austin, Aug. 2011.
ACKNOWLEDGMENT [24] A. Ganesh, J. Wright, X. Li, E. J. Candes, and Y. Ma, Dense error
correction for low-rank matrices via principal component pursuit, in
The authors would like to thank Q. Zhang for his helpful Proc. Int. Symp. Inform. Theory, Jun. 2010, pp. 15.
discussions on the proof of Theorem 4.3. [25] R. Vidal, Subspace clustering, IEEE Signal Process. Mag., vol. 28,
no. 2, pp. 5268, Mar. 2011.
[26] Z. Lin, R. Liu, and Z. Su, Linearized alternating direction method with
R EFERENCES adaptive penalty for low-rank representation, in Proc. Neural Inform.
Process. Syst., Sep. 2011, pp. 17.
[1] X. Li and Y. Pang, Deterministic column-based matrix decomposition, [27] C.-S. Foo, C. B. Do, and A. Y. Ng, A majorization-minimization
IEEE Trans. Knowl. Data Eng., vol. 22, no. 1, pp. 145149, Jan. 2010. algorithm for (multiple) hyperparameter learning, in Proc. 26th Annu.
[2] Y. Yuan, X. Li, Y. Pang, X. Lu, and D. Tao, Binary sparse nonnegative Int. Conf. Mach. Learn., 2009, pp. 321328.
matrix factorization, IEEE Trans.Circuits Syst. Video Technol. , vol. 19, [28] M. Figueiredo, J. Bioucas-Dias, and R. Nowak, Majorization minimiza-
no. 5, pp. 772777, May 2009. tion algorithms for wavelet-based image restoration, IEEE Trans. Image
[3] S. Hu and J. Wang, Absolute exponential stability of a class of Process., vol. 16, no. 12, pp. 29802991, Dec. 2007.
continuous-time recurrent neural networks, IEEE Trans. Neural Netw., [29] J. Bioucas-Dias, M. Figueiredo, and J. Oliveira, Total variation-based
vol. 14, no. 1, pp. 3545, Jan. 2003. image deconvolution: A majorization-minimization approach, in Proc.
[4] A. Goldberg, X. J. Zhu, B. Recht, J. Sui, and R. Nowak, Transduction IEEE Int. Conf. Acoust., Speech Signal Process., May 2006, p. 2.
with Matrix Completion: Three birds with one stone, in Proc. Neural [30] M. Fazel, Matrix rank minimization with applications, Ph.D. thesis,
Inform. Process. Syst., 2010, pp. 19. Stanford Univ., Stanford, CA, 2002.
[5] G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, and Y. Ma, Robust recovery [31] M. Fazel, H. Hindi, and S. Boyd, Log-det heuristic for matrix rank min-
of subspace structures by low-rank representation, IEEE Trans. Pattern imization with applications to hankel and euclidean distance matrices,
Anal. Mach. Intell., vol. 35, no. 1, pp. 171184, Jan. 2013. in Proc. Amer. Control Conf., vol. 3. Jun. 2003, pp. 21562162.
[6] S. Hu and J. Wang, Quadratic stabilizability of a new class of [32] K. Mohan and M. Fazel, Reweighted nuclear norm minimization with
linear systems with structural independent time-varying uncertainty, application to system identification, in Proc. Amer. Control Conf., 2010,
Automatica, vol. 37, no. 1, pp. 5159, 2001. pp. 29532959.
http://www.paper.edu.cn
396 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 3, MARCH 2013

[33] H. Zou and T. Hastie, Regularization and variable selection via the Qionghai Dai (SM05) received the B.S. degree in
elastic net, J. Royal Statist. Soc. Series B, vol. 67, no. 2, pp. 301320, mathematics from Shanxi Normal University, Xian,
2005. China, in 1987, and the M.E. and Ph.D. degrees in
[34] R. Chartrand and W. Yin, Iteratively reweighted algorithms for com- computer science and automation from Northeastern
pressive sensing, in Proc. IEEE Int. Conf. Acoust., Speech Signal University, Shenyang, China, in 1994 and 1996,
Process.. Apr. 2008, pp. 38693872. respectively.
[35] J. Cai, E. Candes, and Z. Shen, A singular value thresholding algorithm He has been a member of the Faculty of Tsinghua
for matrix completion, Preprint, Vol. 20, no. 4, pp. 127, 2008. University, Beijing, China, since 1997. He is cur-
[36] A. Beck and M. Teboulle, A fast iterative shrinkage-thresholding rently a Cheung Kong Professor with Tsinghua
algorithm for linear inverse problems, SIAM J. Image Science, vol. 2, University and is the Director of the Broadband
no. 1, pp. 183202, 2009. Networks and Digital Media Laboratory. His current
research interests include signal processing and computer vision and graphics.
[37] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, Distributed
optimization and statistical learning via the alternating direction method
of multipliers, Found. Trends Mach. Learn., vol. 3, no. 1, pp. 1123,
2010.
[38] Z. Lin, M. Chen, and Y. Ma, The augmented lagrange multiplier method
for exact recovery of corrupted low-rank matrices, Aerospace Corp., Risheng Liu received the B.Sc. degree in mathemat-
Los Angeles, CA, Tech. Rep. 10095055, Mar. 2011. ics and Ph.D. degree in computational mathematics
from the Dalian University of Technology, Dalian,
[39] M. Lees, A note on the convergence of alternating direction methods,
China, in 2007 and 2012, respectively.
Math. Comput., vol. 16, no. 77, pp. 7075, 1963.
He was a joint Ph.D. student with the Robotics
[40] Low-rank matrix recovery and completion via convex optimiza- Institute, Carnegie Mellon University, Pittsburgh,
tion. (2012) [Online]. Available: http://perception.csl.uiuc.edu/matrix- PA, from 2010 to 2012. He is currently a Post-
rank/sample_code.html Doctoral Researcher with the Faculty of Electronic
[41] J. Suo, S. Zhu, S. Shan, and X. Chen, A compositional and dynamic Information and Electrical Engineering, Dalian Uni-
model for face aging, IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, versity of Technology. His current research interests
no. 3, pp. 385401, Mar. 2010. include machine learning and computer vision.
[42] C. Benedek and T. Sziranyi, Bayesian foreground and shadow detec-
tion in uncertain frame rate surveillance videos, IEEE Trans. Image
Process., vol. 17, no. 4, pp. 608621, Apr. 2008.
[43] J. K. Suhr, H. G. Jung, G. Li, and J. Kim, Mixture of gaussians-based
background subtraction for bayer-pattern image sequences, IEEE Trans. Zengke Zhang received the B.S. degree in industrial
Circuits Syst. Video Technol., vol. 21, no. 3, pp. 365370, Mar. 2011. electrization and automation from Tsinghua Univer-
[44] Type i and Type ii Errors. (2009) [Online]. Available: sity, Beijing, China, in 1970.
http://en.wikipedia.org/wiki/type_i_and_type_ii_errors He is a Professor with the Department of Automa-
[45] M. A. Fischler and R. C. Bolles, Random sample consensus: A tion, Tsinghua University. His current research inter-
paradigm for model fitting with applications to image analysis and ests include intelligent control, motion control, sys-
automated cartography, Commun. ACM, vol. 24, pp. 381395, Jun. tem integration, and image processing.
1981.
[46] R. Vidal, Y. Ma, and S. Sastry, Generalized principal component
analysis (gpca), in Proc. IEEE Comput. Soc. Conf., Comput. Vis. Pattern
Recognit., Jun. 2003, pp. 621628.
[47] J. Shi and J. Malik, Normalized cuts and image segmentation, IEEE
Trans. Pattern Anal. Mach. Intell., vol. 22, no. 8, pp. 888905, Aug.
2000.
[48] Yahoo!finacial. (2012) [Online]. Available: http://finance.yahoo.com Sanqing Hu (M05SM06) received the B.S.
[49] Google financial. (2012) [Online]. Available: http://www.google. degree from the Department of Mathematics, Hunan
com.hk/finance?q= Normal University, Hunan, China, the M.S. degree
[50] M. Gavrilov, D. Anguelov, P. Indyk, and R. Motwani, Mining the stock from the Department of Automatic Control, North-
market (extended abstract): Which measure is best? in Proc. 6th ACM eastern University, Shenyang, China, and the Ph.D.
SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2000, pp. 487496. degree from the Department of Automation and
[51] T. Wittman, Time-series clustering and association analysis of financial Computer-Aided Engineering, The Chinese Univer-
data, Elect. Eng. Res. Lab., Univ. Texas, Austin, Tech. Rep. CS 8980, sity of Hong Kong, Kowloon, Hong Kong, and the
2002. Department of Electrical and Computer Engineering,
[52] BolzanoWeierstrass theorem. (2012) [Online]. Available: University of Illinois, Chicago, in 1992, 1996, 2001,
http://en.wikipedia.org/wiki/Bolzano%E2%80%93Weierstrass_theorem and 2006, respectively.
[53] Squeeze theorem. (2012) [Online]. Available: http://en.wikipedia.org/ He was a Research Fellow with the Department of Neurology, Mayo Clinic,
wiki/Squeezetheorem Rochester, MN, from 2006 to 2009. From 2009 to 2010, he was a Research
Assistant Professor with the School of Biomedical Engineering, Science &
Health Systems, Drexel University, Philadelphia, PA. He is currently a Chair
Professor with the College of Computer Science, Hangzhou Dianzi University,
Hangzhou, China. His current research interests include biomedical signal
processing, cognitive and computational neuroscience, neural networks, and
dynamical systems. He is the co-author of more than 60 international journal
Yue Deng received the B.E. degree (Hons.) in and conference papers.
automatic control from Southeast University, Nan- Dr. Hu is an Associate Editor of four journals, including the IEEE
jing, China, in 2008. He is currently pursuing the T RANSACTIONS ON B IOMEDICAL C IRCUITS AND S YSTEMS , the IEEE
Ph.D. degree with the Department of Automation, T RANSACTIONS ON SMCPART B, the IEEE T RANSACTIONS ON N EURAL
Tsinghua University, Beijing, China. N ETWORKS , and Neurocomputing. He was a Guest Editor of Neurocomput-
He was a Visiting Scholar with the School of Com- ings special issue on Neural Networks 2007 and Cognitive Neurodynamics
puter Science, Carnegie Mellon University, Pitts- special issue on cognitive computational algorithms in 2011. He is the
burgh, PA, from 2010 to 2011. His current research Organizing Committee Co-Chairs for ICAST2011, was and is the Program
interests include machine learning, signal process- Chairs for the ICIST2011, ISNN2011, and IWACI2010. He was the Special
ing, and computer vision. Sessions Chair for the ICNSC2008, ISNN2009, and IWACI2010. He served
Mr. Deng was a recipient of the Microsoft and is serving as a member of the program committee of 20 international
fellowship in 2010. conferences.

You might also like