Shi - Optimization-Based Data Mining Techniques - 2005

Optimization-based Data Mining
Techniques with Applications

Proceedings of a Workshop held in Conjunction with
2005 IEEE International Conference on Data Mining
Houston, USA, November 27, 2005
Edited by
Yong Shi
Chù: ûdm) o Sù: k:uh Cut ou Dutu
Thuo'o_) luo'd_ louom),
0uduut |u`v:`t) o th Chù: ûdm) o Sù:,
b`_ù_ !00080, Chùu
ISBN 0-9738918-1-5
Optimization-based Data Mining
Techniques with Applications
Proceedings of a Workshop held in Conjunction with
2005 IEEE International Conference on Data Mining
Houston, USA, November 27, 2005
Edited by
Yong Shi
Chù: ûdm) o Sù: k:uh Cut ou Dutu
Thuo'o_) luo'd_ louom),
0uduut |u`v:`t) o th Chù: ûdm) o Sù:,
b`_ù_ !00080, Chùu
The papers appearing in this book reflect the authors opinions and are published in the
interests of timely dissemination based on review by the program committee or volume
editors. Their inclusion in this publication does not necessarily constitute endorsement by
the editors.
2005 by the authors and editors of this book.
No part of this work can be reproduced without permission except as indicated by the
Fair Use clause of the copyright law. Passages, images, or ideas taken from this work
must be properly credited in any written or published materials.
ISBN 0-9738918-0-7
Printed by Saint Marys University, Canada.
CONTENTS
Introduction......II
Novel Quadratic Programming Approaches for Feature Selection
and Clustering with Applications
W. Art Chaovalitwongse......1
Fuzzy Support Vector Classification Based on Possibility Theory
Zhimin Yang, Yingjie Tian, Naiyang Deng.8
DEA-based Classification for Finding Performance Improvement
Direction
Shingo Aoki, Yusuke Nishiuchi, Hiroshi Tsuj..16
Multi-Viewpoint Data Envelopment Analysis for Finding
Efficiency and Inefficiency
Shingo AOKI, Kiyosei MINAMI, Hiroshi TSUJI.....21
Mining Valuable Stocks with Genetic Optimization Algorithm
Lean Yu, Kin Keung Lai and Shouyang Wang.....27
A Comparison Study of Multiclass Classification between
Multiple Criteria Mathematical Programming and Hierarchical
Method for Support Vector Machines
Yi Peng, Gang Kou, Yong Shi, Zhenxing Chen and Hongjin Yang.30
Pattern Recognition for Multimedia Communication Networks
Using New Connection Models between MCLP and SVM
Jing HE, Wuyi YUE, Yong SHI....37
I
Introduction
For last ten years, the researchers have extensively applied quadratic programming
into classification, known as V. Vapniks Support Vector Machine, as well as various
applications. However, using optimization techniques to deal with data separation and
data analysis goes back to more than thirty years ago. According to O. L. Mangasarian,
his group has formulated linear programming as a large margin classifier in 1960s. In
1970s, A. Charnes and W.W. Cooper initiated Data Envelopment Analysis where a
fractional programming is used to evaluate decision making units, which is economic
representative data in a given training dataset. From 1980s to 1990s, F. Glover proposed
a number of linear programming models to solve discriminant problems with a small
sample size of data. Then, since 1998, the organizer and his colleagues extended such a
research idea into classification via multiple criteria linear programming (MCLP) and
multiple criteria quadratic programming (MQLP). All of these methods differ from
statistics, decision tree induction, and neural networks. So far, there are numerous
scholars around the world who have been actively working on the field of using
optimization techniques to handle data mining problems. This workshop intends to
promote the research interests in the connection of optimization and data mining as well
as real-life applications among the growing data mining communities. All of seven
papers accepted by the workshop reflect the findings of the researchers in the above
interface fields.
Yong Shi
Beijing, China
II
Novel Quadratic Programming Approaches for Feature Selection and Clustering
with Applications
W. Art Chaovalitwongse
Department of Industrial and Systems Engineering
Rutgers, The State University of New Jersey
Piscataway, New Jersey 08854
Email: wchaoval@rci.rutgers.edu
Abstract
Uncontrolled epilepsy poses a signicant burden to
society due to associated healthcare cost to treat and
control the unpredictable and spontaneous occurrence of
seizures. The main objective of this paper is to develop and
apply novel optimization-based data mining approaches
to the study of brain physiology, which might be able to
revolutionize current diagnosis and treatment of epilepsy.
Through quantitative analyses of electroencephalogram
(EEG) recordings, a new data mining paradigm for feature
selection and clustering is developed based on mathemat-
ical models and optimization techniques proposed in this
paper. The experimental results in this study demonstrate
that the proposed techniques can be used as a feature
(electrode) selection technique to capture seizure pre-
cursors. In addition, the proposed techniques will not only
excavate hidden patterns/relationships in EEGs, but also
will give a greater understanding of brain functions (as
well as other complex systems) from a system perspective.
I. . Introduction and Background
Most data mining (DM) tasks fundamentally involve
discrete decisions based on numerical analyses of data
(e.g., the number of clusters, the number of classes, the
class assignment, the most informative features, the outlier
samples, the samples capturing the essential information).
These techniques are combinatorial in nature and can natu-
rally be formulated as discrete optimization problems. The
goal of most DM tasks naturally lends itself to a discrete
NP-hard optimization problem. Aside from complexity
issue, the massive scale of real life DM problems is another
difculty arising in optimization-based DM research.
In this paper, we focus our main application on epilepsy
research. Epilepsy is the second most common brain
disorder after stroke. The most disabling aspect of epilepsy
is the uncertainty of recurrent seizures, which can be
characterized by a chronic medical condition produced
by temporary changes in the electrical function of the
brain. The aim of this research is to develop and apply
a new DM paradigm used to predict seizures based on the
study of neurological brain functions through quantitative
analyses of electroencephalograms (EEGs), which is a tool
for evaluating the physiological state of the brain. Although
EEGs offer excellent spatial and temporal resolution to
characterize rapidly changing electrical activity of brain
activation, it is not an easy task to excavate hidden patterns
or relationships in massive data with properties in time and
space like EEG time series. This paper involves research
activities directed toward the development of mathematical
models and optimization techniques for DM problems. The
primary goal of this paper is to incorporate novel opti-
mization methods with DM techniques. Specically, novel
feature selection and clustering techniques are proposed
in this paper. The proposed techniques will enhance the
ability to provide more precise data characterization, more
accurate prediction/classication, and greater understand-
ing of EEG time series.
A. . Feature/Sample Selection
Although the brain is considered to be the largest
interconnected network, neurologists believe that seizures
represent the spontaneous formation of self-organizing
spatiotemporal patterns that involve only some parts (elec-
trodes) of the brain network. The localization of epilepto-
genic zones is one of the proofs of this concept. Therefore,
feature selection techniques have become a very essential
tool for selecting the critical brain areas participating in
?00 TCDM "ol:ho: Ot`m`zutòu|u:d Dutu Mùù_ Thuù: `th ^'ùtòu:
!
the epileptogenesis process during seizure development.
In addition, graph theoretical approaches appear to t
very well as a model of a brain structure [12]. Feature
selection will be very useful in selecting/identifying the
brain areas correlated to the pathway to seizure onset.
In general, feature/sample selection is considered to be a
dimensionality reduction technique within the framework
of classication and clustering. This problem can naturally
be dened as a binary optimization problems. The notion
of selection a sub-set of variables, out of superset of pos-
sible alternatives, naturally lends itself to a combinatorial
(discrete) optimization problem.
In general, depending on the model used to describe the
data the problem of feature selection will end up being a
(non)-linear mixed integer programming (MIP) problem.
The most difcult issue in DM problems arises when one
has to deal with spatial and temporal data. It is extremely
critical to be able to identify the best features in timely
fashion. To overcome this difculty, the feature selection
problem in seizure prediction research is modeled as a
Mutli-Quadratic Integer Programming (MQIP) problem.
MQIP is very difcult to solve. Although many efcient
reformulation-linearization techniques (RTLs) have been
used to linearize QP and nonlinear integer programming
problems [1], [14], additional quadratic constraints make
MQIP problems much more difcult to solve and current
RTLs fail to solve MQIP problems effectively. A fast and
scalable RTL that can be used to solve MQIPs for feature
selection is herein proposed based on our preliminary stud-
ies in [7], [24]. In addition, a novel framework applying
graph theory to feature selection, which is based on the
preliminary study in [28], is also proposed in this paper.
B. . Clustering
The elements and dynamical connections of the brain
dynamics can portray the characteristics of a group of
neurons and synapses or neuronal populations driven by
the epileptogenic process. Therefore, clustering the brain
areas portraying similar structural and functional relation-
ships will give us an insight in the mechanisms of epilep-
togenesis and an answer to a question of how seizures
are generated, developed, and propagated, and how they
can be disrupted and treated. The goal of clustering is
to nd the best segmentation of raw data into the most
common/similar groups. In clustering similarity measure
is, therefore, the most important property. The difculty
in clustering arises from the fact that clustering is an
unsupervised learning, in which the property or the ex-
pected number of groups (clusters) are not known ahead
of time. The search for the optimal number of clusters is
parametric in nature. Distance-based method is the most
commonly studied clustering technique, which attempts to
identify the best k clusters that minimize the distance of
the points assigned in the cluster from the center of the
cluster. A very well-known example of the distance-based
method is k-mean clustering. Another clustering method is
a model-based method, which assumes a functional model
expression that describes each of the clusters and then
searches for the best parameter to t the cluster model by
minimizing a likelihood measure. Most clustering methods
attempt to identify the best k clusters that minimize the
distance of the points assigned in the cluster from the
center of the cluster. k-median clustering is another widely
studied clustering technique, which can be modeled as
a concave minimization problem and reformulated as a
minimization problem of a bilinear function over a polyhe-
dral set [3]. Although these clustering techniques are well
studied and robust, they still require a priori knowledge of
the data (e.g., the number of clusters, the most informative
features).
II. . Data Mining in EEGs
Recent quantitative EEG studies previously reported
in [5], [11], [10], [8], [16], [24], suggest that seizures are
deterministic rather than random and it may be possible to
predict the onset of epileptic seizures based on quantitative
analysis of the brain electrical activity through EEGs.
The seizure predictability has also been conrmed by
several other groups [13], [29], [20], [21]. This analysis
proposed in this research was motivated by mathematical
models from chaos theory used to characterize multi-
dimensional complex systems and reduce the dimension-
ality of EEGs [19], [31]. These techniques demonstrate
dynamical changes of epileptic activity that involve the
gradual transition from a state of spatiotemporal chaos
to spatial order and temporal chaos [4], [27]. Such a
transition that precedes seizures for periods on the order
of minutes to hours is detectable in the EEG by the
convergence in value of chaos measures (i.e., Lyapunov
Exponent-STL
max
) among critical electrode sites on the
neocortex and hippocampus [10]. T-statistical distance was
proposed to estimate the pair-wise difference (similarity) of
the dynamics of EEG time series between brain electrode
pairs. The T-index will measure the convergence degree
of chaos measures among critical electrode sites. The T-
index at time t between electrode sites i and j is dened
as: T
i,j
(t) =
N |E{STL
max,i
STL
max,j
}|/
i,j
(t),
where E{} is the sample average difference for the
STL
max,i
STL
max,j
estimated over a moving window
w
t
() dened as:
w
t
() =
1 if [t N 1, t]
0 if [t N 1, t],
?
where N is the length of the moving window. Then,
i,j
(t) is the sample standard deviation of the STL
max
differences between electrode sites i and j within the
moving window w
t
(). The thus dened T-index follows a
t-distribution with N-1 degrees of freedom. A novel feature
selection technique based on optimization techniques to
select critical electrode sites minimizing T-index similarity
measure was proposed in [4], [24]. The results of that study
demonstrated that spatiotemporal dynamical properties of
EEGs manifest patterns corresponding to specic clinical
states [6], [4], [17], [24]. In spite of promising signs
of the seizure predictabilty, research in epilepsy is still
far from complete. The existence of seizure pre-cursors
remains to be further investigated with respect to parameter
settings, accuracy, sensitivity, specicity. Essentially, there
is a need of new feature selection and clustering used
to systematically identify the brain areas underlying the
seizure evolution as well as epileptogenic zones (the areas
initiating the habitual seizures).
III. . Feature Selection
The concept of optimization models for feature selec-
tion used to select/identify the brain areas correlated to
the pathway to seizure onset came from the Ising model
has been a powerful tool in studying phase transitions in
statistical physics. Such an Ising model can be described by
a graph G(V, E) having n vertices {v
1
, . . . , v
n
} and each
edge (i, j) E having a weight (interaction energy) J
ij
.
Each vertex v
i
has a magnetic spin variable
i
{1, +1}
associated with it. An optimal spin conguration of min-
imum energy is obtained by minimizing the Hamiltonian:
H() =
1ijn
J
ij
j
over {1, +1}
n
.
This problem is equivalent to the combinatorial problem
of quadratic 0-1 programming [15]. This has motivated us
to use quadratic 0-1 (integer) programming to select the
critical cortical sites, where each electrode has only two
states, and to determine the minimal-average T-index state.
In addition, we also introduce an extension of quadratic
integer programming for electrode selection including
Feature Selection via Multi-Quadratic Programming and
Feature Selection via Graph Theory.
A. . Feature Selection via Quadratic Integer Pro-
gramming (FSQIP)
FSQIP is a novel mathematical model for selecting
critical features (electrodes) of the brain network, which
can be modeled as a quadratic 0-1 knapsack problem
with objective function to minimize the average T-index
(a measure of statistical distance between the mean values
of STL
max
) among electrode sites and the knapsack
constraint to identify the number of critical cortical sites. A
powerful quadratic 0-1 programming technique proposed
in [25] is employed to solve this problem. Next we will
demonstrate how to reduce a quadratic program with a
knapsack constraint to a non-constrained quadratic 0-1
program. In order to formalize the notion of equivalence,
we propose the following denitions.
Denition 1: We say that problem P is polynomially
reducible to problem P
0
if given an instance I(P) of
problem P, we can in polynomial time obtain an instance
I(P
0
) of problem P
0
such that solving I(P) will solve
I(P
0
).
Denition 2: Two problems P
1
and P
2
are called
equivalent if P
1
is polynomially reducible to P
2
and
P
2
1
.
Consider the following three problems:
P
1
: min f(x) = x
T
Ax, x {0, 1}
n
, A R
nn
.
P
1
: min f(x) = x
T
Ax + c
T
x, x {0, 1}
n
, A
R
nn
, c R
n
.
P
1
: min f(x) = x
T
Ax, x {0, 1}
n
, A
R
nn
,
n
i=1
x
i
= k, where 0 k n is a
constant .
Dene A as an n n T-index pair-wise distance matrix,
and k is the number of selected electrode sites. Problems
P
1
,

P
1
, and

P
1
can be shown to be all equivalent by
proving that P
1
is polynomially reducible to

P
1
,

P
1
1
,

P
1
is polynomially
reducible to P
1
, and P
1
is polynomially reducible to
P
1
. For more details, see [4], [6].
B. . Feature Selection via Multi-Quadratic Integer
Programming (FSMQIP)
FSMQIP is a novel mathematical model for selecting
critical features (electrodes) of the brain network, which
can be modeled as a MQIP problem given by: min x
T
Ax,
s.t.,
n
i=1
x
i
= k; x
T
Cx T
k(k 1); x {0, 1}

n
,
where A is an n n matrix of pairwise similarity of
chaos measures before a seizure, C is an n n matrix
of pairwise similarity of chaos measures after a seizure,
and k is the pre-determined number of selected electrodes.
This problem has been proved to be NP-hard in [24].
The objective function is to minimize the average T-index
distance (similarity) of chaos measures among the critical
electrode sites. The knapsack constraint is to identify the
number of critical cortical sites. The quadratic constraint
is to ensure the divergence of chaos measures among the
critical electrode sites after a seizure. A novel RLT to
reformulate this MQIP problem as a MIP problem was
proposed in [7], which demonstrated the equivalence of
the following two problems:
P
2
: min
x
f(x) = x
T
Ax, s.t. Bx b, x
T
Cx
, x {0, 1}
n
, where is a positive constant.
3
P
2
: min
x,y,s,z
g(s) = e
T
s, s.t. Ax y s = 0, Bx
b, y M(e x), Cx z 0, e
T
z , z
M
x, x {0, 1}
n
, y
i
, s
i
, z
i
0, where M
=
C
and M = A
.
Proposition 1: P
2
is equivalent to

P
2
if every entry in
matrices A and C is non-negative.
Proof: It has been shown in [9], [7] that P
1
has an
optimal solution x
0
iff there exist y
0
, s
0
, z
0
such that
(x
0
, y
0
, s
0
, z
0
) is an optimal solution to

P
1
.
C. . Feature Selection via Maximum Clique (FSMC)
FSMC is a novel mathematical model based on graph
theory for selecting critical features (electrodes) of the
brain network. [9]. The brain connectivity can be rigor-
ously modeled as a brain graph as follows: considering a
brain network of electrodes as a weighted graph, where
each node represents an electrode and weights of edges
between nodes represent T-statistical distances of chaos
measures between electrodes. Three possible weighted
graphs are proposed: GRAPH-I is denoted as a complete
graph (the graph with all possible edges); GRAPH-II is
denoted as a graph induced from the complete one by
deleting edges whose T-index before a seizure is greater
than the T-test condent level; GRAPH-III is denoted as
a graph induced from the complete one by deleting edges
whose T-index before a seizure is than the T-test condent
level or T-index after a seizure point is less than the T-test
condence level. Maximum cliques of these graphs will be
investigated as the hypothesis is a group of physiologically
connected electrodes is considered to be a critical largest
connected network of seizure evolution and pathway. The
Maximum Clique Problem (MCP) is NP-hard [26]; there-
fore, solving MCPs is not an easy task. Nevertheless,
the RLT in [7] to provide a very compact formulation
of the maximum clique problem (MCP). This compact
formulation has theoretical and computational advantages
over traditional formulations as well as provides tighter
relaxation bounds.
Consider a maximum clique problem dened as follows.
Let G = G(V, E) be an undirected graph where V =
{1, . . . , n} is the set of vertices (nodes), and E denotes
the set of edges. Assume that there is no parallel edges
(and no self-loops joining the same vertex) in G. Denote
an edge joining vertex i and j by (i, j).
Denition 3: A clique of G is a subset C of vertices
with the property that every pair of vertices in C is
connected by an edge; that is, C is a clique if the subgraph
G(C) induced by C is complete.
Denition 4: The maximum clique problem is the prob-
lem of nding a clique set C of maximal cardinality (size)
|C|.
The maximum clique problem can be represented in many
equivalent formulations (e.g., an integer programming
problem, a continuous global optimization problem, and
an indenite quadratic programming) [22]. Consider the
following indenite quadratic programming formulation of
MCP. Let A
G
= (a
ij
)
nn
be the adjacency matrix of G
dened by
a
ij
=
1 if (i, j) E
0 if (i, j) / E.
The matrix A
G
is symmetric and all eigenvalues are real
numbers. Generally, A
G
has positive and negative (and
possibly zero) eigenvalues and the sum of eigenvalues is
zero as the main diagonal entries are zero [15]. Consider
the following indenite QIP problem and MIP problem for
MCP:
P
3
: max
(i,j)E
1
2
x
T
Ax, s.t. x {0, 1}
n
, where A =
A
G
I and A
G
is an adjacency matrix of the
graph G.
P
3
: min
n
i=1
s
i
, s.t.
n
j=1
a
ij
x
j
s
i
y
i
= 0, y
i

M(1 x
i
) 0, where x
i
{0, 1}, s
i
, y
i
0,
and M = max
i
n
j=1
|a
ij
| = A
.
Proposition 2: P
3
is equivalent to

P
3
. If x
solves the
problems P
3
and

P
3
, then the set C dened by C = t(x
)
is a maximum clique of graph G with |C| = f
G
(x).
Proof: It has been shown in [9], [7] that P
3
has
an optimal solution x
0
iff there exist y
0
, s
0
, such that
(x
0
, y
0
, s
0
) is an optimal solution to

P
3
.
IV. . Clustering Techniques
The neurons in the cerebral cortex maintain thousands
of input and output connections with other group of neu-
rons, which form a dense network of connectivity spanning
the entire thalamocortical system. Despite this massive
connectivity, cortical networks are exceedingly sparse, with
respect to the number of connections present out of all
possible connections. This indicates that brain networks are
not random, but form highly specic patterns. Networks in
the brain can be analyzed at multiple levels of scale. Novel
clustering techniques are herein proposed to construct the
temporal and spatial mechanistic basis of the epileptogenic
models based on the brain dynamics of EEGs and capture
the patterns or hierarchical structure of the brain connec-
tivity from statistical dependence among brain areas. The
proposed hierarchical clustering techniques, which do not
require a priori knowledge of the data (number of clusters),
include Clustering via Concave Quadratic Programming
and Clustering via MIP with Quadratic Constraint.
+
A. . Clustering via Concave Quadratic Programming
(CCQP)
CCQP is a novel clustering mathematical model used
to formulate a clustering problem as a QIP problem [9].
Given n points of data to be clustered, we can formulate
a clustering problem as follows: min
x
f(x) = x
T
Ax I,
s.t. x {0, 1}
n
, where A is an nn Euclidean matrix of
pairwise distance, I is an identity matrix, is a parameter
adjusting the degree of similarity within a cluster, x
i
is
a 0-1 decision variable indicating whether or not point
i is selected to be in the cluster. Note that I is an
offset parameter added to the objective function to avoid
the optimal solution of all x
i
are zero. This will happen
when every entry a
ij
of Euclidean matrix A is positive
and the diagonal is zero. Although this clustering problem
is formulated as a large QIP problem, in some instances
when is large enough to make the quadratic function
become concave function, this problem can be converted
to a continuous problem (minimizing a concave quadratic
function over a sphere) [9]. The reduction to a continuous
problem is the main advantage of CCQP. This property
holds because of the fact that a concave function f : S
over a compact convex set S
n
attains its global
minimum at one of the extreme points of S [15]. Two
equivalent forms of CCQP problems are given by:
P
4
: min
x
f(x) = x
T
Ax, s.t. x {0, 1}
n
, where A is
an n n Euclidean matrix
P
4
: min
x
f(x) = x
T

Ax, s.t. 0 x e, where

A =
A + I, is any real number, I is a diagonal
matrix.
Proposition 3: P
4
is equivalent to

P
4
.
Proof: We will demonstrate that P
2
has an optimal
solution x
0
iff x
0
is an optimal solution to

P
2
as follows.
If we choose such that

A = A+I becomes a negative
semidenite matrix (e.g., = , where is the largest
eigenvalue of A), then the objective function

f(x) becomes
concave and the constraints can be replaced by 0 x
e. Thus, discrete problem P
2
is equivalent to continuous
problem

P
2
[9].
One of the advantages of CCQP is the ability to systemat-
ically determine the optimal number of clusters. Although
CCQP has to solve m clustering problems iteratively
(where m is the nal number of clusters at the termination
of CCQP algorithm), it is efcient enough to solve large-
scale clustering problems because only one continuous
problem is solved in each iteration. After each iteration,
the problem size will become signicantly smaller [9].
Figure 1 presents the procedure of CCQP.
CCQP
Input: All n unassigned data points in set S
Output: The number of clusters and cluster assignment
for all n data points
WHILE S = DO
- Construct an Euclidean matrix A from
pair-wise distance of data points in S
- Solve CCQP in problem

P4
IF Optimal solution xi = 1 THEN
- Remove point i from set S
Fig. 1. Procedure of CCQP algorithm
B. . Clustering via MIP with Quadratic Constraint
(CMIPQC)
CMIPQC is a novel clustering mathematical model
in which a clustering problem can be formulated as a
mixed-integer programming problem with quadratic con-
straint [9]. The goal of CMIPQC is to maximize number
of data points to be in a cluster such that the simi-
larity degrees among data points in a cluster are less
than a pre-determined parameter, . This technique can
be incorporated with hierarchical clustering methods as
follows: (a) Initialization: assign all data points into one
cluster; (b) Partition: use CMIPQC to divide the big
cluster into smaller clusters; (3) Repetition: repeat the
partition process until the stopping criterion are reached
or a cluster contains a single point. Novel mathematical
formulation for CMIPQC is given by: max
x
n
i=1
x
i
, s.t.
x
T
Cx , x {0, 1}, where n is the number of data
points to be clustered, C is an n n Euclidean matrix of
pairwise distance, is a predetermined parameter of the
similarity degree within each cluster, x
i
is a 0-1 decision
variable indicating whether or not point i is selected to be
in the cluster. The objective of this model is to maximize
number of data points to be in a cluster such that the
average pairwise distances among those points are less
than . The difculty of this problem comes from the
quadratic constraint; however, this quadratic constraint can
be efciently linearized by the RLT described in [7]. The
CMIPQC problem is much easier to solve as it can be
reduced to an equivalent MIP problem. Similar to CCQP,
the CMIPQC algorithm has the ability to systematically
determine the optimal number of clusters and only needs
to solve m MIP problems (see Figure 2 for CMIPQC
algorithm). Two equivalent forms of CMIPQC are given
by:
P
5
: min
x
f(x) =
n
i=1
x
i
, s.t. x
T
Cx , x {0, 1}
n
P
5
: min
x
f(x, y, z) =
n
i=1
x
i
, s.t. Cx z 0, e
T
z
, z M
x, x {0, 1}
n
, z
i
0, where M
=
.
Proposition 4: P
3
is equivalent to

P
3
.
Proof: The proof of P
5
has an optimal solution x
0
iff there exist z
0
such that (x
0
, z
0
) is an optimal solution
to

P
5
as follows. P
5
is a special case of P
2
is very similar
to the one in [9], [7].
CMIPQC
Input: All n unassigned data points in set S
Output: The number of clusters and cluster assignment
for all n data points
WHILE S = DO
- Construct an Euclidean matrix A from
pair-wise distance of data points in S
- Solve CMIPQC in problem

P5
IF Optimal solution xi = 1 THEN
- Remove point i from set S
Fig. 2. Procedure of CMIPQC algorithm
V. . Materials and Methods
The data used in our studies consist of continuous
intracranial EEGs from 3 patients with temporal lobe
epilepsy. FSQIP was previously used to demonstrate the
predictability of epileptic seizures [4]. In this research, we
extend our previous ndings of the seizure predictability
by using FSMQIP to select the critical cortical sites. The
FSMQIP problem is formulated as a MQIP problem with
objective function to minimize the average T-index (a
measure of statistical distance between the mean values of
STL
max
) among electrode sites, the knapsack constraint
to identify the number of critical cortical sites [18], and an
additional quadratic constraint to ensure that the optimal
group of critical sites shows the divergence in STL
max
proles after a seizure. The experiment in this study
is to test the hypothesis that FSMQIP can be used to
select critical features (electrodes) that are mostly likely to
manifest pre-cursor patterns prior to a seizure. The results
of this study will demonstrate that if one can select critical
electrodes that will manifest seizure pre-cursors, it may
be possible to predict a seizure in time to warn of an
impending seizure [6]. To test this hypothesis, we designed
an experiment used to compare the probability of detecting
seizure pre-cursor patterns from critical electrodes selected
by FSMQIP with that from randomly selected electrodes.
In this experiment, testing on 3 patients with 20 seizures,
we randomly selected 5,000 groups of electrodes, and used
FSMQIP to select the critical electrodes. The experiment
in this study is conducted as the following steps:
1) The estimation of STL
max
proles [2], [19], [23],
[30], [31] is used to measure the degree of order or
disorder (chaos) of the EEG signals.
2) FSMQIP select the critical electrodes based upon the
behavior of STL
max
proles before and after each
preceding seizure.
3) Such a seizure pre-cursor will be detected when the
brain dynamics from critical electrodes manifest a
pattern of transitional convergence in the similarity
degree of chaos. This pattern can be viewed as a
synchronization of the brain dynamics from critical
electrodes.
VI. . Results
The results show that the probability of detecting
seizure pre-cursor patterns from the critical electrodes
selected by FSMQIP is approximately 83%, which is
signicantly better than that from randomly selected elec-
trodes with (p-value < 0.07). The Histogram of probability
of detecting seizure pre-cursor patterns from randomly se-
lected electrodes and that from from the critical electrodes
is illustrated in Figure 3. The results of this study can be
used as a criterion to pre-select the critical electrode sites
that can be used to predict epileptic seizures.
Fig. 3. Histogram of Seizure Prediction Sen-
sitivities based on Randomly Selected Elec-
trodes versus Electrodes Selected by the
Proposed Feature Selection Technique
VII. . Conclusions
This paper proposes a theoretical foundation of opti-
mization techniques for feature selection and clustering
with an application in epilepsy research. Empirical in-
vestigations of the proposed feature selection techniques
demonstrate the effectiveness of the proposed techniques
with a utility of selecting the critical brain areas associated
with the epileptogenic process. Thus, advances in feature
b
selection and clustering techniques will result in the future
development of a novel DM paradigm to predict impending
seizures from multichannel EEG recordings. Prediction is
possible because, for the vast majority of seizures, the
spatio-temporal dynamical features of seizure pre-cursors
are sufciently similar to that of the preceding seizure.
Mathematical formulations for novel clustering techniques
are also proposed in this paper. These techniques are theo-
retically fast and scalable. The results from this preliminary
research suggest that empirical studies of the proposed
clustering techniques should be investigated in the future
research.
References
[1] W. Adams and H. Sherali, Linearization strategies for a class
of zero-one mixed integer programming problems, Operations
Research, vol. 38, pp. 217226, 1990.
[2] A. Babloyantz and A. Destexhe, Low dimensional chaos in an
instance of epilepsy, Proc. Natl. Acad. Sci. USA, vol. 83, pp. 3513
3517, 1986.
[3] P. Bradley, O. Mangasarian, and W. Street, Clustering via con-
cave minimization, in Advances in Neural Information Processing
Systems, M. Mozer, M. Jordan, and T. Petsche, Eds. MIT Press,
1997.
[4] W. Chaovalitwongse, Optimization and dynamical approaches in
nonlinear time series analysis with applications in bioengineering,
Ph.D. dissertation, University of Florida, 2003.
[5] W. Chaovalitwongse, L. Iasemidis, P. Pardalos, P. Carney, D.-
S. Shiau, and J. Sackellares, Performance of a seizure warning
algorithm based on the dynamics of intracranial EEG, Epilepsy
Research, vol. 64, pp. 93133, 2005.
[6] W. Chaovalitwongse, P. Pardalos, L. Iasemidis, D.-S. Shiau, and
J. Sackellares, Applications of global optimization and dynamical
systems to prediction of epileptic seizures, in Quantitative Neu-
roscience, P. Pardalos, J. Sackellares, L. Iasemidis, and P. Carney,
Eds. Kluwer, 2003, pp. 136.
[7] W. Chaovalitwongse, P. Pardalos, and O. Prokoyev, Reduction of
multi-quadratic 01 programming problems to linear mixed 01
programming problems, Operations Research Letters, vol. 32(6),
pp. 517522, 2004.
[8] W. Chaovalitwongse, O. Prokoyev, and P. Pardalos, Electroen-
cephalogram (EEG) time series classication: Applications in
epilepsy, Annals of Operations Research, vol. To appear, 2005.
[9] W. A. Chaovalitwongse, A robust clustering technique via
quadratic programming, Department of Industrial and Systems
Engineering, Rutgers University, Tech. Rep., 2005.
[10] W. A. Chaovalitwongse, P. Pardalos, L. Iasemidis, D.-S. Shiau, and
J. Sackellares, Dynamical approaches and multi-quadratic integer
programming for seizure prediction, Optimization Methods and
Software, vol. 20(23), pp. 383394, 2005.
[11] W. Chaovalitwongse, P. Pardalos, L. Iasemidis, J. Sackellares, and
D.-S. Shiau, Optimization of spatio-temporal pattern processing
for seizure warning and prediction, U.S. Patent application led
August 2004, Attorney Docket No. 028724150, 2004.
[12] C. Cherniak, Z. Mokhtarzada, and U. Nodelman, Optimal-wiring
models of neuroanatomy, in Computational Neuroanatomy, G. A.
Ascoli, Ed. Humana Press, 2002.
[13] C. Elger and K. Lehnertz, Seizure prediction by non-linear time
series analysis of brain electrical activity, European Journal of
Neuroscience, vol. 10, pp. 786789, 1998.
[14] F. Glover, Improved linear integer programming formulations of
nonlinear integer programs, Management Science, vol. 22, pp. 455
460, 1975.
[15] R. Horst, P. Pardalos, and N. Thoai, Introduction to global opti-
mization. Kluwer Academic Publishers, 1995.
[16] L. Iasemidis, P. Pardalos, D.-S. Shiau, W. Chaovalitwongse,
K. Narayanan, A. Prasad, K. Tsakalis, P. Carney, and J. Sackellares,
Long term prospective on-line real-time seizure prediction, Jour-
nal of Clinical Neurophysiology, vol. 116(3), pp. 532544, 2005.
[17] L. Iasemidis, D.-S. Shiau, W. Chaovalitwongse, J. Sackellares,
P. Pardalos, P. Carney, J. Principe, A. Prasad, B. Veeramani, and
K. Tsakalis, Adaptive epileptic seizure prediction system, IEEE
Transactions on Biomedical Engineering, vol. 5(5), pp. 616627,
2003.
[18] L. Iasemidis, D.-S. Shiau, J. Sackellares, and P. Pardalos, Tran-
sition to epileptic seizures: Optimization, in DIMACS series in
Discrete Mathematics and Theoretical Computer Science, D. Du,
P. Pardalos, and J. Wang, Eds. American Mathematical Society,
1999, pp. 5574.
[19] L. Iasemidis, H. Zaveri, J. Sackellares, and W. Williams, Phase
space analysis of EEG in temporal lobe epilepsy, in IEEE Eng.
in Medicine and Biology Society, 10th Ann. Int. Conf., 1988, pp.
12011203.
[20] B. Litt, R. Esteller, J. Echauz, D. Maryann, R. Shor, T. Henry,
P. Pennell, C. Epstein, R. Bakay, M. Dichter, and G. Vachtservanos,
Epileptic seizures may begin hours in advance of clinical onset: A
report of ve patients, Neuron, vol. 30, pp. 5164, 2001.
[21] F. Mormann, T. Kreuz, C. Rieke, R. Andrzejak, A. Kraskov,
P. David, C. Elger, and K. Lehnertz, On the predictability of epilep-
tic seizures, Journal of Clinical Neurophysiology, vol. 116(3), pp.
569587, 2005.
[22] T. Motzkin and E. Strauss, Maxima for graphs and a new proofs
of a theorem tur an, Canadian Journal of Mathematics, vol. 17, pp.
533540, 1965.
[23] N. Packard, J. Crutcheld, and J. Farmer, Geometry from time
series, Phys. Rev. Lett., vol. 45, pp. 712716, 1980.
[24] P. Pardalos, W. Chaovalitwongse, L. Iasemidis, J. Sackellares, D.-S.
Shiau, P. Carney, O. Prokopyev, and V. Yatsenko, Seizure warning
algorithm based on spatiotemporal dynamics of intracranial EEG,
Mathematical Programming, vol. 101(2), pp. 365385, 2004.
[25] P. Pardalos and G. Rodgers, Computational aspects of a branch and
bound algorithm for quadratic zero-one programming, Computing,
vol. 45, pp. 131144, 1990.
[26] P. Pardalos and J. Xue, The maximum clique problem, Journal
of Global Optimization, vol. 4, pp. 301328, 1992.
[27] P. Pardalos, V. Yatsenko, J. Sackellares, D.-S. Shiau, W. Chaovalit-
wongse, and L. Iasemidis, Analysis of EEG data using optimiza-
tion, statistics, and dynamical system techniques, Computational
Statistics & Data Analysis, vol. 44(12), pp. 391408, 2003.
[28] O. Prokopyev, V. Boginski, W. Chaovalitwongse, P. Pardalos,
J. Sackellares, and P. Carney, Network-based techniques in EEG
data analysis and epileptic brain modeling, in Data Mining in
Biomedicine, P. Pardalos and A. Vazacopoulos, Eds. Springer,
2005, p. To appear.
[29] M. L. V. Quyen, J. Martinerie, M. Baulac, and F. Varela, Anticipat-
ing epileptic seizures in real time by non-linear analysis of similarity
between EEG recordings, NeuroReport, vol. 10, pp. 21492155,
1999.
[30] P. Rapp, I. Zimmerman, and A. M. Albano, Experimental studies
of chaotic neural behavior: cellular activity and electroencephalo-
graphic signals, in Nonlinear oscillations in biology and chemistry,
H. Othmer, Ed. Springer-Verlag, 1986, pp. 175205.
[31] F. Takens, Detecting strange attractors in turbulence, in Dynamical
systems and turbulence, Lecture notes in mathematics, D. Rand and
L. Young, Eds. Springer-Verlag, 1981.
Fuzzy Support Vector Classification Based on Possibility Theory

*
Zhimin Yang
1
Yingjie Tian
2
Naiyang Deng
3**
1
College of Economics & Management, China Agriculture University, 100083, Beijing, China
2
Chinese Academy of Sciences Research Center on Data Technology & Knowledge Economy,
100080, Beijing, China
3
College of Science, China Agriculture University, 100083, Beijing, China
Abstract
This paper is concerned with the fuzzy
support vector classification in which the type
of both the output of the training point and the
value of the final fuzzy classification function
is triangle fuzzy number. First, the fuzzy
classification problem is formulated as a fuzzy
chance constrained programming. Then we
transform this programming into its
equivalence quadratic programming. As a
result, we propose fuzzy support vector
classification algorithm. In order to show its
rationality of the algorithm, a example is
presented.
Keywordsmachine learningfuzzy support
vector classificationpossibility measure
triangle fuzzy number
1. INTRODUCTION
Support vector machines (SVMs)
proposed by Vapnik, is a powerful tool for
machine learning (Vapnik 1995 Vapnik
1998Cristianini 2000Mangasarian 1999
Deng 2004). It is also one of the most
interesting topics in this field. Lin and Wang
in (Lin, 2002) investigated a classification
problem with fuzzy information, where the
training set is { } )
~
, ( , ),
~
, (
1 1 l l
y x y x S = with
output ) , , 1 (
~
l j y
j
= is fuzzy number. This
paper studies this problem in a different way.
We formulate it as a fuzzy chance constrained
programming. Then we transform this
programming into its equivalence quadratic
programming.
Assume that the training points contain
complete fuzzy information, i.e. the sum of
the positive membership degree and negative
membership degree of its output is 1. We
propose a fuzzy support vector classification
algorithm. Given an arbitrary test, its
corresponding output obtained by the
algorithm is a triangle fuzzy number.
2. FUZZY SUPPORT VECTOR
CLASSIFICATION MACHINE
As an extension of positive symbol 1 and
negative symbol -1, we introduce triangle
fuzzy number. Define the corresponding
output by the triangle fuzzy number. For an
input of a training point which belongs to the
positive class with the membership
degree ) 1 5 . 0 ( s s
+ +
o o , the triangle fuzzy
number is
1 2 3
2 2
( , , )
2( ) 2 2( ) 3 2
( , 2 1, ),
0.5 1
y r r r
o o o o
o
o o
o
+ + + +
+
+ +
+
=
+ +
=
s s
1
Similarly, for an input of a training point
which belongs to the negative class with the
membership degree ) 1 5 . 0 ( s s

o o , the
triangle fuzzy number is
8
1 2 3
2 2
( , , )
2( ) 3 2 2( ) 2
( , 2 1, ),
0.5 1
y r r r
o o o o
o
o o
o

=
+ +
= +

s s
. 2
Thus we use )
~
, ( y x to express a training
point, where y
~
is a triangle fuzzy number1
or2.We could use ) , ( o x to express a
training point too, where o is
3
Given training set of classification is
{ } )
~
, ( , ),
~
, (
1 1 l l
y x y x S = , 4
and
n
j
R x e is usual input, ) , , 1 (
~
l j y
j
= is a
triangle fuzzy number (1) or (2). According to
(1) ,(2) and (3), the training set (4) can have
another form
{ } ) , ( , ), , (
1 1 l l
x x S o o
o
= 5
where
j
x is same to those in (4), while
j
o is
those in (3) l j , , 1 = .
Definition 1 )
~
, (
j j
y x in (4) and ) , (
j j
x o in (5)
are called as fuzzy training points, l j , , 1 = ,
and S and
o
S are called as fuzzy training sets.
Definition 2 Fuzzy training point )
~
, (
j j
y x or
) , (
j j
x o is called as fuzzy positive point if it
corresponds to (1); similarly, fuzzy training
point )
~
, (
j j
y x or ) , (
j j
x o is called as fuzzy
negative point if it corresponds to (2).
Note: In this paper, the case either 5 . 0 =
+
j
o
or 5 . 0 =
j
o is omitted, because the
corresponding triangle fuzzy
number ) 2 , 0 , 2 (
~
=
j
y can not provide any
information.
We rearrange the fuzzy training points in
fuzzy training set 4or5, such that the
new fuzzy training set
{ } )
~
, ( , ),
~
, ( ),
~
, ( , ),
~
, (
1 1 1 1 l l p p p p
y x y x y x y x S
+ +
=
6
or
{ } ) , ( , ), , ( ), , ( , ), , (
1 1 1 1 l l p p p p
x x x x S o o o o
o

+ +
=
7
has the following property:
)
~
, (
t t
y x and ) , (
t t
x o are fuzzy positive points
( p t , , 1 = ), )
~
, (
i i
y x and ) , (
i i
x o are fuzzy
negative points ( l p i , , 1 + = ).
Definition 3 Suppose a fuzzy training set (6)
or equivalently (7) and a confidence level
1 0 s < .If there exist
n
R we and
R b e so that
l j b x w y Pos
j j
, , 1 } 1 ) ) ((
~
{ = > > +
(8
then fuzzy training set (6) or (7)is fuzzy
linearly separable, moreover the
corresponding fuzzy classification problem is
fuzzy linearly separable.
Note: 1

Fuzzy linearly separable can be
considered, roughly speaking, that inputs of
fuzzy positive points and fuzzy negative
points can be separated at least with the
possibility degree ) 1 0 ( s < .
2

Fuzzy linearly separability is
generalization of linearly separability of usual
training set. In fact, if ) , , 1 1 p t
t
= = o and
) , , 1 ( 1 l p i
i
+ = = o in training set (7),
fuzzy training set degenerates to usual
training set. So fuzzy linearly separability of
fuzzy training set degenerates to linearly
separability of usual training set. Supposed
1 <
t
o ( p t , , 1 = ) or
1 >
i
o ( l p i , , 1 + = ), it is possible that, on
one hand,
p
x x , ,
1
and
l p
x x , ,
1
+
are not
linearly separable in usual meaning; on the
other hand, they are fuzzy linearly separable.
For example, consider the case show in the
follow figure:
9
1
3
= x 0 1
2
= x 2
1
= x
) 1 ( 1
3 3
= =
y o ) 1 ( 1
2 2
= =
+
y o

1
o
Supposed there are three fuzzy training
points ) , (
1 1
y x , ) , (
2 2
y x and ) , (
3 3
y x . The fuzzy
training points ) , (
2 2
y x and ) , (
3 3
y x are
certain with ) 1 ( 1
2 2
= =
+
y o and
) 1 ( 1
3 3
= =
y o . The first fuzzy training

point ) , (
1 1
y x is fuzzy with two possible
negative membership degrees 51 . 0
1
=
o and
6 . 0
1
=
o .
51 . 0
1
=
o .According to (2), triangle

fuzzy number of ) , (
1 1
o x is
) 9 . 1 , 02 . 0 , 94 . 1 (
~
1
= y .So the fuzzy training
set is )} , ( ), , ( ),
~
, {(
3 3 2 2 1 1
y x y x y x S = .
Suppose 72 . 0 = , and classification
hyperplane 0 = x , then 2 ) (
1
= + b x w , so
7 . 0 722 . 0 } 1 ) ) ((
~
{
1 1
> = > + b x w y Pos ,
moreover
7 . 0 1 } 1 ) ) (( {
2 2
> = > + b x w y Pos
7 . 0 1 } 1 ) ) ((
~
{
3 3
> = > + b x w y Pos .
Therefore fuzzy training set S is fuzzy
linearly separable in the confidence level
72 . 0 = .
6 . 0
1
=
o .According to (2), triangle

fuzzy number of ) , (
1 1
o x is
) 13 . 1 , 2 . 0 , 53 . 1 (
~
1
= y .So the fuzzy training
set is )} , ( ), , ( ),
~
, {(
3 3 2 2 1 1
y x y x y x S = .
Suppose 47 . 0 = , and classification
hyperplane 0 = x , then 2 ) (
1
= + b x w , so
47 . 0 47 . 0 } 1 ) ) ((
~
{
1 1
> = > + b x w y Pos ,
moreover
47 . 0 1 } 1 ) ) (( {
2 2
> = > + b x w y Pos
47 . 0 1 } 1 ) ) ((
~
{
3 3
> = > + b x w y Pos .
Therefore fuzzy training set S is fuzzy
47 . 0 = .Supposed 72 . 0 = , we will find no
classification hyperplane such that
72 . 0 } 1 ) ) ((
~
{
1 1
> > + b x w y Pos . 9
So fuzzy training set S is not fuzzy
72 . 0 =
.
Generally speaking, possibility measure
inequality of fuzzy event can be equivalently
transformed to real inequalities, as shown in
(10).
Theorem 1. (8) in Definition 3 is
equivalent with the real inequalities shown in
:
+ = > + +
= > + +
. , , 1 , 1 ) ) )(( ) 1 ((
, , 1 , 1 ) ) )(( ) 1 ((
2 1
2 3
l p i b x w r r
p t b x w r r
i i i
t t t

10
Proof: ) , , (
~
3 2 1 j j j j
r r r y = is a triangle fuzzy
number, so ) ) ((
~
1 b x w y
j j
+ is also a triangle
fuzzy number due to triangle fuzzy number
operation rule. More concretely,
If 0 ) ( > + b x w
t
,
then
3
2
1
1 (( ) ) (1 (( ) ),
1 (( ) ),
1 (( ) ))
t t t t
t t
t t
y w x b r w x b
r w x b
r w x b
+ = +
+
+
p t , , 1 = .
According to a triangle fuzzy number
) , , (
~
3 2 1
r r r a = and arbitrary given confidence
level 1 0 s < , there exists
2 1
) 1 ( } 0
~
{ r r a Pos + > s .
Therefore, if 0 ) ( > + b x w
t
, then
3
2
{1 (( ) ) 0}
(1 )(1 (( ) ))
(1 (( ) )) 0
1, ,
t t
t t
t t
Pos y w x b
r w x b
r w x b
t p
+ s >
+
+ + s
=
or
3 2
{ (( ) ) 1}
((1 ) )(( ) ) 1
t t
t t t
Pos y w x b
r r w x b

+ > >
+ + >
p t , , 1 = .
Similarly, if 0 ) ( < + b x w
i
, then
!0
1 2
{ (( ) ) 1}
((1 ) )(( ) ) 1
i i
i i i
Pos y w x b
r r w x b

+ > >
+ + >
l p i , , 1 + = .
Therefore (8) in Definition 3 is equivalent
with (10).
In (10), suppose
p t
r r
k
t t
t
, , 1
) 1 (
1
2 3
=
+
=

,
l p i
r r
l
i i
i
, , 1 ,
) 1 (
1
2 1
+ =
+
=

, 11
then10can be rewritten:
+ = s +
= > +
. , , 1 , ) (
, , 1 , ) (
l p i l b x w
p t k b x w
i i
t t
Definition 4 Suppose fuzzy linearly

separable problem of fuzzy training set (6) or
(7), the two parallel hyperplanes
+
= + k b x w ) ( and
= + l b x w ) ( are support
hyperplanes about fuzzy training set (6) or
(7), so that:
= +
+ = s +
= +
= > +
+ =
+
=
. } ) {( max
, , 1 , ) (
} ) {( min
, , 1 , ) (
, , 1
, , 1
l b x w
l p i l b x w
k b x w
p t k b x w
i
l p i
i i
t
p t
t t
) , , 1 ( p t k
t
= ) , , 1 ( l p i l
i
+ = is the same
to those in (10 } { min
, , 1
t
p t
k k
=
+
=
} { max
, , 1
i
l p i
l l
+ =
= .
Distance of two support hyperplanes
+
= + k b x w ) ( and
= + l b x w ) ( is
|| ||
| |
w
l k
+
and we call the distance with margin

0 >
+
k and 0 <
l are constant . Due to

essence idea of Support Vector Machine, our
goal is to maximize margin. In the confidence
level ) 1 0 ( s < , fuzzy linearly separable
problem with fuzzy training set (6)or (7) can
be transformed to fuzzy chance constrained
programming with decision variable
T
b w ) , , ( :
= > > + l j b x w y Pos t s

w
j j
b w
, , 1 , } 1 ) ) ((
~
{ . .
|| ||
2
1
min
2
,

12
where {} Pos is possibility measure of fuzzy
event {} .
Theorem 2 In the confidence level
) 1 0 ( s < , the certain equivalence
programming (usual programming equivalent
with (12) )of fuzzy chance constrained
programming 13 is the quadratic
programming below:
+ = > + +
= > + +
. , , 1 , 1 ) ) )(( ) 1 ((
, , 1 , 1 ) ) )(( ) 1 .(( .
|| ||
2
1
min
2 1
2 3
2
,
l p i b x w r r
p t b x w r r t s
w
i i i
t t t
b w

13
Proof: The result can be got directly
with Theorem 1.
Theorem 3. There exists an optimal
solution of quadratic programming (13).
Proof: omitted. (see Deng 2004)
We will solve the dual programming of
quadratic programming (13).
Theorem 4. The dual programming of
quadratic programming (13) is quadratic
programming with decision variable is
T
) , ( o | :
,
1 1
3 2
1
1 2
1
1
min ( 2 ) ( )
2
. . ((1 ) )
((1 ) ) 0
0, 1, ,
0, 1, ,
p l
t i
t i p
p
t t t
t
l
i i i
i p
t
i
A B C
s t r r
r r
t p
i p l
| o
| o
|
o
|
o
= = +
=
= +
+ + +
+ + =
> =
> = +
_ _
_
_

14
!!
where
3 2
1 1
3 2
((1 ) )
*((1 ) )( )
p p
t s t t
t s
s s t s
A r r
r r x x
| |

= =
= +
+
__
3 2
1 1
1 2
((1 ) )
*((1 ) )( )
p l
t i t t
t i p
i i t i
B r r
r r x x
| o

= = +
= +
+
_ _
1 2
1 1
1 2
((1 ) )
*((1 ) )( )
l l
i q i i
i p q p
q i i q
C r r
r r x x
oo

= + = +
= +
+
_ _
p T
p
R
+
e = ) , , (
1
| | |
p l T
l p
R

+ +
e = ) , , (
1
o o o
T
) , ( o | is
decision variable.
Proof: omitted. (see Deng 2004)
Programming 14is convex. After
getting its optimal solution
T
) , (
* *
o | , , , (
* *
1 p
| | =
T
l p ) ,
*
1
*
o o + , we
find a optimal solution
T
b w ) , (
* *
of fuzzy
coefficient programming (12) is:
* *
3 2
1
*
1 2
1
((1 ) )
((1 ) )
p
t t t t
t
l
i i i i
i p
w r r x
r r x
|
o
=
= +
= +
+ +
_
_
*
3 2
*
3 2
1
*
1 2
1
((1 ) )
((1 ) )( )
((1 ) )( )
s s
P
t t t t s
t
l
i i i i s
i p
b r r
r r x x
r r x x

|
o
=
= +
= +
+
+
_
_
} 0 | {
*
> e
s
s s |
or
*
2
*
3 2
1
*
1 2
1
((1 ) )
((1 ) )( )
((1 ) )( )
qi q
p
t t t t q
t
l
i i i i q
i p
b r r
r r x x
r r x x

|
o
=
= +
= +
+
+
_
_
} 0 | {
*
> e
q
q q o .
So we can get certain optimal classification
hyperplane(see Deng 2004) :
n
R x b x w e = + , 0 ) (
* *
. (15)
Defining the function:
* *
) ( ) ( b x w x g + =
<
< s
>
s <
= =
+ +
) 1 ( , 1
0 ) 1 ( ), (
) 1 ( , 1
) 1 ( 0 ), (
) (
1
1
1
1

o o
u
u u
u
u u
u ,16
Where ) (
1
u
+
and ) (
1
u
are
respectively the inverse function of ) (u
+
and ) (u
.
Both ) (u
+
and ) (u
are regression function

(monotonously on u ) obtained by the
following way:
Computation of ) (u
+
:
Construct training set of regression
problem
)} ), ( ( , ), ), ( {(
1 1 p p
x g x g o o 17
Using (17) as training set , and
selecting appropriate 0 > c , 0 > C ,
c support vector regression machine with
linear kernel are executed.
Computation of ) (u
:
Construct training set of regression
problem
)} ), ( ( , ), ), ( {(
1 1 l l p p
x g x g o o
+ +
. 18
Using (13) as training set, and
selecting the same 0 > c , 0 > C , c support
vector regression machine with linear kernel
are executed.
Note: The equation (11) has the following
explanation: Consider an input x' .It seems
natural that the larger ) (x g ' is, the larger the
corresponding membership degree to be a
fuzzy positive point is; the smaller ) (x g ' is,
the larger the corresponding membership
degree to be a fuzzy negative point is. The
!?
regression function ) (
+
and ) (
just reflect
this idea.
The above discussion leads to
Algorithm (fuzzy support vector
classification)
Given a fuzzy training set (6) or (7)
,and select a appropriate confidence
level ) 1 ( s s o , 0 > C and a kernel
function ) , ( x x K ' ,then construct quadratic
programming:
,
1 1
3 2
1
1 2
1
1
min ( 2 ) ( )
2
. . ((1 ) )
((1 ) ) 0
0 , 1, ,
0 , 1, ,
p l
K K K t i
t i p
p
t t t
t
l
i i i
i p
t
i
A B C
s t r r
r r
C t p
C i p l
| o
| o
|
o
|
o
= = +
=
= +
+ + +
+ + =
s s =
s s = +
_ _
_
_

18
where
3 2
1 1
3 2
((1 ) )
*((1 ) ) ( , )
p p
K t s t t
t s
s s t s
A r r
r r K x x
| |

= =
= +
+
__
3 2
1 1
1 2
((1 ) )
*((1 ) ) ( , )
p l
K t i t t
t i p
i i t i
B r r
r r K x x
| o

= = +
= +
+
_ _
1 2
1 1
1 2
((1 ) )
*((1 ) ) ( , )
l l
K i q i i
i p q p
q i i q
C r r
r r K x x
oo

= + = +
= +
+
_ _
p T
p
R
+
e = ) , , (
1
| | |
p l T
l p
R

+ +
e = ) , , (
1
o o o
T
) , ( o | is
decision variable.
Solve quadratic programming (18),
and get optimal solution
T
) , (
* *
o |
T
l p p
) , , , , , (
* *
1
* *
1
o o | |
+
= .
Select ) , 0 (
*
C
s
e | in
*
| or ) , 0 (
*
C
q
e o in
*
o then compute
*
3 2
*
3 2
1
*
1 2
1
((1 ) )
((1 ) ) ( , )
((1 ) ) ( , )
s s
p
t t t t s
t
l
i i i i s
i p
b r r
r r K x x
r r K x x

|
o
=
= +
= +
+
+
_
_
Or
*
2
*
3 2
1
*
1 2
1
((1 ) )
((1 ) ) ( , )
((1 ) ) ( , ))
qi q
p
t t t t q
t
l
i i i i q
i p
b r r
r r K x x
r r K x x

|
o
=
= +
= +
+
+
_
_
.
Construct function
*
3 2
1
*
1 2
1
( ) ((1 ) ) ( , )
((1 ) ) ( , )
p
t t t t
t
l
i i i i
i p
g x r r K x x
r r K x x
|
o
=
= +
= +
+ +
_
_
.
()Consider )} ), ( ( , ), ), ( {(
1 1 p p
x g x g o o
and )} ), ( ( , ), ), ( {(
1 1 l l p p
x g x g o o
+ +
as
training set respectively and construct
regression functions
) (u
+
and ) (u
by c support vector
regression with linear kernel.
According to (1),(2) and (3), we
transform the function )) ( ( x g o o = in16
to triangle fuzzy number ( ) y y x = , then we
get fuzzy optimal classification function.
Note: 1
If the outputs of all fuzzy training

points in fuzzy training set(6) or (7) are real
number 1 or -1, then fuzzy training set
degenerate to normal training set, so fuzzy
support vector classification machine
degenerate to support vector classification
machine.
2

The selection of confidence level
) 1 0 ( s < in fuzzy support vector
classification machine would be seen as
!3
parameter selection problem, so we can use
methods in parameter selection such as LOO
error and LOO error bound (Deng 2004).

3. Numerical Experiments
In order to show the rationality of our
algorithm, we give a simple example.
Suppose fuzzy training set contains three
fuzzy positive points and three fuzzy negative
points. According to (6) and (7), this fuzzy
training set can be expressed:
)}
~
, ( , ),
~
, ( ),
~
, ( , ),
~
, {(
6 6 4 4 3 3 1 1
y x y x y x y x S =
,
)} , ( , ), , ( ), , ( , ), , {(
6 6 4 4 3 3 1 1
o o o o
o
x x x x S =
T
x ) 2 , , 2 (
1
=
T
x ) 2 , 7 . 1 (
2
=
T
x ) 1 , 5 . 1 (
3
=
T
x ) 0 , 0 (
4
=
T
x ) 5 . 0 , 8 . 0 (
5
=
T
x ) 5 . 0 , 1 (
6
=
, ) 1 , 1 , 1 ( 1
1
= = y ) 1 , 1 , 1 ( 1
2
= = y
) 1 . 1 , 6 . 0 , 1 . 0 (
~
3
= y ) 1 , 1 , 1 ( 1
4
= = y
) 1 , 1 , 1 ( 1
5
= = y
) 1 . 0 , 6 . 0 , 1 . 1 ( 1
~
6
= = y , 1
1
= o 1
2
= o
8 . 0
3
= o 1
4
= o 1
5
= o
8 . 0
6
= o .
Suppose a confidence
level 8 . 0 = , 10 = C and kernel
function x x x x K ' = ') , ( . We use the
Algorithm (fuzzy support vector
classification), so that we get
function 4 ] [ 2 ] [ 2 ) (
2 1
+ = x x x g .
We will establish function )) ( ( x g o o = :
Look )} 8 . 0 , 1 ( ), 1 , 4 . 3 ( ), 1 , 4 {(
1
= S as fuzzy
training set, and select 10 , 1 . 0 = = C c and
linear kernel. Construct support vector
regression, and we get regression function
72 . 0 08 . 0 ) ( + =
+
u u ;
Look )} 8 . 0 , 1 ( ), 1 , 4 . 1 ( ), 1 , 4 {(
2
= S as
fuzzy training set, and select
10 , 1 . 0 = = C c and linear kernel. Construct
support vector regression, and we get
regression function 73 . 0 07 . 0 ) ( + =
u u ;
So we get membership function is:
0.08 ( ) 0.72, 0 ( ) 3.50
1, ( ) 3.50
( ( ))
0.07 ( ) 0.73, 3.86 ( ) 0
1, ( ) 3.86
g x g x
g x
g x
g x g x
g x
o
+ < s
>
=

s <
<
.
Suppose test points
input
T T
x x ) 0 , 1 ( , ) 2 , 1 (
8 7
= = , and we get
0 2 ) (
7
> = x g 88 . 0 )) ( (
7
= x g o ,
0 2 ) (
8
< = x g 87 . 0 )) ( (
8
= x g o through
) (x g and )) ( ( x g o . According to
(1),(2)and(3), we can get
) 03 . 1 , 76 . 0 , 49 . 0 (
~
7
= y and
) 44 . 0 , 74 . 0 , 9 . 0 (
~
8
= y (triangle fuzzy
number).
In order to find relationship and difference
between fuzzy support vector classification
and support vector classification, we will have
three respective outputs of the third fuzzy
training point in fuzzy training set
o
S ,more
concretely, 1
3
= o , 8 . 0
3
= o 1
3
= o .
While output of the sixth fuzzy training point
is 1
6
= o , therefore fuzzy training set
o
S will
become three sets respectively:
)} , ( , ), , ( ), , ( , ), , {(
6 6 4 4 3 3 1 1
1
o o o o
o
x x x x S =
T
x ) 1 , 5 . 1 (
3
= 1
3
= o
T
x ) 5 . 0 , 1 (
6
=
1
6
= o .The inputs and outputs of other fuzzy
training points is the same to those in
o
S .
)} , ( , ), , ( ), , ( , ), , {(
6 6 4 4 3 3 1 1
2
o o o o
o
x x x x S =
T
x ) 1 , 5 . 1 (
3
= 8 . 0
3
= o
T
x ) 5 . 0 , 1 (
6
=
1
6
= o . The inputs and outputs of other
fuzzy training points is the same to those
in
o
S .
)} , ( , ), , ( ), , ( , ), , {(
6 6 4 4 3 3 1 1
3
o o o o
o
x x x x S =
T
x ) 1 , 5 . 1 (
3
= 1
3
= o
T
x ) 5 . 0 , 1 (
6
=
1
6
= o .
The inputs and outputs of other fuzzy
training points is the same to those in
o
S .
!+
So we observe the change of optimal
classification hyperplanes with the variety of
output of the third fuzzy training point:
1 8 . 0 8 . 0 1
3 3 3 3
= = = = o o o o (19)
When all the outputs of training points in
training set are 1 or -1,fuzzy training set
degenerate to usual training set such as
3 1
,
o o
S S . At the same time, fuzzy support
vector classification degenerates to support
vector classification.
Suppose 8 . 0 = , 10 = C , and kernel
function x x x x K ' = ') , ( .We use the
algorithm(fuzzy support vector
classification)and get certain optimal
classification hyperplanes respectively:
2 ] [ ] [ :
2 1 1
= + x x L 4 . 2 ] [ ] [ :
2 1 2
= + x x L
4 . 1 ] [ 923 . 1 ] [ 385 . 0 :
2 1 3
= + x x L
76 . 1 ] [ 923 . 1 ] [ 385 . 0 :
2 1 4
= + x x L .
show in the follow figure:
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
0
0.5
1
1.5
2
2.5
L1
L2
L3
L4
19illuminates membership degree of
fuzzy training point
3
x is changed. when its
negative membership degree gets bigger, and
its positive membership degree gets smaller.
The movement of corresponding certain
optimal classification hyperplane
is
4 3 2 1
L L L L , thus it can be seen the
result is same to intuitionistic judgment.
Reference
Cristianini, N. and Shawe-Taylor J.(2000), An
introduction to Support Vector Machines and
Other Kernelbased Learning Methods.
Cambridge University Press.
Deng, N. Y. and Zhu, M. F.(1987), Optimal
Methods, Education Press, Shenyang.
Deng, N. Y. and Tian, Y. J.(2004), The New
Method in Data Mining, Science Press,
Beijing.
Lin, C. F., Wang, S. D.(2002), Fuzzy Support
Vector Machines, IEEE Transactions on
Neural Networks,2.
Liu, B. D.(1998), Random Programming and
Fuzzy Programming, Tsinghua University
Press, Beijing.
Liu, B. et al.(1998) Chance Constrained
Programming with Fuzzy Parameters, Fuzzy
Sets and Systems, (2).
Mangasarian, O. L.(1999), Generalized
Support Vector Machines. Advances in Large
Margin Classifiers, MIT Press, Boston.
Vapnik, V. N.(1995), The Nature of Statistical
Learning Theory, Springer-Verlag, New
York.
Vapnik, V. N. (1998), Statistical Learning
Theory. Wiley, New York.
Yuan, Y. X. and Sun, W. Y.(1997), Optimal
Theories and Methods, Science Press, Beijing.
Zadeh, L. A.(1965), Fuzzy Sets. Information
and Control.
Zadeh, L. A.( 1978), Fuzzy Sets as a Basis for
a Theory of Possibility, Fuzzy Sets and
Systems.
Zhang, W. X.(1995), Foundation of Fuzzy
Mathematics, Xian Jiaotong University
Press, Xian.
S
!

AbstractIn order to find the performance improvement
direction for DMU (Decision Making Unit), this paper proposes a
new classification techniques. The proposed method consists of
two stages: (1) DEA (Data Envelopment Analysis) for evaluating
DMUs by their inputs/outputs, (2) GT (Group Technology) for
finding clusters among DMUs. A case study for twelve DMUs with
two inputs and two outputs shows that the proposed technique
works to obtain four clusters where each cluster has its own
performance improvement direction. This paper also discusses
the comparison on the traditional clustering and the proposed
clustering.
Index TermsData Envelopment Analysis, Clustering methods,
Data mining, Decision-making, Linear programming.
I. INTRODUCTION
nder the condition that there are a great number of
competitors in a general marketplace, a company should
find out its own advantages compared with others and extend it
[2]. For the reason mentioned above, the concern with the
mathematical approach has been growing [5] [11] [16].
Especially, this paper concentrates on the following issues: (1)
characterize each company in the marketplace by its activity
and define groups by similarity, and (2) compare a company to
others and find the performance improvement direction [3] [4].
As the former issue, in these years, a lot of cluster analyses
have been developed. Cluster analysis is the method for
classification samples which are characterized by multi
property values [5] [6]. It allows us to get common
characteristics in a group, in other words, the reason why a
sample belongs to a group. However, the traditional analysis
calculation regards all property values as appositional.
Therefore, it often gets rules which are based on absolute
property values, and makes difficult to find the performance
Manuscript received October 1, 2005, DEA-based Classification for Finding
Performance Improvement Direction.
S. Aoki is with the Graduate School of Engineering, Osaka Prefecture
University, 1-1 Gakuencho, Sakai, Osaka 599-8531, Japan (corresponding
author to provide phone: +81-72-254-9354; fax: +81-72-254-9915; e-mail:
aoki@cs.osakafu-u.ac.jp)
Y. Nishiuchi is with the Graduate School of Engineering, Osaka Prefecture
University, 1-1 Gakuencho, Sakai, Osaka 599-8531, Japan (e-mail:
nisiuti@cs.osakafu-u.ac.jp)
H. Tsuji is with the Graduate School of Engineering, Osaka Prefecture
tsuji@cs.osakafu-u.ac.jp)
improvement direction for each sample.
As the other issue, DEA has been developed and applied a
variety of the managerial and economic problem situations [8].
By comparison with Pareto optimal solution so called
"efficiency frontier", performance of DMUs is measured
relatively. However, in other words, DEA has considered only
a part of DMUs which used to form the efficiency frontier.
Therefore, little attention has been given to the cluster
technique for classification all DMUs.
In order to improve these problems, this paper proposes a
new classification techniques. The proposed method consists of
two stages: (1) DEA for evaluating DMUs by their
inputs/outputs, (2) GT for finding clusters among DMUs.
The remaining structure of this paper is organized as
follows: section 2 describes about DEA as the basis of this
research. Section 3 proposes the DEA based classification
method. Section 4 illustrates a numerical simulation using the
proposal method and the traditional method, and discusses the
difference between their classification results. Section 5 obtains
universal prospects of two methods. Finally, conclusion and
future extensions are summarized in section 6.
II. DATA ENVELOPMENT ANALYSIS (DEA)
A. An overview on DEA
Data Envelopment Analysis, initiated by Charnes et al.
(1978) [7], has been widely applied to efficiency (productivity)
analysis, and more than fifteen hundreds of researches have
been performed in the past twenty years [8].
DEA assumes DMUs activity that uses multiple inputs to
yield multiple outputs, and defines the process which changes
multiple inputs into multiple outputs as efficiency score. By
comparison with Pareto optimal solution so called "efficiency
frontier", efficiency score of DMU is measured relatively.
B. Efficiency frontier
This section illustrates efficiency frontier visually using
an exercise with a sample data set. In Figure 1, suppose that
there are seven DMUs which have one input and two outputs
where X-axis is an amount of sales (output 1) over a number of
shops (input) and Y-axis is a number of visitors (output 2) over
a number of shops (input). So, if a DMU is located in
upperright region, it shows that the DMU has high
productivity.
Line B-C-F-G is efficiency frontier in Figure 1. The
DMUs on this frontier are considered that an efficient
DEA-based Classification for Finding
Performance Improvement Direction
Shingo Aoki, Member, IEEE, Yusuke Nishiuchi, Non-Member, Hiroshi Tsuji, Member, IEEE
U
!b
activity is done. Other DMUs are considered that an
inefficient activity is done and there are rooms to improve
their activities.
For instance, the DMU
E
s efficiency score equals to
OE/OE1. Thus the range of efficiency score is [0, 1]. The
efficiency score for DMU
B
, DMU
C
, DMU
F,
and DMU
E
are
equal to 1.

Number of visitors per number of shops
A
m
o
u
n
t

o
f

s
a
l
e
s

p
e
r

n
u
m
b
e
r

o
f

S
h
o
p
s
A
B
C
F
G E
D
Efficiency Frontier A
1
E
1
Fig. 1. Graphical description of efficiency measurement
C. DEA model
When there are n DMUs (DMU
1
, , DMU
k
, , DMU
n
),
and each DMU is characterized by its own performance with m
inputs (x
1k
, x
2k
,, x
mk
) and s outputs (y
1k
, y
2k
,, y
sk
), DEA
model is mathematically expressed by the following
formulation [11] [12]:
free n j
U L
s r y y
m i x x to Subject
Minimize
j
n
j
j
rk
n
j
j rj
ik
n
j
j ij
k
: ), ,..., 2 , 1 ( 0
) ,..., 2 , 1 (
) ,..., 2 , 1 ( 0
1
1
1
u
u
u
= >
s s
= >
= > +
_
_
_
=
=
=

In Formulation (1), L and U are the values of lower bound
and upper bound of the
_
=
n
j
j
1
. If L = 0 and U = ,
Formulation (1) is called the CCR model, and if L = U = 1,
Formulation (1) is called the BCC model[13] [14] [15]. This
paper used the CCR model.
k
u is the efficiency score in the manner that
k
u = 1 (100%)
means DMU efficient, while
k
u < 1 means inefficient.
j
(j = 1, 2,, n) can be considered to form the efficiency
frontier about DMU
k
. Especially, if
j
> 0, then DMU
j
is on the
efficiency frontier. A set of these DMU is so called a
Reference set (R
k
) for the DMU
k
and expressed as follows:
n} 1,..., j 0, | {j R
j k
= > = (2)
Using Reference Set, this paper re-defines a set of R
k
as a
vector a
k
which is shown as following:
} , ... , , {
* *
2
*
1 n
=
k
a (3)
In Formulation (1), for instance, when a
k
= {
*
1
,,
*
1 v
= 0,
*
v
= 0.7,
*
1 + v
,,
*
1 w
= 0,
*
w
= 0.3,
*
1 + w
,,
*
n
= 0} and
*
k
u = 0.85, a reference set of DMU
k
is { DMU
V
,
DMU
W
}. In Fig. 2, the point k is nearest point from DMU
k
on
the efficiency frontier and efficiency value of DMU
k
shows a
ratio of 1 to 0.85.
0.7
0.3
DMU
v
DMU
w
DMU
k
k
Efficiency frontier
for DMU
k
1.0
0.85
Fig.2. Reference set for DMUk
What is important is that this research obtains the segment
connecting the origin with k not by researchers subjectivities
but by the intention which makes the efficiency of DMU
k
as
well as possible. The efficiency score of DMU
k+1
is obtained
by replacing k with k+1 at Formulation (1).
III. DEA-BASED CLASSIFICATION METHOD
Let us propose the method which consists of the following
steps:
A: Divine a data set into input items or output items.
B: For each DMU, solve formula (1) for getting the
efficiency score and the
j
values. Then we will get a
similarity coefficient matrix S.
C: Apply rank order algorithm to the similarity coefficient
matrix. Then we will get clusters.
A. Select input and output itmes
For the first step, there is a guideline to define a data set as
follows [9]:
1. Each data is numeric, and its value is more than zero,
2. In order to show the feature of DMUs activity, analyst
should be divined a data set into input items or output items,
3. As for the input item, analyst should choose the data which is
used for the investment such as amount of capital stock,
number of employee, and amount of advertisement invest,
4. As for the output item, analysis should choose the data which
is used for the return such as amount of sales, and number of
visitors,
(1)
!
B. Create similarity coefficient matrix
As the second step, the proposal method calculates an
efficiency score (
k
u ) for each DMU
k
in Formula (1), and a
vector a
k
in formula (3). Then the proposal method creates
similarity coefficient matrix S as follows:
} , , , {
n 2 1
a a a S = (4)
C. Classify DMUs by rank order algorithm
As the last step, DMUs are classified into some groups by
Group Technology (GT) [18], handling the similarity
coefficient matrix S. In this classification, rank order algorithm
by King, J. R [19] is employed. The rank order algorithm
consists of four steps as follows:
1. Step GOTO
weight, ascending by rows Arrange else
STOP. by weight, order ascending in are rows If 4. Step
2 row, each of weight total Calculate 3. Step
weight, ascending by columns Arrange 2. Step
2 column, each of weight total Calculate 1. Step
_
_
=
=
j
ij
j
i
i
ij
i
j
M w
M w
IV. A CASE STUDY
In order to verify the availability of the proposal method,
let us illustrate a numerical simulation.
A. A data set
A sample data set is shown in Table 1. The data set
concerns on regards the performance of 12 DMU (DMU
A
, ,
DMU
L
), and each DMU has four data items: number of
employee, number of shops, number of visitors and amount of
sales.
B. Traditional cluster analysis
B.1. METHOD OF CLUSTERING ANALYSIS. Cluster analysis is an
exploratory data analysis method which aims at sorting
different objects into groups in a way that the degree of
association between objects are maximal if they belong to the
same group and minimal otherwise[5] [20].
Table .
DATA SET FOR NUMERICAL STUDIES
DMU
Entity
Number of
employees
Number of
shops
Number of visitors
(K person/month)
Amount of sales
(M/month)
A 10 8 23 21
B 26 10 37 32
C 40 15 80 68
D 35 28 76 60
E 30 21 23 20
F 33 10 38 41
G 37 12 78 65
H 50 22 68 77
I 31 15 48 33
J 12 10 16 36
l 20 12 64 23
L 45 26 72 35
Inputs Outputs
The degree of association is estimated by the distance
which is calculated by Wards method [21].
Wards method is distinct from other methods because it
uses an analysis of variance approach to evaluate the distances
between clusters. When a new cluster c is created by combining
cluster a and cluster b, for example, a distance between cluster
x and cluster c is mathematically expressed by the following
formulation:

2 2 2 2
ab
c x
x
xb
c x
b x
xa
c x
a x
xc
d
n n
n
d
n n
n n
d
n n
n n
d
+
+
+
+
+
+
=

. cluster in s individual of number the :
. and cluster between distance a :
m n
n m d
m
mn
In general, this method is computationally simple, while it
tends to create small size of clusters.
B2. CLASSIFICATION RESULT. The result of classification for the
data set with Wards clustering method obtains a dendrogram
(See Fig. 3). Dendrogram is also called tree diagram.
In Fig.3, when the two individuals have combined together
on the left, it is concerned that the two individuals belong to the
same group.
The final number of clusters depends on the position where
the dendrogram is cut off. To get four clusters, for example, (A,
J, E), (B, F, I), (K, L) and (C, G, D, H) are obtained by cutting
the dendrogram at (1) in Figure 3.
A
J
L
B
l
!
K
L
O
G
D
H
O 26OO 6OOO
Cut off (1)
(distance among DMUs)
Cut off (2)
DMU
Fig.3. Dendrogram by Ward-method
From this classification result and Table 1, the feature of
each group is considered as follows:
(i) Group (A, J, E) is considered as that consists of small
scale DMUs,
(ii) Group (B, F, I) is considered as that consists of lower
middle scale DMUs,
(iii) Group (K, L) is considered as that consists of larger
middle scale DMUs and that a visitor unit price is very
low,
(iv) Group (C, G, D, H) is considered as that consists of
large scale DMUs.
(5)
!8
Fig.4 is illustrated the classification analysis by the traditional
method.
Small scale
Small scale
Lower middle scale
Lower middle scale
Larger middle scale
A visitor unit price is very low.
Larger middle scale
A visitor unit price is very low.
Large scale
Large scale
Numbers of employees
and number of shops
Number of visitors
and amount of sales
Fig.4. Traditional classification result

V. DEA-BASED CLASSIFICATION
This section describes the process of the proposal method.
Step1: Select inputs and outputs. According to Step 1 in
Section 3, the number of employee and the number of shops are
selected as input values, and the number of visitors and the
amount of sales are selected as output values.
Step2: Create a similarity coefficient matrix. By
Formulation (1), (3) and (4), the similarity coefficient matrix S
is obtained as shown in Table .
TABLE II.
SIMILARITY COEFFICIENT MATRIX S
LfflcleoL` 0VUc
6. 6lmllarly coefflcleoL maLrlx
A B C D E F G H I J K L
A 1 1 0 0 0 0 0 0 0 0 0 0 0
B 0.674 0 0 0 0 0 0 0.404 0 0 0.124 0.054 0
C 0.943 0 0 0 0 0 0 0.889 0 0 0.21 0.113 0
D 0.885 1 0 0 0 0 0 0 0 0 0 0.265 0
E 0.331 0 0 0 0 0 0 0.007 0 0 0.38 0.256 0
F 0.757 0 0 0 0 0 0 0.631 0 0 0 0 0
G 1 0 0 0 0 0 0 1 0 0 0 0 0
H 0.755 0 0 0 0 0 0 0.789 0 0 0.715 0 0
I 0.638 0 0 0 0 0 0 0.276 0 0 0.184 0.368 0
J 1 0 0 0 0 0 0 0 0 0 1 0 0
K 1 0 0 0 0 0 0 0 0 0 0 1 0
L 0.556 0 0 0 0 0 0 0.103 0 0 0.176 0.956 0
DMU
Liiicicncy
scorc
a
k
LfflcleoL` 0VUc
6. 6lmllarly coefflcleoL maLrlx
A B C D E F G H I J K L
A 1 1 0 0 0 0 0 0 0 0 0 0 0
B 0.674 0 0 0 0 0 0 0.404 0 0 0.124 0.054 0
C 0.943 0 0 0 0 0 0 0.889 0 0 0.21 0.113 0
D 0.885 1 0 0 0 0 0 0 0 0 0 0.265 0
E 0.331 0 0 0 0 0 0 0.007 0 0 0.38 0.256 0
F 0.757 0 0 0 0 0 0 0.631 0 0 0 0 0
G 1 0 0 0 0 0 0 1 0 0 0 0 0
H 0.755 0 0 0 0 0 0 0.789 0 0 0.715 0 0
I 0.638 0 0 0 0 0 0 0.276 0 0 0.184 0.368 0
J 1 0 0 0 0 0 0 0 0 0 1 0 0
K 1 0 0 0 0 0 0 0 0 0 0 1 0
L 0.556 0 0 0 0 0 0 0.103 0 0 0.176 0.956 0
DMU
Liiicicncy
scorc
a
k
Let us note S in Table . The
j
values of DMU
A
, DMU
G
,
DMU
J
and DMU
K
on efficiency frontier are more than zero,
and at least one of the other DMUs is equal to zero. This means
that the each DMU is characterized by combination of
efficient DMUs features.
The proposal method is focused attention on such DEA
contribution, and finds the performance improvement direction
for each DMU.
Step3: Classify DMUs by rank order algorithm. The rank
order algorithm for the similarity coefficient matrix S generates
classification as shown in Fig. 5.
The matrix S in Fig. 5 is expressed as follows:
If S
ij
> 0, then it is considered that there is relevance in
DMU
I
and DMU
J
, the entry is 1.
If S
ij
= 0, then it is considered that there is no relevance in
DMU
I
and DMU
J
, the entry is empty.
Initial state
Final state
h
a
n
d
l
i
n
g
Fig. 5. Classification demonstration by rank order algorithm
Then, four clusters: (A, D), (B, C, E, I, K, L), (F, G, H) and
(J) are obtained as shown in Fig.5. The feature of each group is
considered as follows:
(i) group (A, D) is considered as that consists of DMUs
which get many visitors and large amount of sales by a
few employee and a few shops,
(ii) group (B, C, E, I, K, L) is considered as that consists of
DMUs whose employees are clever in marquee,
(iii) group (F, G, H) is considered as that consists of DMUs
which are managed with large-sized shops,
(iv) group (J) is considered as that consists of DMU which has
many visitors who purchase a lot.
From the above analysis, Fig. 6 is illustrated as a
conceptual diagram which shows the situation of the
classification.
B
D
E
F
J
H
Brand Side
Profit Side
Shop Scale Side
A
B
C
D
H
I
K
G
L
Marquee Side
Get large sales with
a few visitor.
Get many visitors
with a few employee.
Get many visitors
and large sales with
many employees
get many visitors and
large amount of sales
in a few employee and
shops.
Fig.6. Proposal classification result
!9
VI. DISCUSSION
From the result of section 4.2, two characteristics by
clustering analysis are considered as follows:
(a) Classification result is based on the scale of management,
(b) The number of clusters can be assigned according to the
purpose.
Therefore, the traditional method does not require
preparation in advance. However, there are demerits that it is
difficult to find the performance improvement direction for a
DMU, since the classification result is only based on scale of
management.
On the other hand, DEA-based classification has three
characteristics as follows:
(a) Classification result is based on the direction of
management,
(b) The number of groups which classified is the same
number of efficient DMUs,
(c) Every group has at least one efficient DMU.
Since the
j
values in the similarity coefficient matrix S
(TABLE ) are positive only if the DMU is efficient, (b) is
true. As shown in Figure 5, since there is one efficient DMU
in every classified group, (c) is also true.
Then, the merits and the demerits of the proposal method
are described. It is easy to find the performance improvement
direction for a DMU. For example, even if a DMU is evaluated
inefficient, it is possible to refer the feature of the efficient
DMU which belongs to the same group. However, it is
necessary to select right input and right output for preparation.
VII. CONCLUSIONS AND FUTURE EXTENSIONS
This paper has described issues of the traditional
classification method and proposed a new classification
method which finds performance improvement direction. Case
study has shown that the classification by cluster analysis was
based on the scale of management, and that, on the other hand,
the classification by the proposal method was based on the
direction of management.
Future extensions of this research include as follows:
(a) Application for a large scale practical problem,
(b) Meaning assigning method for the derived groups,
(c)Investigating reliability of the performance improvement
direction,
(d) Establishment of the one-step application for the proposed
method.
REFERENCES
[1] Y. Hirose et al, Brand value evaluation paper group report, the Ministry of
Economy, Trade and Industry, 2002.
[2] Y. Hirose et al, Brand value that on-balance-ization is hurried, weekly
economist special issue, Vol.24, 2001.
[3] S. Aoki, Y. Naito, and H. Tsuji, DEA-based Indicator for performance
Improvement, Proceeding of The 2005 International Conference on
Active Media Technology, 2005.
[4] Y. Taniguchi, H. Mizuno and H. Yajima, Visual Decision Support
System, Proceeding of IEEE International Conference on Systems, Man
and Cybernetics (SCM97), 1997, pp.554-558.
[5] S. Miyamoto, Fuzzy sets in information retrieval and cluster analysis,
Kluwer Academic Publishers, Dordrecht: Boston, 1990.
[6] M.R. Anderberg, Cluster analysis for applications, Academic Press, New
York, USA, 1973.
[7] A. Charnes, W.W. Cooper, and E. Rhodes, Measuring the efficiency of
decision-making units, European journal of operational research, vol.2,
1978, pp.429-444.
[8] T. Sueyoshi, Management Efficiency Analysis (in Japanese), Asakura
Shoten Co., Ltd, Tokyo, 2001.
[9] K. Tone, Measurement and Improvement of Management Efficiency (in
Japanese), JUSE Press, Ltd, Tokyo, 1993.
[10] M.J. Farrell, The Measurement of Productive Efficiency, Journal of the
Royal Statical Society, (Series A), vol.120, 1957, pp.253-281.
[11] D.L, Adolphson, G.C. Cornia, and L.C. Walters, A Unified Framework
for Classifying DEA Models, Operational Research 90, edited by
E.E.Bradley, Pergamon Press, 1991, pp.647-657.
[12] A. Boussofiane, R.G. Dyson, and E. Thanassoulis, Invited Review:
Applied Data Envelopment Analysis, European Journal of Operational
Research, vol.52, 1991, pp-1-15.
[13] R. D. Banker, and R.C. Morey, The use of categorical variables in Data
Envelopment Analysis, Management Science vol.32, 1984,
pp.1613-1627
[14] R.D. Banker, A. Charnes, and W.W. Cooper, Some models for
estimating technical and scale inefficiencies in data envelopment
analysis, Management Science, Vol.30, 1984, pp.1078-1092.
[15] R.D. Banker, Estimating Most Productive Scale Size Using Data
Envelopment Analysis, European Journal of Operational Research,
vol.17, 1984, pp.35-44.
[16] W.A. Kamakura, A note on the use of categorical variables in Data
Envelopment Analysis, Management Science, vol.34, 1988,
pp.1273-1276.
[17] [J.J. Rousseau, and J. Semple, Categorical outputs in Data Envelopment
Analysis, Management Science, vol.39, 1993, pp.384-386.
[18] J.R. King, V. Nakornchai, Machine-component group formation in
group technology: review and extension, Internat. J. Prod, vol.20, 1982,
pp.117-133.
[19] J.R. King, Machine-Component Grouping in Production Flow Analysis:
An Approach Using a Rank Order Clustering Algorithm, International
Journal of Production Research, vol. 18, 1980, pp.213-232.
[20] J.G. Hirschberg, and D.J. Aigner, A Classification for Medium and
Small Firms by Time-of-Day Electricity Usage, Papers and Proceedings
of the Eight Annual North American Conference of the International
Association of Energy Economists, 1986, pp.253-257.
[21] J. Ward, Hierarchical grouping to optimize an objective function,
Journal of the American Statistical Association, vol.58, 1963,
pp.236-244.
?0

AbstractThis paper proposes a decision support method for
the measuring the productivity efficiency based on DEA (Data
Envelopment Analysis). The decision support method, called
Multi-Viewpoint DEA model which integrates the efficiency
analysis and the inefficiency analysis, is possible to identify the
performance of DMU (Decision Making Unit) between the strong
points and weak points by changing the view parameter. A case
study for twenty-five Japanese baseball players shows that the
proposed model is robust of the evaluation value.
Index TermsData Envelopment Analysis, Decision-Making,
Linear programming, Productivity.
I. INTRODUCTION
EA[1] is a nonparametric method for finding the relative
efficiency of DMUs, each of which is a company
responsible for converting multiple inputs into multiple outputs.
DEA has been applied to a variety of managerial and economic
problem situations in both public and private sectors [5, 9, 13,
14]. DEA defines the process which changes multiple inputs
into multiple outputs as one evaluation value.
The decision method based on DEA induces two kinds of
approaches: One is the efficiency analysis based on the Pareto
optimal solution for the aspect only of the strong points [1, 5].
The other is the inefficiency analysis based on the Pareto
optimal solution for the aspect only of the weak points [7].
Then, the evaluation values in two approaches are inconsistent
[8]. However, analysts have evaluated DMUs only by extreme
aspect: either strong points or weak points. Thus, the traditional
two analyses lack flexibility and robustness [17].
In fact, while there are many inputs and outputs in DEA
framework, these items are not fully used in the previous
approaches. This type of DEA problem has been usually
tackled by multiplier restriction approaches [15] and cone ratio
Manuscript received September 28, 2005, Multi-Viewpoint Data
Envelopment Analysis for Finding Efficiency and Inefficiency.
S. Aoki is with the Graduate School of Engineering, Osaka Prefecture
University, 1-1 Gakuencho, Sakai, Osaka 599-8531, Japan (corresponding
author to provide phone: +81-72-254-9354; fax: +81-72-254-9915; e-mail:
aoki@cs.osakafu-u.ac.jp)
K. Minami is with the Graduate School of Engineering, Osaka Prefecture
University, 1-1 Gakuencho, Sakai, Osaka 599-8531, Japan
H. Tsuji is with the Graduate School of Engineering, Osaka Prefecture
tsuji@cs.osakafu-u.ac.jp)
approaches [16]. While such multiplier restrictions usually
reduce the number of zero weight, they often produce an
infeasible solution in DEA. Therefore, new DEA model which
has robustness on the evaluation values is required.
This paper proposes a decision support technique referred
to as Multi-Viewpoint DEA model. The remaining structure of
this paper is organized as follows: the next section reviews the
traditional DEA models. Section 3 proposes a new model. The
proposed model integrates the efficiency analysis and the
inefficiency analysis into one mathematical formulation, and
allows us to analyze the performance of DMU by
multi-viewpoint between the strong points and weak points.
Section 4 verifies the proposed model through a case study. A
case study shows that the proposed model has two desirable
features: (1) robustness of the evaluation value, and (2)
unification between efficiency analysis and inefficiency
analysis. Finally, conclusion and future study are summarized
in section 5.
II. DEA-BASED EFFICIENCY AND INEFFICIENCY ANALYSES
A. DEA: Data Envelopment Analysis
In order to describe the mathematical structure of the
evaluation value, this paper assumes that there are n DMUs
), DMU , , DMU , , DMU (
n k 1
where each DMU is
characterized by m inputs ) x , , x , , x (
mk ik k 1
and s outputs
). y , , y , , y (
sk rk k 1
Evaluation value of is mathematically
formulated by
mk m k 2 2 k 1 1
sk s k 2 2 k 1 1
k
x v x v x v
y u y u y u
DMU of
Value Evaluation
+ + +
+ + +
=
(1)
Here
r
u is multiplier weight given to the
th
r output, and
i
v is multiplier weight given to the
th
i input. From the
analysis concept, there are two decision methods for
calculating these weights. One is the efficiency analysis
based on the Pareto optimal solution for the aspect only of
the strong points [1, 5]. The other is the inefficiency analysis
based on the Pareto optimal solution for the aspect only of
the weak points [7, 8].
Fig 1 visually represents the difference of two methods.
Suppose that there are nine DMUs which have one input and
two outputs where X-axis is output 1 over input and Y-axis is
Multi-Viewpoint Data Envelopment Analysis for
Finding Efficiency and Inefficiency
Shingo AOKI, Member, IEEE, Kiyosei MINAMI, Non-Member, Hiroshi TSUJI, Member, IEEE
D
?!
output 2 over input. So, if DMU is located in upper-right
region, it shows that the DMU has high productivity.
Efficiency analysis finds out the efficiency frontier which
indicates the best practice line (B-C-D-E-F in Figure 1) and
evaluates the relative evaluation value by the aspect only of
the strong points. On the other hand, inefficiency analysis
finds out the inefficiency frontier which indicates the worst
practice line (B-I-H-G-F in Fig 1) and evaluates the relative
evaluation value by the aspect only of the weak points.
Output 1 .Input
Inefficiency frontier
Efficiency frontier
A
B
C
D
E
F
G
H
I
A
A
O
O
u
t
p
u
t

2

.
I
n
p
u
t
Output 1 .Input
Inefficiency frontier
Efficiency frontier
A
B
C
D
E
F
G
H
I
A
A
O
O
u
t
p
u
t

2

.
I
n
p
u
t
Fig 1. Efficiency analysis and Inefficiency analysis
B. Efficiency Analysis
The efficiency analysis measures the efficiency level of a
specific by relativity comparing its performance to the
efficiency frontier. This paper is based on CCR model [1] while
there are other models [5, 11]. The efficiency analysis can be
mathematically formulated by
0 u , 0 v
3) - (2 1 x v
) n , , 2 , 1 j (
2) - (2 0 y u x v s.t.
1) - (2 ) ( y u Max
r i
m
1 i
ik i
s
1 r
rj r
m
1 i
ij i
E
k
s
1 r
rk r
> >
=
=
s +
u =
_
_ _
_
=
= =
=

(2)
Here formula (2-2) is a restriction condition because the
productivity of all DMUs (formula (1)) becomes 100% or less.
And the objective function (2-1) represents the maximization of
the sum of virtual outputs of
k
DMU , setting that the virtual
inputs of
k
DMU is equal to 1 (formula (2-3)). Therefore, the
optimal solution of ( u , v
r i
) represents the convenient weight
for
k
DMU . Especially, the optimal objective function value
indicates the evaluation value (
E
k
u ) for
k
DMU . This evaluation
value by the convenient weight is called efficiency score in
the manner that %) 100 ( 1
E
k
= u means the state of efficiency,
while %) 100 ( 1
E
k
< u means the state of inefficiency.
C. Inefficiency analysis
There is another analysis which measures the inefficiency
level of a specific
k
DMU based on Inversed DEA model [7].
The inefficiency analysis can be mathematically formulated by
0 u , 0 v
3) - (3 1 x v
) n , , 2 , 1 j (
2) - (3 0 y u x v s.t.
1) - (3 )
1
( y u in M
r i
m
1 i
ik i
s
1 r
rj r
m
1 i
ij i
IE
k
s
1 r
rk r
> >
=
=
> +
u
=
_
_ _
_
=
= =
=
(3)
Again, formula (3-2) is a restriction condition because the
productivity of all DMU (formula (1)) becomes 100% or more.
And the objective function (3-1) represents the minimization of
the virtual outputs of
k
DMU , setting that the virtual inputs of
k
DMU is equal to 1 (formula (3-3)). Therefore, the optimal
solution of ( u , v
r i
) represents the inconvenient weight for
k
DMU . Especially, the inverse number of optimal objective
function value indicates the inefficiency score in the manner
that %) 100 ( 1
IE
k
= u means the state of inefficiency, while
%) 100 ( 1
IE
k
< u means the state of efficiency.
D. Requirement for Multi-Viewpoint DEA
As shown in Figure 1,
B
DMU and
F
DMU are evaluated
as both states of efficiency ( 1
E
k
= u ) and inefficiency
) 1 (
IE
k
= u . This result clearly shows mathematical difference in
two analyses. For the example,
B
DMU has the best
productivity for the Output 2 / input, while it has worst
productivity for the Output 1 / input. In efficiency analysis, the
weight of
B
DMU is evaluated by the aspect of the strong points.
Therefore, the weight of Output 2 / input becomes a positive
value and the weight of Output 1 / input becomes zero. On the
other hand, in inefficiency analysis, the weight of
B
DMU is
evaluated by the aspect of the weak points. Therefore, the
weight of Output 2 / input becomes zero and the weight of
Output 1 / input becomes a positive value. This difference of
the weight estimation causes the mathematical problems as
follow:
a) No robustness of evaluation value
Both analyses may produce zero weights for most inputs
and outputs. The zero weight indicates that the corresponding
inputs or outputs are not used for the evaluation value.
Moreover, if the specific inputs or output items are removed
from the analysis, the evaluation value may change greatly [17].
This type of DEA problem is usually tackled by multiplier
restriction approaches [15] and cone ratio approaches [16].
Such multiplier restrictions usually reduce the number of zero
??
weight, and these analyses often produce an infeasible solution.
The development of DEA model which has robustness of the
evaluation value is required.
b) Lack of unification between efficiency analysis and
inefficiency analysis
Fundamentally, efficient DMU can not be inefficient
while inefficient DMU can not be efficient. However, the
evaluation value may be not consistent like the and in the
Figure 1 where they are in the both states of efficiency and
inefficiency. Thus, it is not easy for analysts to understand
the difference between evaluation values. The basis of the
evaluation value which has unification between efficiency
analysis and inefficiency analysis is required.
III. INTEGRATING EFFICIENT AND INEFFICIENT VIEW
A. Two DEA models based on GP technique
Let us propose a new decision support technique referred
to as Multi-Viewpoint DEA model. The proposed model is a
re-formulation of the efficiency analysis and inefficiency
analysis into one mathematical formulation. This paper applies
the following formula (4) which added the variable ) d , d (
j j
+
to
formula (2-2):
) n , , 2 , 1 j ( 0 d d y u x v
j j
s
1 r
rj r
m
1 i
ij i
= = + +
+
= =
_ _ (4)
Here
+
j
d indicates the slack variables, and
j
d indicates the
artificial variables. Therefore, the objective function (2-1) can
be replaced by mathematically using several big M as follows:
_ _
=
n
1 j
j
s
1 r
rk r
d M y u (5)
From the formula (4) and formula (2-3), the objective
function (5) can be rewritten as follows:
_
_
_ _ _ _
= =
+
=
+
=
+
= =
=
+ =
+ =
+ =
n
k j , 1 j
j k k
n
1 j
j k k
n
1 j
j k k
m
1 i
ik i
n
1 j
j
s
1 r
rk r
d M d ) M 1 ( d 1
d M d d 1
d M ) d d x v ( d M y u
(6)
Using GP (Goal Programming) technique, the
DEA-efficiency-model (formula (2)) can be replaced by the
following Linear Programming:
0 d , d , 0 u , 0 v
1 x v
) n , , 2 , 1 j (
0 d d y u x v . t . s
d M d ) M 1 ( d 1 Max
j j r i
m
1 i
ik i
j j
s
1 r
rj r
m
1 i
ij i
n
k j , 1 j
j k k
> > >
=
=
= + +
+
+
=
+
= =
= =
+
_
_ _
_
(7)

The efficiency score (
E
k
u ) of
k
DMU as follows:

|
|
|
|
.
|
\
|
=
= = u
_
_
=
= +
) 1 ( x v
y u
d 1
m
1 i
ik
*
i
s
1 r
rj
*
r *
k
E
k
(8)
Where superscript * indicates the optimal solution of
formula (7).
Let us apply the formula (4) which added the variable
) d , d (
j j
+
to formula (3-2). This paper notes that
+
j
d indicates
the artificial variables and
j
d indicates the slack variables in
inefficiency analysis. Using GP technique, the inefficiency
analysis (formula (3)) can be replaced by the following Linear
Programming:

0 d , d , 0 u , 0 v
1 x v
) n , , 2 , 1 j (
0 d d y u x v .t. s
d M d d ) 1 (M 1 in M
j j r i
m
1 i
ik i
j j
s
1 r
rj r
m
1 i
ij i
n
k j , 1 j
j k k
> > >
=
=
= + +
+ + +
+
=
+
= =
= =
+ +
_
_ _
_
(9)
The inefficiency score (
IE
k
u ) of
k
DMU as follows:

*
k
IE
k
d 1
1
+
= u (10)
Where superscript * indicate the optimal solution of
formula (9).
B. Mathematical integration of the efficiency and
inefficiency model
In order to integrate two DEA analyses into one formula
mathematically, this paper introduces slack variables. As seen
in formula (7) and (9), it is understood that the both analyses
have the same restriction conditions. Then, this paper applies
the following formula (11) which added any constant ) , ( | o to
the objective function of formula (7) and (9).
?3

) d M d M (
d } - M) - (1 { d )} M 1 ( { ) (
} d M d d ) 1 (M 1 {
} d M d ) M 1 ( d 1 {
n
k j , 1 j
j
n
k j , 1 j
j
-
k k
n
k j , 1 j
j k k
n
k j , 1 j
j k k
_ _
_
_
= =
= =
+
+
= =
+ +
= =
+
| + o
| o + | o | o =
+ + + |
+ o
(11)
When formula (11) is divided by several big M
mathematically, it can be developed as follows:

_ _
_ _
=
=
+
= =
= =
+ +
o | =
o + | o + |
n
1 j
j
n
1 j
j
n
k j , 1 j
j
n
k j , 1 j
j
-
k k
d d
) d d ( ) d d (
(12)
Where these constants can be 1 = | + o estimated, because
the constants ) , ( | o indicate relative ratios of efficiency
analysis and inefficiency analysis. Then the proposed model is
formulated as the following Linear Programming:

0 d , d , 0 u , 0 v
1 x v
) n , , 2 , 1 j (
0 d d y u x v .t. s
d d ) 1 ( x a M
j j r i
m
1 i
ik i
j j
s
1 r
rj r
m
1 i
ij i
n
1 j
j
n
1 j
j
> > >
=
=
= + +
o o
+
=
+
= =
=
=
+
_
_ _
_ _
(13)
Where
ij
x :
th
i input value of
th
j DMU,

rj
y : input value of
th
j DMU,

i
v ,
r
u : input and output weight,

+
i
d ,

r
d : slack variables.
The formula (13) includes the viewpoints parameter, and
allows us to analyze the performance of DMU by changing the
parameter between the strong points (especially, if 1 = o then
the optimal solutions is the same with one of efficiency
analysis) and weak points (if 0 = o then the optimal solutions
is the same with one of inefficiency analysis).
And if ' o = o then this paper defines the evaluation value
(
' , MVP
k
o
u ) of
k
DMU as follows:

)
d 1
1
( ) ' - (1 ) d (1 '
) ' 1 ( '
*
k
*
k
IE
k
E
k
' , MVP
k
+
o
+
o o =
u o u o = u
(14)
Where superscript * indicate the optimal solution of
formula (13).
The first term of formula (14) indicates the evaluation value
by the aspect of the strong points and the second term indicates
it by the aspect of the weak points. Therefore, the evaluation
value (
' , MVP
k
o
u ) is measured on the range between -1 (-100%:
inefficiency) and 1 (100%: efficiency).
IV. CASE STUDY
A. A data set
A data set used in this paper is demonstrated illustrated in
TABLE I. (The source of this data set comes from the internet
site: YAHOO! SPORTS (in Japanese), 2005). Twenty-five
batters are selected for our performance evaluation. When
using the data set, this paper uses bats and walk as input
items as well as singles, doubles, triples, homeruns,
runs batted in and steals as output items.
TABLE .
OFFENSIVE RECORDS OF JAPANESE BASEBALL PLAYERS IN 2005
bats walks singles doubles triples homeruns
runs
batted in
steals
1 577 96 89 37 1 44 120 2
2 452 73 91 19 2 18 70 3
3 498 71 82 25 1 36 91 6
4 574 56 110 34 2 24 89 18
5 503 38 111 29 1 6 51 11
6 473 75 74 21 1 30 89 6
7 431 46 77 27 1 15 63 10
8 552 91 77 31 1 35 100 8
9 569 57 105 42 1 11 73 2
10 529 64 92 22 1 26 75 7
11 420 33 75 27 2 14 75 8
12 530 84 67 24 0 44 108 0
13 549 41 122 25 2 2 34 22
14 633 51 140 19 8 4 45 42
15 580 66 107 27 0 20 88 7
16 544 24 95 28 3 24 79 1
17 473 53 88 20 0 6 40 0
18 526 47 86 28 0 24 71 4
19 559 50 92 22 3 27 90 18
20 559 51 110 24 1 9 62 4
21 452 40 68 19 2 26 84 2
22 580 61 89 23 1 33 94 5
23 542 82 74 18 0 37 100 1
24 503 78 79 20 0 18 74 1
25 424 36 74 18 7 6 39 10
DMU
Inputs Outputs
B. Multi-Viewpoint DEAs result
TABLE II shows the evaluation values of Multi-View DEA
model. This paper calculates eleven patterns between (The
View points parameter) 1 = o and 0 = o . Especially, if setting
the parameter o equals to 1, this evaluation value (
1 , MVP
k
u ) is
calculated by efficiency analysis (formula (2)). And if setting
the parameter equals to 0, this evaluation value (
0 , MVP
k
u ) is
calculated by inefficiency analysis (formula (3)).
1) Efficiency Analysiss Result
This analysis finds that there are 14 batters whose
evaluation value is 1 (efficiency). In TABLE I, these batters are
included in
1
DMU which captured the triple crown
14
DMU
and which captured the steal crown in 2005. Then, it
understood that DEA equally evaluates a lot of evaluation axes.
However, because the evaluation value is estimated only by the
aspect of most strong point for each DMU, multiplicity of
strong points is not considered like
1
DMU . Therefore,
?+
superiority can not be applied between these batters in this
analysis.
2) Inefficiency Analysiss Result
This analysis finds that there are 10 batters whose evaluation
value is -1 (inefficiency). Because the evaluation value is
estimated only by the aspect of weak points, these batters are
included in the batters which have a little steals even if
excelling in the long hits like
12
DMU and
23
DMU . As well as
efficiency analysis, superiority can not be applied between
these batters.
3) Proposed Models Result
The proposed model allows to analyze the performance of
DMU between efficiency and inefficiency. To clarify the
change of the evaluation value when the view points parameter
o is shifted from 1 to 0, let us not focus on evaluation value but
on rank. Fig 2 shows the change of rank for the specific four
batters (
12
DMU ,
13
DMU ,
14
DMU ,
25
DMU ) which estimated
the both states of efficiency and inefficiency.
a) Robustness of the evaluation value
Although
25
DMU has high rank (25) in the case ( 1 = o ),
the rank of
25
DMU is rapidly lower in the other cases. Where,
thinking about strong points, in TABLE I, it is understood that
it has the superiority for the ratio of doubles (output) / bats
(input). However, the other ratios are not excellent respect.
That is to say,
25
DMU has a limited strong point. Oppositely,
as seen in TABLE II, for the batters who is as almighty like
1
DMU and
2
DMU , the rank does not change easily. Because
the proposed model allows us to know whether DMU has the
multiplicity or limit of strong points, it is possible to evaluate
the DMU with robustness.
b) Unification between DEA-efficiency and
DEA-inefficiency model
In the case ( = o 1, 0.8, 0.7, 0.4, 0.2), the rank of
14
DMU
are changed to 25, 12, 19, 11 and 24. Thus, the change of the
rank is large. As shown in TABLE I, because
14
DMU has the
multiplicity of strong points such as singles, triples and steals, it
is understood that
14
DMU has high rank roughly. However,
this result indicates that the rank does not change linear from
the aspect of strong to weak points. Although the efficiency
analysis and the inefficiency analysis are integrated into one
mathematical formulation, how to assign the view points
parameter o still remains.
TABLE II.
PARAMETER AND ESTIMATION VALUE (
MVP
k
u )
=1 =0.9 =0.8 =0.7 =0.6 =0.5 =0.4 =0.3 =0.2 =0.1 =0
1 1 0.787 0.592 0.414 0.226 0.043 -0.138 -0.320 -0.504 -0.695 -0.931
2 1 0.805 0.606 0.422 0.231 0.052 -0.135 -0.315 -0.491 -0.634 -0.988
3 1 0.801 0.605 0.419 0.228 0.048 -0.144 -0.327 -0.502 -0.696 -0.890
4 1 0.803 0.610 0.417 0.226 0.045 -0.143 -0.328 -0.509 -0.661 -0.846
5 1 0.800 0.609 0.412 0.220 0.035 -0.158 -0.350 -0.526 -0.656 -0.881
6 0.980 0.749 0.557 0.373 0.196 0.021 -0.172 -0.355 -0.543 -0.720 -0.933
7 0.989 0.743 0.553 0.381 0.185 0.012 -0.184 -0.377 -0.569 -0.718 -0.909
8 0.981 0.683 0.499 0.319 0.150 -0.034 -0.209 -0.405 -0.604 -0.793 -1
9 1 0.710 0.507 0.353 0.159 -0.023 -0.218 -0.405 -0.603 -0.732 -0.963
10 0.947 0.746 0.559 0.373 0.197 0.012 -0.187 -0.377 -0.555 -0.733 -0.930
11 1 0.803 0.612 0.423 0.228 0.037 -0.163 -0.349 -0.501 -0.674 -0.905
12 1 0.714 0.515 0.330 0.177 -0.020 -0.211 -0.386 -0.578 -0.799 -1
13 1 0.738 0.557 0.392 0.177 -0.009 -0.198 -0.372 -0.541 -0.701 -1
14 1 0.748 0.554 0.405 0.209 0.006 -0.201 -0.376 -0.499 -0.677 -1
15 0.955 0.762 0.563 0.389 0.212 0.022 -0.176 -0.362 -0.550 -0.696 -1
16 1 0.797 0.570 0.398 0.216 0.011 -0.185 -0.377 -0.545 -0.722 -0.922
17 0.851 0.640 0.445 0.277 0.126 -0.054 -0.239 -0.427 -0.622 -0.802 -1
18 0.926 0.716 0.520 0.359 0.165 -0.029 -0.213 -0.409 -0.609 -0.787 -1
19 1 0.758 0.573 0.383 0.182 -0.006 -0.198 -0.385 -0.558 -0.733 -0.935
20 0.926 0.729 0.527 0.373 0.176 -0.012 -0.207 -0.406 -0.587 -0.716 -0.946
21 1 0.764 0.570 0.377 0.176 -0.009 -0.197 -0.382 -0.572 -0.744 -0.961
22 0.934 0.731 0.539 0.353 0.172 -0.017 -0.209 -0.404 -0.588 -0.773 -0.971
23 0.916 0.696 0.499 0.326 0.161 -0.033 -0.215 -0.411 -0.608 -0.800 -1
24 0.849 0.644 0.456 0.276 0.117 -0.069 -0.251 -0.427 -0.619 -0.806 -1
25 1 0.632 0.451 0.294 0.109 -0.071 -0.252 -0.436 -0.624 -0.805 -1
DMU
Estimation Value
?
:=1 :=0 :=0.5
R
a
n
k
Parameter :
0
5
10
15
20
25
No.12
No.13
No.14
No.25
:=1 :=0 :=0.5
R
a
n
k
Parameter :
0
5
10
15
20
25
No.12
No.13
No.14
No.25
Fig 2. Rank of four players
V. CONCLUSION
This paper has proposed a new decision support method,
called Multi-Viewpoint DEA model which integrated the
efficiency analysis and the inefficiency analysis by one
mathematical formulation. The proposed model allows us to
analyze the performance of DMU by changing the view points
parameter between the strong points (especially, if 1 = o then it
becomes efficiency analysis) and weak points (if 0 = o then it
becomes inefficiency analysis). Regarding twenty-five
Japanese baseball players as DMUs, a case study has shown
that the proposed model has two desirable features: (a)
robustness of the evaluation value, and (b) unification between
efficiency analysis and inefficiency analysis. For the future
study, we will also analytically compare our method to the
traditional approaches [15, 16] and explore how to set the view
points parameter.
REFERENCES
[1] A.Charnes, W.W.Cooper, and E.Rhodes, Measuring the efficiency of
decision making units, European Journal of Operational Research, 1978,
Vol.2, pp429-444.
[2] T. Sueyoshi and S. Aoki, A use of a nonparametric statistic for DEA
frontier shift: the Kruskal and Wallis rank test, OMEGA: The
International Journal of Management Science, Vol.29, No.1, 2001,
pp1-18.
[3] T.Sueyoshi, K.Onishi, and Y.Kinase, A Bench Mark Approach for
Baseball Evaluation, European Journal of Operational Research,
Vol.115, 1999, pp.429-428.
[4] T. Sueyoshi, Y. Kinase and S. Aoki, DEA Duality on Returns to Scale in
Production and Cost Analysis, Proceedings of the Sixth Asia Pacific
Management Conference 2000, 2000, pp1-7.
[5] W. W. Cooper, L. M. Seiford, K. Tone, Data Envelopment Analysis: A
comprehensive text with models, applications, references and
DEA-Solver software, Kluwer Academic Publishers, 2000.
[6] R. Coombs, P. Sabiotti and V. Walsh, Economics and Technological
Change, Macmillan, 1987.
[7] Y. Yamada, T. Matui and M. Sugiyama, "An inefficiency measurement
method for management systems", Journal of Operations Research
Society of Japan, vol. 37, 1994, pp. 158-168 (In Japanese).
[8] Y. Yamada, T. Sueyoshi, M. Sugiyama, T. Nukina and T. Makino The
DEA Method for Japanese Management: The Evaluation of Local
Governmental Investments to the Japanese Economy, Journal of the
Operations Research Society of Japan, Vol.38, No.4, 1995, pp.381-396.
[9] S. Aoki, K. Mishima, H. Tsuji: Two-Staged DEA model with Malmquist
Index for Brand Value Estimation, The 8th World Multiconference on
Systemics, Cybernetics and Informatics, Vol. 10, pp.1-6, 2004.
[10] R. D. Banker, A. Charnes, W. W. Cooper, Some Models for Estimating
Technical and Scale Inefficiencies in Data Envelopment Analysis,
Management Science, Vol.30, 1984, pp.1078-1092.
[11] R. D. Banker, and R. M. Thrall, Estimation of Returns to Scale Using
Data Envelopment Analysis, European Journal of Operational Research,
Vol.62, 1992, pp.74-82.
[12] H. Nakayama, M. Arakawa, Y. B. Yun, Data Envelopment Analysis in
Multicriteria Decision Making,M. Ehrgott and X. Gandibleux (eds.)
Multiple Criteria Optimization: State of the Art Annotated Bibliographic
Surveys, Kluwer Acadmic Publishiers, 2002.
[13] E. W. N. Bernroider, V. Stix , The Evaluation of ERP Systems Using
Data Envelopment Analysis, Information Technology and Organizations,
Idea Group Pub, 2003, pp.283-286.
[14] Y. Zhou, Y. Chen, DEA-based Performance Predictive Design of
Complex Dynamic System Business Process Improvement, Proceeding
of Systems, Man and Cybernetics, 2003. IEEE International Conference,
2003, pp.3008-3013.
[15] R. G. Thompson, L. N. Langemeier, C. T. Lee, and R. M. Thrall, The
Role of Multiplier Bounds in Efficiency Analysis with Application to
Kansas Farming, Journal of Econometrics, Vol.46, 1990, pp.93-108.
[16] W. W. Cooper, W. Quanling and G. Yu, Using Displaced Cone
Representation in DEA models for Nondominated Solutions in
Multiobjective Programming, Systems Science and Mathematical
Sciences, Vol.10, 1997, pp.41-49.
[17] S. Aoki, Y. Naito, and H. Tsuji, DEA-based Indicator for performance
Improvement, Proceeding of The Third International Conference on
Active Media Technology, 2005, pp.327-330.
?b

AbstractIn this study, we utilize the genetic algorithm (GA)
to mine high quality stocks for investment. Given the
fundamental financial and price information of stocks trading,
we attempt to use GA to identify stocks that are likely to
outperform the market by having excess returns. To evaluate the
efficiency of the GA for stock selection, the return of equally
weighted portfolio formed by the stocks selected by GA is used as
evaluation criterion. Experiment results reveal that the proposed
GA for stock selection provides a very flexible and useful tool to
assist the investors in selecting valuable stocks.
Index TermsGenetic algorithms; Portfolio optimization;
Data mining; Stock selection
I. INTRODUCTION
N the stock market, investors are often faced with a large
number of stocks. A crucial work of their investment
decision process is the selection of stocks. From a data-mining
perspective, the problem of stock selection is to identify good
quality stocks that are potential to outperform the market by
having excess return in the future. Given the fundamental
accounting and price information of stock trading, it is a
prediction problem that involves discovering useful patterns
or relationship in the data, and applying that information to
identify whether a stock is good quality.
Obviously, it is not an easy task for many investors when
they faced with enormous amount of stocks in the market.
With focus on the business computing, applying artificial
intelligence to portfolio selection and optimization is one way
to meet the challenge. Some research has presented to solve
asset selection problem. Levin [1] applied artificial neural
network to select valuable stocks. Chu [2] used fuzzy multiple
attribute decision analysis to select stocks for portfolio.
Similarly, Zargham [3] used a fuzzy rule-based system to
evaluate the listed stocks and realize stock selection. Recently,
Fan [4] utilized support vector machine to train universal
Manuscript received July 30, 2005. This work was supported in part by the
SRG of City University of Hong Kong under Grant No. 7001806.
Lean Yu is with the Institute of Systems Science, Academy of Mathematics
and Systems Science, Chinese Academy of Sciences, Beijing, 100080, China
(e-mail: yulean@amss.ac.cn).
Kin Keung Lai is with the Department of Management Science, City
University of Hong Kong and is also with the College of Business
Administration, Hunan University, 410082, China (phone: 852-2788-8563;
fax:852-2788-8560; e-mail: mskklai@cityu.edu.hk).
Shouyang Wang is with the Institute of Systems Science, Academy of
Mathematics and Systems Science, Chinese Academy of Sciences, Beijing,
100080, China (e-mail: sywang@amss.ac.cn).
feedforward neural networks to perform stock selection.
However, these approaches have some drawbacks in
solving the stock selection problem. For example, fuzzy
approach [2-3] usually lack learning ability, while neural
network approach [1, 4] has overfitting problem and is often
easy to trap into local minima. In order to overcome these
shortcomings, GA is used to perform this task. Some related
typical literature can be referred to [5-7] for more details.
The main aim of this study is to mine valuable stocks using
GA and test the efficiency of the GA for stock selection. The
rest of the study is organized as follows. Section 2 describes
the mining process based on the genetic algorithm in detail.
Section 3 presents a simulation experiment. And Section 4
concludes the paper.
II. GA-BASED STOCK SELECTION PROCESS
Generally, GA imitates the natural selection process in
biological evolution with selection, crossover and mutation,
and the sequence of the different operations of a genetic
algorithm is shown in the left part of Figure 1. That is, GA is
procedures modeled after genetics and evolution. Genetics
provide the chromosomal representation to encode the
solution space of the problem while evolutionary procedures
are designed to efficiently search for attractive solutions to
large and complex problem. Usually, GA is based on the
survival-of-the-fittest fashion by gradually manipulating the
potential problem solutions to obtain the more superior
solutions in population. Optimization is performed in the
representation rather than in the problem space directly. To
date, GA has become a popular optimization method as they
often succeed in finding the best optimum by global search in
contrast to most common optimization algorithms. Interested
readers can be referred to [8-9] for more details.
The aim of this study is to identify the quality of each stock
using GA so that investors can choose some good ones for
investment. Here we use stock ranking to determine the
quality of stock. The stocks with a high rank are regarded as
good quality stock. In this study, some financial indicators of
the listed companies are employed to determine and identify
the quality of each stock. That is, the financial indicators of
the companies are used as input variables while a score is
given to rate the stocks. The output variable is stock ranking.
Throughout the study, four important financial indicators,
return on capital employed (ROCE), price/earnings ratio (P/E
Ratio), earning per share (EPS) and liquidity ratio are utilized
Mining Valuable Stocks with Genetic
Optimization Algorithm
Lean Yu, Kin Keung Lai and Shouyang Wang
I
?
in this study. Their meaning is formulated as
ROCE = (Profit)/(Shareholders equity)*100% (1)
P/E ratio = (stock price)/(earnings per share)*100% (2)
EPS=(Net income)/(The number of ordinary shares) (3)
Liquidity Ratio=(Current Assets)/(Current Liabilities) (4)
When the input variables are determined, we can use GA to
distinguish and identify the quality of each stock, as illustrated
in Fig. 1.
Fig. 1 Stock selection with genetic algorithm
First of all, a population, which consists of a given number
of chromosomes, is initially created by randomly assigning
1 and 0 to all genes. In the case of stock ranking, a gene
contains only a single bit string for the status of input variable.
The top right part of Figure 1 shows a population with four
chromosomes, each chromosome includes different genes. In
this study, the initial population of the GA is generated by
encoding four input variables. For the testing case of ROCE,
we design 8 statuses representing different qualities in terms
of different interval, varying from 0 (Extremely poor) to 7
(very good). An example of encoding ROCE is shown in
Table 1. Other input variables are encoded by the same
principle. That is, the binary string of a gene consists of three
single bits, as illustrated by Fig. 1.
TABLE I
AN EXAMPLE OF ENCODING ROCE
ROCE value Status Encoding
(-, -30%] 0 000
(-30%, -20%] 1 001
(-20%,-10%] 2 010
(-10%,0%] 3 011
(0%, 10%] 4 100
(10%, 20%] 5 101
(20%, 30%] 6 110
(30%,+) 7 111
It is worth noting that 3-digit encoding is used for
simplicity in this study. Of course, 4-digit encoding is also
adopted, but the computations will be rather complexity.
The subsequent work is to evaluate the chromosomes
generated by previous operation by a so-called fitness
function, while the design of the fitness function is a crucial
point in using GA, which determines what a GA should
optimize. Since the output is some estimated stock ranking of
designated testing companies, some actual stock ranking
should be defined in advance for designing fitness function.
Here we use annual price return (APR) to rank the listed stock
and the APR is represented as
1
1
=
n
n n
n
ASP
ASP ASP
APR
(5)
where APR
n
is the annual price return for year n, ASP
n
is the
annual stock price for year n. Usually, the stocks with a high
annual price return are regarded as good stocks. With the
value of APR evaluated for each of the N trading stocks, they
will be assigned for a ranking r ranged from 1 and N, where 1
is the highest value of the APR while N is the lowest. For
convenience of comparison, the stocks rank r should be
mapped linearly into stock ranking ranged from 0 to 7
according to the following equation:
1
7
=
N
?8
(http://www.sse.com.cn). The sample data span the period
from January 2, 2002 to December 31, 2004. Monthly and
yearly data in this study are obtained by daily data
computation. For simulation, 100 stocks are randomly
selected. In this study, we select 100 stocks from Shanghai A
share, and their stock codes vary from 600000 to 600100.
First of all, the company financial information as the input
variables is fed into the GA to obtain the derived company
ranking. This output is compared with the actual stock ranking
in terms of APR, as indicated by Equations (5) and (6). In the
process of GA optimization, the RMSE between the derived
and the actual ranking of each stock is calculated and served
as the evaluation function of the GA process. The best
chromosome obtained is used to rank the stocks and the top n
stocks are chosen for the portfolio. For experiment purpose,
the top 10 and 20 stocks are chosen for testing according to
the ranking of stock quality using GA. The top 10 and 20
stocks selected by GA can construct a portfolio. For
convenience, equally weighted portfolios are built for
comparison purpose.
In order to evaluate the usefulness of the GA optimization,
we compared the net accumulated return generated by the
selected stock from GA with a benchmark. The benchmark
return is determined by an equally weighted portfolio of all
the stocks available in the experiment. Fig. 2 reveals the
results for different portfolios.
Fig. 2 Accumulated return for different portfolios
From Fig. 2, we can find that the net accumulated return of
the equally weighted portfolio formed by the stocks selected
by GA is significantly outperformed the benchmark. In
addition, the performance of the portfolio of the 10 stocks is
better that of the 20 stocks. As we know, portfolio does not
only focus on the expected return but also on risk
minimization. The larger the number of stocks in the portfolio
is, the more flexible for the portfolio to make the best
composition to avoid risk. However, selecting good quality
stocks is the prerequisite of obtaining a good portfolio. That
is, although the portfolio with the large number of stocks can
lower the risk to some extent, some bad quality stocks may
include into the portfolio, which influences the portfolio
performance. Meantime, this result also demonstrates that if
the investors select good quality stocks, the portfolio with the
large number of stocks does not necessary outperform the
portfolio with the small number of stocks. Therefore it is wise
for investors to select a limit number of good quality stocks
for constructing a portfolio.
IV. CONCLUSIONS
This study uses genetic optimization algorithm to perform
stocks selection for portfolio. Experiment results reveal that
the GA optimization approach has shown to be useful to the
problem of stock selection, which can mine the most valuable
stocks for investors.
ACKNOWLEDGMENT
The authors would like to thank the anonymous reviewers
for their valuable comments and suggestions. Their comments
have improved the quality of the paper immensely.
REFERENCES
[1] A.U. Levin, Stock selection via nonlinear multi-factor models,
Advances in Neural Information Processing Systems, 1995, pp. 966-972.
[2] T.C. Chu, C.T. Tsao, and Y.R. Shiue, Application of fuzzy multiple
attribute decision making on company analysis for stock selection,
Proceedings of Soft Computing in Intelligent Systems and Information
Processing, 1996, pp. 509-514.
[3] M.R. Zargham and M.R. Sayeh, A web-based information system for
stock selection and evaluation, Proceedings of the First International
Workshop on Advance Issues of E-Commerce and Web-Based
Information Systems, 1999, pp. 81-83.
[4] A. Fan and M. Palaniswami, Stock selection using support vector
machines, Proceedings of International Joint Conference on Neural
Networks, 2001, pp. 1793-1798.
[5] L. Lin, L. Cao, J. Wang, and C. Zhang, The applications of genetic
algorithms in stock market data mining optimization, in Data Mining V,
A. Zanasi, N.F.F. Ebecken, and C.A. Brebbia, Eds. WIT Press, 2004.
[6] S.H. Chen, Genetic Algorithms and Genetic Programming in
Computational Finance. Dordrecht: Kluwer Academic Publishers, 2002.
[7] Thomas, J., Sycara, K. The importance of simplicity and validation in
genetic programming for data mining in financial data, Proceedings of
the Joint AAAI-1999 and GECCO-1999 Workshop on Data Mining with
Evolutionary Algorithms, 1999.
[8] J. H. Holland, Genetic algorithms, Scientific American, 1992, 267, pp.
66-72.
[9] D.E. Goldberg, Genetic Algorithm in Search, Optimization, and
Machine Learning. Addison-Wesley, Reading, MA, 1989.
?9
A Comparison Study of Multiclass Classification between Multiple Criteria
Mathematical Programming and Hierarchical Method for Support Vector
Machines
Yi Peng
1
, Gang Kou
1
, Yong Shi
1, 2, 3
, Zhenxing Chen
1
and Hongjin Yang
2
1
College of Information Science & Technology, University of Nebraska at Omaha,
Omaha, NE 68182, USA
{ ypeng, gkou, zchen}@mail.unomaha.edu
2
Chinese Academy of Sciences Research Center on Data Technology & Knowledge Economy,
Graduate University of the Chinese Academy of Sciences, Beijing 100080, China
{yshi, hjyang}@gucas.ac.cn
3
The corresponding author
Abstract
Multiclass classification refers to classify
data objects into more than two classes. The
purpose of this paper is to compare two
multiclass classification approaches: Multiple
Criteria Mathematical Programming (MCMP)
and Hierarchical Method for Support Vector
Machines (SVM). While MCMP considers all
classes at once, SVM was initially designed
for binary classification. It is still an ongoing
research issue to extend SVM from two-class
classification to multiclass classification and
many proposed approaches use hierarchical
method. In this paper, we focus on one
common hierarchical method pairwise
classification. We compare the performance
of MCMP and SVM pairwise approach using
KDD99, a large network intrusion dataset.
Results show that MCMP achieves better
multiclass classification accuracies than SVM
pairwise.
Keywords: classification, multi-group
classification, multi-group Multiple criteria
mathematical programming (MCMP),
pairwise classification
1. INTRODUCTION
As one of the major data mining
functionalities, classification has broad
applications such as credit card portfolio
management, medical diagnosis, and fraud
detection. Based on historical information,
classification builds classifiers to predict
categorical class labels for unknown data.
Classification methods can be classified in
various ways, and one distinction is between
binary and multiclass classification. Binary
classification, as the name indicates, classifies
data into two classes. Multiclass classification
refers to classify data objects into more than
two classes. Many real-life applications
require multiclass classification. For example,
a multiclass classification that is capable of
predicting subtypes of cancer will be more
helpful than a binary classification that can
only predict cancer or non-cancer.
Researchers have suggested various
multiclass classification methods. Multiple
Criteria Mathematical Programming (MCMP)
and Hierarchical Method for Support Vector
Machines (SVM) are two of them. MCMP
and SVM are both based on mathematical
30
programming and there is no comparison
study has been conducted to date. The
purpose of this paper is to compare these two
multiclass classification approaches. While
MCMP considers all classes at once, SVM
was initially designed for binary classification.
It is still an ongoing research issue to extend
SVM from two-class classification to
multiclass classification and many proposed
approaches use hierarchical approach. In this
paper, we focus on one common hierarchical
method pairwise classification. We first
introduce MCMP and SVM pairwise
classification, and then implement an
experiment to compare their performance
using KDD99, a large network intrusion
dataset.
This paper is structured as follows. The
next section discusses the formulation of
multiple-group multiple criteria mathematical
programming classification model. The third
section describes pairwise SVM multiclass
classification method. The fourth section
compares the performance of MCMP and
pairwise SVM using KDD99. The last section
concludes the paper.
2. MULTI-GROUP MULTI-CRITERIA
MATHEMATICAL PROGRAMMING
MODEL
This section introduces a MCMP model
for multiclass classification. Simply speaking,
this method classifies observations into
distinct groups based on two criteria. The
following models represent this concept
mathematically:
Given an r-dimensional attribute
vector ) ,..., (
1 r
a a a = , let
r
ir i i
A A A 9 e = ) ,..., (
1
be one of the sample
records, where ; ,..., 1 n i = n represents the total
number of records in the dataset. Suppose k
groups, G
1
, G
2
, , G
k
, are predefined.
k j i j i G G
j i
s s = u = , 1 , , and
} ... {
2 1 k i
G G G A e , n i ,..., 1 = . A series
of boundary scalars b
1
<b
2
<<b
k-1
, can be set
to separate these k groups. The boundary b
j
is
used to separate G
j
and G
j+1
. Let X =
r T
r
R x x e ) ,..., (
1
be a vector of real number to
be determined. Thus, we can establish the
following linear inequations (Fisher 1936, Shi
et al. 2001):
A
i
X < b
1
, A
i
e G
1
; (1)
b
j-1
A
i
X< b
j
, A
i
e G
j
; (2)
A
i
X > b
k-1
, A
i
e G
k
; (3)
2 j k-1, 1 i n.
A mathematical function f can be used to
describe the summation of total overlapping
while another mathematical
function g represents the aggregation of all
distances. The final classification accuracies
of this multi-group classification problem
depend on simultaneously minimize f and
maximize g . Thus, a generalized bi-criteria
programming method for classification can be
formulated as:
Generalized Model Minimize f and
Maximize g
Subject to: (1), (2) and (3)
To formulate the criteria and complete
constraints for data separation, some variables
need to be introduced. In the classification
problem, A
i
X is the score for the i
th
data
record. If an element A
i
j
G e is misclassified
into a group other than
j
G , then let
p
j i,
o
(p-norm of s s p
j i
1 ,
,
o ) be the Euclidean
distance from A
i
to b
j
, and A
i
X = b
j
+
j i,
o , 1 1 s s k j and let
p
j i 1 ,
o be the
Euclidean distance from A
i j
G e to b
j-1
, and
A
i
X = b
j-1
-
1 , j i
o , k j s s 2 . Otherwise,
n i 1 k, j 1 ,
,
s s s s
j i
o , equals to zero.
Therefore, the function f of total overlapping
of data can be represented as
__
= =
k
j
n
i
p
j i
1 1
,
o .
3!
If an element A
i
j
G e is correctly
classified into
j
G , let
p
j i,
be the
i
to b
j
, and A
i
X =
b
j
-
j i,
, 1 1 s s k j and let
p
j i 1 ,
be the
i j
G e to b
j-1
, and
A
i
X = b
j-1
+
1 , j i
, k j s s 2 . Otherwise,
n i 1 k, j 1 ,
,
s s s s
j i
, equals to zero. Thus,
the objective is to maximize the distance
p
j i,
from A
i
to boundary if A
i
e
1
G or
k
G
and is to minimize the distance
p
j i
j j
b b
,
1
2

from A
i
to the middle of
two adjunct boundaries b
j-1
and b
j
if
A
i
1 2 , s s e k j G
j
. So the function g of
the distances of every data to its class
boundary or boundaries can be represented as
_ _
= = ork j
n
i
p
j i
1 1
,
-
__
= =
1
2 1
,
1
2
k
j
n
i
p
j i
j j
b b
.
Furthermore, to transform the generalized
bi-criteria classification model into a single-
criterion problem, weights w
o
> 0 and w
> 0
are introduced for ) (o f and ) ( g ,
respectively. The values of w
o
and w
can be
pre-defined in the process of identifying the
optimal solution. As a result, the generalized
model can be converted into a single-criterion
mathematical programming model as:
Model 1Minimize w
o
__
= =
k
j
n
i
p
j i
1 1
,
o - w
(
_ _
= = = k orj j
n
i
p
j i
1 1
,
-
__
= =
1
2 1
,
1
2
k
j
n
i
p
j i
j j
b b
)
Subject to:
A
i
X = b
j
+
j i,
o -
j i,
, 1 1 s s k j (4)
A
i
X = b
j-1
-
1 , j i
o +
1 , j i
, k j s s 2 (5)
j i,
b
j
- b
j-1
, k j s s 2 (a)
j i,
b
j+1
- b
j
, 1 1 s s k j (b)
where A
i
, i = 1, , n are given, X and b
j
are
unrestricted, and
j i,
o , . 1 , 0
,
n i
j i
s s > .
(a) and (b) are defined as such because
the distances from any correctly classified
data (A
i
1 2 , s s e k j G
j
) to two adjunct
boundaries b
j-1
and b
j
must be less than b
j
-
b
j-1
. A better separation of two adjunct
groups may be achieved by the following
constraints instead of (a) and (b) because (c)
and (d) set up stronger limitation on
j
i
:
j i,
(b
j
- b
j-1
)/2+, k j s s 2 (c)
j i,
(b
j+1
- b
j
)/2+, 1 1 s s k j (d)
+
9 e c is a small positive real number.
Let p = 2, then objective function in
Model 1 can now be a quadratic objective and
we have:
(Model 2
Minimize w
o
__
= =
k
j
n
i
j i
1 1
2
,
) (o - w
(
_ _
= = = k orj j
n
i
j i
1 1
2
,
) ( -
__
= =

1
2 1
, 1
2
,
] ) ( ) [(
k
j
n
i
j i j j j i
b b ) (6)
Subject to: (4), (5), (c) and (d)
Note that the constant
2
1
)
2
(

j j
b b
is
omitted from the (6) without any effect to the
solution.
A version of model 2 for three
predefined classes is given in Figure 1. The
stars represent group 1 data objects, the black
dots represent group 2 data objects, and the
white circles represent group 3 data objects.
3?
G1 G2 G3
b
1
b
2

A
i
X = b
j
+
j i,
o -
j i,
, 2 , 1 = j A
i
X = b
j-1
-
1 , j i
o +
1 , j i
, 3 , 2 = j
Figure. 1 A Three-classes Model
Model 2 can be regarded as a weak
separation formula since it allows
overlapping. In addition, a medium
separation formula can also be constructed
on the absolute class boundaries (Model 3)
without any overlapping data. Furthermore, a
strong separation formula that requires a
non-zero distance between the boundary of
two adjunct groups (Model 4) emphasizes
non-overlapping characteristic between
adjunct groups.
(Model 3Minimize (6)
Subject to:
(c) and (d)
A
i
X b
j
-
j i,
, 1 1 s s k j
A
i
X b
j-1
+
1 , j i
, k j s s 2
where A
i
j
are
unrestricted, and
j i,
o , . 1 , 0
,
n i
j i
s s > .
(Model 4Minimize (6)
Subject to:
(c) and (d)
A
i
X b
j
-
j i,
o -
j i,
, 1 1 s s k j
A
i
X b
j-1
+
1 , j i
o +
1 , j i
, k j s s 2
where A
i
j
are
unrestricted, and
j i,
o , . 1 , 0
,
n i
j i
s s > .
These models can be used in
multiclass classification and the applicability
of these models depends on the nature of
given datasets. If the adjunct groups in
datasets do not have any overlapping data,
Model 4 or Model 3 is more appropriate.
Otherwise, Model 2 can generate better
results.
3. SVM PAIRWISE MULTICLASS
CLASSIFICATION
Statistical Learning Theory was proposed
by Vapnik and Chervonenkis in the 1960s.
Support Vector Machine (SVM) is one of the
Kernel Machine based Statistical Learning
Methods that can be applied on various types
of data and can detect the internal relations
among the data objectives. Given a set of data,
one can define the kernel matrix to construct
SVM and compute an optimal hyperplane in
the feature space which is induced by a kernel
(Vapnik, 1995). There exist different multi-
class training strategies for SVM such as One-
against-Rest classification, One-against-One
(pairwise) classification, and Error correcting
output codes (ECOC).
LIBSVM is a well-known free software
package for support vector classification. We
o
2 , i
2 , i
2 , i
o
1 , i
1 , i
o
1 , i
o
2 , i
1 , i
33
use the latest version, LIBSVM 2.8, in our
experimental study. This software uses one-
against-one (pairwise) method for multiclass
SVM (Chang and Lin, 2001). The one-
against-one method was first proposed by
Knerr et al. in 1990. It constructs totally
2
) 1 ( k k
binary SVM classifiers where the
classifiers are trained by two distinct classes
of the total k classes (Hsu and Lin, 2002). The
following quadratic program is used
2
) 1 ( k k
times to generate the multi-category SVM
classifiers.
Min (t/2) |||||
2
+ (1/2) ||x, b||
2
Subject to:
D (AX eb) > e - | , where e is a vector of
ones
After
2
) 1 ( k k
number of SVM classifiers
were produced, a majority vote strategy is
applied to the
2
) 1 ( k k
classifiers. Each
classifier has one vote and every data is
predicted to the class with the largest vote.

4. EXPERIMENTAL COMPARISON OF
MCMP AND PAIRWISE SVM
The KDD99 dataset was provided by
Defense Advanced Research Project Agency
(DARPA) in 1998 for the competitive
evaluation of intrusion detection approaches.
KDD 99 dataset contains 9 weeks of raw TCP
data from Simulation of a typical U.S. Air
Force LAN. A version of this dataset was
used in 1999 KDD-CUP intrusion detection
contest (Stolfo et al. 2000). After the contest,
KDD99 has become a de facto standard
dataset for intrusion detection experiments.
There are five main categories of attacks:
denial-of-service (DOS); unauthorized access
from a remote machine (R2L); unauthorized
access to local root privileges (U2R);
surveillance and other probing (Probe).
Because the number of U2R attacks is too
small (52 records), only three types of attacks,
DOS, R2L, and Probe, are used in this
experiment. The KDD99 dataset used in this
experiment has 4898430 records and contains
1071730 distinguish records.
MCMP was solved by LINGO 8.0, a software
tool for solving nonlinear models (LINDO
Systems Inc.). LIBSVM version 2.8 (Chang
and Lin, 2001), an integrated software which
uses pairwise approach to support multi-class
SVM classification, was applied to KDD99
data and the classification results of LIBSVM
were compared with MCMPs.
The four-group classification results of
MCMP and LIBSVM on KDD99 data were
summarized in Table 1 and Table 2,
respectively. The classification results were
displayed in the format of confusion matrices,
which pinpoint the kinds of errors made.
From the confusion matrices in Table 1 and
2, we observe that (1) LIBSVM achieves
perfect classification for training data: 100%
accuracy. The training results of MCMP are
almost perfect: 100% accuracy for probe
and DOS and 99% accuracy for normal
and R2L; (2) Contrasted LIBSVMs
training classification accuracies with testing,
its performance is unstable. LIBSVM
achieves almost perfect classification for
normal class: 99.99% accuracy, but poor
performance for three attack types: 44.48%
for probe, 53.17% for R2L, and 74.49%
for DOS. (3) MCMP has a stable
performance on testing data: 97.2% accuracy
for probe, 99.07% for DOS, 88.43% for
R2L, and 97.05% for normal.
3+
Table 1. MCMP KDD99 Classification Results
Evaluation on training data (400 cases): Accuracy
False
Alarm
Rate
(1) (2) (3) (4) <-classified as
100 0 0 0 (1): Probe 100.00% 0.99%
0 100 0 0 (2): DOS 100.00% 0.00%
0 0 99 1 (3): R2L 99.00% 0.00%
1 0 0 99 (4): Normal 99.00% 1.00%

Evaluation on test data (1071330 cases):
(1) (2) (3) (4) <-classified as
13366 216 145 24 (1): Probe 97.20% 7.88%
1084 244867 1202 14 (2): DOS 99.07% 6.32%
1 4 795 99 (3): R2L 88.43% 91.86%
59 16313 7623 788718 (4): Normal 97.05% 0.02%
Table 2. LIBSVM KDD99 Classification Results
Evaluation on training data (400 cases): Accuracy
False
Alarm
Rate
(1) (2) (3) (4) <-classified as
100 0 0 0 (1): Probe 100.00% 0.00%
0 100 0 0 (2): DOS 100.00% 0.00%
0 0 100 0 (3): R2L 100.00% 0.00%
0 0 0 100 (4): Normal 100.00% 0.00%

Evaluation on test data (1071330 cases):
(1) (2) (3) (4) <-classified as
6117 569 0 7065 (1): Probe 44.48% 67.84%
12861 184107 0 50199 (2): DOS 74.49% 0.31%
0 0 478 421 (3): R2L 53.17% 6.64%
41 0 34 812638 (4): Normal 99.99% 6.63%
3
5. CONCLUSION
This is the first time that we investigate the
differences between MCMP and pairwise
SVM for multiclass classification using a
large network intrusion dataset. The results
indicate that MCMP achieves better
classification accuracy than pairwise SVM. In
our future research, we will focus on the
theoretical differences between these two
multiclass approaches.
References
Bradley, P.S., Fayyad, U.M., Mangasarian,
O.L. (1999) Mathematical programming for
data mining: Formulations and challenges.
INFORMS Journal on Computing, 11, 217-
238.
Chang, C. C. and Lin, C. J. (2001) LIBSVM :
a library for support vector machines.
Software available at
http://www.csie.ntu.edu.tw/~cjlin/libsvm.
Hsu, C. W. and Lin, C. J. (2002) A
comparison of methods for multi-class
support vector machines, IEEE Transactions
on Neural Networks, 13(2), 415-425.
Knerr, S., Personnaz, L., and Dreyfus, G.
(1990), Single-layer learning revisited: A
stepwise procedure for building and training a
neural network, in Neurocomputing:
Algorithms, Architectures and Applications, J.
Fogelman, Ed. New York: Springer-Verlag.
Kou, G., Peng, Y., Shi, Y., Chen, Z. and Chen
X. (2004b) A Multiple-Criteria Quadratic
Programming Approach to Network Intrusion
Detection in Y. Shi, et al (Eds.): CASDMKM
2004, LNAI 3327, Springer-Verlag Berlin
Heidelberg, 145153.
LINDO Systems Inc., An overview of LINGO
8.0,
http://www.lindo.com/cgi/frameset.cgi?leftlin
go.html;lingof.html.
Stolfo, S.J., Fan, W., Lee, W., Prodromidis, A.
and Chan, P.K. (2000) Cost-based Modeling
and Evaluation for Data Mining With
Application to Fraud and Intrusion Detection:
Results from the JAM Project, DARPA
Information Survivability Conference.
Vapnik, V. N. and Chervonenkis (1964), On
one class of perceptrons, Autom. And Remote
Contr. 25(1).
Vapnik, V. N. (1995), The Nature of
Statistical Learning Theory, Springer, New
York.
Zhu, D., Premkumar, G., Zhang, X. and Chu,
C.H. (2001) Data Mining for Network
Intrusion Detection: A comparison of
Alternativest Methods, Decision Sciences,
Volume 32 No. 4, Fall 2001.
3b
Pattern Recognition for Multimedia Communication
Networks Using New Connection Models
between MCLP and SVM
Jing HE
Institute of Intelligent Information
and Communication Technology
Konan University
Kobe 658-8501, Japan
Email: hejing@gucas.ac.cn
Wuyi YUE
Department of Information Science
and Systems Engineering
Konan University
Kobe 658-8501, Japan
Email: yue@konan-u.jp
Yong SHI
Chinese Academy of Sciences
Research Center on Data Technology
and Knowledge Economy
Beijing 100080, China
Email: yshi@gucas.ac.cn
AbstractData mining system of performance evaluation for
multimedia communication networks (MCNs) is a challenging
research and development issue. The data mining system offers
techniques of discovering patterns in voluminous databases. By
means of dividing the performance data into usual and unusual
categories, we try to nd out the category corresponding to the
data mining system. Many pattern recognition algorithms for the
data mining system have been developed and explored in recent
years such as rough sets, tough fuzzy hybridization, granular
computing, articial neural networks, support vector machines
(SVM), and multiple criteria linear programming (MCLP). In
this paper, a new connection model between MCLP and SVM is
employed to identify performance data. In addition to theoretical
foundations, the paper also includes experiment results. Some
real-time and nontrivial examples for MCNs given in this paper
shows how MCLP and SVM work and how they can be combined
to be used at the same time in reality. The advantages that every
algorithm offers are compared with the other methods.
I. INTRODUCTION
Data mining system of performance evaluation for mul-
timedia communication networks (MCNs) is a challenging
research and development issue. The data mining system offers
techniques for discovering patterns in voluminous databases.
Fraudulent activity costs the telecommunication industry mil-
lions of dollars a year.
It is important to identify potentially fraudulent users and
their typical usage patterns, and detect their attempts to gain
fraudulent entry in order to perpetrate illegal activity. Several
ways of identifying unusual patterns can be used such as
multidimensional analysis, cluster analysis and outlier analysis
[1].
By means of dividing the performance data into usual and
unusual categories, we try to nd out the category corre-
sponding to the data mining system. Many pattern recognition
algorithms for data mining have been developed and explored
in recent years such as rough sets, tough fuzzy hybridiza-
tion, granular computing, articial neural Networks, support
vector machines (SVM), multiple criteria linear programming
(MCLP) and so on [2].
SVM has been gaining popularity as one of the effective
methods for machine learning in recent years. In pattern
classication problems with two class sets, SVM generalizes
linear classiers into high dimensional feature spaces through
non-linear mappings. The non-linear mappings are dened
implicitly by kernels in the Hilbert space. This means SVM
may produce non-linear classiers in the original data space.
Linear classiers then are optimized to give the maximal
margins separation between the classes [3]-[5].
Research of linear programming (LP) approach to classi-
cation problems was initiated in [6]-[8]. [9], [10] applied the
compromise solution of MCLP to deal with the same question.
In [11], an analysis for fuzzy linear programming (FLP) in
classication of credit card holder behaviors was presented.
During the process of the calculation in [11], we found that
except some approaches such as MCLP, SVM, many data
mining algorithms try to minimize the inuence of outliers
or eliminate them altogether.
In other words, the unusual outliers may be of particular
interest, such as in the case of unusual pattern detection,
where unusual outliers may indicate fraudulent activities. Thus
identication of usual and unusual patterns is an interesting
data mining task, referred to as pattern recognition.
In this paper, by means of dividing the performance data
into usual and unusual categories, we try to nd out the
category corresponding to the data mining system. The new
pattern recognition model, which connects MCLP and SVM,
is employed to identify performance data.
Some real-time and non-trivial examples for MCNs with
different pattern recognition approaches such as SVM, LP, and
MCLP are given to show how the different techniques work
and can be used in reality. The advantages that the different
algorithms offer are compared with each other. The results of
the comparisons are listed in this paper.
In Section II, we describe the basic formulas of MCLP
and SVM. Connection models between MCLP and SVM are
presented in Section III. The real-time data experiments of
pattern recognition for MCNs are given out in Section IV.
Finally, we conclude the paper with a brief summary in Section
V.
3
II. BASIC FORMULA OF SVM AND MCLP
Support Vector Machines (SVMs) were developed in [3],
[12] and their main features are as follows:
(1) SVM maps the original data set into a high dimensional
feature space by non-linear mapping implicitly dened
by kernels in the Hilbert space.
(2) SVM nds linear classiers with the maximal margins on
the feature space.
(3) SVM provides an evaluation of the generalization ability.
A. Hard Margin SVM
We dene two classes of A and B among the training data
sets , . We use a variable ,
with two values of 1 and -1 to represent which class of A and
B a training data set belongs. Namely, if A, then ,
if B, then .
Let be a separating hyperplane parameter and be
a separating parameter, where and is the
attribute size. Then we use a separating hyperplane
to separate samples, where = and
. is a boundary value. From the above
denition, we know that and . Such method
for separating the samples is called the classication.
The separating hyperplane with maximal margins can be
given by solving the problem with the normalization
at points with the minimum interior deviation as
follows:
(M1) Min
(1)
where represents the function of norm. is given,
and are unrestricted.
Several norms are possible. When is used, the prob-
lem is reduced to quadratic programming, while the problem
with or is reduced to linear programming [13].
The SVM method which can separate two classes of A and
B completely is called the hard margin SVM method. But the
hard margin SVM method tends to cause over-learning.
The hard margin SVM method with is given as
follows:
(M2) Min
(2)
where is given, and are unrestricted.
The aim of machine learning is to predict which class new
patterns belong to on the basis of the given training data set.
B. Soft Margin SVM
The hard margin SVM method is easily affected by noise.
In order to overcome this shortcoming, the soft margin SVM
method is introduced. The soft margin SVM method allows
some slight errors which are represented by slack variables
(exterior deviation) , . Using a trade-off
parameter between Min and Min , we
have the soft margin SVM method as follows:
(M3) Min
(3)
where and are given, , and are unrestricted.
It can be seen that the idea of the soft margin SVM method
is the same as the linear programming approach to linear
classiers. This idea was used in an extension by [14]. Not
only exterior deviations but also interior deviations can be
considered in SVM. Then we propose various algorithms of
SVM considering both of slack variables for misclassied
data points (i.e., exterior deviations) and surplus variables for
correctly classied data points (i.e., interior deviations).
In order to minimize the slackness and to maximize the
surplus, the surplus variable (interior deviation) is used,
. The trade-off parameter is used for the
slackness variable, and another trade-off parameter is
used for the surplus variable. Then we have the optimization
problems as follows:
(M4) Min
(4)
where , and are given, , , and are
unrestricted.
C. MCLP
For the classication explained in Subsection A, the multi-
ple criteria linear programming (MCLP) model is used. We
want to determine the best coefcients of variables
, where are the best co-
efcients of variables obtained by the following Eq. (5), is
the attribute size and . A boundary value , ,
is used to separate two classes of A and B.
A
B
(5)
where is dened in Subsection A. is given, and
are unrestricted.
Eq. (5) is equal to the following equation:
(6)
where is dened in Subsection A. is given,
Let , denote the exterior deviation which
is a deviation from the hyperplane of . Similarly, let ,
38
denote the interior deviation which is a deviation
from the hyperplane of . Our purposes are as follows: (1)
to minimize the maximum exterior deviation (decrease errors
as much as possible). (2) to maximize the minimum interior
deviation (i.e. maximize the margins). (3) to maximize the
weighted sum of interior deviation (MSD). (4) to minimize
the weighted sum of exterior deviation (MMD).
MSD can be written as follows:
Min
A
B
(7)
Then,
(M5) Min
(8)
The alternative of the above model is to nd MMD as
follows:
Max
A
B
(9)
Then,
(M6) Max
(10)
[11] applied the compromise solution of multiple criteria
linear programming to minimize the sum of and maximize
the sum of simultaneously. A two criteria linear program-
ming model is given as follows:
(M7) Min and Max
(11)
A hybrid model presented in [8] that combines Eq. (8) and
Eq. (10) is given as follows:
Min
(12)
III. CONNECTION BETWEEN MCLP AND SVM
A. Linear Separable Examples
It should be noted that the LP of Eq. (8) may yield some
unacceptable solutions such as as well as unbounded
solutions in the goal programming approach. Therefore, some
appropriate normality condition must be imposed on in
order to provide a bounded nontrivial optimal solution. One
such normality condition is .
If the classication is linearly separable, then using the
normalization , the separating hyperplane
with the maximal margins can be given by solving the
problem as follows:
(M8) Max
(13)
where and are dened in Section II. is given,
However, this normality condition makes the problem to be
non-linear optimization model. Instead of maximizing the min-
imum interior deviation in Eq. (13), we can use the following
equivalent formulation with the normalization
at points with the minimum interior deviation [15].
Theorem. The discrimination problem of Eq. (13) is equiv-
alent to the formula used in Eq. (1) as follows:
(M1) Min
Proof :
The above M1 can be rewritten as follows:
Min
(14)
where , is the attribute size,
and . is given, and are unrestricted.
First notice that any optimal solution to Eq. (1) must satisfy
. Otherwise we should have and
39
, i.e. , an impossibility since at
the optimum in the strictly convex case. Similarly,
at the optimum of Eq. (14).
Let be an optimal vector for Eq. (1). Then
= is well dened for Eq. (14).
Assume it is not the optimal solution for Eq. (14). And let
, be the optimal solution instead. Then
and = is feasible for
Eq. (1). Then = , =
= (the constraint is tight at the optimum), in
contradiction with the optimality of . Hence is
the optimal solution for Eq. (14).
Now let be the optimal solution for Eq. (14).
Then = is
dened. Again, assume that is the suboptimal
solution, let be the optimal solution with
and dene = . We have, = /
= = , in contradiction with the optimality
of .
Then M1 and M8 are the same, and Theorem is proved.
B. Linear Unseparable Examples
As what have been mentioned in Eq. (8), MSD is as follows:
(M5) Min
where and are dened in Section II. is given,
The above equation as Eq. (8) can be rewritten as Eq. (1)
according to Theorem as follows:
(M1) Min
Then we use as norm of Eq. (1). is chosen
to be the trade-off parameter between Min and Min
, we have the formulation for the soft margin SVM
methods combining Eq. (8) with Eq. (1) as follows:
Min
(15)
where and are given, and are unrestricted.
Eq. (15) is the same as the SVM formula in Eq. (3).
IV. PATTERN RECOGNITION FOR MCNS
A. Real-time Experiments Data
A set of attributes for MCNs, such as throughput capacity,
package forwarding rate, response time, connection attempts,
delay time, transfer rate and the criteria about unusual pat-
terns is designed. In these real-time experiments, the two
classes of the training data sets in MCNs are A and B, A
and B are dened in Section II. The class A represents usual
pattern, and the class B represents unusual pattern.
The purpose of pattern recognition techniques for MCNs is
to nd the better classier through a training data set and use
the classier to predict all other performance data of MCNs.
The frequently used pattern recognition in the telecommuni-
cation industry is still two-class separation technique. The key
question of two-class separation is to separate the unusual
patterns called fraudulent activity from the usual patterns
called normal activity. The pattern recognition model is to
identify as many MCNs as possible. This is also known as the
method of detecting fraudulent list. In this section, a real-
time performance data mart with 65 derived attributes and
1000 records of a major CHINA TELECOM MCNs database
is rst used to train the different classiers. Then, the training
solutions are employed to predict the performances of another
5000 MCNs. Finally, the classication results in different
models are compared with each other.
B. Accuracy Measure
We would like to be able to access how well the classier
can recognize usual samples (referred to as positive samples)
and how well it can recognize unusual samples (refereed to
as negative samples). The sensitivity and specicity measures
can be used, respectively, for this purpose. In addition, we may
use precision to access the percentage of samples labeled as
unusual that actually are unusual samples. These measures
are dened as follows:
Sensitivity
t pos
pos
Specicity
t neg
neg
Precision
t pos
t pos f pos
where t pos is the number of true positives samples (usual
samples that were correctly classied as such), pos is the
number of positive samples (usual samples), t neg is the
number of true negatives samples (unusual samples that
were correctly classied as such), neg is the number of
negative samples (unusual samples), and f pos is the num-
ber of false positives samples (unusual samples that were
incorrectly labeled as usual). It can be shown that Accuracy
is a function of Sensitivity and Specicity as follwos:
Accuracy Sensitivity
pos
pos neg
Specicity
neg
pos neg
The higher the four rates (Sensitivity rate, Specicity rate,
Precision rate, Accuracy rate) are, the better the classication
results are.
A threshold in this paper is dened to set up against speci-
city and precision depending on the requirement performance
evaluation of MCNs.
+0
C. Experiment Results
A previous experience on classication test showed that the
training results of a data set with balanced records (number
of usual samples is equal to number of unusual samples) may
be different from that of an unbalanced data set (number of
usual samples is not equal to number of unusual samples).
Given there are unbalanced 1000 training accounts, where
860 usual samples are usual and 140 are unusual. Models M1
to M8 can be used to test. M1 to M8 are given in Sections II
and III.
Namely, M1 is the SVM model with the objective function
. M2 is the SVM model with the objective function
. M3 is the SVM model with the objective function
+ . M4 is the SVM model to minimize
the slackness and to maximize the surplus. M5 is the linear
programming model with the objective function . M5
is called the MSD model. M6 is the linear programming model
with the objective function . M6 is called the MMD
model. M7 is the MCLP model. M8 is the MCLP model using
the normalization. is the boundary value for each model.
Here we use to
calculate for models M1 to M8.
A well-known commercial soft package, Lingo [16] has
been used to perform the training and predicting processes.
The learning results of unbalanced 1000 records in Sensitivity
and Specicity are shown in Table 1, where the columns of
H are the Sensitivity rates for the usual pattern, the columns
of K are the Specicity rates for the unusual pattern.
Table 1: Learning Results of Unbalanced 1000 Records in
Sensitivity and Specicity.
Table. 1 shows the learning results of models M1 to M8
for different values of the boundary . If the threshold of
the specicity rate K is predetermined as , then the models
M1, M8 with , , , , , , , , M3 with
, M4 with , , , M6 with , ,
, M7 with , , , are satised as better classiers.
M1 and M8 have the same results of H and K with all values
of .
The best specicity rate model of the threshold in the
learning result of unusual patterns in K is M1, M8 with
. The order in the learning result of unusual patterns
in the specicity K is M8 = M1, M6, M3, M7, M4, M2, M5.
Table. 2 shows the predicting results of unbalanced 5000
records in Precision with models M1 to M8 for different values
of the boundary .
Table 2: Predicting Results of Unbalanced 5000 Records in
Precision.
The Precision rates in models M3, M7, M4 are as high as
the learning results. M1 and M8 have the same results of H
and K with all values of b. If the threshold of the precision
of pattern recognition is predetermined as 0.9. Then the model
M3 with , , , , , , M8 with , , ,
, are satised as better classiers. The best model of the
threshold in the learning results is M3 with . The order
of average predicting precision is M3, M7, M4, M2, M5, M6,
M1, M8.
In this data mart of Table 2, M1 and M8 have similar
structures and solution characterizations due to the formula
presented in Section III. When the classication is to nd the
higher specicity, M1 or M8 can give the better results. When
the classication is to nd the higher precision, M3, M4, M7
can give the better results.
V. CONCLUSION
In this paper, we have proposed a heuristic connection
classication method to recognize unusual patterns of mul-
timedia communication networks (MCNs). This algorithm is
based on the connection model between multiple criteria linear
programming (MCLP) with support vector machines (SVM).
Although the mathematical modeling is not new, the frame-
work of connection conguration is innovative. In addition,
empirical training sets and the prediction results on the real-
time MCNs from a major company, CHINA TELECOM, were
listed out. Comparison studies have shown that the connection
model combining MCLP and SVM has the performed better
learning results with an aspect to predicting the future per-
formance pattern of MCNs. The connection model also has a
great deal of potential to be used in various data mining tasks.
Since the connection model is readily implemented by non-
linear programming, any available non-linear programming
packages, such as Lingo, can be used to conduct the data
analysis. In the meantime, we explored the other possible
connections between SVM and MCLP. The results of ongoing
projects to solve more complex problem will be reported in
the near future.
ACKNOWLEDGMENT
This work was supported in part by GRANT-IN-AID FOR
SCIENTIFIC RESEARCH (No. 16560350) and MEXT.ORC
+!
(2004-2008), Japan and in part by NSFC (No. 70472074),
China.
REFERENCES
[1] J. Han and M. Kamber, Data Mining: Concepts and Techniques, An
Impernt of Academic Press, San Francisco, 2003.
[2] S. Pal and P. Mitra, Pattern Recognition Algorithms for Data Mining,
ACRC Press Company, 2004.
[3] V. Vapnik, Statistical Learning Theory, John Wiley & Sons, New York,
1998.
[4] O. Mangasarian, Linear and Nonlinear Separation of Pattern by Linear
Programming, Operations Research, 31(1): 445-453, 1965.
[5] O. Mangasarian. Multisurface Method for Pattern Separation, IEEE
Transactions on Information Theory, IT-14: 801-807, 1968.
[6] N. Freed and F. Glover, Simple but Powerful Goal Programming Models
for Discriminant Problems, European Journal of Operational Research,
7(3): 44-60, 1981.
[7] N. Freed and F. Glover, Evaluating Alternative Linear Programming
Models to Solve the Two-group Discriminant Problem, Decision Science,
17(1): 151-162, 1986.
[8] F. Glover, Improve Linear Programming Models for Discriminant Anal-
ysis, Decision Sciences, 21(3): 771-785, 1990.
[9] G. Kou, X. Liu, Y. Peng, Y. Shi, M. Wise and W. Xu, Multiple Criteria
Linear Programming Approach to Data Mining: Models, Algorithm
Designs and Software Development, Optimization Methods and Software,
18(4): 453-473, 2003.
[10] G. Kou and Y. Shi, Linux based Multiple Linear Programming Clas-
sication Program: Version 1.0, College of Information Science and
Technology, University of Nebraska-Omaha, U.S.A., 2002.
[11] J. He, X. Liu, Y. Shi, W. Xu and N. Yan, Classication of Credit
Cardholder Behavior by using Fuzzy Linear Programming, International
Journal of Information Technology Decision Making, 3(4): 223-229,
2004.
[12] C. Cortes and V. Vapnik, Support Vector Networks, Machine Learning,
15(20): 273-297.
[13] O. Mangasarian, Arbitrary-Norm Separating Plane, Operations Research
Letters 23, 1999.
[14] K. Bennett and O. Mangasarian, Robust Linear Programming Discrim-
ination of Two Linearly Inseparable Sets, Optimization Methods and
Software, 12(1): 23-24.
[15] P. Marcotte and G. Savard, Novel Approaches to the Discrimination
Problem, ZOR-Methods and Models of Operations Research, 12(36): 517-
545.
[16] http://www.lindo.com/.
[17] J. He, W. Yue and Y. Shi, Identication Mining of Unusual Patterns
for Multimedia Communication Networks, Abstract Proc. of Autumn
Conference 2005 of Operations Research Society of Japan, 262-263,
2005.
[18] Y. Shi and J. He, Computer-based Algorithms for Multiple Criteria and
Multiple Constraint Level Integer Linear Programming, Computers and
Mathematics with Applications, 49(5): 903-921, 2005.
[19] T. Asada and H. Nakayama, SVM using Multi Objective Linear Pro-
gramming and Goal Programming, T. Tanino, T. Tanaka and M. Inuiguchi
(eds), Multi-objective Programming and Goal Programming, 93-98, 2003.
[20] H. Nakayama and T. Asada, Support Vector Machines Formulated as
Multi Objective Linear Programming, Proc. of ICOTA2001, 1171-1178,
2001.
[21] M. Yoon, Y. B. Yun, and H. Nakayama, A Role of Total Margin in
Support Vector Machines, Proc. of IJCNN03, 7(4): 2049-2053, 2003.
[22] W. Yue, J. Gu and X. Tang, A Performance Evaluation Index System
for Multimedia Communication Networks and Forecasting for Web-based
Network Trafc, Journal of Systems Science and Systems Engineering,
13(1): 78-97, 2002.
[23] J. He, Y. Shi and W. Xu, Classications of Credit Cardholder Behavior
by using Multiple Criteria Non-linear Programming, Conference Proc.
of the International Conference on Data-Ming Knowledge Management,
Lecture Notes in Computer Science series, Springer-Verlag, 2004.
[24] http://www.rulequest.com/see5-info.html/.
[25] http://www.sas.com/.
[26] Y. Shi, M. Wise, M. Luo and Y. Lin, Data Mining in Credit Card
Portfolio Management: a Multiple Criteria Decision Making Approach,
Multiple Criteria Decision Making in the New Millennium, Springer,
Berlin, 2001.
+?
Published by
Department of Mathematics and Computing Science
Technical Report Number: 2005-05 November, 2005
ISBN 0-9738918-1-5

Shi - Optimization-Based Data Mining Techniques - 2005

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Shi - Optimization-Based Data Mining Techniques - 2005

Uploaded by

Copyright:

Available Formats

Optimization-based Data Mining

Techniques with Applications

k(k 1); x {0, 1}

Fuzzy Support Vector Classification Based on Possibility Theory

y o . The first fuzzy training

o .According to (2), triangle

o .According to (2), triangle

Definition 4 Suppose fuzzy linearly

and we call the distance with margin

l are constant . Due to

= > > + l j b x w y Pos t s

are regression function

If the outputs of all fuzzy training

You might also like