You are on page 1of 7

A I i n C h i n a

Machine Learning:
The State of the Art

Jue Wang and Qing Tao, Institute of Automation, Chinese Academy of Sciences

I
nternet browsers have facilitated the acquisition of large amounts of information.

This technologys development has greatly surpassed that of data analysis, making

This article discusses large chunks of data uncontrollable and unusable. This is one reason why so many people

MLs advances in light are enthusiastic about machine learning (ML). In fields such as molecular biology and

of theoretical and the Internet, if youre concerned only about the This definition involves five key terms:
ease of gene sorting and distributing information,
practical difficulties, you dont need ML. However, if youre concerned The real-world problem. Suppose the true model
about advanced problems such as gene functions, for this problem is y = F(x), where x is the vari-
analyzing statistical understanding information, and security, ML is ables (features and attributes) describing the
inevitable. problem, y is the outcome of x in the problem,
interpretations, Many researchers quote Herbert Simon in de- and F is the relationship between the variables
scribing ML: and the outcome. MLs mission is to estimate a
algorithm design, function f(x) that approximates F(x).
Learning denotes changes in the system that are Finite observations (the sample set). All the data
and the Rashomon adaptive in the sense that they enable the system to in the sample set S({xk, yk}n) are recordings of
do the same task or tasks drawn from the same popu- finite independent observations of y = F(x) in a
problem. lation more efficiently and more effectively the next noisy environment.
time.1 Sampling assumption. The sample set is an inde-
pendent and identically distributed (i.i.d.) sam-
This is a philosophical and ubiquitous descrip- ple from an unknown joint distribution function
tion of ML. But you cant design algorithms and P(x, y).
analyze the problem with this definition. Computer Modeling. Consider a parametric model space.
scientists and statisticians are more interested in The parameters are estimated from the sample
ML algorithms and their performance. In their set to get an f(x) thats a good approximation of
eyes, ML is F(x). The approximation measure is the average
error for all samples in terms of the distribution
the process (algorithm) of estimating a model thats P(x, y). We then say that the estimated model f(x)
true to the real-world problem with a certain prob- is true to the real-world problem F(x) with a cer-
ability from a data set (or sample) generated by finite tain probability, and we call f(x) a model of F(x).
observations in a noisy environment. Process (the algorithm). ML algorithms should

November/December 2008 1541-1672/08/$25.00 2008 IEEE 49


Published by the IEEE Computer Society
A I i n C h i n a

have some statistical guarantees. Also, Rumelhart and James McClelland pro- to evaluate the models performance and es-
the algorithm will transform the model- posed the nonlinear backpropagation al- timate the model. ML also needs statistics
ing problem into a parametric optimiza- gorithm.9 AI, pattern recognition, and sta- to filter noise in the data. The second foun-
tion problem in the model space. In most tistics researchers became interested in dation is computer science algorithm design
ML approaches, the learned model has a this approach. And, under the impulsion methodologies, for optimizing parameters.
linear dependence on these parameters. from real-world demands, ML, especially In recent years, MLs main objective has
statistical ML, gradually became indepen- been to find a learner linearly dependent on
ML involves both fundamental theoreti- dent from traditional AI and pattern rec- these parameters.
cal difficulties and practical difficulties. ognition. Researchers shunned early sym- In the past decade, ML has been in a
Theoretical difficulties include these: bolic ML methods because they lacked margin era11,12 Geometrically speaking,
good theoretical generalization guaran- the margin of a classifier is the minimal
The number of variables in the sample tees. However, because the complexities of distance of training points from the deci-
set (dimensions) is generally very large real-world data make a ubiquitous learn- sion boundary. Generally, a large margin
or even huge; compared with it, the sam- ing algorithm impossible, the quality of implies good generalization performance.
ple set is very small. the data and background knowledge could Anselm Blumer was the first to use VC
The model of the real-world problem is be the key to MLs success. The intrinsic (Vapnik-Chervonenkis) dimensions and
usually highly nonlinear. readability (interpretability) of symbolic the probably approximately correct (PAC)
framework to study learners generaliza-
Practical difficulties include these: tion ability.13 In 1995, Vladimir Vapnik
published the book The Nature of Statistical
Sometimes, the sample set cannot be Machine learnings ability Learning Theory,11 which included his
represented as a vector in a given linear work on determining PAC bounds of gen-
space, and there are specific relations to simplify data sets provides eralization based on the VC dimensions.
among the data. Blumers and Vapniks research serves
The datas outcome can take many forms a possible way humans as the theoretical cornerstones of finite-
(it could be discrete, it could be continu- sample statistical-learning theory. In 1998,
ous, it could have an ordered structure, or
it even could be missing).
can understand data quality John Shawe-Taylor related generalization to
the margin of two closed convex sets in the
The data set can be a product of many
meaningful models intertwined together.
and make alterations feature space.14
These theories led to the large-margin al-
There are outliers to models that must not
be ignored in some cases, which has in- to improve that quality. gorithm design principle, which has three
points:
spired many new learning models.
Generalization is based on finite sam-
Here, we discuss MLs advances in light methods could allow ML to regain popu- ples, which is consistent with application
of these difficulties. larity because its ability to simplify data scenarios.
sets provides a possible way humans can These algorithms deal with nonlinear
The Development of ML understand data quality and make altera- classification using the kernel trick. This
The term machine learning has become in- tions to improve that quality. method employs a linear classifier algo-
creasingly popular in recent decades. Other The research community is no longer in- rithm to solve a nonlinear problem by
fields have terms with similar meanings, terested in the algorithm design principles mapping the original nonlinear observa-
such as data analysis in statistics and pat- of nonlinear backpropagation. But this re- tions into a higher-dimensional space.
tern classification in pattern recognition. search reminded people of nonlinear algo- The algorithms estimate linear classifi-
In the early days of AI, ML research used rithms importance. Designing nonlinear ers and are transformed into convex op-
mainly symbolic data, and algorithm de- learning algorithms theoretically and sys- timization problems. The algorithms
sign was based on logic.26 tematically is an important impetus of cur- generalization depends on maximiz-
At about the same time, Frank Rosen rent statistical ML research. ing the margins, with a clear geometric
blatt proposed the perceptron, a statis- ML has become an important topic for meaning.
tical approach based on empirical risk pattern recognition researchers and others.
minimization.7 However, this approach re- In 2001, Leo Breiman published Statisti- Weve done margin-related research on
mained unrecognized and undeveloped in cal Modeling: The Two Cultures, which improving supervised algorithms and de-
the following decades. Nevertheless, this viewed ML as a subcategory of statistics.10 signing unsupervised algorithms.1518 For
statistical-modeling methodology has been This approach led to many new directions example, taking the viewpoint that the de
highly regarded in pattern classification re- and ideas for ML. sired domain containing the training sam-
search and has become fundamental in sta- In our view, ML actually has two foun- ples should have as few outliers as possible,
tistical pattern recognition.8 dations. The first is statistics. Because MLs we define the margin of one-class prob-
The real development of statistical objective is to estimate a model from ob- lems and derive a support-vector-machine-
learning came after 1986, when David served data, it must use statistical measures like framework for solving a new one-class

50 www.computer.org/intelligent IEEE INTELLIGENT SYSTEMS


problem with the predefined threshold .15 wasnt until 10 years later that Leslie Val- The margin idea mixes the two aspects [the
Weve also modified support vector ma- iant proposed PAC theory,24 which clearly bias and variance] together so that it is
chines (SVMs) by weighting the margin to describes this idea. not clear which aspect is the main contribu-
solve unbalanced problems.16 In fact, in the early 70s, Vapnik pro- tion to the success of these so-called margin
Meanwhile, Robert Schapire proved that, posed the finite-sample statistical theory. maximization methods. Moreover, from the
in the PAC framework, a concept is weakly Although the theory used VC dimensions margin concept, we are unable to character-
learnable if and only if it is strongly learn- to describe the model spaces complexity, it ize the impact of different loss functions, and
able.19 This means if you build a group of neither explained the reasons for using the we are unable to analyze the closeness of a
models that have a precision slightly better approximated correctness of models nor classifier obtained from convex risk minimi-
than a random guess (greater than 50 per- provided clues on developing algorithms. zation to the optimal Bayes classifier.22
cent) and combine them correctly, you can So, AI researchers didnt adopt it. Methods
have a model with arbitrarily high preci- from this statistical theory remained unac- What is special with the SVM is not the
sion. Afterward, this concept of grouping knowledged until Vapniks book in 1995 regularization term, but is rather the loss
models was developed further under the and particularly Shawe-Taylors research on function, that is, the hinge loss. [Yi Lin23]
margin theory framework, which created a margin bounds in 1998. pointed out that the hinge loss is Bayes con-
plethora of boosting methods with statistical However, many statistics researchers still sistent, that is, the population minimizer of
interpretations via the margin.20,21 widely criticized Vapniks finite-sample sta the loss function agrees with the Bayes rule
However, statisticians have begun to in terms of classification. This is important
doubt the margins usefulness as a statistical in explaining the success of the SVM, be-
measure, as we shall see in the next section.
ML seems to have entered a new cycle, one Although computer scientists cause it implies that the SVM is trying to
implement the Bayes rule.28
characterized by consistency of loss func-
tions with statistical interpretations.22,23 havent criticized the margin Maybe it should be no wonder that a
concept such as the margin would generate
Statistical Interpretations concept directly, their disputes among statisticians. But how do
AI researchers were reluctant to use sta- computer scientists treat margins? Do they
tistics in ML, with the fate of the original
perceptrons serving as a typical example.
experiments havent used believe that you can use them as a statistic
in evaluating models? Although computer
The perceptron has its deficienciesfor ex-
ample, it cant deal with linear inseparable
the margin as a criterion scientists havent criticized the margin con-
cept directly (at least we havent found any
data. However, AI researchers snubbed it
mainly because they didnt like models that to evaluate model precision. related reports), their experiments havent
used the margin as a criterion to evaluate
people cant understand (black boxes). Also, model precision. They still use classic bias
the differing schools of thought in statistics (error rates) and variance (cross-validation)
perplexed AI researchers. In addition, the tistical theory. Vapniks argument was mainly as criteria. Computer scientists like to use
statistical theories at that time assumed that that you could use the margin as a statistic algorithmic issues to explain why they
the number of labeled data approaches in- to evaluate a models performance. Here are would use traditional measures. But theyve
finity, which made AI researchers hesitant some of the major criticisms: used the margin only as a guide to design
to base ML theory on statistics. algorithms.
However, pattern recognition pioneers [Schapire and his colleagues] offered an ex- Statisticians have now done more in-
clearly were aware of statistics impor- planation of why Adaboost [Adaptive Boost- depth research on margins and have pointed
tance. In 1973, Richard Duda and Peter ing, an ML algorithm] works in terms of its out that for 0-1 loss, the margin as a loss
Hart published Pattern Classification and ability to reduce the margin. Comparing function has some connections with Bayes
Scene Analysis.8 They used Bayesian deci- Adaboost to our optimal arcing algorithm risk.22 However, research on loss functions
sion theory as the basis of the classification shows that their explanation is not valid and is still based on the infinite-sample assump-
problem; that is, they evaluated classifica- that the answer lies elsewhere. In this situa- tion. Obtaining tighter, more useful PAC
tion models by the models deviance from tion the VC-type bounds are misleading.25 generalization bounds remains difficult.
the Bayes classifier. ML researchers now
widely acknowledge this criterion. In addi- The bounds and the theory associated with Algorithm Design
tion, for Duda and Hart, approximately cor- the AdaBoost algorithms are interesting, but Perceptron,29 by Marvin Minsky and Sey-
rect (with a nonzero error rate) model pre- tend to be too loose to be of practical impor- mour Papert, is considered an evil sword
cision was acceptable. This deviates from tance. In practice, boosting achieves results that stopped research on perceptrons, espe-
statistics, which assumes the model should far more impressive than the bounds would cially by neural network researchers from
be consistent with the true model when the imply.26 1980 to 2000. The book proposed two
sample size goes to infinity. This is simi- seemingly contradictory principles for ML:
lar in spirit to PAC theory, which assumes In the presence of noise many of the known
precision with a probability of 1 ( > 0), bounds cannot predict wellneither for Algorithms should solve real-world prob-
instead of a probability of 1. However, it small nor for large sample sizes.27 lems instead of toy problems.

November/December 2008 www.computer.org/intelligent 51


A I i n C h i n a

The algorithms time complexity should term ensemble as its used nowadays is con- this problem alongside Richard Bellmans
be polynomial. sistent with Hebbs intent: a high-precision curse of dimensionality and Occams
model is represented through many low- razor and gave it an interesting name
The first implies that algorithms must be precision models. the Rashomon effectto describe the di-
able to deal with nonlinear problems; the To design algorithms, you can regard lemma that ML faces in this situation.10
second implies that the algorithms must be learning problems as an optimization prob- (In the Japanese movie Rashomon, four
computable. Minsky and Papert implied lem on the space spanned by these weak witnesses of an incident give different ac-
that because the two principles contradict classifiers. This is the basic design princi- counts of what happened.)
each other, it might be difficult to design ple in popular boosting algorithms.26 This In high-dimensional spaces, if the sample
ML methods that dont depend on domain methods biggest advantage is that it auto- is too small, the i.i.d. conditions wont be
knowledge. After nearly 40 years, these two matically reduces dimensionality because useful. That is, as the number of variables
principles still apply to ML algorithms. the number of weak classifiers is usually far increases, the given sample will be diluted
Backpropagation is a nonlinear algo- smaller than the inputs dimensionality. exponentially over the space that those vari-
rithm. It was a milestone in perceptron-type ables span. In other words, if p is the upper
learning. However, Vapnik recommended The Rashomon Effect limit of the dimensionality (the number of
going back to linear perceptrons. Philo- Lets look at another problem affecting ma- variables) suitable for the given sample size,
sophically, nonlinear is a general name chine learning. In a liter of water, drop 10 then if the dimensionality increases further,
for unknown things. When youve solved the sample size must increase exponentially
a problem, it means youve found a space to make the model useful.
on which the problem can be linearly rep- Obviously, if you lower the dimension-
resented. Technically, you need to find a Schapires weak-learnability ality of a high-dimension observation data
space where nonlinear problems can be set, the needed sample size will drop ex-
linear. This was the basic idea when John theorem implied another ponentially. For a long time, ML research
von Neumann built mathematical founda- didnt focus on dimensionality reduction
tions for quantum mechanics in the 1930s.30 design principle for learning but used it as an auxiliary method. How-
Vapnik used this idea when he suggested ever, it has long been central to pattern
looking for a map that maps linear insepa-
rable data to a Hilbert space (a linear inner-
algorithms: You can obtain high- recognition research. This is possibly be-
cause image data have very high dimen-
product space), which he called the feature
space. After mapping, the data could be lin-
precision models by combining sionality, and you cant build good models
without considering dimensionality reduc-
early separable on this space. So, you would
need to consider only linear perceptrons many low-precision models. tion. In pattern recognition, dimensional-
ity reduction is called feature selection
here. If you considered margins, you would and feature extraction. Feature selection
then have the maximal-margin problem in involves selecting a feature subset from a
the feature space. grams of salt or sugar. We can discriminate given feature set with a given rule. Feature
The nonlinearity problem would seem whether you dropped sugar or salt when we extraction maps the input space to another
to have been solved, but it isnt so simple. taste it. However, add a liter of water, an- space with a lower dimension; the num-
To make the problem linearly separable, other liter of water, and so on. Eventually, ber of features will be smaller in the new
you need to add dimensions. So, the feature we wouldnt be able to tell what you added. space. In this article, we discuss only fea-
spaces dimensionality will be much higher Such is the effect of dilution. ture selection.
than that of the input space. How high When dealing with a high-dimensional Many feature selection methods exist.31
should this dimensionality be for the map- data set, youll see a similar effect. For a Statisticians were the first to regard fea-
pings of the convex sets in the input space fixed-size training sample, if the number ture selection as an important ML problem.
to have the maximal margin? This question of variables (features in pattern recognition In 1996, Robert Tibshirani proposed the
remains unanswered. Its one reason you or attributes in AI) reaches a certain point, Lasso (Least Absolute Shrinkage and Se-
cant use the margin as a criterion to evalu- the sample will be diluted over the space lection Operator) algorithm, which is based
ate model precision and must resort to tradi- spanned by these variables. This space will on optimizing least squares with an L1 con-
tional measures. also contain many satisfactory models, straint.32 ML researchers have recently rec-
Schapires weak-learnability theorem im- meeting a precision requirement for com- ognized this algorithms importance be-
plied another design principle for learning mon bias and variance measures (such as cause of their increased awareness of the
algorithms: You can obtain high-precision cross-validation on the given sample). problem of information sparsity. Lasso is
models by combining many low-precision However, for real-world problems, similar to wrapper methods in feature se-
models. This principle is called ensemble maybe only one or a few of those models lection in that it targets the objective func-
learning. Neuroscientist Donald O. Hebb will be useful. This means that for real- tion of learning (to minimize the classifica-
first employed this principle in his multi- world observations outside the training tion error or squared error). Unlike many
cell ensemble theory, which posits that vi- sample, most of these satisfactory mod- feature selection methods, Lasso doesnt
sion objects are represented by intercon- els wont be good. This is the multiplic- select a feature subset with a heuristic rule.
nected neuron ensembles. The intent of the ity of models problem. Leo Breiman listed However, it does consider feature selection

52 www.computer.org/intelligent IEEE INTELLIGENT SYSTEMS


as a set of linear constraints on the coef- for this situation, but most of their results sion problems are much more important in
ficient of the variables. During optimiza- are empirical. We havent seen remarkable statistics and certain engineering applica-
tion, if a variables coefficient decreases to theoretical results, either statistically or tions. However, because logistic regression
0, Lasso eliminates the variable from the through domain-dependent research. could convert a classification problem to a
model, thus performing feature selection. The Rashomon effect is not only an regression problem, regression will surely
On the basis of Lasso, Bradley Efron inevitable challenge for machine learning. attract ML researchers attention.
proposed LARS (Least Angle Regression), Residual-analysis methods from classical If you assign y to only a few samples
which is more interesting.33 To provide statistics can deal with low-dimensional while others have no value because of labor
feature selection, LARS uses analysis of data, but when confronted with high- costs or lack of knowledge, semisupervised
residuals. Efron proved that with one ad- dimensional data, theyre of little help. So, learning might be applicable.35 This form
ditional constraint, you can use LARS to the Rashomon effect is also a challenge for of ML designs a class of rules to evaluate
solve the Lasso optimization objective. statistics. y for unknown samples according to the al-
Because LARS is basically a very efficient ready known samples. This is difficult. On
forward selection method, its amazing Other ML Forms one hand, rules might be related to domain
that it can obtain the optimal solution of a Recall that the fundamental thing in ML knowledge. On the other hand, you must
numerical optimization problem. is to estimate a function y = F(x). In re- find rules for samples with known y values
Most research on the Rashomon effect cent years, researchers have proposed and use them as evidence to assign y values
aims to reduce the number of dimensions to other samples, which means searching
or the number of variables under certain in a huge space. So far, no good theoretical
statistical assumptions and build mod- framework exists for this form of ML.
els on a low-dimensional space. How- Because logistic regression If relations or structures exist among
ever, in some fields (for example, natu- samples, which means the output y is re-
ral language processing), instead of only could convert a classification lated or structural, structural learning is
a handful of features being meaningful, appropriate. For this kind of data, in addi-
most features are meaningful. Simple di- problem to a regression tion to the labels in y, a new variable p de-
mensionality reduction might give good scribes the structure, and you can simply
models measured on a given training
set but be useless in linguistics. In other
problem, regression write the y-part as (y, p). The simplest, yet
most useful, structural-learning method,
words, an algorithm will be meaningful
in linguistics only if it considers all the
will surely attract ML learning to rank, needs the samples ar-
ranged according to certain requirements.36
features. However, linguists dont like to
work on the full feature set. They often researchers attention. This method has attracted attention be-
cause its useful for designing search en-
start from small parts and then combine gines. If the y-part denotes different classes
the different parts. In this case, ensemble of documents such as news or sports, you
methods might be useful. many forms of ML, with the difference can write (y, p) as (y, rk), where rk is an or-
Another facet of the Rashomon effect being solely how you define x or y. der for items in the kth class.
is that the data might come not from one If for all samples y isnt assigned, the prob-
source but from multiple sources. Sup- Different Definitions of y lem involves unsupervised learning, which
pose that every model relies on a subset Suppose that a data set satisfies the attri- is closely related to clustering analysis.
of the feature set and that different models bute-value form; that is, the variable set (x)
rely on nonintersecting sets. In this situa- is determined beforehand. For every obser- Different Definitions of x
tion, you can directly use feature selection vation, each variable must have a value. Dif- If the samples dont satisfy the attribute-
methods. However, to find the best parti- ferent domains, ranges, or explanations lead value formthat is, the data set is stored
tion of the feature set, youll have to search to totally different ML forms, which need in a relational databasethe variables
all the feature subsets. This method is dif- different algorithms. have relations among each other. Learn-
ferent from feature selection based on sta- If you define y on a limited integer set, ing involving such data sets is called re-
tistical correlation; it must consider not you have a classification problem. If you lational learning.37 AI first addressed re-
only the correlation among features but also define each variable of x on a limited lational learning in the 1970s, but this
also users specific requirements.34 This integer set, you can use symbolic learn- problem still hasnt been solved. This is
is important in personalized applications, ing methods. If you define part of xs vari- important because 60 percent of the data
such as personalized information retrieval. ables as real numbers, you can use statisti- we face, especially economic data, are
A more complicated situation is when cal learning methods. Of course, you can stored like this.
the data set is an additive combination of also consider integers to be real numbers, Current approaches to relational learn-
multiple reasonable models. For example, in which case the problem is statistical. ing are based mainly on inductive logic
with image data, an image might contain If you define y as a real number, you have programming (ILP), which
many different objects. In this case, fea- a regression problem. Traditional ML does
ture extraction might be useful. Many re- much less research on regression problems 1. scatters samples according to do-
searchers have studied feature extraction than on classification problems; regres- main knowledge so that each fragment

November/December 2008 www.computer.org/intelligent 53


A I i n C h i n a

satisfies the attribute-value form,


2. builds models for each fragment by sta-
tistical learning, and
I ncreasing needs in fields such as molec-
ular biology, Internet information analy-
sis and processing, and economic data anal-
Such specificity is also what some fields
demand for ML. For example, if you use
statistical methods to build an approxi-
3. joins the models according to domain ysis have greatly stimulated ML research. mate model according to the given sam-
knowledge. They have also presented many complex ple, will molecular biologists believe this
problems that cant be solved with tradi- model? What useful thing can they get
This is difficult but important for practical tional ML but that need fresh approaches from it? Maybe the more important thing
application. and new knowledge from other fields. is to help them read the data in more de-
In 2000, Science published a group of ar- One fresh approach comes from Brei- tail. Although statisticians nowadays are
ticles on manifold learning. In this approach, mans 2001 article in Statistical Science concerned with models interpretations, re-
the cognitive process is based on data man- about how a statistician understands ML.10 searchers in some fields might need more
ifolds.3840 Roughly speaking, a manifold He advises that statisticians start from detailed and readable data. These chal-
is a locally coordinated Euclidean space, practice and pay attention to problems lenges will provide new pathways to sym-
which means that you can give Euclidean caused by high-dimensional data. He ad- bolic learning methods.
coordinates (via a homeomorphism) on monishes computer scientists to consider Of course, some researchers believe that
every small part of it. The research reported the conditions for using various theories. such research also belongs to data mining.
in Science referred not to the differential When processing data, no matter how good A common practice of human learning and
manifolds mathematical property but only or bad the result was, computer scientists knowledge management in data mining is
to its definitionthat is, to induce the topo- to use general rules and exceptions to rules.
logical locally coordinate space intuitively, One crucial issue is to find the right mix-
through piecewise linearization. This con- ture of them. In a previous study, weve con-
cept makes sense for cognitive science and If you use statistical sidered rule-plus-exception strategies for
provides insight for ML. For ML methods, discovering this type of knowledge.43 We
this kind of learning gives a different expla- methods to build summarized and compared results from
nation from general Euclidean implementa- psychology, expert systems, genetic algo-
tions for x in the given data set. an approximate model rithms, and ML and data mining, and we
Some detection applications aim to find examined their implications for knowledge
special instances or exceptions. In early
research on this approach, the goal of
according to the given management and discovery. That study es-
tablishes a basis for the design and imple-
finding exceptions was to shorten the de-
scription length of the model to improve
sample, will molecular mentation of new algorithms for discover-
ing rule-plus-exception-type knowledge.
generality.41,42 Recently, driven by a host
of practical challenges such as detect- biologists believe this model?
ing illegal financial activities or security
breaches, finding exceptions in a mass
of data has become important.43 Excep- had been paying more attention to algo-
Acknowledgments
tional activities vary and change so rap- rithm design than to statistical analysis. The Chinese National Basic Research Program
idly that directly building models with Having realized the deficiency in how (2004CB318103) and Natural Science Founda-
them is practically useless. You need to they think, computer scientists are now tion of China grant 60835002 supported this
build models according to requirements paying more attention to statisticians un- research.
and make them the standard for normal derstanding of ML, their research on the
behavior; behavior that exceeds the stan- Rashomon effect, and the discussion about
dard would be considered the exception. the limitations of Vapniks finite-sample References
1. H. Simon, Why Should Machines Learn?
To do this, we have induced models from statistics theory. We also think its neces- Machine Learning: An Artificial Intel-
data and design methods to extract excep- sary to take a hard look at previous ML re- ligence Approach, R. Michalski, J. Car-
tions from the data set.44 search, which was our motivation for this bonell, and T. Mitchell, eds., Tioga Press,
article. 1983, pp. 2538.
And Even More Forms In this article, we havent discussed tra- 2. R. Solomonoff, A New Method for Dis
covering the Grammars of Phrase Structure
Over the last decade, ML researchers ditional ML in AI, and we havent dwelt on
Language, Proc. Intl Conf. Information
have developed other learning forms such how researchers have neglected symbolic Processing, Unesco, 1959, pp. 285290.
as metric learning45 and multi-instance learning methods. We are much concerned 3. E. Hunt, J. Marin, and P. Stone, Experi-
learning.46 These forms are driven by and with these methods,43,47 but were not inter- ments in Induction, Academic Press, 1966.
derived from practical problems. Their ested in their generalization. In other words, 4. L. Samuel, Some Studies in Machine
common trait is that the data are so com- when facing different practical problems, Learning Using the Game of Checkers,
Part II, IBM J. Research and Develop-
plexly represented that no previous learn- you need to consider not only the models ment, vol. 11, no. 4, 1967, pp. 601618.
ing framework can handle them. Generally statistical interpretations but also special 5. J. Quinlan, Induction of Decision Trees,
speaking, these forms are all related to do- examples and how they influence the mod- Machine Learning, vol. 1, Mar. 1986, pp.
main knowledge. els and the problem. 81106.

54 www.computer.org/intelligent IEEE INTELLIGENT SYSTEMS


34. H.L. Liang, W. Jue, and Y. YiYu, User-
T h e A u t h o r s Oriented Feature Selection for Machine
Learning, Computer J., vol. 50, no. 4,
Jue Wang is a professor at the Chinese Academy of Sciences Institute of Automation. His re- 2007, pp. 421434.
search interests include artificial neural networks, genetic algorithms, multiagent systems, ma- 35. A. Blum and T. Mitchell, Combining
chine learning, and data mining. Wang received his masters from the Chinese Academy of Sci- Labeled and Unlabeled Data with Co-
ences Institute of Automation. Hes an IEEE senior member. Contact him at jue.wang@mail. training, Proc. 11th Ann. Conf. Compu
ia.ac.cn. tational Learning Theory, ACM Press,
1998, pp. 92100.
Qing Tao is a professor at the Chinese Academy of Sciences Institute of Automation and at the 36. R. Herbrich, T. Graepel, and K. Ober-May-
New Star Research Institute of Applied Technology. His research interests are applied math- er, Support Vector Learning for Ordinal
ematics, neural networks, statistical learning theory, support vector machines, and pattern recog- Regression, Proc. 9th Intl Conf. Artificial
nition. Tao received his PhD from the University of Science and Technology of China. Contact Neural Networks, IEEE Press, 1999, pp.
him at qing.tao@mail.ia.ac.cn or taoqing@gmail.com. 97102.
37. S. Dzeroski and N. Lavrac, eds., Relational
Data Mining, Springer, 2001.
38. H. Seung and D. Lee, The Manifold Way
of Perception, Science, vol. 290, no. 5500,
6. Z. Pawlak, Rough Sets: Theoretical As 20. Y. Freund and R.E. Schapire. A Decision- 2000, pp. 22682269.
pects of Reasoning about Data, Kluwer Theoretic Generalization of On-line Learn- 39. S. Roweis and L. Saul, Nonlinear Dimen
Academic Publishers, 1991. ing and an Application to Boosting, J. sionality Reduction by Locally Linear Em-
7. F. Rosenblatt, The Perceptron: A Perceiv- Computer and System Sciences, vol. 55, bedding, Science, vol. 290, no. 5500, 2000,
ing and Recognizing Automaton, tech. re- no. 1, 1997, pp. 119139. pp. 23232326.
port 85-4601, Aeronautical Lab., Cornell 21. R.E. Schapire et al., Boosting the Margin: 40. J. Tenenbaum, V.D. Silva, and J. Langford,
Univ., 1957. A New Explanation for the Effectiveness of A Global Geometric Framework for Non-
8. R. Duda and P. Hart, Pattern Classification Voting Methods, Annals of Statistics, vol. linear Dimensionality Reduction, Science,
and Scene Analysis, John Wiley & Sons, 26, no. 5, 1998, pp. 16511686. vol. 290, no. 5500, 2000, pp. 23192323.
1973. 22. T. Zhang, Statistical Behaviour and Con- 41. N. Shepard, C. Novland, and H. Jenkins,
9. D.E. Rumelhart and J.L. McClelland, Par- sistency of Classification Methods Based Learning and Memorization of Classifica-
allel Distributed Processing, MIT Press, on Convex Risk Minimization, Annals of tion, Psychological Monographs, vol. 75,
1986. Statistics, vol. 32, no. 1, 2004, pp. 5685. no. 13, 1961, pp. 142.
10. L. Breiman, Statistical Modeling: The 23. Y. Lin, Support Vector Machines and the 42. M. Nosofsky, J. Palmeri, and C. McKinley,
Two Cultures, Statistical Science, vol. 16, Bayes Rule in Classification, Data Mining Rule-Plus-Exception Model of Classifica-
no. 3, 2001, pp. 199231. and Knowledge Discovery, vol. 6, no. 3, tion Learning, Psychological Rev., vol.
11. V. Vapnik, The Nature of Statistical Learn- 2002, pp. 259275. 101, no. 1, 1994, pp. 5379.
ing Theory, Springer, 1995. 2 4. L. Valiant, A Theory of Learnability, 43. Y. Yao et al., Rule + Exception Strategies
12. V. Vapnik and A. Chervonenkis, On the Comm. ACM, vol. 27, no. 11, 1984, pp. for Security Information Analysis, IEEE
Uniform Convergence of Relative Frequen- 11341142. Intelligent Systems, Sept./Oct. 2005, pp.
cies of Events to Their Probabilities, 25. L. Breiman, Prediction Games and Arc- 5257.
Theory of Probability and Applications, ing Algorithms, Neural Computation, vol. 44. J. Wang et al., Multilevel Data Summari-
vol. 16, Jan. 1971, pp. 264280. 11, no. 7, 1999, pp. 14931517. zation from Information System: A Rule +
13. A. Blumer et al., Learnability and the Exception Approach, AI Comm., vol. 16,
26. J. Friedman, T. Hastie, and R. Tibshirani,
Vapnik-Chervonenkis Dimension, J. no. 1, 2003, pp. 1739.
Additive Logistic Regression: A Statisti-
ACM, vol. 36, no. 4, 1989, pp. 929965. cal View of Boosting, Annals of Statistics, 45. E.P. Xing et al., Distance Metric Learn-
14. J. Shawe-Taylor et al., Structural Risk vol. 28, no. 2, 2000, pp. 337407. ing, with Application to Clustering with
Minimization over Data-Dependent Hier- Side-Information, Advances in NIPS, vol.
27. I. Steinwart, Which Data-Dependent 15, Jan. 2003, pp. 505512.
archies, IEEE Trans. Information Theory, Bounds Are Suitable for SVMs? tech.
vol. 44, no. 5, 1998, pp. 19261940. 46. T.G. Dietterich, R.H. Lathrop, and T. Lo-
report, Los Alamos Natl Lab., 2002;
15. Q. Tao, G.-W. Wu, and J. Wang, A New zano-Prez, Solving the Multiple-Instance
www.ccs3.lanl.gov/~ingo/pubs.shtml.
Maximum Margin Algorithm for One- Problem with Axis-Parallel Rectangles,
28. T. Hastie and J. Zhu, Comment, Sta- Artificial Intelligence, vol. 89, nos. 12,
Class Problems and Its Boosting Imple-
tistical Science, vol. 21, no. 3, 2006, pp. 1997, pp. 3171.
mentation, Pattern Recognition, vol. 38,
352357.
no. 7, 2005, pp. 10711077. 47. W. Zhu and F.-Y. Wang, Reduction and
29. M. Minsky and S. Parpert, Perceptron Axiomization of Covering Generalized
16. Q. Tao et al., Posterior Probability Sup-
(expanded edition), MIT Press, 1988. Rough Sets, Information Sciences, vol.
port Vector Machines for Unbalanced
Data, IEEE Trans. Neural Networks, vol. 30. J. von Neumann, Mathematical Founda- 152, no. 1, 2003, pp. 217230.
16, no. 6, 2005, pp. 15611573. tions of Quantum Mechanics, Princeton
17. Q. Tao, G.-W. Wu, and J. Wang, Learning Univ. Press, 1932.
Linear PCA with Convex Semi-Definite 31. H. Liu and H. Motoda, Feature Selection
Programming, Pattern Recognition, vol. for Knowledge Discovery and Data Min-
40, no. 10, 2007, pp. 26332640. ing, Kluwer Academic Publishers, 1998.
18. Q. Tao, D.-J. Chu, and J. Wang, Recursive 32. R. Tibshirani, Regression Shrinkage and
Support Vector Machines for Dimensional Selection via the Lasso, J. Royal Statisti-
ity Reduction, IEEE Trans. Neural Net cal Soc.: Series B, vol. 58, no. 1, 1996, pp.
works, vol. 19, no. 1, 2008, pp. 189193. 267288.
19. R. Schapire, The Strength of Weak Learn- 33. B. Efron et al., Least Angle Regression, For more information on this or any other com-
ability, Machine Learning, vol. 5, no. 2, Annals of Statistics, vol. 32, no. 2, 2004, puting topic, please visit our Digital Library at
1990, pp. 197227. pp. 407499. www.computer.org/csdl.

November/December 2008 www.computer.org/intelligent 55

You might also like