Professional Documents
Culture Documents
Machine Learning:
The State of the Art
Jue Wang and Qing Tao, Institute of Automation, Chinese Academy of Sciences
I
nternet browsers have facilitated the acquisition of large amounts of information.
This technologys development has greatly surpassed that of data analysis, making
This article discusses large chunks of data uncontrollable and unusable. This is one reason why so many people
MLs advances in light are enthusiastic about machine learning (ML). In fields such as molecular biology and
of theoretical and the Internet, if youre concerned only about the This definition involves five key terms:
ease of gene sorting and distributing information,
practical difficulties, you dont need ML. However, if youre concerned The real-world problem. Suppose the true model
about advanced problems such as gene functions, for this problem is y = F(x), where x is the vari-
analyzing statistical understanding information, and security, ML is ables (features and attributes) describing the
inevitable. problem, y is the outcome of x in the problem,
interpretations, Many researchers quote Herbert Simon in de- and F is the relationship between the variables
scribing ML: and the outcome. MLs mission is to estimate a
algorithm design, function f(x) that approximates F(x).
Learning denotes changes in the system that are Finite observations (the sample set). All the data
and the Rashomon adaptive in the sense that they enable the system to in the sample set S({xk, yk}n) are recordings of
do the same task or tasks drawn from the same popu- finite independent observations of y = F(x) in a
problem. lation more efficiently and more effectively the next noisy environment.
time.1 Sampling assumption. The sample set is an inde-
pendent and identically distributed (i.i.d.) sam-
This is a philosophical and ubiquitous descrip- ple from an unknown joint distribution function
tion of ML. But you cant design algorithms and P(x, y).
analyze the problem with this definition. Computer Modeling. Consider a parametric model space.
scientists and statisticians are more interested in The parameters are estimated from the sample
ML algorithms and their performance. In their set to get an f(x) thats a good approximation of
eyes, ML is F(x). The approximation measure is the average
error for all samples in terms of the distribution
the process (algorithm) of estimating a model thats P(x, y). We then say that the estimated model f(x)
true to the real-world problem with a certain prob- is true to the real-world problem F(x) with a cer-
ability from a data set (or sample) generated by finite tain probability, and we call f(x) a model of F(x).
observations in a noisy environment. Process (the algorithm). ML algorithms should
have some statistical guarantees. Also, Rumelhart and James McClelland pro- to evaluate the models performance and es-
the algorithm will transform the model- posed the nonlinear backpropagation al- timate the model. ML also needs statistics
ing problem into a parametric optimiza- gorithm.9 AI, pattern recognition, and sta- to filter noise in the data. The second foun-
tion problem in the model space. In most tistics researchers became interested in dation is computer science algorithm design
ML approaches, the learned model has a this approach. And, under the impulsion methodologies, for optimizing parameters.
linear dependence on these parameters. from real-world demands, ML, especially In recent years, MLs main objective has
statistical ML, gradually became indepen- been to find a learner linearly dependent on
ML involves both fundamental theoreti- dent from traditional AI and pattern rec- these parameters.
cal difficulties and practical difficulties. ognition. Researchers shunned early sym- In the past decade, ML has been in a
Theoretical difficulties include these: bolic ML methods because they lacked margin era11,12 Geometrically speaking,
good theoretical generalization guaran- the margin of a classifier is the minimal
The number of variables in the sample tees. However, because the complexities of distance of training points from the deci-
set (dimensions) is generally very large real-world data make a ubiquitous learn- sion boundary. Generally, a large margin
or even huge; compared with it, the sam- ing algorithm impossible, the quality of implies good generalization performance.
ple set is very small. the data and background knowledge could Anselm Blumer was the first to use VC
The model of the real-world problem is be the key to MLs success. The intrinsic (Vapnik-Chervonenkis) dimensions and
usually highly nonlinear. readability (interpretability) of symbolic the probably approximately correct (PAC)
framework to study learners generaliza-
Practical difficulties include these: tion ability.13 In 1995, Vladimir Vapnik
published the book The Nature of Statistical
Sometimes, the sample set cannot be Machine learnings ability Learning Theory,11 which included his
represented as a vector in a given linear work on determining PAC bounds of gen-
space, and there are specific relations to simplify data sets provides eralization based on the VC dimensions.
among the data. Blumers and Vapniks research serves
The datas outcome can take many forms a possible way humans as the theoretical cornerstones of finite-
(it could be discrete, it could be continu- sample statistical-learning theory. In 1998,
ous, it could have an ordered structure, or
it even could be missing).
can understand data quality John Shawe-Taylor related generalization to
the margin of two closed convex sets in the
The data set can be a product of many
meaningful models intertwined together.
and make alterations feature space.14
These theories led to the large-margin al-
There are outliers to models that must not
be ignored in some cases, which has in- to improve that quality. gorithm design principle, which has three
points:
spired many new learning models.
Generalization is based on finite sam-
Here, we discuss MLs advances in light methods could allow ML to regain popu- ples, which is consistent with application
of these difficulties. larity because its ability to simplify data scenarios.
sets provides a possible way humans can These algorithms deal with nonlinear
The Development of ML understand data quality and make altera- classification using the kernel trick. This
The term machine learning has become in- tions to improve that quality. method employs a linear classifier algo-
creasingly popular in recent decades. Other The research community is no longer in- rithm to solve a nonlinear problem by
fields have terms with similar meanings, terested in the algorithm design principles mapping the original nonlinear observa-
such as data analysis in statistics and pat- of nonlinear backpropagation. But this re- tions into a higher-dimensional space.
tern classification in pattern recognition. search reminded people of nonlinear algo- The algorithms estimate linear classifi-
In the early days of AI, ML research used rithms importance. Designing nonlinear ers and are transformed into convex op-
mainly symbolic data, and algorithm de- learning algorithms theoretically and sys- timization problems. The algorithms
sign was based on logic.26 tematically is an important impetus of cur- generalization depends on maximiz-
At about the same time, Frank Rosen rent statistical ML research. ing the margins, with a clear geometric
blatt proposed the perceptron, a statis- ML has become an important topic for meaning.
tical approach based on empirical risk pattern recognition researchers and others.
minimization.7 However, this approach re- In 2001, Leo Breiman published Statisti- Weve done margin-related research on
mained unrecognized and undeveloped in cal Modeling: The Two Cultures, which improving supervised algorithms and de-
the following decades. Nevertheless, this viewed ML as a subcategory of statistics.10 signing unsupervised algorithms.1518 For
statistical-modeling methodology has been This approach led to many new directions example, taking the viewpoint that the de
highly regarded in pattern classification re- and ideas for ML. sired domain containing the training sam-
search and has become fundamental in sta- In our view, ML actually has two foun- ples should have as few outliers as possible,
tistical pattern recognition.8 dations. The first is statistics. Because MLs we define the margin of one-class prob-
The real development of statistical objective is to estimate a model from ob- lems and derive a support-vector-machine-
learning came after 1986, when David served data, it must use statistical measures like framework for solving a new one-class
The algorithms time complexity should term ensemble as its used nowadays is con- this problem alongside Richard Bellmans
be polynomial. sistent with Hebbs intent: a high-precision curse of dimensionality and Occams
model is represented through many low- razor and gave it an interesting name
The first implies that algorithms must be precision models. the Rashomon effectto describe the di-
able to deal with nonlinear problems; the To design algorithms, you can regard lemma that ML faces in this situation.10
second implies that the algorithms must be learning problems as an optimization prob- (In the Japanese movie Rashomon, four
computable. Minsky and Papert implied lem on the space spanned by these weak witnesses of an incident give different ac-
that because the two principles contradict classifiers. This is the basic design princi- counts of what happened.)
each other, it might be difficult to design ple in popular boosting algorithms.26 This In high-dimensional spaces, if the sample
ML methods that dont depend on domain methods biggest advantage is that it auto- is too small, the i.i.d. conditions wont be
knowledge. After nearly 40 years, these two matically reduces dimensionality because useful. That is, as the number of variables
principles still apply to ML algorithms. the number of weak classifiers is usually far increases, the given sample will be diluted
Backpropagation is a nonlinear algo- smaller than the inputs dimensionality. exponentially over the space that those vari-
rithm. It was a milestone in perceptron-type ables span. In other words, if p is the upper
learning. However, Vapnik recommended The Rashomon Effect limit of the dimensionality (the number of
going back to linear perceptrons. Philo- Lets look at another problem affecting ma- variables) suitable for the given sample size,
sophically, nonlinear is a general name chine learning. In a liter of water, drop 10 then if the dimensionality increases further,
for unknown things. When youve solved the sample size must increase exponentially
a problem, it means youve found a space to make the model useful.
on which the problem can be linearly rep- Obviously, if you lower the dimension-
resented. Technically, you need to find a Schapires weak-learnability ality of a high-dimension observation data
space where nonlinear problems can be set, the needed sample size will drop ex-
linear. This was the basic idea when John theorem implied another ponentially. For a long time, ML research
von Neumann built mathematical founda- didnt focus on dimensionality reduction
tions for quantum mechanics in the 1930s.30 design principle for learning but used it as an auxiliary method. How-
Vapnik used this idea when he suggested ever, it has long been central to pattern
looking for a map that maps linear insepa-
rable data to a Hilbert space (a linear inner-
algorithms: You can obtain high- recognition research. This is possibly be-
cause image data have very high dimen-
product space), which he called the feature
space. After mapping, the data could be lin-
precision models by combining sionality, and you cant build good models
without considering dimensionality reduc-
early separable on this space. So, you would
need to consider only linear perceptrons many low-precision models. tion. In pattern recognition, dimensional-
ity reduction is called feature selection
here. If you considered margins, you would and feature extraction. Feature selection
then have the maximal-margin problem in involves selecting a feature subset from a
the feature space. grams of salt or sugar. We can discriminate given feature set with a given rule. Feature
The nonlinearity problem would seem whether you dropped sugar or salt when we extraction maps the input space to another
to have been solved, but it isnt so simple. taste it. However, add a liter of water, an- space with a lower dimension; the num-
To make the problem linearly separable, other liter of water, and so on. Eventually, ber of features will be smaller in the new
you need to add dimensions. So, the feature we wouldnt be able to tell what you added. space. In this article, we discuss only fea-
spaces dimensionality will be much higher Such is the effect of dilution. ture selection.
than that of the input space. How high When dealing with a high-dimensional Many feature selection methods exist.31
should this dimensionality be for the map- data set, youll see a similar effect. For a Statisticians were the first to regard fea-
pings of the convex sets in the input space fixed-size training sample, if the number ture selection as an important ML problem.
to have the maximal margin? This question of variables (features in pattern recognition In 1996, Robert Tibshirani proposed the
remains unanswered. Its one reason you or attributes in AI) reaches a certain point, Lasso (Least Absolute Shrinkage and Se-
cant use the margin as a criterion to evalu- the sample will be diluted over the space lection Operator) algorithm, which is based
ate model precision and must resort to tradi- spanned by these variables. This space will on optimizing least squares with an L1 con-
tional measures. also contain many satisfactory models, straint.32 ML researchers have recently rec-
Schapires weak-learnability theorem im- meeting a precision requirement for com- ognized this algorithms importance be-
plied another design principle for learning mon bias and variance measures (such as cause of their increased awareness of the
algorithms: You can obtain high-precision cross-validation on the given sample). problem of information sparsity. Lasso is
models by combining many low-precision However, for real-world problems, similar to wrapper methods in feature se-
models. This principle is called ensemble maybe only one or a few of those models lection in that it targets the objective func-
learning. Neuroscientist Donald O. Hebb will be useful. This means that for real- tion of learning (to minimize the classifica-
first employed this principle in his multi- world observations outside the training tion error or squared error). Unlike many
cell ensemble theory, which posits that vi- sample, most of these satisfactory mod- feature selection methods, Lasso doesnt
sion objects are represented by intercon- els wont be good. This is the multiplic- select a feature subset with a heuristic rule.
nected neuron ensembles. The intent of the ity of models problem. Leo Breiman listed However, it does consider feature selection