Professional Documents
Culture Documents
value. In many instances, data items will have maximized when all the pi are equal. The Gini more than once. In addition, when choosing
real-valued features. To ask about these, the index2, another common measure of impu- the question at each node, only a small, ran-
tree uses yes/no questions of the form “is the rity, is computed by 1 – Σm pi2 .This is again dom subset of the features is considered. With
i=1
value > k?” for some threshold k, where only zero when the set E contains items from only these two modifications, each run may result
values that occur in the data need to be tested one class. in a slightly different tree. The predictions of
as possible thresholds. It is also possible to use Given a measure of impurity I, we choose a the resulting ensemble of decision trees are
more complex questions, taking either linear question that minimizes the weighted average combined by taking the most common predic-
or logical combinations of many features at of the impurity of the resulting children nodes. tion. Maintaining a collection of good hypoth-
once5. That is, if a question with k possible answers eses, rather than committing to a single tree,
© 2008 Nature Publishing Group http://www.nature.com/naturebiotechnology
Decision trees are sometimes more interpre- divides E into subsets E1…,Ek, we choose a reduces the chance that a new example will be
question to minimize Σ j=1(|Ej|/|E|)I(Ej). In
k
table than other classifiers such as neural net- misclassified by being assigned the wrong class
works and support vector machines because many cases, we can choose the best question by many of the trees.
they combine simple questions about the data by enumerating all possibilities. If I is the Boosting10 is a machine-learning method
in an understandable way. Approaches for entropy function, then the difference between used to combine multiple classifiers into a
extracting decision rules from decision trees the entropy of the distribution of the classes in stronger classifier by repeatedly reweighting
have also been successful1. Unfortunately, small the parent node and this weighted average of training examples to focus on the most prob-
changes in input data can sometimes lead to the children’s entropy is called the information lematic. In practice, boosting is often applied
large changes in the constructed tree. Decision gain. The information gain, which is express- to combine decision trees. Alternating decision
trees are flexible enough to handle items with ible via the Kullback-Leibler divergence6, trees11 are a generalization of decision trees
a mixture of real-valued and categorical fea- always has a nonnegative value. that result from applying a variant of boosting
tures, as well as items with some missing fea- We continue to select questions recursively to combine weak classifiers based on decision
tures. They are expressive enough to model to split the training items into ever-smaller stumps, which are decision trees that consist of
many partitions of the data that are not as eas- subsets, resulting in a tree. A crucial aspect to a single question. In alternating decision trees,
ily achieved with classifiers that rely on a single applying decision trees is limiting the com- the levels of the tree alternate between standard
decision boundary (such as logistic regression plexity of the learned trees so that they do question nodes and nodes that contain weights
or support vector machines). However, even not overfit the training examples. One tech- and have an arbitrary number of children. In
data that can be perfectly divided into classes nique is to stop splitting when no question contrast to standard decision trees, items can
by a hyperplane may require a large decision increases the purity of the subsets more than take multiple paths and are assigned classes
tree if only simple threshold tests are used. a small amount. Alternatively, we can choose based on the weights that the paths encounter.
Decision trees naturally support classification to build out the tree completely until no leaf Alternating decision trees can produce smaller
problems with more than two classes and can can be further subdivided. In this case, to and more interpretable classifiers than those
be modified to handle regression problems. avoid overfitting the training data, we must obtained from applying boosting directly to
Finally, once constructed, they classify new prune the tree by deleting nodes. This can be standard decision trees.
items quickly. done by collapsing internal nodes into leaves
if doing so reduces the classification error on Applications to computational biology
Constructing decision trees a held-out set of training examples1. Other Decision trees have found wide application
Decision trees are grown by adding question approaches, relying on ideas such as mini- within computational biology and bioinfor-
nodes incrementally, using labeled training mum description length1,6,7, remove nodes in matics because of their usefulness for aggre-
examples to guide the choice of questions1,2. an attempt to explicitly balance the complex- gating diverse types of data to make accurate
Ideally, a single, simple question would per- ity of the tree with its fit to the training data. predictions. Here we mention only a few of the
fectly split the training examples into their Cross-validation on left-out training examples many instances of their use.
classes. If no question exists that gives such a should be used to ensure that the trees gener- Synthetic sick and lethal (SSL) genetic
perfect separation, we choose a question that alize beyond the examples used to construct interactions between genes A and B occur
separates the examples as cleanly as possible. them. when the organism exhibits poor growth (or
A good question will split a collection of death) when both A and B are knocked out
items with heterogeneous class labels into Ensembles of decision trees and other but not when either A or B is disabled indi-
subsets with nearly homogeneous labels, variants vidually. Wong et al.12 applied decision trees
stratifying the data so that there is little Although single decision trees can be excel- to predict SSL interactions in Saccharomyces
variance in each stratum. Several measures lent classifiers, increased accuracy often can cerevisiae using features as diverse as whether
have been designed to evaluate the degree of be achieved by combining the results of a the two proteins interact physically, localize to
inhomogeneity, or impurity, in a set of items. collection of decision trees8–10. Ensembles of the same place in the cell or have the function
For decision trees, the two most common decision trees are sometimes among the best recorded in a database. They were able to iden-
measures are entropy and the Gini index. performing types of classifiers3. Random for- tify a high percentage of SSL interactions with
Suppose we are trying to classify items into ests and boosting are two strategies for com- a low false-positive rate. In addition, analysis
m classes using a set of training items E. Let bining decision trees. of the computed trees hinted at several mecha-
pi (i = 1,…,m) be the fraction of the items of In the random forests8 approach, many dif- nisms underlying SSL interactions.
E that belong to class i. The entropy of the ferent decision trees are grown by a random- Computational gene finders use a variety
m
probability distribution (pi)i=1 gives a reason- ized tree-building algorithm. The training of approaches to determine the correct exon-
able measure of the impurity of the set E. The set is sampled with replacement to produce intron structure of eukaryotic genes. Ab initio
entropy, –Σmi=1 pilogpi , is lowest when a single a modified training set of equal size to the gene finders use information inherent in the
pi equals 1 and all others are 0, whereas it is original but with some training items included sequence, whereas alignment-based methods
use sequence similarity among related spe- 2. Breiman, L., Friedman, J., Olshen, R. & Stone, C. Human Interface (eds. Gorayska, B. & Mey, J.) 305–
cies. Allen et al.13 used decision trees within Classification and Regression Trees (Wadsworth 317 (Elsevier Science, Amsterdam, The Netherlands,
International Group, Belmont, CA, USA, 1984). 1996).
the JIGSAW system to combine evidence from 3. Caruana, R. & Niculescu-Mizil, A. An empirical 10. Schapire, R.E. The boosting approach to machine
many different gene finding methods, result- comparison of supervised learning algorithms. in learning: an overview. in Nonlinear Estimation and
Machine Learning, Proceedings of the Twenty-Third Classification (eds. Denison, D.D., Hansen, M.H.,
ing in an integrated method that is one of the International Conference (eds. Cohen, W.W. & Moore, Holmes, C.C., Mallick, B. & Yu, B.) 141–171 (Springer,
best available ways to find genes in the human A.) 161–168 (ACM, New York, 2003). New York, 2003).
genome and the genomes of other species. 4. Zadrozny, B. & Elkan, C. Obtaining calibrated probabil- 11. Freund, Y. & Mason, L. The alternating decision
ity estimates from decision trees and naive Bayesian tree learning algorithm. in Proceedings of the 16th
Middendorf et al.14 used alternating deci- classifiers. in Proceedings of the 18th International International Conference on Machine Learning, (eds.
sion trees to predict whether an S. cerevisiae Conference on Machine Learning, (eds. Brodley, C.E. Bratko, I. & Džeroski, S.) 124–133 (Morgan Kaufmann,
© 2008 Nature Publishing Group http://www.nature.com/naturebiotechnology
gene would be up- or downregulated under & Danyluk, A.P.) 609–616 (Morgan Kaufmann, San San Francisco, 1999).
Francisco, 2001). 12. Wong, S.L. et al. Combining biological networks to
particular conditions of transcription regula- 5. Murthy, S.K., Kasif, S. & Salzberg, S. A system for predict genetic interactions. Proc. Natl. Acad. Sci.
tor expression given the sequence of its regula- induction of oblique decision trees. J. Artif. Intell. Res. USA 101, 15682–15687 (2004).
2, 1–32 (1994). 13. Allen, J.E., Majoros, W.H., Pertea, M. & Salzberg, S.L.
tory region. In addition to good performance 6. MacKay, D.J.C. Information Theory, Inference and JIGSAW, GeneZilla, and GlimmerHMM: puzzling out
predicting the expression state of target genes, Learning Algorithms (Cambridge University Press, the features of human genes in the ENCODE regions.
they were able to identify motifs and regula- Cambridge, UK, 2003). Genome Biol. 7 Suppl, S9 (2006).
7. Quinlan, J.R. & Rivest, R.L. Inferring decision trees 14. Middendorf, M., Kundaje, A., Wiggins, C., Freund, Y.
tors that appear to control the expression of using the Minimum Description Length Principle. Inf. & Leslie, C. Predicting genetic regulatory response
the target genes. Comput. 80, 227–248 (1989). using classification. Bioinformatics 20, i232–i240
8. Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2004).
1. Quinlan, J.R. C4.5: Programs for Machine Learning. (2001). 15. Chen, X.-W. & Liu, M. Prediction of protein-protein
(Morgan Kaufmann Publishers, San Mateo, CA, USA, 9. Heath, D., Kasif, S. & Salzberg, S. Committees of interactions using random decision forest framework.
1993). decision trees. in Cognitive Technology: In Search of a Bioinformatics 21, 4394–4400 (2005).