Online Ensemble Learning
by Nikunj Chandrakant Oza
B.S. (Massachusetts Institute of Technology) 1994 M.S. (University of California, Berkeley) 1998
A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science in the GRADUATE DIVISION of the UNIVERSITY of CALIFORNIA, BERKELEY
Committee in charge:
Professor Stuart Russell, Chair Professor Michael Jordan Professor Leo Breiman
Fall 2001
The dissertation of Nikunj Chandrakant Oza is approved:
Chair
Date
Date
Date
University of California at Berkeley
2001
Online Ensemble Learning
Copyright 2001 by Nikunj Chandrakant Oza
1
Abstract
Online Ensemble Learning by Nikunj Chandrakant Oza Doctor of Philosophy in Computer Science University of California at Berkeley Professor Stuart Russell, Chair
This thesis presents online versions of the popular bagging and boosting algorithms. We demonstrate theoretically and experimentally that the online versions perform comparably to their original batch counterparts in terms of classi?cation performance. However, our online algorithms yield the typical practical bene?ts of online learning algorithms when the amount of training data available is large. Ensemble learning algorithms have become extremely popular over the last several years because these algorithms, which generate multiple base models using traditional machine learning algorithms and combine them into an ensemble model, have often demonstrated signi?cantly better performance than single models. Bagging and boosting are two of the most popular algorithms because of their good empirical results and theoretical support. However, most ensemble algorithms operate in batch mode, i.e., they repeatedly read and process the entire training set. Typically, they require at least one pass through the training set for every base model to be included in the ensemble. The base model learning algorithms themselves may require several passes through the training set to create each base model. In situations where data is being generated continuously, storing data for batch learning is impractical, which makes using these ensemble learning algorithms impossible. These algorithms are also impractical in situations where the training set is large enough that reading and processing it many times would be prohibitively expensive.
2 This thesis describes online versions of bagging and boosting. Unlike the batch versions, our online versions require only one pass through the training examples in order regardless of the number of base models to be combined. We discuss how we derive the online algorithms from their batch counterparts as well as theoretical and experimental evidence that our online algorithms perform comparably to the batch versions in terms of classi?cation performance. We also demonstrate that our online algorithms have the practical advantage of lower running time, especially for larger datasets. This makes our online algorithms practical for machine learning and data mining tasks where the amount of training data available is very large.
Professor Stuart Russell Dissertation Committee Chair
iii
To my parents.
iv
Contents
List of Figures List of Tables 1 Introduction 2 Background 2.1 Batch Supervised Learning . . . . . . . . . . 2.1.1 Decision Trees and Decision Stumps . 2.1.2 Naive Bayes . . . . . . . . . . . . . 2.1.3 Neural Networks . . . . . . . . . . . 2.2 Ensemble Learning . . . . . . . . . . . . . . 2.2.1 Motivation . . . . . . . . . . . . . . 2.2.2 Ensemble Learning Algorithms . . . 2.3 Online Learning . . . . . . . . . . . . . . . . 2.4 Online Ensemble Learning . . . . . . . . . . 3 Bagging 3.1 The Bagging Algorithm . . . . . . . . . . . 3.2 Why and When Bagging Works . . . . . . 3.3 The Online Bagging Algorithm . . . . . . . 3.4 Convergence of Batch and Online Bagging . 3.4.1 Preliminaries . . . . . . . . . . . . 3.4.2 Main Result . . . . . . . . . . . . . 3.5 Experimental Results . . . . . . . . . . . . 3.5.1 The Data . . . . . . . . . . . . . . 3.5.2 Accuracy Results . . . . . . . . . . 3.5.3 Running Times . . . . . . . . . . . 3.5.4 Online Dataset . . . . . . . . . . . 3.6 Summary . . . . . . . . . . . . . . . . . . vi xi 1 8 8 9 12 12 15 15 18 23 27 34 34 36 38 39 40 45 49 50 52 56 58 63
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
v 4 Boosting 4.1 Earlier Boosting Algorithms . . . . . . . . . . . . . . . . . 4.2 The AdaBoost Algorithm . . . . . . . . . . . . . . . . . . . 4.3 Why and When Boosting Works . . . . . . . . . . . . . . . 4.4 The Online Boosting Algorithm . . . . . . . . . . . . . . . 4.5 Convergence of Batch and Online Boosting for Naive Bayes 4.5.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . 4.5.2 Main Result . . . . . . . . . . . . . . . . . . . . . . 4.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . 4.6.1 Accuracy Results . . . . . . . . . . . . . . . . . . . 4.6.2 Priming . . . . . . . . . . . . . . . . . . . . . . . . 4.6.3 Base Model Errors . . . . . . . . . . . . . . . . . . 4.6.4 Running Times . . . . . . . . . . . . . . . . . . . . 4.6.5 Online Dataset . . . . . . . . . . . . . . . . . . . . 4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 65 66 70 72 76 77 82 86 87 89 91 96 103 104
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
5 Conclusions 106 5.1 Contributions of this Dissertation . . . . . . . . . . . . . . . . . . . . . . . 106 5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Bibliography 111
vi
List of Figures
?
2.1 2.2 An example of a decision tree. . . . . . . . . . . . . . . . . . . . . . . . . 10
Decision Tree Learning Algorithm. This algorithm takes a training set , attribute set , goal attribute , and default class as inputs and returns a decision tree. denotes the th training example, denotes example ’s value for attribute , and denotes example ’s value for the goal attribute . . . . . . . . . . . . . . . 11 Naive Bayes Learning Algorithm. This algorithm takes a training set and attribute set as inputs and returns a Naive Bayes classi?er. is the number of training examples seen so far, is the number of examples in class , and is the number of examples in class that have as their value for attribute . . 13
2.3
2.4 2.5
An example of a multilayer feedforward perceptron. . . . . . . . . . . . . . 14 An ensemble of linear classi?ers. Each line A, B, and C is a linear classi?er. The boldface line is the ensemble that classi?es new examples by returning the majority vote of A, B, and C. . . . . . . . . . . . . . . . . . . . . . . . 16
Batch Bagging Algorithm and Sampling with Replacement: is the original training set of examples, is the number of base models to be learned, is the base model learning algorithm, the ’s are the classi?cation functions that take a new example as input and return the predicted class from the set of possible classes , is a function that returns each of the integers from to with equal probability, and is the indicator function that returns 1 if event is true and 0 otherwise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 AdaBoost algorithm: is the set of training examples, is the base model learning algorithm, and is the number of base models to be generated. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Weighted Majority Algorithm: is the vector of weights corresponding to the predictors, is the latest example to arrive, is the correct classi?cation of example , the are the predictions of the experts . . . . . . . 24
2.6
¤
4
? ? rfff x rv t w?#HH?yHwr us
$
¤
r
2.8
3
2 qV h S h $P S f f f S V ` S ` $P Up@ci?¨ggHHHe0dcba¨8Y
2.7
¤ ?? ?
" #! §
1
1
?
3
?
?
?
?
) '% 0(&$
?
¤
? ¨¤ ? § 4
V ?P XRW
V S 1P 6 F ? FD 7 HUTRQHIHGEC?
$
2
?
2
?
?
?
A9 ? 7 1 B@&86
?
¨? §¤ 5 ?
vii 2.9
Breiman’s blocked ensemble algorithm: Among the inputs, is the training set, is the number of base models to be constructed, and is the size of each base model’s training set ( for ). is the base model learning algorithm, is the number of training examples examined in the process of creating each and is the number of these examples that the ensemble previously constructed base models ( ) misclassi?es. . . . . . . . Test Error Rates: Boosting vs. Blocked Boosting with decision tree base models. Test Error Rates: Online Boosting vs. Blocked Boosting with decision tree base models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Test Error Rates: Primed Online Boosting vs. Blocked Boosting with decision tree base models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . is the set of base models to Online Bagging algorithm. be updated, is the next training example, is the userchosen probability that each example should be included in the next base model’s training set, and is the online base model learning algorithm that takes a base model and training example as inputs and returns the updated base model. . . . . . . . . . . . . . Test Error Rates: Bagging vs. Fixed Fraction Online Bagging with decision tree base models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Test Error Rates: Online Bagging vs. Fixed Fraction Online Bagging with decision tree base models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . is the set of base models to Online Boosting algorithm. is the next training example, and is the online base model be updated, learning algorithm that takes a base model, training example, and its weight as inputs and returns the updated base model. . . . . . . . . . . . . . . . . . . Test Error Rates: Boosting vs. Online Arcx4 with decision tree base models. . Test Error Rates: Online Boosting vs. Online Arcx4 with decision tree base models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Test Error Rates: Primed Online Boosting vs. Online Arcx4 with decision tree base models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.10 2.11 2.12 2.13
2.14 2.15 2.16
2.17 2.18 2.19 3.1
Batch Bagging Algorithm and Sampling with Replacement: is the original training set of examples, is the number of base models to be learned, is the base model learning algorithm, the ’s are the classi?cation functions that take a new example as input and return the predicted class, is a function that returns each of the integers from to with equal probability, and is the indicator function that returns 1 if event is true and 0 otherwise. . . 35
? d3
V S 1P 6 F ? FD 7 jUTRQHIHGEi?
?
3
?
? ?
A9 ? 7 1 B@&8h6
?
3
q 2SfffS ?S ? Y ?UBHHHIU@?? ? d3 ? ` ? ?? 1
q ? 4SfffS ? 4S` 4Y s gTUgHHHQU???fe?
? q ? 4SfffS ? 4S` 4Y s ?TUbHHHQUgE????
4SfffS` U?HHHb?4
¤
A 4
? ??
F
2
D
? ??
V S $ Qca¨P
V S $ Qca¨P
V ?P kRTW
2
. 27 . 28 . 28 . 29
. 30 . 31 . 31
. 31 . 32 . 32 . 33
viii 3.2
The Batch Bagging Algorithm in action. The points on the left side of the ?gure represent the original training set that the bagging algorithm is called with. The three arrows pointing away from the training set and pointing toward the three sets of points represent sampling with replacement. The base model learning algorithm is called on each of these samples to generate a base model (depicted as a decision tree here). The ?nal three arrows depict what happens when a new example to be classi?ed arrives—all three base models classify it and the class receiving the maximum number of votes is returned. . . . . . . . . . . . . . . . . . . . . . 3.3 Online Bagging Algorithm: is the classi?cation function returned by online bagging, is the latest training example to arrive, and is the online base model learning algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Test Error Rates: Batch Bagging vs. Online Bagging with decision tree base models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Test Error Rates: Batch Bagging vs. Online Bagging with Naive Bayes base models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Test Error Rates: Batch Bagging vs. Online Bagging with decision stump base models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7 Test Error Rates: Batch Neural Network vs. Online Neural Network (one update per example). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8 Test Error Rates: Batch Bagging vs. Online Bagging with neural network base models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9 Test Error Rates: Batch Neural Network vs. Online Neural Network (10 updates per example). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.10 Test Error Rates: Batch Bagging vs. Online Bagging with neural network base models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.11 Basic structure of online learning algorithm used to learn the calendar data. is the current hypothesis, is the latest training example to arrive, and is the online learning algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . .
. 36
. . . . . . . . . . . . . . . . . . . 72
V S $ Qca¨P
? d3
2
?
4.2 4.3
. . . . . . . . . . . . . . . . . . . 67 . . . . . . . . . . . . . . . . . . . 69
base models learned so far, is the online base model learning
3
2 qV h S h $P S f f f S V ` S ` $P Up@ci?¨ggHHHe0dcba¨8Y
4.1
AdaBoost algorithm: is the base model learning algorithm, and generated. . . . . . . . . . . . . . . . . The Batch Boosting Algorithm in action. . Online Boosting Algorithm: is the set of is the latest training example to arrive, and algorithm. . . . . . . . . . . . . . . . .
is the set of training examples, is the number of base models to be
?
4
? l3
? l3
?
$
?
. 39 . 54 . 54 . 56 . 56 . 56 . 58 . 58
. 63
ix 4.4
Illustration of online boosting in progress. Each row represents one example being passed in sequence to all the base models for updating; time runs down the diagram. Each base model (depicted here as a tree) is generated by updating the base model above it with the next weighted training example. In the upper left corner (point “a” in the diagram) we have the ?rst training example. This example updates the ?rst base model but is still misclassi?ed after training, so its weight is increased (the rectangle “b” used to represent it is taller). This example with its higher weight updates the second base model which then correctly classi?es it, so its weight decreases (rectangle “c”). . . . . . . . . . . . . . . . . . . . . . . Learning curves for Car Evaluation dataset. . . . . . . . . . . . . . . . . . . Test Error Rates: Batch Boosting vs. Online Boosting with decision tree base models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Test Error Rates: Batch Boosting vs. Online Boosting with Naive Bayes base models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Test Error Rates: Batch Boosting vs. Online Boosting with decision stump base models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Test Error Rates: Batch Boosting vs. Online Boosting (one update per example) with neural network base models. . . . . . . . . . . . . . . . . . . . . . . . Test Error Rates: Batch Boosting vs. Online Boosting (10 updates per example) with neural network base models. . . . . . . . . . . . . . . . . . . . . . . . Test Error Rates: Batch Boosting vs. Primed Online Boosting with decision tree base models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Test Error Rates: Online Boosting vs. Primed Online Boosting with decision tree base models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Test Error Rates: Batch Boosting vs. Primed Online Boosting with Naive Bayes base models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Test Error Rates: Online Boosting vs. Primed Online Boosting with Naive Bayes base models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Test Error Rates: Batch Boosting vs. Primed Online Boosting with decision stump base models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Test Error Rates: Online Boosting vs. Primed Online Boosting with decision stump base models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Test Error Rates: Batch Boosting vs. Primed Online Boosting with neural network base models (one update per example). . . . . . . . . . . . . . . . . . . . . Test Error Rates: Online Boosting vs. Primed Online Boosting with neural network base models (one update per example). . . . . . . . . . . . . . . . . . Test Error Rates: Batch Boosting vs. Primed Online Boosting with neural network base models (10 updates per example). . . . . . . . . . . . . . . . . . . . . Test Error Rates: Online Boosting vs. Primed Online Boosting with neural network base models (10 updates per example). . . . . . . . . . . . . . . . . . Naive Bayes Base Model Errors for Synthetic1 Dataset . . . . . . . . . . . . Naive Bayes Base Model Errors for Synthetic2 Dataset . . . . . . . . . . . .
4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14 4.15 4.16 4.17 4.18 4.19 4.20 4.21 4.22
. 74 . 76 . 89 . 89 . 91 . 91 . 91 . 95 . 95 . 95 . 95 . 96 . 96 . 96 . 96 . 97 . 97 . 97 . 97
x 4.23 4.24 4.25 4.26 4.27 4.28 4.29 4.30
Naive Bayes Base Model Errors for Synthetic3 Dataset . . . . Naive Bayes Base Model Errors for Census Income Dataset . . Naive Bayes Base Model Errors for Forest Covertype Dataset . Neural Network Base Model Errors for Synthetic1 Dataset . . Neural Network Base Model Errors for Synthetic2 Dataset . . Neural Network Base Model Errors for Synthetic3 Dataset . . Neural Network Base Model Errors for Census Income Dataset Neural Network Base Model Errors for Forest Covertype Dataset
. . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
97 97 98 98 98 98 98 98
xi
List of Tables
The datasets used in our experiments. For the Census Income dataset, we have given the sizes of the supplied training and test sets. For the remaining datasets, we have given the sizes of the training and test sets in our ?vefold crossvalidation runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . for in Synthetic Datasets . . . . . . 3.2 3.3 Results (fraction correct): batch and online bagging (with Decision Trees) . 3.4 Results (fraction correct): batch and online bagging (with Naive Bayes) . . 3.5 Results (fraction correct): batch and online bagging (with decision stumps) 3.6 Results (fraction correct): batch and online bagging (with neural networks). The column “Neural Network” gives the average test set performance of running backpropagation with 10 epochs on the entire training set. “Online Neural Net” is the result of running backpropagation with one update step per training example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7 Results (fraction correct): online algorithms (with neural networks trained with 10 update steps per training example). . . . . . . . . . . . . . . . . . 3.8 Running Times (seconds): batch and online bagging (with Decision Trees) . 3.9 Running Times (seconds): batch and online bagging (with Naive Bayes) . . 3.10 Running Times (seconds): batch and online bagging (with decision stumps) 3.11 Running times (seconds): batch and online bagging (with neural networks). (1) indicates one update per training example and (10) indicates 10 updates per training example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.12 Results (fraction correct) on Calendar Apprentice Dataset . . . . . . . . . . 4.1 3.1
51 51 52 53 55
4.2
Results (fraction correct): batch and online boosting (with Decision Trees). Boldfaced/italicized results are signi?cantly better/worse than single decision trees. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Results (fraction correct): batch and online boosting (with Naive Bayes). Boldfaced/italicized results are signi?cantly better/worse than single Naive Bayes classi?ers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
? ? v?? ? v ~  z T~ ?BTbv Q}{y
x wv u s q p on yHb` e' o trk' i?m
57 59 60 61 61
62 64
xii 4.3 Results (fraction correct): batch and online boosting (with decision stumps). Boldfaced/italicized results are signi?cantly better/worse than single decision stumps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results (fraction correct): batch algorithms (with neural networks). Boldfaced/italicized results are signi?cantly better/worse than single neural networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results (fraction correct): online algorithms (with neural networks trained with one update step per training example). Boldfaced/italicized results are signi?cantly better/worse than single online neural networks. Results marked “ ” are signi?cantly different from single batch neural networks (Table 4.4). Results marked “ ” are signi?cantly different from batch boosting (Table 4.4). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results (fraction correct): online algorithms (with neural networks trained with 10 update steps per training example). Boldfaced/italicized results are signi?cantly better/worse than single online neural networks. Results marked “ ” are signi?cantly different from single batch neural networks (Table 4.4). Results marked “ ” are signi?cantly different from batch boosting (Table 4.4). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Running Times (seconds): batch and online boosting (with Decision Trees) Running Times (seconds): batch and online boosting (with Naive Bayes) . . Running Times (seconds): batch and online boosting (with decision stumps) Running times (seconds): batch algorithms (with neural networks) . . . . . Running times (seconds): online algorithms (with neural networks–one update per example) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Running times (seconds): online algorithms (with neural networks–ten updates per example) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results (fraction correct) on Calendar Apprentice Dataset . . . . . . . . . .
90
4.4
92
4.5
4.6
4.7 4.8 4.9 4.10 4.11 4.12 4.13
? ?
? ?
93
94 99 100 101 101 102 102 105
xiii
Acknowledgements
This Ph.D. thesis represents a great deal more than the intellectual exercise that it was intended to represent. This thesis, and my Ph.D. program as a whole re?ects my growth both intellectually and as a person. Many people have played signi?cant roles in my growth during this period and earlier. First and foremost I give thanks to my parents. I think my former chess coach, Richard Shorman, said it best, “They clearly put a lot of tender love and care into you.” I could not agree more. They have always been there for me and they are much more than I could ask for in a family. My uncle “Uttammama” (Uttam Oza) has been a part of my life since I was one year old. From patiently allowing me to read storybooks to him when I was two years old to making many suggestions related to my applications to undergraduate and graduate schools, he has always been there and it is impossible to enumerate everything he has done. During the time I have known him, he has added to his family. He and my aunt Kalindimami, and cousins Anand and Anuja have formed a second immediate family that have always been there. My extended families on my mom and dad’s sides both here and in India have always expressed great con?dence and pride in me, which has helped me immensely in my work and in life in general. I made many friends while I was here. First and foremost, I thank Huma Dar. I tell people that she is my former classmate, former of?cemate, friend, political discussionmate, Berkeley older sister, and parole of?cer. I have known Mehul Patel since my undergraduate days at MIT and he continues to be a good friend here at Berkeley and a link to the past. Marco van Akkeren and I spent much time discussing graduate student life. Marco, Megumi Yamato, and I had many discussions and I especially enjoyed our Sunday evening jaunts to Ben and Jerry’s. Michael Anshelevich was also part of our group at International House and he helped me with one of the proofs in this thesis. Several current and past members of RUGS (Russell’s Unusual Graduate Students) have been helpful to me. Geoff Zweig, Jeff Forbes, Tim Huang, Ron Parr, Archie Russell, and John Binder were quite helpful to me when I was doing my Master’s research. Dean Grannes, Eric Fosler, and Phil McLauchlan have helped me in many ways and I appreciate the chance that I had to work with them. My of?cemates over the years have helped me with my work and added muchneeded humor to the of?ce environment: David Blei, Huma
xiv Dar, Daishi Harada, Andrew Ng, Magnus Stensmo, and Andy Zimdars. I bugged Andrew Ng and David Blei often to pick the probability components of their brains. Andrew Ng was especially helpful to me with one of the proofs in this thesis. Thanks guys! Stuart Russell was my research advisor and always will be an advisor to me. He gave me the freedom to explore the ?eld to decide what I wanted to work on. In my explorations I learned much about the ?eld beyond my thesis topic, and I know that this will help me throughout my career. He patiently helped me through my explorations and took much time to discuss ideas with me. I credit most of my intellectual growth during my time at Berkeley to him. The funding that he gave me allowed me to work without worrying about the ?nancial side of life. Mike Jordan, Leo Breiman, and Joe Hellerstein served on my qualifying exam committee and were generous with their time in helping me sort out exactly what I wanted to work on. Jitendra Malik has been helpful to me during my time here in many ways including serving on my Master’s report committee. During my time here, California MICRO and Schlumberger provided fellowship support for two years and the NSF Graduate Research Training Program in Spatial Cognition provided me with funding for two years. The NSF program helped me broaden my horizons by enabling me to learn about and conduct research in Cognitive Science and Psychology. A number of administrators were a great help to me while I was here. Sheila Humphreys, Mary Byrnes, Winnie Wang, Peggy Lau, and Kathryn Crabtree helped smooth out many bumps during my time at Berkeley. I taught a C++ programming course in the Spring 1998 semester at City College of San Francisco. I thoroughly enjoyed the chance to run my own course. I became friends with Louie Giambattista while I was there and he remains a good friend to this day. As a graduate student I worked during one summer at General Electric Corporate Research and Development with Tomek Strzalkowski. Working with him was a great pleasure and he has helped me after that summer in many ways. I also spent two summers at NASA Ames Research Center with Kagan Tumer. He gave me much freedom to explore what I wanted to work on while I was there and took me under his wing as a research “little brother.” Ron Genise taught me algebra and geometry when I was in junior high school. He also designed a course in which we used a geometry exploration software that he and his son designed to explore concepts of geometry. He gave me my ?rst taste of what research was
xv like, and I was hooked. His dedication to teaching inspired me and he has served as my mentor ever since I took his courses. Roger Poehlmann is my chess coach. His dedication to chess and his enjoyment of teaching has not only helped me improve as a tournament chess player but has helped me in my work by providing me a needed diversion. I look forward to continuing to work with him. Prof. Michael Rappa was my freshman advisor at MIT and both he and his wife Lynda have been friends ever since. They have helped me in more ways than I can count and I feel honored to know two such special people.
1
Chapter 1 Introduction
A supervised learning task involves constructing a mapping from input data (normally described by several features) to the appropriate outputs. In a classi?cation learning task, each output is one or more classes to which the input belongs. In a regression learning task, each output is some other type of value to be predicted. In supervised learning, a set of training examples—examples with known output values—is used by a learning algorithm to generate a model. This model is intended to approximate the mapping between the inputs and outputs. For example, we may have data consisting of examples of credit card holders. In a classi?cation learning task, our goal may be to learn to predict whether a person will default on his/her credit card based on the other information supplied. Each example may correspond to one credit card holder, listing that person’s various features of interest (e.g., average daily credit card balance, other credit cards held, and frequency of late payment) as well as whether the person defaulted on his/her credit card. A learning algorithm would use the supplied examples to generate a model that approximates the mapping between each person’s supplied ?nancial information and whether or not he defaults on his credit card. This model can then be used to predict whether an applicant for a new credit card will default and; therefore, to decide whether to issue that person a card. A regression learning task’s goal may be to predict a person’s average daily credit card balance based on the other variables. In this case, a learning algorithm would generate a model that approximates the mapping from each person’s ?nancial information to the average daily balance. The generalization performance of a learned model (how closely the target outputs and the
2 model’s predicted outputs agree for patterns that have not been presented to the learning algorithm) would provide an indication of how well the model has learned the desired mapping. Many learning algorithms generate one model (e.g., a decision tree or neural network) that can be used to make predictions for new examples. Ensembles—also known as combiners, committees, or turnkey methods—are combinations of several models whose individual predictions are combined in some manner (e.g., averaging or voting) to form a ?nal prediction. For example, one can generate three neural networks from a supplied training set and combine them into an ensemble. The ensemble could classify a new example by passing it to each network, getting each network’s prediction for that example, and returning the class that gets the maximum number of votes. Many researchers have demonstrated that ensembles often outperform their base models (the component models of the ensemble) if the base models perform well on novel examples and tend to make errors on different examples (e.g., (Breiman, 1993; Oza & Tumer, 1999; Tumer & Oza, 1999; Wolpert, 1992)). and consider a new example . If all three networks always agree, then whenever To see why, let us de?ne is incorrect, and to be the three neural networks in the previous example
will also be incorrect, so that the incorrect class will get the is incorrect,
majority of the votes and the ensemble will also be incorrect. On the other hand, if the and may be correct, so that the ensemble will return the correct class by majority
if the base models’ errors are independent, then the probability that the ensemble makes is precisely , where is a
network example, if all the networks have an error rate of 0.3 and make independent errors, then the probability that the ensemble misclassi?es a new example is 0.21. Even better than base models that make independent errors would be base models that are somewhat anticorrelated. For example, if no two networks make a mistake on the same example, then the ensemble’s performance will be perfect because if one network misclassi?es an example, then the remaining two networks will correct the error. Two of the most popular ensemble algorithms are bagging (Breiman, 1994) and boost
x ? v ?eQ??@???? ?n? y? ? ? ??
?? ?
an error is the probability that more than
base models misclassify the example. This random variable. In our three
h??? ? ~ ?
?
vote. More precisely, if an ensemble has
base models having an error rate
x ?n ? ? ?
x Cn ` ? ?
networks tend to make errors on different examples, then when
x ?n ` ? ?
?
? v v E? ? ? b` ? ?
x ?@E? ?n ?
x ? ? ? ?n d ?????m
x Cn ? ? ?
x ?@E? ?n ?
and
3 ing (Freund & Schapire, 1996). Given a training set, bagging generates multiple bootstrapped training sets and calls the base model learning algorithm with each of them to training set by repeatedly ( times) selecting one of the yield a set of base models. Given a training set of size , bootstrapping generates a new
of them have equal probability of being selected. Some training examples may not be selected at all and others may be selected multiple times. A bagged ensemble classi?es a new example by having each of its base models classify the example and returning the class that receives the maximum number of votes. The hope is that the base models generated from the different bootstrapped training sets disagree often enough that the ensemble performs better than the base models. Boosting is a more complicated algorithm that generates a sequence of base models. Boosting maintains a probability distribution over the training set. Each base model is generated by calling the base model learning algorithm with the training set weighted by the current probability distribution. Then, the base model is tested on the training set. Those examples that are misclassi?ed have their weights increased so that their weights represent half the total weight of the training set. The remaining half of the total training set weight is allocated for the correctly classi?ed examples, which have their weights decreased accordingly. This new probability distribution and the training set are used to generate the next base model. Intuitively, boosting increases the weights of previously misclassi?ed examples, thereby focusing more of the base model learning algorithm’s attention on these hardtolearn examples. The hope is that subsequent base models correct the mistakes of the previous models. Bagging and boosting are popular among ensemble methods because of their strong theoretical motivations (Breiman, 1994; Freund & Schapire, 1997) and the good experimental results obtained with them (Freund & Schapire, 1996; Bauer & Kohavi, 1999; Dietterich, 2000). Most ensemble learning algorithms including bagging and boosting are batch algorithms. That is, they repeatedly read and process the entire set of training examples. They typically require at least one pass through the data for each base model to be generated. Often, the base model learning algorithms themselves require multiple passes through the
If the base model learning algorithm cannot accept weighted training sets, then one can generate a bootstrapped training set according to the weight distribution and pass that to the learning algorithm instead.
?
training set to create each model. Suppose
is the number of base models to be incorpo
`
?
?
examples at random, where all
?
?
4
algorithm needs to pass through the training set to create a base model. In this case, bagthrough the training set—for each base model, it needs passes to create the base model
and one pass to test the base model on the training set. We would prefer to learn the entire ensemble in an online fashion, i.e., using only one pass through the entire training set. This would make ensemble methods practical for use when data is being generated continuously so that storing data for batch learning is impractical. Online ensemble learning would also be helpful in data mining tasks in which the datasets are large enough that multiple passes through the datasets would require a prohibitively long training time. In this thesis, we present online versions of bagging and boosting. Speci?cally, we discuss how our online algorithms mirror the techniques that bagging and boosting use to generate multiple distinct base models. We also present theoretical and experimental evidence that our online algorithms succeed in this mirroring, often obtaining classi?cation performance comparable to their batch counterparts in less time. Our online algorithms are demonstrated to be more practical with larger datasets. In Chapter 2, we present the relevant background for this thesis. We ?rst review batch supervised learning in general and the base models that we use. We discuss what has been done in the areas of ensemble learning and online learning, which are the two essential components of this thesis. We also describe the little work that has been done in online ensemble learning. We discuss in more detail the motivation for ensemble learning and the various ensemble learning algorithms. We compare these algorithms in terms of how they bring about diversity among their constituent or base models, which is necessary in order for ensembles to achieve better performance. We also discuss online learning, including the motivation for it and the various online learning algorithms. We discuss the Weighted Majority (Littlestone & Warmuth, 1994) and Winnow (Littlestone, 1988) algorithms in more detail because, like online ensemble learning algorithms, they maintain several models and update them in an online manner. However, Weighted Majority and Winnow are designed to perform not much worse than the best of their constituent models, whereas our online ensemble algorithms, like typical batch ensemble methods, are designed to yield performance better than any of their base models. We discuss how this difference comes about.
x ¤?C?? ~ ? ?n
?
? ??
ging requires
passes through the training set. Boosting requires
?
rated into the ensemble and
is the average number of times that the base model learning passes
5 In Chapter 3, we discuss the bagging algorithm in great detail. We then derive our online bagging algorithm which has two requirements. The ?rst requirement is an online learning algorithm to produce the base models. The second requirement is a method of mirroring bagging’s method of producing multiple distinct base models. In particular, bagis often unavailable when learning online. Our online bagging algorithm avoids this retraining example to update each base model, where is a suitable Poisson random variging’s bootstrap sampling method, which we discussed earlier, requires knowing quirement. It simulates the bootstrap sampling process by sending , which
copies of each new
able. We prove that the sampling distribution used in online bagging converges to what is used in batch bagging. We then prove that the ensemble returned by our online bagging algorithm converges to that returned by the batch algorithm given the same training set if the base model learning algorithms are proportional and if they return classi?ers that converge toward each other as the size of the training set grows. By proportional, we mean that the batch and online base model learning algorithms return the same hypothesis given training sets where the relative proportion of every example to every other example is the same. The size of the bootstrap training set should not matter. For example, changing the training set by duplicating every example should not change the hypothesis returned by the base model learning algorithm. This is true for decision trees, decision stumps, and Naive Bayes classi?ers, which are three of the base models that we experiment with in this thesis. However, this is not true for neural networks, which we also experiment with. We discuss the results of our experiments comparing the generalization performances of online bagging and batch bagging on many real and synthetic datasets of varying sizes. We see that batch and online bagging mostly performed comparably to one another when using decision trees, decision stumps, and Naive Bayes classi?ers as the base models. These base models satisfy the conditions we described in the last paragraph. When using neural networks as the base models, online bagging’s performance suffered, especially on the smaller datasets. On larger datasets; however, the loss was small enough that it may be acceptable given the reduced running time. Additionally, we discuss experiments with our online bagging algorithm on a domain in which data is arriving continuously and the algorithm needs to supply a prediction for each example as it arrives. Online bagging with decision trees performed best on this problem and improved upon single decision trees.
?
?
?
6 In Chapter 4, we discuss the boosting algorithm (speci?cally, AdaBoost) in detail and derive our online boosting algorithm from boosting. Just as with bagging, it appears that examples their proper weights. Once again, we avoid the requirement of knowing we require foreknowledge of , the size of the training set, in order to give the training
suitable Poisson random variables to approximate boosting’s process of reweighting the training set. We prove that our online boosting algorithm’s performance becomes comparable to that of the batch boosting algorithm when using Naive Bayes base models. We also review our experimental results comparing the performances of online boosting and batch boosting on the same datasets on which we compare online and batch bagging. We ?nd that, in some cases, our online boosting algorithm signi?cantly underperforms batch boosting with small training sets. Even when the training set is large, our online algorithm may underperform initially before ?nally catching up to the batch algorithm. This characteristic is common with online algorithms because they do not have the luxury of viewing the training set as a whole the way batch algorithms do. Online boosting’s performance is especially worse when given a lossy online learning algorithm to create its base models. We experiment with “priming” our online boosting algorithm by running it in batch mode for some initial subset of the training set and running in online mode for the remainder of the training set. Priming is demonstrated to improve online boosting, especially when the online base model learning algorithm is lossy. We also compare the running times of our algorithms. In most cases, online boosting was faster than batch boosting while in other cases online boosting was slower. However, primed online boosting was almost always fastest among the boosting algorithms. We also compare online boosting and batch boosting in more detail by comparing the errors of their base models, which are important in determining the overall behavior of the boosting algorithms. Additionally, we discuss experiments with our online boosting algorithm on the same online domain that we experiment with in Chapter 3. In Chapter 5, we summarize the contributions of this thesis and discuss future work. This thesis delivers online versions of bagging and boosting that allow one to obtain the high accuracy of these methods when the amount of training data available is such that repeatedly examining this data, as bagging and boosting require, is impractical. Our theoretical and empirical results give us the con?dence that our online algorithms achieve this
?
?
using
7 goal.
8
Chapter 2 Background
In this chapter, we ?rst introduce batch supervised learning in general as well as the speci?c algorithms that we use in this thesis. We then discuss the motivation for and existing work in the area of ensemble learning. We introduce the bagging and boosting algorithms because the online algorithms that we present in this thesis are derived from them. We then introduce and discuss online learning. We then describe the work that has been done in online ensemble learning, thereby motivating the work presented in this thesis.
2.1
Batch Supervised Learning
examples or instances. It is assumed that there is a distribution
from which each training example is drawn independently. The th example is of the form value to be predicted. In a classi?cation problem, represents one or more classes to which
value to be predicted. In the previous chapter, we gave the example of a classi?cation problem in which we want to learn how to predict whether a credit card applicant is likely to default on his credit card given certain information such as average daily credit card balance, other credit cards held, and frequency of late payment. In this problem, each of the examples in the training set would represent one current credit card holder for whom we
¨
?
the example represented by
belongs. In a regression problem,
is some other type of
¨
¤
¤
¨
¤
?
, where
is a vector of values of several features or attributes and
represents the
§
¤
?
?
ing set consists of
?
?
A batch supervised learning algorithm
?
takes a training set
as its input. The train
¤
x jv Cn ¨ ? ¤ ¤
?
9 know whether he has defaulted up to now or not. If an applicant has an average daily credit card balance of $1500, ?ve other credit cards, and pays late ten percent of the time, and has The output of a supervised learning algorithm is a hypothesis unknown mapping from the inputs to the outputs. In our example, not defaulted so far, then he may be represented as .
mation about the credit card applicant to a prediction of whether the applicant will default. In the experiments that we present in this thesis, we have a test set—a set of examples that ples in are assumed to be independent and identically distributed (i.i.d.) draws from the as the proportion of test cases that misclassi?es: we use to test how well the hypothesis same distribution predicts the outputs on new examples. The exam
otherwise.
In this thesis, we use batch supervised learning algorithms for decision trees, decision stumps, Naive Bayes, and neural networks. We de?ne each of these now.
2.1.1
Decision Trees and Decision Stumps
A decision tree is a tree structure consisting of nonterminal nodes, leaf nodes, and arcs. An example of a decision tree that may be produced in our example creditcard domain is depicted in Figure 2.1. Each nonterminal node represents a test on an attribute value. In the example decision tree, the top node (called the root node) tests whether the example to be classi?ed has an income attribute value that is less than $50000. If it does not, then we go down the right arc, where we see a leaf node that says “NO.” A leaf node contains the classi?cation to be returned if we reach it. In this case, if the income is at least $50000, then we do not examine any other attribute values and we predict that the person will not default on his credit card. If the income is less than $50000, then we go down the left arc to the next nonterminal node, which tests whether the average daily checking account balance is less than $500. If it is, then we predict that the person will default on his credit card,
?
x in ?
°
?
where
is the test set and
is the indicator function—it returns 1 if
x ±p ? n ? ?¨ ?x ?l?n
?
°d??) % s s? ? § ? ~
?
on the test set
is true and 0
?
?
from which the examples in
were drawn. We measure the error of
x ?? bx q ~ v ? v dq ?#0n ? v q ? ~n ? ?
x ¨ ? p ?jv ?n
that approximates the
would map from infor
?
§
?
10
Income < $50000 YES NO
Average daily checking balance < $500 YES NO
NO
YES
NO
Figure 2.1: An example of a decision tree.
otherwise we predict that he will not default. Note that this tree never examines a person’s total savings. Decision trees are constructed in a topdown manner. A decision tree learning algorithm (ID3) is shown in Figure 2.2. If all the examples are of the same class, then the algorithm just returns a leaf node of that class. If there are no attributes left with which to construct a nonterminal node, then the algorithm has to return a leaf node. It returns a leaf node labeled with the class most frequently seen in the training set. If none of these conditions is true, then the algorithm ?nds the one attribute value test that comes closest to splitting the entire training set into groups such that each group only contains examples of one class. When such an attribute is selected, the training set is split according to that attribute. That is, for have value for the chosen attribute. The learning algorithm is called recursively on each each value of the attribute, a training set is constructed such that all the examples in
of these training sets.
? c?
? c?
?
?
11
if
is the same for all ,
return a leaf node labeled
else
Set
for each value
for each example
for each value
return
Figure 2.2: Decision Tree Learning Algorithm. This algorithm takes a training set , attribute
set , goal attribute , and default class as inputs and returns a decision tree. denotes the th training example, denotes example ’s value for attribute , and denotes example ’s value for the goal attribute .
A decision stump is a decision tree that is restricted to having only one nonterminal node. That is, the decision tree algorithm is used to ?nd the one attribute test that comes closest to splitting all the training data into groups of examples of the same class. One branch is constructed for each value of the attribute, and a leaf node is constructed at the end of that branch. The training set is split into groups according to the value of that attribute and each group is attached to the appropriate leaf. The class label most frequently seen in each group is the class label assigned to the corresponding leaf.
?
?
?
¤ ?
¨¤ ? §
FF 6 g??D F?F??eg? 6D ? ? FF 6 g??D V ? H×??#S ? D¨??C?E76@1gd3 g?6 Big?0b????h?eg? P 7 F FF 7 9? ???F ? s F F 6D ? ? ? ?D ¤ ? s ?? ? ¨¤ ? § ? ? ?? ¤ ? ?s ? D ? FF 6 g??D V ? P FD ?? 6DD D?F F ? 9 9 4 ? s X#S 8GTe????? gg?? BhB?i?? ?
to be a nonterminal node with test . of attribute ,
,
Add example
to set
.
of attribute ,
Add a branch to .
labeled
with subtree
V? b?s
return a leaf node labeled
¨¤ ? §
P W ` ????? ??dB??#? ?¤ ? ? ? ? ? ? ?
?
?
?
? ¨¤ ? § ?
? s ? ?h? ??
else if
` § ? q ?? ??SBfHfHfHSI?US?@Y·d? ? ? ? ?
.
Decision Tree Learning( , , )
?
??? §¤
,
.
.
?
12
2.1.2
Naive Bayes
Bayes’s theorem tells us how to optimally predict the class of an example. For an
same for all classes; therefore, we ignore it. Now we can return the class that maximizes
This is known as the Naive Bayes classi?er. The algorithm that we use is shown in Figber of training examples seen so far, is estimated by is estimated by is the number of examples in class , and as their value for attribute . , and attribute values
. The algorithm returns a classi?cation function that returns, for an that maximizes
example , the class
Every factor in this equation estimates its corresponding factor in Equation 2.1.
2.1.3
Neural Networks
The multilayer perceptron is the most common neural network representation. It is often depicted as a directed graph consisting of nodes and arcs—an example is shown in
?
" #! §
? ` ?' ? ? ? ?? à ? ?
¨
and, for all classes
)' e?% ?
¨
is the number of examples in class
having
" #! §
x ?¨ p s? 0(% ? ?g? ?m e(% ? )' )' p ' n x?¨ p i{m y ?n ? ¨
?
ure 2.3. For each training example, we just increment the appropriate counts:
?x g¨ p
x ? ?? {m p n
` ' x ?n )' s? 0(% ? k?? {m ? ?¨ p ??m p ' n ? ? ?? à
?
¨ á h " ?áh #! ? eEh á Eh
y
?
)' e(% ?
have
as their th attribute value. Estimating
is unnecessary because it is the
¨
x ?¨ p ? b0(% ? ?g? ?m s)' p ' n
For example,
?
for all classes
and all possible values of all attributes
is estimated from a training set.
would be the fraction of class training examples that
x ?¨ p ? ge?% ? kg? ?m s)' p ' n
attribute value of example . Each of the probabilities
and
(2.1)
is the num
y
)' 0(% ?
x ¨ p
' g? x ?¨ p i{m ?n ? b) % ? p s
x ?¨ p
s? ? ?? ?m p n
then we can rewrite
as
¤
¤
n ? ? ?m ` ?? ?? ' ?
?
o
De?ne
to be the set of attributes. If all the attributes are independent given the class, , where is the th
?
x ?¨ p
x ? ?? ?m p n ?n s? ? ?? {m ?¨ p ??m p n x
¨
example , we should predict the class
that maximizes
?n p x ? ? s? ¨ p ??m p
?
13
NaiveBayesLearning( , ) Increment Increment for
return
Figure 2.3: Naive Bayes Learning Algorithm. This algorithm takes a training set
Figure 2.4. Each column of nodes is a layer. The leftmost layer is the input layer. The inputs of an example to be classi?ed are entered into the input layer. The second layer is the hidden layer. The third layer is the output layer. Information ?ows from the input layer to the hidden layer to the output layer via a set of arcs. Note that the nodes within a layer are not directly connected. In our example, every node in one layer is connected to every node in the next layer, but this is not required in general. Also, a neural network can have more or less than one hidden layer and can have any number of nodes in each hidden layer. Each noninput node, its incoming arcs, and its single outgoing arc constitute a neuron, which is the basic computational element of a neural network. Each incoming arc multiplies the value coming from its origin node by the weight assigned to that arc and sends the result to the destination node. The destination node adds the values presented to it by all the incoming arcs, transforms it with a nonlinear activation function (to be described later), and then sends the result along the outgoing arc. For example, the output of a hidden node
in our example neural network is
?
?
layer of nodes to unit
in the next layer (so
is the weight on the arc that goes from
ó
?
) ? ` ¨%§ ¤
ó
where
is the weight on the arc in the th layer of arcs that goes from unit in the th
??` ? ?¤ ? ?? ¨¤ ) ? ` ¨%§ ¤ ? ?? ? ?
" #! §
set as inputs and returns a Naive Bayes classi?er. far, is the number of examples in class , and have as their value for attribute .
and attribute is the number of training examples seen so is the number of examples in class that
?
á Eh ` ?' éh ? B??#???¨t4 è ? ? ? ? ?? ? s V $P ? á Eh " ?á #! ? eEh
Increment
í?
ì ?p Hê ?
? V S $ ?Qca¨P 1
for each training example
?
"#! § q?#BfHfHfHSIU@·y1 ?S ?S ? Y ?
?
?
,
) H? (¨%§ò ¤
)e(%a$ ' ? ?
?
? Hê
14
x1 z1 y1 x2 z2 y2 x3 z3 y3 x4 z4
inputs
hidden units
outputs
Figure 2.4: An example of a multilayer feedforward perceptron.
activation function is the sigmoid function:
The output of an output node
is
the inputs. Neural networks used for classi?cation problems typically have one output per class. The example neural network depicted in Figure 2.4 is of this type. The outputs lie example presented to it is a member of that output’s corresponding class. Therefore, the class corresponding to the highest output value is returned as the prediction. Neural network learning performs nonlinear regression given a training set. The most widely used method for setting the weights in a neural network is the backpropagation
? ?~
v q?
in the range
?
where
is the number of hidden units. The outputs are clearly nonlinear functions of
. Each output value is a measure of the network’s con?dence that the
?
x yùUnl÷?Bep~ ?? ? ~
? R¤ ê ) ? ? ?%§ ¤
? ?`
? ? ?¤ ? ì ?üúp ? ¨ ?
? y ?x in ì
ì
input unit to hidden unit ) and
is a nonlinear activation function. A commonly used
?¨
?
?
15 algorithm (Bryson & Ho, 1969; Rumelhart, Hinton, & Williams, 1986). For each training example in the training set, its inputs are presented to the input layer of the network and the predicted outputs are calculated. The difference between each predicted output and the corresponding target output is calculated. This error is then propagated back through the network and the weights on the arcs of the networks are adjusted so that if the training example is presented to the network again, then the error would be less. The learning algorithm typically cycles through the training set many times—each cycle is called an epoch in the neural network literature.
2.2
2.2.1
Ensemble Learning
Motivation
As we discussed in Chapter 1, ensembles are combinations of multiple base models, each of which may be a traditional machine learning model such as a decision tree or Naive Bayes classi?er. When a new example is to be classi?ed, it is presented to the ensemble’s base models and their outputs are combined in some manner (e.g., voting or averaging) to yield the ensemble’s prediction. Intuitively, we would like to have base models that perform well and do not make highlycorrelated errors. We can see the intuition behind this point graphically in Figure 2.5. The goal of the learning problem depicted in the ?gure is to separate the positive examples (’+’) from the negative examples (’’). The ?gure depicts an ensemble of three linear classi?ers. For example, line C classi?es examples above it as negative examples and examples below it as positive examples. Note that none of the three lines separates the positive and negative examples perfectly. For example, line C misclassi?es all the positive examples in the top half of the ?gure. Indeed, no straight line can separate the positive examples from the negative examples. However, the ensemble of three lines, where each line gets one vote, correctly classi?es all the examples—for every example, at least two of the three linear classi?ers correctly classi?es it, so the majority is always correct. This is the result of having three very different linear classi?ers in the ensemble. This example clearly depicts the need to have base models whose errors are not highly correlated. If all the linear classi?ers make mistakes on the same examples
16
                  

+B 
+
A
 C +
++ ++ + + ++ ++ + ++ + + ++ + + + + ++ + + ++ ++ ++ + + + + ++ + + ++ + +++ + + +  + + + + +    + + ++ + + +      +               +       + ++ ++ +  +  + + ++ ++ ++ + + + + ++ + ++ + + + + + + ++ + + ++ ++ ++ + + + + ++ + ++ + + ++ + ++ ++ + +
Figure 2.5: An ensemble of linear classi?ers. Each line A, B, and C is a linear classi?er. The boldface line is the ensemble that classi?es new examples by returning the majority vote of A, B, and C. (for example if the ensemble consisted of three copies of line A), then a majority vote over the lines would also make mistakes on the same examples, yielding no performance improvement. Another way of explaining the superior performance of the ensemble is that the class of ensemble models has greater expressive power than the class of individual base models. We pointed out earlier that, in the ?gure, no straight line can separate the positive examples from the negative examples. Our ensemble is a piecewise linear classi?er (the bold line), which is able to perfectly separate the positive and negative examples. This is because the class of piecewise linear classi?ers has more expressive power than the class of single linear classi?er. The intuition that we have just described has been formalized (Tumer & Ghosh, 1996; Tumer, 1996). Ensemble learning can be justi?ed in terms of the bias and variance of the learned model. It has been shown that, as the correlations of the errors made by the base models decrease, the variance of the error of the ensemble decreases and is less than the
17
? ? ??' ¤
variance of the error of any single base model. If be obtained),
? ? ??' ¤ § ¨? '
is the average additional error of
the base models (beyond the Bayes error, which is the minimum possible error that can is the additional error of an ensemble that computes the average of the
base models’ outputs, and is the average correlation of the errors of the base models, then Tumer and Ghosh (1996) have shown that
of the errors made by the base models is made clear by this equation. If the base models be the same and the ensemble would not yield any improvement. If the base models’ errors relative to the base models’ errors. It is possible to do even better by having base models Ensemble learning can be seen as a tractable approximation to full Bayesian learning. In full Bayesian learning, the ?nal learned model is a mixture of a very large set of models— typically all models in a given family (e.g., all decision stumps). If we are interested in ?nal learned model is predicting some quantity , and we have a set of models
Full Bayesian learning combines the explanatory power of all the models (
Bayesian learning is intractable because it uses a very large (possibly in?nite) set of models. Ensembles can be seen as approximating full Bayesian learning by using a mixture of a lihoods ( ). Ensemble learning lies between traditional learning with single models small set of the models having the highest posterior probabilities (
and full Bayesian learning in that it uses an intermediate number of models.
x s?
x s?
¤
?n ?{m
by the posterior probability of the models given the training set (
x s? ??m ?n ¤
). However, full
) or highest like
?
x C?m ?n ? x s? i{m ¤ p x s? ??m x s? i{m ¤ p x s? i{m ?n ?n ?n ?n ¤ ¤ x ??m x s? C?m ¤ ?n ?n ? ? ¤ ¤
¤
?n ?{m
¤
?
`
?
p
with anticorrelated errors. If
, then the ensemble’s error would be zero.
and a training set , then the
) weighted
?
?`
q rp
are independent, then
?
~ p
always agree, then
x s? ?{m ?n ¤
?
where
is the number of base models in the ensemble. The effect of the correlations ; therefore, the errors of the ensemble and the base models would , which means the ensemble’s error is reduced by a factor of
v ??' ¤ ? ?
x ?ù ~
? ?n ? ?a?~
? ?? p ?? ¨' ¤ § '
18
2.2.2
Ensemble Learning Algorithms
The ensemble example shown in Figure 2.5 is an arti?cial example. We normally cannot expect to get base models that make mistakes on completely separate parts of the input space and ensembles that classify all the examples correctly. However, there are many algorithms that attempt to generate a pool of base models that make errors that are as uncorrelated as possible. Methods such as bagging (Breiman, 1994), boosting (Freund & Schapire, 1996) , and crossvalidation partitioning (Krogh & Vedelsby, 1995; Tumer & Ghosh, 1996) promote diversity by presenting each base model with a different subset of training examples or different weight distributions over the examples. The method of errorcorrecting output codes (Dietterich & Bakiri, 1995) presents each base model with the same training inputs but different labels—for each base model, the algorithm constructs a random partitioning of the labels into two new labels. The training data with new labels are used to train the base models. Input Decimation (Oza & Tumer, 1999, 2001; Tumer & Oza, 1999) and Stochastic Attribute Selection Committees (SASC) (Zheng & Webb, 1998) instead promote diversity by presenting each base model with a different subset of input features. SASC presents each base model (the number of base models is determined by hand) with a random subset of features. Input Decimation uses as many base models as there are classes and trains each base model using a subset of features having maximum correlation with the presence or absence of its corresponding class. However, in both SASC and Input Decimation, all training patterns are used with equal weight to train all the base models. The methods we have just discussed are distinguished by their methods of training the base models. We can also distinguish methods by the way they combine their base models. Majority voting is one of the most basic methods of combining (Battiti & Colla, 1994; Hansen & Salamon, 1990) and is the method used in bagging. If the classi?ers provide probability values, simple averaging is an effective combining method and has received a lot of attention (Lincoln & Skrzypek, 1990; Perrone & Cooper, 1993; Tumer & Ghosh, 1996). Weighted averaging has also been proposed and different methods for computing the weights of the classi?ers have been examined (Benediktsson, Sveinsson, Ersoy, &
We explain bagging and boosting in more detail later in this chapter.
`
?
19 Swain, 1994; Hashem & Schmeiser, 1993; Jacobs, 1995; Jordan & Jacobs, 1994; Lincoln & Skrzypek, 1990; Merz, 1999). Boosting uses a weighted averaging method where each base model’s weight is proportional to its classi?cation accuracy. The combining schemes described so far are linear combining techniques, which have been mathematically analyzed in depth (Breiman, 1994; Hashem & Schmeiser, 1993; Perrone & Cooper, 1993; Tumer & Ghosh, 1996). There are also nonlinear ensemble schemes include rankbased combining (AlGhoneim & Vijaya Kumar, 1995; Ho, Hull, & Srihari, 1994), beliefbased methods (Rogova, 1994; Xu, Krzyzak, & Suen, 1992; Yang & Singh, 1994), and orderstatistic combiners (Tumer & Ghosh, 1998; Tumer, 1996). In this thesis, the ensemble methods that we use are bagging and boosting, which we explain now. Bagging Bootstrap Aggregating (bagging) generates multiple bootstrap training sets from the original training set and uses each of them to generate a classi?er for inclusion in the ensemble. The algorithms for bagging and doing the bootstrap sampling (sampling with replacement) are shown in Figure 2.6. To create a bootstrap training set from a training examples. Each example has probability from 1 to
and adds the th training example to the bootstrap training set . Clearly,
some of the original training examples will not be selected for inclusion in the bootstrap bootstrap training sets and then generate classi?ers using each of them. Bagging returns a number of votes from the base models . In bagging, the
sets that are created are likely to have some differences. If these differences are enough to reasonably good, then the ensemble will probably perform better than the base models individually. (Breiman, 1996a) de?nes models as unstable if differences in their training
?
induce noticeable differences among the
base models while leaving their performances
?
¨
? ? BB? ? g` ? v? ? ?v v
x ??? ?n
function
that classi?es new examples by returning the class
that gets the maximum bootstrap training
?
training set and others will be chosen one time or more. In bagging, we create
?
?
algorithm shown in Figure 2.6 does exactly this—
?? B?~
?
?
?
set of size
, we perform
Multinomial trials, where in each trial, we draw one of the of being drawn in each trial. The second times, the algorithm chooses a number
?
such
20
Return
Sample With Replacement( , ) For
? B t6 ? V?#S?GP?6bFI?jFcDEC? ?98&7@h6}?6 7 A ? 1 s #SBHHfHI?U??s?? ff S S ? s ?×
Add
Return .
is the original training set of examples, is the number of base models to be learned, is the base model learning algorithm, the ’s are the classi?cation functions that take a new example as input and return the predicted class from the set of possible classes , is a function that returns each of the integers from to with equal probability, and is the indicator function that returns 1 if event is true and 0 otherwise.
Figure 2.6: Batch Bagging Algorithm and Sampling with Replacement:
sets tend to induce signi?cant differences in the models and stable if not. Another way of stating this is that bagging does more to reduce the variance in the base models than the bias, so bagging performs best relative to its base models when the base models have high variance and low bias. He notes that decision trees are unstable, which explains why bagged decision trees often outperform individual decision trees; however, Naive Bayes classi?ers are stable, which explains why bagging with Naive Bayes classi?ers tends not to improve upon individual Naive Bayes classi?ers. Boosting The AdaBoost algorithm, which is the boosting algorithm that we use, generates a sequence of base models with different weight distributions over the training set. The Admodel learning algorithm , and the number of base models that we wish to combine. aBoost algorithm is shown in Figure 2.7. Its inputs are a set of training examples, a base
?
V ?P XRW V S 1P 6 F ? FD 7 A 9 ? 7 1 HUTR?bIjcEC? ?8&@h6
?
3
?
? VQ}?E¨P ?PTW ` ? ? ? B? ?B??#??E¨P 4 s V $ 4 ? ? ?? ? s V $ ? !¤ ? V P ?s 4 3 ? ?? ? ? V ·#S PjDEbFgHFH?81RcFi 4GDi? ??81 s 7 A ? ? F ? A ? ? 2SfffS ?S ? UBHHHIU?s A
, to .
For each
?
?
5
?
?
?
1
2
¤
4
2
Bagging( ,
?
) ,
?
21
AdaBoost( For
Initialize
otherwise
the base model learning algorithm, and
AdaBoost was originally designed for twoclass classi?cation problems; therefore, for this explanation we will assume that there are two possible classes. However, AdaBoost is regularly used with a larger number of classes. The ?rst step in Adaboost is to construct an is one that assigns equal weight to all initial distribution of weights over the training set. The ?rst distribution in AdaBoost training examples. We now enter the loop in the , we calculate its error on the training set misclassi?es.
than what we would achieve through randomly guessing the class)—if this condition is not satis?ed, then we stop and return the ensemble consisting of the previouslygenerated base ing examples as follows. Examples that were correctly classi?ed by
Q RP H IG F
have their weights
If cannot take a weighted training set, then one can call it with a training set generated by sampling with replacement from the original training set according to the distribution .
? E
models. If this condition is satis?ed, then we calculate a new distribution
` ?
?~ ?
? `?
We require that
(this is the weak learning assumption—the error should be less
` ?
itself, which is just the sum of the weights of the training examples that
`?
` ?
ing set. After getting back a hypothesis
` E
?
algorithm. To construct the ?rst base model, we call
with distribution
over the train
over the train
3
2 qV h S h $P S f f f S V ` S ` $P Up@ci?¨ggHHHe0dcba¨8Y
Figure 2.7: AdaBoost algorithm:
is the set of training examples, is the number of base models to be generated.
?
?
f
) DA A ) B?
Output the ?nal hypothesis:
}s V ¨P $
? ? ? ??s 2 ? A ? " #? 4 ? #% 2 Vt¨P ? 1 3) 0 % 0( & ? s 7 4 ? 1 ) ' ? ?% $ #&? V #?U?@c?¨g?HHHe0dcb&¨8?P ?s 4 ?S qV h S h $P S f f f S V ` S ` $P Y 3 ? ? ? 2SfffS ?S ? UBHHHIU?s A q×#SBfHfHfHSI?US@?·7 ? Y ? "? s V 7 0#?lt¨P ` ? 2 3S q $P S f f f S $P US #?UV h cS h ¨ggHHHeV ` cS ` ¨8Y ?
) for all . : . Calculate the error of If then, set and abort this loop. : Update distribution
.
4
if
?
)
)A` CB?
? ?? ? ?? ? s V $ B? B??U???¨P B4 ?9 ) % ) ` ?R ? 0 '( ? ? !¤
? ) `A 5 V 7 ? V 7P ` % ? @98 6t¨P ? ?s t¨?` u ? ? ? ` E
7
is
?
22
their weights reduced and misclassi?ed examples have their weights increased. Speci?examples that correctly classi?ed have their total weight reduced to . We construct under
returned by AdaBoost is a function that takes a new example as input and returns the class weight is , which is proportional to the base model’s accuracy on the weighted
training set presented to it. Clearly, the heart of AdaBoost is the distribution updating step. The idea behind it misclassi?ed examples. The misclassi?ed examples’ weights are multiplied by start out having total weight , but their weights are multiplied by
)A` DB? C`) A
adjustment is that the next base model will be generated by a weak learner (i.e., the base by the previous base model will have to be learned. Boosting does more to reduce bias than variance. For this reason, boosting tends to improve upon its base models most when they have high bias and low variance. Examples of such models are Naive Bayes classi?ers and decision stumps (decision trees of depth one). Boosting’s bias reduction comes from the way it adjusts its distribution over the training set. The examples that a base model misclassi?es have their weights increased, causing the base model learning algorithm to focus more on those examples. If the base model learning algorithm is biased against certain training examples, those examples get more weight, yielding the possibility of correcting this bias. However, this method of adjusting the training set distribution causes boosting to have dif?culty when the training data is noisy (Dietterich, 2000). Noisy examples are normally dif?cult to learn. Because of this, the weights assigned to the noisy examples tend to be higher than for the other
?~ ?
model will have error less than
); therefore, at least some of the examples misclassi?ed
?
the sum of their weights decreases to
. The point of this weight
)
`% ?
`
? ù ~ p ) DB` ? ` % ? x ?y#n )A T ?
?
the sum of their weights is increased to
. The correctly classi?ed examples , therefore,
?
`
?
is as follows. We can see from the algorithm that
?
? p D`) A ? T U?
?
that gets the maximum weighted vote over the
?
?
? ù ??~
? E
and the new distribution
base models in this fashion. The ensemble base models, where each base model’s
is the sum of the weights of the , so that
? ?
then go into the next iteration of the loop to construct base model
using the training set
? E
? E
h~ ?
?~ ?
` ?
cally, examples that
misclassi?ed have their total weight increased to
?~ ?
? `?
x
) DA )A DB?
` ?
? ` n ì C?
?
by
. Note that, because of our condition
, correctly classi?ed examples have under and . We
` ?
)
`% ?
multiplied by
S 3A`
SA 3B` ?
. Examples that were misclassi?ed by
have their weights multiplied
23 examples, often causing boosting to focus too much on those noisy examples and over?t the data.
2.3
Online Learning
Online learning is the area of machine learning concerned with learning each training example once (perhaps as it arrives) and never examining it again. Online learning is necessary when data arrives continuously so that it may be impractical to store data for batch learning or when the dataset is large enough that multiple passes through the dataset would a new training example . The algorithm returns a new hypothesis that is updated to take too long. An online learning algorithm takes as its input a current hypothesis and
re?ect the new example. Clearly, an online learning algorithm can be used wherever a batch algorithm is required by simply calling the online algorithm once for each example in the training set. A lossless online learning algorithm is an algorithm that returns a hypothesis identical to what its corresponding batch algorithm would return given the same training set. Some researchers have developed online algorithms for learning traditional machine learning models such as decision trees—in this thesis, we use the lossless online decision tree learning algorithm of (Utgoff, Berkman, & Clouse, 1997). Given an existing decision tree and a new example, this algorithm adds the example to the example sets at the appropriate nonterminal and leaf nodes and then con?rms that all the attributes at the nonterminal nodes and the class at the leaf node are still the best. If any of them are not, then they are updated and the subtrees below them are rechecked as necessary. Naive Bayes model learning is performed the same way in online and batch modes, so online learning is clearly lossless. Batch neural network learning is often performed by making multiple passes (known in the literature as epochs) through the data with each training example processed one at a time. So neural networks can be learned online by simply making one pass through the data. However, there would clearly be some loss associated with only making one pass through the data. Two online algorithms that have been wellanalyzed in the theoretical machine learn
?
? ?
x ¨ ? jv ?n
24
WeightedMajority( ) For If ,
.
, then return 1, else return 0.
If the target output For , If
is not available, then exit. .
then
sponding to the predictors, is the latest example to arrive, example , the are the predictions of the experts .
ing literature are the Weighted Majority Algorithm (Littlestone & Warmuth, 1994) and the Winnow Algorithm (Littlestone, 1988) (see (Blum, 1996) for a brief review of these algorithms). Both the Weighted Majority and Winnow algorithms maintain weights on several predictors and increase or decrease their weights depending on whether the individual predictors correctly classify or misclassify, respectively, the training example currently being considered. For example, Figure 2.8 contains the Weighted Majority algorithm as listed in (Blum, 1996). The ?rst step in the algorithm is an initialization step that is only performed once—it sets the weights for each training example as it arrives. First, the algorithm calculates the predictions of each predictor on the new example . The ensemble’s prediction is then returned—it is the class that gets the maximum total weighted vote over all the predictors. Each predictor that makes a mistake on that example has its weight reduced in half. Weighted Majority and Winnow have shown promise in the few empirical tests that have been performed (e.g., (Armstrong, Freitag, Joachims, & Mitchell, 1995; Blum, 1997)). These algorithms have also been proven to perform not much worse than the best individual predictor. For expredictor that makes at most mistakes, then the Weighted Majority algorithm will make
f
of all the predictors to 1. The remaining steps are performed once
ample, given a sequence of training examples and the pool of predictors
4 ¤ ? ? #HHf x r v r fs rff t
r
Figure 2.8: Weighted Majority Algorithm:
? ?s
V
q 2SfffS ? Y ? úU?HHH@{?? ? $ ?
Initial conditions: For all
,
¤
.
? V }s W dec¤ b ` a¤ 2SfffS ? s U?HHH??? W ? W ? V V ` ? ? C¤ ? ' 4 ¤ ¤ XY? 'C¤ V $ 4 ?¨P ?s ¤ ¤ 2SfffS ? s U?HHH??? ¤ ? 2 ¤ $
$
is the vector of weights correis the correct classi?cation of
, if there is a
25
would expect the bad predictors to make many more mistakes than the good predictors, leading to much lower weights for the bad predictors. Eventually only the good predictors would in?uence the prediction of the entire model. Work in universal prediction (Merhav & Feder, 1998; Singer & Feder, 1999) has yielded algorithms that produce combined predictors that also are proven in the worst case to perform not much worse than the best individual predictor. Additionally, Singer and Feder (1999) discuss a universal linear prediction algorithm that produces a weighted mixture of sequential linear predictors. Speci?cally, universal prediction is concerned with the problem of so far. We would like to use a method that minimizes the difference between our prediction can be used, where tion) and for is the order (the number of past observations used to make a predicobservations and predictions:
? ? i? h? ? p~
and the observed value
. A linear predictor of the form
are coef?cients obtained by minimizing the sum of . Using a thorder linear predictor requires us
to select a particular order , which is normally very dif?cult. This motivates the use of a universal predictor each of the different sequential linear predictors of orders 1 through some
? ?? q
, which yields a performanceweighted combination of the outputs of :
where
We can compare the universal predictor to the full Bayesian model shown in Equation 2.2. the predictive performance of the thorder model, just as combined.
?
of the performance of hypothesis
. These measures are used to weight the models being
x s? ??m ?n
¤
? ? i!?
ò
is the universal predictor’s version of
, i.e.,
?
is a normalized measure of is a normalized measure
?
x 0x ? v Cn ` v ? ? ? Ul?B? ` ? ? t ? ùn ÷ ? ? ` ? v ?q x ex ò ? v ?n ` v ? ? ? #l?B? ? ùn ÷ ? ? ` q
` ? x i? ? ò ? &i? hCn ? ? ù ? ? ? ? q ??
v i? ? ò ? i!?? ò ? ?
÷
x s? i{m ?n ¤
q
÷ ? ó ù ? ò r ` ò ù ? ? ? ` ? ? t ? x Bey? h? ` § v g ? r t ay? h?n ` ? v ? ?p ? x x? hq ? ? ? ~ eù ?
squared differences between the previous
r g ` ?x??ù?? h? ? ? ? r t up i? Bs? q ) ` ? § w% ? ?r v ?
ùT?y? ?hCn ` ? t ` ? v? ?
ù ? v ? ? ? ? ? ~? l!?h? BBv ? h? v ??h?
~ ?? ù
? ?`
?
??
ò
v
? ? ? p i? ???
v? ? ?v ~  z ? ÷ bBTbv Q?e?
¤
p
p
ó
? ? i? h?
predicting the th observation
given the
g
q
? ? i!?
x ò ? v ?n v ? ?
q
ò
?
x ? ? s o ?ì C?8g s ??n ÷
r g ) ` ? § w% v ?
at most
mistakes, where
is a constant. This is not surprising, since we
observations
seen
?
? ? i!?
? ? i? h? q
ò
?
26 The universal predictor is a special case of the full Bayesian model because of the special structure of the predictors being combined. Speci?cally, the sequential linear predictors used in a universal predictor have much in common, so that they can be computed in a timerecursive and orderrecursive manner (see (Singer & Feder, 1999) for the algorithm). This recursive nature means that the complexity of the universal prediction algorithm is only general in that its models need not have such a structure—the only requirement is that the fore, the complexity of the full Bayesian learner could, in the worst case, be the number of models multiplied by the complexity of learning each model. Most ensembles also do not have such a structure among their base models. For example, in ensembles of decision trees or neural networks, there is no recursive structure among the different instances of the models; therefore the complexity of the learning algorithm is the number of models multiplied by the complexity of each base model’s learning algorithm. The reader may have noticed that Weighted Majority, Winnow, and the universal prediction algorithm just described may be thought of as ensemble learning algorithms because they combine multiple predictors. Ensemble learning algorithms generally perform better than their base models; therefore, one may wonder why Weighted Majority, Winnow, and the universal prediction algorithm are only proven to have error that is at most slightly worse than the best predictor. This is because they do not attempt to bring about diversity among their predictors the way ensemble algorithms do. Many ensemble algorithms require their base models to use different input features, outputs, training examples, distributions over training examples, or initial parameters to bring about diversity. On the other hand, Weighted Majority, Winnow, and the universal prediction algorithm do not assume that there are any differences in their predictors. Without any way to reduce the correlations in the errors made by the predictors, these algorithms are unable to guarantee performance better than any individual predictors.
?
~ p x s?
?
models
in the hypothesis class be mutually exclusive such that
¤
?n ??m ¤ t
?
?
, where
is the total length of the sequence . The full Bayesian learner is more . There
¤
x ? ?n ?
27
Breiman( , for
return
is the training set, is the number of base models to be constructed, and is the size of each base model’s training set ( for ). is the base model learning algorithm, is the number of training examples examined in the process of creating each and is the number of these examples that the ensemble previously constructed base models ( ) misclassi?es.
2.4
Online Ensemble Learning
There has been some recent work on learning ensembles in an online fashion. Breiman (1999) devised a “blocked” online boosting algorithm that trains the base models using consecutive subsets of training examples of some ?xed size. The algorithm’s pseudocode is given value or may allow it to grow up to the maximum possible which is at most to create each base model. For the ?rst base model, the ?rst the algorithm draws the next training example from in Figure 2.9. The user may choose is the original training set and —the number of base models—to be some ?xed
, where
is the userchosen number of training examples used th base model for
training examples in the ,
and classi?es it by unweighted vot
~ ù r??
ing over the
base models generated so far. If the example is misclassi?ed, then it
~
?
?
?
?
?
training set
are selected. To generate a training set for the
2
Figure 2.9: Breiman’s blocked ensemble algorithm: Among the inputs,
?
? B s?
s?
?
D
?
?
V s V $ 4P Q}?E¨P ?W ` ? ? ? B??#}?E¨P 4 ?? ? ?? ? s V $ ? ? !¤ V P ?s 4 3 ? ? ? ? v ???fdü?g? ?¨P ?? h?wd???¨P ?? F ? ? ? V ? A ? F ? ?f ? s V A ? § v ?gGP ?? F s V? ? § ? ?s A ? ) % ?? § ` V S $ Qca¨P ? ? ? ? ) ? % ??§ ? ? ? ? ?aF F V S $ Qca¨P ? ? b ? ?? ? ?? ? V b??s?VE¨P 4?TW ` ? Q? B??#?s ?s A $ P ? ` ?¤ ¤ ` ? ?? ? ? ??D D
.
If
or
then add else add
to
and
to
with probability
if
then
,
else
.
`
? ?? F
4SfffS` UbHHHbE4
? ?
?
V S $ ?ct¨P
Get the next training example
b
in . ,
,
?
?
?s@? ? ? ? ?? ??s Sd??s?cSd??s?F D ? ? 2 ffS ? US?fHHH?s A 2 ?
, Initialize Do until
?
?
?
3
q 2SfffS ?S ? úU?HHHIU@Y ?
?
)
,
.
,
, where
.
A
? ?
?
28
50 45 40 35 30 25 20 15 10 5 0 0 5 10 15 20 25 30 35 40 45 50 Boosting 50 45 40 35 30 25 20 15 10 5 0 0 5 10 15 20 25 30 35 40 45 50 Online Boosting
Blocked Boosting
Figure 2.10: Test Error Rates: Boosting
vs. Blocked Boosting with decision tree base models.
Figure 2.11: Test Error Rates: Online Boosting vs. Blocked Boosting with decision tree base models.
Blocked Boosting
probability proportional to the fraction of training examples drawn for this model that are of smoothing). This method of selecting examples is done in order to train each base model using a training set in which half the examples have been correctly classi?ed by the ensemble consisting of the previous base models and half have been misclassi?ed. This process at which time the base model learning algorithm is called with to get base model . of selecting examples is done until examples have been selected for inclusion in
. Breiman’s algorithm returns a function that classi?es a new example by returning the
Breiman discusses experiments with his algorithm using decision trees as base models and test error decreased more rapidly when using larger subsets of training examples. However, each training example is only used to update one base model and each base model is only this is suf?cient to achieve performance comparable to a typical batch ensemble algorithm in which all the training examples are available to generate all of the base models. Figure 2.10 shows a scatterplot of the results of comparing AdaBoost to Breiman’s blocked boosting algorithm on the ?rst eight UCI datasets (Table 3.1) used in the experitrained with
ranging from 100 to 800. His experiments with one synthetic dataset showed that the
examples, which is a relatively small number. It is generally not clear when
? ? ?? ? H` ? v? ? ? v v
class that receives the maximum number of votes over the base models
?
?
?
?
x ~ ù
? ? in ?? ?
?
?
?
?
? ? @??
misclassi?ed (
) and the previous base model’s error
(done for the purpose
?
?
is included in the new base model’s training set
; otherwise it is included in
?
?
with a
,
?
?
?
? ?
?
29
50 45 40 35 30 25 20 15 10 5 0 0 5 10 15 20 25 30 35 40 45 50 Primed Online Boosting
Figure 2.12: Test Error Rates: Primed Online Boosting vs. Blocked Boosting with decision tree base models.
ments described in Chapter 3 and Chapter 4. This scatterplot, like the other ones in this thesis, compares the error on the test set of two algorithms. Every point represents one dataset. The diagonal line contains the points at which the errors of the two algorithms are equal. In Figure 2.10, the points above the line represent experiments in which boosting had a lower test set error than blocked boosting. Points below the line represent experiments in which blocked boosting had the lower test set error. Figure 2.11 compares the online boosting algorithm we present in Chapter 4 with the blocked boosting algorithm. Figure 2.12 shows the results of comparing our primed online boosting algorithm presented in Chapter 4 with the blocked algorithm. In our primed algorithm, we train with the lesser of the ?rst 20% of the training set or the ?rst 10000 training examples in batch mode and train with the remainder in online mode. In the blocked boosting algorithm, each decision tree had 84 training examples, so we used algorithm. Fern and Givan (2000) present both an online bagging algorithm and online boosting algorithm. Their online bagging algorithm is shown in Figure 2.13. This algorithm selects user ?xes in advance. is an online base model learning algorithm that takes a current each new training example to update each base model with some probability that the was trained with 100 examples (
Blocked Boosting
boosting and primed online boosting algorithms perform better than the blocked boosting
÷
q ?
p
? q dq ~ ?
p
? ?
) except for the Promoters dataset, which only . Overall, AdaBoost and both our online
? ?
30
for
hypothesis and training example as input and returns a new hypothesis updated with the as the base models, their online bagging algorithm never performed signi?cantly better than a single decision tree. With low values of , the ensembles’ decision trees are quite diverse because their training sets tend to be very different; however, each tree gets too the trees to get enough training data to perform well, but their training sets have enough in common to yield low diversity among the trees and little performance gain from combining. Figure 2.14 shows the results of comparing batch bagging to Fern and Givan’s bagging algorithm on several UCI datasets. Figure 2.15 gives the results of comparing our online them the best results in their tests. However, we allowed their algorithm to use decision trees with no depth limit in order to allow for a fair comparison. As we mentioned earlier, bagging tends to work best with base models having high variance and low bias. Therefore, using decision trees with no depth limit would tend to work best. Both batch bagging and our online bagging algorithm perform comparably to Fern and Givan’s algorithm. Fern and Givan’s online boosting algorithm is an online version of Arcx4 (Breiman, with a training example, that example is given weight , where 1996b). Arcx4 is similar to AdaBoost except that when a base model of previous base models
? ??
that have misclassi?ed it. The pseudocode for the
online algorithm is shown in Figure 2.16. In this algorithm, each example is given weight
?
q ?p ÷
bagging algorithm from Chapter 3 to Fern and Givan’s algorithm with
, which gave
is presented is the number
÷
few training examples, causing each of them to perform poorly. Higher values of
?
?
? d?
? y~
÷
new example. In experiments with various settings for
? d3 ? q ? 4SfffS ? 4S` 4Y s gTUgHHHQU???¤e?
Figure 2.13: Online Bagging algorithm.
is the set of base models to be updated, is the next training example, is the userchosen probability that each example should be included in the next base model’s training set, and is the online base model learning algorithm that takes a base model and training example as inputs and returns the updated base model.
÷
VV S $P c?ct¨gS ?Bd3 4P ? 4 ? b ? ? 2SfffS ? U?HHH?s A ? d3 ? Qca¨P ??U?HHHQUb??Y ·? V S $ q ? 4SfffS ? 4S` 4 s
, With probability , do .
OnlineBag(
,
, ,
)
`
? ??
v? ?? v ? ?Bg` ?
V S $ ?ct¨P
and depthlimited decision trees
allow
31
50 45 40 35 30 25 20 15 10 5 0 0 5 10 15 20 25 30 35 40 45 50 Bagging 50 45 40 35 30 25 20 15 10 5 0 0 5 10 15 20 25 30 35 40 45 50 Online Bagging
Fern Givan Bagging
Figure 2.14: Test Error Rates: Bagging vs.
Fixed Fraction Online Bagging with decision tree base models.
Figure 2.15: Test Error Rates: Online Bagging vs. Fixed Fraction Online Bagging with decision tree base models.
Initialize for
if
algorithm that takes the current hypothesis, a training example, and its weight as input and returns a new hypothesis updated to re?ect the new example with the supplied weight. This algorithm was tested on four machine learning datasets, three of which are part of the UCI Machine Learning Repository (Blake, Keogh, & Merz, 1999), and several branch prediction problems from computer architecture. The main goal of their work was to apply ensembles to branch prediction and similar resourceconstrained online domains. For this reason, they allowed their algorithm a ?xed amount of memory and examined the tradeoff between having a larger number of shallow decision trees and having a smaller number of deep decision trees. Their results suggest that, given limited memory, a boosted ensemble
?
to update each base model just like Arcx4. Here,
s
k l l l #mmmk
Figure 2.16: Online Boosting algorithm.
x w xik v Cu
is the set of base models to be updated, is the next training example, and is the online base model learning algorithm that takes a base model, training example, and its weight as inputs and returns the updated base model.
? ? ? d?&?
? ? ?Bo
s q k ? e ? ? ? Bo Yo
? e ?fV ? V S $ q ? 4SfffS ? 4S` 4 d3 Qca¨P ??UgHHHhQUbE?Y s ?
OnlineArcx4(
,
,
)
.
,
then
if}~rf{ {  ? V ? i k x w y60¨sik
.
.
Fern Givan Bagging
x
v Cu k
x w sze x ` Cu so u v p
p u s q xo ?trb xo p n k l l l k i e ?pmmm?jhg
is an online base model learning
32
50 45 40 35 30 25 20 15 10 5 0 0 5 10 15 20 25 30 35 40 45 50 Boosting 50 45 40 35 30 25 20 15 10 5 0 0 5 10 15 20 25 30 35 40 45 50 Online Boosting
Online Arcx4
Figure 2.17: Test Error Rates: Boosting
vs. Online Arcx4 with decision tree base models.
Figure 2.18: Test Error Rates: Online Boosting vs. Online Arcx4 with decision tree base models.
with a greater number of smaller decision trees is generally superior to one with fewer large trees. They often achieved results much better than a single decision tree, but did not compare their results to any batch ensemble algorithms including Arcx4 and AdaBoost. We compared AdaBoost and our original and primed online boosting algorithms to their online Arcx4 algorithm and show the results in Figures 2.17, 2.18, and 2.19, respectively. We allowed their online Arcx4 algorithm to use decision trees without depth limits but with pruning just as we did with AdaBoost and our online boosting algorithm. Batch boosting performs better than online Arcx4. Our online boosting algorithm and online Arcx4 perform comparably. Our primed online boosting algorithm outperformed online Arcx4 slightly. Our approach to online bagging and online boosting is different from the methods described in that we focus on reproducing the advantages of bagging and boosting in an online setting. Like the batch versions of bagging and boosting and most other batch ensemble algorithms, our algorithms make all the training data available for training all the base models and still obtain enough diversity among the base models to yield good ensemble performance.
Online Arcx4
33
50 45 40 35 30 25 20 15 10 5 0 0 5 10 15 20 25 30 35 40 45 50 Primed Online Boosting
Figure 2.19: Test Error Rates: Primed Online Boosting vs. Online Arcx4 with decision tree base models.
Online Arcx4
34
Chapter 3 Bagging
In this chapter, we ?rst describe the bagging (Breiman, 1994) algorithm and some of the theory behind it. We then derive the online bagging algorithm. Finally, we compare the performances of the two algorithms theoretically and experimentally.
3.1
The Bagging Algorithm
As we explained in Chapter 2, ensemble methods perform best when they create base models that are different from one another. Bootstrap Aggregating (bagging) (Breiman, 1994), does this by drawing multiple bootstrap training sets from the original training set and using each of these to construct a base model. Because these bootstrap training sets are likely to be different, we expect the base models to have some differences. The algorithms for bagging and doing the bootstrap sampling (sampling with replacement) are shown in Figure 3.1. Figure 3.2 depicts the bagging algorithm in action. To create a bootstrap trainin each trial, we draw one of the
?
to the bootstrap training set . Clearly, some of the original training examples will not be selected for inclusion in the bootstrap training set and others will be chosen one or more
?
times. In bagging, we create
such bootstrap training sets and then generate classi?ers
?
?
?
times, the algorithm chooses a number
?
from 1 to
and adds the th training example
?
ing drawn in each trial. The second algorithm shown in Figure 3.1 does exactly this—
?
examples. Each example has probability
? ??
?
?
ing set from the original training set of size
, we perform
multinomial trials where, of be
35
Sample With Replacement( , ) For ,
x????i ? ??i??? k ? ? ? ? ? u ? ? ? e f??
Return . is the original training set of examples, is the number of base models to be learned, is the base model learning algorithm, the ’s are the classi?cation functions that take a new example as input and return the predicted class, is a function that returns each of the integers from to with equal probability, and is the indicator function that returns 1 if event is true and 0 otherwise.
?
Figure 3.1: Batch Bagging Algorithm and Sampling with Replacement:
§ hq x ? Du ? x ? k m?B? ? ??i??? ? ? ? ? ? u n ? o ?
using each of them. In Figure 3.2, the set of three arrows on the left (which have “Sample The next set of arrows depicts calling the base model learning algorithm on these three si?es new examples by returning the class
? ? ?? ?
w/ Replacement” above them) depicts sampling with replacement three times ( bootstrap samples to yield three base models. Bagging returns a function the maximum number of votes from the base models
? ? ? ? ? ? ? ?????? ? #?
).
that clasthat gets
. In Figure 3.2, three
base models vote for the class. In bagging, the ences among the
?
bootstrap training sets that are created
are likely to have some differences. If these differences are enough to induce some differbase models while leaving their performances reasonably good, then, as described in Chapter 2, the ensemble is likely to perform better than the base models.
?
out of the set of possible classes
? ? ?
?
?
? ?? I?t?
?
g ? ? ? ?&??? ?
?
Add
?
to .
x w e sz?x
Return
x ? k ??¤? ?0?mm&Di? ? ? ? g ? ? ?? ?
? ? ? v Cu p o u BI6p ?
u
·????#?y????????x ? ± ° ? ? e v ?? ¨ Cu !?o x ? e u § p mtq p o ?? ?? ??&?e ?? ? g ? ? ? p ? o ? ? ?? ??
n k l l l k ? k i e ??mmm???jhg
For each
g ? ? ? p&??? ? e ? ? k l l l k ? k i e ??mmm???j??
n
Bagging( ,
?
) ,
36
Sample w/ Replacement
++ + + + ++++ + + + ++++++++++ + ++++ +++++++++++ ++ + + + + + ++ + + + ++++ + + ++++++ + + + + +++++++ + +
Learn Base Models Vote Final Answer= Plurality Vote
+ + ++++ + + + ++++++ + + + +++++++ + +
+ + ++++ + + ++++++ + + + + +++++++ + +
Figure 3.2: The Batch Bagging Algorithm in action. The points on the left side of the ?gure represent the original training set that the bagging algorithm is called with. The three arrows pointing away from the training set and pointing toward the three sets of points represent sampling with replacement. The base model learning algorithm is called on each of these samples to generate a base model (depicted as a decision tree here). The ?nal three arrows depict what happens when a new example to be classi?ed arrives—all three base models classify it and the class receiving the maximum number of votes is returned.
3.2
Why and When Bagging Works
It is wellknown in the ensemble learning community that bagging is more helpful when the base model learning algorithm is unstable, i.e., when small changes in the training set lead to large changes in the hypothesis returned by the learning algorithm (Breiman, 1996a). This is consistent with what we discussed in Chapter 2: an ensemble needs to have base models that perform reasonably well but are nevertheless different from one another. Bagging is not as helpful with stable base model learning algorithms because they tend to return similar base models in spite of the differences among the bootstrap training sets. Because of this, the base models almost always vote the same way, so the ensemble returns the same prediction as almost all of its base models, leading to almost no improvement over the base models. draws from some distribution drawn from distribution
? ? ? ?? ? ?? ???
, we can apply a learning algorithm and get a predictor is classi?ed correctly is
× ? ?? ?
, then the probability that
? ? ??? ? ×? ? ? ?? ??u?3??
× ??
that returns a predicted class given a new example . If
?
is a new example
?
?
Breiman illustrates this as follows. Given a training set
consisting of
independent
? ? ?? ?????
37
?à ?à ?à ? ? ? ? ?
where 1,2,. . . ,C is the set of possible classes to which an example can belong. Let us de?ne
? ?
i.e., the probability over the set training sets, is
?
of randomly drawn training sets that the predictor predicts is classi?ed correctly, averaged over randomly drawn
? ×
class , then the probability that
aggregating the predictors constructed on all the possible training sets (bagging obviously approximates this). We can write the probability of correct classi?cation of the aggregated
?
classi?er as
ù ? ? ÷ ? ? ? ? ? ??? y¨?3? rtI?? ?à ? ? ??
The improvement in classi?cation performance that we get by aggregating compared to not
? ?
aggregating is
ù ? ? ÷ ? ? ? ? ? ??? ????? ? u?I?? ?à ? ? ? ? ?z???
Equation 3.1 shows that aggregating especially helps with unstable base classi?ers. If the classi?ers are too stable, i.e., the classi?ers induced by the various training sets tend and will be quite low. With unstable base classi?ers,
? ? ? U??? ? ? ? è× s?? ? ? ? è× x3?
to agree, then
? ù ? ? ÷ ? ? ??y¨?3? ?
will tend to be close to 0 or 1, in which case
? ?? ?? ?
is close to
? × x3? ? ?
will tend to be away from 0 or 1 leading to higher values of
, i.e., a greater bene?t
?
?
? ? ò? ? ? ? ? ?? ?? ? ? ? Iy3?t¨I? s??C?I? s??
?
?
? ? ò? ? ? ? ? ?? ? ? ? ? ? ? ú Iy3?t?I? x3?¨píüI? p??
? ? ú I? p??
? ? ? ? ? ? ú 6??ìI? p??
ù ? ? ÷ ? ? ? ? ?? ? ??y?&íìI?a??
de?nes an aggregated classi?er
, i.e., the predictor constructed by
?
?
×
? ??
? ? ? ú üI? p??
?
? ?? ? I?ó??
where
is the probability that an example
is drawn under distribution
?
? ? ? ?? ? ? ? ? ? ? ? ? p??3?¨p??? s?á?? ? ? ? ? píì? ? m? ?
? ? ? ?? ? ? ? ? ? ? ? ? p??3?¨pí?? ?è× s??
? ? ò? ? ? ? ? ?? ? ? ? ? Iy3??¨I? x3?¨I? s??
×? ?? ê ???¤t?
?
? ×? ?? ????? I?? ? ? ? ? ?I?? ?t?I??
? ? ? ? éè× x3? ? ?
. Breiman
(3.1)
38 from aggregation. Clearly aggregation is often impossible because we typically cannot obtain all possible training sets. Bagging approximates aggregation by drawing a relatively small number of bootstrap training sets. Nevertheless, Equation 3.1 still demonstrates that the more diverse the base models in terms of their predictions, the more bagging improves upon them in terms of classi?cation performance.
3.3
The Online Bagging Algorithm
random draws
?
& ??
Bagging seems to require that the entire training set be available at all times because, for each base model, sampling with replacement is done by performing over the entire training set. However, we are able to avoid this requirement as follows. We noted earlier in this chapter that, in bagging, each original training example may be replicated zero, one, two, or more times in each bootstrap training set because the sampling each of the original training examples where
? ? ?
? ? ?
is done with replacement. Each base model’s bootstrap training set contains
copies of
which is the binomial distribution. Knowing this, instead of sampling with replacement by set in order, one example at a time and draw each example a random number each training example arrives, for each of the base models, we could choose example
? ?
according to Equation 3.2. If one has an online base model learning algorithm then, as according to Equation 3.2 and use the learning algorithm to update the base model with the new times. This would simulate sampling with replacement but allow us to keep —the
?
just one training example in memory at any given time—the entire training set does not have to be available. However, in many online learning scenarios, we do not know number of training examples—because training data continually arrives. This means that we cannot use Equation 3.2 to choose the number of draws for each training example. tribution of
?
' (
$ # ! %?"
? ? ¤ ???
? ? ???
tends to a Poisson(1) distribution:
?
However, as
?
performing
random draws over the entire training set, one could just read the training times
, which is reasonable in an online learning scenario, the dis. Now that we
§ ??
§ ??
?
? § ¨¤
?
?
(3.2)
? ? ¤ ìY??
? ? ???
39
ure 3.3): as each training example is presented to our algorithm, for each base model, examples are classi?ed the same way as in bagging: unweighted voting over the models. Online bagging is a good approximation to batch bagging to the extent that their sampling methods generate similar distributions of bootstrap training sets and their base model learning algorithms produce similar hypotheses when trained with similar distributions over training examples. We examine this issue in the next section.
?
? ?
choose
and update the base model with that example
3.4
Convergence of Batch and Online Bagging
In this section, we prove that, under certain conditions, the classi?cation function returned by online bagging converges to that returned by batch bagging as the number of base models and the number of training examples tends to in?nity. This means that the classi?cation performances of the ensembles returned by batch and online bagging also converge. The overall structure of our argument is as follows. After mentioning some known de?nitions and theorems, we ?rst prove that the bootstrap sampling distribution used in online bagging converges to that of batch bagging. Then we prove that the distribution over base models under online bagging converges to the analogous distribution under batch bagging subject to certain conditions on the base model learning algorithms which we discuss. We
?
have removed the dependence on
? aCA@9?7e5 ? B 6 8 8 ú 6 ? ?
, we can perform online bagging as follows (see Figtimes. New base
s tq
Figure 3.3: Online Bagging Algorithm:
?
is the classi?cation function returned by online bagging, is the latest training example to arrive, and is the online base model learning algorithm.
x w e x??x
· ? ? ? ± ° ? ? e ??0? ?? ????tx
Return
? ? ? v Cu p o u BIap ?
x ? ??k
p ou s q
e
p o
v Cu
1
Do
1
Set
according to times
.
.
x #i ?#4??? 2 ? ? 3 3? u n k l l l k ? k i ? ) ?pmmm?????0g
For each base model
,(
?
p xo
? k ??
OnlineBagging(
) ) in ,
40 ?nally establish the convergence of the classi?cation functions returned by online and batch bagging.
3.4.1
Preliminaries
In this section, we state some standard de?nitions (Grimmett & Stirzaker, 1992) and prove some essential lemmas.
× × ú ?
De?nition 1 A sequence of random variables
×
y s ?%? q R s A? q
vectors. The following is Scheff? ’s Theorem (Billingsley, 1995). We state the version for dise crete distributions because that is all we need.
Y dGA? c Y Y
Theorem 1 De?ne for
Y
to be a sequence of distributions and
h h U ige
Corollary 1 Given the same conditions as in Scheffe’s Theorem, we have
?
H ? ? ? e ? Y ? ? e w????tQ6???
We now prove the convergence of the sampling distributions used in online and batch ging algorithms. Use
? ? ?
bagging. Let
?
? ? ` ? ú a 6 B ú ? ` ? "?P??@?b???w?
h ? ? ???h
for all
.
denote the training set presented to the batch bagging and online bagto denote the multinomial distribution with
?
B
H ? ? ? e ? Y ? ? e w??y??Q6y"?
s q p ? ? e ? ? e? ü??tY %? rrìy"?
?
?
Y?
Y?
?
?
d ? c G%? u vt
d ? c GA? u xt
s q Y %? rp
and , i.e.,
Y
? e? Y y"tf? ??? ? e
?
Y
a ú b?`
bution such that
?
for all values
for all . Then we have
?
?
? ? ? ? ?????
?
? 0?
R
×
probability to
for all
where
is the number of elements in all the
, where
R
×
R
×
? ? U ?Vú ? ? ? ? ? W #???X? ? R × R ×
quence of random variables
?
(the th elements of
) converges in
to be another distriis the set of support
R
×
? ? ? ? ?????
R
×
? #?
? ? ? S ???? ? T?
R
? ? 9? S ×
verges in probability to a vector
of random variables (written as
D
?
R
×
×
De?nition 2 A sequence of vectors
whose elements are random variables con) if the se
QB
H
? E F ? mIPj×
?
× m?? ? ?
H F IGE
? ? ? ? ?????
? #?
×
×
random variable
(written
) if, for all
,
?
? ? ? ? ?????
? 0?
converges in probability to a as .
D ??
41 trials where, in each trial, each of
? ?
a W ? Re
possible elements is chosen with probability
Sampling with replacement (bootstrapping) in the batch bagging algorithm is done by perexamples from
? ?? ` ?ú a 6 Bú ?` ? ??b?C?b???w?
online bagging’s bootstrap sampling scheme. We now show that the bootstrap sampling method of online bagging is equivalent to ples, each of which has probability
. Note that this is the same as the bootstrap sampling distribution for
e l
batch bagging except that the number of multinomial trials is not ?xed. This lemma makes
? g
?? 3??
p
e l
Re
and each element of
?
is distributed according to
? k3?? g ? ?
p
?
5
Re
of being drawn. Therefore
p s
n?? m d n ?? m d
?
?
p
s
?
?
?
?
? ? ` ? ú a 6 B ú ? ` ? ? ? p?P??@????o??¨?
?
g k?
performing
multinomial trials where each trial yields one of the
training exam
Re
bootstrap training set is an integer. De?ne
to be the probability of obtaining
s
?
is a vector of integers so that the number of times each example from
? Re ? ? s s
is included in the under
g R e k?
of the bootstrapped training set, is not ?xed. Clearly, we would need to have that
p
s
g h?
? ? ?
up 20% of the training set each, and
was left out. However
, which is the total size
? ? ?b? ?
? ? ? y3? ? ? ?
? ? ?
then 40% of the bootstrapped training set is copies of
, and
,
, and
make
?
H
H
H
?
H
H
?
?
Re
?
? i?
?
example, if we have ?ve training examples (
e f
) and
W ??
W ?? W ?? ? j? ? ?? B 6 8 8 ú 6 ? 5 g ~3aCA@9?7?¨h?
p
is distributed according to a
distribution, where
s
. For ,
Re
? á@@@??7? ? B 6 8 8ú 6 ? ? ? ?? B 6 8 8 ú 6 ~3aCA@9?7?
number of examples drawn has a
distribution. Therefore, each element of
p
s
?
?
tion. Since there are
training examples, there are
such trials; therefore, the total
? á@A@9?d? ? B 6 8 8 ú 6
each training example is chosen a number of times according to a
?
Re
?
? ? ` ? ú a 6 B ú q ? p?P??@?b?r¨?
Re
De?ne
to be the online bagging version of
p
§
p
scheme.
s
Re
De?ne
to be the probability of obtaining
under batch bagging’s bootstrapping . Recall that, under online bagging, distribu
? ? ?
? ? y?? ?
§
? ? ? x3? ? ? ?
? Re ? ? § § ? ? ?
copies of
, and one copy of each of
,
, and
. Example
W
?
, This means that, out of the ?ve examples in
? ?
p
?
H
H
H
H
H
?
Re
?
?
?
H
?
W
?
? ??
?
p
§
?
(
?
), then one possible value for
is
W ??
W ??
W ??
? ??
p
§
?? ` ?ú a 6 Bú ?` ? 3?P??@?b???w?
5
Re
Therefore,
?
?
?
the bootstrapped training set
? p §
of the
th base model when sampling with replacement. . For example, if we have ?ve training examples . Given these, we have , there are two was left out.
? ?? ? ú
resents the fraction of the trials in which the th original training example
ú p
is drawn into
e
ú
?
Re
. De?ne
?
to be a vector of length
where the th element
?§ p
, each of which has probability
p § ? ?
of being drawn. This distribution is rep
?
?
?
forming
independent multinomial trials where each trial yields one of the
training
?
.
42 our subsequent proofs easier.
?
? ? ` ? ú a 6 B p?P??@??ú § § q ? ? r¨f? g a
Proof: We prove this by showing that the probability generating functions (Grimmett & Stirzaker, 1992) for the two distributions are the same. The probability generating funcgenerating function for a ing random variable is
m
As mentioned above, the distribution Bernoulli trials where
? ìk? ? g
involves perform
. By standard results (Grimmett &
Stirzaker, 1992), we can obtain the generating function for this distribution by composition:
? ? d8 ? ? ?
§
but this is the generating function for a wanted to show.
random variable, which is what we
the original training set increases. Speci?cally, these distributions converge in probability. distribution vectors over the training set for
? ? ?e ??
e p
base models. The elements of these average
?s p
e
Proof: In the batch version of bagging, to generate each base model’s bootstrapped ample is
otherwise
H
?
B
if example
is drawn on the th trial for the
?
?
? U z??
and
? ?
, th model
U B
? ??
?
U
a
. We de?ne the following indicator variables, for
? ? ? ? ? W #???X? ? ? ? è?
? ? ? ? ? W p????? ?
?
training set, we perform
? ? ? ? ? ? W #???B?? ?
independent trials where the probability of drawing each ex,
?? e
?? e
??
?
Lemma 2 As
and/or
,
D
p
? ?Ce ? ??
vectors are
and
?I?6p ?
Re
?
?
?
p
?
?? Ce
?§ p
? ? Iap ?
Re
?
?
p
? ??
?
p
?? Ce
De?ne
and
, which are the averages of the bootstrap .
.
?
to the number of base models) increases and as the number of training examples
p s ? ? I6p ? ? ? p § ? ? I6p ? ? ?
?
under online and batch bagging converge as the number of training sets
(corresponding in
Re
We next show that the distributions of the bootstrapped training set proportion vectors
? 8 ??Y
?
? Y c}?
?
? ??
? ? ? ` ? ú a 6 B {?b?C?bú ~¨i?h??? q ? ? ? g ?? p n ?? m ? ? # $ d ? ? 8 á???  7w ? cY {?b?C?bú q ? ? ` ? ú a 6 B
? ??
& S m !? ? x ?$ ? btbtz?wü??? 8 ? v ? Y ? y ? ? 8
?
?
?
?
?
? ?
? ?? B 6 8 8 ú 6 y3á@A@9?d?
?
? á@A@9?d? ? B 6 8 8 ú 6 ?
?
? ?tz?y 8 ? Y ?
?
?
? b?
Y ? z?y
D
w
? ? ? 8 ????? ? ?
?
5 g k?
?
? S ?9?
?
? ??

w ??
&
? v? B 6 8 8ú 6 I?á@AC??7?
tion for a
random variable is
?
?? 3??
&
?s
p 5 ut×
Lemma 1
if and only if
. The probability .
? ?
n ?? m d
?s
? aCA@9?d?s× ? B 6 8 8 ú 6 ? 5 ?
$
.
?s
? ??
?s
D
w
×
?
43 Clearly,
p ? I? ? ? ? ? ? ?? ?
. Therefore, we have
?
? ? ???
We have
? ?
? ??
(Grimmett & Stirzaker, 1992). As mentioned earlier, in online bagging, we can recast the bootstrap sampling process independent multinomial trials where the probability of drawing each
× ? X? p ? ?? B 6 8 8 ú 6 ? ~3aCA@9?7u5 g ? ?
g ? ? Pd
training example is except that
?
B
a
? ??
?
×? ???
. Clearly,
?
For online bagging, let us de?ne
?g ? ? ? ? ? ? W #???B?? ? ?
U ??
?
? X?
p
?
as performing
and
. for all , , and . The
the same way that we did for batch bagging
?? e
?? e
??
? z?
?
Now, we show that, as
? ? Pd ?? e
or
,
, which implies that
D
?
D
?
?? e
therefore,
, where
is a vector of length
where every element is 1.
? ?? ? e
?
?
Therefore, by the Weak Law of Large Numbers, as
?
D
or
,
?
D
?
§
?
?
?
?
?
?
? í?
?
?
? ? p
?
?
? p
? ? ? Iap
? ? ? I6p
?
p
?
?
?
? ? ? I6p
B
that consists of draws of example number
? ? ??? e ?
?
?
? ? ? x?
?
?
?
? ??? ? ??x? ? ? e ? ? ?
? ? e á? ?? ? ?? ?
?
?
process
times. Over
models, the average fraction of the bootstrapped training set is
?
Our bagged ensemble consists of
base models, so we do the above bootstrapping
?
?
B
model’s bootstrapped training set that consists of draws of example number
? X?
is
? p
a
?
?
§ ??
? ??
?
p
?
× ? ?? x?
?
?
?
?
?
?
?
?
?
?
B
?
?
?
a
? ? ??
p
p
? × I? ?
? × I? ?
?
?
?
?
? ?? ?
?
? U?
?
? s?
?
?
?
? ??
p
for all
,
, and . The fraction of the
th base
×? ???
×
p
;
§
? ? ?
§
? ?
?
? ? B ?w? g
§ ??
? ?
?
¤
?? ??? ? ? ?
?
B
? I? ? ? ? B ? ?
d
? ? ?
?
?
? ? ? ?
? X?
? B ??? g
? ?? ??? I? ?
d
?
p
×? ? ?Yx?
?
B
? ?
B ? ? B ?w? g 3?? I? ? ??
?
? ?? e p
? X?
p ? ? ?
d
? I? ? ?
? ×? ? ?Yx? ? ? ?? p ? ? ?? g 3?? B ??
? B w? g
B n ? w? g 3?? ?? ? B ??
?
?
??
? ??
p
? ?e g × I? ? ? ? ?
e
d
?
? x?
? e l p
n ??
?
? }?
g
??
? ??
?
? }?
g
? &?
just the variance of a constant which is 0. So we only have to worry about the ?rst term.
?
? X?
? ??
p
?e g × I? ? ? ?
? ?k@? ?g ? × ? I? ? ? X? p ? ? ?I? ? ? ? x? ? g &?
?
?
? }?
p
? e g ? ? x? × I? ? ? ? ?
?
?
? X?
p
? I? ? ?
?
? ? ? ? ? ?
? ? ? ? ? ?
? ??
n ? f? g 3?? ?? ? B ??
?
? ??
p
? ? d B n × ?? ? B ? f? g 3?? ?? ? B ?? ?
d
?
Bw? g
? &?
? ??
p
?e g × I? ? ? ?
? ?? ? ?
Bf? g
n ?? 3?? ?? ? d
? ? ??? e
?e g × I? ? e p l ? ?
p
? ? ?
p
? I? ?
fraction of the
th base model’s bootstrapped training set that consists of draws of example
?
? p
a
number
is
×
?
?
where, for
convenience in this derivation—one can de?ne this to be any value from 0 to 1 and it would
, we are de?ning
? xi? ? H
B?
×
p
not matter in the long run since
. Therefore, we have
as
??
H
? H ? g ? ? ?yr?h3??
Let us look at the second term ?rst. Since
? ??
? ??
Clearly, we would want
We also have, by standard results (Grimmett & Stirzaker, 1992),
, because with
H t? g
?
H ??x?? g ? ? H
? @?
×
would be no multinomial trials, so
p
Continuing the above derivation, we have
H r?
d
×
e
and the variance of a constant is 0. . . This is done merely for , the second term is , there 44
? ? ?x?
p
?e g × I? ? ? ?
?
?
? x?
?
? ??
?
? ? ?X?
p
? e g ? ? x? × I? ? ? ? ?
?
H ?? B
B
45 So we have
? ?
? §
models, the average fraction of the bootstrapped training set consisting of draws of example is
? ? p ? ? ? I6p ? ? ? ? ? ?? e
B
We have
? ?
? ??
Now that we have established the convergence of the sampling distributions, we go on to demonstrate the convergence of the bagged ensembles themselves.
3.4.2
?
Main Result
? ? ? §
Re
. The second step is to create a hypothesis by calling a batch learning algorithm on . In ’s to represent the base models in the batch bagging to be the analogous function returned by an online learning
? § ? ? ? ? s ?
our proof of convergence, we use
? ??
8 Re
? ??
?
algorithm. De?ne
?
ú
Re
?
ú
?
of the same size as
in which, for all ,
copies of the th example in
are placed in
?
?
?
?
that classi?es a new example
as follows. First, draw a bootstrap training set
? ?? R e
? ??
?
?
?
? U U¨ú
Re
H
and
for all
? ? ? ? ? W p????? ?
?
Let
?
Re
be a training set with
?
¤ ?? ¤
examples and let
be a vector of length
where from
. De?ne
to be a function
?? e
?? e
? ? zV
?? e
which means that
. As mentioned earlier, this implies that
.
? ?? e
D
?
?
Therefore, by the Weak Law of Large Numbers, as
?
D
or
,
?
D
?
?
§
?
?
?
?
?
?
?
?
?
?
¤
?
? p
p
?? ? 3?s?
? ? ? Iap
?
?
?
?
? ? ? ì? ? ? ? ??x? e ? ? ? ? ? ? ? ? ? ì? ? ? ? ?? ? e
?
?
?
We have
base models, so we repeat the above bootstrap process
?
?
?
?
? ?
¤
? ? X?
p
? e g ? ? s? × I? ? ?
?
?
?
times. Over
base
,
?
? I? ?
p
46 algorithm. We use
? ? s ? s
Recall that the size of the bootstrap training set is not ?xed in online bagging. For this reason, we include an additional input , which is the size of the bootstrap training set. of copies of each example from
? ?? ?
included in the bootstrap training set must be an integer.
? ? ? ? ?u¤?? ? § ? p
Re
and online base model learning algorithms because there are bootstrap training sets that online bagging may produce that batch bagging cannot produce. In particular, the boottraining set
?
ging, the probability of getting the null hypothesis for a base model tends to 1. However, For our proof of convergence, we require that the batch and online base model learning algorithms be proportional.
? ?? ?
8 Re
This clearly means that our online bagging algorithm is assumed to use an online base
Online learning algorithms do not need to be called with the entire training set at once. We just notate it this way for convenience and because, to make the proofs easier, we recast the online bagging algorithm’s online sampling process as an of?ine sampling process in our ?rst lemma.
§
?
online algorithm that produced
are proportional learning algorithms.
?
s
8
Re
for all
and , then we say that the batch algorithm that produced
§
? ??
?
? ??
8
Re
? ??
?
De?nition 3 Let ,
Re
, and
be as de?ned above. If
and the
?
?
?
? ??
s
? ??
?
?
? ??
?
?
? ??
s
?
. In this case, clearly
does not converge to
?
?
§
? ? ? ?¨I??
? 3?
?
??
?
?
s
? ?? R e ?? ? ? § ? ?
H ? ? ? ? ?ü¨I??
?
have exactly
n
examples. In this case, as
,
?
n
s
?
learning algorithms return some null hypothesis
n
?
that the bootstrap training set is of size
tends to 0. Therefore, suppose the base model if the bootstrap training set does not , i.e., under online bag.
?
. This is not true of online bagging—in fact, as
?
strap training sets produced by batch bagging are always of the same size
?
?
as
and
. However, this is clearly not true for all batch
as the original , the probability
? ? ¨???
8
Re
? ??
? 3?
?
ging converges to that returned by batch bagging), we need to have
? ? ¨?? ? § ?
Re
?
? s ?
s
s
? ??
? ??
?
? ???
? ??
In order to show that
§ §
(i.e., that the ensemble returned by online bag
? ? ???
8
Re
? ??
? ??
?
? ? ¨??
Re
? ??
? 3?
?
?
?
§
?
? ?? R e ?? ? ? § ? ?
? ??
Re
and
induce distributions over the base models
and
Re
, which is the analogous function returned by online bagging. The distributions over
? ? s ? s s ? § ? § § s
§ ? ? á??
8
Re
? ??
? ?? ?
p
? ? ù ? ? ÷ ? ? ? ? ?0? ??y??íá??
? ??
?
base models given a training set . De?ne
? ? ??
?
function returned by the batch bagging algorithm when asked to return an ensemble of
? ? s ? p s ? ? Iap ? ? ? s
? ??
? ?? ?
p
? ? ù ? ? ÷ ? ? ? ? ?#? ??y¨??¤??
De?ne
§
?I?6p ?
Re 8
?
? ?? B 6 8 8ú 6 ? 5 ~?á@@@??7??8
Recall that
. Clearly,
8
?
?
§
?
’s to represent the base models in our online bagging algorithm.
must be a vector of integers because the number , which is the classi?cation
.
? ??
? ?? §
?
?
47 model learning algorithm that is lossless relative to the base model learning algorithm used in batch bagging. However, our assumption is actually somewhat stronger. We require that particular, we assume that the size of the bootstrapped training set does not matter—only the proportions of every training example relative to every other training example matter. example in the current bootstrapped training set
à ? ?
e
, then note that would be the same for would be the same. We assume
?
Re
both
?
and
and, of course, the original training set
that our base model learning algorithms would return the same hypothesis if called with stumps, and Naive Bayes classi?ers because they only depend on the relative proportions of training examples having different attribute and class values. However, this assumption is not true for neural networks and other models generated using gradientdescent learning steps as training with cases. particular, all bootstrap training sets drawn under batch bagging are of size example, if online bagging creates a bootstrap training set of size cannot draw . That is,
§ ? u?
Re
, so we would not expect to get the same hypothesis in these two that one cannot get for
s ? § s ?
, so for all . For
s
Re
, i.e., the set of that can be obtained under online bagging but not under batch bagging. . We can see why as follows. We have, for all
? ? ?? ? yR e ? s ?
?? ±?? ? Re
We might be worried if our base model learning algorithms return some null hypothesis for ,
? ? ? ? ?íá?? ?
8
? ? áxR e ?
? ? h U ? ? ? s???R e ?ì?h
? ??
? ? ì??
?
?
Re
Re
?
?
?
? ??
? ??
U g?
s
§
? 3? ?
?yR e
?
§
?
?? s %? q
s %? q
±?? ?
? ? ? ìI??
? ? ? ìI??
Re
?
?
s
§
?? ???
?? ???
Re
possible values of
? H
and
, respectively and
. De?ne
§
h
h
s
h h ? ruh ? C§ H F ??? R e ?
?
while
. De?ne
and
s
§
s
s
H ¨?
Re
?
would be a vector of integers. If
s ? Re ? ? s § s
Re
is not a vector of integers, then clearly batch bagging to be the set of
?
? 3?
, then
? u?
Re
?
s
?
Re
possible
,
is a vector of integers. However, this is not true for all possible
Re
Re
One may worry that it is possible to get values for
§ §
?
algorithms. For example, training with
à
?
?
as they would if called with
. This assumption is true for decision trees, decision
would give us twice as many gradientdescent
?
For example, if we were to create a new bootstrapped training set
à ? à ?
by repeating each
?
our base model learning algorithms return the same hypothesis given the same
8
and . In
?
? h U ?°?R e
?
. In
48 as
?
. We can rewrite these as follows:
? Y? ? ? a??
? I??
tions clearly have discontinuities because they return a class which is discretevalued. However, given Lemma 4, we only require that our classi?cation functions course, obtaining such convergence requires that
H F ?dE
be bounded away from a deci? ?F ?
for all in an neighborhood around . This requirement is clearly
e E Re
related to the notion of stability that we discussed in Section 3.2. Decision trees and neural networks are unstable while Naive Bayes classi?ers and decision stumps are stable; theretrees and neural networks than in case of Naive Bayes classi?ers and decision stumps; therefore, we can expect convergence of online bagging to batch bagging to be slower with unstable base models than with stable ones.
? ??
8 Re Re
fore, small changes in
are more likely to cross decision boundaries in case of decision
denote the distributions over base models under batch and online bagging, reand , which are distributions that converge in
? ?? ? ?? ? ? ?? ? ?? ?
? ??
?? ?R e
?? ?t?
?
using
draws from
? ? ??
?
??? CR e ??? ? ?? ? ? ? ? ?? CR e ??? ?? ??
spectively. Clearly,
. Since
and
are created
? ??
?? CR e
?? ???
?
?
s
? ??
8
Re
?
? ??
?
§
?
Proof: Let us de?ne
? ? ?? ?
?? CR e
. Let
?
?
?
?
?
s
? ? ??? R e ?? ? ? § ? ?
?
?
?? CR e ??? ?? ?
? ? ??? R e ?t? ?? ? ?
D
??
?
?
as
and
for all . and
? ??
?
?
? ??
e
?? ?d?
converge in probability to some classi?er
? ?? ? ? ?? ? § ?
as
, then
?
? ??
?
s
?
8
?
Re
?
? ??
8
Re
? ??
Theorem 2 If
for all and and if
and
?
?
?
s
? ?? R e ?? ? ? § ? ?
?
sion boundary. That is, for every
? ?? ?
e
, there must exist an
such that for all
s
?
? ??
e
?? ?d?
s
? ??
?
e
?
? ?? ?
?
?
?
s
? ? ? ?á?? R e ?? ? ? § ? ?
?
? ??
?? ? ? ? ?díá?? R e ?d? ?? ? ?
8
Re
? ??
?
and
converge in probability to some classi?er
as
? ?? R e ?? ? ? § ? ?
?
?
?
? R e a? ?
? R e a? ?
?? R e
?
?
?
?? R e
since
,
§
D
s
D
s
?
Re
in
do not yield dramatic changes in the prediction performance. It is generally true that if is a continuous function. Our classi?cation func
? Y?
? ? ? üI??
?? ???
and
. We clearly require some smoothness condition whereby small changes
. Of ,
? Y?
? ? ? óI??
?? 3??
? ??
? ? ? óI??
?? ???
ond term in the equation for
? §
may prevent convergence of
?
s
? h U ?°?R e
If our base model learning algorithms return some null hypothesis for
? s
, then the sec
?
8
?
Re
?
? ??
s
? ? ?? ? ?R e ?
s ?
y s%? q
?? ?
? ?? ? ? Y?
? ? á??
? ? á?? R e
?
8
? ?
Re
? ?
? ?? ? ??
s §
? ? 3? ? xR e ? ? ? ?? ? xR e ? §
s ? ?
y ?s
y s
? ? sA? q
s A? q
? ?
tY? ? tY? ?
? ? ? üI?? ? ? ? üI?? ? §
? s
?? ??? ?? ??? ?? ???
49 probability, we immediately get
? ? ? ? ?? § ?
D
To summarize, we have proven that the classi?cation function of online bagging conexamples
?
tend to in?nity if the base model learning algorithms are proportional and if . We noted that
decision trees, decision stumps, and Naive Bayes classi?ers are proportional, but gradientdescent learning algorithms such as backpropagation for neural networks typically are not proportional. Base model convergence and therefore the convergence of online bagging to batch bagging would tend to be slower with unstable base models such as decision trees and neural networks than with Naive Bayes and decision stumps. This is clearly related to the notion of stability that is important to the performance of bagging. We now compare online bagging and batch bagging experimentally.
3.5
Experimental Results
We now discuss the results of our experiments that compare the performances of bagging, online bagging, and the base model learning algorithms. We used four different types of base models in both batch and online bagging: decision trees, decision stumps, Naive Bayes classi?ers, and neural networks. For decision trees, we reimplemented the ITI learning algorithm (Utgoff et al., 1997) in C++. ITI allows decision trees to be learned in batch and online mode. The online algorithm is lossless. The batch and online Naive Bayes learning algorithms are essentially identical, so the online algorithm is clearly lossless. We implemented both batch and lossless online learning algorithms for decision stumps. As we mentioned before, the learning algorithms for decision trees, decision stumps, and Naive Bayes classi?ers are also proportional. For neural networks, we implemented the backpropagation algorithm. In the batch ensemble algorithms, we ran neural network learning for ten epochs. In the online ensemble algorithms, we can run through the training set only once; however, we can vary the number of update steps per example. We present results with one update step and ten update steps per training example. We get worse results with
?
the base models themselves converge to the same classi?er as
?
verges to that of batch bagging as the number of base models
? ??
? ??
?
? ??
? s
.
and the number of training
50 these online methods than with the multiepoch batch method, and we will see how much loss this leads to in our ensemble algorithms. We ran all of our ensemble algorithms to generate ensembles of 100 base models. We ran all of our experiments on Dell 6350 computers having 600 MegaHertz Pentium III processors.
3.5.1
The Data
We tested our algorithms on nine UCI datasets (Blake et al., 1999), two datasets (Census Income and Forest Covertype) from the UCI KDD archive (Bay, 1999), and three synthetic datasets. These are batch datasets, i.e., there is no natural order in the data. With these datasets, we use our learning algorithms to generate a hypothesis using a training set and then test the hypothesis using a separate test set. We also performed experiments with an online dataset, in that the data arrives as a sequence and the learning algorithm is expected to generate a prediction for each example immediately upon arrival. The algorithm is then given the correct answer which is used to incrementally update the current hypothesis. We describe this set of experiments in Section 3.5.4. We give the sizes and numbers of attributes and classes of the batch datasets in Table 3.1. Every dataset except Census Income was supplied as just one dataset, so we tested the batch algorithms by performing ten runs of ?vefold cross validation on all the datasets except Census Income, for which we just ran with the supplied training and test set. In general fold cross validation consists of learning algorithm
B
times such that, on the th run, the th group is used as the test set and
ú ú
the remaining groups are combined to form the training set. After learning with the training our ten runs of ?vefold cross validation generated 50 results which we averaged to get the results for our batch algorithms. Because online algorithms are often sensitive to the order in which training examples are supplied, for each training set that we generated, we ran our online algorithms with ?ve different orders of the examples in that training set. Therefore, the online algorithms’ results are averages over 250 runs. All three of our synthetic datasets have two classes. The prior probability of each class is 0.5, and every attribute except the last one is conditionally dependent upon the class and
B
set, the accuracy on the th test set is recorded. This process yields
ú
B
randomly dividing the data into
nearly equal sized groups of examples and running the
B
results. Therefore,
51 Table 3.1: The datasets used in our experiments. For the Census Income dataset, we have given the sizes of the supplied training and test sets. For the remaining datasets, we have given the sizes of the training and test sets in our ?vefold crossvalidation runs. Training Set Promoters 84 Balance 500 Breast Cancer 559 German Credit 800 Car Evaluation 1382 Chess 2556 Mushroom 6499 Nursery 10368 Connect4 54045 Synthetic1 80000 Synthetic2 80000 Synthetic3 80000 Census Income 199523 Forest Covertype 464809 Data Set Test Inputs Set 22 57 125 4 140 9 200 20 346 6 640 36 1625 22 2592 8 13512 42 20000 20 20000 20 20000 20 99762 39 116203 54 Classes 2 3 2 2 4 2 2 5 3 2 2 2 2 7
0.8 0.9
0.2 0.1
the next attribute. We set up the attributes this way because the Naive Bayes model only represents the probabilities of each attribute given the class, and we wanted data that is not realizable by a single Naive Bayes classi?er so that bagging and boosting are more likely to yield improvements over a single classi?er. The probabilities of each attribute except the depends quite strongly on
? ? ?? ? n 9?
. . In Synthetic2, these proba?
? 9? n
bilities are 0.1 and 0.8, while in Synthetic3, these are 0.01 and 0.975, respectively. These
? H z??
H ? rì?
?
? H Ir?
? ? "??
and
?
?
? ? p??
The only difference between the three synthetic datasets is
? ?
? n 9? H ? ? H rìxr?
. In Synthetic1,
? ?
?
?
? ? r?
? U h?
? H Ir?
? ?
last one (
for
) are as shown in Table 3.2. Note that each attribute
? ?
?
?
? ? T?
H r?
? U ?·? ?
? ? T?
? ? ? H w? ? ? H x?? p?? ? ? ?
? A?
?
? ? ? H r? cr? p?? ? ? ?
Table 3.2:
for
? A?
? ? ? ? ? W 0????? ?
? ? X0?
in Synthetic Datasets
? ? ? ? ? W #????? ?
? ? "??
52 Table 3.3: Results (fraction correct): batch and online bagging (with Decision Trees) Dataset Decision Tree Bagging Online Bagging Promoters 0.7837 0.8504 0.8613 Balance 0.7664 0.8161 0.8160 Breast Cancer 0.9531 0.9653 0.9646 German Credit 0.6929 0.7445 0.7421 Car Evaluation 0.9537 0.9673 0.9679 Chess 0.9912 0.9938 0.9936 Mushroom 1.0 1.0 1.0 Nursery 0.9896 0.9972 0.9973
differences lead to varying dif?culties for learning Naive Bayes classi?ers and; therefore, different performances of single classi?ers and ensembles. We tested our algorithms with all four base models on these synthetic datasets even though they were designed with Naive Bayes classi?ers in mind.
3.5.2
Accuracy Results
Tables 3.3, 3.4, 3.5, 3.6, and 3.7 show the accuracies of the single model, bagging, and online bagging with decision trees, Naive Bayes, decision stumps, neural networks with one update step per training example, and neural networks with ten updates per training example, respectively. In case of decision trees, Naive Bayes, and decision stumps, the batch and online single model results are the same because the online learning algorithms for these models are lossless. Therefore, we only give one column of single model results. In case of neural networks, we show the batch and online single neural network results separately because online neural network learning is lossy. Boldface entries represent cases while italicized entries represent cases when the ensemble algorithm signi?cantly underperformed relative to a single model. Entries for a batch or online algorithm that have a “*” next to them represent cases for which it signi?cantly outperformed its online or batch counterpart, respectively. The results of running decision tree learning and batch and online bagging with de? zH H ? ?f?
when the ensemble algorithm signi?cantly (ttest,
?
) outperformed a single model
53 Table 3.4: Results (fraction correct): batch and online bagging (with Naive Bayes) Dataset Naive Bayes Promoters 0.8774 Balance 0.9075 Breast Cancer 0.9647 German Credit 0.7483 Car Evaluation 0.8569 Chess 0.8757 Mushroom 0.9966 Nursery 0.9031 Connect4 0.7214 Synthetic1 0.4998 Synthetic2 0.7800 Synthetic3 0.9251 CensusIncome 0.7630 Forest Covertype 0.6761 Bagging Online Bagging 0.9067 0.9665 0.748 0.8532 0.8759* 0.9966 0.9029 0.7212 0.4996 0.7801 0.9251 0.7637 0.6762
? ?
0.9072 0.9661 0.7483 0.8547 0.9966 0.7216 0.4997 0.7800 0.9251 0.7636 0.6762
? ? C? ? v? ? ? ? ? ???x?
cision tree base models are shown in Table 3.3. We ran decision tree learning and the ensemble algorithms with decision tree base models on only the eight smallest datasets in our collection because the ITI algorithm is too expensive to use with larger datasets in online mode. Batch bagging and online bagging performed comparably to each other. This can be seen in Figure 3.4, a scatterplot of the error rates of batch and online boosting on the test sets—each point represents one dataset. Points above the diagonal line represent datasets for which the error of online bagging was higher than that of batch bagging and points below the line represent datasets for which online bagging had lower error. Batch and online bagging signi?cantly outperformed decision trees on all except the Mushroom dataset, which is clearly so easy to learn that one decision tree was able to achieve perfect classi?cation performance on the test set, so there was no room for improvement. Table 3.4 gives the results of running Naive Bayes classi?ers and batch and online
This is because, as explained in Chapter 2, when a decision tree is updated online, the tests at each node of the decision tree have to be checked to con?rm that they are still the best tests to use at those nodes. If any tests have to be changed then the subtrees below that node may have to be changed. This requires running through the appropriate training examples again since they have to be assigned to different nodes in the decision tree. Therefore the decision trees must store their training examples, which is clearly impractical when the training set is large.
?
? ??
? ? %x? ?
? ? ? ? 4Pzx? ?
?
54
50
50
Online Bagging
20 10 0 0 10 20 30 40 50 Batch Bagging
Figure 3.4: Test Error Rates: Batch Bagging
vs. Online Bagging with decision tree base models.
Figure 3.5: Test Error Rates: Batch Bagging
vs. Online Bagging with Naive Bayes base models.
bagging with Naive Bayes base models. The bagging algorithms performed comparably to each other (Figure 3.5) and mostly performed comparably to the single models. We expected this because of the stability of Naive Bayes, which we discussed in Chapter 2. Table 3.5 shows the results of running decision stumps and bagging and online bagging with decision stumps. Bagging and online bagging only signi?cantly outperformed decision stumps for the two smallest datasets—for the remaining datasets, the bagging algorithms performed comparably to decision stumps. From both the table and Figure 3.6, we can see that batch and online bagging with decision stumps performed comparably to each other. Just as with Naive Bayes classi?er learning, decision stump learning is known to be stable (Breiman, 1996a), so once again we do not have enough diversity among the base models to yield signi?cantly better performance for bagging and online bagging. Table 3.6 shows the results of running neural network learning and bagging and online bagging with neural network base models. For the online results in this table, the neural networks were trained using only one update step per training example. Recall that, with our batch algorithms, we trained each neural network by cycling through the dataset 10 times (epochs). We can see from both the table and Figure 3.7 that online neural network learning often performed much worse than batch neural network learning; therefore, it is not surprising that online bagging often performed much worse than batch bagging (Figure 3.8), especially on the smaller datasets. Unfortunately, online bagging also did not
?
30
Online Bagging
40
40 30 20 10 0 0 10 20 30 40 50 Batch Bagging
?
55 Table 3.5: Results (fraction correct): batch and online bagging (with decision stumps) Dataset Decision Stump Promoters 0.7710 Balance 0.5989 Breast Cancer 0.8566 German Credit 0.6862 Car Evaluation 0.6986 Chess 0.6795 Mushroom 0.5617 Nursery 0.4184 Connect4 0.6581 Synthetic1 0.5002 Synthetic2 0.8492 Synthetic3 0.9824 CensusIncome 0.9380 Forest Covertype 0.6698 Bagging Online Bagging 0.8041 0.8113 0.7170 0.7226 0.8557 0.8564 0.6861 0.6862 0.6986 0.6986 0.6798 0.6792 0.5617 0.5617 0.4185 0.4177 0.6581 0.6581 0.4996 0.4994 0.8492 0.8492 0.9824 0.9824 0.9380 0.9380 0.6698 0.6698
improve upon online neural networks to the extent that batch bagging improved upon batch neural networks. This dif?culty is remedied for the case of the online algorithms where the neural networks were trained with ten update steps per training example. As shown in Table 3.7, online bagging performed signi?cantly better than online neural networks most of the time. The batch algorithms continued to perform signi?cantly better than the online algorithms with 10 updates per example (Figure 3.9 and Figure 3.10), but not to as large an extent as in the one update case. In summary, with base models for which we have proportional learning algorithms (decision tree, decision stump, Naive Bayes), online bagging achieved comparable classi?cation performance to batch bagging. As expected, both bagging algorithms improved signi?cantly upon decision trees because decision tree learning is unstable, and did not improve signi?cantly upon Naive Bayes and decision stumps which are stable learning algorithms. With neural networks, the poorer performances of the online base models led to poorer performance of the online bagging algorithm. However, especially in larger datasets, the loss in performance due to online neural network learning was low, which led to low loss in the performance of online bagging. This small performance reduction may be ac
56
70 60
Online Bagging
50 40 30 20 10 0 0 10 20 30 40 50 60 70 Batch Bagging
Figure 3.6: Test Error Rates: Batch Bagging
vs. Online Bagging with decision stump base models.
50 45 40 35 30 25 20 15 10 5 0 0 5 10 15 20 25 30 35 40 45 50 Neural Network 50 45 40 35 30 25 20 15 10 5 0 0 5 10 15 20 25 30 35 40 45 50 Batch Bagging
Online Neural Network
Figure 3.7: Test Error Rates: Batch Neural
Network vs. Online Neural Network (one update per example).
Figure 3.8: Test Error Rates: Batch Bagging
vs. Online Bagging with neural network base models.
ceptable given the great reduction in running time. We examine the running times of the algorithms in the next section.
3.5.3
Running Times
The running times of all the algorithms on all the datasets explained in the previous section are shown in Tables 3.8, 3.9, 3.10, and 3.11. The running times with decision trees, decision stumps, and Naive Bayes classi?ers are generally comparable and quite low relative to the running times for boosting that we will examine in the next chapter.
Online Bagging
?
?
57 Table 3.6: Results (fraction correct): batch and online bagging (with neural networks). The column “Neural Network” gives the average test set performance of running backpropagation with 10 epochs on the entire training set. “Online Neural Net” is the result of running backpropagation with one update step per training example. Neural Network Promoters 0.8982* Balance 0.9194* Breast Cancer 0.9527* German Credit 0.7469 Car Evaluation 0.9422* Chess 0.9681* Mushroom 1* Nursery 0.9829* Connect4 0.8199* Synthetic1 0.7217* Synthetic2 0.8564* Synthetic3 0.9824 CensusIncome 0.9519 Forest Covertype 0.7573* Dataset Bagging Online Neural Net 0.9036* 0.5018 0.9210* 0.6210 0.9561* 0.9223 0.7495* 0.7479 0.9648* 0.8452 0.9827* 0.9098 1* 0.9965 * 0.9220 0.8399* 0.7717 0.7326* 0.6726 0.8584* 0.8523 0.9824 0.9824 0.9533* 0.9520 0.7787* 0.7381
? ? ? ? ???v?
Online Bagging 0.5509 0.6554 0.9221 0.8461 0.9277 0.9959
? ? z? ?????? ? ? ? ? v? ? ? ? ? ? 4z?{? ?
The running times of online bagging with tenupdate neural networks are lower than the running times of batch bagging largely because each neural network in batch bagging runs through the training set 10 times, so bagging overall runs through the training set 1000 times, whereas online bagging only runs through the training set once. This makes an especially big difference on larger datasets where fewer passes through the training set means less data being swapped between the computer’s cache, main memory, and virtual memory. Not surprisingly, online bagging with one update per neural network ran much faster than batch bagging because online bagging not only passed through the dataset fewer times than batch bagging, but had substantially fewer backpropagation updates per training example (one update per base model) than batch bagging (ten updates per base model). Clearly, the main advantage that online bagging has over batch bagging is the same as the advantage that online learning algorithms generally have over batch learning algorithms— the ability to incrementally update their hypotheses with new training examples. Given a
?
0.6850 0.8524 0.9824 0.9514 0.7416
58
50 45 40 35 30 25 20 15 10 5 0 0 5 10 15 20 25 30 35 40 45 50 Neural Network 50 45 40 35 30 25 20 15 10 5 0 0 5 10 15 20 25 30 35 40 45 50 Batch Bagging
Online Neural Network
Figure 3.9: Test Error Rates: Batch Neural
Network vs. Online Neural Network (10 updates per example).
Figure 3.10: Test Error Rates: Batch Bagging vs. Online Bagging with neural network base models.
?xed training set, batch learning algorithms often run faster than online learning algorithms because batch algorithms can often set their parameters once and for all by viewing the entire training set at once while online algorithms have to update their parameters once for every training example. However, in situations where new training examples are arriving continually, batch learning algorithms generally require that the current hypothesis be thrown away and the entire previouslylearned training set plus the new examples be learned. This often requires far too much time and is impossible in cases where there is more data than can be stored. The ability to incrementally update learned models is most critical in situations in which data is continually arriving and a prediction must be returned for each example as it arrives. We describe such a scenario in the next section.
3.5.4
Online Dataset
In this section, we discuss our experiments with a dataset that represents a true online learning scenario. This dataset is from the Calendar APprentice (CAP) project (Mitchell, Caruana, Freitag, McDermott, & Zabowski, 1994). The members of this project designed a personal calendar program that helps users keep track of their meeting schedules. The software provides all the usual functionality of calendar software such as adding, deleting, moving, and copying appointments. However, this software has a learning component that learns user preferences of the meeting time, date, location, and duration. After the
Online Bagging
?
59 Table 3.7: Results (fraction correct): online algorithms (with neural networks trained with 10 update steps per training example). Neural Network Promoters 0.8982* Balance 0.9194* Breast Cancer 0.9527* German Credit 0.7469* Car Evaluation 0.9422* Chess 0.9681* Mushroom 1* Nursery 0.9829* Connect4 0.8199* Synthetic1 0.7217* Synthetic2 0.8564* Synthetic3 0.9824* CensusIncome 0.9519* Forest Covertype 0.7573* Dataset Bagging Online Neural Net 0.9036* 0.8036 0.9210* 0.8965 0.9561* 0.9020 0.7495* 0.7062 0.9648* 0.8812 0.9827* 0.9023 1* 0.9995 * 0.9411 0.8399* 0.7042 0.7326* 0.6514 0.8584* 0.8345 0.9824 0.9811 0.9533* 0.9487 0.7787* 0.6974
? ? ? ? ???v?
Online Bagging 0.9002 0.8987 0.7209 0.8877 0.9185 0.9396 0.7451 0.6854 0.8508 0.9824 0.9487 0.7052
? ? ? ? z?zv? ? ? ? ? ? ???? ?
user enters various attributes of a meeting to be set up (e.g., the number of attendees), the software may, for example, suggest to the user that the duration of the meeting should be 30 minutes. The user may then either accept that suggestion or may enter a different duration. The user’s selection is used to make a training example that is then used to update the software’s current hypothesis of what the user’s preferred duration is as a function of the other meeting attributes. The software maintains one hypothesis for each value to be predicted (meeting time, date, location, and duration). In CAP, the hypothesis is a set of rules extracted from a decision tree constructed by a procedure similar to ID3. Every night, a new decision tree is constructed using 120 randomlyselected appointments from the user’s previous 180 appointments (this number was chosen through trial and error). A set of rules is extracted from this decision tree by converting each path in the decision tree into an ifthen rule, where the “if” portion is the conjunction of the tests at each nonterminal node along the path, and the “then” portion is the class at the leaf node of the path. Preconditions of these rules are removed if the removal does not result in reduced accuracy over the previous 180 appointments. Any new rules that are duplicates of previouslygenerated rules
?
60 Table 3.8: Running Times (seconds): batch and online bagging (with Decision Trees) Dataset Decision Tree Bagging Online Bagging Promoters 0.16 1.82 1.88 Balance 0.18 1.66 1.96 Breast Cancer 0.18 1.88 2.28 German Credit 0.22 8.98 9.68 Car Evaluation 0.3 4.28 4.28 Chess 0.98 20.4 20.22 Mushroom 1.12 22.6 23.68 Nursery 1.22 29.68 32.74
are removed. The remaining new rules are then sorted with the previouslygenerated rules based on their accuracy over the previous 180 appointments. For the entire next day, the software uses these rules to give the user advice. Speci?cally, in the sorted list of rules, the ?rst rule whose preconditions match the known attributes of the meeting to be scheduled is used to give the user a suggestion. The dataset that we experimented with contains one user’s 1790 appointments over a twoyear period. Every appointment has 32 input attributes and four possible target classes. We ran our base model learning algorithms and our online ensemble algorithms on this dataset with the aim of predicting two different targets: the desired duration of the meeting (of which there are 13 possible values) and the day of the week (of which there are six possible values—the user chose to never schedule meetings for Sunday) . The basic structure of our online method of learning is given in Figure 3.11. That is, for each training example, we ?rst predict the desired target using the current hypothesis. This prediction is the suggestion that would be supplied to the user if our learning algorithm was incorporated in the calendar software. We then obtain the desired value of the target—in our case, we obtain it from the dataset but in general this would be obtained from the user. We then check whether our current hypothesis correctly predicted the target and update our running total of the number of errors made so far if necessary. Then we use the online learning algorithm to update our current hypothesis with this new example.
? ?
We ran every algorithm separately for each target.
61
Table 3.9: Running Times (seconds): batch and online bagging (with Naive Bayes) Dataset Naive Bayes Promoters 0.02 Balance 0.02 Breast Cancer 0.02 German Credit 0.02 Car Evaluation 0.04 Chess 0.42 Mushroom 0.38 Nursery 0.86 Connect4 6.92 Synthetic1 7.48 Synthetic2 5.94 Synthetic3 4.58 CensusIncome 56.6 Forest Covertype 106.0
? ?
Bagging Online Bagging 0.2 0.22 0.1 0.1 0.14 0.32 0.14 0.38 0.34 0.44 1.02 1.72 2.14 3.28 1.82 3.74 33.98 42.04 45.6 64.16 44.78 74.84 44.98 56.2 131.8 157.4 371.8 520.2
Table 3.10: Running Times (seconds): batch and online bagging (with decision stumps) Dataset Decision Stump Promoters 0.2 Balance 0.14 Breast Cancer 0.1 German Credit 0.36 Car Evaluation 0.1 Chess 1.46 Mushroom 0.8 Nursery 0.26 Connect4 5.94 Synthetic1 3.8 Synthetic2 3.82 Synthetic3 4.06 CensusIncome 51.4 Forest Covertype 176.64 Bagging Online Bagging 0.2 0.3 0.14 0.2 0.28 0.3 0.46 0.56 0.14 0.28 15 2.98 2.04 2.68 19.64 5.1 53.66 33.72 26.6 28.02 30.26 28.34 29.36 36.04 124.2 131.8 594.34 510.58
62 Table 3.11: Running times (seconds): batch and online bagging (with neural networks). (1) indicates one update per training example and (10) indicates 10 updates per training example. Neural Bagging Online Network Net(1) Promoters 2.58 442.74 0.1 Balance 0.12 12.48 0.02 Breast Cancer 0.12 8.14 0.06 German Credit 0.72 73.64 0.1 Car Evaluation 0.6 36.86 0.1 Chess 1.72 166.78 0.38 Mushroom 7.68 828.38 1.2 Nursery 9.14 1118.98 1.54 Connect4 2337.62 156009.3 356.12 Synthetic1 142.02 15449.58 17.38 Synthetic2 300.96 24447.2 17.74 Synthetic3 203.82 17672.84 24.12 CensusIncome 4221.4 201489.4 249 Forest Covertype 2071.36 126518.76 635.48 Dataset Online Online Bag(1) Net(10) 32.42 2.34 1.48 0.14 0.94 0.18 7.98 0.68 3.92 0.46 20.86 1.92 129.18 6.64 140.02 9.22 28851.42 1133.78 2908.48 149.34 3467.9 124.22 2509.6 117.54 23765.8 1405.6 17150.24 805.04 Online Bag(10) 334.56 11.7 6.58 63.5 36.82 159.8 657.48 1004.8 105035.76 16056.14 13327.66 12469.1 131135.2 73901.86
The results of predicting the day of the week and the desired meeting duration using the base classi?ers in isolation and as part of online bagging are shown in Table 3.12. They model learning algorithms and the online bagging algorithms with these base models. The accuracy on the th example is measured to be 1 if the hypothesis learned online using the given in the table are the average accuracies over all the examples in the dataset. To provide a basis for comparison, the learning algorithm used in the CAP project had an average accuracy of around 0.50 for day of the week and around 0.63 for meeting duration. Decision stumps, Naive Bayes classi?ers, and neural networks as well as the bagging algorithms with these as base models were not competitive. Decision tree learning was competitive (average accuracy 0.5101 for day of the week and 0.6905 for meeting duration) and online bagging with decision trees performed relatively well (0.5536 and 0.7453) on these tasks. This is consistent with the CAP designers’ decision to use decision
B ? ??
are the results of running the algorithm shown in Figure 3.11 with
replaced by the base
? B
previous
examples classi?es the th example correctly, and 0 otherwise. The values
B
?
63
Obtain the desired target value .
? } ?63
if
, then
tree learning in their program. Our online algorithm clearly has a major bene?t over their method: we are able to learn from all the training data rather than having to use trial and error to select a window of past examples to learn from, which is the only way to make most batch algorithms practical for this type of problem in which data is continually being generated.
3.6
Summary
In this chapter, we ?rst reviewed the bagging algorithm and discussed the conditions under which it tends to work well relative to single models. We then derived an online bagging algorithm. We proved the convergence of the ensemble generated by the online bagging algorithm to that of batch bagging subject to certain conditions. Finally we compared the two algorithms empirically on several “batch” datasets of various sizes and illustrated the performance of online bagging in a domain in which data is generated continuously.
? ?
?
current hypothesis,
is the latest training example to arrive, and
?
Figure 3.11: Basic structure of online learning algorithm used to learn the calendar data.
?
? ? ?? ? ?
× ?? ? ?v??? ?
Give suggestion
? ?? ? ?? ?
 ¤3 ? ? ?? ? ?
× 9×
? ? 9?? ? ?
OnlineLearning(
) . .
? ? ??3
Initial condition:
? ? ?? ? ?
.
? ?? ? ? ?v??{? ? ??? ? 
is the is the online learning algorithm.
64
Table 3.12: Results (fraction correct) on Calendar Apprentice Dataset Decision Trees Target Day Duration Naive Bayes Target Day Duration Decision Stumps Target Day Duration
Single Model 0.5101 0.6905
Online Bagging 0.5536 0.7453
Single Model 0.4520 0.1508
Online Bagging 0.3777 0.1335
Single Model 0.1927 0.2626
Online Bagging 0.1994 0.2553
Neural Networks (one update per example) Target Single Model Day 0.3899 Duration 0.5106 Neural Networks (ten updates per example) Target Day Duration
Online Bagging 0.2972 0.4615
Single Model 0.4028 0.5196
Online Bagging 0.4380 0.5240
65
Chapter 4 Boosting
In this chapter, we ?rst describe the boosting algorithm AdaBoost (Freund & Schapire, 1996, 1997) and some of the theory behind it. We then derive our online boosting algorithm. Finally, we compare the performances of the two algorithms theoretically and experimentally.
4.1
Earlier Boosting Algorithms
The ?rst boosting algorithm (Schapire, 1990) was designed to convert a weak PAClearning algorithm into a strong PAClearning algorithm (see (Kearns & Vazirani, 1994) for a detailed explanation of the PAC learning model and Schapire’s original boosting algorithm). In the PAC (Probably Approximately Correct) model of learning, a learner has from a ?xed but unknown probability distribution and if the example
?
?
concept that the learner is trying to learn. The concept is drawn from some concept class expected to return a hypothesis
?? ? ? ? ? ? ?I?t3? 4? !
which hopefully has small error, which
?
concept disagree on an example drawn randomly from the same distribution used to generate the training examples. A strong PAClearning algorithm is an algorithm that, given
? H t?à F H t°E F
any
,
, and access to training examples drawn randomly from
? ? ?? ?I?&?
?
is de?ned to be
—this is the probability that the hypothesis and true
H t?I??? ? ? ??
is in the concept and
? ?
H C?
if not. The learner is
?
×
over the domain
?
?
? ? ?? ?I???
? ??
access to a set of labeled examples of the form
?
where each
is chosen randomly , and is the target
× ? hI?
?
? ? ?? ?I???
?
, returns a
66 hypothesis having error at most with probability at least
à E
in a running time polynomial in
,
, the complexity of the target concept, and some
appropriate measure of the dimensionality of the example space. A weak PAClearning algorithm is the same as a strong PAClearning algorithm except that the set of acceptable may be selected.
? H F u?á
values for may be restricted. More precisely, there must exist some
W
A boosting algorithm is formally de?ned to be an algorithm that converts a weak learning algorithm into a strong learning algorithm. That is, it boosts a learning algorithm that performs slightly better than random chance on a twoclass problem into an algorithm that performs very well on that problem. Schapire’s original boosting algorithm (Schapire, 1990) is a recursive algorithm. At the bottom of the recursion, it combines three hypotheses generated by the weak learning algorithm. The error of this combination is provably lower than the errors of the weak hypotheses. Three of these combinations are then constructed and combined to form a combination with still lower error. Additional levels in the recursion are constructed until the desired error bound is reached. Freund (Freund, 1995) later devised the “boostbymajority” algorithm which, like the original algorithm, combines many weak hypotheses, but it combines them all at the same level, i.e., there is no multilevel hierarchy as there is for the original boosting algorithm. However, this algorithm has learning algorithm performs better than random chance) has to be known in advance. The AdaBoost algorithm eliminates this requirement. In fact, it is called Adaboost because it adapts to the actual errors of the hypotheses returned by the weak learning algorithm. As explained in Chapter 2, AdaBoost is an ensemble learning algorithm that combines a set of base models made diverse by presenting the base model learning algorithm with different distributions over the training set. We explain how this happens now.
á
one practical de?ciency: the value of
as de?ned above (the amount by which the weak
4.2
The AdaBoost Algorithm
The AdaBoost algorithm, which is the boosting algorithm that we use, generates a sequence of base models with different weight distributions over the training set. The Ad
à g?
?
. The algorithm must do this
? ??
? ??
E
such that any
? ?? ?
á ??
E
W
? ?? ?
U GE
67
otherwise
? ?
aBoost algorithm is shown in Figure 4.1. Figure 4.2 depicts AdaBoost in action. Its inputs of base models that we wish to combine. AdaBoost was originally designed for twoclass classi?cation problems; therefore, for this explanation we will assume that there are two possible classes. However, AdaBoost is regularly used with a larger number of classes. ing set which assigns equal weight to all
H
training examples. For example, a set of 10
?
training examples with weight
each is depicted in Figure 4.2 in the leftmost column
§
where every box, which represents one training example, is of the same size. We now enter set weighted by
?
If cannot take a weighted training set, then one can call it with a training set generated by sampling with replacement from the original training set according to the distribution .
? ??
?
E
misclassi?es. We require that
? o? §
(this is the weak learning assumption—the error
?
training set itself, which is just the sum of the weights of the training examples that
W
?
E
?
. After getting back a hypothesis
? ?? ?
, we calculate its error
?
?
the loop in the algorithm. To construct the ?rst base model, we call
? ? ?
?
The ?rst step in Adaboost is to construct an initial distribution of weights
? ? ??
over the train
with the training on the
?
?
?
?
are a set of
training examples, a base model learning algorithm
§
?
the base model learning algorithm, and
, and the number
?
is the set of training examples, is the number of base models to be generated.
× ? ?? ? ? ? ? ? v??XXXr×
?
Figure 4.1: AdaBoost algorithm:
§
?
?
? ?
?
? ? pD? ? ?& ? ?
Output the ?nal hypothesis:
! j$ ? ?? ? é
? ?×
?? é vêz?
if
? ?
?
& ??
?
?$ ? ?
?
?
ù?oCê??oC? ÷ × ?? é ? ? × ? ? üú
? v?? ?? ? ? ?
?
· ?0? ?? ????f?v? ? ? ? ± ° ? ? ? × ?
é ê?
Update distribution
? ? è ? °????
?
é ? ê?
? ? ? ó é ??iì?í
If
then,
set
and abort this loop. : for all ,
× ?? é oCêê?
?
?
? ! ?j$
·
? é í ? é ???ìP?
Calculate the error of
ò? ?&
? ? ?? ?? × é ? Gê??
?
×
?
?
?v???(?X?X?Xr× ? ? v?3? ?iz? ?? ? ? ? é § ? ? ? ? ? ? ? ? ? ? ? ? ? ~4XXX??%??è
For
:
? ? ? ? ? ? ? ? ? ? ) ?4XXX??%C?g?
Initialize
for all
. .
?
? ~?
? ??
×
?
?? ? ? ? ? ? v??XXXr×
AdaBoost(
§
?
?
???????loC? ? × ? ? ? v?? ?? ? ? ?
) .
?? !?¨
?
is
68 should be less than what we would achieve through randomly guessing the class). If this condition is not satis?ed, then we stop and return the ensemble consisting of the previouslygenerated base models. If this condition is satis?ed, then we calculate a new distribution
&
si?ed examples have their weights reduced and misclassi?ed examples have their weights under
?
under
. In our example in Figure 4.2, the ?rst base model misclassi?ed the ?rst three to
H
three misclassi?ed examples’ weights are increased from
(the heights of the
W
top three boxes have increased in the ?gure from the ?rst column to the second column to re?ect this), which means the total weight of the misclassi?ed examples is now
H
seven correctly classi?ed examples’ weights are decreased from of the correctly classi?ed examples is now also training set and the new distribution
? ? ?
W
to
(the heights
of the remaining seven boxes have decreased in the ?gure), which means the total weight culating model
W
. Returning to our algorithm, after cal?
. The point of this weight adjustment is that base
?
will be generated by a weak learner (i.e., the base model will have error less than will have to be learned. base models in this fashion.
? ?
); therefore, at least some of the examples misclassi?ed by
We construct
The ensemble returned by AdaBoost is a function that takes a new example as input each base model’s weight is , which is proportional to the base model’s accuracy
? ?
and returns the class that gets the maximum weighted vote over the
? ?
base models, where
on the weighted training set presented to it. According to Freund and Schapire, this method of combining is derived as follows. If we have a twoclass problem, then given an instance rule we should choose the class
? ? I?? ? ?? ? ? ? ?? ?3??
?
? ? ? ?I??
?
?
? ? ? ? ?????
? ? I??
?
??
?
? ? ?? í?3??
F
? ? ? ?I??
?
over
?
if
? ??
?
U a
?
? ? ? ? I?? é ?
and base model predictions
?
for
, by the Bayes optimal decision
?
, we go into the next iteration of the loop to construct base model
?
using the
H
? ??
?
?
?
? ? ??
E
training examples and correctly classi?ed the remaining ones; therefore,
? B?
?
?? ¤??
? ? B??
? ? B??
? ? ? ? #???? ?
? ??
? ?
?
and examples that
correctly classi?ed have their total weight reduced to . The
?
increased. Speci?cally, examples that
? ? ? ?
W W
misclassi?ed have their total weight increased to
?
E
?
?
?
?
? ? ? ? ?????
? ? 6 ??p`
?
weights multiplied by
. Note that, because of our condition
, correctly clas
?
W
? ??
?
?
have their weights multiplied by
? ?
. Examples that were misclassi?ed by
have their
. The
?
over the training examples as follows. Examples that were correctly classi?ed by
? ? ? ?
$
?
?
?
?
? ? ?? ? ?? ? ?? ?
69
Training Examples
1/2
.....
1/2
Weighted Combination
Figure 4.2: The Batch Boosting Algorithm in action. By Bayes’s rule, we can rewrite this as
? ? ? ? ? ? ? ? ? ?? ?I?? ?¨?I??? ? ? ? I?? ??? ?? ? ?p??????? ? ? ? I?? 3??? ? ?? ? ? ? ? ? ? é i? E
F
Since the denominator is the same for all classes, we disregard it. Assume that the errors of the different base models are independent of one another and of the target concept. That the predictions of the other base models. Then, we get
? é i? E ? ? ?& ? ?? ? ? ?VI?? é ?
is, assume that the event
is conditionally independent of the actual label
é E
?
factor), high accuracies (
?
?
want to choose the class
? ? ? ?? ?í?3??
that has the best combination of high prior probability (the ) of models that vote for class (those for which
?
? ??
?? ? ? ?I??
é
?? ? ????
é E
where
and
is the actual label. This intuitively makes sense: we
?
? ? ? ? ?????
?
?
¨ ??
?&
! j$
?
?
§
? ??
! j$
? ?? ?3??
é
§
? ??
é E
é
¨ ??
é E
ò? ?&
F ! ?$ ?
?
?
§
? ??
ò? b&
? ?
! j$
é
§
? ??
? ?? ? ? ?@?I?? ? ? ?
é
?
? ?? ?3??
?¨?I??? ? ? ? I?? 3?? ?? ? ????????? ? ? ? ?? ? ? ? ?? I?? ??¨? í?3?? ? ? ? ? ? ?? ?3?? ?
?
? ? ? ? ?????
and
70 ), and high errors ( with
?
n
Taking logarithms and replacing logs of products with sums of logs, we get
?
é i? E
which is the method that AdaBoost uses to choose the classi?cation of a new example.
4.3
Why and When Boosting Works
The question of why boosting performs as well as it has in experimental studies performed so far (e.g., (Freund & Schapire, 1996; Bauer & Kohavi, 1999)) has not been precisely answered. However, there are characteristics that appear systematically in experimental studies and can be seen just from the algorithm. Unlike bagging, which is largely a variance reduction method, boosting appears to reduce both bias and variance. After a base model is trained, misclassi?ed training examples have their weights increased and correctly classi?ed examples have their weights decreased for the purpose of training the next base model. Clearly, boosting attempts to correct the bias of the most recently constructed base model by focusing more attention on the examples that it misclassi?ed. This ability to reduce bias enables boosting to work especially well with highbias, lowvariance base models such as decision stumps and Naive Bayes classi?ers. One negative aspect of boosting is the dif?culty that it has with noisy data. Noisy data is generally dif?cult to learn; therefore, in boosting, many base models tend to misclassify
?
If there are more than two classes, one can simply choose the class
?
é i? E
that maximizes
é E
§
é E
? é ?? E
? ?
é E
÷
t
?
¨ ??
¨ ??
§
E
?&
?&
! j$
! ?$
é E
§
? ??
?
? ??
?
? ?
é
é
?
÷
F
F
? ?? ????
t
é i? E
§
?
?&
é ?? E
é E
! ?$
é E
?
? ?? ?
?
?
? ?
?
é
?&
E i?
! ?$
÷
t
§
? ??
?
?
é
?&
! j$
?
? ??
?
é
? ?? ?3??
replace
and
with
. Dividing by the
?
n
? ??
). If we add the trivial base model
? ? ?
that always predicts class
’s, we get
?
?
) for models that vote against class
(those for which , then we can
é E
? ? ? ? é?I?? é ? ? ?? ? ? udI?? é ?
71 noisy training examples, causing their weights to increase too much. This causes the base model learning algorithm to focus too much on the noisy examples at the expense of the remaining examples and overall performance. There is a wellknown theorem by Freund and Schapire proving that the training error of the boosted ensemble decreases exponentially in the number of base models. Theorem 3 Suppose the weak learning algorithm WeakLearn, when called by AdaBoost, the ?nal hypothesis
?
One would think that increasing the number of base models so much that the training error goes to zero would lead to over?tting due to the excessive size of the ensemble model. However, several experiments have demonstrated that adding more and more base models even after the training error has gone to zero continues to reduce the test error (Drucker & Cortes, 1996; Quinlan, 1996; Breiman, 1998). There has been some more recent work (Schapire, Freund, Bartlett, & Lee, 1997, 1998) that attempts to explain this phenomenon in terms of the distribution of margins of the training examples, where the margin of an example is the total weighted vote for the correct class minus the maximum total weighted vote received by any incorrect class. That is, it is claimed that adding more base models to the ensemble tends to increase the distribution of margins over the training set. For the training error to reach zero, one only needs a positive margin on all the training examples; however, even after this is achieved, boosting continues to increase the margins, thereby increasing the separation between the examples in the different classes. However, this explanation has been shown experimentally to be incomplete (Breiman, 1997). A theoretical explanation for boosting’s seeming immunity to over?tting has not yet been obtained and is an active area of research. However, this immunity and boosting’s good performance in experiments makes it one of the most popular ensemble methods, which is why we now devise an online version.
? é i? E
? é E
é
returned by AdaBoost is bounded above by
?
? ?
?? ??
? ??
?I? ?? ?
W
? ?E
E
E
E
generates hypotheses with errors
?? ?m¨
. Then the error
?
?
¨
? ? ?
? #????? #? ? ? ? ? ? ?
of .
72
à
then
else
4.4
The Online Boosting Algorithm
Just like bagging, boosting seems to require that the entire training set be available at all times for every base model to be generated. In particular, at each iteration of boosting, we call the base model learning algorithm on the entire weighted training set and calculate the error of the resulting base model on the entire training set. We then use this error to adjust the weights of all the training examples. However, we have devised an online version of boosting that is similar in principle to our online bagging algorithm. Recall that the online bagging algorithm assigns each train
? ?
latest training example to arrive, and
×
? ? ov?
is the set of base models learned so far, is the online base model learning algorithm.
?
Figure 4.3: Online Boosting Algorithm:
? ? ?
? ? pD?
! ?$
é
· ? ? ± ° ? ? ? × ?? ??? ?? ????f?v??
Return
?
?
?
à
?&
? ??
à
To classify new examples:
4 ! ? 6 7&?? x ? ? 53} ! x 0 2 ?  é 1?y ? ? x ?y }f9í ! 0 ! ? )}1y? ?é (} ?é !
! ? 6 x ? ? 4 x 3} ! 0 2 ?  é 1y? ? ? x y? }f9í ! 0# ! ? # 8} 1y? $?é 8} %?é ! ?
$
?
à
× ?? é ? ?vP??
If
× 9×
?
? ?? ? é ? ?v?z{? ? ??z? ? ? é
à
1
Do
?
times
× ?
& !? ? ? 3 3? '%l?#4???
2
1
Set
according to
.
? ? ? ? ? ? ? ? ? ? ) ~(XXX??%??0è
é z?
For each base model
,(
?
? ? ?V
!
Set the example’s “weight”
×
? ?? ?v?? ? ?f ? ?
OnlineBoosting(
) .
) in ,
. is the
? ??
é
!
? ? ? ? ? ? ? ? ? ? ) ~(XXX??%??0è ?
Initial conditions: For all
,
# !? ? $?é "???
?
?
.
73 ing example a Poisson parameter
v
training example by the batch bagging algorithm. When the batch boosting algorithm genbagging, so the online boosting algorithm assigns each example a Poisson parameter
? ?
erates the ?rst base model, every training example is assigned a weight
just like online bagging. For subsequent base models, online boosting updates the Poisson parameter for each training example in a manner very similar to the way batch boosting updates the weight of each training example—increasing it if the example is misclassi?ed and decreasing it if the example is correctly classi?ed. The pseudocode of our online boosting algorithm is given in Figure 4.3. Because and the associated parameters
?
(these are the sums of the weights of the correctly classi?ed and misclassi?ed examples, algorithm ters
v
. Then the algorithm goes into a loop, in which one base model is updated in each , the online base model learning algorithm,
¤
same factor that is used by AdaBoost for misclassi?ed examples. We then go into the secnew updated weight . We repeat this process for all
?
v
base models. The ?nal ensemble
returned has the same form as in AdaBoost, i.e., it is a function that takes a new example and returns the class that gets the maximum weighted vote over all the base models, where
? Y?
? ??
?
ond iteration of the loop to update the second base model
with example
?
?
?
?
v
E
misclassi?es. Then we calculate
and update
by multiplying it by
?
?
?
v
example , then we increment
?
, which is the sum of the weights of the examples that , which is the and its
?
# $??
?
by the same factor
? ?
that we do in AdaBoost. On the other hand, if
?
v
?
fraction of the total examples that
&
has misclassi?ed. We then update
E
classi?es correctly. We then calculate
? ? ? ?
?
v
correctly. If it does, we update
? ?
, which is the sum of the weights of the examples that which, just like in boosting, is the weighted by multiplying it misclassi?es
?
. We then see if the updated
??
has learned the example, i.e., whether
?
?
? ? à
$
? ?
call
? ? ??
times with base model
?
? v? B 6 8 8ú 6 I?á@@@??7?
¤
iteration. For the ?rst iteration, we choose
according to the
distribution, and and example classi?es it
? ??
? ??
C $@
v
and
. The algorithm starts by assigning the training example
?
9
classi?cation function that is composed of updated base models
? ??
? ??
and a new labeled training example
?
?
respectively, for each of the
? ?
base models), as well as an online base model learning . The algorithm’s output is a new and associated paramethe “weight”
?
v
v ? ? ???
C@ %Av
and
?
# $??
?
? ? ? ? p????
9
our algorithm is an online algorithm, its inputs are the current set of base models
? ?v
# $??
? ??
?
to correspond to the weight
? ??
à
??
v
? ? ? ? ????? ??
à
v ? ???
B@ $Av
? ?
assigned to each just like batch
? ?
? ? ? ? ? ????0?
B D@
?
? Y?
? zv
? &? ?
74
Training Examples
a b c
d
...
Weighted Combination
Figure 4.4: Illustration of online boosting in progress. Each row represents one example being
passed in sequence to all the base models for updating; time runs down the diagram. Each base model (depicted here as a tree) is generated by updating the base model above it with the next weighted training example. In the upper left corner (point “a” in the diagram) we have the ?rst training example. This example updates the ?rst base model but is still misclassi?ed after training, so its weight is increased (the rectangle “b” used to represent it is taller). This example with its higher weight updates the second base model which then correctly classi?es it, so its weight decreases (rectangle “c”).
the weighted training set presented to it. Figure 4.4 illustrates our online boosting algorithm in action. Each row depicts one training example updating all the base models. For example, at the upper left corner (rectangle “a” in the diagram) we have the ?rst training example. The ?rst base model (the tree to the right of the “a” rectangle) is updated with example “a,” however the ?rst base model still misclassi?es that example; therefore the example’s weight is increased. This is
V WU T UT RP H G F 1'SQI?%E
each base model’s vote is
, which is proportional to the base model’s accuracy on
75 depicted as rectangle “b” which is taller to indicate that the example now has more weight. The second base model is now updated with the same training example but with its new higher weight. The second base model correctly classi?es it, therefore its weight is reduced (depicted by rectangle “c”). This continues until all the base models are updated, at which time we throw away this training example and pick up the next example (rectangle “d” in the ?gure). We then update all the base models with this new example, increasing and decreasing this example’s weight as necessary. Each column of base models (depicted in the diagram as trees) is actually the same base model being incrementally updated by each new training example. One area of concern is that, in AdaBoost, an example’s weight is adjusted based on the performance of a base model on the entire training set, whereas in online boosting, the weight adjustment is based on the base model’s performance only on the examples seen earlier. To see why this may be an issue, consider running AdaBoost and online boosting 10000 examples before being tested on, say, the tenth training example. In online boosting, is generated from only the ?rst ten examples before being tested on the tenth example. in online boosting may be presented with different weights for the tenth example. This in Clearly, we may expect the two on a training set of size 10000. In AdaBoost, the ?rst base model is generated from all
each algorithm, and so on.
We will see in Section 4.6 that this is a problem that often leads to online boosting performing worse than batch boosting, and sometimes signi?cantly so. Online boosting is especially likely to suffer a large loss initially when the base models have been trained with very few examples, and the algorithm may never recover. Even when online boosting’s performance ends up comparable to that of batch boosting, one may obtain learning curves such as what is shown in Figure 4.5. The learning curve clearly shows that online boosting performs poorly relative to batch boosting for smaller numbers of examples and then ?nally catches up to batch boosting by the time the entire training set has been learned. To alleviate this problem with online boosting, we implemented primed online boosting, which trains with some initial part of the training set in batch mode and then trains with the remainder in online mode. The hope is to reduce some of the loss that online boosting
a `X
may, in turn, lead to very different weights for the tenth example when presented to
Y X
’s to be very different; therefore,
P X
P X
P X Y `X
in AdaBoost and
76
Car Evaluation with Decision Trees 1 0.95
Fraction Correct
0.9 0.85 0.8 0.75 0.7 0.65 0.6 0 200 400 600 800 1000 1200 1400 Number of Training Examples Boosting Online Boosting Primed Online Boosting
Figure 4.5: Learning curves for Car Evaluation dataset.
suffers initially by training in batch mode with a reasonable initial portion of the training set. We would like to get results such as what is shown for “Primed Online Boosting” in Figure 4.5, which stays quite close to batch boosting for all numbers of training examples and ends with an accuracy better than what online boosting without priming achieves.
4.5
Convergence of Batch and Online Boosting for Naive Bayes
To demonstrate the convergence of online and batch boosting, we do a proof by induction over the base models. That is, for the base case, we demonstrate that if both batch and online boosting are given the same training set, then the ?rst base model returned by online th base models converge then the st base models converge. In the process, we show boosting converges to that returned by batch boosting as . Then we show that if the
that the base model errors also converge. This is suf?cient to show that the classi?cation functions returned by online boosting and batch boosting converge. We ?rst discuss some preliminary points: de?nitions and lemmas that will be useful to us. Then we discuss the proof of convergence.
e db c
i g )hf
f
? ?V V V P ? j ? n w ?H  ~ o w ?H  ? ?H h'Au %?rrxp $???{n ?w Dwl ? xb ? ? ?w? ? ? irj ?? P P ? t "jh?'n?wu V $H?r~orwxp V %H}? n{ Qw ?l ? ??? ???Au V $}rrxp V %}?{ ? ?  ?  ? ?j ? w ?H  ~ o w ?H  ? V V V P e db ? c h'AAu %?rrxp $?'{ Qw c j ? n w ?H  ~ o w ?H  ?
, then and as . . , there exists an . . We have , .
?? ? ?w ?? ? ? ??l ? ?nH ? ? "rml ??j
,
V d %H??{ P ?w ? ?  ? e hb c ? qtttq? ? )b ?vQQ?rq i ???
), and
4.5.1
Lemma 3 If
mett & Stirzaker, 1992), so we state them without proof.
Lemma 3, Corollary 2, and Corollary 3 are standard results (Billingsley, 1995; Grim
Preliminaries
and
are sequences of random variables such that
yc w ?xp
? ?c Y ¤?? ??
u?p?y?wu?wsp c u ?u yc w p yc w ??xp V q V u %?? y c u q x%?? pH w w pH u ?u yc w ttt Y q QvQq Au rP u ttt Y q QQQq sp rP p
, then and , then
Corollary 2 If
Corollary 3 If
and
,
, and
for any continuous function
for all , then
p y c xp w ? ? ? ? y c ?? ?
?
? ? w ?5u
u y c u w
p w ?y c xp
Lemma 4 If
and
are discrete random variables and
, then
d w pH ex%??
f
e z? c V e cv? i c utsrx%ml ? d p o w p