What is random in Random Forest?
My favorite interview question is “What is random in Random Forest?”. I ask this question to anyone who claims to have used Random Forest (RF) in her work. The most common answer I get is that the Random Forest are so called because each tree in the forest is built by randomly selecting a sample of the data. While this answer is partially correct, there are other more important reasons to prefix the term random in front of Forest.
The genesis of “Random Forest” is a seminal paper that Leo Brieman wrote in 2001. In this paper, Leo introduced the term Random Forest to denote a new class of algorithms (note the plural here) that were based on multiple machine learning ideas gaining popularity at that time. All these ideas had two points of similarity; one, they had some element of randomness and, second, they were based on the Decision Trees. In his paper, Leo mentions two prior works as his inspiration. One of those is by Amit and Geman (Shape Quantization and Recognition with Randomized Trees) of which Leo says “This later paper has been influential in my thinking”. Another important paper that Leo refers to is called “The random Subspace Method for constructing Decision Forest” by Tin Kan Ho.
Training Decision Trees (DTs) was always a challenge as they tend to overfit. Some attempts to tackle this problem had used the idea of randomness to improve generalizability. One idea that was quite popular was the idea of bootstrap aggregation (also known as bagging) where multiple trees are trained. Each tree is estimated based on a sample of the data and the prediction is a majority vote of all the trees. This method improves the variance of a decision tree by incorporating randomness. The papers by Amit and Ho, however, incorporate randomness by training each tree on a random subspace of the data. In these methods, each tree is trained only a limited number of features and the selection of these features are random (with replacement).
Implementation with code
We can further build intuition about these ideas by trying these ideas out in Python. Starting from the root idea of randomness, we will try to implement a bagged tree. Sklearn implementation of bagged trees is via the generic sklearn.ensemble.BaggingClassifier API. For this demonstration, I have used the Wisconsin breast cancer dataset. Further details on this dataset are available here.
The complete code is available on GitHub. In this blog, I have added screenshots.
Bagged Decision Trees
After some basic cleanup of the data (please check GitHub for details), I try to model the data using bagged decision trees. Three parameters are related to the discussion on this post. ‘max_samples’ is the percentage of data used for each tree in the ensemble. In the example below, we use only 30% of the data for each tree and fit 10 trees (n_estimator=10). The second parameter, ‘max_features’, is the number of features to use in each tree. In this case, we want all features and thus have this value as 1.0. The final parameter, ‘bootstrap’, denotes our sampling strategy. ‘bootstrap=True’ denotes sampling with replacement.
Here we have an error of 7.2% on Out of Bag (OOB) data using the bagged decision trees.
Random Subspace & Random Forest
Random subspace methods are the general method of choosing a random subset of features from the data to fit the estimator. Random Subspace method, when combined with bagged decision trees results, gives rise to Random Forests. There could be more sophisticated extensions of the Random Subspace method, for example creating a subspace that is a linear combination of the original features. But, RFs work remarkably well by just random selection or features.
sklearn has a direct API for Random Forest and the below code depicts the use of RF (complete code on GitHub). Note that there is no parameter equivalent of the ‘max_samples’ parameter from the bagged trees. That is, RF sklearn implementation does not even allow for training on sub-samples. A strong hint that random feature is the primary reason why RFs are so called.
The RF implementation allows us to control 2 sets of parameters, the first set is used to create the forest and the second set is specific to the individual tree in the forest. ‘n_estimators’ is the number of trees our forest will have. ‘max_features’ as before is the percentage of features we want in each tree. Setting max_features=auto selects sqrt(p) (where p is the number of features in original data) features from the data and grows a tree using this data. The final parameter of interest is ‘bootstrap’; though RF does not allow sampling of data to build trees, it does allow bootstrapping. The bootstrapped data will have the same size as original data but each tree will be trained on a different dataset.
RFs are a powerful tool for classification. In many of my classification project, I start with an RF with default parameter settings and get very good results without doing anything else. RFs are immune to feature scales, noisy or correlated data and even overfitting. And, all these benefits accrue mainly due to the selection of random features.