Random Forest is an ensemble model based on decision trees which is built through the Bagging technique used for classification and regression tasks in supervised learning (Although, it can be used in unsupervised learning too).
Yes, it’s a bit more technical than the definition in the previous post, isn’t it? Well, let’s explain the main concept behind the Random Forest algorithm.
Bagging (Bootstrap Agregation), Boosting … What the hell?
When we have to ensemble that is, group different models to create a more complex machine learning model, normally we will use the following techniques.
- Bagging (Bootstrap Aggregating)
Bagging algorithms aim to reduce the complexity of models that overfit the training data. In contrast, boosting is an approach to increase the complexity of models that suffer from high bias, that is, models that underfit the training data.
On the other hand, the stacking technique is used to combine the predictions of several other models. We train the available data first by different models then other algorithm is used to combine the results to give the final prediction results.
Bagging technique explanation
Also, the following video explains the process step by step:
It’s important to keep in mind the random forest algorithm is actually pretty similar to bagged trees but not identical.
The only difference is that only a subset of features are selected at random out of the total and the best split feature from the subset is used to split each node in a tree, unlike in bagging where all features are considered for splitting a node. The parameter which measures this value is called mtry.
Why should I use a random forest? I already have my decision trees!
Let’s compare with the decision tree model from the Random Forest perspective:
Pros => Random Forest algorithm reduces the variance of the model significantly regarding the decision tree algorithm since a simple decision tree can overfit so that some points of data can be really well predicted but others don’t.
Cons => It’s a “black box” model algorithm. It has normally a really high accuracy much better than the simple decision tree but it’s a bit trickier to explain the conclusions to the business side. If it’s mandatory to know the conclusions of the algorithm in detail then tends to be forgotten or at least, is presented along with other more informative algorithm like a simple decision tree.
Fortunately, we are already seeing different approaches to extract the common rules of the underlying decision trees in the Random forest algorithm reducing the “black box” factor of the algorithm.
Random Forest Parameters
There are several parameters to be configured when we want to model a Random Forest algorithm:
- depth => Maximum depth of the trees.
- ntrees => Number of trees to be used to build the Random Forest.
- maxnodes => Maximum number of terminal nodes trees in the forest can have.
- mtry => Number of variables randomly sampled as candidates at each split.
The bigger the parameter values are the more complex the random forest will be since there will be more trees, with more depth and more variables to be considered in each split which should guide us to better predictive models.
Random Forest Libraries
The typical packages used in R to model with a Random Forest algorithm are e1071 and RandomForest. However, there is a newer implementation called ranger which reduces the execution time significantly. We encourage you to try it!
Besides, those who usually work with the caret package to develop machine learning models won’t have any problem since these three algorithms are fully supported.
References & further reading
We attach different interesting links for those who want to continue reading about the topic with more information