Random Forest – Modeling The Titanic Voyage with R

David Carrasco Data Science Leave a Comment

What’s a Random Forest?

Random Forest is a machine learning algorithm used normally for classification and regression tasks in supervised learning which consists on an ensemble or group of simple decision tree models to predict the value of a target variable based on a bunch of input variables. The main advantage regarding a simple decision trees is that reduces the overfitting in the training data decreasing the variance of the resultant model.

The algorithm is implemented in almost every programming language (R, Python, Java, C …) and even available for distributed computing frameworks like Spark or H2O.

In this initial post, we are going to explain a pretty straightforward use case using the Random Forest algorithm in R to predict the survivorship of a passenger in the Titanic voyage.

In a second post, we will see the magic behind the Random Forest algorithm and why it’s mainly used in the data science industry.

Random Forest Classification

Random Forest Titanic Example

We are going to use the RandomForest package to create the model. Take a look at its CRAN URL for more information.

First of all, we are going to load the randomForest library

Our goal is to predict if a person will survive or die using the famous titanic dataset. We are going to study the structure of the dataset, define our supervisor variable and create the model to predict new data points in the future.

First of all, we have to get the csv dataset from internet to explore the data:

Loading the dataset:

Let’s take a look at the structure of titanic train data

We are going to apply the following transformations to prepare the data:

  • Remove unnecessary variables: Name, Parch, Ticket, Cabin
  • Fill NAs in the Name field with -1
  • Fill NAs in the Fare field with the mean
  • Fill NAs in Embarked field with the embarqued value
  • Convert Sex and Embarked fields to factor

Then, we are going to use our customized function CleanData for that:

Now the data is ready to apply the ramdomForest algorithm then we have to define the following parameters for the algorithm:

  • x, formula = a data frame or a matrix of predictors, or a formula describing the model to be fitted (for the print method, an randomForest object). In our case we use the train data set after applying the CleanData function.
  • y = A response vector. If it’s a factor, classification is assumed, otherwise regression is assumed. If the response vector is not declared, the randomForest will be executed in unsupervised mode. In our case we are going to use the Survived field as supervisor.
  • ntree = Number of trees to grow. This should not be a small number, to ensure that every input row gets its prediction at least a few times. In our case, we are going to use 100 trees to train the model
The result is a RandomForest object we can explore and print the main information of the model that is already created.
Now, we are going to check out the confusion matrix with the test set:
Finally, let’s see the accuracy metric of the model to have an idea of the goodness of the model:
As we can see above, we have a 83% of accuracy.

Conclusion

At this time, we would have to think if that measure accomplishes our initial goals or if it’s needed to come back to the preparation phase.
Probably, we have to consider new variables in the input, new transformations to be applied in the data or think about other machine learning models instead of Random Forest that could increase the metrics of the final desired model.

Leave a Reply

Your email address will not be published. Required fields are marked *