I used random forests quite often for my research. Recently I have been thinking about a small change in random forests that I call "Half Random Forest". Instead of choosing new variables at each split the variables are chosen only once for each tree. The main advantage I see for this approach is that each tree relies only on a few variables. This is advantageous when some of the variables are missing.

How it works

During training you pick N features for each tree. Then the training data elements are filtered such that all items have values for the selected features. The selected features are stored for each tree and then the trees are trained.

For classifying a new item you would then only decide based on the trees that don't rely on any of the missing values.

The procedure leads to more correlated trees and of course if too many values are missing the decision would be based on only a few trees (or even none) but for typical use cases enough trees should remain (plus you can always train more trees).


The first results I got when I experimented with a simple implementation were quite promising. Sadly when I tested my approach against a normal Random Forest where I replaced the missing variables with the mean the Random Forest outperformed my approach. So it seems my approach does not that well. However when I find the time I would still like to perform tests with a second data set.

The implementation is available on GitHub here if anyone wants to try it out himself