'Random forest is worse than linear regression? It it normal and what is the reason?
I am trying to use machine learning to predict a dataset. It is a regression problem with 180 input features and 1 continuously-valued output. I try to compare deep neural networks, random forest regression, and linear regression.
As I expect, 3-hidden-layer deep neural networks outperform other two approaches with a root mean square error (RMSE) of 0.1. However, I unexpected to see that random forest even performs worse than linear regression (RMSE 0.29 vs. 0.27). In my expectation, the random forest can discover more complex dependencies between features to decrease error. I have tried to tune the parameters of random forest (number of trees, maximum features, max_depth, etc.). I also tried different K-cross validation, but the performance is still less than linear regression.
I searched online, and one answer says linear regression may perform better if features have a smooth, nearly linear dependence on the covariates. I do not fully get the point because if that is the case, should not deep neural networks give much performance gain?
I am struggling to give an explanation. Under what situation, random forest is worse than linear regression, but deep neural networks can perform much better?
Solution 1:[1]
If your features explain linear relation to the target variable then a Linear Model usually performs well than a Random Forest Model. It totally depends on the linear relations between your features.
That said, Linear models are not superior or the Random Forest is any inferior one.
Try scaling and transforming the data using MinMaxScaler()
from scikit-learn
to see if the linear model improves further
Pro Tips
If linear model is working like a charm you need to ask your self Why? and How? And get into the basics of both the models to understand why it worked on your data. These questions will lead you to feature engineer better. And as a matter of fact, Kaggle Grand Masters do use Linear Models in stacking to get that top 1% score by capturing the linear relations in the dataset.
So at the end of the day, linear models could wonders too.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |