'Python sklearn RandomForestClassifier non-reproducible results
I've been using sklearn's random forest, and I've tried to compare several models. Then I noticed that random-forest is giving different results even with the same seed. I tried it both ways: random.seed(1234) as well as use random forest built-in random_state = 1234 In both cases, I get non-repeatable results. What have I missed...?
# 1
random.seed(1234)
RandomForestClassifier(max_depth=5, max_features=5, criterion='gini', min_samples_leaf = 10)
# or 2
RandomForestClassifier(max_depth=5, max_features=5, criterion='gini', min_samples_leaf = 10, random_state=1234)
Any ideas? Thanks!!
EDIT: Adding a more complete version of my code
clf = RandomForestClassifier(max_depth=60, max_features=60, \
criterion='entropy', \
min_samples_leaf = 3, random_state=seed)
# As describe, I tried random_state in several ways, still diff results
clf = clf.fit(X_train, y_train)
predicted = clf.predict(X_test)
predicted_prob = clf.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = metrics.roc_curve(np.array(y_test), predicted_prob)
auc = metrics.auc(fpr,tpr)
print (auc)
EDIT: It's been quite a while, but I think using RandomState might solve the problem. I didn't test it yet myself, but if you're reading it, it's worth a shot. Also, it is generally preferable to use RandomState instead of random.seed().
Solution 1:[1]
First make sure that you have the latest versions of the needed modules(e.g. scipy, numpy etc). When you type random.seed(1234)
, you use the numpy
generator.
When you use random_state
parameter inside the RandomForestClassifier
, there are several options: int, RandomState instance or None.
From the docs here :
If int, random_state is the seed used by the random number generator;
If RandomState instance, random_state is the random number generator;
If None, the random number generator is the RandomState instance used by np.random.
A way to use the same generator in both cases is the following. I use the same (numpy) generator in both cases and I get reproducible results (same results in both cases).
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from numpy import *
X, y = make_classification(n_samples=1000, n_features=4,
n_informative=2, n_redundant=0,
random_state=0, shuffle=False)
random.seed(1234)
clf = RandomForestClassifier(max_depth=2)
clf.fit(X, y)
clf2 = RandomForestClassifier(max_depth=2, random_state = random.seed(1234))
clf2.fit(X, y)
Check if the results are the same:
all(clf.predict(X) == clf2.predict(X))
#True
Check after running the same code for 5 times:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from numpy import *
for i in range(5):
X, y = make_classification(n_samples=1000, n_features=4,
n_informative=2, n_redundant=0,
random_state=0, shuffle=False)
random.seed(1234)
clf = RandomForestClassifier(max_depth=2)
clf.fit(X, y)
clf2 = RandomForestClassifier(max_depth=2, random_state = random.seed(1234))
clf2.fit(X, y)
print(all(clf.predict(X) == clf2.predict(X)))
Results:
True
True
True
True
True
Solution 2:[2]
Ok, what solved it eventually, is reinstalling the conda environment. I'm still not sure why the different results happened. Thanks
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | Ruslan |