'Why the sum "value" isn't equal to the number of "samples" in scikit-learn RandomForestClassifier?
I built a random forest by RandomForestClassifier and plot the decision trees. What does the parameter "value" (pointed by red arrows) mean? And why the sum of two numbers in the [] doesn't equal to the number of "samples"? I saw some other examples, the sum of two numbers in the [] equals to the number of "samples". Why in my case, it doesn't?
df = pd.read_csv("Dataset.csv")
df.drop(['Flow ID', 'Inbound'], axis=1, inplace=True)
df.replace([np.inf, -np.inf], np.nan, inplace=True)
df.dropna(inplace = True)
df.Label[df.Label == 'BENIGN'] = 0
df.Label[df.Label == 'DrDoS_LDAP'] = 1
Y = df["Label"].values
Y = Y.astype('int')
X = df.drop(labels = ["Label"], axis=1)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.5)
model = RandomForestClassifier(n_estimators = 20)
model.fit(X_train, Y_train)
Accuracy = model.score(X_test, Y_test)
for i in range(len(model.estimators_)):
fig = plt.figure(figsize=(15,15))
tree.plot_tree(model.estimators_[i], feature_names = df.columns, class_names = ['Benign', 'DDoS'])
plt.savefig('.\\TheForest\\T'+str(i))
Solution 1:[1]
Nice catch.
Although undocumented, this is due to the bootstrap sampling taking place by default in a Random Forest model (see my answer in Why is Random Forest with a single tree much better than a Decision Tree classifier? for more on the RF algorithm details and its difference from a mere "bunch" of decision trees).
Let's see an example with the iris
data:
from sklearn.datasets import load_iris
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
iris = load_iris()
rf = RandomForestClassifier(max_depth = 3)
rf.fit(iris.data, iris.target)
tree.plot_tree(rf.estimators_[0]) # take the first tree
The result here is similar to what you report: for every other node except the lower right one, sum(value)
does not equal samples
, as it should be the case for a "simple" decision tree.
A cautious observer would have noticed something else which seems odd here: while the iris dataset has 150 samples:
print(iris.DESCR)
.. _iris_dataset:
Iris plants dataset
--------------------
**Data Set Characteristics:**
:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
and the base node of the tree should include all of them, the samples
for the first node are only 89.
Why is that, and what exactly is going on here? To see, let us fit a second RF model, this time without bootstrap sampling (i.e. with bootstrap=False
):
rf2 = RandomForestClassifier(max_depth = 3, bootstrap=False) # no bootstrap sampling
rf2.fit(iris.data, iris.target)
tree.plot_tree(rf2.estimators_[0]) # take again the first tree
Well, now that we have disabled bootstrap sampling, everything looks "nice": the sum of value
in every node equals samples
, and the base node contains indeed the whole dataset (150 samples).
So, the behavior you describe seems to be due to bootstrap sampling indeed, which, while creating samples with replacement (i.e. ending up with duplicate samples for each individual decision tree of the ensemble), these duplicate samples are not reflected in the sample
values of the tree nodes, which display the number of unique samples; nevertheless, it is reflected in the node value
.
The situation is completely analogous with that of a RF regression model, as well as with a Bagging Classifier - see respectively:
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |