'How to use warm_start

I'd like to use the warm_start parameter to add training data to my random forest classifier. I expected it to be used like this:

clf = RandomForestClassifier(...)
clf.fit(get_data())
clf.fit(get_more_data(), warm_start=True)

But the warm_start parameter is a constructor parameter. So do I do something like this?

clf = RandomForestClassifier()
clf.fit(get_data())
clf = RandomForestClassifier (warm_start=True)
clf.fit(get_more_data)

That makes no sense to me. Won't the new call to the constructor discard previous training data? I think I'm missing something.



Solution 1:[1]

The basic pattern of (taken from Miriam's answer):

clf = RandomForestClassifier(warm_start=True)
clf.fit(get_data())
clf.fit(get_more_data())

would be the correct usage API-wise.

But there is an issue here.

As the docs say the following:

When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest.

it means, that the only thing warm_start can do for you, is adding new DecisionTree's. All the previous trees seem to be untouched!

Let's check this with some sources:

  n_more_estimators = self.n_estimators - len(self.estimators_)

    if n_more_estimators < 0:
        raise ValueError('n_estimators=%d must be larger or equal to '
                         'len(estimators_)=%d when warm_start==True'
                         % (self.n_estimators, len(self.estimators_)))

    elif n_more_estimators == 0:
        warn("Warm-start fitting without increasing n_estimators does not "
             "fit new trees.")

This basically tells us, that you would need to increase the number of estimators before approaching a new fit!

I have no idea what kind of usage sklearn expects here. I'm not sure, if fitting, increasing internal variables and fitting again is correct usage, but i somehow doubt it (especially as n_estimators is not a public class-variable).

Your basic approach (in regards to this library and this classifier) is probably not a good idea for your out-of-core learning here! I would not pursue this further.

Solution 2:[2]

Just to add to excellent @sascha`s answer, this hackie method works:

rf = RandomForestClassifier(n_estimators=1, warm_start=True)                     
rf.fit(X_train, y_train)
rf.n_estimators += 1
rf.fit(X_train, y_train) 

Solution 3:[3]

from sklearn.datasets import load_iris
boston = load_iris()
X, y = boston.data, boston.target

### RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=10, warm_start=True)
rfc.fit(X[:50], y[:50])
print(rfc.score(X, y))
rfc.n_estimators += 10
rfc.fit(X[51:100], y[51:100])
print(rfc.score(X, y))
rfc.n_estimators += 10
rfc.fit(X[101:150], y[101:150])
print(rfc.score(X, y))

Below is differentiation between warm_start and partial_fit.

When fitting an estimator repeatedly on the same dataset, but for multiple parameter values (such as to find the value maximizing performance as in grid search), it may be possible to reuse aspects of the model learnt from the previous parameter value, saving time. When warm_start is true, the existing fitted model attributes an are used to initialise the new model in a subsequent call to fit. Note that this is only applicable for some models and some parameters, and even some orders of parameter values. For example, warm_start may be used when building random forests to add more trees to the forest (increasing n_estimators) but not to reduce their number.

partial_fit also retains the model between calls, but differs: with warm_start the parameters change and the data is (more-or-less) constant across calls to fit; with partial_fit, the mini-batch of data changes and model parameters stay fixed.

There are cases where you want to use warm_start to fit on different, but closely related data. For example, one may initially fit to a subset of the data, then fine-tune the parameter search on the full dataset. For classification, all data in a sequence of warm_start calls to fit must include samples from each class.

Solution 4:[4]

All warm_start does boils down to preserving the state of the previous train.


It differs from a partial_fit in that the idea is not to incrementally learn on small batches of data, but rather to re-use a trained model in its previous state. Namely the difference between a regular call to fit and a fit having set warm_start=True is that the estimator state is not cleared, see _clear_state

if not self.warm_start:
    self._clear_state()

Which, among other parameters, would initialize all estimators:

if hasattr(self, 'estimators_'):
    self.estimators_ = np.empty((0, 0), dtype=np.object)

So having set warm_start=True in each subsequent call to fit will not initialize the trainable parameters, instead it will start from their previous state and add new estimators to the model.


Which means that one could do:

grid1={'bootstrap': [True, False],
 'max_depth': [10, 20, 30, 40, 50, 60],
 'max_features': ['auto', 'sqrt'],
 'min_samples_leaf': [1, 2, 4],
 'min_samples_split': [2, 5, 10]}

rf_grid_search1 = GridSearchCV(estimator = RandomForestClassifier(), 
                               param_distributions = grid1,
                               cv = 3,
                               random_state=12)
rf_grid_search1.fit(X_train, y_train)

Then fit a model on the best parameters and set warm_start=True:

rf = RandomForestClassifier(**rf_grid_search1.best_params_, warm_start=True)
rf.fit(X_train, y_train)

Then we could perform GridSearch only on say n_estimators:

grid2 = {'n_estimators': [200, 400, 600, 800, 1000]}
rf_grid_search2 = GridSearchCV(estimator = rf,
                               param_distributions = grid2,
                               cv = 3, 
                               random_state=12,
                               n_iter=4)
rf_grid_search2.fit(X_train, y_train)

The advantage here is that the estimators would already be fit with the previous parameter setting, and with each subsequent call to fit, the model will be starting from the previous parameters, and we're just analyzing if adding new estimators would benefit the model.

Solution 5:[5]

as @sascha pointed out, the previously fitted trees are untouched, and you need to add new estimators before calling fit again. he seemed unsure how to change it, as it is a public variable. the api provides a function called set_params() which allows this. here's how i've done it in the past:

training_data = list(random.sample(list(zip(INPUT, OUTPUT)), min([int(len(INPUT) * 0.80), 1300]))) 
# get either 80% of the data or 1300 samples, whichever is smaller
__INPUT=[]
__output=[]
for _I, o in training_data:
   __INPUT.append((_I))
   __output.append(o)
# re-split our random sample of tuples into 2 lists
regressor.fit(__INPUT, __output)
# first fit
est = int(int(len(regressor.estimators_) * random.choice([1.1, 1.3, 1.4, 1.4, 1.5, 1.5, 1.5, 1.6, 1.1, 1.11, 1.13, 1.1, 1.11, 1.13]))) 
# get current estimators times a number between 1.1 and 1.5...theres a better way to write this, but im putting the shitty version here for the copy-pasta people
print('Planting additional trees...', est - len(regressor.estimators_))
regressor = regressor.set_params(n_estimators=est, warm_start=True)
regressor.fit(__INPUT, __output)
# new trees fit

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Sergey Makarevich
Solution 3 manish Prasad
Solution 4
Solution 5 dominic muscatella