'Python scikit learn pipelines (no transformation on features)

I am running different machine learning models on my data set. I am using sklearn pipelines to try different transforms on the numeric features to evaluate if one transformation gives better results. The basic structure I am using is simple:

from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScalar

pipe = Pipeline(steps=[('stdscaler', StandardScaler()), ('clf', RandomForestClassifier())])

pipe.fit(X_train, y_train)

I am trying a bunch of transformations but I also want to test the scenario where no transformation are performed on the numeric feature set (i.e. features are used as is). Is there a way to include that within the pipeline? Something like:

pipe = Pipeline(steps=[('do nothing', do_nothing()), ('clf', RandomForestClassifier())])


Solution 1:[1]

Yes, you can simply do

pipe = Pipeline(steps=[('clf', RandomForestClassifier())])

Also, if you had some custom base transformation you almost always wanted, and it also had certain hyperparameters or added functionality you could also do something like (somewhat lame example, but just for ideas..)

from sklearn.base import TransformerMixin

class Transform(TransformerMixin):
    def __init__(self, **kwargs):
        print(kwargs)
        self.hyperparam = kwargs

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        if self.hyperparam["square"]:
            X = [x**2 for x in X]
        # elif "other_transform" in self.hyperparam:
            # pass_val = self.hyperparam['other_transform']
            # X = other_transform(X, pass_val)
        return X  # default to no transform if no hyperparameter is provided as argument of Transform()

pass_pipe = Pipeline(steps=[('do nothing', Transform()),
                            ('clf', RandomForestClassifier())])
square_pipe = Pipeline(steps=[('square', Transform(square=True)),
                              ('clf', RandomForestClassifier())])

The above is a mutually exclusive way to do transforms, i.e. one or the other. If, instead, you had a bunch of transforms and you wanted to apply them in a certain order, implementing callbacks would probably be the right way. Check out how that kind of thing is implemented in popular libraries like sklearn, pytorch, or fastai.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 queise