'Get intermediate data state in scikit-learn Pipeline
Given the following example:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.pipeline import Pipeline
import pandas as pd
pipe = Pipeline([
("tf_idf", TfidfVectorizer()),
("nmf", NMF())
])
data = pd.DataFrame([["Salut comment tu vas", "Hey how are you today", "I am okay and you ?"]]).T
data.columns = ["test"]
pipe.fit_transform(data.test)
I would like to get intermediate data state in scikit learn pipeline corresponding to tf_idf output (after fit_transform on tf_idf but not NMF) or NMF input. Or to say things in another way, it would be the same than to apply
TfidfVectorizer().fit_transform(data.test)
I know pipe.named_steps["tf_idf"] ti get intermediate transformer, but I can't get data, only parameters of the transformer with this method.
Solution 1:[1]
As @Vivek Kumar suggested in the comment and as I answered here, I find a debug step that prints information or writes intermediate dataframes to csv useful:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.pipeline import Pipeline
import pandas as pd
from sklearn.base import TransformerMixin, BaseEstimator
class Debug(BaseEstimator, TransformerMixin):
def transform(self, X):
print(X.shape)
self.shape = shape
# what other output you want
return X
def fit(self, X, y=None, **fit_params):
return self
pipe = Pipeline([
("tf_idf", TfidfVectorizer()),
("debug", Debug()),
("nmf", NMF())
])
data = pd.DataFrame([["Salut comment tu vas", "Hey how are you today", "I am okay and you ?"]]).T
data.columns = ["test"]
pipe.fit_transform(data.test)
Edit
I now added a state to the debug transformer. Now you can access the shape as in the answer by @datasailor with:
pipe.named_steps["debug"].shape
Solution 2:[2]
As far as I understand, you want to get the transformed training data. You already fitted the data in pipe.named_steps["tf_idf"]
, so just use this fitted model to transform the training data again:
pipe.named_steps["tf_idf"].transform(data.test)
Solution 3:[3]
I've create a gist for this. Essentially, from Python 3.2, using the Context Manager, the code below allows for one to retrieve intermediate results into a dict with the names of the pipeline transformers as keys.
with intermediate_transforms(pipe):
Xt = pipe.transform(X)
intermediate_results = pipe.intermediate_results__
This is accomplished via the function below, but see my gist for more documentation.
import contextlib
from functools import partial
from sklearn.pipeline import Pipeline
@contextlib.contextmanager
def intermediate_transforms(pipe: Pipeline):
# Our temporary overload of Pipeline._transform() method.
# https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/pipeline.py
def _pipe_transform(self, X):
Xt = X
for _, name, transform in self._iter():
Xt = transform.transform(Xt)
self.intermediate_results__[name] = Xt
return Xt
if not isinstance(pipe, Pipeline):
raise ValueError(f'"{pipe}" must be a Pipeline.')
pipe.intermediate_results__ = {}
_transform_before = pipe._transform
pipe._transform = partial(_pipe_transform, pipe) # Monkey-patch our _pipe_transform method.
yield pipe # Release our patched object to the context
# Restore
pipe._transform = _transform_before
delattr(pipe, 'intermediate_results__')
Solution 4:[4]
I'm not sure exactly what your use case is, but one simple solution is this:
# get feature values by transforming x for each step, except the classifier
x_intermediate = data.train
for step in pipe.steps[:-1]:
x_intermediate = step[1].transform(x_intermediate)
print(x_intermediate)
Good luck-
Tony
Solution 5:[5]
Here's what I use:
def fit_transform_step(pipe, X, y=None, step_name=None):
if step_name not in pipe.named_steps:
raise ValueError(f"step not in Pipeline: {step_name}")
Xt = X
for k,v in pipe.steps:
if v != 'passthrough':
Xt = v.fit_transform(Xt, y)
if k==step_name:
break
return Xt
call like:
tf_idf_out = fit_transform_step(pipe, data.test, step_name='tf_idf')
Solution 6:[6]
Using slicing: model[:-1].transform(X)
where model is the Pipeline object. Note that you need to call pipeline.fit(X_train, y_train)
on your pipeline object first.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | CodeZero |
Solution 3 | user394430 |
Solution 4 | |
Solution 5 | Hans Bouwmeester |
Solution 6 | cs_stackX |