'featureUnion vs columnTransformer?

what is the difference between FeatureUnion() and ColumnTransformer() in sklearn?

which should i use if i want to build a supervised model with features containing mixed data types (categorical, numeric, unstructured text) where i need to combine separate pipelines?

source: https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html

source: https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html



Solution 1:[1]

According to the sklearn documentation:

FeatureUnion: Concatenates results of multiple transformer objects. This estimator applies a list of transformer objects in parallel to the input data, then concatenates the results. This is useful to combine several feature extraction mechanisms into a single transformer.

ColumnTransformer: Applies transformers to columns of an array or pandas DataFrame. This estimator allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space. This is useful for heterogeneous or columnar data, to combine several feature extraction mechanisms or transformations into a single transformer.

So, FeatureUnion applies different transformers to the whole of the input data and then combines the results by concatenating them.

ColumnTransformer, on the other hand, applies different transformers to different subsets of the whole input data, and again concatenates the results.

For the case you propose, the ColumnTransformer should be the first step. And then, once all the columns are converted to numeric, with FeatureUnion you could transform them even further by, e.g., combining PCA and SelectKBest

Finally, you could certainly use FeatureUnion as a ColumnTransformer, but you would have to include in each of the branches a column/type selector than only feeds into the next transformer down the pipeline the columns of interest, as it is explained here: https://ramhiser.com/post/2018-04-16-building-scikit-learn-pipeline-with-pandas-dataframe/

However, ColumnTransformer does exactly that and in a simpler way.

Solution 2:[2]

Both of these methods are used to combine independent transformations (transformers) into a single transformer, by independent I mean transformation (transformers) that don't need to be executed sequentially, they will be executed in parallel and the output of each transformation will be merged at the end.

The main difference is that: each transformer in a feature union object gets the whole dataset as input. While in column transformer object they get only part of the data as input.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Community
Solution 2