'sklearn lda gridsearchcv with pipeline

pipe = Pipeline([('reduce_dim', LinearDiscriminantAnalysis()),('classify', LogisticRegression())])
param_grid = [{'classify__penalty': ['l1', 'l2'],
                 'classify__C': [0.05,0.1, 0.3, 0.6, 0.8, 1.0]}] 

gs = GridSearchCV(pipe, param_grid=param_grid, cv=5, scoring='roc_auc', n_jobs=3)
gs.fit(data, label)

I have a question for using pipeline and gridsearchcv. now i first try to use lda to reduce dimension, i want to know the process about gridsearchcv with pipeline ? split train/test->lda->fit & predict or lda->split train/test->fit & predict?



Solution 1:[1]

Part 1

First of all, the Pipeline defines the steps that you are going to do.

In your case, first you use LinearDiscriminantAnalysis and then LogisticRegression.

Part 2

In

gs = GridSearchCV(pipe, param_grid=param_grid, cv=5, scoring='roc_auc', n_jobs=3)

you have defined cross validation (cv) = 5.

This number defines the number of folds ((Stratified)KFold) so you split your data automatically 5 times into train and test data and every single time you perform the analysis that Pipeline defines.

Bottom line: the first case (split train/test->lda->fit & predict) seems better but the question is methodology-related.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 seralouk