'How to apply StandardScaler in Pipeline in scikit-learn (sklearn)?
In the example below,
pipe = Pipeline([
('scale', StandardScaler()),
('reduce_dims', PCA(n_components=4)),
('clf', SVC(kernel = 'linear', C = 1))])
param_grid = dict(reduce_dims__n_components=[4,6,8],
clf__C=np.logspace(-4, 1, 6),
clf__kernel=['rbf','linear'])
grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=1, verbose=2)
grid.fit(X_train, y_train)
print(grid.score(X_test, y_test))
I am using StandardScaler()
, is this the correct way to apply it to test set as well?
Solution 1:[1]
Yes, this is the right way to do this but there is a small mistake in your code. Let me break this down for you.
When you use the StandardScaler
as a step inside a Pipeline
then scikit-learn will internally do the job for you.
What happens can be described as follows:
- Step 0: The data are split into
TRAINING data
andTEST data
according to thecv
parameter that you specified in theGridSearchCV
. - Step 1: the
scaler
is fitted on theTRAINING data
- Step 2: the
scaler
transformsTRAINING data
- Step 3: the models are fitted/trained using the transformed
TRAINING data
- Step 4: the
scaler
is used to transform theTEST data
- Step 5: the trained models
predict
using thetransformed TEST data
Note: You should be using grid.fit(X, y)
and NOT grid.fit(X_train, y_train)
because the GridSearchCV
will automatically split the data into training and testing data (this happen internally).
Use something like this:
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import PCA
pipe = Pipeline([
('scale', StandardScaler()),
('reduce_dims', PCA(n_components=4)),
('clf', SVC(kernel = 'linear', C = 1))])
param_grid = dict(reduce_dims__n_components=[4,6,8],
clf__C=np.logspace(-4, 1, 6),
clf__kernel=['rbf','linear'])
grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=1, verbose=2, scoring= 'accuracy')
grid.fit(X, y)
print(grid.best_score_)
print(grid.cv_results_)
Once you run this code (when you call grid.fit(X, y)
), you can access the outcome of the grid search in the result object returned from grid.fit(). The best_score_
member provides access to the best score observed during the optimization procedure and the best_params_
describes the combination of parameters that achieved the best results.
IMPORTANT EDIT 1: if you want to keep a validation dataset of the original dataset use this:
X_for_gridsearch, X_future_validation, y_for_gridsearch, y_future_validation
= train_test_split(X, y, test_size=0.15, random_state=1)
Then use:
grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=1, verbose=2, scoring= 'accuracy')
grid.fit(X_for_gridsearch, y_for_gridsearch)
Solution 2:[2]
Quick answer: Your methodology is correct.
Although the above answer is very good, I just would like to point out some subtleties:
best_score_ [1] is the best cross-validation metric, and not the generalization performance of the model [2]. To evaluate how well the best found parameters generalize, you should call the score on the test set, as you've done. Therefore it is needed to start by splitting the data into training and test set, fit the grid search only in the X_train, y_train, and then score it with X_test, y_test [2].
Deep Dive:
A threefold split of data into training set, validation set and test set is one way to prevent overfitting in the parameters during grid search. On the other hand, GridSearchCV uses Cross-Validation in the training set, instead of having both training and validation set, but this does not replace the test set. This can be verified in [2] and [3].
References:
[1] GridSearchCV
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 |