'Does sklearn LogisticRegressionCV use all data for final model
I was wondering how the final model (i.e. decision boundary) of LogisticRegressionCV in sklearn was calculated. So say I have some Xdata and ylabels such that
Xdata # shape of this is (n_samples,n_features)
ylabels # shape of this is (n_samples,), and it is binary
and now I run
from sklearn.linear_model import LogisticRegressionCV
clf = LogisticRegressionCV(Cs=[1.0],cv=5)
clf.fit(Xdata,ylabels)
This is looking at just one regularization parameter and 5 folds in the CV. So clf.scores_
will be a dictionary with one key with a value that is an array with shape (n_folds,1). With these five folds you can get a better idea of how the model performs.
However, I'm confused about what you get from clf.coef_
(and I'm assuming the parameters in clf.coef_
are the ones used in clf.predict
). I have a few options I think it could be:
- The parameters in
clf.coef_
are from training the model on all the data - The parameters in
clf.coef_
are from the best scoring fold - The parameters in
clf.coef_
are averaged across the folds in some way.
I imagine this is a duplicate question, but for the life of me I can't find a straightforward answer online, in the sklearn documentation, or in the source code for LogisticRegressionCV. Some relevant posts I found are:
Solution 1:[1]
You are mistaking between hyper-parameters and parameters. All scikit-learn estimators which have CV in the end, like LogisticRegressionCV
, GridSearchCV
, or RandomizedSearchCV
tune the hyper-parameters.
Hyper-parameters are not learnt from training on the data. They are set prior to learning assuming that they will contribute to optimal learning. More information is present here:
Hyper-parameters are parameters that are not directly learnt within estimators. In scikit-learn they are passed as arguments to the constructor of the estimator classes. Typical examples include C, kernel and gamma for Support Vector Classifier, alpha for Lasso, etc.
In case of LogisticRegression, C
is a hyper-parameter which describes the inverse of regularization strength. The higher the C, the less regularization is applied on the training. Its not that C
will be changed during training. It will be fixed.
Now coming to coef_
. coef_
contains coefficient (also called weights) of the features, which are learnt (and updated) during the training. Now depending on the value of C (and other hyper-parameters present in contructor), these can vary during the training.
Now there is another topic on how to get the optimum initial values of coef_
, so that the training is faster and better. Thats optimization. Some start with random weights between 0-1, others start with 0, etc etc. But for the scope of your question, that is not relevant. LogisticRegressionCV is not used for that.
This is what LogisticRegressionCV does:
- Get the values of different
C
from constructor (In your example you passed 1.0). - For each value of
C
, do the cross-validation of supplied data, in which the LogisticRegression will befit()
on training data of the current fold, and scored on the test data. The scores from test data of all folds are averaged and that becomes the score of the currentC
. This is done for allC
values you provided, and theC
with the highest average score will be chosen. - Now the chosen
C
is set as the finalC
and LogisticRegression is again trained (by callingfit()
) on the whole data (Xdata,ylabels
here).
Thats what all the hyper-parameter tuners do, be it GridSearchCV, or LogisticRegressionCV, or LassoCV etc.
The initializing and updating of coef_
feature weights is done inside the fit()
function of the algorithm which is out of scope for the hyper-parameter tuning. That optimization part is dependent on the internal optimization algorithm of the process. For example solver
param in case of LogisticRegression
.
Hope this makes things clear. Feel free to ask if still any doubt.
Solution 2:[2]
You have the parameter refit=True
by default. On the docs you can read:
If set to True, the scores are averaged across all folds, and the coefs and the C that corresponds to the best score is taken, and a final refit is done using these parameters. Otherwise the coefs, intercepts and C that correspond to the best scores across folds are averaged.
So if refit=True
the CV model is retrained using all the data.
When it says the final refit is done using these parameters it is talking about the C regularization parameter. So it uses the C that gives the best
average score across the K folds.
When refit=False
it retrieves you the best model in cross validation.
So if you trained 5 folds, you will get the model (coeff + C + intercept), trained on 4 folds of data, which gave the best score on its fold test set.
I agree that the documetation here is not very clear but averaging C values and coefficients does not really make much sense
Solution 3:[3]
I just took a look at the source code. It seems for refit = True, they just selected the best hyperparameter (C and l1_ratio) and retrain the model with all the data.
for refit = False:
It seems they do average the hyperparameters, see the blow source code:
best_indices = np.argmax(scores, axis=1)
...
best_indices_C = best_indices % len(self.Cs_)
self.C_.append(np.mean(self.Cs_[best_indices_C]))
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | user1269942 |
Solution 3 | Xin Niu |