'How to extract coefficients from fitted pipeline for penalized logistic regression?

I have a set of training data that consists of X, which is a set of n columns of data (features), and Y, which is one column of target variable.

I am trying to train my model with logistic regression using the following pipeline:

pipeline = sklearn.pipeline.Pipeline([
    ('logistic_regression', LogisticRegression(penalty = 'none', C = 10))
])

My goal is to obtain the values of each of the n coefficients corresponding to the features, under the assumption of a linear model (y = coeff_0 + coeff_1*x1 + ... + coeff_n*xn).

What I tried was to train this pipeline on my data with model = pipeline.fit(X, Y). So I think that I now have the model that contains the coefficients that I want. However, I don't know how to access them. I'm looking for something like mode.best_params_('logistic_regression').

Does anyone know how to extract the fitted coefficients from a model like this?



Solution 1:[1]

Have a look at the scikit-learn documentation for Pipeline, this example is inspired by it:

from sklearn import svm
from sklearn.datasets import make_classification
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression
from sklearn.pipeline import Pipeline
# generate some data to play with
X, y = make_classification(n_informative=5, n_redundant=0, random_state=42)
# ANOVA SVM-C
anova_filter = SelectKBest(f_regression, k=5)
clf = svm.SVC(kernel='linear')
anova_svm = Pipeline([('anova', anova_filter), ('svc', clf)])
anova_svm.set_params(anova__k=10, svc__C=.1).fit(X, y)
# access coefficients
print(anova_svm['svc'].coef_)

model.coef_ does the job, .best_params_ is usualy associated with GridSearch, i.e. hyperparameter optimization.

In your specific case try: model['logistic_regression'].coefs_.

Solution 2:[2]

Example to get the coefs from a pipeline.

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.pipeline import Pipeline

X, y = load_iris(return_X_y=True)

pipeline = Pipeline([('lr', LogisticRegression(penalty = 'l2', 
                                               C = 10))])
pipeline.fit(X, y)

pipeline['lr'].coef_

array([[-0.42923513,  2.08235619, -4.28084811, -1.97174699],
       [ 1.06321671, -0.08077595, -0.46911772, -2.3221883 ],
       [-0.63398158, -2.00158024,  4.74996583,  4.29393529]])

Solution 3:[3]

here is how to visualize the coefficients and measure model accuracy. I used the baby weight and height and gestation period to predict preterm

pipeline = Pipeline([('lr', LogisticRegression(penalty='l2',C=10))])

scaler=StandardScaler()
#X=np.array(df['gestation_wks']).reshape(-1,1)
X=scaler.fit_transform(df[['bwt_lbs','height_ft','gestation_wks']])
y=np.array(df['PreTerm'])

X_train,X_test, y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=123)

pipeline.fit(X_train,y_train)
y_pred_prob=pipeline.predict_proba(X_test)
predictions=pipeline.predict(X_test)
print(predictions)

sns.countplot(x=predictions, orient='h')
plt.show()
#print(predictions[:,0])
print(pipeline['lr'].coef_)
print(pipeline['lr'].intercept_)
print('Coefficients close to zero will contribute little to the end result')

num_err = np.sum(y != pipeline.predict(X))
print("Number of errors:", num_err)

def my_loss(y,w):
    s = 0
    for i in range(y.size):
        # Get the true and predicted target values for example 'i'
        y_i_true = y[i]
        y_i_pred = w[i]
        s = s + (y_i_true - y_i_pred)**2
    return s

 print("Loss:",my_loss(y_test,predictions))

 fpr, tpr, threshholds = roc_curve(y_test,y_pred_prob[:,1])

 plt.plot([0, 1], [0, 1], 'k--')
 plt.plot(fpr, tpr)
 plt.xlabel('False Positive Rate')
 plt.ylabel('True Positive Rate')
 plt.title('ROC Curve')
 plt.show()

 accuracy=round(pipeline['lr'].score(X_train, y_train) * 100, 2)

 print("Model Accuracy={accuracy}".format(accuracy=accuracy))

 cm=confusion_matrix(y_test,predictions)
 print(cm)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Venkatachalam
Solution 2
Solution 3 Golden Lion