'How to extract coefficients from fitted pipeline for penalized logistic regression?
I have a set of training data that consists of X, which is a set of n columns of data (features), and Y, which is one column of target variable.
I am trying to train my model with logistic regression using the following pipeline:
pipeline = sklearn.pipeline.Pipeline([
('logistic_regression', LogisticRegression(penalty = 'none', C = 10))
])
My goal is to obtain the values of each of the n coefficients corresponding to the features, under the assumption of a linear model (y = coeff_0 + coeff_1*x1 + ... + coeff_n*xn
).
What I tried was to train this pipeline on my data with model = pipeline.fit(X, Y)
. So I think that I now have the model that contains the coefficients that I want. However, I don't know how to access them. I'm looking for something like mode.best_params_('logistic_regression')
.
Does anyone know how to extract the fitted coefficients from a model like this?
Solution 1:[1]
Have a look at the scikit-learn documentation for Pipeline
, this example is inspired by it:
from sklearn import svm
from sklearn.datasets import make_classification
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression
from sklearn.pipeline import Pipeline
# generate some data to play with
X, y = make_classification(n_informative=5, n_redundant=0, random_state=42)
# ANOVA SVM-C
anova_filter = SelectKBest(f_regression, k=5)
clf = svm.SVC(kernel='linear')
anova_svm = Pipeline([('anova', anova_filter), ('svc', clf)])
anova_svm.set_params(anova__k=10, svc__C=.1).fit(X, y)
# access coefficients
print(anova_svm['svc'].coef_)
model.coef_
does the job, .best_params_
is usualy associated with GridSearch
, i.e. hyperparameter optimization.
In your specific case try: model['logistic_regression'].coefs_
.
Solution 2:[2]
Example to get the coefs
from a pipeline
.
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.pipeline import Pipeline
X, y = load_iris(return_X_y=True)
pipeline = Pipeline([('lr', LogisticRegression(penalty = 'l2',
C = 10))])
pipeline.fit(X, y)
pipeline['lr'].coef_
array([[-0.42923513, 2.08235619, -4.28084811, -1.97174699],
[ 1.06321671, -0.08077595, -0.46911772, -2.3221883 ],
[-0.63398158, -2.00158024, 4.74996583, 4.29393529]])
Solution 3:[3]
here is how to visualize the coefficients and measure model accuracy. I used the baby weight and height and gestation period to predict preterm
pipeline = Pipeline([('lr', LogisticRegression(penalty='l2',C=10))])
scaler=StandardScaler()
#X=np.array(df['gestation_wks']).reshape(-1,1)
X=scaler.fit_transform(df[['bwt_lbs','height_ft','gestation_wks']])
y=np.array(df['PreTerm'])
X_train,X_test, y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=123)
pipeline.fit(X_train,y_train)
y_pred_prob=pipeline.predict_proba(X_test)
predictions=pipeline.predict(X_test)
print(predictions)
sns.countplot(x=predictions, orient='h')
plt.show()
#print(predictions[:,0])
print(pipeline['lr'].coef_)
print(pipeline['lr'].intercept_)
print('Coefficients close to zero will contribute little to the end result')
num_err = np.sum(y != pipeline.predict(X))
print("Number of errors:", num_err)
def my_loss(y,w):
s = 0
for i in range(y.size):
# Get the true and predicted target values for example 'i'
y_i_true = y[i]
y_i_pred = w[i]
s = s + (y_i_true - y_i_pred)**2
return s
print("Loss:",my_loss(y_test,predictions))
fpr, tpr, threshholds = roc_curve(y_test,y_pred_prob[:,1])
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()
accuracy=round(pipeline['lr'].score(X_train, y_train) * 100, 2)
print("Model Accuracy={accuracy}".format(accuracy=accuracy))
cm=confusion_matrix(y_test,predictions)
print(cm)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Venkatachalam |
Solution 2 | |
Solution 3 | Golden Lion |