'Metrics F1 warning zero division

I want to calculate the F1 score of my models. But I receive a warning and get a 0.0 F1-score and I don't know what to do.

here is the source code:

def model_evaluation(dict):

    for key,value in dict.items():

        classifier = Pipeline([('tfidf', TfidfVectorizer()),
                         ('clf', value),
    ])
        classifier.fit(X_train, y_train)
        predictions = classifier.predict(X_test)
        print("Accuracy Score of" , key ,  ": ", metrics.accuracy_score(y_test,predictions))
        print(metrics.classification_report(y_test,predictions))
        print(metrics.f1_score(y_test, predictions, average="weighted", labels=np.unique(predictions), zero_division=0))
        print("---------------","\n")


dlist =  { "KNeighborsClassifier": KNeighborsClassifier(3),"LinearSVC":
    LinearSVC(), "MultinomialNB": MultinomialNB(), "RandomForest": RandomForestClassifier(max_depth=5, n_estimators=100)}

model_evaluation(dlist)

And here is the result:

Accuracy Score of KNeighborsClassifier :  0.75
              precision    recall  f1-score   support

not positive       0.71      0.77      0.74        13
    positive       0.79      0.73      0.76        15

    accuracy                           0.75        28
   macro avg       0.75      0.75      0.75        28
weighted avg       0.75      0.75      0.75        28

0.7503192848020434
--------------- 

Accuracy Score of LinearSVC :  0.8928571428571429
              precision    recall  f1-score   support

not positive       1.00      0.77      0.87        13
    positive       0.83      1.00      0.91        15

    accuracy                           0.89        28
   macro avg       0.92      0.88      0.89        28
weighted avg       0.91      0.89      0.89        28

0.8907396950875212
--------------- 

Accuracy Score of MultinomialNB :  0.5357142857142857
              precision    recall  f1-score   support

not positive       0.00      0.00      0.00        13
    positive       0.54      1.00      0.70        15

    accuracy                           0.54        28
   macro avg       0.27      0.50      0.35        28
weighted avg       0.29      0.54      0.37        28

0.6976744186046512
--------------- 

C:\Users\Cey\anaconda3\lib\site-packages\sklearn\metrics\_classification.py:1272: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
Accuracy Score of RandomForest :  0.5714285714285714
              precision    recall  f1-score   support

not positive       1.00      0.08      0.14        13
    positive       0.56      1.00      0.71        15

    accuracy                           0.57        28
   macro avg       0.78      0.54      0.43        28
weighted avg       0.76      0.57      0.45        28

0.44897959183673475
--------------- 

Can someone tell me what to do? I only receive this message when using the "MultinomialNB()" classifier


Second:

When extending the dictionary by using the Gausian classifier (GaussianNB()) I receive this error message:

TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

What should I do here ?



Solution 1:[1]

Can someone tell me what to do? I only receive this message when using the "MultinomialNB()" classifier

The first error seems to be indicating that a specific label is not predicted when using the MultinomialNB, which results in an undefined f-score, or ill-defined, since the missing values are set to 0. This is explained here

When extending the dictionary by using the Gausian classifier (GaussianNB()) I receive this error message: TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

As per this question, the error is quite explicit, the issue is that TfidfVectorizer is returning a sparse matrix, which cannot be used as input for the GaussianNB. So the way I see it, you either avoid using the GaussianNB, or you add an intermediate transformer to turn the sparse array to dense, which I wouldn't advise being the result of a tf-idf vectorization.

Solution 2:[2]

Together with UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples (main credits go there) and @yatu's answer, I could at least find a workaround for the warning:

UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use zero_division parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result))

Quote from sklearn.metrics.f1_score in the Notes at the bottom:

When true positive + false positive == 0, precision is undefined. When true positive + false negative == 0, recall is undefined. In such cases, by default the metric will be set to 0, as will f-score, and UndefinedMetricWarning will be raised. This behavior can be modified with zero_division.

Thus, you cannot avoid this error if your data does not output a difference between true positives and false positives. That being said, you can only suppress the warning at least, adding zero_division=0 to the functions mentioned in the quote. In either case, set to 0 or 1, you will get a 0 value as the return anyway.

precision = precision_score(y_test, y_pred, zero_division=0)
print('Precision score: {0:0.2f}'.format(precision))

recall = recall_score(y_test, y_pred, zero_division=0)
print('Recall score: {0:0.2f}'.format(recall))

f1 = f1_score(y_test, y_pred, zero_division=0)
print('f1 score: {0:0.2f}'.format(recall))

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 yatu
Solution 2