'Invalid labels for classification logistic regression model in pyspark databricks

I am using Spark ML library for classification problem using a logistic regression.

I have vectorized input features and created training dataset and test dataset.

While fitting the model I get invalid labels issue.

invalid labels issue

the training dataset is :

where my input features as Independent_features and my target feature as Category_con.

training dataset



Solution 1:[1]

Use the words : label, features instead of independent_features and Category_con while creating your vectors.

Solution 2:[2]

For the labels, you would need to change them into just 3 categories. It looks like you might have 6 from the error message. You would need to use conditional replacement to group or bin the categories like below:

train_df.withColumn('label', when((col('Category_con') == firstCondition) ).otherwise(when((col('Category_con') == secondCondition) ).otherwise(lastCondition))

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 MichiganMadeLearner