'Python label encoding : Decision tree classification

Im really new to Python and am trying to run a decision tree model with the below query:

from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
import numpy as np
import pandas as pd
import sklearn as skl


data_forecast = pd.read_excel("./Forcast_data_Analytics.xlsx")

x = data_forecast[['Name','Power', 'FirstEventID','AlleventIds']]
y = data_forecast[['Possible_fix','Changes_Required']]

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.8)

classifier = DecisionTreeClassifier()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

sample data:

Name       Power      FirstEventID      AlleventIds         Possible_fix        Changes_Required
India      I3000       10130-1           10130-1, 134-00     yes                 Bug Fix

Can I do the decision tree classification without label encoding? or Do I need to encode my data in order to enter classification?

what is the best way to do this? I want to consider everything as string and encode them. After classification, I also want to decode them.

I tried the below encoding method, which did not work:

from sklearn.preprocessing import LabelEncoder
vals = np.array(data_forecast)
LabelEncoder = LabelEncoder()
integer_encoded = LabelEncoder.fit_transform(vals)

Error:

Exception has occurred: ValueError
y should be a 1d array, got an array of shape (59, 23) instead.

What is the right way to do this? How do i encode/decode my labels and use this?



Solution 1:[1]

The question is already old, but I'll try to help, it may be useful for someone else.

The error seems to be simple and happened even before the encoding was processed by the classifier. y should be one single column (1-dimension array) and you passed 2 here:

y = data_forecast[['Possible_fix','Changes_Required']]

About the encoding part, I'm not specialist on that, but what I've already done and worked was to load data as a DataFrame "df" and later split as df2 for X:

df2 = df.loc[:, df.columns != 'col_class']

And encode only X:

from sklearn.preprocessing import LabelEncoder
X = df2.apply(LabelEncoder().fit_transform)
y = df['col_class']

Hope it helps.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 André