'Getting a value Error : how to use string data type in model.fit for jupyter using DecisionTreeClassifier?

this is the code

import pandas as pd
from sklearn.tree import DecisionTreeClassifier
dataset = pd.read_csv("emotion.csv")
X = dataset.drop(columns = ["mood"])
y = dataset['mood']
model = DecisionTreeClassifier()
model.fit(X,y)
model.predict([["i am sad"]])

and this is the error


ValueError: could not convert string to float: 'oh yeah'

any help would be appreciated....



Solution 1:[1]

Feature engineering

You can't use raw features represented with strings in DecisionTreeClassifier. You have to train your decision tree on the data represented with numbers. ValueError: could not convert string to float: 'oh yeah' means that decision tree tried to take float() of values in your data to convert them to numbers for you, but did not succeed.

If you have categorical features - process them with some encoding method: for instance, label encoding, one hot encoding. You can read more about categorical features encoding methods in many different sources, e.g. here.

If you have text features - use feature extraction methods to generate new features based on them: for instance, TF-IDF. Again, you can find many materials on it, e.g. methods overview in scikit-learn documentation.

Example

Here is an example on how to work with data represented only with text. I suggest you to study TF-IDF technique and TfidfVectorizer documentation page to understand better what is happening.

Data: https://github.com/dair-ai/emotion_dataset/blob/master/README.md

Code:

import pickle
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier

with open('merged_training.pkl', 'rb') as file:
    data = pickle.load(file)

vectorizer = TfidfVectorizer(min_df=5, max_features=1000)
X = vectorizer.fit_transform(data['text'])
y = data['emotions']

model = DecisionTreeClassifier(max_depth=10, random_state=13)
model.fit(X, y)

accuracy_score(y, model.predict(X))

Output:

0.3708941025745605

Please notice that this is just a starter and it has many things to improve. For instance:

  • Separate train and validation data set
  • Preprocess input text
  • Try other models, such as logistic regression or boosting
  • tune hyperparameters

I hope this gives you an idea on what you can do with text data.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1