'Getting a value Error : how to use string data type in model.fit for jupyter using DecisionTreeClassifier?
this is the code
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
dataset = pd.read_csv("emotion.csv")
X = dataset.drop(columns = ["mood"])
y = dataset['mood']
model = DecisionTreeClassifier()
model.fit(X,y)
model.predict([["i am sad"]])
and this is the error
ValueError: could not convert string to float: 'oh yeah'
any help would be appreciated....
Solution 1:[1]
Feature engineering
You can't use raw features represented with strings in DecisionTreeClassifier
. You have to train your decision tree on the data represented with numbers. ValueError: could not convert string to float: 'oh yeah'
means that decision tree tried to take float()
of values in your data to convert them to numbers for you, but did not succeed.
If you have categorical features - process them with some encoding method: for instance, label encoding, one hot encoding. You can read more about categorical features encoding methods in many different sources, e.g. here.
If you have text features - use feature extraction methods to generate new features based on them: for instance, TF-IDF. Again, you can find many materials on it, e.g. methods overview in scikit-learn
documentation.
Example
Here is an example on how to work with data represented only with text. I suggest you to study TF-IDF technique and TfidfVectorizer
documentation page to understand better what is happening.
Data: https://github.com/dair-ai/emotion_dataset/blob/master/README.md
Code:
import pickle
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
with open('merged_training.pkl', 'rb') as file:
data = pickle.load(file)
vectorizer = TfidfVectorizer(min_df=5, max_features=1000)
X = vectorizer.fit_transform(data['text'])
y = data['emotions']
model = DecisionTreeClassifier(max_depth=10, random_state=13)
model.fit(X, y)
accuracy_score(y, model.predict(X))
Output:
0.3708941025745605
Please notice that this is just a starter and it has many things to improve. For instance:
- Separate train and validation data set
- Preprocess input text
- Try other models, such as logistic regression or boosting
- tune hyperparameters
I hope this gives you an idea on what you can do with text data.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |