'RuntimeError: Found dtype Long but expected Float when fine-tuning using Trainer API
I'm trying to fine-tune BERT model for sentiment analysis (classifying text as positive/negative) with Huggingface Trainer API. My dataset has two columns, Text
and Sentiment
, it looks like this.
Text Sentiment
This was good place 1
This was bad place 0
Here is my code:
from datasets import load_dataset
from datasets import load_dataset_builder
from datasets import Dataset
import datasets
import transformers
from transformers import TrainingArguments
from transformers import Trainer
dataset = load_dataset('csv', data_files='./train/test.csv', sep=';')
tokenizer = transformers.BertTokenizer.from_pretrained("TurkuNLP/bert-base-finnish-cased-v1")
model = transformers.BertForSequenceClassification.from_pretrained("TurkuNLP/bert-base-finnish-cased-v1", num_labels=1)
def tokenize_function(examples):
return tokenizer(examples["Text"], truncation=True, padding='max_length')
tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.rename_column('Sentiment', 'label')
tokenized_datasets = tokenized_datasets.remove_columns('Text')
training_args = TrainingArguments("test_trainer")
trainer = Trainer(
model=model, args=training_args, train_dataset=tokenized_datasets['train']
)
trainer.train()
Running this throws error:
Variable._execution_engine.run_backward(
RuntimeError: Found dtype Long but expected Float
The error may come from dataset itself, but can I fix it with my code somehow? I searched the Internet and this error seems to have been previously solved by "converting tensors to float" but how would I do it with Trainer API? Any advise is very highly appreciated.
Some reference:
https://discuss.pytorch.org/t/run-backward-expected-dtype-float-but-got-dtype-long/61650/10
Solution 1:[1]
Most likely, the problem is with loss function. This can be fixed if you set up the model correctly, mainly by specifying the correct loss to use. Refer to this code to see the logic for deciding the proper loss.
Your problem has binary labels and thus should be framed as a single-label classification problem. As such, the code you have shared will be inferred as a regression problem, which explains the error that it expected float but found long type for target labels.
You need to pass the correct problem type.
model = transformers.BertForSequenceClassification.from_pretrained(
"TurkuNLP/bert-base-finnish-cased-v1",
num_labels=1,
problem_type = "single_label_classification"
)
This will make use of BCE loss. For BCE loss, you need the target to float, so you also have to cast the labels to float. I think you can do that with the dataset API. See this.
The other way would be to use a multi-class classifier or CE loss. For that, just fixing num_labels
should be fine.
model = transformers.BertForSequenceClassification.from_pretrained(
"TurkuNLP/bert-base-finnish-cased-v1",
num_labels=2,
)
Solution 2:[2]
Here I am assuming that you are trying to do one label classification, that is, to predict a single result instead of predicting multiple results.
But the loss function (I don't know what you are using but it is probably BCE) you use, expects a vector from you as a label.
So either you need to convert your labels to vectors as people suggested in the comments, or you can replace the loss function with Cross-entropy loss and change your number of label parameters with 2(or whatever). Both solutions will work.
If you want to train your model as multi-label classifier you can convert your labels to vectors with using sklearn.preprocessing:
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
import numpy as np
dataset = pd.read_csv("filename.csv", encoding="utf-8")
enc_labels = preprocessing.LabelEncoder()
int_encoded = enc_labels.fit_transform(np.array(dataset["Sentiment"].to_list()))
onehot_encoder = OneHotEncoder(sparse = False)
int_encoded = int_encoded.reshape(len(int_encoded),1)
onehot_encoded = onehot_encoder.fit_transform(int_encoded)
for index, cat in dataset.iterrows():
dataset.at[index , 'Sentiment'] = onehot_encoded[index]
Solution 3:[3]
You could cast your data.
If you have it in Pandas format. You could do:
df['column_name'] = df['column_name'].astype(float)
If you have it in HuggingFace format. You should do something like that:
from datasets import load_dataset
dataset = load_dataset('glue', 'mrpc', split='train')
from datasets import Value, ClassLabel
new_features = dataset.features.copy()
new_features["idx"] = Value('int64')
new_features["label"] = ClassLabel(names=['negative', 'positive'])
new_features["idx"] = Value('int64')
dataset = dataset.cast(new_features)
Before:
dataset.features
{'idx': Value(dtype='int32', id=None),
'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], id=None),
'sentence1': Value(dtype='string', id=None),
'sentence2': Value(dtype='string', id=None)}
After:
dataset.features
{'idx': Value(dtype='int64', id=None),
'label': ClassLabel(num_classes=2, names=['negative', 'positive'], id=None),
'sentence1': Value(dtype='string', id=None),
'sentence2': Value(dtype='string', id=None)}
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Umang Gupta |
Solution 2 | Mehmet Çal?ku? |
Solution 3 | Diego Pereira |