'Predicting Sentiment of Raw Text using Trained BERT Model, Hugging Face

I'm predicting sentiment analysis of Tweets with positive, negative, and neutral classes. I've trained a BERT model using Hugging Face. Now I'd like to make predictions on a dataframe of unlabeled Twitter text and I'm having difficulty.

I've followed the following tutorial (https://curiousily.com/posts/sentiment-analysis-with-bert-and-hugging-face-using-pytorch-and-python/) and was able to train a BERT model using Hugging Face.

Here's an example of predicting on raw text however it's only one sentence and I would like to use a column of Tweets. https://curiousily.com/posts/sentiment-analysis-with-bert-and-hugging-face-using-pytorch-and-python/#predicting-on-raw-text

review_text = "I love completing my todos! Best app ever!!!"

encoded_review = tokenizer.encode_plus(
  review_text,
  max_length=MAX_LEN,
  add_special_tokens=True,
  return_token_type_ids=False,
  pad_to_max_length=True,
  return_attention_mask=True,
  return_tensors='pt',
)

input_ids = encoded_review['input_ids'].to(device)
attention_mask = encoded_review['attention_mask'].to(device)
output = model(input_ids, attention_mask)
_, prediction = torch.max(output, dim=1)
print(f'Review text: {review_text}')
print(f'Sentiment  : {class_names[prediction]}')

Review text: I love completing my todos! Best app ever!!!
Sentiment  : positive

Bill's response works. Here's the solution.

def predictionPipeline(text):
  encoded_review = tokenizer.encode_plus(
      text,
      max_length=MAX_LEN,
      add_special_tokens=True,
      return_token_type_ids=False,
      pad_to_max_length=True,
      return_attention_mask=True,
      return_tensors='pt',
    )

  input_ids = encoded_review['input_ids'].to(device)
  attention_mask = encoded_review['attention_mask'].to(device)

  output = model(input_ids, attention_mask)
  _, prediction = torch.max(output, dim=1)

  return(class_names[prediction])

df2['prediction']=df2['cleaned_tweet'].apply(predictionPipeline)


Solution 1:[1]

You can use the same code to predict texts from the dataframe column.

model = ...
tokenizer = ...
    
def predict(review_text):
    encoded_review = tokenizer.encode_plus(
    review_text,
    max_length=MAX_LEN,
    add_special_tokens=True,
    return_token_type_ids=False,
    pad_to_max_length=True,
    return_attention_mask=True,
    return_tensors='pt',
    )

    input_ids = encoded_review['input_ids'].to(device)
    attention_mask = encoded_review['attention_mask'].to(device)
    output = model(input_ids, attention_mask)
    _, prediction = torch.max(output, dim=1)
    print(f'Review text: {review_text}')
    print(f'Sentiment  : {class_names[prediction]}')
    return class_names[prediction]


df = pd.DataFrame({
            'texts': ["text1", "text2", "...."]
        })

df_dataset["sentiments"] = df.apply(lambda l: predict(l.texts), axis=1)

Solution 2:[2]

Bill's answer is great. But running the code prompts an error on my end in 2022/05.

TypeError: torch.max received an invalid combination of arguments - got 
(numpy.ndarray, dim=int), but expected one of: (torch.FloatTensor source)
(torch.FloatTensor source, torch.FloatTensor other) didn’t match because some of the keywords were incorrect: dim 
(torch.FloatTensor source, int dim) 
(torch.FloatTensor source, int dim, bool keepdim)

It seems the structure of the model output has been changed. It is not a tensor object rather than a tuple of the tensor object and some other stuffs.

Changing from torch.max(output, dim=1) to torch.max(output[0], dim=1) solves this issue. See ref: https://discuss.pytorch.org/t/how-to-solve-this-torch-max-error/106432

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Bill
Solution 2 Wei Mintao