'Python 'utf-8' codec can't decode byte 0xe0

import re

dictionary = dict()

for line in open('Group14.csv', encoding="utf8"):
    line = line.strip()

    date = re.findall('(\w+\s\w+\s\d+)\s\d+\S\d+\S\d+\s\S+\s(\d+)', line)
    tweet = re.findall(',(.*)', line)
    #print(date[0], tweet[0])
    for key, value in dictionary.items():
        if tweet[0] in dictionary.values():
            dictionary[date[0]] += 1
        else:
            dictionary[date[0]] = tweet[0]
print(dictionary)

I want to read data from one Group14.csv. Remove extra white-spaces. for the second column in Group14.csv i want to loop through it in order to run a cleaning condition which if true: will print that cell with adjacent coloumn 1 cell <>. If false: skip the line......

Then I want to output my cleaned data with both columns into another csv

NOTE: THE 1st COLUMN IS TWITTER DATE AND SECOND IS TWEET



Solution 1:[1]

As Serge Ballesta pointed out in the question comments:

Your input file is likely to be in a non UTF8 encoding, probably latin1... 0xe0 is latin1 code for à

I had the same issue and that was it.

See the solution in this related SO question

Pandas.read_csv() with special characters (accents) in column names ?

In brief, you can define the encoding when reading the csv file. This should work:

# ...everything before

for line in open('Group14.csv', encoding="latin1"): # change the encoding

# ...everything after

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Sushi2all