'Python 'utf-8' codec can't decode byte 0xe0
import re
dictionary = dict()
for line in open('Group14.csv', encoding="utf8"):
line = line.strip()
date = re.findall('(\w+\s\w+\s\d+)\s\d+\S\d+\S\d+\s\S+\s(\d+)', line)
tweet = re.findall(',(.*)', line)
#print(date[0], tweet[0])
for key, value in dictionary.items():
if tweet[0] in dictionary.values():
dictionary[date[0]] += 1
else:
dictionary[date[0]] = tweet[0]
print(dictionary)
I want to read data from one Group14.csv. Remove extra white-spaces. for the second column in Group14.csv i want to loop through it in order to run a cleaning condition which if true: will print that cell with adjacent coloumn 1 cell <>. If false: skip the line......
Then I want to output my cleaned data with both columns into another csv
NOTE: THE 1st COLUMN IS TWITTER DATE AND SECOND IS TWEET
Solution 1:[1]
As Serge Ballesta pointed out in the question comments:
Your input file is likely to be in a non UTF8 encoding, probably latin1... 0xe0 is latin1 code for à
I had the same issue and that was it.
See the solution in this related SO question
Pandas.read_csv() with special characters (accents) in column names ?
In brief, you can define the encoding when reading the csv file. This should work:
# ...everything before
for line in open('Group14.csv', encoding="latin1"): # change the encoding
# ...everything after
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Sushi2all |