'unrecognized character in header of csv
import csv
def readCSV(filename, begin_date="01/07/2020", end_date="30/09/2020"):
file = open(filename)
csvreader = csv.reader(file)
header = []
header = next(csvreader)
if __name__ == '__main__':
raw_load_data = readCSV("Total_load_2020.csv")
raw_forecast_data = readCSV("Total_load_forecast_2020.csv")
The data follows csv (downloaded online) and looks like follow:
RowDate,RowTime,TotalLoadForecast
01/01/2020,00:00,8600.52
01/01/2020,00:15,8502.06
01/01/2020,00:30,8396.45
...
But the output contains some weird characters (non-existing in data):
['RowDate', 'RowTime', 'TotalLoad']
['RowDate', 'RowTime', 'TotalLoadForecast']
Of course, I can easily remove it. But why does that happen in the first place?
Solution 1:[1]
Yes, that's a BOM, U+FEFF BYTE ORDER MARK
. OP's file is probably encoded UTF-8, but OP appears to be decoding it as CP-1252.
I say that because the three-byte sequence for a UTF-8-encoded BOM is \xEF\xBB\xBF
and appears as 
when (wrongly?) decoded as CP-1252^1:
Encoding | Representation (hexadecimal) | Representation (decimal) | Bytes as CP1252 characters |
---|---|---|---|
UTF-8 | EF BB BF |
239 187 191 |
 |
Here's how to mock up OP's data with a leading BOM, from a BSD shell:
% echo -e '\xEF\xBB\xBFRowDate,RowTime,TotalLoadForecast
01/01/2020,00:00,8600.52
01/01/2020,00:15,8502.06
01/01/2020,00:30,8396.45' > sample.csv
and confirm it's there with less sample.csv
:
<U+FEFF>RowDate,RowTime,TotalLoadForecast
01/01/2020,00:00,8600.52
01/01/2020,00:15,8502.06
01/01/2020,00:30,8396.45
sample.csv (END)
Less is correctly interpreting the three UTF-8 bytes as the Unicode code-point U+FEFF
.
If OP still needs to read this file as CP-1252, they can try with the following... but I think they'll get errors because it doesn't actually seem like it is CP-1252:
import csv
with open('sample.csv', 'r', newline='', encoding='cp1252') as f:
# Read the first 3 bytes
leading_bytes = f.read(3)
if (leading_bytes != ''):
f.seek(0) # Not a BOM, reset stream to beginning of file
else:
pass # skip BOM
reader = csv.reader(f)
for row in reader:
print(row)
But, I really think this file should be decoded as UTF-8:
with open('sample.csv', 'r', newline='') as f: # utf-8 is the default encoding
# Read the first (decoded) Unicode code point
first_unicode_char = f.read(1)
if (first_unicode_char != '\ufeff'):
f.seek(0) # Not a BOM, reset stream to beginning of file
or, let Python handle the guesswork and eliminate a BOM if it exists, with the utf_8_sig
decoder:
with open('sample.csv', 'r', newline='', encoding='utf_8_sig') as f:
Solution 2:[2]
Just update line "file = open(filename)" to "file = open(filename, encoding='utf_8_sig')"
def readCSV(filename, begin_date="01/07/2020", end_date="30/09/2020"):
file = open(filename, encoding='utf_8_sig')
csvreader = csv.reader(file)
header = []
header = next(csvreader)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | DM Equinox |