'unrecognized character in header of csv

import csv

def readCSV(filename, begin_date="01/07/2020", end_date="30/09/2020"):
    file = open(filename)
    csvreader = csv.reader(file)
    header = []
    header = next(csvreader)

if __name__ == '__main__':
    raw_load_data = readCSV("Total_load_2020.csv")
    raw_forecast_data = readCSV("Total_load_forecast_2020.csv")

The data follows csv (downloaded online) and looks like follow:

RowDate,RowTime,TotalLoadForecast
01/01/2020,00:00,8600.52
01/01/2020,00:15,8502.06
01/01/2020,00:30,8396.45
...

But the output contains some weird characters (non-existing in data):

['RowDate', 'RowTime', 'TotalLoad']
['RowDate', 'RowTime', 'TotalLoadForecast']

Of course, I can easily remove it. But why does that happen in the first place?



Solution 1:[1]

Yes, that's a BOM, U+FEFF BYTE ORDER MARK. OP's file is probably encoded UTF-8, but OP appears to be decoding it as CP-1252.

I say that because the three-byte sequence for a UTF-8-encoded BOM is \xEF\xBB\xBF and appears as  when (wrongly?) decoded as CP-1252^1:

Encoding Representation (hexadecimal) Representation (decimal) Bytes as CP1252 characters
UTF-8 EF BB BF 239 187 191 

Here's how to mock up OP's data with a leading BOM, from a BSD shell:

% echo -e '\xEF\xBB\xBFRowDate,RowTime,TotalLoadForecast
01/01/2020,00:00,8600.52
01/01/2020,00:15,8502.06
01/01/2020,00:30,8396.45' > sample.csv

and confirm it's there with less sample.csv:

<U+FEFF>RowDate,RowTime,TotalLoadForecast
01/01/2020,00:00,8600.52
01/01/2020,00:15,8502.06
01/01/2020,00:30,8396.45
sample.csv (END)

Less is correctly interpreting the three UTF-8 bytes as the Unicode code-point U+FEFF.

If OP still needs to read this file as CP-1252, they can try with the following... but I think they'll get errors because it doesn't actually seem like it is CP-1252:

import csv

with open('sample.csv', 'r', newline='', encoding='cp1252') as f:
    # Read the first 3 bytes
    leading_bytes = f.read(3)

    if (leading_bytes != ''):
        f.seek(0)  #  Not a BOM, reset stream to beginning of file
    else:
        pass       # skip BOM

    reader = csv.reader(f)
    for row in reader:
        print(row)

But, I really think this file should be decoded as UTF-8:

with open('sample.csv', 'r', newline='') as f:  # utf-8 is the default encoding
    # Read the first (decoded) Unicode code point
    first_unicode_char = f.read(1)

    if (first_unicode_char != '\ufeff'):
        f.seek(0) #  Not a BOM, reset stream to beginning of file

or, let Python handle the guesswork and eliminate a BOM if it exists, with the utf_8_sig decoder:

with open('sample.csv', 'r', newline='', encoding='utf_8_sig') as f:

Solution 2:[2]

Just update line "file = open(filename)" to "file = open(filename, encoding='utf_8_sig')"

def readCSV(filename, begin_date="01/07/2020", end_date="30/09/2020"):
    file = open(filename, encoding='utf_8_sig')
    csvreader = csv.reader(file)
    header = []
    header = next(csvreader)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 DM Equinox