'Python Pandas ParserError: Error tokenizing data c error with Very Large Dataset

I am new to python so thank you for your patience with me.

I am in the process of converting a very large txt file to a csv file in python so I can use it in mysql. There are 14549030 records within the file. I am running into this error:

ParserError: Error tokenizing data. C error: Expected 19 fields in line 13297995, saw 22 

when I am trying to use pandas to manipulate the file, I am running this to convert the csv to dataframe:

import pandas as pd
import pymysql

print('convert CSV to dataframe')
data = pd.read_csv ('mydata.csv', delimiter=',', header=None)
df = pd.DataFrame(data)

When I look at the line compared to the others, it looks like it has the same number of lines and the same formatting, so I am not sure what the issue is or how it is reading as more lines than the others (and it looks like there are around 50,000 lines that are causing this error). Is it a memory issue? Any help is appreciated to help fix the error. Thank you!

To give some context: the text file's column separators were || instead of commas, and one of the columns had names that were separated by commas. I had help successfully converting || into commas.



Solution 1:[1]

pd.read_csv should return a DataFrame. So, in your code, "data" is a Dataframe and the line df = pd.DataFrame(data), is trying to build a new DataFrame from another one.

Your problem happened somewhere before the code you're showing. And probably because after converting the || into commas, now you have more columns (22) than the ones you where suppose to (19) as result of the data in one of your columns having commas itself.

Maybe you can try to read the original .txt file with pandas using || as separator (try \|\| or \\|\\|, because pandas assumes a regex if they're more than two characters in the separator).

Or, when you transform the txt to csv, you could enclose the values with commas between some special character, like quotation marks " and the enter it to the parameter quotechar of read_csv.

By the way, a csv file is just a plain text file with a particular formatting. So, it doesn't make sense to use pandas if you have previously edited your .txt file to replace the || with commas. Just follow the same procedure to enclose the special values with commas in a way that mysql can parse properly (probably quotation marks).

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Ignatius Reilly