'Pandas read csv not reading a file properly. Not splitting into proper columns

So I'm trying to read in this dataset from Kaggle.

https://www.kaggle.com/gmadevs/atp-matches-dataset#atp_matches_2016.csv

I'm using pandas' read_csv function to do so, but it isn't splitting the columns properly. I've tried this code

df_2016 = pd.read_csv("Path/to/file/atp_matches_2016.csv")

The printed out data frame gives me this though

                                                                                                                                         tourney_id  ... l_bpFaced
2016-M020 Brisbane Hard 32.0 A 20160104.0 300.0 105683.0 4.0 NaN Milos Raonic  R 196.0 CAN 25.021218 14.0 2170.0 103819.0 1.0  NaN    Roger Federer  ...       NaN
                                          299.0 103819.0 1.0 NaN Roger Federer R 185.0 SUI 34.406571 3.0  8265.0 106233.0 8.0  NaN    Dominic Thiem  ...       NaN
                                          298.0 105683.0 4.0 NaN Milos Raonic  R 196.0 CAN 25.021218 14.0 2170.0 106071.0 7.0  NaN    Bernard Tomic  ...       NaN
                                          297.0 103819.0 1.0 NaN Roger Federer R 185.0 SUI 34.406571 3.0  8265.0 105777.0 NaN  NaN  Grigor Dimitrov  ...       NaN
                                          296.0 106233.0 8.0 NaN Dominic Thiem R NaN   AUT 22.335387 20.0 1600.0 105227.0 3.0  NaN      Marin Cilic  ...       NaN

Why is it having a problem splitting the columns?

I'm expecting an output of this, which is what I got for every year except 2016 and 2017 for some reason.

  tourney_id tourney_name surface  ...  l_SvGms l_bpSaved  l_bpFaced
0   2015-329        Tokyo    Hard  ...     10.0       2.0        5.0
1   2015-329        Tokyo    Hard  ...     13.0      12.0       19.0
2   2015-329        Tokyo    Hard  ...     18.0       9.0       11.0
3   2015-329        Tokyo    Hard  ...     13.0       4.0        8.0
4   2015-329        Tokyo    Hard  ...     10.0       1.0        5.0

The actual csv file looks to be in good shape and in a format identical to the other years. I also tried to specify the columns with the columns parameter in the read_csv function, but that gives me the same output.



Solution 1:[1]

The safest way I can think for is to read the csv twice:

rows = pd.read_csv('path/to/atp_matches_2016.csv', skiprows=[0], header = None)
# skip header line
rows = rows.dropna(axis=1, how='all')
# drop columns that only have NaNs

rows.columns = pd.read_csv('path/to/atp_matches_2016.csv', nrows=0).columns
print(rows.head(5))

Output:

  tourney_id tourney_name surface  draw_size tourney_level  tourney_date  \
0  2016-M020     Brisbane    Hard       32.0             A    20160104.0   
1  2016-M020     Brisbane    Hard       32.0             A    20160104.0   
2  2016-M020     Brisbane    Hard       32.0             A    20160104.0   
3  2016-M020     Brisbane    Hard       32.0             A    20160104.0   
4  2016-M020     Brisbane    Hard       32.0             A    20160104.0 



   match_num  winner_id  winner_seed winner_entry  ... w_bpFaced l_ace  l_df  \
0      300.0   105683.0          4.0          NaN  ...       1.0   7.0   3.0   
1      299.0   103819.0          1.0          NaN  ...       1.0   2.0   4.0   
2      298.0   105683.0          4.0          NaN  ...       4.0  10.0   3.0   
3      297.0   103819.0          1.0          NaN  ...       1.0   8.0   2.0   
4      296.0   106233.0          8.0          NaN  ...       2.0  11.0   2.0   

  l_svpt  l_1stIn  l_1stWon  l_2ndWon  l_SvGms  l_bpSaved l_bpFaced  
0   61.0     34.0      25.0      14.0     10.0        3.0       5.0  
1   55.0     31.0      18.0       9.0      8.0        2.0       6.0  
2   84.0     54.0      41.0      16.0     12.0        2.0       2.0  
3  104.0     62.0      46.0      21.0     16.0        8.0      11.0  
4   98.0     52.0      41.0      27.0     15.0        7.0       8.0  

Solution 2:[2]

Try this

import csv
with open('Path/to/file/atp_matches_2016.csv') as csvfile:
     reader = csv.DictReader(csvfile)

Create's an object which operates like a regular reader but maps the information read into a dict whose keys are given by the optional field names parameter. The field names parameter is a sequence whose elements are associated with the fields of the input data in order. These elements become the keys of the resulting dictionary. If the field names parameter is omitted, the values in the first row of the file f will be used as the field names. If the row read has more fields than the fieldnames sequence, the remaining data is added as a sequence keyed by the value of restkey. If the row read has fewer fields than the field names sequence, the remaining keys take the value of the optional restval parameter.

Solution 3:[3]

If your .csv file has empty columns, you can use the argument keep_default_na=False to have read_csv interpret the empty columns as empty strings.

df_2016 = pd.read_csv("Path/to/file/atp_matches_2016.csv", keep_default_na=False)

Solution 4:[4]

One approach is to use the row you know contains the correct columns to prepare the csv file. As such

cols = pd.read_csv('Path\\to\\file\\atp_matches_2016.csv', nrows=1).columns
df = pd.read_csv('Path\\to\\file\\atp_matches_2016.csv', usecols=cols)

Output:

df.head()

  tourney_id tourney_name surface  draw_size tourney_level    ...     l_1stWon  l_2ndWon  l_SvGms  l_bpSaved l_bpFaced
0  2016-M020     Brisbane    Hard       32.0             A    ...         25.0      14.0     10.0        3.0       5.0
1  2016-M020     Brisbane    Hard       32.0             A    ...         18.0       9.0      8.0        2.0       6.0
2  2016-M020     Brisbane    Hard       32.0             A    ...         41.0      16.0     12.0        2.0       2.0
3  2016-M020     Brisbane    Hard       32.0             A    ...         46.0      21.0     16.0        8.0      11.0
4  2016-M020     Brisbane    Hard       32.0             A    ...         41.0      27.0     15.0        7.0       8.0

[5 rows x 49 columns]

Solution 5:[5]

enter image description hereThere are too many commas at the end of each line in the CSV file. You can see this when you open the CSV file in NOTEPAD. You just do find & replace all this commas with nothing in notepad. Then the loading will work fine. I checked it. You can use this solution if other solution don't work.

Solution 6:[6]

Remove "" (double quotes from your CSV file first and then import the data to the cell).

How to remove double quotes:

  1. Open CSV file in Jupyter
  2. Click on edit
  3. Replace all
  4. Replace all "" with blank
import df=pd.read_csv("x/y/z.csv")

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Chris
Solution 2 Shahad
Solution 3 Julian Drago
Solution 4 Todd Burus
Solution 5 moys
Solution 6 RiveN