'Pandas read csv not reading a file properly. Not splitting into proper columns
So I'm trying to read in this dataset from Kaggle.
https://www.kaggle.com/gmadevs/atp-matches-dataset#atp_matches_2016.csv
I'm using pandas' read_csv function to do so, but it isn't splitting the columns properly. I've tried this code
df_2016 = pd.read_csv("Path/to/file/atp_matches_2016.csv")
The printed out data frame gives me this though
tourney_id ... l_bpFaced
2016-M020 Brisbane Hard 32.0 A 20160104.0 300.0 105683.0 4.0 NaN Milos Raonic R 196.0 CAN 25.021218 14.0 2170.0 103819.0 1.0 NaN Roger Federer ... NaN
299.0 103819.0 1.0 NaN Roger Federer R 185.0 SUI 34.406571 3.0 8265.0 106233.0 8.0 NaN Dominic Thiem ... NaN
298.0 105683.0 4.0 NaN Milos Raonic R 196.0 CAN 25.021218 14.0 2170.0 106071.0 7.0 NaN Bernard Tomic ... NaN
297.0 103819.0 1.0 NaN Roger Federer R 185.0 SUI 34.406571 3.0 8265.0 105777.0 NaN NaN Grigor Dimitrov ... NaN
296.0 106233.0 8.0 NaN Dominic Thiem R NaN AUT 22.335387 20.0 1600.0 105227.0 3.0 NaN Marin Cilic ... NaN
Why is it having a problem splitting the columns?
I'm expecting an output of this, which is what I got for every year except 2016 and 2017 for some reason.
tourney_id tourney_name surface ... l_SvGms l_bpSaved l_bpFaced
0 2015-329 Tokyo Hard ... 10.0 2.0 5.0
1 2015-329 Tokyo Hard ... 13.0 12.0 19.0
2 2015-329 Tokyo Hard ... 18.0 9.0 11.0
3 2015-329 Tokyo Hard ... 13.0 4.0 8.0
4 2015-329 Tokyo Hard ... 10.0 1.0 5.0
The actual csv file looks to be in good shape and in a format identical to the other years. I also tried to specify the columns with the columns parameter in the read_csv function, but that gives me the same output.
Solution 1:[1]
The safest way I can think for is to read the csv twice:
rows = pd.read_csv('path/to/atp_matches_2016.csv', skiprows=[0], header = None)
# skip header line
rows = rows.dropna(axis=1, how='all')
# drop columns that only have NaNs
rows.columns = pd.read_csv('path/to/atp_matches_2016.csv', nrows=0).columns
print(rows.head(5))
Output:
tourney_id tourney_name surface draw_size tourney_level tourney_date \
0 2016-M020 Brisbane Hard 32.0 A 20160104.0
1 2016-M020 Brisbane Hard 32.0 A 20160104.0
2 2016-M020 Brisbane Hard 32.0 A 20160104.0
3 2016-M020 Brisbane Hard 32.0 A 20160104.0
4 2016-M020 Brisbane Hard 32.0 A 20160104.0
match_num winner_id winner_seed winner_entry ... w_bpFaced l_ace l_df \
0 300.0 105683.0 4.0 NaN ... 1.0 7.0 3.0
1 299.0 103819.0 1.0 NaN ... 1.0 2.0 4.0
2 298.0 105683.0 4.0 NaN ... 4.0 10.0 3.0
3 297.0 103819.0 1.0 NaN ... 1.0 8.0 2.0
4 296.0 106233.0 8.0 NaN ... 2.0 11.0 2.0
l_svpt l_1stIn l_1stWon l_2ndWon l_SvGms l_bpSaved l_bpFaced
0 61.0 34.0 25.0 14.0 10.0 3.0 5.0
1 55.0 31.0 18.0 9.0 8.0 2.0 6.0
2 84.0 54.0 41.0 16.0 12.0 2.0 2.0
3 104.0 62.0 46.0 21.0 16.0 8.0 11.0
4 98.0 52.0 41.0 27.0 15.0 7.0 8.0
Solution 2:[2]
Try this
import csv
with open('Path/to/file/atp_matches_2016.csv') as csvfile:
reader = csv.DictReader(csvfile)
Create's an object which operates like a regular reader but maps the information read into a dict whose keys are given by the optional field names parameter. The field names parameter is a sequence whose elements are associated with the fields of the input data in order. These elements become the keys of the resulting dictionary. If the field names parameter is omitted, the values in the first row of the file f will be used as the field names. If the row read has more fields than the fieldnames sequence, the remaining data is added as a sequence keyed by the value of restkey. If the row read has fewer fields than the field names sequence, the remaining keys take the value of the optional restval parameter.
Solution 3:[3]
If your .csv file has empty columns, you can use the argument keep_default_na=False
to have read_csv
interpret the empty columns as empty strings.
df_2016 = pd.read_csv("Path/to/file/atp_matches_2016.csv", keep_default_na=False)
Solution 4:[4]
One approach is to use the row you know contains the correct columns to prepare the csv file. As such
cols = pd.read_csv('Path\\to\\file\\atp_matches_2016.csv', nrows=1).columns
df = pd.read_csv('Path\\to\\file\\atp_matches_2016.csv', usecols=cols)
Output:
df.head()
tourney_id tourney_name surface draw_size tourney_level ... l_1stWon l_2ndWon l_SvGms l_bpSaved l_bpFaced
0 2016-M020 Brisbane Hard 32.0 A ... 25.0 14.0 10.0 3.0 5.0
1 2016-M020 Brisbane Hard 32.0 A ... 18.0 9.0 8.0 2.0 6.0
2 2016-M020 Brisbane Hard 32.0 A ... 41.0 16.0 12.0 2.0 2.0
3 2016-M020 Brisbane Hard 32.0 A ... 46.0 21.0 16.0 8.0 11.0
4 2016-M020 Brisbane Hard 32.0 A ... 41.0 27.0 15.0 7.0 8.0
[5 rows x 49 columns]
Solution 5:[5]
There are too many commas at the end of each line in the CSV file. You can see this when you open the CSV file in NOTEPAD. You just do find & replace all this commas with nothing in notepad. Then the loading will work fine. I checked it. You can use this solution if other solution don't work.
Solution 6:[6]
Remove ""
(double quotes from your CSV file first and then import the data to the cell).
How to remove double quotes:
- Open CSV file in Jupyter
- Click on edit
- Replace all
- Replace all
""
with blank
import df=pd.read_csv("x/y/z.csv")
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Chris |
Solution 2 | Shahad |
Solution 3 | Julian Drago |
Solution 4 | Todd Burus |
Solution 5 | moys |
Solution 6 | RiveN |