'removing NA values from a DataFrame in Python 3.4
import pandas as pd
import statistics
df=print(pd.read_csv('001.csv',keep_default_na=False, na_values=[""]))
print(df)
I am using this code to create a data frame which has no NA values. I have couple of CSV files and I want to calculate Mean of one of the columns - sulfate. This column has many 'NA' values, which I am trying to exclude. Even after using the above code, 'NA's aren't excluded from the data frame. Please suggest.
Solution 1:[1]
I think you should import the .csv file as it is and then manipulate the data frame. Then, you can use any of the methods below.
foo[foo.notnull()]
or
foo.dropna()
Solution 2:[2]
Method 1 :
df[['A','C']].apply(lambda x: my_func(x) if(np.all(pd.notnull(x[1]))) else x, axis = 1)
Use pandas notnull
Method 2 :
df = df[np.isfinite(df['EPS'])]
Method 3 : Using dropna Here
In [24]: df = pd.DataFrame(np.random.randn(10,3))
In [25]: df.ix[::2,0] = np.nan; df.ix[::4,1] = np.nan; df.ix[::3,2] = np.nan;
In [26]: df
Out[26]:
0 1 2
0 NaN NaN NaN
1 2.677677 -1.466923 -0.750366
2 NaN 0.798002 -0.906038
3 0.672201 0.964789 NaN
4 NaN NaN 0.050742
5 -1.250970 0.030561 -2.678622
6 NaN 1.036043 NaN
7 0.049896 -0.308003 0.823295
8 NaN NaN 0.637482
9 -0.310130 0.078891 NaN
In [27]: df.dropna() #drop all rows that have any NaN values
Out[27]:
0 1 2
1 2.677677 -1.466923 -0.750366
5 -1.250970 0.030561 -2.678622
7 0.049896 -0.308003 0.823295
Solution 3:[3]
I got the same error until I added axis=0
and how='any'
.
df=df.dropna(axis=0, how='any')
Solution 4:[4]
columsMissng=[]
for i in columns:
c=df.loc[df[i] == '?', i].count();
columsMissng.append((i,c));
c=0
dropcolumsMissng=[]
for i in columsMissng:
if i[1]>20000:
count=count+1;
dropcolumsMissng.append(i[0])
newDF=df.drop(columns=dropcolumsMissng)
In place of '?'
you can put any value you want to count and if i[1]>20000:
you can put your threshold like 50% of data or anything you want.
In case you want to remove 'NaN'
c=newDF.columns.values
dropcolumsMissng=[]
for i in columns:
num_nans = len(newDF) - newDF[i].count()
if num_nans>20000:
dropcolumsMissng.append(i)
newDF=newDF.drop(columns=dropcolumsMissng)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Pang |
Solution 2 | Community |
Solution 3 | Yuca |
Solution 4 | AVIK DUTTA |