'df.isna().sum() is not working on titanic dataset
I tried titanic model on kaggle. And it is weird that isna().sum() outputs wrong information.
import os
import pandas as pd
import numpy as np
import statsmodels.api as sm
from google.colab import auth
auth.authenticate_user()
import gspread
from oauth2client.client import GoogleCredentials
gc = gspread.authorize(GoogleCredentials.get_application_default())
worksheet = gc.open('titanic_train').sheet1
titanic = worksheet.get_all_records()
titanic = pd.DataFrame(titanic)
titanic
titanic.info()
titanic.isna().sum()
output is like below.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 891 non-null object
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 891 non-null object
11 Embarked 891 non-null object
dtypes: float64(1), int64(5), object(6)
memory usage: 83.7+ KB
PassengerId 0
Pclass 0
Name 0
Sex 0
Age 0
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 0
Embarked 0
dtype: int64
It said that NaN is 0 but there are several NaN in Age, Embarked. Why it cant detect Nan? Is it because of Dtype??
Solution 1:[1]
It is doing this because there are no NaNs
You notice the df.info()
there is no null value.
Solution 2:[2]
its because of your panda version is 1.2.4.when i degrade to .24 or some other lower version you will get nan values
Solution 3:[3]
I imported in Google Colab as well and get the follwing output when running df.isna().sum()
:
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
Did you make any column conversions? For instance, setting the Age column to type object will convert any np.nan values to "nan" which are not recognised as missing values by pandas.
df["Age"] = df["Age"].astype(str)
df["Age"].isna().sum()
# output: 0
You can check for any "nan" values with this:
df["Age"].str.contains("nan").any()
# output: True
Converting them back to np.nan will solve the issue:
df["Age"].replace("nan", np.nan, inplace=True)
df["Age"].isna().sum()
# output: 177
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Divyessh |
Solution 2 | Mufseera |
Solution 3 | Akis Hadjimpalasis |