'How to handle seaborn pairplot errors when the dataset has NaN values?

I have a pandas DataFrame with multiple columns filled with numbers and rows, and the 1st column has the categorical data. Obviously, I have NaN values and zeroes in multiple rows (but not the entire blank row, of course) and in different columns.

The rows have valuable data in other columns which are not NaN. And the columns have valuable data in other rows, which are also not NaN.

The problem is that sns.pairplot does not ignore NaN values for correlation and returns errors (such as division by zero, string to float conversion, etc.).

I have seen some people saying to use fillna() method, but I am hoping if anyone knows a more elegant way to do this, without having to go through that solution and spend numerous hours to fix the plot, axis, filters, etc. afterwards. I didn't like that work around.

It is similar to what this person has reported:
https://github.com/mwaskom/seaborn/issues/1699

ZeroDivisionError: 0.0 cannot be raised to a negative power

Here is the sample dataset: image of the sample dataset



Solution 1:[1]

Seaborn's PairGrid function will allow you to create your desired plot. PairGrid is much more flexible than sns.pairplot. Any PairGrid created has three sections: the upper triangle, the lower triangle and the diagonal.

For each part, you can define a customized plotting function. The upper and lower triangle sections can take any plotting function that accepts two arrays of features (such as plt.scatter) as well as any associated keywords (e.g. marker). The diagonal section accepts a plotting function that has a single feature array as input (such as plt.hist) in addition to the relevant keywords.

For your purpose, you can filter out the NaNs in your customized function(s):

from sklearn import datasets
import pandas as pd
import numpy as np
import seaborn as sns

data = datasets.load_iris()
iris = pd.DataFrame(data.data, columns=data.feature_names)

# break iris dataset to create NaNs
iris.iat[1, 0] = np.nan
iris.iat[4, 0] = np.nan
iris.iat[4, 2] = np.nan
iris.iat[5, 2] = np.nan

# create customized scatterplot that first filters out NaNs in feature pair
def scatterFilter(x, y, **kwargs):
    
    interimDf = pd.concat([x, y], axis=1)
    interimDf.columns = ['x', 'y']
    interimDf = interimDf[(~ pd.isnull(interimDf.x)) & (~ pd.isnull(interimDf.y))]
    
    ax = plt.gca()
    ax = plt.plot(interimDf.x.values, interimDf.y.values, 'o', **kwargs)
    
# Create an instance of the PairGrid class.
grid = sns.PairGrid(data=iris, vars=list(iris.columns), size = 4)

# Map a scatter plot to the upper triangle
grid = grid.map_upper(scatterFilter, color='darkred')

# Map a histogram to the diagonal
grid = grid.map_diag(plt.hist, bins=10, edgecolor='k', color='darkred')

# Map a density plot to the lower triangle
grid = grid.map_lower(scatterFilter, color='darkred')

This will yield the following plot:

Iris Seaborn PairPlot

PairPlot allows you to plot contour plots, annotate the panels with descriptive statistics, etc. For more details, see here.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Gino Mempin