'How to fill null values in python datatable?

Pandas library has a really good function call .fillna() which can be used to fill null values

df = df.fillna(0)

I am using Datatable Library for my new assignment because it is very fast to load and work with huge data in Datatable.

Does such a function fillna exist in Datatable library of python?

Or what alternative do we have to fill missing values in Datatable?



Solution 1:[1]

As I stated in the comments, at the moment, there is no built in function that replicates all of pandas' fillna; this post looks at how some parts of fillna can be replicated in datatable, and the obvious limitations. Note also that you can raise a feature request on github page for this and other features that you may be interested in:

The example used here is from pandas fillna page and tweaked a bit to illustrate some important quirks:

import pandas as pd
import numpy as np
from datatable import dt, f, ifelse, update

Datatable:

DT = dt.Frame(
{
    "A": [None, 3.0, None, None],
    "B": [2, 4, None, 3],
    "C": [np.nan, np.nan, np.nan, np.nan],
    "D": [0, 1, 5, 4],
})

DT

          A      B        C      D
   | float64  int32  float64  int32
-- + -------  -----  -------  -----
 0 |      NA      2       NA      0
 1 |       3      4       NA      1
 2 |      NA     NA       NA      5
 3 |      NA      3       NA      4
[4 rows x 4 columns]

Pandas:

df = pd.DataFrame([[np.nan, 2, np.nan, 0],
                   [3, 4, np.nan, 1],
                   [np.nan, np.nan, np.nan, 5],
                   [np.nan, 3, np.nan, 4]],
                   columns=list('ABCD'))

df

    A    B   C  D
0  NaN  2.0 NaN  0
1  3.0  4.0 NaN  1
2  NaN  NaN NaN  5
3  NaN  3.0 NaN  4

In Pandas:

df.fillna(0)
    A   B   C   D
0   0.0 2.0 0.0 0
1   3.0 4.0 0.0 1
2   0.0 0.0 0.0 5
3   0.0 3.0 0.0 4

That was easy. One way to replicate this in datatable is to use the replace function:

DT.replace({None:0})

   |       A      B        C      D
   | float64  int32  float64  int32
-- + -------  -----  -------  -----
 0 |      NA      2       NA      0
 1 |       3      4       NA      1
 2 |      NA      0       NA      5
 3 |      NA      3       NA      4
[4 rows x 4 columns]

Notice that only the null values in the integer column was replaced; one caveat of the replace function is that it will only replace similar types (integer for integer, float for float, and object for object). So, the only way (at the moment) to replace null values for float is to pass a float value:

DT.replace({None:0.0})

DT
Out[15]: 
   |       A      B        C      D
   | float64  int32  float64  int32
-- + -------  -----  -------  -----
 0 |       0      2        0      0
 1 |       3      4        0      1
 2 |       0     NA        0      5
 3 |       0      3        0      4
[4 rows x 4 columns]

Note that the null fill was done twice(once for the integer column, the second time for the float columns). It is very fast, but it would be nice to do it just once like fillna in Pandas does. The only other option I know of at the moment is to use a dictionary comprehension:

DT[:,update(**{key: ifelse(f[key]==None,
                              0, 
                              f[key]) 
    for key in DT.names})]

DT
Out[20]: 
   |       A      B        C      D
   | float64  int32  float64  int32
-- + -------  -----  -------  -----
 0 |       0      2        0      0
 1 |       3      4        0      1
 2 |       0      0        0      5
 3 |       0      3        0      4
[4 rows x 4 columns]

Let's look at another example from the Pandas fillna page, this time with a dictionary:

values = {'A': 0, 'B': 1, 'C': 2, 'D': 3}
df.fillna(value=values)
    A   B   C   D
0   0.0 2.0 2.0 0
1   3.0 4.0 2.0 1
2   0.0 1.0 2.0 5
3   0.0 3.0 2.0 4

Again, relatively easy from Pandas. In datatable, you have to run a dictionary comprehension:

DT[:,update(**{key: ifelse(f[key]==None, 
                              value, 
                              f[key]) 
  for key, value in values.items()})]

DT
Out[25]: 
   |       A      B        C      D
   | float64  int32  float64  int32
-- + -------  -----  -------  -----
 0 |       0      2        2      0
 1 |       3      4        2      1
 2 |       0      1        2      5
 3 |       0      3        2      4
[4 rows x 4 columns]

Speedwise, for the dictionary operation above, on my PC, datatable returns

23.5 µs ± 671 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

While Pandas returns :

665 µs ± 11.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

So, yes, datatable is significantly faster.

Note also that pandas has ffill and bfill, which are quite useful but are notably missing in datatable and cant be easily replicated with the existing functions(not that I am aware of). And in fillna you can fill across rows with the axis parameter.

Hope this helps a bit. Again, you can visit the github issues page or the discussions page to air your opinions and raise feature requests.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1