'Data Quality check using dask

I have two dask dataframes and I am performing data quality checks. Each dataframe have at least 50 columns, but for this, let me give a sample dataframe assuming as part of the 20gb data.

# first dataframe
df1=pd.DataFrame({'fname1': ['dwayne','peter','dead','wonder'], 
                 'lname1': ['rock','pan','pool','boy'], 
                 'entrydate1':['31DEC2021', '22JAN2022', NaN, '15DEC2025']})

ddf1 = dd.from_pandas(df1, npartitions=2)

# second dataframe
df2=pd.DataFrame({'fname2': ['bruce','peter','dead', nan], 
                 'lname2': ['banner','pan','pool','boy'], 
                 'entrydate2':['31DEC2021', NaN, '15DEC2025', '25JAN2018']})

ddf2 = dd.from_pandas(df2, npartitions=2)

I would like to compare fname, lname, and entrydate columns. I already have initial counts, missing counts, but what I would like to do is:

  1. merge the two ddf into ddf3
  2. create a new column ['comparefnames'] that compares values of fname1 and fname2. like: if fname1 == fname2 then 1, else 0

ddf3 = ddf1.join(ddf2) seems to be taking longer than 4 hours to perform and still not done.

I am using the book "Data Science with Python and Dask", but I cannot seem to make it work.

the last one I tried:


compare_fname = [ddf3['fname1'] == ddf3['fname2']]

ddf3['comparefnames'] = if ddf3[compare_fname] = TRUE then 1 else 0.

my code above i know will not work in dask, maybe I am lost still. Please help. Thanks



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source