'Data Quality check using dask
I have two dask dataframes and I am performing data quality checks. Each dataframe have at least 50 columns, but for this, let me give a sample dataframe assuming as part of the 20gb data.
# first dataframe
df1=pd.DataFrame({'fname1': ['dwayne','peter','dead','wonder'],
'lname1': ['rock','pan','pool','boy'],
'entrydate1':['31DEC2021', '22JAN2022', NaN, '15DEC2025']})
ddf1 = dd.from_pandas(df1, npartitions=2)
# second dataframe
df2=pd.DataFrame({'fname2': ['bruce','peter','dead', nan],
'lname2': ['banner','pan','pool','boy'],
'entrydate2':['31DEC2021', NaN, '15DEC2025', '25JAN2018']})
ddf2 = dd.from_pandas(df2, npartitions=2)
I would like to compare fname, lname, and entrydate columns. I already have initial counts, missing counts, but what I would like to do is:
- merge the two ddf into ddf3
- create a new column ['comparefnames'] that compares values of fname1 and fname2. like: if fname1 == fname2 then 1, else 0
ddf3 = ddf1.join(ddf2) seems to be taking longer than 4 hours to perform and still not done.
I am using the book "Data Science with Python and Dask", but I cannot seem to make it work.
the last one I tried:
compare_fname = [ddf3['fname1'] == ddf3['fname2']]
ddf3['comparefnames'] = if ddf3[compare_fname] = TRUE then 1 else 0.
my code above i know will not work in dask, maybe I am lost still. Please help. Thanks
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|