'How can I reshape a Pandas DataFrame to show if certain values in a column are present by day?

I'm writing a script to query a database and return a DataFrame, df, that looks like this:

       cost data_source process_date   revenue
0  0.796895           A   2016-10-01  0.133634
1  0.222457           B   2016-10-02  0.111921
2  0.980005           C   2016-10-03  0.353150
3  0.332533           D   2016-10-04  0.686957
4  0.554089           E   2016-10-05  0.798600

The objective is to check if a data_source from the following Series of all_data_sources is missing from the DataFrame, df, shown above:

0    A
1    B
2    C
3    D
4    E
5    F
6    G
dtype: object

The script should return a new DataFrame which shows by day (Index) which all_data_sources(columns) are missing from the queried df. The closest I've gotten in terms of the format I want is using a pivot table:

source_by_date = df.pivot_table(values = "cost", index = "process_date",
                            columns = "data_source",
                            aggfunc = np.mean)

Which returns a new df looking like the one I would want but with the wrong values:

data_source          A         B         C         D         E
process_date                                                  
2016-10-01    0.796895       NaN       NaN       NaN       NaN
2016-10-02         NaN  0.222457       NaN       NaN       NaN
2016-10-03         NaN       NaN  0.980005       NaN       NaN
2016-10-04         NaN       NaN       NaN  0.332533       NaN
2016-10-05         NaN       NaN       NaN       NaN  0.554089

Instead of returning the mean of cost for each data source I want it to return True if the data source is missing and False if it isn't missing (or the other way around).

It doesn't have to be a pivot table. This is just the closest I have gotten to what I want.

The other problem with my current path is that it will just take into account the data_sources actually returned from the query which may not include the complete list from the Series all_data_sources.



Solution 1:[1]

you can also use get_dummies() method:

In [79]: x = pd.get_dummies(df.set_index('process_date').data_source)

In [80]: x
Out[80]:
              A  B  C  D  E
process_date
2016-10-01    1  0  0  0  0
2016-10-02    0  1  0  0  0
2016-10-03    0  0  1  0  0
2016-10-04    0  0  0  1  0
2016-10-05    0  0  0  0  1

In [81]: new = pd.DataFrame(columns=all_data_sources.ix[~all_data_sources.isin(df.data_source)].values, index=x.index).fillna(0)

In [82]: pd.concat([x, new], axis=1).astype(bool)
Out[82]:
                  A      B      C      D      E      F      G
process_date
2016-10-01     True  False  False  False  False  False  False
2016-10-02    False   True  False  False  False  False  False
2016-10-03    False  False   True  False  False  False  False
2016-10-04    False  False  False   True  False  False  False
2016-10-05    False  False  False  False   True  False  False

Data:

In [77]: all_data_sources
Out[77]:
0    A
1    B
2    C
3    D
4    E
5    F
6    G
dtype: object

In [78]: df
Out[78]:
       cost data_source process_date   revenue
0  0.796895           A   2016-10-01  0.133634
1  0.222457           B   2016-10-02  0.111921
2  0.980005           C   2016-10-03  0.353150
3  0.332533           D   2016-10-04  0.686957
4  0.554089           E   2016-10-05  0.798600

Some explanations:

In [83]: all_data_sources.ix[~all_data_sources.isin(df.data_source)]
Out[83]:
5    F
6    G
dtype: object

In [84]: pd.DataFrame(columns=all_data_sources.ix[~all_data_sources.isin(df.data_source)].values, index=x.index).fillna(0)
Out[84]:
              F  G
process_date
2016-10-01    0  0
2016-10-02    0  0
2016-10-03    0  0
2016-10-04    0  0
2016-10-05    0  0

Solution 2:[2]

You can group the data_source by date and check for each group if all the keys are in the data_source based on which built up a logical Series (this works for python 3 with the dictionary comprehension):

import pandas as pd
(df.groupby('process_date')['data_source']
   .apply(lambda g: pd.Series({k: k in g.values for k in all_data_sources}))
   .unstack(level = 1))

enter image description here

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 MaxU - stop genocide of UA
Solution 2 Psidom