'How can I reshape a Pandas DataFrame to show if certain values in a column are present by day?
I'm writing a script to query a database and return a DataFrame, df, that looks like this:
cost data_source process_date revenue
0 0.796895 A 2016-10-01 0.133634
1 0.222457 B 2016-10-02 0.111921
2 0.980005 C 2016-10-03 0.353150
3 0.332533 D 2016-10-04 0.686957
4 0.554089 E 2016-10-05 0.798600
The objective is to check if a data_source
from the following Series of all_data_sources
is missing from the DataFrame, df, shown above:
0 A
1 B
2 C
3 D
4 E
5 F
6 G
dtype: object
The script should return a new DataFrame which shows by day (Index) which all_data_sources
(columns) are missing from the queried df. The closest I've gotten in terms of the format I want is using a pivot table:
source_by_date = df.pivot_table(values = "cost", index = "process_date",
columns = "data_source",
aggfunc = np.mean)
Which returns a new df looking like the one I would want but with the wrong values:
data_source A B C D E
process_date
2016-10-01 0.796895 NaN NaN NaN NaN
2016-10-02 NaN 0.222457 NaN NaN NaN
2016-10-03 NaN NaN 0.980005 NaN NaN
2016-10-04 NaN NaN NaN 0.332533 NaN
2016-10-05 NaN NaN NaN NaN 0.554089
Instead of returning the mean of cost for each data source I want it to return True if the data source is missing and False if it isn't missing (or the other way around).
It doesn't have to be a pivot table. This is just the closest I have gotten to what I want.
The other problem with my current path is that it will just take into account the data_sources
actually returned from the query which may not include the complete list from the Series all_data_sources
.
Solution 1:[1]
you can also use get_dummies() method:
In [79]: x = pd.get_dummies(df.set_index('process_date').data_source)
In [80]: x
Out[80]:
A B C D E
process_date
2016-10-01 1 0 0 0 0
2016-10-02 0 1 0 0 0
2016-10-03 0 0 1 0 0
2016-10-04 0 0 0 1 0
2016-10-05 0 0 0 0 1
In [81]: new = pd.DataFrame(columns=all_data_sources.ix[~all_data_sources.isin(df.data_source)].values, index=x.index).fillna(0)
In [82]: pd.concat([x, new], axis=1).astype(bool)
Out[82]:
A B C D E F G
process_date
2016-10-01 True False False False False False False
2016-10-02 False True False False False False False
2016-10-03 False False True False False False False
2016-10-04 False False False True False False False
2016-10-05 False False False False True False False
Data:
In [77]: all_data_sources
Out[77]:
0 A
1 B
2 C
3 D
4 E
5 F
6 G
dtype: object
In [78]: df
Out[78]:
cost data_source process_date revenue
0 0.796895 A 2016-10-01 0.133634
1 0.222457 B 2016-10-02 0.111921
2 0.980005 C 2016-10-03 0.353150
3 0.332533 D 2016-10-04 0.686957
4 0.554089 E 2016-10-05 0.798600
Some explanations:
In [83]: all_data_sources.ix[~all_data_sources.isin(df.data_source)]
Out[83]:
5 F
6 G
dtype: object
In [84]: pd.DataFrame(columns=all_data_sources.ix[~all_data_sources.isin(df.data_source)].values, index=x.index).fillna(0)
Out[84]:
F G
process_date
2016-10-01 0 0
2016-10-02 0 0
2016-10-03 0 0
2016-10-04 0 0
2016-10-05 0 0
Solution 2:[2]
You can group the data_source
by date and check for each group if all the keys are in the data_source
based on which built up a logical Series (this works for python 3 with the dictionary comprehension):
import pandas as pd
(df.groupby('process_date')['data_source']
.apply(lambda g: pd.Series({k: k in g.values for k in all_data_sources}))
.unstack(level = 1))
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | MaxU - stop genocide of UA |
Solution 2 | Psidom |