'Pivot dataframe with duplicate index by aggregating per group
I am facing the following dataframe.
Date | Security | Field | Value | |
---|---|---|---|---|
0 | 2022-05-03 08:00:12.394000 | CFI2Z2 | VALUE | 83.3 |
1 | 2022-05-03 08:00:12.394000 | CFI2Z2 | VOLUME | 1 |
2 | 2022-05-03 08:00:12.460000 | TRNLTTFFVMc1 | VALUE | 83.4 |
3 | 2022-05-03 08:00:12.460000 | TRNLTTFFVMc1 | VOLUME | 1 |
4 | 2022-05-03 08:00:12.460000 | TRNLTTFFVMc1 | VALUE | 83.23 |
3 | 2022-05-03 08:00:12.460000 | TRNLTTFFVMc1 | VOLUME | 2 |
Note that the column "Date" is not unique. Neither globally or and not per group.
I am trying to achieve the following transformation.
('TRNLTTFFVMc1', 'VALUE') | ('TRNLTTFFVMc1', 'VOLUME') | ('CFI2Z2', 'VALUE') | ('CFI2Z2', 'VOLUME') | |
---|---|---|---|---|
2022-05-03 08:00:12.394000 | nan | nan | 83.3 | 1 |
2022-05-03 08:00:12.460000 | 83,27 | 3 | nan | nan |
Note that the output columns index is a MultiIndex.
MultiIndex([('TRNLTTFFVMc1', 'VALUE'),
('TRNLTTFFVMc1', 'VOLUME'),
( 'CFI2Z2', 'VALUE'),
( 'CFI2Z2', 'VOLUME')],
)
For groups with multiple VOLUME/PRICE pairs I would like to calculate the Volume-Weighted Average Price.
The problem is that I can't just pivot, because the date is not unique. I am not sure how to proceed.
Solution 1:[1]
Given the brief discussion, here is my answer. I tried to make it more concise but I could not. If someone has a better option, I'm all ears!
Here is the data:
df = pd.DataFrame(
{'Date': pd.to_datetime(['2022-05-03 08:00:12.394000', '2022-05-03 08:00:12.394000',
'2022-05-03 08:00:12.460000', '2022-05-03 08:00:12.460000',
'2022-05-03 08:00:12.460000', '2022-05-03 08:00:12.460000']),
'Security': ['CFI2Z2', 'CFI2Z2', 'TRNLTTFFVMc1', 'TRNLTTFFVMc1', 'TRNLTTFFVMc1', 'TRNLTTFFVMc1'],
'Field': ['VALUE', 'VOLUME', 'VALUE', 'VOLUME', 'VALUE', 'VOLUME'],
'Value': [83.3, 1, 83.4, 1, 83.23, 2]
}
)
Date Security Field Value
0 2022-05-03 08:00:12.394 CFI2Z2 VALUE 83.30
1 2022-05-03 08:00:12.394 CFI2Z2 VOLUME 1.00
2 2022-05-03 08:00:12.460 TRNLTTFFVMc1 VALUE 83.40
3 2022-05-03 08:00:12.460 TRNLTTFFVMc1 VOLUME 1.00
4 2022-05-03 08:00:12.460 TRNLTTFFVMc1 VALUE 83.23
5 2022-05-03 08:00:12.460 TRNLTTFFVMc1 VOLUME 2.00
First, calculate the aggregate values for each (Date, Security)
pair:
def calculate_aggregates(fields):
avg_val = (np.sum(fields[0::2]*fields[1::2]))/(tot:=np.sum(fields[1::2]))
return (avg_val, tot)
values = np.concatenate(df.groupby(['Date', 'Security']).apply(lambda df: calculate_aggregates(df['Value'].to_numpy())))
Then, assign the values to each of the groups and unstack the index twice to create the desired column MultiIndex
:
df_result = df.groupby(['Date', 'Field', 'Security']).first().assign(Value=values).unstack(-1).unstack(-1)
Then, df_result
is:
Security CFI2Z2 TRNLTTFFVMc1
Field VALUE VOLUME VALUE VOLUME
Date
2022-05-03 08:00:12.394 83.3 1.0 NaN NaN
2022-05-03 08:00:12.460 NaN NaN 83.286667 3.0
The assumption this code makes to calculate the aggregates is that the (VALUE, VOLUME)
pairs are coming always in this order within the groups. It could be changed with some checks in the groupby
function if this is not the case.
Solution 2:[2]
Solution with MultiIndex Series
, working well if all pairs with VOLUME
and VALUE
per columns Date,Security
.
#created Multiindex Series
s = df.set_index(['Date','Security', 'Field'])['Value']
#selected rows by values in level2
VAL = s.xs('VALUE', level=2)
VOL = s.xs('VOLUME', level=2)
#count VOLUMEs
COUNT = VOL.groupby(level=[0,1]).sum()
#because pairs multiply working, divide sum and append VOL like new column
#then reshape and swap MultiIndex in columns
df = (VAL.mul(VOL)
.groupby(level=[0,1]).sum().div(COUNT).to_frame('VALUE')
.assign(VOLUME=COUNT)
.unstack()
.swaplevel(0,1,axis=1)
.sort_index(axis=1))
print (df)
Security CFI2Z2 TRNLTTFFVMc1
VALUE VOLUME VALUE VOLUME
Date
2022-05-03 08:00:12.394 83.3 1.0 NaN NaN
2022-05-03 08:00:12.460 NaN NaN 83.286667 3.0
Details:
print ((VAL.index == VOL.index).all())
True
print (VAL.mul(VOL))
Date Security
2022-05-03 08:00:12.394 CFI2Z2 83.30
2022-05-03 08:00:12.460 TRNLTTFFVMc1 83.40
TRNLTTFFVMc1 166.46
Name: Value, dtype: float64
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 |