'Pivot dataframe with duplicate index by aggregating per group

I am facing the following dataframe.

Date Security Field Value
0 2022-05-03 08:00:12.394000 CFI2Z2 VALUE 83.3
1 2022-05-03 08:00:12.394000 CFI2Z2 VOLUME 1
2 2022-05-03 08:00:12.460000 TRNLTTFFVMc1 VALUE 83.4
3 2022-05-03 08:00:12.460000 TRNLTTFFVMc1 VOLUME 1
4 2022-05-03 08:00:12.460000 TRNLTTFFVMc1 VALUE 83.23
3 2022-05-03 08:00:12.460000 TRNLTTFFVMc1 VOLUME 2

Note that the column "Date" is not unique. Neither globally or and not per group.

I am trying to achieve the following transformation.

('TRNLTTFFVMc1', 'VALUE') ('TRNLTTFFVMc1', 'VOLUME') ('CFI2Z2', 'VALUE') ('CFI2Z2', 'VOLUME')
2022-05-03 08:00:12.394000 nan nan 83.3 1
2022-05-03 08:00:12.460000 83,27 3 nan nan

Note that the output columns index is a MultiIndex.

MultiIndex([('TRNLTTFFVMc1',  'VALUE'),
            ('TRNLTTFFVMc1', 'VOLUME'),
            (      'CFI2Z2',  'VALUE'),
            (      'CFI2Z2', 'VOLUME')],
           )

For groups with multiple VOLUME/PRICE pairs I would like to calculate the Volume-Weighted Average Price.

The problem is that I can't just pivot, because the date is not unique. I am not sure how to proceed.



Solution 1:[1]

Given the brief discussion, here is my answer. I tried to make it more concise but I could not. If someone has a better option, I'm all ears!

Here is the data:

df = pd.DataFrame(
    {'Date': pd.to_datetime(['2022-05-03 08:00:12.394000', '2022-05-03 08:00:12.394000', 
                             '2022-05-03 08:00:12.460000', '2022-05-03 08:00:12.460000', 
                             '2022-05-03 08:00:12.460000', '2022-05-03 08:00:12.460000']),
     'Security': ['CFI2Z2', 'CFI2Z2', 'TRNLTTFFVMc1', 'TRNLTTFFVMc1', 'TRNLTTFFVMc1', 'TRNLTTFFVMc1'],
     'Field': ['VALUE', 'VOLUME', 'VALUE', 'VOLUME', 'VALUE', 'VOLUME'],
     'Value': [83.3, 1, 83.4, 1, 83.23, 2]
    }
)
                     Date      Security   Field  Value
0 2022-05-03 08:00:12.394        CFI2Z2   VALUE  83.30
1 2022-05-03 08:00:12.394        CFI2Z2  VOLUME   1.00
2 2022-05-03 08:00:12.460  TRNLTTFFVMc1   VALUE  83.40
3 2022-05-03 08:00:12.460  TRNLTTFFVMc1  VOLUME   1.00
4 2022-05-03 08:00:12.460  TRNLTTFFVMc1   VALUE  83.23
5 2022-05-03 08:00:12.460  TRNLTTFFVMc1  VOLUME   2.00

First, calculate the aggregate values for each (Date, Security) pair:

def calculate_aggregates(fields):
    avg_val = (np.sum(fields[0::2]*fields[1::2]))/(tot:=np.sum(fields[1::2]))
    return (avg_val, tot)

values = np.concatenate(df.groupby(['Date', 'Security']).apply(lambda df: calculate_aggregates(df['Value'].to_numpy())))

Then, assign the values to each of the groups and unstack the index twice to create the desired column MultiIndex:

df_result = df.groupby(['Date', 'Field', 'Security']).first().assign(Value=values).unstack(-1).unstack(-1)

Then, df_result is:

Security                CFI2Z2        TRNLTTFFVMc1       
Field                    VALUE VOLUME        VALUE VOLUME
Date                                                     
2022-05-03 08:00:12.394   83.3    1.0          NaN    NaN
2022-05-03 08:00:12.460    NaN    NaN    83.286667    3.0

The assumption this code makes to calculate the aggregates is that the (VALUE, VOLUME) pairs are coming always in this order within the groups. It could be changed with some checks in the groupby function if this is not the case.

Solution 2:[2]

Solution with MultiIndex Series, working well if all pairs with VOLUME and VALUE per columns Date,Security.

#created Multiindex Series
s = df.set_index(['Date','Security', 'Field'])['Value']

#selected rows by values in level2 
VAL = s.xs('VALUE', level=2)
VOL = s.xs('VOLUME', level=2)

#count VOLUMEs
COUNT = VOL.groupby(level=[0,1]).sum()

#because pairs multiply working, divide sum and append VOL like new column
#then reshape and swap MultiIndex in columns
df = (VAL.mul(VOL)
         .groupby(level=[0,1]).sum().div(COUNT).to_frame('VALUE')
         .assign(VOLUME=COUNT)
         .unstack()
         .swaplevel(0,1,axis=1)
         .sort_index(axis=1))
print (df)
Security                CFI2Z2        TRNLTTFFVMc1       
                         VALUE VOLUME        VALUE VOLUME
Date                                                     
2022-05-03 08:00:12.394   83.3    1.0          NaN    NaN
2022-05-03 08:00:12.460    NaN    NaN    83.286667    3.0

Details:

print ((VAL.index == VOL.index).all())
True

print (VAL.mul(VOL))
Date                     Security    
2022-05-03 08:00:12.394  CFI2Z2           83.30
2022-05-03 08:00:12.460  TRNLTTFFVMc1     83.40
                         TRNLTTFFVMc1    166.46
Name: Value, dtype: float64

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2