'Map range from 2 columns based on overlapping range in another Pandas dataframe and sum values for same range
I have two datasets (df1 and df2) of values with a certain range (Start and End) in both of them.
I would like to annotate the first one (df1) with values from column Num of the corresponding overlapping range of values (Start/End) on df2.
Example: The first row in df1 ranges from 0-2300000, since 2300000 is lower than the End in first row in df2 and the whole range 0-2300000 is overlapping with the range of 62920-121705338, it would be annotated with Num 3. Similarly, also the row 2 of df1 has range 2300000-5400000 overlapping with the range of 62920-121705338, row 2 would also be annotated with Num 3.
However, in the case of the last row of df1, the range contains two rows from df2, so there needs to output the sum in Num in the last two rows of df2.
The desired output would be df3
df1.head()
|Start |End |Tag |
|---------|---------|-------|
|0 |2300000 |gneg45 |
|2300000 |5400000 |gpos25 |
|143541857|200000000|gneg34 |
df2.head()
| Start | End | Num |
|---------|---------|--------|
|62920 |121705338| 3 |
|143541857|147901334| 2 |
|147901760|151020217| 5 |
df3 =
|Start |End |Num |
|---------|---------|-------|
|0 |2300000 |3 |
|2300000 |5400000 |3 |
|143541857|200000000|7 |
I tried pandas merge creating a key and query based on a range of columns, but nothing really worked.
Thanks in advance!!
Solution 1:[1]
From your description, you are looking for overlapping range in df1 and df2 in order for df1 to take the Num value from df2.
To formulate the condition of overlapping range condition, let's illustrate as follows the opposite condition of non-overlapping range:
Either:
|<-------------->|
df2.Start .df2.End
|<------------->|
df1.Start df1.End
or:
|<-------------->|
df2.Start .df2.End
|<------------->|
df1.Start df1.End
This non-overlapping range condition can be formulated as:
Either (df1.End < df2.Start) or (df1.Start > df2.End)
Therefore, the overlapping range condition, being the opposite, is the negation of the above conditions, that is:
~ ((df1.End < df2.Start) | (df1.Start > df2.End))
which is equivalent to:
(df1.End >= df2.Start) & (df1.Start <= df2.End)
[Note: we deduce overlapping condition by considering the opposite and get the negation because the overlapping conditions have more scenarios. There are 4 cases: (1) df1 covering the entire df2 range and more; (2) df1 being entirely contained within the df2 range; (3) overlapping on left end only; (4) overlapping on right end only. We can simplify the logics by our approach.]
Solution 1: Simple Solution for small dataset
Step 1: For small dataset, you can cross join df1 and df2 by .merge(), then filter by the overlapping condition using .query(), as follows:
df3 = (df1.merge(df2, how='cross', suffixes=('_df1', '_df2'))
.query('(End_df1 >= Start_df2) & (Start_df1 <= End_df2)')
.rename({'Start_df1': 'Start', 'End_df1': 'End'}, axis=1)
[['Start', 'End', 'Num']]
)
If your Pandas version is older than 1.2.0 (released in December 2020) and does not support merge with how='cross', you can use:
df3 = (df1.assign(key=1).merge(df2.assign(key=1), on='key', suffixes=('_df1', '_df2')).drop('key', axis=1)
.query('(End_df1 >= Start_df2) & (Start_df1 <= End_df2)')
.rename({'Start_df1': 'Start', 'End_df1': 'End'}, axis=1)
[['Start', 'End', 'Num']]
)
Intermediate result:
print(df3)
Start End Num
0 0 2300000 3
3 2300000 5400000 3
7 143541857 200000000 2
8 143541857 200000000 5
Step 2: Sum up the Num values for same range (same Start and End) by .groupby() and .sum():
df3 = df3.groupby(['Start', 'End'])['Num'].sum().reset_index()
Result:
print(df3)
Start End Num
0 0 2300000 3
1 2300000 5400000 3
2 143541857 200000000 7
Solution 2: Numpy Solution for large dataset
For large dataset and performance is a concern, you can use numpy broadcasting (instead of cross join and filtering) to speed up the execution time:
Step 1:
d1_S = df1.Start.to_numpy()
d1_E = df1.End.to_numpy()
d2_S = df2.Start.to_numpy()
d2_E = df2.End.to_numpy()
# filter for overlapping range condition and get the respective row indexes of `df1`, `df2` in `i` and `j`
i, j = np.where((d1_E[:, None] >= d2_S) & (d1_S[:, None] <= d2_E))
df3 = pd.DataFrame(
np.column_stack([df1.values[i], df2.values[j]]),
columns=df1.columns.append(df2.columns + '_df2')
)
Intermediate result:
print(df3)
Start End Tag Start_df2 End_df2 Num_df2
0 0 2300000 gneg45 62920 121705338 3
1 2300000 5400000 gpos25 62920 121705338 3
2 143541857 200000000 gneg34 143541857 147901334 2
3 143541857 200000000 gneg34 147901760 151020217 5
Step 2: Sum up the Num values for same range (same Start and End) by .groupby() and .sum():
df3 = df3.groupby(['Start', 'End'])['Num_df2'].sum().reset_index(name='Num')
Result:
print(df3)
Start End Num
0 0 2300000 3
1 2300000 5400000 3
2 143541857 200000000 7
Solution 2:[2]
Building off the logic from @SeaBean, one option is with conditional_join from pyjanitor, followed by a groupby:
# pip install pyjanitor
import pandas as pd
import janitor
(
df1
.conditional_join(
# add suffix here
# to avoid MultiIndex, which happens
# if the columns overlap
df2.add_suffix('_y'),
# column from left, column from right, comparator
('Start', 'End_y', '<='),
('End', 'Start_y', '>='))
.rename(columns={'Num_y':'Num'})
.groupby(['Start', 'End'], as_index = False)
.Num
.sum()
)
Start End Num
0 0 2300000 3
1 2300000 5400000 3
2 143541857 200000000 7
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | sammywemmy |
