'Map range from 2 columns based on overlapping range in another Pandas dataframe and sum values for same range

I have two datasets (df1 and df2) of values with a certain range (Start and End) in both of them.

I would like to annotate the first one (df1) with values from column Num of the corresponding overlapping range of values (Start/End) on df2.

Example: The first row in df1 ranges from 0-2300000, since 2300000 is lower than the End in first row in df2 and the whole range 0-2300000 is overlapping with the range of 62920-121705338, it would be annotated with Num 3. Similarly, also the row 2 of df1 has range 2300000-5400000 overlapping with the range of 62920-121705338, row 2 would also be annotated with Num 3.

However, in the case of the last row of df1, the range contains two rows from df2, so there needs to output the sum in Num in the last two rows of df2.

The desired output would be df3

df1.head()

|Start    |End      |Tag    |
|---------|---------|-------|
|0        |2300000  |gneg45 |   
|2300000  |5400000  |gpos25 |
|143541857|200000000|gneg34 |

df2.head()

| Start   |   End   |  Num   |
|---------|---------|--------|
|62920    |121705338|  3     |   
|143541857|147901334|  2     |
|147901760|151020217|  5     |


df3 = 

|Start    |End      |Num    |
|---------|---------|-------|
|0        |2300000  |3      |   
|2300000  |5400000  |3      |
|143541857|200000000|7      |

I tried pandas merge creating a key and query based on a range of columns, but nothing really worked.

Thanks in advance!!



Solution 1:[1]

From your description, you are looking for overlapping range in df1 and df2 in order for df1 to take the Num value from df2.

To formulate the condition of overlapping range condition, let's illustrate as follows the opposite condition of non-overlapping range:

Either:

                                         |<-------------->| 
                                      df2.Start       .df2.End
           |<------------->| 
        df1.Start       df1.End

or:

                 |<-------------->| 
              df2.Start       .df2.End
                                             |<------------->| 
                                          df1.Start       df1.End

This non-overlapping range condition can be formulated as:

Either (df1.End < df2.Start) or (df1.Start > df2.End)

Therefore, the overlapping range condition, being the opposite, is the negation of the above conditions, that is:

~ ((df1.End < df2.Start) | (df1.Start > df2.End))

which is equivalent to:

(df1.End >= df2.Start) & (df1.Start <= df2.End)

[Note: we deduce overlapping condition by considering the opposite and get the negation because the overlapping conditions have more scenarios. There are 4 cases: (1) df1 covering the entire df2 range and more; (2) df1 being entirely contained within the df2 range; (3) overlapping on left end only; (4) overlapping on right end only. We can simplify the logics by our approach.]

Solution 1: Simple Solution for small dataset

Step 1: For small dataset, you can cross join df1 and df2 by .merge(), then filter by the overlapping condition using .query(), as follows:

df3 = (df1.merge(df2, how='cross', suffixes=('_df1', '_df2'))
          .query('(End_df1 >= Start_df2) & (Start_df1 <= End_df2)')
          .rename({'Start_df1': 'Start', 'End_df1': 'End'}, axis=1)
          [['Start', 'End', 'Num']]
      )

If your Pandas version is older than 1.2.0 (released in December 2020) and does not support merge with how='cross', you can use:

df3 = (df1.assign(key=1).merge(df2.assign(key=1), on='key', suffixes=('_df1', '_df2')).drop('key', axis=1)
          .query('(End_df1 >= Start_df2) & (Start_df1 <= End_df2)')
          .rename({'Start_df1': 'Start', 'End_df1': 'End'}, axis=1)
          [['Start', 'End', 'Num']]
      )

Intermediate result:

print(df3)

       Start        End  Num
0          0    2300000    3
3    2300000    5400000    3
7  143541857  200000000    2
8  143541857  200000000    5

Step 2: Sum up the Num values for same range (same Start and End) by .groupby() and .sum():

df3 = df3.groupby(['Start', 'End'])['Num'].sum().reset_index()

Result:

print(df3)

       Start        End  Num
0          0    2300000    3
1    2300000    5400000    3
2  143541857  200000000    7

Solution 2: Numpy Solution for large dataset

For large dataset and performance is a concern, you can use numpy broadcasting (instead of cross join and filtering) to speed up the execution time:

Step 1:

d1_S = df1.Start.to_numpy()
d1_E = df1.End.to_numpy()
d2_S = df2.Start.to_numpy()
d2_E = df2.End.to_numpy()

# filter for overlapping range condition and get the respective row indexes of `df1`, `df2` in `i` and `j`
i, j = np.where((d1_E[:, None] >= d2_S) & (d1_S[:, None] <= d2_E))

df3 = pd.DataFrame(
          np.column_stack([df1.values[i], df2.values[j]]),
          columns=df1.columns.append(df2.columns + '_df2')
      )

Intermediate result:

print(df3)

       Start        End     Tag  Start_df2    End_df2 Num_df2
0          0    2300000  gneg45      62920  121705338       3
1    2300000    5400000  gpos25      62920  121705338       3
2  143541857  200000000  gneg34  143541857  147901334       2
3  143541857  200000000  gneg34  147901760  151020217       5

Step 2: Sum up the Num values for same range (same Start and End) by .groupby() and .sum():

df3 = df3.groupby(['Start', 'End'])['Num_df2'].sum().reset_index(name='Num')

Result:

print(df3)

       Start        End  Num
0          0    2300000    3
1    2300000    5400000    3
2  143541857  200000000    7

Solution 2:[2]

Building off the logic from @SeaBean, one option is with conditional_join from pyjanitor, followed by a groupby:

# pip install pyjanitor
import pandas as pd
import janitor
(
df1
.conditional_join(
    # add suffix here
    # to avoid MultiIndex, which happens
    # if the columns overlap
    df2.add_suffix('_y'), 
    # column from left, column from right, comparator
    ('Start', 'End_y', '<='), 
    ('End', 'Start_y', '>='))
.rename(columns={'Num_y':'Num'})
.groupby(['Start', 'End'], as_index = False)
.Num
.sum()
)
       Start        End  Num
0          0    2300000    3
1    2300000    5400000    3
2  143541857  200000000    7

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 sammywemmy