'Pandas: return rows that have two matching columns commonality
I am trying to write a commonality script which will return rows in a pandas dataframe that have two matching columns, and also will sum up the number of rows with matches into a new column
OPERATION and MACHINE are the columns to match
Input:
BATCH | OPERATION | MACHINE | DATE |
---|---|---|---|
1A | 4000 | Printer1 | 01-Jan-22 |
1A | 2000 | Fax1 | 02-Jan-22 |
1B | 4000 | Printer2 | 03-Jan-22 |
1B | 2000 | Phone1 | 04-Jan-22 |
1C | 2000 | Phone2 | 05-Jan-22 |
1A | 2000 | Phone2 | 06-Jan-22 |
Output: just the last two rows in this example with OPERATION =2000 and MACHINE = Phone2
BATCH | OPERATION | MACHINE | DATE | MATCHES |
---|---|---|---|---|
1C | 2000 | Phone2 | 05-Jan-22 | 2 |
1A | 2000 | Phone2 | 06-Jan-22 | 2 |
I have tried several merge and duplicated df commands but can't get it work.
Solution 1:[1]
df[df.groupby(['OPERATION', 'MACHINE'])['BATCH'].transform('count') > 1].assign(MATCHES = df.groupby(['OPERATION', 'MACHINE'])['BATCH'].transform('count'))
BATCH OPERATION MACHINE DATE MATCHES
4 1C 2000 Phone2 05-Jan-22 2
5 1A 2000 Phone2 06-Jan-22 2
OR
df['MATCHES'] = df.groupby(['OPERATION', 'MACHINE'])['BATCH'].transform('count')
df[df['MATCHES']>1]
Note: one may be faster than other depending on the source dataframe
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Jonathan Leon |