'Add together numbers next to duplicate strings that occur across columns for each row
I have a dataset that looks like this:
Col1 | Col2 | Col3 | Col4 |
---|---|---|---|
Carbon Dioxide 32 | Carbon Monoxide 30 | Hydrogen 12 | Carbon Monoxide 12 |
Carbon 12 | Carbon Dioxide 20 | Iron 3 | Carbon 6 |
Sulfur 81 | Sulfur 20 | Sulfur Dioxide 4 | Sulfur 8 |
Sulfur 81 | Sulfur 20 | Carbon 4 | Sulfur 8 |
Carbon 81 | Sulfur 20 | Carbon 4 | Sulfur 8 |
For every row, if there are duplicate strings before each number across the columns, I want to add the numbers next to the text, and rearrange the table, so I get this:
Col1 | Col2 | Col3 | Col4 |
---|---|---|---|
Carbon Dioxide 32 | Carbon Monoxide 42 | Hydrogen 12 | (blank) |
Carbon 18 | Carbon Dioxide 20 | Iron 3 | (blank) |
Sulfur 109 | Sulfur Dioxide 4 | (blank) | (blank) |
Sulfur 109 | Carbon 4 | (blank) | (blank) |
Carbon 85 | Sulfur 28 | (blank) | (blank) |
Then I want to pick the row which has the largest number next to it, so I get:
Col1 |
---|
Carbon Monoxide 42 |
Carbon Dioxide 20 |
Sulfur 109 |
Sulfur 109 |
Carbon 85 |
Also, please note, my actual data frame is pretty big (25K+ rows) and I don't know all the possible elements/compounds beforehand.
How can achieve the above in Python?
Any help will be appreciated. Thanks in advance!
#Also, here is the python code for generating the first table above for convenience:
df = pd.DataFrame({
'Col1': ['Carbon Dioxide 32', 'Carbon 12', 'Sulfur 81', 'Sulfur 81', 'Carbon 81'],
'Col2': ['Carbon Monoxide 30', 'Carbon Dioxide 20', 'Sulfur 20', 'Sulfur 20', 'Sulfur 20'],
'Col3': ['Hydrogen 12', 'Iron 3', 'Sulfur Dioxide 4', 'Carbon 4', 'Carbon 4'],
'Col4': ['Carbon Monoxide 12', 'Carbon 6', 'Sulfur 8', 'Sulfur 8', 'Sulfur 8']
})
Solution 1:[1]
Use:
df = pd.DataFrame({
'Col1': ['Carbon Dioxide 32', 'Carbon 12', 'Sulfur 81', 'Sulfur 81', 'Carbon 81'],
'Col2': ['Carbon Monoxide 30', 'Carbon Dioxide 20', 'Sulfur 20', 'Sulfur 20', 'Sulfur 20'],
'Col3': ['Hydrogen 12', 'Iron 3', 'Sulfur Dioxide 4', 'Carbon 4', 'Carbon 4'],
'Col4': ['Carbon Monoxide 12', 'Carbon 6', 'Sulfur 8', 'Sulfur 8', 'Sulfur 8']
})
temp = df.agg(lambda x: (x.str.split().str[:-1].str.join(' ').values, x.str.split().str[-1].values)).T.explode([0,1])
temp.index = np.array([f"c{i}" for i in range(int(len(temp)/4))]*4).flatten()
temp[1]=temp[1].astype(int)
temp2 = temp.groupby([temp.index, 0])[1].sum()
temp2.max(level=1)
temp2.reset_index().sort_values(1, ascending = False).groupby('level_0')[[0,1]].apply(lambda x: x.iloc[0])
Output:
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | keramat |