'How to create a new columns based off of values of other columns which could contain #s or NaN?
I have a few dataframes that I'm merging based on known, populated fields. The resulting dataframe will always contain a set of columns, but may or may not have values for some of the columns.
Here is a snippet:
df = df.merge(
how="left",
right=ins_df,
left_on=["warehouse", "date"],
right_on=["ware_id", "c_date"],
).merge(
how="left",
right=ware_df,
left_on=["warehouse", "date"],
right_on=["warehouse_code", "warehouse_date"],
)
I get a resulting dataframe of several columns, let's just call them A, B, C, D, and E.
I need to create new columns, F, G, and H. F needs to be calculated like A + B - C and G needs to be calculated based on the values of max(A + B + C, E - D), and H needs to be C * D etc.
This would be relatively simple, but I am running into a hiccup because I don't know how I should handle NaNs. When we merge, sometimes there are no values in the ins_df or ware_df to use and we get NaN values in the df. I can't do a df-wide default of NaN values because for some calculations, we would want to treat an NaN as a 0, for example, while in other calculations, we want to treat the NaN as -1 or not even populate a field if there is NaN for one of the fields (for example for H, if C or D is NaN, we don't want to calculate H. But for G, if C is NaN, we want to treat it as 0)
Is there a simple way to do this in my calculations? For example, something like this
df['G'] = max(df.get('A', default=0) + df.get('B', default=1) + df.get('C', default=0), df.get('E', default=-1))
df['H'] = df['C'] * df['D'] if ['C', 'D'] in df.columns else 0
Solution 1:[1]
With the following toy dataframe:
import pandas as pd
df = pd.DataFrame(
{
"A": [8, 5, 2],
"B": [4, pd.NA, 3],
"C": [1, 1, 8],
"D": [pd.NA, 7, 2],
"E": [5, 5, pd.NA],
}
)
Here is one way to do it:
# For G, if A is NaN, we want to treat it as 0, etc.
df["G"] = pd.concat(
[df["A"].fillna(0) + df["B"].fillna(1) + df["C"].fillna(0), df["E"].fillna(-1)],
axis=1,
).max(axis=1)
# For H, if C or D is NaN, we don't want to calculate H
df["H"] = df["C"] * df["D"]
print(df)
# Output
A B C D E G H
0 8 4 1 <NA> 5 13 <NA>
1 5 <NA> 1 7 5 7 7
2 2 3 8 2 <NA> 13 16
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Laurent |