'How to spilt each column into more columns in a given dataframe
I have over 100 columns of the week and for each column of the week, I want to proportion it into days and assign row-specific values to each row over 7 new columns. Like this
I am new to python, I know I need a while loops and for loops but now sure how to go about doing this. Can anyone help?
Based on previous advice I had from this forum, the below work for Week1, can someone advise me on how to loop each week for weeks 2, 3, 4 to nth week?
import pandas as pd
df = pd.DataFrame({"Week1": [9, 30, 35, 65],"Week2": [20, 10, 25, 55],"Week3": [19, 35, 40, 15],"Week4": [7, 10, 70, 105]})
# define which columns need to be created
# this will be the range between 1 and the maximum of the Total Number column
columns_to_fill = ["col" + str(i) for i in range(1, 8)]
# columns_to_fill = [ col1, col2, col3, col4, col5, col6, col7, col8, col9, col10, .... , col28 ]
# now, go through each row of your dataframe
for indx, row in df.iterrows():
# and for each column in the new columns to be filled
# check if the number is smaller or equal than the row's Total Number
# if it is smaller, fill the column with 1
# else fill the column with 0
for number, column in enumerate(columns_to_fill):
if number + 1 <= row["Week1"]:
df.loc[indx, column] = 1
else:
df.loc[indx, column] = 0
# now check if there is a remainder
remainder = row["Week1"] - 7
# while remainder is greater than 0
# we need to continue adding +1 to the columns
while remainder > 0:
for number, column in enumerate(columns_to_fill):
if number + 1 <= remainder:
df.loc[indx, column] += 1
else:
continue
# update remainder
remainder = remainder - 7
Solution 1:[1]
Here is a vectorized option, first repeat
each row 7 times (number of days a week), and add an extra index level with set_index
being the day number.
_df = (
df.loc[df.index.repeat(7)]
.set_index(np.array(list(range(1,8))*len(df)), append=True)
)
print(_df.head(10))
# Week1 Week2 Week3 Week4
# 0 1 9 20 19 7
# 2 9 20 19 7
# 3 9 20 19 7
# 4 9 20 19 7
# 5 9 20 19 7
# 6 9 20 19 7
# 7 9 20 19 7
# 1 1 30 10 35 10
# 2 30 10 35 10
# 3 30 10 35 10
now calculate the result of the entire division with //7
, then add the rest where needed using the modulo %
that you can compare with the extra index level created as it is the day number.
# entire division
res = _df//7
# add the rest where needed
res += (_df%7 >= _df.index.get_level_values(1).to_numpy()[:, None]).astype(int)
print(res)
# Week1 Week2 Week3 Week4
# 0 1 2 3 3 1
# 2 2 3 3 1
# 3 1 3 3 1
# 4 1 3 3 1
# 5 1 3 3 1
# 6 1 3 2 1
# 7 1 2 2 1
# 1 1 5 2 5 2
# 2 5 2 5 2
# 3 4 2 5 2
Finally, reshape and rename columns if wanted.
# reshape the result
res = res.unstack()
# rename the columns if you don't want multiindex
res.columns = [f'{w}_col{i}' for w, i in res.columns]
print(res)
# Week1_col1 Week1_col2 Week1_col3 Week1_col4 Week1_col5 Week1_col6 \
# 0 2 2 1 1 1 1
# 1 5 5 4 4 4 4
# 2 5 5 5 5 5 5
# 3 10 10 9 9 9 9
# Week1_col7 Week2_col1 Week2_col2 Week2_col3 Week2_col4 Week2_col5 \
# 0 1 3 3 3 3 3
# 1 4 2 2 2 1 1
# ...
and you can still join
to your original dataframe
res = df.join(res)
Solution 2:[2]
If you want ot use while/for loops, this iterates over the original data frame. The data frame rows can be of any length and have any number of header elements. The sub-header can have any number of elements (1D).
#Import
import numpy as np
import pandas as pd
#Example data frame and sub-header.
df = pd.DataFrame({"Week1": [9, 30, 35, 65],"Week2": [20, 10, 25, 55],"Week3": [19, 35, 40, 15],"Week4": [7, 10, 70, 105]})
subHeader = ['day1','day2','day3','day4','day5','day6','day7']
#Sort data frame and sub header.
df = df.reindex(sorted(df.columns), axis=1)
subHeader.sort()
#Extract relevant variables.
cols = df.shape[1]
rows = df.shape[0]
subHeadLen = len(subHeader)
mainHeader = list(df.columns)
meanHeadLen = len(mainHeader)
#MultiIndex main header with sub-header.
header = pd.MultiIndex.from_product([mainHeader,subHeader], names=['Week','Day'])
#Hold vals in temporary matrix.
mat = np.zeros((rows,meanHeadLen*subHeadLen))
#Iterate over data frame weeks. For every value in each row distribute over matrix indices by incrementing elements daily.
for col in range(cols):
for val in range(rows):
while df.iat[val,col] > 0:
for subVal in range(subHeadLen):
if df.iat[val,col] > 0:
mat[val][col*subHeadLen + subVal] = mat[val][col*subHeadLen + subVal] + 1
df.iat[val,col] = df.iat[val,col] - 1
#Final data frame.
df2 = pd.DataFrame(mat,columns=header)
print(df2)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Ben.T |
Solution 2 |