'Slicing a dataframe using matches to build a new dataframe with Pandas?
I am trying to get my code to take in a dataframe, find all occurrences of "START:", then iterate through each occurrence to create 'slices' (Where the first row is the "START:" match, and captures all the rows between where the last row is the "END:" that matches the string after "START:").
I want this to be put into a new dataframe where every 'slice' is separated by a blank row.
I can't seem to get it to work without it being ridiculously slow when scaling up my sheet to larger sizes (750,000 rows).
I'm unsure how else to approach my problem or how I can make it faster so large dataframes don't slow it to a crawl.
My df:
My code that where I think my issue or bad methodology is:
new_df = pd.DataFrame({}, columns = df.columns)
new_df = new_df.append(pd.Series(), ignore_index = True)
for value in list_of_commands:
if 'Start: ' in value:
value_to_match = value[6:]
idx_start = df[df[col_name_source].str.contains(value_to_match, na = False)].first_valid_index()
idx_end = df[df[col_name_source].str.contains(value_to_match, na = False)].last_valid_index()
new_df = pd.concat([new_df, df.loc[idx_start:idx_end, :]])
new_df = new_df.append(pd.Series(), ignore_index = True)
Whole program:
import pandas as pd
sheets_index = [
('Numbers_', '0'), ('Numbers_', '1'), ('Numbers_', '2'), ('Numbers_', '3'),
('Numbers_1', '0'), ('Numbers_1', '1'), ('Numbers_1', '2'), ('Numbers_1', '3'),
('Numbers_TEST', '0'), ('Numbers_TEST', '1'), ('Numbers_TEST', '2'), ('Numbers_TEST', '3'),
('Numbers_TEST', '4'), ('Numbers_TEST', '5'), ('Numbers_TEST', '6'), ('Numbers_TEST', '7'), ('Numbers_TEST', '8')
]
index = pd.MultiIndex.from_tuples(sheets_index, names=['Id1','Id2'])
df = pd.DataFrame(
{
'TYPE': ['AA','aa','Aa','aA','DD','dd','Dd','dD','11','AA','11','aa','11','Aa','11','aA','11'],
'DATE': ['BB','bb','Bb','bB','CC','cc','Cc','cC','22','BB','22','bb','22','Bb','22','bB','22'],
'OTHER': ['CC','cc','Cc','cC','BB','bb','Bb','bB','33','CC','33','cc','33','Cc','33','cC','33'],
'SOURCE': ['DD','dd','Dd','dD','AA','aa','Aa','aA','XX','Start: Test_function1','Start: Test_function2','dd','','End: Test_function1','','zz','End: Test_function2']
},
index=index
)
command_list = ["AA", "dd", "DD"]
warning_list = ["Dd", "dD"]
ingenium_list = ["CC", "BB"]
col_name_type = 'TYPE'
col_name_other = 'OTHER'
col_name_source = 'SOURCE'
df_filtered_command = df[df[col_name_type].isin(command_list)]
df_filtered_warnings = df[df[col_name_type].isin(warning_list)]
df_filtered_other = df[df[col_name_other].isin(ingenium_list)]
df_final_command = df_filtered_command[(df_filtered_command[col_name_source].str.endswith('001', na=False)) |
(df_filtered_command[col_name_source].str.contains("a"))]
list_of_commands = df[col_name_source].dropna().tolist()
new_df = pd.DataFrame({}, columns = df.columns)
new_df = new_df.append(pd.Series(), ignore_index = True)
for value in list_of_commands:
if 'Start: ' in value:
value_to_match = value[6:]
idx_start = df[df[col_name_source].str.contains(value_to_match, na = False)].first_valid_index()
idx_end = df[df[col_name_source].str.contains(value_to_match, na = False)].last_valid_index()
new_df = pd.concat([new_df, df.loc[idx_start:idx_end, :]])
new_df = new_df.append(pd.Series(), ignore_index = True)
print(f'\n {new_df} \n')
Solution 1:[1]
pd.concat()
can be expensive! Try populating a list of dfs and do one pd.concat()
at the end.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Mose Wintner |