'Create a column for each first directory of a path and fill the column with each last directory of the same path

This dataset represents a collection of image information. Each image has some tags that are stored very badly. In particular I have a dataframe with a column ('tags_path') which is a string representing a list of several paths for each observation of my dataset like this:

df['tags_path'][0]
',/SITUATION/Group Photo,/CONTENT YEAR/Years 2020/2022,/FRAMEWORK/Otherframeworks/Tracks,/PERSON/Editor/Mark,/PERSON/Co-Editor/Paul,PERSON/Protagonist/Cherles,/SITUATION/Victory,/SITUATION/Portrait,'

as you can see there are several paths in this string, each first directory of each path represents the category of the tag while each last directory represents the tag name. For example in the above observation we have:

SITUATION->['Group Photo', 'Victory', 'Potrait']
CONTENT YEAR->['2022']
FRAMEWORK->['Tracks']
PERSON->['Mark', 'Paul', 'Charles']

I would like to create a column in the dataframe for each "tag category" (SITUATION, CONTENT-YEAR, FRAMEWORK, ecc...) which contains their own list of tags. Since now i managed to create an empty column for all unique tag cotegories of my dataset like this:

df['tags_path'] = ','+df['tags_path']+','
tags = [re.findall(r',/[a-zA-Z .]+', str(df.loc[i, 'tags_path'])) for i in range(len(df))]
flat_tags_columns = [x[2:] for x in list(set([item for sublist in tags for item in sublist]))]
for i in flat_tags_columns:
      df[i] = 0

Now i need to fill the columns with the respective tags. Thanks.

python pandas

Solution 1:^[1]

With the following toy dataframe:

from pathlib import Path
import pandas as pd

df = pd.DataFrame(
    {
        "tags_path": [
            ",/SITUATION/Group Photo,/CONTENT YEAR/Years 2020/2022,/FRAMEWORK/Otherframeworks/Tracks,/PERSON/Editor/Mark,/PERSON/Co-Editor/Paul,/PERSON/Protagonist/Charles,/SITUATION/Victory,/SITUATION/Portrait,",
            ",/SITUATION/Group Photo,/CONTENT YEAR/Years 2020/2021,/FRAMEWORK/Otherframeworks/Tracks,/PERSON/Editor/Peter,/PERSON/Co-Editor/John,/PERSON/Protagonist/Charly,/SITUATION/Victory,/SITUATION/Portrait,",
        ]
    }
)

I suggest a different approach, taking advantage of Python standard library Pathlib module for dealing with pathlike objects:

def process(tag):
    """Helper function which extracts columns names and values as lists.
    """
    paths = [Path(item) for item in tag.split(",")]
    data = {str(path.parts[1]): [] for path in paths if path.parts}
    for path in paths:
        try:
            data[str(path.parts[1])].append(path.name)
        except IndexError:
            pass
    return [[col for col in data.keys()], [value for value in data.values()]]


# Temporary columns
df["temp"] = df["tags_path"].apply(process)
df[["columns", "values"]] = pd.DataFrame(df["temp"].tolist(), index=df.index)

# Add final columns
df[df["columns"][0]] = pd.DataFrame(df["values"].tolist(), index=df.index)

# Cleanup
df = df.drop(columns=["temp", "columns", "values"])

print(df)
# Output
                                           tags_path  \
0  ,/SITUATION/Group Photo,/CONTENT YEAR/Years 20...   
1  ,/SITUATION/Group Photo,/CONTENT YEAR/Years 20...   

                          SITUATION CONTENT YEAR FRAMEWORK  \
0  [Group Photo, Victory, Portrait]       [2022]  [Tracks]   
1  [Group Photo, Victory, Portrait]       [2021]  [Tracks]   

                  PERSON  
0  [Mark, Paul, Charles]  
1  [Peter, John, Charly]

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	Laurent

'Create a column for each first directory of a path and fill the column with each last directory of the same path

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]