'Create a column for each first directory of a path and fill the column with each last directory of the same path
This dataset represents a collection of image information. Each image has some tags that are stored very badly. In particular I have a dataframe with a column ('tags_path') which is a string representing a list of several paths for each observation of my dataset like this:
df['tags_path'][0]
',/SITUATION/Group Photo,/CONTENT YEAR/Years 2020/2022,/FRAMEWORK/Otherframeworks/Tracks,/PERSON/Editor/Mark,/PERSON/Co-Editor/Paul,PERSON/Protagonist/Cherles,/SITUATION/Victory,/SITUATION/Portrait,'
as you can see there are several paths in this string, each first directory of each path represents the category of the tag while each last directory represents the tag name. For example in the above observation we have:
SITUATION->['Group Photo', 'Victory', 'Potrait']
CONTENT YEAR->['2022']
FRAMEWORK->['Tracks']
PERSON->['Mark', 'Paul', 'Charles']
I would like to create a column in the dataframe for each "tag category" (SITUATION, CONTENT-YEAR, FRAMEWORK, ecc...) which contains their own list of tags. Since now i managed to create an empty column for all unique tag cotegories of my dataset like this:
df['tags_path'] = ','+df['tags_path']+','
tags = [re.findall(r',/[a-zA-Z .]+', str(df.loc[i, 'tags_path'])) for i in range(len(df))]
flat_tags_columns = [x[2:] for x in list(set([item for sublist in tags for item in sublist]))]
for i in flat_tags_columns:
df[i] = 0
Now i need to fill the columns with the respective tags. Thanks.
Solution 1:[1]
With the following toy dataframe:
from pathlib import Path
import pandas as pd
df = pd.DataFrame(
{
"tags_path": [
",/SITUATION/Group Photo,/CONTENT YEAR/Years 2020/2022,/FRAMEWORK/Otherframeworks/Tracks,/PERSON/Editor/Mark,/PERSON/Co-Editor/Paul,/PERSON/Protagonist/Charles,/SITUATION/Victory,/SITUATION/Portrait,",
",/SITUATION/Group Photo,/CONTENT YEAR/Years 2020/2021,/FRAMEWORK/Otherframeworks/Tracks,/PERSON/Editor/Peter,/PERSON/Co-Editor/John,/PERSON/Protagonist/Charly,/SITUATION/Victory,/SITUATION/Portrait,",
]
}
)
I suggest a different approach, taking advantage of Python standard library Pathlib module for dealing with pathlike objects:
def process(tag):
"""Helper function which extracts columns names and values as lists.
"""
paths = [Path(item) for item in tag.split(",")]
data = {str(path.parts[1]): [] for path in paths if path.parts}
for path in paths:
try:
data[str(path.parts[1])].append(path.name)
except IndexError:
pass
return [[col for col in data.keys()], [value for value in data.values()]]
# Temporary columns
df["temp"] = df["tags_path"].apply(process)
df[["columns", "values"]] = pd.DataFrame(df["temp"].tolist(), index=df.index)
# Add final columns
df[df["columns"][0]] = pd.DataFrame(df["values"].tolist(), index=df.index)
# Cleanup
df = df.drop(columns=["temp", "columns", "values"])
print(df)
# Output
tags_path \
0 ,/SITUATION/Group Photo,/CONTENT YEAR/Years 20...
1 ,/SITUATION/Group Photo,/CONTENT YEAR/Years 20...
SITUATION CONTENT YEAR FRAMEWORK \
0 [Group Photo, Victory, Portrait] [2022] [Tracks]
1 [Group Photo, Victory, Portrait] [2021] [Tracks]
PERSON
0 [Mark, Paul, Charles]
1 [Peter, John, Charly]
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Laurent |