'Pandas: How to add a grouping variable based upon another column?

I have a dataframe with some id's and some dates. I want to be able to group the id's by their change in date to create a generalized "grouping_variable". In r I would do it like this:

df <- tibble(id = c(rep("1", 4), rep("2", 4), rep("3", 4)),
             dates = as_date(c('2022-02-07', '2022-02-07', '2022-02-08', '2022-02-08',
                         '2022-02-09', '2022-02-09', '2022-02-10', '2022-02-10', 
                         '2022-02-11', '2022-02-11', '2022-02-11', '2022-02-11')))

df <- df %>% group_by(id) %>% mutate(grouping_var = match(dates, unique(dates)))

basically, this code groups by the id, and then within the groups, each unique date is assigned a value, and then value is then joined with the actual date, which results in a column with these values: 1 1 2 2 1 1 2 2 1 1 1 1

In Python/ pandas I can't find an equivalent to the match function. Does anyone know how to do that?

Here is some sample data in Python:

d = {'user' : ["1", "1", "1", "1", "2", "2", "2", "2", "3", "3", "3", "3"],
    'dates' : ['2022-02-07', '2022-02-07', '2022-02-08', '2022-02-08',
                         '2022-02-09', '2022-02-09', '2022-02-10', '2022-02-10', 
                         '2022-02-11', '2022-02-11', '2022-02-11', '2022-02-11'],
    'hoped_for_output' : [1, 1, 2, 2, 1, 1, 2, 2, 1, 1, 1, 1]}

example_df = pd.DataFrame(data = d)

Many thanks!



Solution 1:[1]

We may use factorize after grouping by 'user'

d['hoped_for_output'] = d.groupby(['user'])['dates'].transform(lambda x: pd.factorize(x)[0]) + 1

-output

d
 user   dates   hoped_for_output
0   1   2022-02-07  1
1   1   2022-02-07  1
2   1   2022-02-08  2
3   1   2022-02-08  2
4   2   2022-02-09  1
5   2   2022-02-09  1
6   2   2022-02-10  2
7   2   2022-02-10  2
8   3   2022-02-11  1
9   3   2022-02-11  1
10  3   2022-02-11  1
11  3   2022-02-11  1

data

d = pd.DataFrame(d)

Solution 2:[2]

May be interesting to see it being implemented with datar?

>>> from datar.all import f, c, rep, as_date, tibble, group_by, mutate, match, unique
>>> df = tibble(
...     id=c(rep("1", 4), rep("2", 4), rep("3", 4)),
...     dates=as_date(
...         c(
...             "2022-02-07",
...             "2022-02-07",
...             "2022-02-08",
...             "2022-02-08",
...             "2022-02-09",
...             "2022-02-09",
...             "2022-02-10",
...             "2022-02-10",
...             "2022-02-11",
...             "2022-02-11",
...             "2022-02-11",
...             "2022-02-11",
...         )
...     ),
... )
>>> df >> group_by(f.id) >> mutate(grouping_var=match(f.dates, unique(f.dates)) + 1)
         id            dates  grouping_var
   <object> <datetime64[ns]>       <int64>
0         1       2022-02-07             1
1         1       2022-02-07             1
2         1       2022-02-08             2
3         1       2022-02-08             2
4         2       2022-02-09             1
5         2       2022-02-09             1
6         2       2022-02-10             2
7         2       2022-02-10             2
8         3       2022-02-11             1
9         3       2022-02-11             1
10        3       2022-02-11             1
11        3       2022-02-11             1
[TibbleGrouped: id (n=3)]

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Panwen Wang