'Filter Pandas dataframe by the number of user-item interactions

I'm given a Pandas dataframe with columns user_id, item_id, timestamp . Suppose that there is only one user_id - item_id interaction, i.e., there exist only one timestamp with this particular interaction:

     user_id  item_id   timestamp
  0      1       2         123
  1      1       3         145
  2      4       6         123
  3      5       7         198

Given a parameter threshold I filter out those user_ids that appear <= threshold number of times:

data = data.groupby("user_id").filter(lambda x: len(x) > threshold)

and get (for threshold = 1):

     user_id  item_id   timestamp
  0      1       2         123
  1      1       3         145

because only user "1" has interacted more than threshold items.

Now, suppose that there may be multiple user_id - item_id interactions, i.e., there could be several timestamps with a particular interaction:

     user_id  item_id   timestamp
  0      1       2         123
  1      1       2         145
  2      4       6         123
  3      4       7         198

What would be the most elegant (the fastest) way to filter out those users that have <= threshold number of unique interactions? The desired output would then be:

     user_id  item_id   timestamp
  0      4       6         123
  1      4       7         198

(because user "1" has interacted only with 1 item, and user "4" remains there, because he has interacted with 2 items).

One way I thought of (not that elegant, huh?):

data_cold = data.groupby('user_id').agg({'item_id':lambda x: x.nunique()})
data_cold = data_cold.reset_index()
data_cold = data_cold[data_cold.item_id > threshold]
data = data[data['user_id'].isin(data_cold.user_id)]

Solution 1:^[1]

if I understand correctly, you could group by two columns and then count groups with size(). After that, the unique interactions will have count of 1 and can be filtered if wanted:

data = data.groupby(["user_id", "item_id"]).size().reset_index(name = 'counts')
uniques = data[data['counts'] == 1]

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1

'Filter Pandas dataframe by the number of user-item interactions

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]