'Probability of selling the same items again in a pandas dataframe

I need to know the probability of selling similar items together, based on a sales history formatted like this:

pd.DataFrame({"sale_id": [1, 1, 1, 2, 2, 3, 3, 3, 3, 4],
              "item": ["A", "B", "C", "A", "C", "A", "D", "E", "C", "B"],
              "qty": [1, 4, 3, 2, 8, 3, 6, 5, 12, 9]})

sale_id Item    Qty
1       A       1
1       B       4
1       C       3
2       A       2
2       C       8
3       A       3
3       D       6
3       E       5
3       C       12
4       B       9

I want to build a matrix like this:

Correlation Matrix

I have tried pivoting the data frame and using a the pd.DataFrame.corr() with a custom callable, but i ran out of RAM by calling:

pd.pivot_table(df, index = "sales_id", columns = "item")

The actual dataframe that I'm using is 700,000 lines long and have 20,000 different items.



Solution 1:[1]

I set out to find a purely Pandas way of doing things, and ended up also optimizing Pavel's method quite a lot.

Before anything else I do:

df = pd.DataFrame({"sale_id": [1, 1, 1, 2, 2, 3, 3, 3, 3, 4],
              "item": ["A", "B", "C", "A", "C", "A", "D", "E", "C", "B"],
              "qty": [1, 4, 3, 2, 8, 3, 6, 5, 12, 9]})
df.drop('qty', axis=1, inplace=True)

Then, timing/memory tracing begins:

  1. Pandas only method:
values = df.groupby('sale_id')['item'].agg(lambda x: [*combinations(sorted(x), 2)]).explode().value_counts()
denom = df['sale_id'].nunique()
output = (values/denom).reindex(combinations(df['item'].unique(), 2), fill_value=0)
output

Output, Timing and Memory Usage:

(A, B)    0.25
(A, C)    0.75
(A, D)    0.25
(A, E)    0.25
(B, C)    0.25
(B, D)    0.00
(B, E)    0.00
(C, D)    0.25
(C, E)    0.25
(D, E)    0.25
Name: item, dtype: float64

475 µs ± 13.9 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

(End, Peek) Memory: (5442, 16440)
  1. Optimized version of Pavel's Method:

The only differences are extracting just item from the groupby, so a Series is produced rather than a DataFrame, using agg instead of transform, and not using a set since it's not necessary after agg.

grp = df.groupby('sale_id')['item'].agg(lambda x: ''.join(x))
purchases = grp.apply(lambda x: ''.join(x)).unique()
unique_items = df.item.unique()
res = {}
for c in combinations(unique_items, 2):
    c = set(c)
    res[frozenset(c)] = 0
    for i in purchases:
        if c.intersection(i) == c:
            res[frozenset(c)] += 1
for k, v in res.items():
    res[k] = v / purchases.shape[0]
res

Output, Timing and Memory Usage:

{frozenset({'A', 'B'}): 0.25,
 frozenset({'A', 'C'}): 0.75,
 frozenset({'A', 'D'}): 0.25,
 frozenset({'A', 'E'}): 0.25,
 frozenset({'B', 'C'}): 0.25,
 frozenset({'B', 'D'}): 0.0,
 frozenset({'B', 'E'}): 0.0,
 frozenset({'C', 'D'}): 0.25,
 frozenset({'C', 'E'}): 0.25,
 frozenset({'D', 'E'}): 0.25}

276 µs ± 6.59 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

(End, Peek) Memory: (7643, 19402)
  1. Original Method:
grp = df.groupby('sale_id').transform(lambda x: ''.join(x))
purchases = grp['item'].apply(lambda x: ''.join(set(x))).unique()
unique_items = df.item.unique()
res = {}
for c in combinations(unique_items, 2):
    c = set(c)
    res[frozenset(c)] = 0
    for i in purchases:
        if c.intersection(i) == c:
            res[frozenset(c)] += 1
for k, v in res.items():
    res[k] = v / purchases.shape[0]
res

Timing and Memory Usage:

1.01 ms ± 30.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

(End, Peek) Memory: (15847, 27630)

Methods used were %timeit in a Jupyter Notebook, and tracemalloc for memory tracing.

Solution 2:[2]

I believe the standard algorithm for collaborative filtering would go something like:

  • first you need to group your data by sale_id and combine the values in the item column.
  • Then for each group you need to create a set of items that were bought together.
  • Then finally you need to create every possible combination of existing items as a set and do the intersection with your actual item sets

This is what it all looks for me. This should have a linear space complexity and I'm sure it can be improved still but it can work.

from itertools import combinations
import pandas as pd

df = pd.DataFrame({"sale_id": [1, 1, 1, 2, 2, 3, 3, 3, 3, 4],
              "item": ["A", "B", "C", "A", "C", "A", "D", "E", "C", "B"],
              "qty": [1, 4, 3, 2, 8, 3, 6, 5, 12, 9]})

# we don't care about quantity
df = df.loc[:, ['sale_id', 'item']]

# Get all the unique sets of items sold
grp = df.groupby('sale_id').transform(lambda x: ''.join(x))
purchases = grp['item'].apply(lambda x: ''.join(set(x))).unique()

# create all possible two-item pairs, then iterate over them
# adding 1 to the value of dictionary when the purchase
# matches the combination
unique_items = df.item.unique()
res = {}

for c in combinations(unique_items, 2):
    c = set(c)
    res[frozenset(c)] = 0
    for i in purchases:
        if c.intersection(i) == c:
            res[frozenset(c)] += 1

# get percentages
for k, v in res.items():
    res[k] = v / purchases.shape[0]

Output:

{frozenset({'A', 'B'}): 0.25,
 frozenset({'A', 'C'}): 0.75,
 frozenset({'A', 'D'}): 0.25,
 frozenset({'A', 'E'}): 0.25,
 frozenset({'B', 'C'}): 0.25,
 frozenset({'B', 'D'}): 0.0,
 frozenset({'B', 'E'}): 0.0,
 frozenset({'C', 'D'}): 0.25,
 frozenset({'C', 'E'}): 0.25,
 frozenset({'D', 'E'}): 0.25}

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 pavel