'I want to speed up a nested loop when creating a df keyword counts (keywords appearing with other keywords)

Using Pandas I have a df that is 14000 rows by 56 columns (keywords) I have a keyword list (full_keys) that is 1406 items and an empty (0) dataframe (called key_frame)that uses the keyword list as index and columns (so 1406x1406)

I want to iterate through each row of the df, and for each keyword in the row, increment the intersection in key_frame of that keyword and each other keyword that exists in the row.

key_frame = pd.DataFrame(columns=full_keys, index=full_keys).fillna(0) # create empty df 

for i, r in keywords.iterrows(): #iterate through each row in df
  for index, rows in key_frame.iterrows(): #iterate through blank df to tabulate
    if index in list(r): # if the index (which is a keyword) appears in row...
      for x in r: #iterate through row and for each intersection loc[index, x] increment tally by 1
        key_frame.loc[index, x]+=1

I also have this, same as above but for commented line:

key_frame_2 = pd.DataFrame(columns=full_keys, index=full_keys).fillna(0)

for i, r in keywords.iterrows():
  for x in full_keys: #iterate through list rather than key_frame, it's...faster?
    if x in list(r):
      for keyword in r:
        key_frame_2.loc[x, keyword]+=1

Feeling rusty and would love some pointers! I know there is a better way to do this.



Solution 1:[1]

Assuming that the keywords data frame only contains elements from full_keys:

>>> full_keys = ["A", "B", "C", "D", "E"]
>>> keywords
   0  1  2
0  B  A  C
1  C  B  D
2  E  C  D
3  A  D  E
4  E  A  D

you can first create a DataFrame that contains, for each row in keywords, a row with a list of all the pairs of keywords in it (including duplicates, and in all orderings). We achieve that using itertools.product:

import itertools
key_frame = keywords.apply(lambda row: list(itertools.product(row, repeat=2)), axis=1)
>>> key_frame
0    [(B, B), (B, A), (B, C), (A, B), (A, A), (A, C...
1    [(C, C), (C, B), (C, D), (B, C), (B, B), (B, D...
2    [(E, E), (E, C), (E, D), (C, E), (C, C), (C, D...
3    [(A, A), (A, D), (A, E), (D, A), (D, D), (D, E...
4    [(E, E), (E, A), (E, D), (A, E), (A, A), (A, D...
dtype: object

Then, flatten those lists to get a Series of all pairs, and count the pairs (this basically gives you a one-dimensional histogram of keyword pairs):

key_frame = key_frame.explode().value_counts()
>>> key_frame
(D, D)    4
(A, A)    3
(D, E)    3
...
(C, E)    1
(A, C)    1
(A, B)    1
dtype: int64

Now, to turn this into a two-dimensional histogram of keywords, we use unstack(), which requires that the index be a MultiIndex, instead of an index of tuples:

key_frame.index = pd.MultiIndex.from_tuples(key_frame.index)
key_frame = key_frame.unstack()
>>> key_frame
     A    B    C    D    E
A  3.0  1.0  1.0  2.0  2.0
B  1.0  2.0  2.0  1.0  NaN
C  1.0  2.0  3.0  2.0  1.0
D  2.0  1.0  2.0  4.0  3.0
E  2.0  NaN  1.0  3.0  3.0

At this stage the key_frame DataFrame contains all the non-zero entries in the DataFrame you want. This might already be good enough, but in order to fill in missing columns, rows, or NaN entries, we can use reindex() and fillna():

key_frame = key_frame.reindex(full_keys, columns=full_keys).fillna(0)
>>> key_frame
     A    B    C    D    E
A  3.0  1.0  1.0  2.0  2.0
B  1.0  2.0  2.0  1.0  0.0
C  1.0  2.0  3.0  2.0  1.0
D  2.0  1.0  2.0  4.0  3.0
E  2.0  0.0  1.0  3.0  3.0

Which should give you a DataFrame identical to one achieved using your code, but at a fraction of the time - running your code took around 100x time than this solution on the same keywords DataFrame (with 100 rows and 10 columns).

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Orb