'Counting number of co-occurrences of words for a specified vocabulary and within a specified radius?

I have a vocabulary V = ["anarchism", "originated", "term", "abuse"], and list of words test = ['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including', 'the', 'diggers', 'abuse', 'the', 'english', 'term', 'anarchism'].

I'd like to count the number of co-occurrences within a radius R in the list of words test between each word in the vocabulary V. If R=5, say, then we look 5 words to the left of a given vocabulary word in V and 5 words to the right. We then count the number of times each word in V occurs within that radius of 5.

For example, let's take the first word in the V, "anarchism." The word "anarchism" occurs first and last in test. After the first occurrence, we look 5 words to the left (i.e. nothing) and 5 words to the right ('originated', 'as', 'a', 'term', 'of'). Is any of these "anarchism"? No. For the last occurrence of "anarchism", we look 5 words to the left ('diggers', 'abuse', 'the', 'english', 'term') and 5 words to the right (again, nothing). Hence "anarchism" does not occur within a radius of 5 words with itself, so the (0, 0) entry of the output matrix corresponding to ("anarchism", "anarchism") is 0. However, the word "originated" occurs once within 5 words of "anarchism", so the (0, 1) entry (i.e. the ("anarchism", "originated")) cell of the output matrix is 1. Similarly, the word "term" occurs once within radius 5 of the first occurrence of "anarchism" and once within radius 5 of the second occurrence of "anarchism", so the (0, 2) entry of the output is 2. We continue in this way for each word in the vocabulary V.

The resulting output is therefore a 4x4 matrix (since there are 4 words in V), and it is symmetric, since for example the counts of co-occurrences for ("anarchism", "originated") are the same as ("originated", "anarchism").

For this example, the output (e.g. numpy array) looks like:

0 1 2 1
1 0 1 1
2 1 0 2
1 1 2 0

Each row and column corresponds to the respective entries of V. How can I implement this in Python?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source