'Counting number of co-occurrences of words for a specified vocabulary and within a specified radius?
I have a vocabulary V = ["anarchism", "originated", "term", "abuse"]
, and list of words
test = ['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including', 'the', 'diggers', 'abuse', 'the', 'english', 'term', 'anarchism']
.
I'd like to count the number of co-occurrences within a radius R
in the list of words test
between each word in the vocabulary V
. If R=5
, say, then we look 5 words to the left of a given vocabulary word in V
and 5 words to the right. We then count the number of times each word in V
occurs within that radius of 5.
For example, let's take the first word in the V
, "anarchism." The word "anarchism" occurs first and last in test
. After the first occurrence, we look 5 words to the left (i.e. nothing) and 5 words to the right ('originated', 'as', 'a', 'term', 'of'
). Is any of these "anarchism"? No. For the last occurrence of "anarchism", we look 5 words to the left ('diggers', 'abuse', 'the', 'english', 'term'
) and 5 words to the right (again, nothing). Hence "anarchism" does not occur within a radius of 5 words with itself, so the (0, 0) entry of the output matrix corresponding to ("anarchism", "anarchism") is 0. However, the word "originated" occurs once within 5 words of "anarchism", so the (0, 1) entry (i.e. the ("anarchism", "originated")) cell of the output matrix is 1. Similarly, the word "term" occurs once within radius 5 of the first occurrence of "anarchism" and once within radius 5 of the second occurrence of "anarchism", so the (0, 2) entry of the output is 2. We continue in this way for each word in the vocabulary V
.
The resulting output is therefore a 4x4 matrix (since there are 4 words in V
), and it is symmetric, since for example the counts of co-occurrences for ("anarchism", "originated") are the same as ("originated", "anarchism").
For this example, the output (e.g. numpy array) looks like:
0 | 1 | 2 | 1 |
1 | 0 | 1 | 1 |
2 | 1 | 0 | 2 |
1 | 1 | 2 | 0 |
Each row and column corresponds to the respective entries of V
. How can I implement this in Python?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|