'Pairwise comparison algorithm with time complexity better than O(n**2)

I have around 500,000 arrays of 10 words i.e. 500,000 word 10-grams. For every 10-gram, I need to know in which positions, if any, the remaining 499,999 10-grams have identical elements:

a = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']

b = ['A', 'M', 'C', 'M', 'E', 'M', 'G', 'M', 'I', 'M']

...

z = ['R', 'R', 'R', 'R', 'R', 'F', 'G', 'H', 'I', 'J']

If we use a 1 for positions where the two arrays contain the same word and a 0 for positions where they contain different words, the intersection of a with b would be represented as [1, 0, 1, 0, 1, 0, 1, 0, 1, 0]; the intersection of a with z would be represented as [0, 0, 0, 0, 0, 1, 1, 1, 1, 1], etc.

Can we do better than a naive O(n**2) algorithm, i.e. one for loop within another for loop?



Solution 1:[1]

Fun question :)

So I have an idea and I think it is O(n*log(n) + n) where +n asymptotically irrelevant.

So I would propose something as follows:

tuple_len = 10
min_value = 1
max_value = 10
number_of_entries = 100
l = [[j] + [randint(min_value,max_value) for i in range(tuple_len)] for j in range(number_of_entries)]

Base set:

[[0, 9, 10, 3, 6, 3, 10, 9, 7, 8, 4],
 [1, 2, 3, 6, 7, 9, 2, 5, 10, 6, 10],
 [2, 5, 4, 10, 8, 5, 9, 2, 7, 4, 3],
 [3, 5, 9, 4, 5, 5, 3, 10, 1, 4, 4],
 [4, 9, 10, 9, 10, 9, 10, 6, 1, 6, 2],
 [5, 5, 6, 3, 6, 9, 5, 8, 3, 1, 1],
 [6, 9, 7, 5, 5, 5, 2, 1, 2, 3, 6],
 [7, 2, 6, 9, 10, 5, 6, 7, 3, 7, 5],
 [8, 6, 8, 9, 3, 7, 1, 2, 9, 8, 10],
 [9, 7, 5, 7, 2, 1, 3, 7, 1, 2, 9],
 [10, 1, 4, 4, 3, 6, 9, 6, 3, 3, 8],
 [11, 8, 3, 10, 10, 5, 9, 7, 3, 4, 5],
...]

So I just used numbers for convenience and added the position in list as first value.

I propose to sort the dataset for each column of the data in turn, where sorting is O(n*log(n)), and then add the positional value of all entries with the same value to a set, which is O(n). The result looks something like:

[{6, 18, 24, 26},
 {22, 34},
 {1, 6, 19, 31, 57, 58},
 {1, 9, 18},
...}

This can be interpreted as Entry 6, 18, 24 and 26 have the same value in position 1. Checking if two entries correspond is Ò(1) with

true if (a in match_set) and (b in match_set) else false

Code example below:

match_sets = [set() for i in range(tuple_len)]


for position in range(tuple_len):
    l = sorted(l, key= lambda x: x[position+1])
    last_value = l[0][position+1]
    for entry in range(number_of_entries):
        if l[entry][position + 1] == last_value:
            match_sets[position].add(l[entry][0])
            last_value = l[entry][position + 1]
        

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 FloLie