'Fastest way in numpy to get distance of product of n pairs in array
I have N
number of points, for example:
A = [2, 3]
B = [3, 4]
C = [3, 3]
.
.
.
And they're in an array like so:
arr = np.array([[2, 3], [3, 4], [3, 3]])
I need as output all pairwise distances in BFS (Breadth First Search)
order to track which distance is which, like: A->B, A->C, B->C
. For the above example data, the result would be [1.41, 1.0, 1.0]
.
EDIT: I have to accomplish it with numpy or core libraries.
Solution 1:[1]
As an alternative method, but similar to ddejohn answer, we can use np.triu_indices
which return just the upper triangular indices in the matrix, which may be more memory-efficient:
np.linalg.norm(arr - arr[:, None], axis=-1)[np.triu_indices(arr.shape[0], 1)]
This doesn't need additional modules like flattening and indexing. Its performance is similar to the aforementioned answer for large data (e.g. you can check it by arr = np.random.rand(10000, 2)
on colab, which will be done near 4.6 s for both; It may beats the np.triu
and flatten
in larger data).
I have tested the memory usage one time by memory-profiler as follows, but it must be checked again if it be important in terms of memory usage (I'm not sure):
Update:
I have tried to limit the calculations just to the upper triangle, that speed the code up 2 to 3 times on the tested arrays. As array size grows, the performance difference between this loop and the previous methods by np.triu_indices
or np.triu
grows and be more obvious:
ind = np.arange(arr.shape[0] - 1)
sub_ind = ind + 1
result = np.zeros(sub_ind.sum())
j = 0
for i in range(ind.shape[0]):
result[j:j+ind[-1-i]+1] = np.linalg.norm(arr[ind[i]] - arr[sub_ind[i]:], axis=-1)
j += ind[-1-i]+1
Also, through this way, the memory consumption is reduced at least ~x4
. So, this method made it possible to work on larger arrays and more quickly.
Benchmarks:
# arr = np.random.rand(100, 2)
100 loops, best of 5: 459 µs per loop (ddejohns --> np.triu & np.flatten)
100 loops, best of 5: 528 µs per loop (mine --> np.triu_indices)
100 loops, best of 5: 1.42 ms per loop (This method)
--------------------------------------
# arr = np.random.rand(1000, 2)
10 loops, best of 5: 49.9 ms per loop
10 loops, best of 5: 49.7 ms per loop
10 loops, best of 5: 30.4 ms per loop (~x1.7) The fastest
--------------------------------------
# arr = np.random.rand(10000, 2)
2 loops, best of 5: 4.56 s per loop
2 loops, best of 5: 4.6 s per loop
2 loops, best of 5: 1.85 s per loop (~x2.5) The fastest
Solution 2:[2]
If you can use it, SciPy has a function for this:
In [2]: from scipy.spatial.distance import pdist
In [3]: pdist(arr)
Out[3]: array([1.41421356, 1. , 1. ])
Solution 3:[3]
Here's a numpy-only solution (fair warning: it requires a lot of memory, unlike pdist
)...
dists = np.triu(np.linalg.norm(arr - arr[:, None], axis=-1)).flatten()
dists = dists[dists != 0]
Demo:
In [4]: arr = np.array([[2, 3], [3, 4], [3, 3], [5, 2], [4, 5]])
In [5]: pdist(arr)
Out[5]:
array([1.41421356, 1. , 3.16227766, 2.82842712, 1. ,
2.82842712, 1.41421356, 2.23606798, 2.23606798, 3.16227766])
In [6]: dists = np.triu(np.linalg.norm(arr - arr[:, None], axis=-1)).flatten()
In [7]: dists = dists[dists != 0]
In [8]: dists
Out[8]:
array([1.41421356, 1. , 3.16227766, 2.82842712, 1. ,
2.82842712, 1.41421356, 2.23606798, 2.23606798, 3.16227766])
Timings (with the solution above wrapped in a function called triu
):
In [9]: %timeit pdist(arr)
7.27 µs ± 738 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [10]: %timeit triu(arr)
25.5 µs ± 4.58 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | ddejohn |
Solution 3 |