'Correctness of chunked cosine distance / similarity methods using just Numpy
I'm trying to implement chunked cosine similarity / distance computation using just numpy
.
The cosine1
method is taken from this SO post and adapted slightly for broadcasting since I'm doing chunked computation.
The cosine2
method is the intuitive method using the simple definition of the angle between two vectors.
The cosine3
method is scipy.spatial.distance.cdist(A, B, 'cosine')
.
NOTE
: cosine1
and cosine2
are computing similarity while cosine3
is computing distance...I left them as is since I found it easier to compare output due to rounding differences when subtracting from 1.
cosine2
and cosine3
corroborate the correctness of the other and the chunked computation returns the same result as the single-pass operation. I don't want to depend on scipy
so I'm leaning towards implementing cosine2
unless someone points out some issue with it.
What I really want to know is:
(1) Why does cosine1
produce the wrong answer despite it being a generally approved of method in the aforementioned SO post? My hunch is that I'm messing up the broadcasting changes added to support chunked computation.
(2) Why does the chunked computation of cosine1
produce results inconsistent with the single-pass computation while every other method produces identical output between the chunked and single-pass computation?
(3) Is cosine2
correct?
Code:
import numpy as np
from scipy.spatial import distance
np.set_printoptions(linewidth=999)
def compute_similarities(A, B, method="euclidean", chunk=0):
return Methods()(A, B, method=method, chunk=chunk)
class Methods():
def __init__(self):
pass
def euclidean(self, a, b):
return np.linalg.norm(a[:, None, :] - b[None, :, :], axis=-1)
def cosine1(self, A, B):
similarity = np.dot(B, A.T)
square_mag = np.diag(similarity)
inv_square_mag = 1 / square_mag
inv_square_mag[np.isinf(inv_square_mag)] = 0
inv_mag = np.sqrt(inv_square_mag)
cosine = similarity * inv_mag[np.newaxis, :] # something funny here...
return cosine.T * inv_mag[:, np.newaxis] # ...or here
def cosine2(self, A, B):
return np.dot(A, B.T) / (np.linalg.norm(A, axis=1)[:, np.newaxis] * np.linalg.norm(B, axis=1)[np.newaxis, :])
def cosine3(self, A, B):
return distance.cdist(A, B, 'cosine')
def __call__(self, A, B, method='euclidean', chunk=0):
dist = getattr(self, method)
a = np.array(A)
b = a if np.array_equal(A, B) else np.array(B)
chunks = a.shape[0] // (chunk if chunk else 1)
if not (chunk and chunks):
return dist(a, b)
dists = np.zeros((a.shape[0], b.shape[0]))
for i in range(chunks):
_a = a[(i*chunk):(i*chunk)+chunk]
dists[(i*chunk):(i*chunk)+chunk] = dist(_a, b)
if a.shape[0] % chunk != 0:
dists[(i*chunk)+chunk:] = dist(a[(i*chunk)+chunk:], b)
return dists
if __name__ == '__main__':
A = np.random.uniform(0, 1, (4, 3))
methods = ['euclidean', 'cosine1', 'cosine2', 'cosine3']
for method in methods:
print(f"{'-'*10} {method} {'-'*(79-len(method))}\none chunk...")
print(compute_similarities(A, A, method=method, chunk=A.shape[0]), end='\n\n')
print("two chunks...")
print(compute_similarities(A, A, method=method, chunk=A.shape[0] // 2), end='\n\n\n')
Output:
---------- euclidean ----------------------------------------------------------------------
one chunk...
[[0. 0.92371109 0.60656914 0.83801003]
[0.92371109 0. 0.65194097 0.48798021]
[0.60656914 0.65194097 0. 0.70142247]
[0.83801003 0.48798021 0.70142247 0. ]]
two chunks...
[[0. 0.92371109 0.60656914 0.83801003]
[0.92371109 0. 0.65194097 0.48798021]
[0.60656914 0.65194097 0. 0.70142247]
[0.83801003 0.48798021 0.70142247 0. ]]
---------- cosine1 ------------------------------------------------------------------------
one chunk...
[[1. 0.81289633 1.27765755 0.53644275]
[0.49578501 1. 0.9536029 0.690402 ]
[0.64123831 0.78471929 1. 0.58098557]
[0.59124435 1.24763468 1.2758628 1. ]]
two chunks...
[[1. 0.81289633 1.27765755 0.53644275]
[0.49578501 1. 0.9536029 0.690402 ]
[1. 1.22375609 1.55948261 0.90603689]
[0.4738922 1. 1.02262531 0.80151668]]
---------- cosine2 ------------------------------------------------------------------------
one chunk...
[[1. 0.63483999 0.90514252 0.56317736]
[0.63483999 1. 0.86504947 0.92809993]
[0.90514252 0.86504947 1. 0.86096334]
[0.56317736 0.92809993 0.86096334 1. ]]
two chunks...
[[1. 0.63483999 0.90514252 0.56317736]
[0.63483999 1. 0.86504947 0.92809993]
[0.90514252 0.86504947 1. 0.86096334]
[0.56317736 0.92809993 0.86096334 1. ]]
---------- cosine3 ------------------------------------------------------------------------
one chunk...
[[0.00000000e+00 3.65160008e-01 9.48574836e-02 4.36822638e-01]
[3.65160008e-01 0.00000000e+00 1.34950528e-01 7.19000737e-02]
[9.48574836e-02 1.34950528e-01 0.00000000e+00 1.39036660e-01]
[4.36822638e-01 7.19000737e-02 1.39036660e-01 1.11022302e-16]]
two chunks...
[[0.00000000e+00 3.65160008e-01 9.48574836e-02 4.36822638e-01]
[3.65160008e-01 0.00000000e+00 1.34950528e-01 7.19000737e-02]
[9.48574836e-02 1.34950528e-01 0.00000000e+00 1.39036660e-01]
[4.36822638e-01 7.19000737e-02 1.39036660e-01 1.11022302e-16]]
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|