'How do I check which rows of one small array exists in another larger one?

How do I check which rows of one small array exists in another larger one?

Given the following setup:

final_batch = np.emtpy((batch_size,2))
batch_size = 4
a = np.array(range(10))
b = np.array(range(10,20))
edges = np.array([[0,11],[0,12],[1,11],[1,12],[0,17]])


c1 = np.random.choice(a,batch).reshape(-1,1)
c2 = np.random.choice(b,batch).reshape(-1,1)
samples = np.append(c1,c2,axis=1)

Now there can exist dubplicates in samples and edges, I want to keep making np.random.choice and only add them to final_batch IF they don't already exist in edges. The simple way to do this would be to just take them 1 by 1 in a loop

while len(final_batch)<batch_size+1:
    c1 = np.random.choice(a,1).reshape(-1,1)
    c2 = np.random.choice(b,1).reshape(-1,1)
    if not np.isin(c1,c2).any():
        final_batch = np.append(final_batch,np.append(c1,c2,axis=1),axis=0)    

final_batch = final_batch[1:]

But all of a,b and edges can be huge and batch size will be 10k, but as it's way faster to sample many elements at once I wanted to see if there is a faster way. Something like

while len(final_batch)<batch_size+1:
     c1 = np.random.choice(a,batch).reshape(-1,1)
     c2 = np.random.choice(b,batch).reshape(-1,1)
     samples = np.append(c1,c2,axis=1)
     full_batch.append(samples NOT IN edges)
     

Note that c1 and c2 are mutually exclusive, so I feel like I should be able to use this somehow.



Solution 1:[1]

If I understand your question, you are looking for something like

samples = np.empty((10, 2), dtype=int)
samples[:,0] = np.random.choice(a, 10)
samples[:,1] = np.random.choice(b, 10)
new_indices = (samples != edges[:,None]).any(axis=2).all(axis=0)
new_samples = samples[new_indices]

Meaning I generate 10 new samples, then I look whether they match edges. This is not optimal in operation number, as I continue checking for equality even after I found a match, but this is vectorized with numpy, which is usually faster than stopping as soon as you can.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1