'What's the fastest way of finding a random index in a Python list, a large number of times?

What's the best (fastest) way of extracting a random value from a list, a large number (>1M) of times?

I am currently in a situation where I have a graph represented as an adjacency list, whose inner lists can have vastly different length (in the range [2, potentially 100k]).

I need to iterate over this list to generate random walks, so my current solution is along the lines of

  1. Get random node
  2. Pick a random index from that node's adjacency list
  3. Move to the new node
  4. Go to 2
  5. Repeat until the random walk is as long as needed
  6. Go to 1

This works well enough when the graph isn't too big, however now I am working with a graph that contains >440k nodes, with a very large variance in the number of edges each node has.

The function I am using at the moment to extract the random index is

node_neighbors[int(random.random() * number_neighbors_of_node)]

This sped up the computation over my previous implementation, however it's still unacceptably slow for my purposes.

The number of neighbors of a node can go from 2 to tens of thousands, I cannot remove small nodes, I have to generate tens of thousands of random walks in this environment.

From profiling the code, most of the generation time is spent looking for these indexes, so I'm looking for a method that can reduce the time taken to do so. However, if it's possible to sidestep it entirely by modifying the algorithm, that would also be great.

Thanks!

EDIT: out of curiosity, I tested three variants of the same code using timeit and these are the results:

setup='''
import numpy as np
import random

# generate a random adjacency list, nodes have a number of neighbors between 2 and 10000

l = [list(range(random.randint(2, 10000))) for _ in range(10000)]
'''

for _ in range(1000):    
    v = l[random.randint(0, 10000-1)] # Get a random node adj list 
    vv = v[random.randint(0, len(v)-1)] # Find a random neighbor in v

0.29709450000001425

for _ in range(1000):    
    v = l[random.randint(0, 10000-1)]
    vv = v[np.random.choice(v)]

26.760767499999986

for _ in range(1000):    
    v = l[random.randint(0, 10000-1)]
    vv = v[int(random.random()*(len(v)))]

0.19086300000000733

for _ in range(1000):    
    v = l[random.randint(0, 10000-1)]
    vv = v[int(random.choice(v))]

0.24351880000000392


Solution 1:[1]

Your solution (sol3) is already the fastest by a larger margin than your tests suggest. I adapted the performance measurements to eliminate the arbitrary selection of nodes in favour of a path traversal that will be much closer to your stated goal.

Here are the improved performance tests and result. I added sol5() to see if pre-computing a list of random values would make a difference (I was hoping numpy would vectorize that but it didn't go any faster).

Setup

import numpy as np
import random

# generate a random adjacency list, nodes have a number of neighbors between 2 and 10000

nodes     = [list(range(random.randint(2, 10000))) for _ in range(10000)]
pathLen   = 1000

Solutions

def sol1():
    node = nodes[0]
    for _ in range(pathLen):
        node = nodes[random.randint(0, len(node)-1)] # move to a random neighbor

def sol2():
    node = nodes[0]
    for _ in range(pathLen):
        node = nodes[np.random.choice(node)]

def sol3():
    node = nodes[0]
    for _ in range(pathLen):
        node = nodes[int(random.random()*(len(node)))]

def sol4():
    node = nodes[0]
    for _ in range(pathLen):
        node = nodes[int(random.choice(node))]

def sol5():
    node = nodes[0]
    for rng in np.random.random_sample(pathLen):
        node = nodes[int(rng*len(node))]

Measurements

from timeit import timeit
count = 100

print("sol1",timeit(sol1,number=count))
print("sol2",timeit(sol2,number=count))
print("sol3",timeit(sol3,number=count))
print("sol4",timeit(sol4,number=count))
print("sol5",timeit(sol5,number=count))

sol1 0.12516996199999975
sol2 30.445685411
sol3 0.03886452900000137
sol4 0.1244026900000037
sol5 0.05330073100000021

numpy is not very good at manipulating matrices that have variable dimensions (like your neighbor lists) but perhaps a way to accelerate the process is to vectorize the next node selection. By assigning a random float to each node in a numpy array, you can use that to navigate between nodes until your path comes back to an already visited node. Only then would you need to generate a new random value for that node. Presumably, and depending on the path length, there would be a relatively small number of these "collisions".

Using the same idea, and leveraging numpy's vectorizing, you can make several traversals in parallel by creating a matrix of node identifiers (columns) with each row being a parallel traversal.

To illustrate this, here's a function that advances multiple "ants" on their individual random paths through the nodes:

import numpy as np
import random

nodes   = [list(range(random.randint(2, 10000))) for _ in range(10000)]
nbLinks = np.array(list(map(len,nodes)),dtype=np.int)         # number of neighbors per node
npNodes = np.array([nb+[-1]*(10000-len(nb)) for nb in nodes]) # fixed sized rows for numpy

def moveAnts(antCount=12,stepCount=8,antPos=None,allPaths=False):
    if antPos is None:
        antPos = np.random.choice(len(nodes),antCount)
    paths = antPos[:,None]

    for _ in range(stepCount):
        nextIndex = np.random.random_sample(size=(antCount,))*nbLinks[antPos]
        antPos    = npNodes[antPos,nextIndex.astype(np.int)]
        if allPaths:
            paths = np.append(paths,antPos[:,None],axis=1)
        
    return paths if allPaths else antPos

Example: 12 ants advancing randomly by 8 steps from random starting locations

print(moveAnts(12,8,allPaths=True))

"""
    [[8840 1302 3438 4159 2983 2269 1284 5031 1760]
     [4390 5710 4981 3251 3235 2533 2771 6294 2940]
     [3610 2059 1118 4630 2333  552 1375 4656 6212]
     [9238 1295 7053  542 6914 2348 2481  718  949]
     [5308 2826 2622   17   78  976   13 1640  561]
     [5763 6079 1867 7748 7098 4884 2061  432 1827]
     [3196 3057   27  440 6545 3629  243 6319  427]
     [7694 1260 1621  956 1491 2258  676 3902  582]
     [1590 4720  772 1366 2112 3498 1279 5474 3474]
     [2587  872  333 1984 7263  168 3782  823    9]
     [8525  193  449  982 4521  449 3811 2891 3353]
     [6824 9221  964  389 4454  720 1898  806   58]]
"""

Performance is not better on individual ants, but in parallel the time per-ant is much better

from timeit import timeit
count = 100

antCount  = 100
stepCount = 1000
ap = np.random.choice(len(nodes),antCount)

t = timeit(lambda:moveAnts(antCount,stepCount,ap),number=count)

print(t) # 0.9538277329999989 / 100 --> 0.009538277329999989 per ant

[EDIT] I thought of a better array model for the variable sized rows and came up with an approach that will not waste memory in a (mostly empty) matrix of fixed dimension. The approach is to use a 1D array to hold the links of all nodes consecutively and two additional arrays to hold the starting position and number of neighbours. This data structure turns out to run even faster than the fixed sized 2D matrix.

import numpy as np
import random

nodes     = [list(range(random.randint(2, 10000))) for _ in range(10000)]
links     = np.array(list(n for neighbors in nodes for n in neighbors))
linkCount = np.array(list(map(len,nodes)),dtype=np.int) # number of neighbors for each node
firstLink = np.insert(np.add.accumulate(linkCount),0,0) # index of first link for each node



def moveAnts(antCount=12,stepCount=8,antPos=None,allPaths=False):
    if antPos is None:
        antPos = np.random.choice(len(nodes),antCount)
    paths = antPos[:,None]

    for _ in range(stepCount):
        nextIndex = np.random.random_sample(size=(antCount,))*linkCount[antPos]
        antPos    = links[firstLink[antPos]+nextIndex.astype(np.int)]
        if allPaths:
            paths = np.append(paths,antPos[:,None],axis=1)
        
    return paths if allPaths else antPos

from timeit import timeit
count = 100

antCount  = 100
stepCount = 1000
ap = np.random.choice(len(nodes),antCount)

t = timeit(lambda:moveAnts(antCount,stepCount,ap),number=count)

print(t) # 0.7157810379999994 / 100 --> 0.007157810379999994 per ant

The performance "per ant" improves as you add more of them, up to a point (roughly 10x faster than sol3):

antCount  = 1000
stepCount = 1000
ap = np.random.choice(len(nodes),antCount)

t = timeit(lambda:moveAnts(antCount,stepCount,ap),number=count)

print(t,t/antCount) #3.9749405650000007, 0.0039749405650000005 per ant

antCount  = 10000
stepCount = 1000
ap = np.random.choice(len(nodes),antCount)

t = timeit(lambda:moveAnts(antCount,stepCount,ap),number=count)

print(t,t/antCount) #32.688697579, 0.0032688697579 per ant

Solution 2:[2]

@Alain T. already answers the specific question about graphs.

My answer will address the generic question: "What's the fastest way of finding a random index in a Python list, a large number of times?", meaning generating random subsample of N elements from a given list or NumPy array.

We will evaluate the following methods:

## Returns Python list
random_subset = random.sample(data, N)  

## Returns numpy.ndarray
random_selection = np.random.choice(data, N)  

## Obtain a single random idx using random.randrange, N times, used to index data
random_selection = [data[idx] for idx in (random.randrange(0, len(data)) for _ in range(N))]  

## Obtain a single random idx using using np.random.randint, N times, used to index data
random_selection = [data[idx] for idx in (np.random.randint(0, len(data)) for _ in range(N))]

## Create an NumPy array of `N` random indexes using np.random.randint, used to index data
random_selection = [data[idx] for idx in np.random.randint(0, len(data), N)]

Assume we have 1MM data values (the exact values do not matter).

data_nparray = np.random.random(int(1e6))  
data_list = list(data_nparray)  

As it turns out, when evaluating the fastest way to generate a series of random elements, the answer changes if data stored as a list or a NumPy array.

Random selection from a Python list or tuple

Random selection from a Python list or tuple

  • The plot above shows the increase time (microseconds) when randomly selecting elements from a Python list.
    • For <50 elements, random.sample is the fastest (i.e. random.sample(data, N))
    • For >= 50 elements, it is fastest to create an array of N random indexes using np.random.randint (i.e. [data[idx] for idx in np.random.randint(0, len(data), N)])
    • np.random.choice is extremely slow here, as it converts the data into a NumPy array each time. However, for >100,000 selections, it becomes the fastest option.
    • You can probably get away with using random.sample throughout.
    • The same results were found to hold for Python tuples as well.

Random selection from a NumPy array

Random selection from a NumPy array

  • The plot above shows the increase time (microseconds) when randomly selecting elements from a NumPy array
    • random.sample does not work with NumPy arrays, only lists. Converting 1MM elements from a NumPy array to a list itself takes ~23,000 microseconds on average, so we will skip random.sample.
    • For <10 elements, it is fastest to select N random indexes one-by-one using random.randrange (i.e. [data[idx] for idx in (random.randrange(0, len(data)) for _ in range(N))])
    • For 10 to 50 elements, it is fastest to create an array of N random indexes using np.random.randint (i.e. [data[idx] for idx in np.random.randint(0, len(data), N)])
    • For >50 elements, it is fastest to use np.random.choice (i.e. np.random.choice(data, N)) which also returns a NumPy array, not a list.
    • You can probably get away with using random.randrange for <25 elements, then switching to np.random.choice.

Code to reproduce

import time
import random
import numpy as np
import pandas as pd
from IPython.display import display
def time_fn(fn, num_runs=7, num_loops=1_000):
    run_times = [] 
    for num_run_i in range(num_runs):
        start = time.time()
        for _ in range(num_loops):
            fn()
        end = time.time()
        run_times.append((end-start)/num_loops)
    return run_times

def create_times_df_row(data, N, fn, fn_name, num_runs=7, num_loops=1_000):
    try:
        run_times = time_fn(fn, num_runs=num_runs, num_loops=num_loops)
        return [
            {
                'N': N,
                'function': fn_name,
                'run_time_avg': np.mean(run_times),
                'run_time_std': np.std(run_times),
                'num_runs': num_runs,
                'num_loops': num_loops,
                'input_type': str(type(data)),
                'return_type': str(type(fn())),
            }
        ]
    except Exception as e:
        print(e)
    return []

data_nparray = np.random.random(int(1e4))
data_list = list(data_nparray)
all_times_df = []
for N in list(range(1, 10)) + list(range(10, 25, 5)) + list(range(25, 100, 25)) + list(range(100, 1000+1, 100)):
    times_df = []
    for data in [data_nparray, data_list]:
        num_loops = 1_000
        if N >= 100:
            num_loops = 100
        if N >= 1_000:
            num_loops = 10
        times_df.extend(create_times_df_row(
            data=data,
            N=N,
            fn=lambda: np.random.choice(data, N),
            fn_name='np.random.choice',
            num_loops=num_loops if isinstance(data, np.ndarray) else 10,
        ))
        times_df.extend(create_times_df_row(
            data=data,
            N=N,
            fn=lambda: random.sample(data, N),
            fn_name='random.sample',
            num_loops=num_loops,
        ))
        times_df.extend(create_times_df_row(
            data=data,
            N=N,
            fn=lambda: [data[idx] for idx in (random.randrange(0, len(data)) for _ in range(N))],
            fn_name='random.randrange',
            num_loops=num_loops,
        ))
        times_df.extend(create_times_df_row(
            data=data,
            N=N,
            fn=lambda: [data[idx] for idx in (np.random.randint(0, len(data)) for _ in range(N))],
            fn_name='np.random.randint (single idx)',
            num_loops=num_loops,
        ))
        times_df.extend(create_times_df_row(
            data=data,
            N=N,
            fn=lambda: [data[idx] for idx in np.random.randint(0, len(data), N)],
            fn_name='np.random.randint (idx array)',
            num_loops=num_loops,
        ))
    print(N, end=' ')
    times_df = pd.DataFrame(times_df)
    display(times_df)
    all_times_df.append(times_df)
    
all_times_df = pd.concat(all_times_df).reset_index(drop=True)


from __future__ import division
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import time

from typing import *
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt 
# Assignment Constants
RANDOM_STATE = 10
FIGSIZE = (12.0, 9.0)

#### Use the following line before plt.plot(....) to increase the plot size ####
plt.figure(figsize=FIGSIZE)
### Adjust some plot display options:
sns.set(rc={
    'figure.figsize':FIGSIZE, 
    'axes.facecolor':'white', 
    'figure.facecolor':'white', 
    'grid.color': 'gainsboro',
})
sns.set_context("talk")

def time_to_ms(row):
    row['run_time_avg'] = 1000*row['run_time_avg']
    row['run_time_std'] = 1000*row['run_time_std']
    return row

def time_to_us(row):
    row['run_time_avg'] = 1e6*row['run_time_avg']
    row['run_time_std'] = 1e6*row['run_time_std']
    return row

cmap = dict(zip(all_times_df['function'].unique(), sns.color_palette('tab10')))

xticks = [1, 2, 3, 5, 7, 10, 15, 25, 50, 100, 200, 500, 1000,]


## Plot for Python list
line_plot = sns.lineplot(
    data=all_times_df.query(f'''input_type == "<class 'list'>"''').apply(time_to_us, axis=1),
    x='N', y='run_time_avg', hue='function', linewidth=3, palette=cmap,
)
scatter_plot = sns.scatterplot(
    data=all_times_df.query(f'''input_type == "<class 'list'>"''').apply(time_to_us, axis=1),
    x='N', y='run_time_avg', hue='function', linewidth=3, palette=cmap,
    legend=False,
)

line_plot.set_xscale("log")
line_plot.set_yscale("log")
plt.xticks(xticks)
line_plot.get_xaxis().set_major_formatter(matplotlib.ticker.ScalarFormatter())
plt.xlabel("Size")
plt.ylabel("Time (µs)")
line_plot.set_title("Random selection from a Python list")
plt.legend(bbox_to_anchor=(0.5,0.5))
plt.ylim(7e-1, 2e5)
display(line_plot)


## Plot for Numpy array
line_plot = sns.lineplot(
    data=all_times_df.query(f'''input_type == "<class 'numpy.ndarray'>"''').apply(time_to_us, axis=1),
    x='N', y='run_time_avg', hue='function', linewidth=3, palette=cmap,
    # legend=False,
)
sns.scatterplot(
    data=all_times_df.query(f'''input_type == "<class 'numpy.ndarray'>"''').apply(time_to_us, axis=1),
    x='N', y='run_time_avg', hue='function', linewidth=3, palette=cmap,
    legend=False,
)
line_plot.set_xscale("log")
line_plot.set_yscale("log")
plt.xticks(xticks)
line_plot.get_xaxis().set_major_formatter(matplotlib.ticker.ScalarFormatter())

plt.xlabel("Size")
plt.ylabel("Time (µs)")
line_plot.set_title("Random selection from a NumPy array")
plt.legend(bbox_to_anchor=(1, 1))
plt.ylim(7e-1, 2e5)
display(line_plot)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2