'What's the fastest way of finding a random index in a Python list, a large number of times?
What's the best (fastest) way of extracting a random value from a list, a large number (>1M) of times?
I am currently in a situation where I have a graph represented as an adjacency list, whose inner lists can have vastly different length (in the range [2, potentially 100k]).
I need to iterate over this list to generate random walks, so my current solution is along the lines of
- Get random node
- Pick a random index from that node's adjacency list
- Move to the new node
- Go to 2
- Repeat until the random walk is as long as needed
- Go to 1
This works well enough when the graph isn't too big, however now I am working with a graph that contains >440k nodes, with a very large variance in the number of edges each node has.
The function I am using at the moment to extract the random index is
node_neighbors[int(random.random() * number_neighbors_of_node)]
This sped up the computation over my previous implementation, however it's still unacceptably slow for my purposes.
The number of neighbors of a node can go from 2 to tens of thousands, I cannot remove small nodes, I have to generate tens of thousands of random walks in this environment.
From profiling the code, most of the generation time is spent looking for these indexes, so I'm looking for a method that can reduce the time taken to do so. However, if it's possible to sidestep it entirely by modifying the algorithm, that would also be great.
Thanks!
EDIT: out of curiosity, I tested three variants of the same code using timeit
and these are the results:
setup='''
import numpy as np
import random
# generate a random adjacency list, nodes have a number of neighbors between 2 and 10000
l = [list(range(random.randint(2, 10000))) for _ in range(10000)]
'''
for _ in range(1000):
v = l[random.randint(0, 10000-1)] # Get a random node adj list
vv = v[random.randint(0, len(v)-1)] # Find a random neighbor in v
0.29709450000001425
for _ in range(1000):
v = l[random.randint(0, 10000-1)]
vv = v[np.random.choice(v)]
26.760767499999986
for _ in range(1000):
v = l[random.randint(0, 10000-1)]
vv = v[int(random.random()*(len(v)))]
0.19086300000000733
for _ in range(1000):
v = l[random.randint(0, 10000-1)]
vv = v[int(random.choice(v))]
0.24351880000000392
Solution 1:[1]
Your solution (sol3) is already the fastest by a larger margin than your tests suggest. I adapted the performance measurements to eliminate the arbitrary selection of nodes in favour of a path traversal that will be much closer to your stated goal.
Here are the improved performance tests and result. I added sol5() to see if pre-computing a list of random values would make a difference (I was hoping numpy would vectorize that but it didn't go any faster).
Setup
import numpy as np
import random
# generate a random adjacency list, nodes have a number of neighbors between 2 and 10000
nodes = [list(range(random.randint(2, 10000))) for _ in range(10000)]
pathLen = 1000
Solutions
def sol1():
node = nodes[0]
for _ in range(pathLen):
node = nodes[random.randint(0, len(node)-1)] # move to a random neighbor
def sol2():
node = nodes[0]
for _ in range(pathLen):
node = nodes[np.random.choice(node)]
def sol3():
node = nodes[0]
for _ in range(pathLen):
node = nodes[int(random.random()*(len(node)))]
def sol4():
node = nodes[0]
for _ in range(pathLen):
node = nodes[int(random.choice(node))]
def sol5():
node = nodes[0]
for rng in np.random.random_sample(pathLen):
node = nodes[int(rng*len(node))]
Measurements
from timeit import timeit
count = 100
print("sol1",timeit(sol1,number=count))
print("sol2",timeit(sol2,number=count))
print("sol3",timeit(sol3,number=count))
print("sol4",timeit(sol4,number=count))
print("sol5",timeit(sol5,number=count))
sol1 0.12516996199999975
sol2 30.445685411
sol3 0.03886452900000137
sol4 0.1244026900000037
sol5 0.05330073100000021
numpy is not very good at manipulating matrices that have variable dimensions (like your neighbor lists) but perhaps a way to accelerate the process is to vectorize the next node selection. By assigning a random float to each node in a numpy array, you can use that to navigate between nodes until your path comes back to an already visited node. Only then would you need to generate a new random value for that node. Presumably, and depending on the path length, there would be a relatively small number of these "collisions".
Using the same idea, and leveraging numpy's vectorizing, you can make several traversals in parallel by creating a matrix of node identifiers (columns) with each row being a parallel traversal.
To illustrate this, here's a function that advances multiple "ants" on their individual random paths through the nodes:
import numpy as np
import random
nodes = [list(range(random.randint(2, 10000))) for _ in range(10000)]
nbLinks = np.array(list(map(len,nodes)),dtype=np.int) # number of neighbors per node
npNodes = np.array([nb+[-1]*(10000-len(nb)) for nb in nodes]) # fixed sized rows for numpy
def moveAnts(antCount=12,stepCount=8,antPos=None,allPaths=False):
if antPos is None:
antPos = np.random.choice(len(nodes),antCount)
paths = antPos[:,None]
for _ in range(stepCount):
nextIndex = np.random.random_sample(size=(antCount,))*nbLinks[antPos]
antPos = npNodes[antPos,nextIndex.astype(np.int)]
if allPaths:
paths = np.append(paths,antPos[:,None],axis=1)
return paths if allPaths else antPos
Example: 12 ants advancing randomly by 8 steps from random starting locations
print(moveAnts(12,8,allPaths=True))
"""
[[8840 1302 3438 4159 2983 2269 1284 5031 1760]
[4390 5710 4981 3251 3235 2533 2771 6294 2940]
[3610 2059 1118 4630 2333 552 1375 4656 6212]
[9238 1295 7053 542 6914 2348 2481 718 949]
[5308 2826 2622 17 78 976 13 1640 561]
[5763 6079 1867 7748 7098 4884 2061 432 1827]
[3196 3057 27 440 6545 3629 243 6319 427]
[7694 1260 1621 956 1491 2258 676 3902 582]
[1590 4720 772 1366 2112 3498 1279 5474 3474]
[2587 872 333 1984 7263 168 3782 823 9]
[8525 193 449 982 4521 449 3811 2891 3353]
[6824 9221 964 389 4454 720 1898 806 58]]
"""
Performance is not better on individual ants, but in parallel the time per-ant is much better
from timeit import timeit
count = 100
antCount = 100
stepCount = 1000
ap = np.random.choice(len(nodes),antCount)
t = timeit(lambda:moveAnts(antCount,stepCount,ap),number=count)
print(t) # 0.9538277329999989 / 100 --> 0.009538277329999989 per ant
[EDIT] I thought of a better array model for the variable sized rows and came up with an approach that will not waste memory in a (mostly empty) matrix of fixed dimension. The approach is to use a 1D array to hold the links of all nodes consecutively and two additional arrays to hold the starting position and number of neighbours. This data structure turns out to run even faster than the fixed sized 2D matrix.
import numpy as np
import random
nodes = [list(range(random.randint(2, 10000))) for _ in range(10000)]
links = np.array(list(n for neighbors in nodes for n in neighbors))
linkCount = np.array(list(map(len,nodes)),dtype=np.int) # number of neighbors for each node
firstLink = np.insert(np.add.accumulate(linkCount),0,0) # index of first link for each node
def moveAnts(antCount=12,stepCount=8,antPos=None,allPaths=False):
if antPos is None:
antPos = np.random.choice(len(nodes),antCount)
paths = antPos[:,None]
for _ in range(stepCount):
nextIndex = np.random.random_sample(size=(antCount,))*linkCount[antPos]
antPos = links[firstLink[antPos]+nextIndex.astype(np.int)]
if allPaths:
paths = np.append(paths,antPos[:,None],axis=1)
return paths if allPaths else antPos
from timeit import timeit
count = 100
antCount = 100
stepCount = 1000
ap = np.random.choice(len(nodes),antCount)
t = timeit(lambda:moveAnts(antCount,stepCount,ap),number=count)
print(t) # 0.7157810379999994 / 100 --> 0.007157810379999994 per ant
The performance "per ant" improves as you add more of them, up to a point (roughly 10x faster than sol3):
antCount = 1000
stepCount = 1000
ap = np.random.choice(len(nodes),antCount)
t = timeit(lambda:moveAnts(antCount,stepCount,ap),number=count)
print(t,t/antCount) #3.9749405650000007, 0.0039749405650000005 per ant
antCount = 10000
stepCount = 1000
ap = np.random.choice(len(nodes),antCount)
t = timeit(lambda:moveAnts(antCount,stepCount,ap),number=count)
print(t,t/antCount) #32.688697579, 0.0032688697579 per ant
Solution 2:[2]
@Alain T. already answers the specific question about graphs.
My answer will address the generic question: "What's the fastest way of finding a random index in a Python list, a large number of times?", meaning generating random subsample of N
elements from a given list or NumPy array.
We will evaluate the following methods:
## Returns Python list
random_subset = random.sample(data, N)
## Returns numpy.ndarray
random_selection = np.random.choice(data, N)
## Obtain a single random idx using random.randrange, N times, used to index data
random_selection = [data[idx] for idx in (random.randrange(0, len(data)) for _ in range(N))]
## Obtain a single random idx using using np.random.randint, N times, used to index data
random_selection = [data[idx] for idx in (np.random.randint(0, len(data)) for _ in range(N))]
## Create an NumPy array of `N` random indexes using np.random.randint, used to index data
random_selection = [data[idx] for idx in np.random.randint(0, len(data), N)]
Assume we have 1MM data values (the exact values do not matter).
data_nparray = np.random.random(int(1e6))
data_list = list(data_nparray)
As it turns out, when evaluating the fastest way to generate a series of random elements, the answer changes if data stored as a list or a NumPy array.
Random selection from a Python list or tuple
- The plot above shows the increase time (microseconds) when randomly selecting elements from a Python list.
- For <50 elements,
random.sample
is the fastest (i.e.random.sample(data, N)
) - For >= 50 elements, it is fastest to create an array of
N
random indexes usingnp.random.randint
(i.e.[data[idx] for idx in np.random.randint(0, len(data), N)]
) np.random.choice
is extremely slow here, as it converts the data into a NumPy array each time. However, for >100,000 selections, it becomes the fastest option.- You can probably get away with using
random.sample
throughout. - The same results were found to hold for Python tuples as well.
- For <50 elements,
Random selection from a NumPy array
- The plot above shows the increase time (microseconds) when randomly selecting elements from a NumPy array
random.sample
does not work with NumPy arrays, only lists. Converting 1MM elements from a NumPy array to a list itself takes ~23,000 microseconds on average, so we will skiprandom.sample
.- For <10 elements, it is fastest to select
N
random indexes one-by-one usingrandom.randrange
(i.e.[data[idx] for idx in (random.randrange(0, len(data)) for _ in range(N))]
) - For 10 to 50 elements, it is fastest to create an array of
N
random indexes usingnp.random.randint
(i.e.[data[idx] for idx in np.random.randint(0, len(data), N)]
) - For >50 elements, it is fastest to use
np.random.choice
(i.e.np.random.choice(data, N)
) which also returns a NumPy array, not a list. - You can probably get away with using
random.randrange
for <25 elements, then switching tonp.random.choice
.
Code to reproduce
import time
import random
import numpy as np
import pandas as pd
from IPython.display import display
def time_fn(fn, num_runs=7, num_loops=1_000):
run_times = []
for num_run_i in range(num_runs):
start = time.time()
for _ in range(num_loops):
fn()
end = time.time()
run_times.append((end-start)/num_loops)
return run_times
def create_times_df_row(data, N, fn, fn_name, num_runs=7, num_loops=1_000):
try:
run_times = time_fn(fn, num_runs=num_runs, num_loops=num_loops)
return [
{
'N': N,
'function': fn_name,
'run_time_avg': np.mean(run_times),
'run_time_std': np.std(run_times),
'num_runs': num_runs,
'num_loops': num_loops,
'input_type': str(type(data)),
'return_type': str(type(fn())),
}
]
except Exception as e:
print(e)
return []
data_nparray = np.random.random(int(1e4))
data_list = list(data_nparray)
all_times_df = []
for N in list(range(1, 10)) + list(range(10, 25, 5)) + list(range(25, 100, 25)) + list(range(100, 1000+1, 100)):
times_df = []
for data in [data_nparray, data_list]:
num_loops = 1_000
if N >= 100:
num_loops = 100
if N >= 1_000:
num_loops = 10
times_df.extend(create_times_df_row(
data=data,
N=N,
fn=lambda: np.random.choice(data, N),
fn_name='np.random.choice',
num_loops=num_loops if isinstance(data, np.ndarray) else 10,
))
times_df.extend(create_times_df_row(
data=data,
N=N,
fn=lambda: random.sample(data, N),
fn_name='random.sample',
num_loops=num_loops,
))
times_df.extend(create_times_df_row(
data=data,
N=N,
fn=lambda: [data[idx] for idx in (random.randrange(0, len(data)) for _ in range(N))],
fn_name='random.randrange',
num_loops=num_loops,
))
times_df.extend(create_times_df_row(
data=data,
N=N,
fn=lambda: [data[idx] for idx in (np.random.randint(0, len(data)) for _ in range(N))],
fn_name='np.random.randint (single idx)',
num_loops=num_loops,
))
times_df.extend(create_times_df_row(
data=data,
N=N,
fn=lambda: [data[idx] for idx in np.random.randint(0, len(data), N)],
fn_name='np.random.randint (idx array)',
num_loops=num_loops,
))
print(N, end=' ')
times_df = pd.DataFrame(times_df)
display(times_df)
all_times_df.append(times_df)
all_times_df = pd.concat(all_times_df).reset_index(drop=True)
from __future__ import division
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import time
from typing import *
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Assignment Constants
RANDOM_STATE = 10
FIGSIZE = (12.0, 9.0)
#### Use the following line before plt.plot(....) to increase the plot size ####
plt.figure(figsize=FIGSIZE)
### Adjust some plot display options:
sns.set(rc={
'figure.figsize':FIGSIZE,
'axes.facecolor':'white',
'figure.facecolor':'white',
'grid.color': 'gainsboro',
})
sns.set_context("talk")
def time_to_ms(row):
row['run_time_avg'] = 1000*row['run_time_avg']
row['run_time_std'] = 1000*row['run_time_std']
return row
def time_to_us(row):
row['run_time_avg'] = 1e6*row['run_time_avg']
row['run_time_std'] = 1e6*row['run_time_std']
return row
cmap = dict(zip(all_times_df['function'].unique(), sns.color_palette('tab10')))
xticks = [1, 2, 3, 5, 7, 10, 15, 25, 50, 100, 200, 500, 1000,]
## Plot for Python list
line_plot = sns.lineplot(
data=all_times_df.query(f'''input_type == "<class 'list'>"''').apply(time_to_us, axis=1),
x='N', y='run_time_avg', hue='function', linewidth=3, palette=cmap,
)
scatter_plot = sns.scatterplot(
data=all_times_df.query(f'''input_type == "<class 'list'>"''').apply(time_to_us, axis=1),
x='N', y='run_time_avg', hue='function', linewidth=3, palette=cmap,
legend=False,
)
line_plot.set_xscale("log")
line_plot.set_yscale("log")
plt.xticks(xticks)
line_plot.get_xaxis().set_major_formatter(matplotlib.ticker.ScalarFormatter())
plt.xlabel("Size")
plt.ylabel("Time (µs)")
line_plot.set_title("Random selection from a Python list")
plt.legend(bbox_to_anchor=(0.5,0.5))
plt.ylim(7e-1, 2e5)
display(line_plot)
## Plot for Numpy array
line_plot = sns.lineplot(
data=all_times_df.query(f'''input_type == "<class 'numpy.ndarray'>"''').apply(time_to_us, axis=1),
x='N', y='run_time_avg', hue='function', linewidth=3, palette=cmap,
# legend=False,
)
sns.scatterplot(
data=all_times_df.query(f'''input_type == "<class 'numpy.ndarray'>"''').apply(time_to_us, axis=1),
x='N', y='run_time_avg', hue='function', linewidth=3, palette=cmap,
legend=False,
)
line_plot.set_xscale("log")
line_plot.set_yscale("log")
plt.xticks(xticks)
line_plot.get_xaxis().set_major_formatter(matplotlib.ticker.ScalarFormatter())
plt.xlabel("Size")
plt.ylabel("Time (µs)")
line_plot.set_title("Random selection from a NumPy array")
plt.legend(bbox_to_anchor=(1, 1))
plt.ylim(7e-1, 2e5)
display(line_plot)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 |