'What's the computational complexity of .iloc[] in pandas dataframes?
I'm trying to understand what's the execution complexity of the iloc
function in pandas.
I read the following Stack Exchange thread (Pandas DataFrame search is linear time or constant time?) that:
"accessing single row by index (index is sorted and unique) should have runtime O(m) where
m << n_rows
"
mentioning that iloc
runs on O(m)
time. What is m
(linear, log, constant,...)?
Some experiments I ran:
import pandas as pd
>>> a = pd.DataFrame([[1,2,3],[1,3,4],[2,3,4],[2,4,5]], columns=['a','b','c'])
>>> a = a.set_index('a').sort_index()
>>> a
b c
a
1 3 4
1 4 5
2 2 3
2 3 4
>>> a.iloc[[0,1,2,3]]
b c
a
1 3 4
1 4 5
2 2 3
2 3 4
So iloc
clearly works with offsets and not on the integer-based index (column a
). Even if we delete few rows at the top, the iloc
offset-based lookup works correctly:
>>> a.drop([1]).iloc[[0,1]]
b c
a
2 2 3
2 3 4
So why isn't iloc
offset-lookup running on a comparable time to numpy arrays when each column is simply a numpy array that can be accessed in constant time (few operations)? And what's its complexity?
UPDATE:
I tried to compare the efficiency of pandas vs numpy on a 10000000x2 matrix. Comparing the efficiency of a value increment per row in a DataFrame df
and an array arr
, with and without a for
loop:
# Initialization
SIZE = 10000000
arr = np.ones((SIZE,2), dtype=np.uint32)
df = pd.DataFrame(arr)
# numpy, no for-loop
arr[range(SIZE),1] += 1
# pandas, no for-loop
df.iloc[range(SIZE),1] += 1
# numpy, for-loop
for i in range(SIZE):
arr[i,1] += 1
# pandas, for-loop
for i in range(SIZE):
df.iloc[i,1] += 1
Method | Execution time |
---|---|
numpy, no for-loop | 7 seconds |
pandas, no for-loop | 24 seconds |
numpy, with for-loop | 27 seconds |
pandas, with for-loop | > 2 hours |
Solution 1:[1]
There likely isn't one answer for the runtime complexity of iloc
. The method accepts a huge range of input types, and that flexibility necessarily comes with costs. These costs are likely to include both large constant factors and non-constant costs that are almost certainly dependent on the way in which it is used.
One way to sort of answer your question is to step through the code in the two cases.
Indexing with range
First, indexing with range(SIZE)
. Assuming df
is defined as you did, you can run:
import pdb
pdb.run('df.iloc[range(SIZE), 1]')
and then s
tep through the code to follow the path. Ultimately, this arrives at this line:
self._values.take(indices)
where indices
is an ndarray of integers constructed from the original range
, and self._values
is the source ndarray of the data frame.
There are two things to note about this. First, the range is materialized into an ndarray, which means you have a memory allocation of at least SIZE
elements. So...that's going to cost you some time :). I don't know how the indexing happens in NumPy itself, but given the time measurements you've produced, it's possible that there is no (or a much smaller) allocation happening.
The second thing to note is that numpy.take
makes a copy. You can verify this by looking at the .flags
attribute of the object returned from calling this method, which indicates that it owns its data and is not a view onto the original. (Also note that np.may_share_memory
returns False
.) So...there's another allocation there :).
Take home: It's not obvious that there's any non-linear runtime here, but there are clearly large constant factors. Multiple allocations are probably the big killer, but the complex branching logic in the call tree under the .iloc
property surely doesn't help.
Indexing in a loop
The code taken in this path is much shorter. It pretty quickly arrives here:
return self.obj._get_value(*key, takeable=self._takeable)
The really crappy runtime here is probably due to that tuple-unpacking. That repeatedly unpacks and repacks key
as a tuple on each loop iteration. (Note that key
is already the tuple (i, 1)
, so that sucks. The cost of accepting a generic iterable.)
Runtime analysis
In any case, we can get an estimate of the actual runtime of your particular use case by profiling. The following script will generate a log-spaced list of array sizes from 10 to 10e9, index with a range, and print out the time it takes to run the __getitem__
method. (There are only two such calls in the tree, so it's easy to see which is the one we care about.)
import pandas as pd
import numpy as np
import cProfile
import pstats
sizes = [10 ** i for i in range(1, 9)]
for size in sizes:
df = pd.DataFrame(data=np.zeros((size, 2)))
with cProfile.Profile() as pr:
pr.enable()
df.iloc[range(size), 1]
pr.disable()
stats = pstats.Stats(pr)
print(size)
stats.print_stats("__getitem__")
Once the output gets above the minimum resolution, you can see pretty clear linear behavior here:
Size | Runtime
------------------
10000 | 0.002
100000 | 0.021
1000000 | 0.206
10000000 | 2.145
100000000| 24.843
So I'm not sure what sources you're referring to that talk about nonlinear runtime of indexing. They could be out of date, or considering a different code path than the one using range
.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |