'Why can't I iterate twice over the same data?

Why can't I iterate twice over the same iterator?

# data is an iterator.

for row in data:
    print("doing this one time")

for row in data:
    print("doing this two times")

This prints "doing this one time" a few times, since data is non-empty. However, it does not print "doing this two times". Why does iterating over data work the first time, but not the second time?



Solution 1:[1]

It's because data is an iterator, and you can consume an iterator only once. For example:

lst = [1, 2, 3]
it = iter(lst)

next(it)
=> 1
next(it)
=> 2
next(it)
=> 3
next(it)
=> StopIteration

If we are traversing some data using a for loop, that last StopIteration will cause it to exit the first time. If we try to iterate over it again, we'll keep getting the StopIteration exception, because the iterator has already been consumed.


Now for the second question: What if we do need to traverse the iterator more than once? A simple solution would be to save all the elements to a list, which can be traversed as many times as needed. For instance, if data is an iterator:

data = list(data)

That is alright as long as there are few elements in the list. However, if there are many elements, it's a better idea to create independent iterators using tee():

import itertools
it1, it2 = itertools.tee(data, 2) # create as many as needed

Now we can loop over each one in turn:

for e in it1:
    print("doing this one time")

for e in it2:
    print("doing this two times")

Solution 2:[2]

Iterators (e.g. from calling iter, from generator expressions, or from generator functions which yield) are stateful and can only be consumed once, as explained in Óscar López's answer. However, that answer's recommendation to use itertools.tee(data) instead of list(data) for performance reasons is misleading.

In most cases, where you want to iterate through the whole of data and then iterate through the whole of it again, tee takes more time and uses more memory than simply consuming the whole iterator into a list and then iterating over it twice. tee may be preferred if you will only consume the first few elements of each iterator, or if you will alternate between consuming a few elements from one iterator and then a few from the other.

Solution 3:[3]

Once an iterator is exhausted, it will not yield any more.

>>> it = iter([3, 1, 2])
>>> for x in it: print(x)
...
3
1
2
>>> for x in it: print(x)
...
>>>

Solution 4:[4]

How to loop over an iterator twice?

It is impossible! (Explained later.) Instead, do one of the following:

  • Collect the iterator into a something that can be looped over multiple times.

    items = list(iterator)
    
    for item in items:
        ...
    

    Downside: This costs memory.

  • Create a new iterator. It usually takes only a microsecond to make a new iterator.

    for item in create_iterator():
        ...
    
    for item in create_iterator():
        ...
    

    Downside: Iteration itself may be expensive (e.g. reading from disk or network).

  • Reset the "iterator". For example, with file iterators:

    with open(...) as f:
        for item in f:
            ...
    
        f.seek(0)
    
        for item in f:
            ...
    

    Downside: Most iterators cannot be "reset".


Philosophy of an Iterator

The world is divided into two categories:

  • Iterable: A for-loopable data structure that holds data. Examples: list, tuple, str.
  • Iterator: A pointer to some element of an iterable.

If we were to define a sequence iterator, it might look something like this:

class SequenceIterator:
    index: int
    items: Sequence  # Sequences can be randomly indexed via items[index].

    def __next__(self):
        """Increment index, and return the latest item."""

The important thing here is that typically, an iterator does not store any actual data inside itself.

Iterators usually model a temporary "stream" of data. That data source is consumed by the process of iteration. This is a good hint as to why one cannot loop over an arbitrary source of data more than once. We need to open a new temporary stream of data (i.e. create a new iterator) to do that.

Exhausting an Iterator

What happens when we extract items from an iterator, starting with the current element of the iterator, and continuing until it is entirely exhausted? That's what a for loop does:

iterable = "ABC"
iterator = iter(iterable)

for item in iterator:
    print(item)

Let's support this functionality in SequenceIterator by telling the for loop how to extract the next item:

class SequenceIterator:
    def __next__(self):
        item = self.items[self.index]
        self.index += 1
        return item

Hold on. What if index goes past the last element of items? We should raise a safe exception for that:

class SequenceIterator:
    def __next__(self):
        try:
            item = self.items[self.index]
        except IndexError:
            raise StopIteration  # Safely says, "no more items in iterator!"
        self.index += 1
        return item

Now, the for loop knows when to stop extracting items from the iterator.

What happens if we now try to loop over the iterator again?

iterable = "ABC"
iterator = iter(iterable)

# iterator.index == 0

for item in iterator:
    print(item)

# iterator.index == 3

for item in iterator:
    print(item)

# iterator.index == 3

Since the second loop starts from the current iterator.index, which is 3, it does not have anything else to print and so iterator.__next__ raises the StopIteration exception, causing the loop to end immediately.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 davidsbro
Solution 2
Solution 3 falsetru
Solution 4