'How to start iterating a file at specific line?

I’m iterating through a file’s lines with enumerate(), and sometimes would need to start the iterating at a specific file line, so I attempted testfile.seek(), e.g. if I want to start iterating the file again at line 10 then testfile.seek(10):

test_file.seek(10)

for i, line in enumerate(test_file):
    …

Yet the test_file always keep iterating starting at the very first line 0. What could I be doing wrong? Why isn’t the seek() working? Any better implementations would be appreciated as well.

Thank you in advance and will be sure to upvote/accept answer



Solution 1:[1]

Ordinary files are sequences of characters, at the file system level and as far as Python is concerned; there's no low-level way to jump to a particular line. The seek() command counts the offset in bytes, not lines. (In principle, an explicit seek offset should only be used if the file was opened in binary mode. Seeking on a text file is "undefined behavior", since logical characters can take more than one byte.)

Your only option if you want to skip a number of lines is to read and discard them. Since iterating over a file object fetches it one line at a time, a compact way to get your code to work is with itertools.islice():

from itertools import islice

skipped = islice(test_file, 10, None)  # Skip 10 lines, i.e. start at index 10
for i, line in enumerate(skipped, 11):
    print(i, line, end="")
    ...

Solution 2:[2]

A native Python way of doing this would be use zip to iterate over unnecessary lines.

with open("text.txt","r") as test_file:
    for _ in zip(range(10), test_file): pass
    for i, line in enumerate(test_file,start=10):
        print(i, line)

Solution 3:[3]

Personally i would just use an if statement. rudimentary perhaps but it is atleast very easy to understand.

with open("file") as fp:
for i, line in enumerate(fp):
    if i >= 10:
        # do stuff.

Edit: islice: The comparisons done here: Python fastest access to line in file are better than i am capable of. combined with the itertools manual: https://docs.python.org/2/library/itertools.html i doubt you'd need much more

Solution 4:[4]

The only way the seek method is going to help you is if all the lines in the file are of the same length, which you know ahead of time and your file is either binary or at least ascii-only text (i.e. no variable-width characters allowed). Then you really could do

test_file.seek(10 * (length_of_line + 1), os.SEEK_SET)

This is because seek will move the internal file pointer by a fixed number of bytes, not lines. The +1 above is to account for newline characters. You would likely have to make it +2 on a windows machine using \r\n line terminators.

This will not work if your file is non-ascii because some lines may be the same length in characters but actually contain a different number of bytes, making the call to seek yield undefined results.

There are a few legitimate ways you can skip the first 10 lines:

  1. Read the whole file into a list and discard the first 10 lines:

    with open(...) as test_file:
        test_data = list(test_file)[10:]
    

    Now test_data contains all the lines in your file besides the first 10.

  2. Discard lines from the file as you read it in a for loop using enumerate:

    with open(...) as test_file:
        for lineno, line in test_file:
            if lineno < 10:
                continue
            # Do something with the line
    

    This method has the advantage of not storing the unnecessary lines. This is functionally similar to using itertools.islice as some of the other answers suggest.

  3. Use some really arcane low-level stuff to actually read 10 newline characters from the file before proceeding normally. You may have to specify the encoding of the file up-front for this to work correctly with text I/O, but it should work out-of-the-box for ASCII files (see here for more details):

    newline_count = 10
    with open(..., encoding='utf-8') as test_file:
        while newline_count > 0:
            next_char = test_file.read(1)
            if next_char == '\n':
                newline_count -= 1
        # You have skipped 10 lines, so process normally here.
    

    This option is not particularly robust. It does not handle the case where there are fewer than 10 lines gracefully, and it re-implements the internal machinery of the built-in file iterator very crudely. The only possible advantage it offers is that it does not buffer entire lines like the iterator does.

Solution 5:[5]

You can't use seek() to get to a beginning of a particular line unless you know the byte-offset of the first character of the desired line.

One simple way to do it would be to use the islice() iterator in the itertools module.

For example, say you had a very big input file that looked like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
...

Sample code:

from __future__ import print_function
from itertools import islice

with open('test_file.txt') as test_file:
    for i, line in enumerate(islice(test_file, 9, None), 10):
        print('line #{}: {}'.format(i, line), end='')

Output:

line #10: 10
line #11: 11
line #12: 12
line #13: 13
line #14: 14
line #15: 15
line #16: 16
line #17: 17
line #18: 18
line #19: 19
line #20: 20
line #21: 21
line #22: 22
...

Note islice() counts from zero, which is why it's first argument was 9 and not 10. Also this is not as fast as seek() would be because islice() actually reads all the lines until it gets to the one where you want to start.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2
Solution 3 Community
Solution 4
Solution 5