'Read all bookmarks from a PDF document and create a dictionary with PageNumber and Title of the bookmark

I a trying to read a PDF document using Python with PyPDF2 package. The objective is to read all the bookmarks in the pdf and construct a dictionary with page numbers of the bookmark as keys and titles of bookmarks as values.

There is not much support on the internet on how to achieve it except for this article. The code posted in it doesn't work and i am not an expert in python to correct it. PyPDF2's reader object has a property named outlines which gives you a list of all bookmark objects but there are no page numbers for bookmarks and traversing the list is little difficult as there are no parent/child relationships between bookmarks.

I am sharing below my code to read a pdf document and inspect outlines property.

import PyPDF2

reader = PyPDF2.PdfFileReader('SomeDocument.pdf')

print(reader.numPages)
print(reader.outlines[1][1])

python-3.x pypdf2

Solution 1:^[1]

The parent/child relationships are preserved by having the lists nested in each other. This sample code will display bookmarks recursively as an indented table of contents:

import PyPDF2


def show_tree(bookmark_list, indent=0):
    for item in bookmark_list:
        if isinstance(item, list):
            # recursive call with increased indentation
            show_tree(item, indent + 4)
        else:
            print(" " * indent + item.title)


reader = PyPDF2.PdfFileReader("[your filename]")

show_tree(reader.getOutlines())

I don't know how to retrieve the page numbers. I tried with a few files, and the page attribute of a Destination object is always an instance of IndirectObject, which doesn't seem to contain any information about page number.

UPDATE:

There is a getDestinationPageNumber method to get page numbers from Destination objects. Modified code to create your desired dictionary:

import PyPDF2


def bookmark_dict(bookmark_list):
    result = {}
    for item in bookmark_list:
        if isinstance(item, list):
            # recursive call
            result.update(bookmark_dict(item))
        else:
            result[reader.getDestinationPageNumber(item)] = item.title
    return result


reader = PyPDF2.PdfFileReader("[your filename]")

print(bookmark_dict(reader.getOutlines()))

However, note that you will overwrite and lose some values if there are multiple bookmarks on the same page (dictionary keys must be unique).

Solution 2:^[2]

edit: PyPDF2 is not dead! I'm the new maintainer. See mportes answer.

Here is my old answer:

Here is how you do it with PyMupdf and type annotations:

from typing import Dict

import fitz  # pip install pymupdf


def get_bookmarks(filepath: str) -> Dict[int, str]:
    # WARNING! One page can have multiple bookmarks!
    bookmarks = {}
    with fitz.open(filepath) as doc:
        toc = doc.getToC()  # [[lvl, title, page, …], …]
        for level, title, page in toc:
            bookmarks[page] = title
    return bookmarks


print(get_bookmarks("my.pdf"))

Solution 3:^[3]

@myrmica provides the correct answer. The function needs some additional error handling to handle a situation where a bookmark is defective. I've also added 1 to the page numbers because they are zero-based.

import PyPDF2

def bookmark_dict(bookmark_list):
    result = {}
    for item in bookmark_list:
        if isinstance(item, list):
            # recursive call
            result.update(bookmark_dict(item))
        else:
          try:
             result[reader.getDestinationPageNumber(item)+1] = item.title
          except:
             pass
    return result

reader = PyPDF2.PdfFileReader("[your filename]")

print(bookmark_dict(reader.getOutlines()))

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1
Solution 2
Solution 3	shawmat

'Read all bookmarks from a PDF document and create a dictionary with PageNumber and Title of the bookmark

Solution 1:[1]

Solution 2:[2]

Solution 3:[3]

Sources

Related Questions

Solution 1:^[1]

Solution 2:^[2]

Solution 3:^[3]