'Read all bookmarks from a PDF document and create a dictionary with PageNumber and Title of the bookmark
I a trying to read a PDF document using Python with PyPDF2 package. The objective is to read all the bookmarks in the pdf and construct a dictionary with page numbers of the bookmark as keys and titles of bookmarks as values.
There is not much support on the internet on how to achieve it except for this article. The code posted in it doesn't work and i am not an expert in python to correct it. PyPDF2's reader object has a property named outlines which gives you a list of all bookmark objects but there are no page numbers for bookmarks and traversing the list is little difficult as there are no parent/child relationships between bookmarks.
I am sharing below my code to read a pdf document and inspect outlines property.
import PyPDF2
reader = PyPDF2.PdfFileReader('SomeDocument.pdf')
print(reader.numPages)
print(reader.outlines[1][1])
Solution 1:[1]
The parent/child relationships are preserved by having the lists nested in each other. This sample code will display bookmarks recursively as an indented table of contents:
import PyPDF2
def show_tree(bookmark_list, indent=0):
for item in bookmark_list:
if isinstance(item, list):
# recursive call with increased indentation
show_tree(item, indent + 4)
else:
print(" " * indent + item.title)
reader = PyPDF2.PdfFileReader("[your filename]")
show_tree(reader.getOutlines())
I don't know how to retrieve the page numbers. I tried with a few files, and the page
attribute of a Destination
object is always an instance of IndirectObject
, which doesn't seem to contain any information about page number.
UPDATE:
There is a getDestinationPageNumber method to get page numbers from Destination
objects. Modified code to create your desired dictionary:
import PyPDF2
def bookmark_dict(bookmark_list):
result = {}
for item in bookmark_list:
if isinstance(item, list):
# recursive call
result.update(bookmark_dict(item))
else:
result[reader.getDestinationPageNumber(item)] = item.title
return result
reader = PyPDF2.PdfFileReader("[your filename]")
print(bookmark_dict(reader.getOutlines()))
However, note that you will overwrite and lose some values if there are multiple bookmarks on the same page (dictionary keys must be unique).
Solution 2:[2]
edit: PyPDF2 is not dead! I'm the new maintainer. See mportes answer.
Here is my old answer:
Here is how you do it with PyMupdf and type annotations:
from typing import Dict
import fitz # pip install pymupdf
def get_bookmarks(filepath: str) -> Dict[int, str]:
# WARNING! One page can have multiple bookmarks!
bookmarks = {}
with fitz.open(filepath) as doc:
toc = doc.getToC() # [[lvl, title, page, …], …]
for level, title, page in toc:
bookmarks[page] = title
return bookmarks
print(get_bookmarks("my.pdf"))
Solution 3:[3]
@myrmica provides the correct answer. The function needs some additional error handling to handle a situation where a bookmark is defective. I've also added 1 to the page numbers because they are zero-based.
import PyPDF2
def bookmark_dict(bookmark_list):
result = {}
for item in bookmark_list:
if isinstance(item, list):
# recursive call
result.update(bookmark_dict(item))
else:
try:
result[reader.getDestinationPageNumber(item)+1] = item.title
except:
pass
return result
reader = PyPDF2.PdfFileReader("[your filename]")
print(bookmark_dict(reader.getOutlines()))
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | |
Solution 3 | shawmat |