tocPDF
CLI tool for generating missing PDF outline based on the Table of Contents from the PDF.
Introduction
Imagine this: you just became fascinated by some new topic in quantum physics, say, for instance, perturbation theory, and you found that the reference textbook that everyone recommends can be luckily downloaded as a free PDF from the internet. Excited to start your journey in this field, you open the PDF and find a nicely structured table of contents (ToC) with all the relevant subjects having their own dedicated section. Having found the section that you want to start reading, you are eager to jump to it. But wait! For some reason, the outline that comes with the PDF is empty and, therefore, you need to manually navigate to page using the page number from the ToC. Even worse, since the PDF contains a Preface chapter, the page numbering of the book only starts on the PDF page 42 and, therefore, you cannot simply use the PDF page to jump the desired section. Instead, you need to add an offset to the PDF page which makes navigating the PDF even less convenient.
Solution
Not to worry, I wrote a Python command line tool tocPDF
to solve precisely this issue. It extracts the ToC found in the PDF and populates the outline of the PDF with it.
Installation
The project is hosted on GitHub and can be installed using pip
:
pip install tocPDF
Usage
tocPDF
requires the PDF page number corresponding to the start and end of the ToC as well as the offset to the first page of the book (the first one numbered with an Arabic numeral).
tocPDF --start 3 --end 6 --offset 9 example.pdf
tocPDF -s 3 -e 6 -o 9 example.pdf
For certain PDFs, this offset is not constant for the entire document due to additional unnumbered or missing pages. This is particularly common between different parts of a textbook where a blank unnumbered page is often added. In these cases, using naively the original offset without taking into account this additional shift would lead to a mismatch between the page number in the outline and the true page number of the section. tocPDF
can address this issue by verifying that the offset for each section is still accurate. This is done by ensuring that the expected page number of a section matches the one parsed from the PDF. This feature can be enabled by passing the -m/--missing_pages
flag:
tocPDF -s 3 -e 6 -o 9 -m example.pdf
Note that this additional verification step will slow down the generation of the outline.
ToC Extraction
tocPDF
works with a number of backends for extracting the ToC from the PDF:
- pdfplumber (default)
- pypdf
- tika (requires Java)
These can be selected using the -p/--parser
flag:
tocPDF -s 3 -e 6 -o 9 -m -p pypdf example.pdf
Certain parsers might perform better than others on certain PDF ToC formats. Therefore, if you are not happy with the results, make sure to experiment with the different parsers.