'Scraping two-column PDF

I try to scrape the texts of hundreds of PDFs for a project.

The PDFs have title pages, headers, footers and two columns. I tried the packages pdftools and tabulizer. However, both have their advantages and disadvantages:

  • the pdf_text() function from pdftools reads the PDFs correctly with only some encoding issues which can be solved manually but it does not take the two-column structure into account. Moreover, it produces a character vector with as many elements as pages.
  • On the contrary, the extract_text() function from tabulizer handles the two-column structure nicely but but produces (in many cases) incorrect results (example below). Moreover, it produces a character value with only one element containing the text of the entire PDF document.

Based on another post on stackoverflow, I built following function that is based on tabulizer since it handles the two-column structure of the PDFs and outputs a vector containing all pages stored in separate elements:

get_text <- function(url) {
  # Get nunber of pages of PDF
  p <- get_n_pages(url)
  # Initialize a list
  L <- vector(mode = "list", length = 1)
  # Extract text from pdf
  txt <- tabulizer::extract_text(url, pages = seq(1,p))
  # Output: character vector containing all pages
  return(txt)
}

While it works fine in general, there are some PDFs which are not read correctly. For example,

get_text(url = "https://aplikace.mvcr.cz/sbirka-zakonu/ViewFile.aspx?type=c&id=3592")

Instead of the the correct words and numbers (which contain Czech letters), something like ""\001\002\r\n\b\a\004 \006\t\n\r\n% .\005 \t\031\033 * ." is displayed. However, not for all PDFs. Furthermore, please note that pdftools reads it correctly (ignoring the two columns).

Can anybody help me with this problem or can explain me why it occurs?

Thank you very much in advance!



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source