'Extracting exact table data from PDF
I am trying to extract each row of my table from a pdf file I created before.
The problem I have, is that empty cells, which I thought would be saved as 'null', are ignored, and not even read as space characters.
I extract the content from my PDF via this method:
public final ArrayList<String> extractLines(final File pdf) throws IOException {
try (PDDocument doc = PDDocument.load(pdf)) {
PDFTextStripper strip = new PDFTextStripper();
String txt = strip.getText(doc);
String[] arr = txt.split("\n");
final ArrayList<String> lines = new ArrayList<>(Arrays.asList(arr));
return lines;
}
}
Is it even possible to extract the data with whitespaces?
If so, with PDFBox? Or a different method?
EDIT:
Cannot get traprange to work, simple test:
File e = new File("C:/Users/Test/Downloads/a.pdf");
List<Table> t = new PDFTableExtractor().setSource(e).extract();
System.out.println(t.get(0).toString());
Only gives me:
Could it have to do with the form of my table?
My table:
Solution 1:[1]
I came up with my own solution.
Since I have a 2D ArrayList, I each have a list containing a row of the table.
Now I save the position of the non empty cells (only one is not empty per row at any time).
I save it in a meta data field of the PDF and load this field to get the positions back.
Solution 2:[2]
The solution needs custom algorithm to complete the task. Please check this solution for custom PDFTableStripper.
Another great solution has been implemented by Tho which could be found at traprage. It can extract the null data of a particular cell.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Dahlin |
Solution 2 | Abdul Alim Shakir |