'Tabula py not reading all rows for PDFs with alternating colors for each row when Lattice is set to True
I am trying to extract all rows from the PDF attached here.
Here is the code I used:
def parse_latticepdf_pages(pdf):
pages = read_pdf(
pdf,
pages = "all",
guess = False,
lattice = True,
silent = True,
area = [43, 5, 568, 774],
pandas_options = {'header': None}
)
return pd.concat(pages)
parse_latticepdf_pages(pdf = "file.pdf")
The output shows only those rows which are in the grey background color. İt doesn't show rows with the white background color. How do I get all rows regardless of the color the rows are in?
Note: Initially I tried with stream = True, but that caused other problems where each line appears as a separate row and it is impossible to group the rows as needed. Hence, I set Lattice = True. Also, enabling and not enabling multiple_tables return the same issue.
I would appreciate any help regarding this. Thank you!
Solution 1:[1]
Not sure what's happening, but confirmed it works with multiple_tables=False
option as the following:
In [41]: tabula.read_pdf(fname, pages=1, lattice=True, area = [43, 5, 568, 774], multiple_tables=False)
Out[41]:
[ Issued Date Permit No. ... Proposed Use Valuation
0 4/1/2019 P025361-032119 ... New office and restroom addition to existing\r... $45,000.00
1 4/12/2019 P025502-041219 ... Isolate chapel from fire damaged area 4000 sq.... $1,000.00
2 4/12/2019 P025487-041019 ... Interior finish-out for new meat market 2500\r... $35,000.00
3 4/15/2019 P025520-041519 ... New 8-unit apartment building 10,800 sq. ft. $350,000.00
4 4/25/2019 P025101-020719 ... New Five Story Hotel 93,501 sq. ft. $12,327,000.00
5 4/9/2019 P025475-040919 ... Mobile Home Placement 1216 sq. ft. $1,250.00
6 4/9/2019 P025477-040919 ... Mobile Home Placement 1216 sq. ft. $1,250.00
7 4/9/2019 P025479-040919 ... Mobile Home Placement 1216 sq. ft. $1,250.00
8 4/8/2019 P025459-040519 ... Build a carport. $1,000.00
[9 rows x 7 columns]]
It might cause another issue for page="all"
though.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | chezou |