'Tabula read pdf - CalledProcessError
I am using tabula to read tables from a pdf. The documents I'm extracting data from are really large, so I'm using a for-loop to run through the different pages:
for i in range(45, endofdoc):
df = read_pdf('D:\\XXXXX.pdf', pages = i, pandas_options={'header': None}, java_options = "-Xmx512m"):
This has worked for many of the files. For the file I'm currently working on, it worked until page 195. On this page it gives an error I will paste below. Interestingly enough, the next page works again. I checked the PDF, the format of this page is not different than any of the other ones. What could be going wrong? And more importantly, how can I fix it? Thanks in advance!
File "C:\Users\Kirsten\anaconda3\lib\subprocess.py", line 528, in run
raise CalledProcessError(retcode, process.args,
CalledProcessError: Command '['java', '-Xmx512m', '-Dfile.encoding=UTF8', '-jar', 'C:\\Users\\Kirsten\\anaconda3\\lib\\site-packages\\tabula\\tabula-1.0.5-jar-with-dependencies.jar', '--pages', '195', '--guess', '--format', 'JSON', 'D:\\XXXXX.pdf']' returned non-zero exit status 1.
My versions:
Python version:
3.9.7 (default, Sep 16 2021, 16:59:28) [MSC v.1916 64 bit (AMD64)]
Java version:
java version "1.8.0_181"
Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
Java HotSpot(TM) Client VM (build 25.181-b13, mixed mode, sharing)
tabula-py version: 2.3.0
platform: Windows-10-10.0.19044-SP0
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|