'Getting a 'CalledProcessError.... returned non-zero exit status 1' on running tabula.read_pdf() function on python 3.6

I have tried all possible options. Please help

I am getting the following error while running the read_pdf() of tabula in python. The error is

CalledProcessError: Command '['java', '-Dfile.encoding=UTF8', '-jar', 'D:\\Transfer_Folder\\WPy-3661\\python-3.6.6.amd64\\lib\\site-packages\\tabula\\tabula-1.0.3-jar-with-dependencies.jar', '--pages', 'all', '--guess', '--format', 'JSON', '--outfile', 'C:\\Users\\guptac\\AppData\\Local\\Temp\\tmpqu_wgv1o', 'C:/Users/guptac/Desktop/1409.3215.pdf']' returned non-zero exit status 1.

While running the tabula.environment_info()

I get the following :

Python version:
    3.6.6 (v3.6.6:4cf1f54eb7, Jun 27 2018, 03:37:03) [MSC v.1900 64 bit (AMD64)]
Java version:
    java version "1.8.0_221"
Java(TM) SE Runtime Environment (build 1.8.0_221-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.221-b11, mixed mode)
tabula-py version: 1.4.0
platform: Windows-10-10.0.17763-SP0
uname:
    uname_result(system='Windows', node='Guptacdt02', release='10', version='10.0.17763', machine='AMD64', processor='Intel64 Family 6 Model 158 Stepping 9, GenuineIntel')
linux_distribution: ('MSYS_NT-10.0-WOW', '2.10.0', '')
mac_ver: ('', ('', '', ''), '')

I have already tried to keep my program file i.e where I had written the above piece of code called untitled0.py at my Desktop and the text file I am trying to process in my Desktop as well. I saw from Stackoverflow myriad options like getting Java 7 instead of Java 8 will solve my issue, upgrade tabula package (which I have found is already of the latest version) and my Java is also updated. Someone said in the comments of another post to keep the code and the pdf in the same directory which I did but of no avail. The above error message keeps coming over and over again.

import tabula

df = tabula.read_pdf('C:/Users/guptac/Desktop/1409.3215.pdf',pages='all',encoding = 'utf-8',multiple_tables=True)

Error message:

CalledProcessError: Command '['java', '-Dfile.encoding=UTF8', '-jar', 'D:\\Transfer_Folder\\WPy-3661\\python-3.6.6.amd64\\lib\\site-packages\\tabula\\tabula-1.0.3-jar-with-dependencies.jar', '--pages', 'all', '--guess', '--format', 'JSON', '--outfile', 'C:\\Users\\guptac\\AppData\\Local\\Temp\\tmpqu_wgv1o', 'C:/Users/guptac/Desktop/1409.3215.pdf']' returned non-zero exit status 1.

Expected results: I should be able to parse multiple tables or should I say extract multiple tables from the pdf document I have provided as input to the

tabula.read_pdf()

Update: Also running as shown here: https://github.com/chezou/tabula-py/issues/93 does not do any good.See the error below..


D:\Transfer_Folder\WPy-3661\python-3.6.6.amd64\Lib\site-packages>java -jar'D:\\Transfer_Folder\\WPy-3661\\python-3.6.6.amd64\\lib\\site-packages\\tabula\\tabula-1.0.3-jar-with-dependencies.jar', '--pages', 'all', '--guess', '--format', 'JSON', '--outfile', 'C:\\Users\\guptac\\AppData\\Local\\Temp\\tmpi1dv4lz7', '1409.3215.pdf'
Unrecognized option: -jar'D:\\Transfer_Folder\\WPy-3661\\python-3.6.6.amd64\\lib\\site-packages\\tabula\\tabula-1.0.3-jar-with-dependencies.jar',
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.

Update:

The source of the document I have downloaded is : https://arxiv.org/pdf/1409.3215.pdf

Update I have also checked the solution posted here but could not quite understand what solution they were suggesting:

https://github.com/chezou/tabula-py/issues/60

Update

I have given up hope on tabula. I am instead using camelot.. Much better.



Solution 1:[1]

Sorry for the late response, I know you already uses camelot instead of tabula-py, but just FYI for the people who find this topic.

It was the issue on Windows that introduced v1.4.0. tabula-py v1.4.1 should fix the issue. https://github.com/chezou/tabula-py/releases/tag/v1.4.1

Solution 2:[2]

For someone in the future facing the same problem, try:

tabula.read_pdf(file, pages='all', java_options=['-Dfile.encoding=UTF8', 'Djava.awt.headless=true'])

the headless=true fixed the error "Can't connect to X11 window server using ':0' as the value of the DISPLAY variable" that I could find manually printing the stderr of the subprocess.run calling the JAR file.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 chezou
Solution 2 async await