'Pytesseract: Error opening data file \\Program Files (x86)\\Tesseract-OCR\\en.traineddata
I am trying to use pytesseract on Jupyter Notebook.
- Windows 10 x64
- Running Jupyter Notebook (Anaconda3, Python 3.6.1) with administrative privilege
- The work directory containing TIFF file is in different drive (Z:)
When I run the following code:
try:
import Image
except ImportError:
from PIL import Image
import pytesseract
pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files (x86)\\Tesseract-OCR\\tesseract.exe'
tessdata_dir_config = '--tessdata-dir "C:\\Program Files (x86)\\Tesseract-OCR\\tessdata"'
print(pytesseract.image_to_string(Image.open('Multi_page24bpp.tif'), lang='en', config = tessdata_dir_config))
I get the following error:
TesseractError Traceback (most recent call last)
<ipython-input-37-c1dcbc33cde4> in <module>()
11 # tessdata_dir_config = '--tessdata-dir "C:\\Program Files (x86)\\Tesseract-OCR\\tessdata"'
12
---> 13 print(pytesseract.image_to_string(Image.open('Multi_page24bpp.tif'), lang='en'))
14 # print(pytesseract.image_to_string(Image.open('test-european.jpg'), lang='fra'))
C:\Users\cpcho\AppData\Local\Continuum\Anaconda3\lib\site-packages\pytesseract\pytesseract.py in image_to_string(image, lang, boxes, config)
123 if status:
124 errors = get_errors(error_string)
--> 125 raise TesseractError(status, errors)
126 f = open(output_file_name, 'rb')
127 try:
TesseractError: (1, 'Error opening data file \\Program Files (x86)\\Tesseract-OCR\\en.traineddata')
I found these two references helpful but I am missing something: https://github.com/madmaze/pytesseract/issues/50 https://github.com/madmaze/pytesseract/issues/64
Thank you for your time on this!
Solution 1:[1]
From your post, observed two possible issues.
All the trained language data should be saved in
TESSDATA_PREFIX
, a Windows environmental variable, which is atC:\Program Files (x86)\Tesseract-OCR\tessdata
in your case.The
tesseract
trained English data is namedeng.traineddata
(i.e.'eng'
) unless you modified its name. Refer to this Tesseract Data Files for more information.
In addition, for pytesseract
to read the image file Image.open()
, you may include the full file path (e.g. 'z:\\path\\to\\image'
) if the image file is unable to locate.
Hope to this.
Solution 2:[2]
I faced the same problem. I tried all solutions on Google, without success. Finally, I solved the problem by replacing.
pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files (x86)\\Tesseract-OCR\\tesseract.exe'
with
pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files (x86)/Tesseract-OCR/tesseract.exe'.
Solution 3:[3]
If you don't want to set environment variable you can pass as an argument as well
For example:
First, do your imports
import pytessetact
from PIL import Image
And now configure pytesseract
pytesseract.pytesseract.tesseract_cmd = "C:/path_to_your_tesseract.exe"
tessdata_dir_config = '--tessdata-dir "C:/path_to_your_tessdata_folder"'
pytesseract.image_to_string(image, config=tessdata_dir_config)
Solution 4:[4]
Day 1 -all works; Day 2 -this error; on second computer all works... 5 hours later: ===i find ANSWER in my mind===
From "C:\Program Files\Tesseract-OCR\tessdata" copy 'eng.traineddata' to "C:\Program Files\Tesseract-OCR"
its work =\
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | thewaywewere |
Solution 2 | Isma |
Solution 3 | sam |
Solution 4 | ???????? ?????? |