'Pytesseract: Error opening data file \\Program Files (x86)\\Tesseract-OCR\\en.traineddata

I am trying to use pytesseract on Jupyter Notebook.

  • Windows 10 x64
  • Running Jupyter Notebook (Anaconda3, Python 3.6.1) with administrative privilege
  • The work directory containing TIFF file is in different drive (Z:)

When I run the following code:

try:
    import Image
except ImportError:
    from PIL import Image
import pytesseract

pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files (x86)\\Tesseract-OCR\\tesseract.exe'

tessdata_dir_config = '--tessdata-dir "C:\\Program Files (x86)\\Tesseract-OCR\\tessdata"'

print(pytesseract.image_to_string(Image.open('Multi_page24bpp.tif'), lang='en', config = tessdata_dir_config))

I get the following error:

TesseractError                            Traceback (most recent call last)
<ipython-input-37-c1dcbc33cde4> in <module>()
     11 # tessdata_dir_config = '--tessdata-dir "C:\\Program Files (x86)\\Tesseract-OCR\\tessdata"'
     12 
---> 13 print(pytesseract.image_to_string(Image.open('Multi_page24bpp.tif'), lang='en'))
     14 # print(pytesseract.image_to_string(Image.open('test-european.jpg'), lang='fra'))

C:\Users\cpcho\AppData\Local\Continuum\Anaconda3\lib\site-packages\pytesseract\pytesseract.py in image_to_string(image, lang, boxes, config)
    123         if status:
    124             errors = get_errors(error_string)
--> 125             raise TesseractError(status, errors)
    126         f = open(output_file_name, 'rb')
    127         try:

TesseractError: (1, 'Error opening data file \\Program Files (x86)\\Tesseract-OCR\\en.traineddata')

I found these two references helpful but I am missing something: https://github.com/madmaze/pytesseract/issues/50 https://github.com/madmaze/pytesseract/issues/64

Thank you for your time on this!



Solution 1:[1]

From your post, observed two possible issues.

  1. All the trained language data should be saved in TESSDATA_PREFIX, a Windows environmental variable, which is at C:\Program Files (x86)\Tesseract-OCR\tessdata in your case.

  2. The tesseract trained English data is named eng.traineddata (i.e. 'eng') unless you modified its name. Refer to this Tesseract Data Files for more information.

In addition, for pytesseract to read the image file Image.open(), you may include the full file path (e.g. 'z:\\path\\to\\image') if the image file is unable to locate.

Hope to this.

Solution 2:[2]

I faced the same problem. I tried all solutions on Google, without success. Finally, I solved the problem by replacing.

pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files (x86)\\Tesseract-OCR\\tesseract.exe' 

with

pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files (x86)/Tesseract-OCR/tesseract.exe'.

Solution 3:[3]

If you don't want to set environment variable you can pass as an argument as well

For example:

First, do your imports

    import pytessetact
    from PIL import Image

And now configure pytesseract

    pytesseract.pytesseract.tesseract_cmd = "C:/path_to_your_tesseract.exe"
    tessdata_dir_config = '--tessdata-dir "C:/path_to_your_tessdata_folder"'

    pytesseract.image_to_string(image, config=tessdata_dir_config)

Solution 4:[4]

Day 1 -all works; Day 2 -this error; on second computer all works... 5 hours later: ===i find ANSWER in my mind===

From "C:\Program Files\Tesseract-OCR\tessdata" copy 'eng.traineddata' to "C:\Program Files\Tesseract-OCR"

its work =\

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 thewaywewere
Solution 2 Isma
Solution 3 sam
Solution 4 ???????? ??????