Tesseract OCR

Build tesseract

Prerequisits^[1]:

brew install libgif libjpeg libpng libtiff zlib

# Error: xz: undefined method `deny_network_access!' for Formulary::FormulaNamespaceeddce1918855a2fb5cf7427fd5438072::Xz:Class
# then comment this line `# deny_network_access! [:build, :postinstall]`

Install leptonica^[2]:

# https://stackoverflow.com/questions/40067547/glibtool-on-macbook
brew install libtool automake

git clone https://github.com/DanBloomberg/leptonica.git
cd leptonica
./autogen.sh
./configure 
make 
sudo make install

Install tesseract-ocr:

git clone https://github.com/tesseract-ocr/tesseract.git
cd tesseract
./autogen.sh
./configure
make
sudo make install

Trained data

We have three sets of official .traineddata files trained at Google, for tesseract versions 4.00 and above. These are made available in three separate repositories.

tessdata_fast (Sep 2017) best “value for money” in speed vs accuracy, Integer models.
tessdata_best (Sep 2017) best results on Google’s eval data, slower, Float models. These are the only models that can be used as base for finetune training.
tessdata (Nov 2016 and Sep 2017) These have legacy tesseract models from 2016. The LSTM models have been updated with Integer version of tessdata_best LSTM models. (Cube based legacy tesseract models for Hindi, Arabic etc. have been deleted).

git clone https://github.com/tesseract-ocr/tessdata_best.git

[1] ttps://stackoverflow.com/questions/33659458/tesseract-image-issue

[2] ttps://github.com/DanBloomberg/leptonica

[1]

[2]