Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
A few weeks ago we announced the first release of the tesseract package: a high quality OCR engine in R. We have now released an update with extra features.
Installing Training Data
As explained in the first post, the tesseract system is powered by language specific training data. By default only English training data is installed. Version 1.3 adds utilities to make it easier to install additional training data.
# Download French training data tesseract_download("fra")
Note that this function is not needed on Linux. Here you should install training data via your system package manager instead. For example on Debian/Ubuntu:
sudo apt-get install tesseract-ocr-fra
And on Fedora/CentOS you use:
sudo yum install tesseract-langpack-fra
Use tesseract_info()
to see which training data are currently installed.
OCR Engine Parameters
Tesseract supports many parameters to fine tune the OCR engine. For example you can limit the possible characters that can be recognized.
engine <- tesseract(options = list(tessedit_char_whitelist = "0123456789")) ocr("image.png", engine = engine)
In the example above, Tesseract will only consider numeric characters. If you know in advance the data is numeric (for example an accounting spreadsheet) such options can tremendously improve the accuracy.
Magick Images
Tesseract now automatically recognizes images from the awesome magick package (our R wrapper to ImageMagick). This can be useful to preprocess images before feeding to tesseract.
library(magick) library(tesseract) image <- image_read("http://jeroenooms.github.io/files/dog_hq.png") image <- image_crop(image, "1700x100+50+150") cat(ocr(image))
We plan to more integration between Magick and Tesseract in future versions.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.