Site icon R-bloggers

Tesseract 4 is here! State of the art OCR in R!

[This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Last week Google and friends released the new major version of their OCR system: Tesseract 4. This release builds upon 2+ years of hard work and has completely overhauled the internal OCR engine. From the tesseract wiki:

Tesseract 4.0 includes a new neural network-based recognition engine that delivers significantly higher accuracy (on document images) than the previous versions, in return for a significant increase in required compute power. On complex languages however, it may actually be faster than base Tesseract.

We have now also updated the R package tesseract to ship with the new Tesseract 4 on MacOS and Windows. It uses the new engine by default, and the results are extremely impressive! Recognition is much more accurate then before, even without manually enhancing the image quality.

Updating

The binary R package for Windows and MacOS can be installed directly from CRAN:

install.packages("tesseract")

Tesseract 4 uses a new training data format, so if you had previously installed custom training data you might need to redownload these as well:

# If you want to OCR french text:
library(tesseract)
tesseract_download('fra')

Training data for eng (default) is included with the R package already.

Testing

Let’s take a simple example from last month’s blog post about ocr’ing bird drawings from the natural history collection. We use the magick package to preprocess the image (crop the area of interest). The image_ocr() function is a magick wrapper for tesseract::ocr().

library(magick)
image_read("https://jeroen.github.io/images/birds.jpg") %>%
  image_crop('1200x300+100+1700') %>%
  image_ocr() %>%
  cat()

H Grenveld. del Watherby & 0°
CLIMACTERIS PICUMNUS.
(BROWN TREE CREEPER).

Tesseract has perfectly detected the hand-written species name (both the Latin and English name), and has also found and nearly perfectly predicted the tiny author names. These results would be a very good basis for post-processing and automatic classification. For example we could match these results against known species and authors as illustrated explained in the original blog post.

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.