Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
Since my last blog on Tesseract-OCR I have been playing around casually with it to see what it is possible of doing. Tesseract supports optical character recognition for over 100 languages. That together with straight forward usage for implementing it in R inspired me to try using it for Hebrew text.
The last time I publicly explored anything to do with Hebrew language and letters was when I wrote a R package for calculating Hebrew Gemmatrias. While its remained untouched for years now, its still usable and you can check it out on my Github here.
In this blog I explore two pages of the Noam Elimelech and examine the words at the end of each asterisk. For context, the text of the Noam Elimelech is a collection of teachings by the 18th century Rabbi, Rabbi Elimelech of Lizhensk ztvk”l zy”a. There are a number of asterisks placed across the text which are largely unexplained for as to why. While it is beyond the scope of this blog go too deep into the specifics, I share how I extracted the text which preceded each asterisk.
The text I used can be accessed here. While the methods highlighted here can be extended to the entire text, this blog is just for proof of concept. As such I limit the scope to two pages of the text.
The Code
Since the text that I’m using has with two columns per page, the text will need to be cropped by columns before OCR is applied. Prior to that, the .pdf
files will need to be converted to .png
format. The workflow is thus:
- Converting the
.pdf
file to.png
format (pdftools::pdf_convert()
) - Reading the created
.png
file and cropping it (magick::image_read()
andmagick::image_crop()
)
- Employing Tesseract-OCR to extract the text (
tesseract::ocr()
).
(While there are functions in themagick
package accomplish this, I found the Tesseract-OCR wrapper to not fair as well as using it directly with the `tesseract` package. I thus used themagick
package for cropping the text area andtesseract
for the ocr work.) - Do the relevant text cleaning and extract the words before each asterisk by using regular expressions.
library(tidyverse) library(magick) library(tesseract) noamElimelech<-c("NoamElimelech_Bechukosai_1.pdf", "NoamElimelech_Bechukosai_2.pdf") %>% sapply(function(x) pdftools::pdf_convert(x, dpi = 1000)) %>% unname() noamElimelech_Left_1 <- noamElimelech[1] %>% image_read() %>% image_crop("0x11243+4050+1600") %>% ocr(eng=tesseract("heb")) %>% str_split("\\n") %>% unlist() noamElimelech_Left_2 <- noamElimelech[2] %>% image_read() %>% image_crop("0x11310+4300+1300") %>% ocr(eng=tesseract("heb")) %>% str_split("\\n") %>% unlist() noamElimelech_Right_1 <- noamElimelech[1] %>% image_read() %>% image_crop("4050x8800+0+1200")%>% ocr(eng=tesseract("heb")) %>% str_split("\\n") %>% unlist() noamElimelech_Right_2 <- noamElimelech[2] %>% image_read() %>% image_crop("4400x2500+0+1200")%>% ocr(eng=tesseract("heb")) %>% str_split("\\n") %>% unlist()
For regular expressions I looked up how to use regular expressions with Hebrew text and learned that the Unicode reference for the Hebrew letter alphabet is the range \u0590-\u05fe
(see here). Additionally to deal with the apostrophes which are common for abbreviating text, I was sure to ignore them when extracting the words.
noamElimelech_text <- c(noamElimelech_Left_1, noamElimelech_Right_1, noamElimelech_Left_2, noamElimelech_Right_2) %>% paste(collapse="") %>% str_replace_all('"',"'")
The regular expressions I use extracts the previous two words which allow for better context. I will have to spend some more time learning about text analysis if I wanted to make this blog beyond demonstrating text extraction.
There are some spaces skipped and letters misread by tesseract
, but nevertheless the result is interesting.
words_before_asterisks <-noamElimelech_text %>% str_extract_all("([\\u0590-\\u05fe[']]{1,} [\\u0590-\\u05fe[']]{1,})( \\* )") %>% unlist() %>% str_remove_all("\\*") %>% trimws() words_before_asterisks [1] "במה פעמים" "טובה גדולה" "והצדיק מבטל" "כן יקום" [5] "ממילא בטל" "של בע'פ" "להפכם לרחמים" "להשפיע לכם" [9] "ויבולהמלשון יובל" "בל' נסתר" "בעבודה כלל" "שבולכם פה" [13] "והתחברות יחד" "הוא להיפך" "ברצונם ותשוקתם" "להפכם לרחמים" [17] "של אדם" "רע ונהפכולרחמים" "דהיינו ממטהלמעל'" "על אחרים" [21] "תשוב' שלימ'"
Conclusion
Its very cool to see how well Tesseract-OCR works. While there were some characters misclassified and spaces missed I still managed to get the text before the asterisks. It could be that from these findings there may be some hints to why the asterisks are in the places they are, but its beyond the scope and my present qualifications to explain any of that!
If you know of training data available which fairs better for character recognition than what I used please let me know!
Thank you for reading this blog!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.