Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
My {cdcfluview} package started tossing erros on CRAN just over a week ago when the CDC added an extra parameter to one of the hidden API endpoints that the package wraps. After a fairly hectic set of days since said NOTE came, I had time this morning to poke at a fix. There are alot of tests, so after successful debugging session I was awaiting CRAN checks on various remotes as well as README builds and figured I’d keep up some practice with another, nascent, package of mine, {swiftr}, which makes it dead simple to build R functions from Swift code, in similar fashion to what Rcpp::cppFunction()
does for C/C++ code.
macOS comes with a full set of machine learning/AI libraries/frameworks that definitely have “batteries included” (i.e. you can almost just make one function call to get 90-95% what you want without even training new models). One of which is text extraction from Apple’s computer Vision
framework. I thought it’d be a fun and quick “wait mode” distraction to wrap the VNRecognizeTextRequest()
function and use it from R.
To show how capable the default model is, I pulled a semi-complex random image from DDG’s image search:
Let’s build the function (you need to be on macOS for this; exposition inine):
library(swiftr) # github.com/hrbrmstr/swiftr swift_function( code = ' import Foundation import CoreImage import Cocoa import Vision @_cdecl ("detect_text") public func detect_text(path: SEXP) -> SEXP { // turn R string into Swift String so we can use it let fileName = String(cString: R_CHAR(STRING_ELT(path, 0))) var res: String = "" var out: SEXP = R_NilValue // get image into the right format if let ciImage = CIImage(contentsOf: URL(fileURLWithPath:fileName)) { let context = CIContext(options: nil) if let img = context.createCGImage(ciImage, from: ciImage.extent) { // setup comptuer vision request let requestHandler = VNImageRequestHandler(cgImage: img) // start recognition let request = VNRecognizeTextRequest() do { try requestHandler.perform([request]) // if we have results if let observations = request.results as? [VNRecognizedTextObservation] { // paste them together let recognizedStrings = observations.compactMap { observation in observation.topCandidates(1).first?.string } res = recognizedStrings.joined(separator: "\\n") } } catch { debugPrint("\\(error)") } } } res.withCString { cstr in out = Rf_mkString(cstr) } return(out) } ')
The detect_text()
is now available in R, so let’s see how it performs on that image of signs:
detect_text(path.expand("~/Data/signs.jpeg")) %>% stringi::stri_split_lines() %>% unlist() ## [1] "BEWILDERED" "UNCLEAR" "nAZEU" "UNCERTAIN" "VISA" "INSURE" ## [7] "ATED" "MUDDLED" "LOsT" "DISTRACTED" "PERPLEXED" "CONFUSED" ## [13] "PUZZLED"
It works super-fast and gets far more correct than I would have expected.
Toy examples aside, it also works pretty well (as one would expect) on “real” text images, such as this example from the Tesseract test suite:
detect_text(path.expand("~/Data/tesseract/news.3B/0/8200_006.3B.tif")) %>% stringi::stri_split_lines() %>% unlist() ## [1] "Tobacco chiefs still refuse to see the truth abou" ## [2] "even of America's least conscionable" ## [3] "The tobacco industry would like to promote" ## [4] "men sat together in Washington last" ## [5] "under the conditions they are used.'" ## [6] "week to do what they do best: blow" ## [7] "the specter of prohibition." ## [8] "panel\" of toxicologists as \"not hazardous" ## [9] "smoke at the truth about cigarettes." ## [10] "'If cigarettes are too dangerous to be sold," ## [11] "then ban them. Some smokers will obey the" ## [12] "People not paid by the tobacco companies" ## [13] "aren't so sure. The list includes several" ## [14] "The CEOs of the nation's largest tobacco" ## [15] "firms told congressional panel that nicotine" ## [16] "law, but many will not. People will be selling" ## [17] "iS not addictive, that they are unconvinced" ## [18] "cigarettes out of the trunks of cars, cigarettes" ## [19] "substances the government does not allow in" ## [20] "foods or classifies as potentially toxic. They" ## [21] "that smoking causes lung cancer or any other" ## [22] "made by who knows who, made of who knows include ammonia, a pesticide called" ## [23] "illness, and that smoking is no more harmful" ## [24] "what,\" said James Johnston of R.J. Reynolds." ## [25] "than drinking coffee or eating Twinkies." ## [26] "It's a ruse. He knows cigarettes are not" ## [27] "methoprene, and ethyl furoate, which has" ## [28] "They said these things with straight taces." ## [29] "going to be banned, at leasi not in his lifetime." ## [30] "caused liver damage in rats." ## [31] "The list \"begs a number of important" ## [32] "They said them in the face of massive" ## [33] "STEVE WILSON" ## [34] "What he really fears are new taxes, stronger" ## [35] "questions about the safety of these additives,\"" ## [36] "scientific evidence that smoking is responsible" ## [37] "anti-smoking campaigns, further smoking" ## [38] "said a joint statement from the American" ## [39] "for more than 400,000 deaths every year." ## [40] "restrictions, limits on secondhand smoke and" ## [41] "Rep. Henry Waxman, D-Calif., put that" ## [42] "Republic Columnist" ## [43] "Lung, Cancer and Heart associations. The" ## [44] "limits on tar and nicotine." ## [45] "statement added that substances safe to eat" ## [46] "frightful statistic another way:" ## [47] "Collectively, these steps can accelerate the" ## [48] "\"Imagine our nation's outrage if two fully" ## [49] "He and the others played dumb for the" ## [50] "current 5 percent annual decline in cigarette" ## [51] "aren't necessarily safe to inhale." ## [52] "The 50-page list can be obtained free by" ## [53] "loaded jumbo jets crashed each day, killing all" ## [54] "entire six hours, but really didn't matter." ## [55] "use and turn the tobacco business from highly" ## [56] "calling 1-800-852-8749." ## [57] "aboard. That's the same number of Americans" ## [58] "The game i nearly over, and the tobacco" ## [59] "profitable to depressed." ## [60] "Johnson's comment about cigarettes \"made" ## [61] "Here are just the 44 ingredients that start" ## [62] "that cigarettes kill every 24 hours.'" ## [63] "executives know it." ## [64] "with the letter \"A\":" ## [65] "The CEOs were not impressed." ## [66] "The hearing marked a turning point in the" ## [67] "of who knows what\" was comical." ## [68] "Acetanisole, acetic acid, acetoin," ## [69] "\"We have looked at the data." ## [70] "It does" ## [71] "nation's growing aversion to cigarettes. No" ## [72] "The day before the hearing, the tobacco" ## [73] "acetophenone,6-acetoxydihydrotheaspirane," ## [74] "not convince me that smoking causes death,\"" ## [75] "2-acetyl-3-ethylpyrazine, 2-acetyl-5-" ## [76] "said Andrew Tisch of the Lorillard Tobacco" ## [77] "longer hamstrung by tobacco-state seniority" ## [78] "companies released a long-secret list of 599" ## [79] "Co." ## [80] "and the deep-pocketed tobacco lobby," ## [81] "methylfuran, acetylpyrazine, 2-acetylpyridine," ## [82] "Congress is taking aim at cigarette makers." ## [83] "additives used in cigarettes. The companies" ## [84] "said all are certified by an \"independent" ## [85] "3-acetylpyridine, 2-acetylthiazole, aconitic"
(You can compare that on your own with the Tesseract results.)
FIN
{cdcfluview} checks are done, and the fixed functions are back on CRAN! Just in time to close out this post.
If you’re on macOS, definitely check out the various ML/AI frameworks Apple has to offer via Swift and have some fun playing with integrating them into R (or build some small, command line utilities if you want to keep Swift and R apart).
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.