Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
hyphenatr
–what may be my smallest package ever–has just hit CRAN. It, well, hyphenates words using libhyphen
(a.k.a. libhnj
). There are no external dependencies (i.e. no brew install
, apt get
, et. al. required) and it compiles on everything CRAN supports including Windows.
I started coding this to see if it could be a “poor dude’s ‘syllabifier’” (NOTE: “dude” is gender agnostic and I am fully aware of the proper NLP terms to use but it’s way more fun to make up words) to make it easer to turn my Zellingach project from earlier in the year into a generalized package. In short, Tex hyphenation rules (which is what libhyphen
and, hence, hyphenatr
uses) don’t generalize to separating all syllables since–for instance–you really wouldn’t want to leave some trailing syllables hanging apart from their siblings (mostly for typographic reasons). Rather than let my investigation work be for naught, you get a package!
What’s in the box?
hyphenatr
ships with support for 39 language hyphenation rules. Here’s proof:
> library(hyphenatr) list_dicts() #> [1] "af_ZA" "bg_BG" "ca" "cs_CZ" "da_DK" "de" "de_AT" "de_CH" #> [9] "de_DE" "el_GR" "en_GB" "en_US" "es_ANY" "et_EE" "fr" "gl" #> [17] "hr_HR" "hu_HU" "is" "it_IT" "lt" "lt_LT" "lv_LV" "nb_NO" #> [25] "nl_NL" "nn_NO" "pl_PL" "pt_BR" "pt_PT" "ro_RO" "ru_RU" "sh" #> [33] "sk_SK" "sl_SI" "sr" "sv" "te_IN" "uk_UA" "zu_ZA" |
Where underscores are present, it’s locationcode_COUNTRYCODE
otherwise it’s just the location code and you can switch which dictionary is in use with switch_dict()
. en_US
is default because I’m a lazy, narcissistic American. You can read about those files here, and I followed Dirk’s (Eddelbuettel) model in AsioHeaders
, keeping all individual copyrights & author credits intact (open source attribution is not as easy as you might think).
By default, hyphenatr
will stick a =
where hyphens can be (this is the libhyphen
default). You can change that to anything else (examples below) or you can ask hyphenatr
to just return a split vector (i.e. components of the word split at hyphenation points).
How does it work?
You call hyphenate
on a vector of words. On my demure 13″ MacBook Pro it takes ~24ms to process 10,000 words. Don’t believe me? Give this a go on your own system:
library(hyphenatr) library(microbenchmark) dat <- readLines(system.file("extdata/top10000en.txt", package="hyphenatr")) microbenchmark(out1 <- hyphenate(dat)) #> Unit: milliseconds #> expr min lq mean median uq max neval #> out1 <- hyphenate(dat) 20.77134 22.16768 23.70809 23.65906 24.73395 30.21601 100 |
I extracted some of the results of that to give an idea what you get back:
out1[500:550] #> [1] "got" "fam=ily" "pol=icy" "in=vestors" "record" "loss" #> [7] "re=ceived" "April" "Ex=change" "code" "graph=ics" "agency" #> [13] "in=creased" "man=ager" "keep" "look" "of=ten" "de=signed" #> [19] "Euro=pean" "earn=ings" "en=vi=ron=ment" "July" "job" "third" #> [25] "wa=ter" "net" "banks" "an=a=lysts" "strong" "party" #> [31] "econ=omy" "away" "dol=lar" "taken" "de=vel=oped" "con=tinue" #> [37] "al=low" "Mi=crosoft" "key" "ei=ther" "se=cu=rity" "project" #> [43] "agreed" "though" "Ja=pan" "rather" "coun=tries" "plant" #> [49] "along" "Ap=ple" "ac=tion" |
It’s a tad slower if you want separated vectors back (~30ms) but I think you’ll find that mode more useful if you do plan on using the package:
microbenchmark(out2 <- hyphenate(dat, simplify=FALSE)) #> Unit: milliseconds #> expr min lq mean median uq max neval #> out2 <- hyphenate(dat, simplify = FALSE) 26.32844 28.27894 29.26569 29.13235 29.80986 33.21204 100 jsonlite::toJSON(out2[530:540], pretty=TRUE) #> [ #> ["econ", "omy"], #> ["away"], #> ["dol", "lar"], #> ["taken"], #> ["de", "vel", "oped"], #> ["con", "tinue"], #> ["al", "low"], #> ["Mi", "crosoft"], #> ["key"], #> ["ei", "ther"], #> ["se", "cu", "rity"] #> ] |
As I stated earlier, you can use whatever separator you want, but you’ll pay the price as that’ll take an excruciating ~31ms for this word list:
microbenchmark(out3 <- hyphenate(dat, simplify="-")) #> Unit: milliseconds #> expr min lq mean median uq max neval #> out3 <- hyphenate(dat, simplify = "-") 26.22136 28.04543 29.82251 30.0245 31.20909 36.4886 100 out3[500:550] #> [1] "got" "fam-ily" "pol-icy" "in-vestors" "record" "loss" #> [7] "re-ceived" "April" "Ex-change" "code" "graph-ics" "agency" #> [13] "in-creased" "man-ager" "keep" "look" "of-ten" "de-signed" #> [19] "Euro-pean" "earn-ings" "en-vi-ron-ment" "July" "job" "third" #> [25] "wa-ter" "net" "banks" "an-a-lysts" "strong" "party" #> [31] "econ-omy" "away" "dol-lar" "taken" "de-vel-oped" "con-tinue" #> [37] "al-low" "Mi-crosoft" "key" "ei-ther" "se-cu-rity" "project" #> [43] "agreed" "though" "Ja-pan" "rather" "coun-tries" "plant" #> [49] "along" "Ap-ple" "ac-tion" |
If you’re processing text for use in HTML, you could use this package to add “soft hyphens” (­
) to the words, but now we’re dangerously close to a nigh intolerable ~40ms for 10,000 words:
microbenchmark(out4 <- hyphenate(dat, simplify="­")) #> Unit: milliseconds #> expr min lq mean median uq max neval #> out4 <- hyphenate(dat, simplify = "­") 28.57537 29.78537 31.6346 31.31182 33.16067 37.89471 100 out4[500:550] #> [1] "got" "fam­ily" "pol­icy" "in­vestors" #> [5] "record" "loss" "re­ceived" "April" #> [9] "Ex­change" "code" "graph­ics" "agency" #> [13] "in­creased" "man­ager" "keep" "look" #> [17] "of­ten" "de­signed" "Euro­pean" "earn­ings" #> [21] "en­vi­ron­ment" "July" "job" "third" #> [25] "wa­ter" "net" "banks" "an­a­lysts" #> [29] "strong" "party" "econ­omy" "away" #> [33] "dol­lar" "taken" "de­vel­oped" "con­tinue" #> [37] "al­low" "Mi­crosoft" "key" "ei­ther" #> [41] "se­cu­rity" "project" "agreed" "though" #> [45] "Ja­pan" "rather" "coun­tries" "plant" #> [49] "along" |
As stated, it works with other languages:
switch_dict("de_DE") hyphenate("tägelîch") #> [1] "tä=gelîch" |
(I picked that word at random from the internet. If it’s a “bad” word, I equally randomly place the blame on @sooshie).
Moving right along
If you hit any snags, drop an issue on GitHub. If you have any hyphenation language rules (in the supported “LibreOffice” format) please submit a PR (both including the file and updating instCOPYRIGHTS
).
I cannot conclude w/o giving special thanks to Edwin de Jonge & Gergely Daróczi for language testing.
Well, I really can’t conclude without impersonating a Dalek:
cat(toupper(hyphenate("Exterminate!", simplify=" - "))) |
EX - TER - MI - NATE!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.