Hy-phen-ate All The Things! (in R)

hrbrmstr

6 years ago

[This article was first published on R – rud.is, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

hyphenatr–what may be my smallest package ever–has just hit CRAN. It, well, hyphenates words using libhyphen (a.k.a. libhnj). There are no external dependencies (i.e. no brew install, apt get, et. al. required) and it compiles on everything CRAN supports including Windows.

I started coding this to see if it could be a “poor dude’s ‘syllabifier’” (NOTE: “dude” is gender agnostic and I am fully aware of the proper NLP terms to use but it’s way more fun to make up words) to make it easer to turn my Zellingach project from earlier in the year into a generalized package. In short, Tex hyphenation rules (which is what libhyphen and, hence, hyphenatr uses) don’t generalize to separating all syllables since–for instance–you really wouldn’t want to leave some trailing syllables hanging apart from their siblings (mostly for typographic reasons). Rather than let my investigation work be for naught, you get a package!

What’s in the box?

hyphenatr ships with support for 39 language hyphenation rules. Here’s proof:

> library(hyphenatr)
 
list_dicts()
#>  [1] "af_ZA"  "bg_BG"  "ca"     "cs_CZ"  "da_DK"  "de"     "de_AT"  "de_CH" 
#>  [9] "de_DE"  "el_GR"  "en_GB"  "en_US"  "es_ANY" "et_EE"  "fr"     "gl"    
#> [17] "hr_HR"  "hu_HU"  "is"     "it_IT"  "lt"     "lt_LT"  "lv_LV"  "nb_NO" 
#> [25] "nl_NL"  "nn_NO"  "pl_PL"  "pt_BR"  "pt_PT"  "ro_RO"  "ru_RU"  "sh"    
#> [33] "sk_SK"  "sl_SI"  "sr"     "sv"     "te_IN"  "uk_UA"  "zu_ZA"

Where underscores are present, it’s locationcode_COUNTRYCODE otherwise it’s just the location code and you can switch which dictionary is in use with switch_dict(). en_US is default because I’m a lazy, narcissistic American. You can read about those files here, and I followed Dirk’s (Eddelbuettel) model in AsioHeaders, keeping all individual copyrights & author credits intact (open source attribution is not as easy as you might think).

By default, hyphenatr will stick a = where hyphens can be (this is the libhyphen default). You can change that to anything else (examples below) or you can ask hyphenatr to just return a split vector (i.e. components of the word split at hyphenation points).

How does it work?

You call hyphenate on a vector of words. On my demure 13″ MacBook Pro it takes ~24ms to process 10,000 words. Don’t believe me? Give this a go on your own system:

library(hyphenatr)
library(microbenchmark)
 
dat <- readLines(system.file("extdata/top10000en.txt", package="hyphenatr"))
 
microbenchmark(out1 <- hyphenate(dat))
#> Unit: milliseconds
#>                    expr      min       lq     mean   median       uq      max neval
#>  out1 <- hyphenate(dat) 20.77134 22.16768 23.70809 23.65906 24.73395 30.21601   100

I extracted some of the results of that to give an idea what you get back:

out1[500:550]
#>  [1] "got"            "fam=ily"        "pol=icy"        "in=vestors"     "record"         "loss"          
#>  [7] "re=ceived"      "April"          "Ex=change"      "code"           "graph=ics"      "agency"        
#> [13] "in=creased"     "man=ager"       "keep"           "look"           "of=ten"         "de=signed"     
#> [19] "Euro=pean"      "earn=ings"      "en=vi=ron=ment" "July"           "job"            "third"         
#> [25] "wa=ter"         "net"            "banks"          "an=a=lysts"     "strong"         "party"         
#> [31] "econ=omy"       "away"           "dol=lar"        "taken"          "de=vel=oped"    "con=tinue"     
#> [37] "al=low"         "Mi=crosoft"     "key"            "ei=ther"        "se=cu=rity"     "project"       
#> [43] "agreed"         "though"         "Ja=pan"         "rather"         "coun=tries"     "plant"         
#> [49] "along"          "Ap=ple"         "ac=tion"

It’s a tad slower if you want separated vectors back (~30ms) but I think you’ll find that mode more useful if you do plan on using the package:

microbenchmark(out2 <- hyphenate(dat, simplify=FALSE))
#> Unit: milliseconds
#>                                      expr      min       lq     mean   median       uq      max neval
#>  out2 <- hyphenate(dat, simplify = FALSE) 26.32844 28.27894 29.26569 29.13235 29.80986 33.21204   100
 
jsonlite::toJSON(out2[530:540], pretty=TRUE)
#> [
#>   ["econ", "omy"],
#>   ["away"],
#>   ["dol", "lar"],
#>   ["taken"],
#>   ["de", "vel", "oped"],
#>   ["con", "tinue"],
#>   ["al", "low"],
#>   ["Mi", "crosoft"],
#>   ["key"],
#>   ["ei", "ther"],
#>   ["se", "cu", "rity"]
#> ]

As I stated earlier, you can use whatever separator you want, but you’ll pay the price as that’ll take an excruciating ~31ms for this word list:

microbenchmark(out3 <- hyphenate(dat, simplify="-"))
#> Unit: milliseconds
#>                                    expr      min       lq     mean  median       uq     max neval
#>  out3 <- hyphenate(dat, simplify = "-") 26.22136 28.04543 29.82251 30.0245 31.20909 36.4886   100
 
out3[500:550]
#>  [1] "got"            "fam-ily"        "pol-icy"        "in-vestors"     "record"         "loss"          
#>  [7] "re-ceived"      "April"          "Ex-change"      "code"           "graph-ics"      "agency"        
#> [13] "in-creased"     "man-ager"       "keep"           "look"           "of-ten"         "de-signed"     
#> [19] "Euro-pean"      "earn-ings"      "en-vi-ron-ment" "July"           "job"            "third"         
#> [25] "wa-ter"         "net"            "banks"          "an-a-lysts"     "strong"         "party"         
#> [31] "econ-omy"       "away"           "dol-lar"        "taken"          "de-vel-oped"    "con-tinue"     
#> [37] "al-low"         "Mi-crosoft"     "key"            "ei-ther"        "se-cu-rity"     "project"       
#> [43] "agreed"         "though"         "Ja-pan"         "rather"         "coun-tries"     "plant"         
#> [49] "along"          "Ap-ple"         "ac-tion"

If you’re processing text for use in HTML, you could use this package to add “soft hyphens” () to the words, but now we’re dangerously close to a nigh intolerable ~40ms for 10,000 words:

microbenchmark(out4 <- hyphenate(dat, simplify="&shy;"))
#> Unit: milliseconds
#>                                        expr      min       lq    mean   median       uq      max neval
#>  out4 <- hyphenate(dat, simplify = "&shy;") 28.57537 29.78537 31.6346 31.31182 33.16067 37.89471   100
 
out4[500:550]
#>  [1] "got"                        "fam&shy;ily"                "pol&shy;icy"                "in&shy;vestors"            
#>  [5] "record"                     "loss"                       "re&shy;ceived"              "April"                     
#>  [9] "Ex&shy;change"              "code"                       "graph&shy;ics"              "agency"                    
#> [13] "in&shy;creased"             "man&shy;ager"               "keep"                       "look"                      
#> [17] "of&shy;ten"                 "de&shy;signed"              "Euro&shy;pean"              "earn&shy;ings"             
#> [21] "en&shy;vi&shy;ron&shy;ment" "July"                       "job"                        "third"                     
#> [25] "wa&shy;ter"                 "net"                        "banks"                      "an&shy;a&shy;lysts"        
#> [29] "strong"                     "party"                      "econ&shy;omy"               "away"                      
#> [33] "dol&shy;lar"                "taken"                      "de&shy;vel&shy;oped"        "con&shy;tinue"             
#> [37] "al&shy;low"                 "Mi&shy;crosoft"             "key"                        "ei&shy;ther"               
#> [41] "se&shy;cu&shy;rity"         "project"                    "agreed"                     "though"                    
#> [45] "Ja&shy;pan"                 "rather"                     "coun&shy;tries"             "plant"                     
#> [49] "along"

As stated, it works with other languages:

switch_dict("de_DE")
 
hyphenate("tägelîch")
#> [1] "tä=gelîch"

(I picked that word at random from the internet. If it’s a “bad” word, I equally randomly place the blame on @sooshie).

Moving right along

If you hit any snags, drop an issue on GitHub. If you have any hyphenation language rules (in the supported “LibreOffice” format) please submit a PR (both including the file and updating instCOPYRIGHTS).

I cannot conclude w/o giving special thanks to Edwin de Jonge & Gergely Daróczi for language testing.

Well, I really can’t conclude without impersonating a Dalek:

cat(toupper(hyphenate("Exterminate!", simplify=" - ")))

EX - TER - MI - NATE!

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.