Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In my recent post I promised to present the most interesting features of the stringi
package in more detail.
Here's one of such jolly features. Many LaTeX users may find it very useful.
Loading a text file with encoding auto-detection
Here's a LaTeX document consisting of a Polish poem. Probably, most of you wouldn't have been able to guess the file's character encoding if I hadn't left some hints. But it's OK, we have a little challenge.
Let's use some (currently experimental) stringi
functions to guess the file's encoding.
First of all, we should read the file as a raw vector (anyway, each text file is a sequence of bytes).
library(stringi) # experimental function (as per stringi_0.2-5): download.file("http://www.rexamine.com/manual_upload/powrot_taty_latin2.tex", dest = "powrot_taty_latin2.tex") file <- stri_read_raw("powrot_taty_latin2.tex") head(file, 15) ## [1] 25 25 20 45 4e 43 4f 44 49 4e 47 20 3d 20 49
Let's try to detect the file's character encoding automatically.
stri_enc_detect(file)[[1]] # experimental function ## $Encoding ## [1] "ISO-8859-2" "ISO-8859-1" "ISO-8859-9" ## ## $Language ## [1] "pl" "pt" "tr" ## ## $Confidence ## [1] 0.46 0.19 0.07
Encoding detection is, at best, an imprecise operation using statistics and heuristics. ICU indicates that most probably we deal with Polish text in ISO-8859-2 (a.k.a. latin2) here. What a coincidence: it's true.
Let's re-encode the file. Our target encoding will be UTF-8, as it is a “superset'' of all 8-bit encodings. We really love portable code:
file <- stri_conv(file, stri_enc_detect(file)[[1]]$Encoding[1], "UTF-8") file <- stri_split_lines1(file) # split a string into text lines print(file[22:28]) # text sample ## [1] ",,Pójdźcie, o dziatki, pójdźcie wszystkie razem" ## [2] "" ## [3] "Za miasto, pod słup na wzgórek," ## [4] "" ## [5] "Tam przed cudownym klęknijcie obrazem," ## [6] "" ## [7] "Pobożnie zmówcie paciórek."
Of course, if we knew a priori that the file is in ISO-8859-2, we'd just call:
file <- stri_conv(readLines("http://www.rexamine.com/manual_upload/powrot_taty_latin2.tex"), "ISO-8859-2", "UTF-8")
So far so good.
Word count
LaTeX word counting is a quite complicated task and there are many possible approaches
to perform it. Most often, they rely on running some external tools (which may be a bit inconvenient for some users). Personally, I've always been most satisfied with the output produced by the Kile LaTeX IDE for KDE desktop environment.
As not everyone has Kile installed, I've had decided to grab Kile's algorithm (the power of open source!), made some not-too-invasive stringi
-specific tweaks and here we are:
stri_stats_latex(file) ## CharsWord CharsCmdEnvir CharsWhite Words Cmds ## 2283 335 576 461 32 ## Envirs ## 2
Some other aggregates are also available (they are meaningful in case of any text file):
stri_stats_general(file) ## Lines LinesNEmpty Chars CharsNWhite ## 232 122 3308 2930
Finally, here's the word count for my R programming book (in Polish). Importantly, each chapter is stored in a separate .tex
file (there are 30 files), so "clicking out” the answer in Kile would be a bit problematic:
apply( sapply( list.files(path="~/Publikacje/ProgramowanieR/rozdzialy/", pattern=glob2rx("*.tex"), recursive=TRUE, full.names=TRUE), function(x) stri_stats_latex(readLines(x)) ), 1, sum) ## CharsWord CharsCmdEnvir CharsWhite Words Cmds Envirs ## 718755 458403 281989 120202 37055 6119
Notably, my publisher was satisfied with the above estimate. 🙂
Next time we'll take a look at ICU's very powerful transliteration services.
More information
For more information check out the stringi
package website and its on-line documentation.
For bug reports and feature requests visit our GitHub profile.
Any comments and suggestions are warmly welcome.
–
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.