Escaping from character encoding hell in R on Windows
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Note: the title of this post was inspired by this question on stackoverflow.
This section gives the basic facts and recommendations for importing files with arbitrary encoding on Windows. The issues described here by and large to not apply on Mac or Linux; they are specific to running R on Windows.
If you are on a deadline and just need to get the job done this section should be all you need. Additional background and discussion is presented in later sections.
To read a text file with non ASCII encoding into R you should a) determine the encoding and b) read it in such a way that the information is re-encoded into UTF-8, and c) ignore the bug in the data.frame
print method on Windows. Hopefully the encoding is specified in the documentation that accompanied your data. If not, you can guess the encoding using the stri_read_raw
and stri_enc_detect
functions in the stringi package. You can ensure that the information is re-encoded to UTF-8 by using the readr package.
For example, I have two versions of a file containing numbers and Japanese characters: japanese_utf8.csv
is encoded in UTF-8, and japanese_shiftjis.csv
is encoded in SHIFT-JIS. We can read these files as follows on any platform (Windows, Linux, Mac):
library(readr) options(stringsAsFactors = FALSE) read_csv("japanese_utf8.csv", locale = locale(encoding = "UTF-8")) read_csv("japanese_shiftjis.csv", locale = locale(encoding = "SHIFT-JIS")) No. 発行日 朝夕刊 面名 ページ 1 00001 2015年09月25日 週刊 週刊朝日 022 2 00002 2015年09月25日 週刊 週刊朝日 018 3 00003 2015年09月21日 朝刊 3総合 003 No. 発行日 朝夕刊 面名 ページ 1 00001 2015年09月25日 週刊 週刊朝日 022 2 00002 2015年09月25日 週刊 週刊朝日 018 3 00003 2015年09月21日 朝刊 3総合 003
On Windows there is a bug in print.data.frame
that causes data.frame
‘s with UTF-8 encoded columns to be displayed incorrectly in non UTF-8 locales. Running the above example on Windows produces this:
No. <U+767A><U+884C><U+65E5> <U+671D><U+5915><U+520A> <U+9762><U+540D> <U+30DA><U+30FC><U+30B8> 1 00001 2015<U+5E74>09<U+6708>25<U+65E5> <U+9031><U+520A> <U+9031><U+520A><U+671D><U+65E5> 022 2 00002 2015<U+5E74>09<U+6708>25<U+65E5> <U+9031><U+520A> <U+9031><U+520A><U+671D><U+65E5> 018 3 00003 2015<U+5E74>09<U+6708>21<U+65E5> <U+671D><U+520A> 3<U+7DCF><U+5408> 003 No. <U+767A><U+884C><U+65E5> <U+671D><U+5915><U+520A> <U+9762><U+540D> <U+30DA><U+30FC><U+30B8> 1 00001 2015<U+5E74>09<U+6708>25<U+65E5> <U+9031><U+520A> <U+9031><U+520A><U+671D><U+65E5> 022 2 00002 2015<U+5E74>09<U+6708>25<U+65E5> <U+9031><U+520A> <U+9031><U+520A><U+671D><U+65E5> 018 3 00003 2015<U+5E74>09<U+6708>21<U+65E5> <U+671D><U+520A> 3<U+7DCF><U+5408> 003
which looks terrible but does not actually indicate a problem. The information is encoded correctly, but due to a long-standing bug it is displayed incorrectly. You can check to see if the values are correct by converting the data.frame by (ab)using print.listof
, e.g.,
print.listof(read_csv("japanese_shiftjis.csv", locale = locale(encoding = "SHIFT-JIS"))) No. : [1] "00001" "00002" "00003" 発行日 : [1] "2015年09月25日" "2015年09月25日" "2015年09月21日" 朝夕刊 : [1] "週刊" "週刊" "朝刊" 面名 : [1] "週刊朝日" "週刊朝日" "3総合" ページ : [1] "022" "018" "003"
To recap:
- Regardless of platform (Windows, Mac Linux), use the readr package to read data into R. This will re-encode the contents of the file to UTF-8 for you.
- Make sure you specify the encoding using the
locale
argument as shown in the example above. - Ignore the ugly
print.data.frame
bug and useprint.listof
to check that your data was imported correctly.
Those wishing for more details about this issue can read on.
What is the problem?
The problem is that the basic R functions for reading and writing data from and to files does no work in any reasonable way on Windows. If you are struggling with this you are not alone! There are numerous questions on stackoverflow, blog posts (e.g., this one by Rolf Fredheim, and another by Huidong Tian), and anguished mailing list posts. Thinking of the person-hours wasted on this issue over the years almost brings a tear to my eye.
Let’s try it, using some simplified data from a project I worked on last year. For illustration I’ve created two files containing a mix of English letters, numbers, and Japanese characters. I saved one version with UTF-8 encoding, and another with SHIFT-JIS. On Linux we can read both files easily, provided only that we correctly specify the encoding if the file is not already encoded in UTF-8:
read.csv("japanese_utf8.csv") No. 発行日 朝夕刊 面名 ページ 1 1 2015年09月25日 週刊 週刊朝日 22 2 2 2015年09月25日 週刊 週刊朝日 18 3 3 2015年09月21日 朝刊 3総合 3 read.csv("japanese_shiftjis.csv", fileEncoding = "SHIFT-JIS") No. 発行日 朝夕刊 面名 ページ 1 1 2015年09月25日 週刊 週刊朝日 22 2 2 2015年09月25日 週刊 週刊朝日 18 3 3 2015年09月21日 朝刊 3総合 3
On Windows things are much more difficult. Using read.csv
with the default options does not work because read.csv
assumes that the encoding of the file matches the Windows locale settings:
read.csv("japanese_utf8.csv") No. ç.ºè.Œæ.. æœ.å..å.Š é..å.. ペãƒ.ã.. 1 1 2015å¹´09月25æ—¥ 週刊 週刊æœæ—¥ 22 2 2 2015å¹´09月25æ—¥ 週刊 週刊æœæ—¥ 18 3 3 2015å¹´09月21æ—¥ æœåˆŠ 3ç·åˆ 3
Trying to tell R that the file is encoded in UTF-8 not a general solution because read.csv
will then try to re-encode from UTF-8 to the native encoding, which may or may not work depending on the contents of the file. On my system trying to read a UTF-8 encoded file containing Japanese characters with the fileEncoding falls flat on its face:
read.csv("japanese_utf8.csv", fileEncoding = "UTF-8") [1] No. X <0 rows> (or 0-length row.names) Warning messages: 1: In read.table(file = file, header = header, sep = sep, quote = quote, : invalid input found on input connection 'japanese_utf8.csv' 2: In read.table(file = file, header = header, sep = sep, quote = quote, : incomplete final line found by readTableHeader on 'japanese_utf8.csv'
Finally, we might try the encoding
argument rather than fileEncoding
. This simply marks the strings with the specified encoding:
read.csv("japanese_utf8.csv", encoding = "UTF-8") read.csv("japanese_utf8.csv", encoding = "UTF-8") No. X.U.767A..U.884C..U.65E5. X.U.671D..U.5915..U.520A. X.U.9762..U.540D. X.U.30DA..U.30FC..U.30B8. 1 1 2015<U+5E74>09<U+6708>25<U+65E5> <U+9031><U+520A> <U+9031><U+520A><U+671D><U+65E5> 22 2 2 2015<U+5E74>09<U+6708>25<U+65E5> <U+9031><U+520A> <U+9031><U+520A><U+671D><U+65E5> 18 3 3 2015<U+5E74>09<U+6708>21<U+65E5> <U+671D><U+520A> 3<U+7DCF><U+5408> 3
This kind of works, though you wouldn’t know it from the output. As mentioned above, there is a bug in the print.data.frame
function that prevents UTF-8 encoded text from displaying correctly. We can use another print method to see that the column values have been read in correctly:
print.listof(read.csv("japanese_utf8.csv", encoding = "UTF-8")) No. : [1] 1 2 3 X.U.767A..U.884C..U.65E5. : [1] "2015年09月25日" "2015年09月25日" "2015年09月21日" X.U.671D..U.5915..U.520A. : [1] "週刊" "週刊" "朝刊" X.U.9762..U.540D. : [1] "週刊朝日" "週刊朝日" "3総合" X.U.30DA..U.30FC..U.30B8. : [1] 22 18 3
Unfortunately there are two problems with this: first, the names of the columns have not been correctly encoded, and second, this will only work if your input data is in UTF-8 in the first place. Trying to apply this strategy to our SHIFT-JIS encoded file will not work at all because we cannot mark strings with arbitrary encoding, only with UTF-81. Trying to mark the string as SHIFT-JIS will silently fail:
print.listof(read.csv("japanese_shiftjis.csv", encoding = "SHIFT-JIS")) No. : [1] 1 2 3 X...s.ú : [1] "2015”N09ŒŽ25“ú" "2015”N09ŒŽ25“ú" "2015”N09ŒŽ21“ú" X....Š. : [1] "TŠ§" "TŠ§" "’©Š§" X.Ê.. : [1] "TŠ§’©“ú" "TŠ§’©“ú" "‚R‘‡" ƒy..ƒW : [1] 22 18 3
Ouch! Why is this so hard? Can we make it suck less?
Encoding in R
Basically R gives you two ways of handling character encoding. You can use the default encoding of your OS, or you can use UTF-81. On OS X and Linux these options are often the same, since the default OS encoding is usually UTF-8; this is a great advantage because just about everything can be represented in UTF-8. On Windows there is no such luck. On my Windows 7 machine the default is “Windows code page 1252”; many characters (such as Japanese) cannot be represented in code page 1252. If I want to work with Japanese text in R on Windows I have two options; change my locale to Japanese, or I can convert strings to UTF-8 and mark them as such.
In some ways just changing your locale to one that can accommodate the data you are working with is the simplest approach. Again, on Mac and Linux the locale usually specifies UTF-8 encoding, so no changes are needed; things should just work as you would expect them to. On windows we can change the locale to match the data we are working with using the Sys.setlocale
function. This sometimes works well; for example, we can read our UTF-8 and SHIFT-JIS encoded data on Windows as follows:
setlocale("LC_ALL", "English_United States.932") read.csv("japanese_shiftjis.csv") read.csv("japanese_utf8.csv", fileEncoding = "UTF-8") [1] "LC_COLLATE=English_United States.932;LC_CTYPE=English_United States.932;LC_MONETARY=English_United States.932;LC_NUMERIC=C;LC_TIME=English_United States.932" No. 発行日 朝夕刊 面名 ページ 1 1 2015年09月25日 週刊 週刊朝日 22 2 2 2015年09月25日 週刊 週刊朝日 18 3 3 2015年09月21日 朝刊 3総合 3 No. 発行日 朝夕刊 面名 ページ 1 1 2015年09月25日 週刊 週刊朝日 22 2 2 2015年09月25日 週刊 週刊朝日 18 3 3 2015年09月21日 朝刊 3総合 3
This works fine until we want to read some other kind of text in the same R session, and then we are right back to the same old problem. Another issue with this method is that it does not work in Rstudio unless the locale is set on startup; you cannot change the locale of a running session in Rstudio2.
Because the Sys.setlocale
method only works for a subset of languages in any given session, our best bet is to read store everything in UTF-8 (and make sure it is marked as such). It is not convenient to do this using the read.table
family of functions in R, but it is possible with some care:
x <- read.csv("japanese_shiftjis.csv", encoding = "UTF-8", check.names = FALSE # otherwise R will mangle the names ) charcols <- !sapply(x, is.numeric) x[charcols] <- lapply(x[charcols], iconv, from = "SHIFT-JIS", to = "UTF-8") names(x) <- iconv(names(x), from = "SHIFT-JIS", to = "UTF-8") print.listof(x) No. : [1] 1 2 3 発行日 : [1] "2015年09月25日" "2015年09月25日" "2015年09月21日" 朝夕刊 : [1] "週刊" "週刊" "朝刊" 面名 : [1] "週刊朝日" "週刊朝日" "3総合" ページ : [1] 22 18 3
OK it works, but honestly that too much work for something as simple as reading a .csv file into R. As suggested at the beginning of this post, a better strategy is to use the readr package because it will do the conversion to UTF-8 for you:
print.listof(read_csv("arabic_utf-8.csv"), locale = locale(encoding = "UTF-8")) print.listof(read_csv("japanese_utf8.csv"), locale = locale(encoding = "UTF-8")) print.listof(read_csv("japanese_shiftjis.csv"), locale = locale(encoding = "SHIFT-JIS")) X5 : [1] "1895-01-02" "1895-01-07" "1895-01-16" X8 : [1] "اصلى" "اصلى" "اصلى" X12 : [1] "وقائع" "وقائع" "وقائع" No. : [1] "00001" "00002" "00003" 発行日 : [1] "2015年09月25日" "2015年09月25日" "2015年09月21日" 朝夕刊 : [1] "週刊" "週刊" "朝刊" 面名 : [1] "週刊朝日" "週刊朝日" "3総合" ページ : [1] "022" "018" "003" No. : [1] "00001" "00002" "00003" 発行日 : [1] "2015年09月25日" "2015年09月25日" "2015年09月21日" 朝夕刊 : [1] "週刊" "週刊" "朝刊" 面名 : [1] "週刊朝日" "週刊朝日" "3総合" ページ : [1] "022" "018" "003"
Files
Here are the example data files and code and needed to run the examples in this post.
Footnotes:
We can also mark strings as encoded in latin1
, but that is not useful if you take my advice and store everything in UTF-8.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.