More powerful iconv in R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The R function iconv converts between character string encodings, for example, from the locale dependent encoding to UTF-8:
> iconv("foo", to="UTF-8") [1] "foo"
However, R has long-running trouble with embedded null characters ('\0') in strings. Hence, if we try to convert to an encoding that permits embedded null characters, iconv will fail:
> iconv("foo", to="UTF-16") Error in iconv("foo", to = "UTF-16") : embedded nul in string: '\xff\xfef\0o\0o\0'
The ‘embedded nul’ error is thrown by mkCharLenCE, after the real conversion is complete. The converted string exists in memory, though not in a form that R can currently represent as a STRSXP. Hence the error when passed to mkCharLenCE.
The issue of embedded null characters has been discussed previously on the R mailing lists (see this thread), but I don’t think this is the issue here. The point here is that the C implementation of iconv operates on binary data, not necessarily null terminated C strings. Hence, in order to fully utilize the iconv mechanism, the R-level iconv ought to accept and return objects that can handle arbitrary binary data, i.e.of type RAWSXP, in addition to character vectors.
To this end, I’ve written a small patch (13 lines w/o documentation) against the current R-devel sources (r52328) that allows the R-level iconv to accept an argument of type RAWSXP, in addition to character vectors. Now, when a raw object is passed to iconv, no character substitution is performed, the arguments sub and mark are ignored, and a raw object is returned. However, rather than returning NA (NA does not exist for RAWSXPs) when conversions are invalid or incomplete, a partially converted object is returned. The following patch doesn’t touch any of the code associated with STRSXPs, nor affect the behavior of iconv when a character vector is passed.
Once compiled into R the new iconv will operate on raw vectors. Continuing with our example:
> bar <- iconv(charToRaw("foo"), to="UTF-16") > bar [1] ff fe 66 00 6f 00 6f 00 > rawToChar(iconv(bar, from="UTF-16")) [1] "foo"
The patch code is listed below, and also available here R-devel-iconv-0.0.patch. P.S. Thanks to Tal Galili for recommending the GeSHi plugin for wordpress, it worked out nicely for the R and patch (lang="diff") code in this post, though I prefer the more subtle coloring in the patch code.
Index: src/library/base/R/New-Internal.R =================================================================== --- src/library/base/R/New-Internal.R (revision 52328) +++ src/library/base/R/New-Internal.R (working copy) @@ -239,7 +239,7 @@ iconv <- function(x, from = "", to = "", sub = NA, mark = TRUE) { - if(!is.character(x)) x <- as.character(x) + if(!is.character(x) && !is.raw(x)) x <- as.character(x) .Internal(iconv(x, from, to, as.character(sub), mark)) } Index: src/main/sysutils.c =================================================================== --- src/main/sysutils.c (revision 52328) +++ src/main/sysutils.c (working copy) @@ -548,16 +548,17 @@ int mark; const char *from, *to; Rboolean isLatin1 = FALSE, isUTF8 = FALSE; + Rboolean isRawx = (TYPEOF(x) == RAWSXP); - if(TYPEOF(x) != STRSXP) - error(_("'x' must be a character vector")); + if(TYPEOF(x) != STRSXP && !isRawx) + error(_("'x' must be a character vector or raw")); if(!isString(CADR(args)) || length(CADR(args)) != 1) error(_("invalid '%s' argument"), "from"); if(!isString(CADDR(args)) || length(CADDR(args)) != 1) error(_("invalid '%s' argument"), "to"); if(!isString(CADDDR(args)) || length(CADDDR(args)) != 1) error(_("invalid '%s' argument"), "sub"); - if(STRING_ELT(CADDDR(args), 0) == NA_STRING) sub = NULL; + if(STRING_ELT(CADDDR(args), 0) == NA_STRING || isRawx) sub = NULL; else sub = translateChar(STRING_ELT(CADDDR(args), 0)); mark = asLogical(CAD4R(args)); if(mark == NA_LOGICAL) @@ -584,7 +585,7 @@ PROTECT(ans = duplicate(x)); R_AllocStringBuffer(0, &cbuff); /* 0 -> default */ for(i = 0; i < LENGTH(x); i++) { - si = STRING_ELT(x, i); + si = isRawx ? x : STRING_ELT(x, i); top_of_loop: inbuf = CHAR(si); inb = LENGTH(si); outbuf = cbuff.data; outb = cbuff.bufsize - 1; @@ -622,7 +623,7 @@ goto next_char; } - if(res != -1 && inb == 0) { + if(res != -1 && inb == 0 && !isRawx) { cetype_t ienc = CE_NATIVE; nout = cbuff.bufsize - 1 - outb; @@ -632,7 +633,13 @@ } SET_STRING_ELT(ans, i, mkCharLenCE(cbuff.data, nout, ienc)); } - else SET_STRING_ELT(ans, i, NA_STRING); + else if(!isRawx) SET_STRING_ELT(ans, i, NA_STRING); + else { + nout = cbuff.bufsize - 1 - outb; + UNPROTECT(1); + PROTECT(ans = allocVector(RAWSXP, nout)); + memcpy(RAW(ans), cbuff.data, nout); + } } Riconv_close(obj); R_FreeStringBuffer(&cbuff);
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.