More powerful iconv in R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The R function iconv converts between character string encodings, for example, from the locale dependent encoding to UTF-8:
> iconv("foo", to="UTF-8")
[1] "foo"
However, R has long-running trouble with embedded null characters ('\0') in strings. Hence, if we try to convert to an encoding that permits embedded null characters, iconv will fail:
> iconv("foo", to="UTF-16")
Error in iconv("foo", to = "UTF-16") :
embedded nul in string: '\xff\xfef\0o\0o\0'
The ‘embedded nul’ error is thrown by mkCharLenCE, after the real conversion is complete. The converted string exists in memory, though not in a form that R can currently represent as a STRSXP. Hence the error when passed to mkCharLenCE.
The issue of embedded null characters has been discussed previously on the R mailing lists (see this thread), but I don’t think this is the issue here. The point here is that the C implementation of iconv operates on binary data, not necessarily null terminated C strings. Hence, in order to fully utilize the iconv mechanism, the R-level iconv ought to accept and return objects that can handle arbitrary binary data, i.e.of type RAWSXP, in addition to character vectors.
To this end, I’ve written a small patch (13 lines w/o documentation) against the current R-devel sources (r52328) that allows the R-level iconv to accept an argument of type RAWSXP, in addition to character vectors. Now, when a raw object is passed to iconv, no character substitution is performed, the arguments sub and mark are ignored, and a raw object is returned. However, rather than returning NA (NA does not exist for RAWSXPs) when conversions are invalid or incomplete, a partially converted object is returned. The following patch doesn’t touch any of the code associated with STRSXPs, nor affect the behavior of iconv when a character vector is passed.
Once compiled into R the new iconv will operate on raw vectors. Continuing with our example:
> bar <- iconv(charToRaw("foo"), to="UTF-16")
> bar
[1] ff fe 66 00 6f 00 6f 00
> rawToChar(iconv(bar, from="UTF-16"))
[1] "foo"
The patch code is listed below, and also available here R-devel-iconv-0.0.patch. P.S. Thanks to Tal Galili for recommending the GeSHi plugin for wordpress, it worked out nicely for the R and patch (lang="diff") code in this post, though I prefer the more subtle coloring in the patch code.
Index: src/library/base/R/New-Internal.R
===================================================================
--- src/library/base/R/New-Internal.R (revision 52328)
+++ src/library/base/R/New-Internal.R (working copy)
@@ -239,7 +239,7 @@
iconv <- function(x, from = "", to = "", sub = NA, mark = TRUE)
{
- if(!is.character(x)) x <- as.character(x)
+ if(!is.character(x) && !is.raw(x)) x <- as.character(x)
.Internal(iconv(x, from, to, as.character(sub), mark))
}
Index: src/main/sysutils.c
===================================================================
--- src/main/sysutils.c (revision 52328)
+++ src/main/sysutils.c (working copy)
@@ -548,16 +548,17 @@
int mark;
const char *from, *to;
Rboolean isLatin1 = FALSE, isUTF8 = FALSE;
+ Rboolean isRawx = (TYPEOF(x) == RAWSXP);
- if(TYPEOF(x) != STRSXP)
- error(_("'x' must be a character vector"));
+ if(TYPEOF(x) != STRSXP && !isRawx)
+ error(_("'x' must be a character vector or raw"));
if(!isString(CADR(args)) || length(CADR(args)) != 1)
error(_("invalid '%s' argument"), "from");
if(!isString(CADDR(args)) || length(CADDR(args)) != 1)
error(_("invalid '%s' argument"), "to");
if(!isString(CADDDR(args)) || length(CADDDR(args)) != 1)
error(_("invalid '%s' argument"), "sub");
- if(STRING_ELT(CADDDR(args), 0) == NA_STRING) sub = NULL;
+ if(STRING_ELT(CADDDR(args), 0) == NA_STRING || isRawx) sub = NULL;
else sub = translateChar(STRING_ELT(CADDDR(args), 0));
mark = asLogical(CAD4R(args));
if(mark == NA_LOGICAL)
@@ -584,7 +585,7 @@
PROTECT(ans = duplicate(x));
R_AllocStringBuffer(0, &cbuff); /* 0 -> default */
for(i = 0; i < LENGTH(x); i++) {
- si = STRING_ELT(x, i);
+ si = isRawx ? x : STRING_ELT(x, i);
top_of_loop:
inbuf = CHAR(si); inb = LENGTH(si);
outbuf = cbuff.data; outb = cbuff.bufsize - 1;
@@ -622,7 +623,7 @@
goto next_char;
}
- if(res != -1 && inb == 0) {
+ if(res != -1 && inb == 0 && !isRawx) {
cetype_t ienc = CE_NATIVE;
nout = cbuff.bufsize - 1 - outb;
@@ -632,7 +633,13 @@
}
SET_STRING_ELT(ans, i, mkCharLenCE(cbuff.data, nout, ienc));
}
- else SET_STRING_ELT(ans, i, NA_STRING);
+ else if(!isRawx) SET_STRING_ELT(ans, i, NA_STRING);
+ else {
+ nout = cbuff.bufsize - 1 - outb;
+ UNPROTECT(1);
+ PROTECT(ans = allocVector(RAWSXP, nout));
+ memcpy(RAW(ans), cbuff.data, nout);
+ }
}
Riconv_close(obj);
R_FreeStringBuffer(&cbuff);
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.