Pitfall of XML package: to know the cause

tomizono

6 years ago

[This article was first published on R – ЯтомизоnoR, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This is the sequel to the previous report “issues specific to cp932 locale, Japanese Shift-JIS, on Windows“. In this report, I will dig the issues deeper to find out what is exactly happening.

1. Where it occurs

I knew the issues depend node texts, because a very same script run on Windows to parse another table in the same html source.

# Windows
src <- 'http://www.taiki.pref.ibaraki.jp/data.asp'

t2 <- iconv(as.character(
        readHTMLTable(src, which=4, trim=T, header=F, 
          skip.rows=2:48, encoding='shift-jis')[1,1]
      ), from='utf-8', to='shift-jis')
> t2 # bad
[1] NA

s2 <- iconv(as.character(
        readHTMLTable(src, which=6, trim=T, header=F, 
          skip.rows=1, encoding='shift-jis')[2,2]
      ), from='utf-8', to='shift-jis')
> s2 # good
[1] "北茨城中郷"

To know the difference of the two html parts is to know where the issue occurs.

Let’s see the html source by primitive functions, instead of using the package XML.

con <- url(src, encoding='shift-jis')
x <- readLines(con)
close(con)

I know two useful keywords to locate points of t2 and s2 above; 2016 and 101 respectively.

# for t2
> grep('2016', x)
[1] 120 133 141 148 160 161

> x[119:121]
[1] "ttt<td class="title">" 
[2] "tttt最新の観測情報&nbsp;&nbsp;（2016年1月17日&nbsp;&nbsp;8時）"
[3] "ttt</td>"

# for s2
> grep('101', x)
[1] 181

> x[181:182]
[1] "tttttt<td class="td1">101</td>"
[2] "tttttt<td class="td2">北茨城中郷</td>"

Note that only x[182] is for s2 and x[181] was used to find the position. Apparently, differences between t2 and s2 are:

t2 includes  .
t2 spreads over multiple lines.

Because I want to know the exact content of a single node (html element), the three lines of t2 must be joined together.

paste(x[119:121], collapse='rn')

Pasting with a newline code rn may be the answer, but more exact procedure is better.

Binary functions are elegant tools which can handle an html source as binary data as is offered by the web server regardless of client platforms and locales.

con <- url(src, open='rb')
skip <- readBin(con, what='raw', n=5009)
xt2 <- readBin(con, what='raw', n=92)
skip <- readBin(con, what='raw', n=3013)
xs2 <- readBin(con, what='raw', n=31)
close(con)

In this time I cheated the byte position of interests from the prior result x.

# t2 begins after
> sum(nchar(x[1:118], type='bytes') + 2) + 
  nchar(sub('<.*$', '', x[119]), type='bytes')
[1] 5009
# t2 length
> sum(nchar(x[119:121], type='bytes') + 2) - 2 - 
  nchar(sub('<.*$', '', x[119]), type='bytes')
[1] 92
# s2 begins after
> sum(nchar(x[1:181], type='bytes') + 2) + 
  nchar(sub('<.*$', '', x[182]), type='bytes')
[1] 8114
# s2 length
> nchar(x[182], type='bytes') - 
  nchar(sub('<.*$', '', x[182]), type='bytes')
[1] 31
# s2 from the end of t2
> 8114 - 5009 - 92
[1] 3013

Variables xt2 and xs2 have what I want as binary (raw) vector.

# Windows
> rawToChar(xt1)
[1] "<td class="title">rntttt最新の観測情報&nbsp;&nbsp;（2016年1月17日&nbsp;&nbsp;8時）rnttt</td>"
> rawToChar(xs1)
[1] "<td class="td2">北茨城中郷</td>"

Compare inside texts of these nodes.

t2: rntttt最新の観測情報&nbsp;&nbsp;（2016年1月17日&nbsp;&nbsp;8時）rnttt
s2: 北茨城中郷

Maybe, a text including control codes (r, n, t) and/or html entities ( ) is unsafe, and a text made up of printable Kanji characters only is safe. So, the issues must occur at these special characters.

Before digging more, I want to introduce two nice binary functions that can be used without cheating of byte position. Function getBinaryURL in package RCurl can get the whole binary data from a web page. Function grepRaw can locate positions of specific character in binary vector.

library(RCurl)
x <- getBinaryURL(src)

> str(x)
 raw [1:35470] 0d 0a 3c 68 ...

> grepRaw('2016', x, all=FALSE)
[1] 5062

> x[5000:5100]
  [1] 09 3c 74 72 3e 0d 0a 09 09 09 3c 74 64 20 63 6c 61 73 73 3d 22 74 69 74
 [25] 6c 65 22 3e 0d 0a 09 09 09 09 8d c5 90 56 82 cc 8a cf 91 aa 8f ee 95 f1
 [49] 26 6e 62 73 70 3b 26 6e 62 73 70 3b 81 69 32 30 31 36 94 4e 31 8c 8e 31
 [73] 37 93 fa 26 6e 62 73 70 3b 26 6e 62 73 70 3b 38 8e 9e 81 6a 0d 0a 09 09
 [97] 09 3c 2f 74 64

> rawToChar(x[5000:5100])
[1] "t<tr>rnttt<td class="title">rntttt最新の観測情報&nbsp;&nbsp;（2016年1月17日&nbsp;&nbsp;8時）rnttt</td"

2. What is cause

Let’s check out what happens when a html has html entity ( ) or spaces (r n t). I’m going to use a minimum html to compare responses of package XML on Mac and Windows.

library(XML)

> xmlValue(xmlRoot(htmlParse('<html>ABC</html>', asText=T)))
[1] "ABC"

2-1. No-Break Space (U+00A0,  )

# Mac
> xmlValue(xmlRoot(htmlParse(
  '<html>&nbsp;</html>', asText=T)))
[1] " " # good

> iconv(xmlValue(xmlRoot(htmlParse(
  '<html>&nbsp;</html>', asText=T))), from='utf-8', to='shift-jis')
[1] NA # bad

> charToRaw(xmlValue(xmlRoot(htmlParse(
  '<html>&nbsp;</html>', asText=T))))
[1] c2 a0 # good

> charToRaw(xmlValue(xmlRoot(htmlParse(
  '<html>&nbsp;あ</html>', asText=T, encoding='utf-8'))))
[1] c2 a0 e3 81 82 # good

> charToRaw(xmlValue(xmlRoot(htmlParse(
  '<html>&nbsp;xe3x81x82</html>', asText=T, encoding='utf-8'))))
[1] c2 a0 e3 81 82 # good

> charToRaw(xmlValue(xmlRoot(htmlParse(
  '<html>&nbsp;x82xa0</html>', asText=T, encoding='shift-jis'))))
[1] c2 a0 e3 81 82 # good

# Windows
> xmlValue(xmlRoot(htmlParse(
  '<html>&nbsp;</html>', asText=T)))
[1] "ﾂxa0" # nonsense; putting utf-8 characters on shift-jis terminal

> iconv(xmlValue(xmlRoot(htmlParse(
  '<html>&nbsp;</html>', asText=T))), from='utf-8', to='shift-jis')
[1] NA # bad

> charToRaw(xmlValue(xmlRoot(htmlParse(
  '<html>&nbsp;</html>', asText=T))))
[1] c2 a0 # good

> charToRaw(xmlValue(xmlRoot(htmlParse(
  '<html>&nbsp;あ</html>', asText=T, encoding='shift-jis'))))
[1] c2 a0 e3 81 82 # good

> charToRaw(xmlValue(xmlRoot(htmlParse(
  '<html>&nbsp;xe3x81x82</html>', asText=T, encoding='utf-8'))))
[1] c2 a0 e3 81 82 # good

> charToRaw(xmlValue(xmlRoot(htmlParse(
  '<html>&nbsp;x82xa0</html>', asText=T, encoding='shift-jis'))))
[1] c2 a0 e3 81 82 # good

As shown above, function xmlValue always returns a utf-8 string and the result is exactly same on both Mac and Windows, regardless of the difference of locales. An   is converted to a u+00a0 (xc2xa0 in utf-8). An error occurs when iconv converts utf-8 characters into shift-jis on both Mac and Windows. So, this is not an issue of xmlValue.

The issue can be simplified into an issue of iconv.

# Mac and Windows
> iconv('u00a0', from='utf-8', to='shift-jis', sub='byte')
[1] "<c2><a0>" # bad

As the above, function iconv fails converting u+00a0 into shift-jis. Because Mac people usually do not convert characters into shift-jis, the issue is specific to Windows.

Perhaps I found a background of the cause. According to a list of JIS X 0213 non-Kanji at wikipedia, No-Break Space was not defined in JIS X 0208 and added at JIS X 0213 in year 2004. This means u+00a0 is included in the latest extended shift-jis (shift_jis-2004), but not in the conventional shift-jis. Because Windows code page 932 (cp932) was defined after the conventional shift-jis (JIS X 0208), cp932 is not compatible with JIS X 0213. In contrast, Mac uses shift_jis-2004 (JIS X 0213).

# Mac
> charToRaw(iconv('u00a0', from='utf-8', to='shift_jisx0213', sub=' '))
[1] 85 41 # good

When the explicit version of shift-jis is specified, iconv successfully converts u+00a0 into shift_jis-2004.

But, Windows fails with the message:

unsupported conversion from 'utf-8' to 'shift_jisx0213' 
in codepage 932.

Actually, the issue is not of iconv, but of differences between versions of JIS code.

2-2. trim

In the following tests, a Japanese Hiragana character “あ”, binary code of that is “e3 81 82” for utf-8 and is “82 a0” for shift-jis, was used.

# Mac
> xmlValue(xmlRoot(htmlParse(
  '<html>a</html>', asText=T, encoding='shift-jis')), trim=T)
[1] "a" # good. ascii

> xmlValue(xmlRoot(htmlParse(
  '<html>ta</html>', asText=T, encoding='shift-jis')), trim=T)
[1] "a" # good. ascii, trim

> charToRaw(xmlValue(xmlRoot(htmlParse(iconv(
  '<html>あ</html>', from='utf-8', to='shift-jis'), asText=T, encoding='shift-jis')), trim=T))
[1] e3 81 82 # good. shift-jis

> charToRaw(xmlValue(xmlRoot(htmlParse(iconv(
  '<html>tあ</html>', from='utf-8', to='shift-jis'), asText=T, encoding='shift-jis')), trim=F))
[1] 09 e3 81 82 # good. shift-jis, trim=FALSE

> charToRaw(xmlValue(xmlRoot(htmlParse(iconv(
  '<html>tあ</html>', from='utf-8', to='shift-jis'), asText=T, 
  encoding='shift-jis')), trim=T))
[1] e3 81 82 # good. shift-jis, trim

> charToRaw(xmlValue(xmlRoot(htmlParse(
  '<html>tx82xa0</html>', asText=T, encoding='shift-jis')), trim=T))
[1] e3 81 82 # good. shift-jis, trim

> charToRaw(xmlValue(xmlRoot(htmlParse(iconv(
  '<html>atあ</html>', from='utf-8', to='shift-jis'), asText=T, 
  encoding='shift-jis')), trim=T))
[1] 61 09 e3 81 82 # good. shift-jis, trim=TRUE but is not required

> charToRaw(xmlValue(xmlRoot(htmlParse(
  '<html>tあ</html>', asText=T, encoding='utf-8')), trim=T))
[1] e3 81 82 # good. utf-8, trim

> charToRaw(xmlValue(xmlRoot(htmlParse(
  '<html>txe3x81x82</html>', asText=T, encoding='utf-8')), trim=T))
[1] e3 81 82 # good. #utf-8, trim

# Windows
> xmlValue(xmlRoot(htmlParse(
  '<html>a</html>', asText=T, encoding='shift-jis')), trim=T)
[1] "a" # good. ascii

> xmlValue(xmlRoot(htmlParse(
  '<html>ta</html>', asText=T, encoding='shift-jis')), trim=T)
[1] "a" # good. ascii, trim

> charToRaw(xmlValue(xmlRoot(htmlParse(
  '<html>あ</html>', asText=T, encoding='shift-jis')), trim=T))
[1] e3 81 82 # good. shift-jis

> charToRaw(xmlValue(xmlRoot(htmlParse(
  '<html>tあ</html>', asText=T, encoding='shift-jis')), trim=F))
[1] 09 e3 81 82 # good. shift-jis, trim=FALSE

> charToRaw(xmlValue(xmlRoot(htmlParse(
  '<html>tあ</html>', asText=T, encoding='shift-jis')), trim=T))
[1] e7 b8 ba e3 a0 bc e3 b8 b2 # bad. shift-jis, trim

> charToRaw(xmlValue(xmlRoot(htmlParse(
  '<html>tx82xa0</html>', asText=T, encoding='shift-jis')), trim=T))
[1] e7 b8 ba e3 a0 bc e3 b8 b2 # bad. shift-jis, trim

> charToRaw(xmlValue(xmlRoot(htmlParse(
  '<html>atあ</html>', asText=T, encoding='shift-jis')), trim=T))
[1] 61 09 e3 81 82 # good. shift-jis, trim=TRUE but is not required

> charToRaw(xmlValue(xmlRoot(htmlParse(iconv(
  '<html>tあ</html>', from='shift-jis', to='utf-8'), asText=T, 
  encoding='utf-8')), trim=T))
[1] e3 81 82 # good. utf-8, trim

> charToRaw(xmlValue(xmlRoot(htmlParse(
  '<html>txe3x81x82</html>', asText=T, 
  encoding='utf-8')), trim=T))
[1] e3 81 82 # good. utf-8, trim

Mac passed all tests.

In contrast, a bad case was found in Windows, when the text consisted of Japanese characters (あ) and space characters (t), when an option trim=TRUE was specified, and when removal of space characters were truly required. The result was something unreadable. Ascii and utf-8 encodings were safe.

A point here may be the difference of regular expression by locales; utf-8 for Mac and shift-jis for Windows.

# Mac
> charToRaw(gsub('\s', '', 'tあ'))
[1] e3 81 82 # good

# Windows
charToRaw(gsub('\s', '', iconv('tあ', from='shift-jis', to='utf-8')))
[1] e7 b8 ba e3 a0 bc e3 b8 b2 # bad

This result matches exactly with the tests of xmlValue. So, what the trim=TRUE in the package XML is doing may be using R’s regular expression (depends on locale) to the internal string (always utf-8). Because the regular expression working on Japanese Windows is safe to the national locale (cp932), it is unsafe to the international locale (utf-8).

Additionally, the result of the utf-8 trim test was good in Windows. This indicates there’re some procedures to handle locale differences in package XML and the bad case slips through by mistake.

3. Another way

Thanks to Jeroen for the comment. CRAN package xml2 is free from the 2nd issue.

# Windows
> charToRaw(xml_text(read_xml(
  '<html>tあ</html>', 
  encoding='shift-jis'), trim=T))
[1] e3 81 82 # good. shift-jis, trim=TRUE

Its trim=TRUE is working good on Windows with shift-jis encoded html, and the result is independent to read_xml option as_html=TRUE or FALSE. So we can use the package xml2 as an alternative solution instead of the package XML.

To leave a comment for the author, please follow the link and comment on their blog: R – ЯтомизоnoR.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

1. Where it occurs

2. What is cause

2-1. No-Break Space (U+00A0, &nbsp;)

2-2. trim

3. Another way

2-1. No-Break Space (U+00A0, )