Pitfall of XML package: to know the cause
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This is the sequel to the previous report “issues specific to cp932 locale, Japanese Shift-JIS, on Windows“. In this report, I will dig the issues deeper to find out what is exactly happening.
1. Where it occurs
I knew the issues depend node texts, because a very same script run on Windows to parse another table in the same html source.
# Windows src <- 'http://www.taiki.pref.ibaraki.jp/data.asp' t2 <- iconv(as.character( readHTMLTable(src, which=4, trim=T, header=F, skip.rows=2:48, encoding='shift-jis')[1,1] ), from='utf-8', to='shift-jis') > t2 # bad [1] NA s2 <- iconv(as.character( readHTMLTable(src, which=6, trim=T, header=F, skip.rows=1, encoding='shift-jis')[2,2] ), from='utf-8', to='shift-jis') > s2 # good [1] "北茨城中郷"
To know the difference of the two html parts is to know where the issue occurs.
Let’s see the html source by primitive functions, instead of using the package XML.
con <- url(src, encoding='shift-jis') x <- readLines(con) close(con)
I know two useful keywords to locate points of t2
and s2
above; 2016
and 101
respectively.
# for t2 > grep('2016', x) [1] 120 133 141 148 160 161 > x[119:121] [1] "ttt<td class="title">" [2] "tttt最新の観測情報 (2016年1月17日 8時)" [3] "ttt</td>" # for s2 > grep('101', x) [1] 181 > x[181:182] [1] "tttttt<td class="td1">101</td>" [2] "tttttt<td class="td2">北茨城中郷</td>"
Note that only x[182]
is for s2
and x[181]
was used to find the position. Apparently, differences between t2
and s2
are:
t2
includes .t2
spreads over multiple lines.
Because I want to know the exact content of a single node (html element), the three lines of t2
must be joined together.
paste(x[119:121], collapse='rn')
Pasting with a newline code rn
may be the answer, but more exact procedure is better.
Binary functions are elegant tools which can handle an html source as binary data as is offered by the web server regardless of client platforms and locales.
con <- url(src, open='rb') skip <- readBin(con, what='raw', n=5009) xt2 <- readBin(con, what='raw', n=92) skip <- readBin(con, what='raw', n=3013) xs2 <- readBin(con, what='raw', n=31) close(con)
In this time I cheated the byte position of interests from the prior result x
.
# t2 begins after > sum(nchar(x[1:118], type='bytes') + 2) + nchar(sub('<.*$', '', x[119]), type='bytes') [1] 5009 # t2 length > sum(nchar(x[119:121], type='bytes') + 2) - 2 - nchar(sub('<.*$', '', x[119]), type='bytes') [1] 92 # s2 begins after > sum(nchar(x[1:181], type='bytes') + 2) + nchar(sub('<.*$', '', x[182]), type='bytes') [1] 8114 # s2 length > nchar(x[182], type='bytes') - nchar(sub('<.*$', '', x[182]), type='bytes') [1] 31 # s2 from the end of t2 > 8114 - 5009 - 92 [1] 3013
Variables xt2
and xs2
have what I want as binary (raw) vector.
# Windows > rawToChar(xt1) [1] "<td class="title">rntttt最新の観測情報 (2016年1月17日 8時)rnttt</td>" > rawToChar(xs1) [1] "<td class="td2">北茨城中郷</td>"
Compare inside texts of these nodes.
t2: rntttt最新の観測情報 (2016年1月17日 8時)rnttt s2: 北茨城中郷
Maybe, a text including control codes (r
, n
, t
) and/or html entities (
) is unsafe, and a text made up of printable Kanji
characters only is safe. So, the issues must occur at these special characters.
Before digging more, I want to introduce two nice binary functions that can be used without cheating of byte position. Function getBinaryURL
in package RCurl
can get the whole binary data from a web page. Function grepRaw
can locate positions of specific character in binary vector.
library(RCurl) x <- getBinaryURL(src) > str(x) raw [1:35470] 0d 0a 3c 68 ... > grepRaw('2016', x, all=FALSE) [1] 5062 > x[5000:5100] [1] 09 3c 74 72 3e 0d 0a 09 09 09 3c 74 64 20 63 6c 61 73 73 3d 22 74 69 74 [25] 6c 65 22 3e 0d 0a 09 09 09 09 8d c5 90 56 82 cc 8a cf 91 aa 8f ee 95 f1 [49] 26 6e 62 73 70 3b 26 6e 62 73 70 3b 81 69 32 30 31 36 94 4e 31 8c 8e 31 [73] 37 93 fa 26 6e 62 73 70 3b 26 6e 62 73 70 3b 38 8e 9e 81 6a 0d 0a 09 09 [97] 09 3c 2f 74 64 > rawToChar(x[5000:5100]) [1] "t<tr>rnttt<td class="title">rntttt最新の観測情報 (2016年1月17日 8時)rnttt</td"
2. What is cause
Let’s check out what happens when a html has html entity (
) or spaces (r
n
t
). I’m going to use a minimum html to compare responses of package XML
on Mac
and Windows
.
library(XML) > xmlValue(xmlRoot(htmlParse('<html>ABC</html>', asText=T))) [1] "ABC"
2-1. No-Break Space (U+00A0, )
# Mac > xmlValue(xmlRoot(htmlParse( '<html> </html>', asText=T))) [1] " " # good > iconv(xmlValue(xmlRoot(htmlParse( '<html> </html>', asText=T))), from='utf-8', to='shift-jis') [1] NA # bad > charToRaw(xmlValue(xmlRoot(htmlParse( '<html> </html>', asText=T)))) [1] c2 a0 # good > charToRaw(xmlValue(xmlRoot(htmlParse( '<html> あ</html>', asText=T, encoding='utf-8')))) [1] c2 a0 e3 81 82 # good > charToRaw(xmlValue(xmlRoot(htmlParse( '<html> xe3x81x82</html>', asText=T, encoding='utf-8')))) [1] c2 a0 e3 81 82 # good > charToRaw(xmlValue(xmlRoot(htmlParse( '<html> x82xa0</html>', asText=T, encoding='shift-jis')))) [1] c2 a0 e3 81 82 # good # Windows > xmlValue(xmlRoot(htmlParse( '<html> </html>', asText=T))) [1] "ツxa0" # nonsense; putting utf-8 characters on shift-jis terminal > iconv(xmlValue(xmlRoot(htmlParse( '<html> </html>', asText=T))), from='utf-8', to='shift-jis') [1] NA # bad > charToRaw(xmlValue(xmlRoot(htmlParse( '<html> </html>', asText=T)))) [1] c2 a0 # good > charToRaw(xmlValue(xmlRoot(htmlParse( '<html> あ</html>', asText=T, encoding='shift-jis')))) [1] c2 a0 e3 81 82 # good > charToRaw(xmlValue(xmlRoot(htmlParse( '<html> xe3x81x82</html>', asText=T, encoding='utf-8')))) [1] c2 a0 e3 81 82 # good > charToRaw(xmlValue(xmlRoot(htmlParse( '<html> x82xa0</html>', asText=T, encoding='shift-jis')))) [1] c2 a0 e3 81 82 # good
As shown above, function xmlValue
always returns a utf-8 string and the result is exactly same on both Mac
and Windows
, regardless of the difference of locales. An
is converted to a u+00a0
(xc2xa0
in utf-8). An error occurs when iconv
converts utf-8 characters into shift-jis
on both Mac
and Windows
. So, this is not an issue of xmlValue
.
The issue can be simplified into an issue of iconv
.
# Mac and Windows > iconv('u00a0', from='utf-8', to='shift-jis', sub='byte') [1] "<c2><a0>" # bad
As the above, function iconv
fails converting u+00a0
into shift-jis
. Because Mac people usually do not convert characters into shift-jis
, the issue is specific to Windows
.
Perhaps I found a background of the cause. According to a list of JIS X 0213 non-Kanji at wikipedia, No-Break Space
was not defined in JIS X 0208
and added at JIS X 0213
in year 2004. This means u+00a0
is included in the latest extended shift-jis (shift_jis-2004
), but not in the conventional shift-jis. Because Windows code page 932
(cp932
) was defined after the conventional shift-jis (JIS X 0208
), cp932
is not compatible with JIS X 0213
. In contrast, Mac uses shift_jis-2004
(JIS X 0213
).
# Mac > charToRaw(iconv('u00a0', from='utf-8', to='shift_jisx0213', sub=' ')) [1] 85 41 # good
When the explicit version of shift-jis is specified, iconv successfully converts u+00a0
into shift_jis-2004
.
But, Windows
fails with the message:
unsupported conversion from 'utf-8' to 'shift_jisx0213' in codepage 932.
Actually, the issue is not of iconv
, but of differences between versions of JIS code
.
2-2. trim
In the following tests, a Japanese Hiragana
character “あ”, binary code of that is “e3 81 82
” for utf-8
and is “82 a0
” for shift-jis
, was used.
# Mac > xmlValue(xmlRoot(htmlParse( '<html>a</html>', asText=T, encoding='shift-jis')), trim=T) [1] "a" # good. ascii > xmlValue(xmlRoot(htmlParse( '<html>ta</html>', asText=T, encoding='shift-jis')), trim=T) [1] "a" # good. ascii, trim > charToRaw(xmlValue(xmlRoot(htmlParse(iconv( '<html>あ</html>', from='utf-8', to='shift-jis'), asText=T, encoding='shift-jis')), trim=T)) [1] e3 81 82 # good. shift-jis > charToRaw(xmlValue(xmlRoot(htmlParse(iconv( '<html>tあ</html>', from='utf-8', to='shift-jis'), asText=T, encoding='shift-jis')), trim=F)) [1] 09 e3 81 82 # good. shift-jis, trim=FALSE > charToRaw(xmlValue(xmlRoot(htmlParse(iconv( '<html>tあ</html>', from='utf-8', to='shift-jis'), asText=T, encoding='shift-jis')), trim=T)) [1] e3 81 82 # good. shift-jis, trim > charToRaw(xmlValue(xmlRoot(htmlParse( '<html>tx82xa0</html>', asText=T, encoding='shift-jis')), trim=T)) [1] e3 81 82 # good. shift-jis, trim > charToRaw(xmlValue(xmlRoot(htmlParse(iconv( '<html>atあ</html>', from='utf-8', to='shift-jis'), asText=T, encoding='shift-jis')), trim=T)) [1] 61 09 e3 81 82 # good. shift-jis, trim=TRUE but is not required > charToRaw(xmlValue(xmlRoot(htmlParse( '<html>tあ</html>', asText=T, encoding='utf-8')), trim=T)) [1] e3 81 82 # good. utf-8, trim > charToRaw(xmlValue(xmlRoot(htmlParse( '<html>txe3x81x82</html>', asText=T, encoding='utf-8')), trim=T)) [1] e3 81 82 # good. #utf-8, trim # Windows > xmlValue(xmlRoot(htmlParse( '<html>a</html>', asText=T, encoding='shift-jis')), trim=T) [1] "a" # good. ascii > xmlValue(xmlRoot(htmlParse( '<html>ta</html>', asText=T, encoding='shift-jis')), trim=T) [1] "a" # good. ascii, trim > charToRaw(xmlValue(xmlRoot(htmlParse( '<html>あ</html>', asText=T, encoding='shift-jis')), trim=T)) [1] e3 81 82 # good. shift-jis > charToRaw(xmlValue(xmlRoot(htmlParse( '<html>tあ</html>', asText=T, encoding='shift-jis')), trim=F)) [1] 09 e3 81 82 # good. shift-jis, trim=FALSE > charToRaw(xmlValue(xmlRoot(htmlParse( '<html>tあ</html>', asText=T, encoding='shift-jis')), trim=T)) [1] e7 b8 ba e3 a0 bc e3 b8 b2 # bad. shift-jis, trim > charToRaw(xmlValue(xmlRoot(htmlParse( '<html>tx82xa0</html>', asText=T, encoding='shift-jis')), trim=T)) [1] e7 b8 ba e3 a0 bc e3 b8 b2 # bad. shift-jis, trim > charToRaw(xmlValue(xmlRoot(htmlParse( '<html>atあ</html>', asText=T, encoding='shift-jis')), trim=T)) [1] 61 09 e3 81 82 # good. shift-jis, trim=TRUE but is not required > charToRaw(xmlValue(xmlRoot(htmlParse(iconv( '<html>tあ</html>', from='shift-jis', to='utf-8'), asText=T, encoding='utf-8')), trim=T)) [1] e3 81 82 # good. utf-8, trim > charToRaw(xmlValue(xmlRoot(htmlParse( '<html>txe3x81x82</html>', asText=T, encoding='utf-8')), trim=T)) [1] e3 81 82 # good. utf-8, trim
Mac
passed all tests.
In contrast, a bad case was found in Windows
, when the text consisted of Japanese characters (あ
) and space characters (t
), when an option trim=TRUE
was specified, and when removal of space characters were truly required
. The result was something unreadable. Ascii
and utf-8
encodings were safe.
A point here may be the difference of regular expression by locales; utf-8
for Mac
and shift-jis
for Windows
.
# Mac > charToRaw(gsub('\s', '', 'tあ')) [1] e3 81 82 # good # Windows charToRaw(gsub('\s', '', iconv('tあ', from='shift-jis', to='utf-8'))) [1] e7 b8 ba e3 a0 bc e3 b8 b2 # bad
This result matches exactly with the tests of xmlValue
. So, what the trim=TRUE in the package XML is doing may be using R’s regular expression
(depends on locale
) to the internal string (always utf-8
). Because the regular expression working on Japanese Windows
is safe to the national locale (cp932
), it is unsafe to the international locale (utf-8
).
Additionally, the result of the utf-8 trim test was good in Windows
. This indicates there’re some procedures to handle locale differences in package XML and the bad case slips through by mistake.
3. Another way
Thanks to Jeroen for the comment. CRAN package xml2
is free from the 2nd issue.
# Windows > charToRaw(xml_text(read_xml( '<html>tあ</html>', encoding='shift-jis'), trim=T)) [1] e3 81 82 # good. shift-jis, trim=TRUE
Its trim=TRUE
is working good on Windows with shift-jis encoded html, and the result is independent to read_xml
option as_html=TRUE
or FALSE
. So we can use the package xml2
as an alternative solution instead of the package XML
.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.