Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
A couple of my collaborators have had trouble using read_html()
from the readr package to access this Wikipedia page. Specifically they have been getting errors like this:
Error in utils::type.convert(out[, i], as.is = TRUE, dec = dec) : invalid multibyte string at '<e2><94>'
Since I couldn’t reproduce these errors on my machine it appeared to be something relating to their particular machine setup. Looking at their locale provided a clue:
> Sys.getlocale() [1] "LC_COLLATE=Korean_Korea.949;LC_CTYPE=Korean_Korea.949;LC_MONETARY=Korean_Korea.949; LC_NUMERIC=C;LC_TIME=Korean_Korea.949"
whereas on my machine I have:
> Sys.getlocale() [1] "LC_CTYPE=en_ZA.UTF-8;LC_NUMERIC=C;LC_TIME=en_ZA.UTF-8;LC_COLLATE=en_ZA.UTF-8; LC_MONETARY=en_ZA.UTF-8;LC_MESSAGES=en_ZA.UTF-8;LC_PAPER=en_ZA.UTF-8;LC_NAME=C;LC_ADDRESS=C; LC_TELEPHONE=C;LC_MEASUREMENT=en_ZA.UTF-8;LC_IDENTIFICATION=C"
The document that they were trying to scrape is encoded in UTF-8, which I see in my locale but not in theirs. Perhaps changing locale will sort out the problem? Since the en_ZA
locale is a bit of an acquired taste (unless you’re South African, in which case it’s de rigueur!), the following should resolve the problem:
> Sys.setlocale("LC_CTYPE", "en_US.UTF-8")
This might precipitate an error stating that the directive cannot be honoured by your system. Do not fear, all is not lost. Try the following (which seems to work almost universally!):
Sys.setlocale("LC_ALL", "English")
Try scraping again. Your issues should be resolved.
The post Web Scraping and “invalid multibyte string” appeared first on Exegetic Analytics.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.