Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The tolower() function returns an error where it can’t map to the Unicode character set of the input data – a common occurrence when analysing social media data with emoticons.
Emoticons are those symbols that are commonly used on mobile phones but aren’t always recognised on all platforms.
For example, when converting tweets to @delta (Delta Airlines), I got the following error:
Error in tolower(text) : invalid input '@ActualALove: First time I've seen a foot-rest in first class! Oh @Delta, how I love thee \ud83d\ude0a✈\ud83d\udc78 http://t.co/noKI9CiM' in 'utf8towcs'
When I looked up the actual tweet, it looked liked this.
The two unicode characters that weren’t recognised were \ud83d\ude0a (SMILING FACE WITH SMILING EYES) and \ud83d\udc78 (PRINCESS).
Gaston Sanchez has posted a solution to this problem in his blog Data Analysis Visually Enforced. I’ve used the code and it works well. When I have time, I’ll extend it to replace the offending characters instead of returning NA for the entire string.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.