Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I wrote a little while back about using Microsoft Cognitive Services APIs with R to first of all detect the language of pieces of text and then do sentiment analysis on them. I wasn’t too happy with the some of the code as it was very inelegant. I knew I could code better than I had, especially as I’ve been doing a lot more work with purrr recently. However, it had sat in drafts for a while. Then David Smith kindly posted about the process I used which meant I had to get this nicer version of my code out ASAP!
Get the complete code in this gist.
Prerequisites
Setup
library(httr) library(jsonlite) library(dplyr) library(purrr) cogapikey<-"XXX" text=c("is this english?" ,"tak er der mere kage" ,"merci beaucoup" ,"guten morgen" ,"bonjour" ,"merde" ,"That's terrible" ,"R is awesome") # Put data in an object that converts to the expected schema for the API data_frame(text) %>% mutate(id=row_number()) -> textdf textdf %>% list(documents=.) -> mydata
Language detection
We need to identify the most likely language for each bit of text in order to send this additional bit of info to the sentiment analysis API to be able to get decent results from the sentiment analysis.
cogapi<-"https://westus.api.cognitive.microsoft.com/text/analytics/v2.0/languages?numberOfLanguagesToDetect=1" cogapi %>% POST(add_headers(`Ocp-Apim-Subscription-Key`=cogapikey), body=toJSON(mydata)) -> response # Process response response %>% content() %>% flatten_df() %>% select(detectedLanguages) %>% flatten_df()-> respframe textdf %>% mutate(language= respframe$iso6391Name) -> textdf
Sentiment analysis
With an ID, text, and a language code, we can now request the sentiment of our text be analysed.
# New info mydata<-list(documents = textdf) # New endpoint cogapi<-"https://westus.api.cognitive.microsoft.com/text/analytics/v2.0/sentiment" # Construct a request cogapi %>% POST(add_headers(`Ocp-Apim-Subscription-Key`=cogapikey), body=toJSON(mydata)) -> response # Process response response %>% content() %>% flatten_df() %>% mutate(id=as.numeric(id))-> respframe # Combine textdf %>% left_join(respframe) -> textdf
And… et voila! A multi-language dataset with the language identified and the sentiment scored using purrr for easier to read code.
Using purrr with APIs makes code nicer and more elegant as it really helps interact with hierarchies from JSON objects. I feel much better about this code now!
Original
id | language | text | score |
---|---|---|---|
1 | en | is this english? | 0.2852910 |
2 | da | tak er der mere kage | NA |
3 | fr | merci beaucoup | 0.8121097 |
4 | de | guten morgen | NA |
5 | fr | bonjour | 0.8118965 |
6 | fr | merde | 0.0515683 |
7 | en | That’s terrible | 0.1738841 |
8 | en | R is awesome | 0.9546152 |
Revised
text | id | language | score |
---|---|---|---|
is this english? | 1 | en | 0.2265771 |
tak er der mere kage | 2 | da | 0.7455934 |
merci beaucoup | 3 | fr | 0.8121097 |
guten morgen | 4 | de | 0.8581840 |
bonjour | 5 | fr | 0.8118965 |
merde | 6 | fr | 0.0515683 |
That’s terrible | 7 | en | 0.0068665 |
R is awesome | 8 | en | 0.9973871 |
Interestingly the scores for English have not stayed the same – for instance, Microsoft now sees “R is awesome” in a much more positive light. It’s also great to see German and Danish are now supported!
Get the complete code in this gist.
The post Using purrr with APIs – revamping my code appeared first on Locke Data. Locke Data are a data science consultancy aimed at helping organisations get ready and get started with data science.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.