[This article was first published on R – Gradient Metrics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
As part of our road to detecting metaphors we got stuck on a simple problem: compound nouns.
If you take the sentence:
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
series of immigration policy changes
Series modifies changes in reference to immigration policy, which is a compound noun. “Series of changes” is not what we would consider metaphorical usage, but our detector would label “series of immigration” as potentially metaphorical, given its strangeness. Identifying compound nouns, and identifying which specific word is being modified (and is thus the “target” concept in the metaphor), is critical to improving performance. But, we realized we didn’t want to throw this extra information out. Enter collocations:“a sequence of words or terms that co-occur more often than would be expected by chance”
Using the same corpus that we’ve been using (which contains news articles, social media posts, and TV transcripts), we calculated the most prominent collocations containing “immigrant”, “immigration”, and “migration”.prefix | suffix | prefix frequency | suffix frequency | co-occurrence |
chain | immigration | 16271 | 24547 | 8548 |
undocumented | immigrant | 23145 | 127600 | 13645 |
migration | crisis | 24547 | 30606 | 1980 |
illegal | immigrant | 61791 | 127600 | 18759 |
comprehensive | immigration | 9550 | 261664 | 4930 |
migration | visa | 24547 | 23667 | 1041 |
immigration | custom | 261664 | 16135 | 6669 |
unaccompanied | immigrant | 7945 | 127600 | 1547 |
immigration | reform | 261664 | 42233 | 16839 |
testify | immigrant | 13444 | 127600 | 2288 |
In graphical form, here’s what the information looks like:
The blue lines indicate the word is a prefix to the key source words (e.g. “chain migration”, “unaccompanied immigrant”), and the green lines indicate it is a suffix (e.g. “immigration policy”, “illegal reform”.)
What stands out to us is how little overlap there is in the collocations that overlap between the three words (except “illegal”, which is highly related to all three). This is especially surprising between “migration” and “immigration” which are both abstract nouns.
To leave a comment for the author, please follow the link and comment on their blog: R – Gradient Metrics.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.