[This article was first published on Stats and things, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Occasionally, I’ll need to pick out names (first name, last name) from text. These days, the text I’m working with is usually tweets. Any how, I didn’t see any solution out there (that worked for me) when I developed this, so hopefully it can be a starting point for somebody else with similar needs…Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
First, I start out with a list of names from the census bureau. I downloaded male first names, female first names, and last names and same them as variables in R. I do take out some of the names as “exceptions” that screw up my process here. Names like “In”, “An”, “Chi”, “So”, and so on.
Then, I split my target text up into bigrams, that is, adjacent pairs of words in the original text…
This returns every pair of words in the tweet. From here, I can look through each of these bigrams for names. To make the search for names a little easier, I throw out any bigrams that don’t have capital letters for the first and last names.
Now that I have a list of bigrams in which both words start with capital letters, I can compare the words to the name list to see if they are names. I start with the last name. If the second word in the bigram doesn’t appear in the last name list, we can stop… there’s no need to check the first name. If the second word is a last name, then we check the first word against the first names list. If both of these check out, we have ourselves a name. Here’s the code for that…
The full code for this can be found here… https://github.com/corynissen/cook-county-tweet-dashboard/blob/master/cctweets/findNames.R
I have tried the openNLP package for this and couldn’t get it to work reliably and quickly, so I made my own. If you have any suggestions on how to do this better, let me know!
Follow me on twitter… https://twitter.com/corynissen
To leave a comment for the author, please follow the link and comment on their blog: Stats and things.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.