Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I don’t normally work with personal identifiable information such as emails. However, the recent data dump from Ashley Madison got me thinking about how I’d examine a data set composed of email addresses. What are the characteristics of an email that I’d look to extract? How would I perform that task in R? Here’s some quick R code to extract the host, address type, and other information from a set of email strings. From there, we can obviously summarize the data according to a number of desired email characteristics.
df = data.frame(email = c("one@gkn.com","two132@wern.com","three@fu.com","four@huo.com","five@hoi.net", "ten@hoinse.com","four99@huo.com","two@wern.gov","f_ive@hoi.com","six@ihoio.gov")) df$one <- sub("@.*$", "", df$email ) df$two <- sub('.*@', '', df$email ) df$three <- sub('.*\.', '', df$email ) num <- c(0:9); num num_match <- str_c(num, collapse = "|"); num_match df$num_yn <- as.numeric(str_detect(df$email, num_match)) und <- c("_"); und und_match <- str_c(und, collapse = "|"); und_match df$und_yn <- as.numeric(str_detect(df$email, und_match)) > df email one two three num_yn und_yn 1 one@gkn.com one gkn.com com 0 0 2 two132@wern.com two132 wern.com com 1 0 3 three@fu.com three fu.com com 0 0 4 four@huo.com four huo.com com 0 0 5 five@hoi.net five hoi.net net 0 0 6 ten@hoinse.com ten hoinse.com com 0 0 7 four99@huo.com four99 huo.com com 1 0 8 two@wern.gov two wern.gov gov 0 0 9 f_ive@hoi.com f_ive hoi.com com 0 1 10 six@ihoio.gov six ihoio.gov gov 0 0
What about you? If you regularly work with email addresses and have some useful insights for the rest of us, please leave a comment below. How do you usually attack a data set where it’s just a large number of email addresses?
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.