Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
When addressing somebody unknown to you with an uncommon name e.g. in an email you might not know whether this person is male or female. In this post, we make it a little fun project to let R help us with that, so read on!
Of course, R cannot figure out the gender just by looking at the names, we need some data! A very impressive dataset can be found here: Gender by Name Data Set.
In this dataset, we find nearly one hundred fifty thousand instances of first/given names of male and female babies, source datasets are from government authorities:
- US: Baby Names from Social Security Card Applications – National Data, 1880 to 2019
- UK: Baby names in England and Wales Statistical bulletins, 2011 to 2018
- Canada: British Columbia 100 Years of Popular Baby names, 1918 to 2018
- Australia: Popular Baby Names, Attorney-General’s Department, 1944 to 2019
NB: Because of the origin of the data the categories here are strictly binary (male/female) and not gender-divers.
We can now write a simple R function which formats the output a little bit and provides us with percentage values in case the name is used for both genders:
name_gender_data <- read.csv("data/name_gender_dataset.csv") # change path accordingly name_gender <- function(name) { data <- name_gender_data[name_gender_data$Name == name, 1:3] data <- cbind(data[1:2], round(data[3] / sum(data[3]), 3) * 100) colnames(data) <- c("Name", "Gender", "Percent") rownames(data) <- NULL data }
I, of course, start by trying it on my own name
name_gender("Holger") ## Name Gender Percent ## 1 Holger M 100
Now, how about a name not everybody might know the gender of, “Emre”:
name_gender("Emre") ## Name Gender Percent ## 1 Emre M 100
Same with “Elle”:
name_gender("Elle") ## Name Gender Percent ## 1 Elle F 100
How about names that are given to both genders, like “Charlie”:
name_gender("Charlie") ## Name Gender Percent ## 1 Charlie M 86.9 ## 2 Charlie F 13.1
And, as the last example, what happens when the name is not included in the data:
name_gender("nobody") ## [1] Name Gender Percent ## <0 rows> (or 0-length row.names)
I hope that you enjoyed this little project and that it will prove helpful. Do you have other ideas about what to do with this dataset? Leave them in the comments!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.