The Gender of Big Data
[This article was first published on Ripples, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
When I grow up I want to be a dancer (Carmen, my beautiful daughter)
The presence of women in positions of responsibility inside Big Data companies is quite far of parity: while approximately 5o% of world population are women, only 7% of CEOs of Top 100 Big Data Companies are. Like it or not, technology seems to be a guy thing.
To do this experiment, I did some webscraping to download the list of big data companies from here. I also used a very interesting package called genderizeR
, which makes gender prediction based on first names (more info here).
Here you have the code:
library(rvest) library(stringr) library(dplyr) library(genderizeR) library(ggplot2) library(googleVis) paste0("http://www.crn.com/slide-shows/data-center/300076704/2015-big-data-100-business-analytics.htm/pgno/0/", 1:45) %>% c(., paste0("http://www.crn.com/slide-shows/data-center/300076709/2015-big-data-100-data-management.htm/pgno/0/",1:30)) %>% c(., paste0("http://www.crn.com/slide-shows/data-center/300076740/2015-big-data-100-infrastructure-tools-and-services.htm/pgno/0/",1:25)) -> webpages results=data.frame() for(x in webpages) { read_html(x) %>% html_nodes("p:nth-child(1)") %>% .[[2]] %>% html_text() -> Company read_html(x) %>% html_nodes("p:nth-child(2)") %>% .[[1]] %>% html_text() -> Executive results=rbind(results, data.frame(Company, Executive)) } results=data.frame(lapply(results, as.character), stringsAsFactors=FALSE) results[74,]=c("Trifacta", "Top Executive: CEO Adam Wilson") results %>% mutate(Name=gsub("Top|\bExec\S*|\bCEO\S*|President|Founder|and|Co-Founder|\:", "", Executive)) %>% mutate(Name=word(str_trim(Name))) -> results results %>% select(Name) %>% findGivenNames() %>% filter(probability > 0.9 & count > 15) %>% as.data.frame() -> data data %>% group_by(gender) %>% summarize(Total=n()) -> dat doughnut=gvisPieChart(dat, options=list( width=450, height=450, legend="{ position: 'bottom', textStyle: {fontSize: 10}}", chartArea="{left:25,top:50}", title='TOP 100 BIG DATA COMPANIES 2015 Gender of CEOs', colors="['red','blue']", pieHole=0.5), chartid="doughnut") plot(doughnut)
To leave a comment for the author, please follow the link and comment on their blog: Ripples.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.