[This article was first published on R-Chart, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Post also available with code executed inline at rpubs.com.
O’Reilly recently published the results of a survey from attendees of the Strata Conference related to tool usage and salary. The entire survey is available for download. In the survey results, R was heralded as second only to SQL as a tool used by conference attendees. An chart from the survey appeared in this post and elsewhere online.
These two technologies overlap a bit but are are highly complementary. SQL can be used to quickly extract data from relational databases and filter, order and summarize data. SQL queries can be executed with R itself or in another language to produce a CSV file that can be imported into R. R can do additional filtering, ordering and summarizing, be used for more sophisticated analysis, reshaping of data and presentation in a final form.
As part of a in-progress R screencast, I wanted to speculate a bit about the most common “clusters” of technologies that are popular among R users (at least the Strata Conference respondents). Although the raw data from the survey is not available, the graph cited in the survey results includes enough information to do a bit of additional analysis. I reconstructed the original graph as a starting point with the intention of splitting out the data and non-data roles into facetted bar charts. This would make usage reported among non-data respondents a bit clearer.
So the first step was to replicate the original plot with a few cosmetic and editorial updates – no “Respodents” appear in the new version. This involved the use of reshape2 and ggplot2.
library(reshape2)
library(ggplot2)
With these available, I created the data frame by combining a few vectors containing the data of interest.
data.science.tools < - as.data.frame="" b="" rbind="" t="">< !----->< !----->->
c(‘All Respondents’,
‘SQL’,’R’,’Python’,’Excel’,’Hadoop’,’Java’,
‘Network/Graph’,’JavaScript’,’Tableau’,’D3′,
‘Mahout’,’Ruby’,’SAS/SPSS’),
c(57,42,33,26,25,23,17,16,7,15,8,7,5,9),
c(43,29,10,15,11,12,17,4,13,4,5,6,6,2)
)))
names(data.science.tools)=c(‘DataTool’, ‘Data’, ‘NonData’)
At this point, the results match up with what appeared in the chart from the O’Reilly report. The numbers represent percent of respondents that use the given tool.
data.science.tools
DataTool Data NonData
1 All Respondents 57 43
2 SQL 42 29
3 R 33 10
4 Python 26 15
5 Excel 25 11
6 Hadoop 23 12
7 Java 17 17
8 Network/Graph 16 4
9 JavaScript 7 13
10 Tableau 15 4
11 D3 8 5
12 Mahout 7 6
13 Ruby 5 6
14 SAS/SPSS 9 2
The data is easier to deal with if reshaped using melt.
data.science.tools.df < - b="" melt="">< !----->< !----->->
data.science.tools,
c(‘DataTool’),
variable.name=’Role’,
value.name=’Respondents’
)
The resulting data frame:
data.science.tools.df
DataTool Role Respondents
1 All Respondents Data 57
2 SQL Data 42
3 R Data 33
4 Python Data 26
5 Excel Data 25
6 Hadoop Data 23
7 Java Data 17
8 Network/Graph Data 16
9 JavaScript Data 7
10 Tableau Data 15
11 D3 Data 8
12 Mahout Data 7
13 Ruby Data 5
14 SAS/SPSS Data 9
15 All Respondents NonData 43
16 SQL NonData 29
17 R NonData 10
18 Python NonData 15
19 Excel NonData 11
20 Hadoop NonData 12
21 Java NonData 17
22 Network/Graph NonData 4
23 JavaScript NonData 13
24 Tableau NonData 4
25 D3 NonData 5
26 Mahout NonData 6
27 Ruby NonData 6
28 SAS/SPSS NonData 2
Convert data into required numeric type
data.science.tools.df$Respondents < - as.numeric="" b="">< !----->< !----->->
data.science.tools.df$Respondents
)
Create the original chart
ggplot(data = data.science.tools.df,
aes(x=reorder(DataTool,
Respondents,
function(x) max(x)
),
y=Respondents,
fill=Role)
) +
geom_bar(stat=’identity’) +
coord_flip() +
theme(axis.title.y = element_blank())
Now do the facetted example
ggplot(data = data.science.tools.df,
aes(x=reorder(DataTool,
Respondents,
function(x) max(x)),
y=Respondents,
fill=Role)
) +
geom_bar(stat=’identity’) +
coord_flip() +
facet_grid(. ~ Role) +
theme(axis.title.y = element_blank())
Those in the non-data role appear to be largely coming from a more traditional software development/programming background. The top tool in use after SQL is Java, followed by Python and JavaScript. Hadoop is closely related as a java-based framework. Excel is used more than Excel, which suggests a fascinating opportunity for R. Spreadsheets are and will remain useful, but anyone involved in data munging and analysis can benefit from R. As has been oft-trumpeted, scripted R programs are far more controlled and disciplined than clicking around in a spreadsheet. They promote reproducible, less error-prone results. Ruby ranks a bit higher than among the non-data users and SAS/SPSS usage is minimal which also fits with a programmer audience.
To get a closer look at “non-Data” role.
ggplot(data = data.science.tools.df[
data.science.tools.df$Role==’NonData’,],
aes(x=reorder(DataTool,
Respondents,
function(x) max(x)
),
y=Respondents)
) +
geom_bar(stat=’identity’) +
coord_flip() +
theme(axis.title.y = element_blank())
There are a number of tools conspicuously lacking in the survey.
- Microsoft programming is completely absent.
- Command line utilities (like awk, sed, sqlite3 and some others)
- Perl
It also be interesting to see related data about respondents that undoubtedly impact the results (mathematical proficiency, design abilities, typical data stores / database types accessed, typical audience for summarized data).
As I have been reviewing literature and educational resources on R, I am developing a stronger opinion that R, though a remarkable functional and powerful programming language, has not been presented well to a programming audience. Most introductions to R are more palatable to statisticians and others who have data analysis to complete but are not strongly aligned with programmer culture and expectations. The fact that so many R packages are in essence full-fledged DSLs has further complicated R’s presentation. As I mentioned in my previous post, Hadley’s new book and RStudio are significant in-roads that highlight R in a more programmer-friendly way. And the involvement of programmers at the Strata Conference and similar events will increase its visibility and accessibility as well.
To leave a comment for the author, please follow the link and comment on their blog: R-Chart.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.