Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
A few months ago I decided to apply word frequency analysis ideas to R code. My idea was simple: the functions that one should invest the most effort into learning are precisely those functions that are used most frequently in real world code. In fact, this simple idea can be applied to spoken languages as valuably as to programming languages: the words that will most improve your ability to understand day-to-day speech in a foreign language are precisely the words that occur most frequently in day-to-day speech. Our language classrooms tend to screw this up by emphasizing words like “spoon” over “dude”, even though people my age will use the word “dude” many more times a day than they’ll use “spoon”.1 In large part, this is because the usefulness of a word tends to be confounded with its respectability when you learn a language in an academic setting. This over-emphasis on respectability is why you’re never taught curse words, even though they’re as practically useful as the majority of other words.
But I digress. Returning to R, the attached CSV data set contains the results of my word frequency analysis. To produce this data set, I spidered all of the source code for every R package on CRAN, split the resulting corpus text into lexical tokens, and counted the occurrences of each token. To make things simpler, I decided to treat every token followed by a parenthesized expression as a function, even though this gives me some syntactical units like if
in my list of functions.
I think the resulting raw data set is probably as useful as any summaries I can offer of it, but I’ll offer some obvious results for those interested.
Function call frequencies in R, unsurprisingly, seem to follow Zipf’s law or some variety thereof. You can see this fairly clearly in the following logarithmic plot:
And here are the top 25 most frequently used functions in R, including some syntactical units:
Function Name | Occurrences in Corpus |
---|---|
if | 333421 |
c | 143123 |
function | 137490 |
length | 109847 |
paste | 62906 |
cat | 56199 |
return | 56001 |
stop | 54161 |
is.null | 53575 |
list | 51862 |
log | 49479 |
for | 48950 |
rep | 39090 |
names | 34439 |
sum | 31475 |
as.integer | 29417 |
matrix | 29344 |
is.na | 20230 |
dim | 20202 |
max | 19995 |
nrow | 19789 |
as.double | 19705 |
attr | 18802 |
t | 17889 |
17461 |
Some fun extensions to this work would be to use this frequency data set as a standard for the abnormality of code style: you could compare an individual programmer’s code to the standard frequency data to see which functions such-and-such a programmer tends to overuse and underuse. I personally tend to underuse stop
and I never use attr
at all.
Another interesting project would be to use this data set as input to a text classifier that would attempt to predict the author of a piece of R code based on token frequency information, in line with Mosteller and Wallace’s famous analysis of the Federalist Papers or anti-spam programs>.
- To this day, I still mix up the words for “spoon” (“cuchara”) and “knife” (“cuchillo”) in Spanish, thought I never make any mistakes with slang words like “dude” (“tio”).
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.