Fishing for packages in CRAN
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
by Joseph Rickert
It is incredibly challenging to keep up to date with R packages. As of today (6/16/15), there are 6,789 listed on CRAN. Of course, the CRAN Task Views are probably the best resource for finding what's out there. A tremendous amount of work goes into maintaining and curating these pages and we should all be grateful for the expertise, dedication and efforts of the task view maintainers. But, R continues to grow at a tremendous rate. (Have a look at growth curve in Bob Muenchen's 5/22/15 post R Now Contains 150 Times as Many Commands as SAS). CRANberries, a site that tracks new packages and package updates, indicates that over the last few months the list of R packages has been growing by about 100 packages per month. How can anybody hope to keep current?
So, on any given day, expect that finding out what R packages exist that may pertain to any particular topic will require some work. What follows, is a beginners guide to fishing for packages in CRAN. This example looks for “Bayesian” packages using some simple web page scraping and elementary text mining.
The Bayesian Inference Task View lists 144 packages. This is probably everything that is really important, but let's see what else is to be found that has anything at all to do with Bayesian Inference. In the first block of code, R's available.packages() function fetches the list of packages available from my Windows PC. (This is an extremely interesting function and I don't do justice to it here.) Then, this list is used to scrape the package descriptions from the various package webpages. The loop takes some time to run so I saved the package descriptions both in a csv file and a in a .RData workspace.
library(svTools) library(RCurl) library(tm) #----------------------------------------- # TWO HELPER FUNCTIONS # Funcion to get ackage description from CRAN package page getDesc <- function(package){ l1 <- regexpr("</h2>",package) ind1 <- as.integer(l1[[1]]) + 9 l2 <- regexpr("Version",package) ind2 <- as.integer(l2[[1]]) - (46 + nchar("package")) desc <- substring(package,ind1,ind2) return(desc) } # Function to get CRAN package page getPackage <- function(name){ url <- paste("http://cran.r-project.org/web/packages/",name,"/index.html",sep="") txt <- getURL(url,ssl.verifypeer=FALSE) return(txt) } #-------------------------------------------- # SCRAPE PACKAGE DATA FROM CRAN # Get the list of R packages packages <- as.data.frame(available.packages()) head(packages) dim(packages) pkgNames <- rownames(packages) rm(packages) # Dont need this any more pkgDesc <- vector() for (i in 1:length(pkgNames)){ pkgDesc[i] <- getDesc(getPackage(pkgNames[i])) } length(pkgDesc) #6598 #---------------------------------------------- # SOME HOUSEKEEPING # cranP <- data.frame(pkgNames,pkgDesc) # write.csv(cranP,"C:/DATA/CRAN/CRAN_pkgs_6_15_15") # save.image("pkgs.RData") # load("pkgs.RData")
When I did this a few days ago 6,598 packages were available. The next section of code turns the vector of package descriptions into a document corpus and creates a document term matrix with a row for each package and 20,781worth of terms. Taking the transpose of the term matrix makes it easier to see what is going on. The matrix is extremely sparse (only one 1 shows up) as this small portion of the matrix illustrates and all of the terms are pretty much useless. Removing the sparse terms cuts the matrix down to only 372 terms.
# SOME SIMPLE TEXT MINING # Make a corpus out of package descriptions pCorpus <- VCorpus(VectorSource(pkgDesc)) pCorpus inspect(pCorpus[1:3]) # Function to prepare corpus prepC <- function(corpus){ c <- tm_map(corpus, stripWhitespace) c <- tm_map(c,content_transformer(tolower)) c <- tm_map(c,removeWords,stopwords("english")) c <- tm_map(c,removePunctuation) c <- tm_map(c,removeNumbers) return(c)} pCorpusPrep <- prepC(pCorpus) #------------------------------------------------------------ # Create the document term matrix dtm <- DocumentTermMatrix(pCorpusPrep) dtm # <<DocumentTermMatrix (documents: 6598, terms: 20781)>> # Non-/sparse entries: 142840/136970198 # Sparsity : 100% # Maximal term length: 83 # Weighting : term frequency (tf) # Work with the transpose to list keywords as rows inspect(t(dtm[100:105,90:105])) # Docs # Terms 100 101 102 103 104 105 # accomodated 0 0 0 0 0 0 # accompanied 0 0 0 0 0 0 # accompanies 0 0 0 0 0 0 # accompany 0 0 0 0 0 0 # accompanying 0 0 0 0 0 0 # accomplished 0 0 0 0 0 0 # accomplishes 0 0 0 0 0 0 # accordance 0 0 0 0 0 0 # according 0 0 1 0 0 0 # accordingly 0 0 0 0 0 0 # accordinglyp 0 0 0 0 0 0 # account 0 0 0 0 0 0 # accounted 0 0 0 0 0 0 # accounting 0 0 0 0 0 0 # accountp 0 0 0 0 0 0 # accounts 0 0 0 0 0 0 # Reduce the number of sparse terms dtms <- removeSparseTerms(dtm,0.99) dim(dtms) # 6598 372
I am pretty much counting on some luck here, hoping that "Bayesian" will be one of the remaining 372 terms. This last bit of code finds 229 packages associated with the keyword "Bayesian"
# Find the Bayesian packages dtmsT <- t(dtms) keywords <- row.names(dtmsT) bi <- which(keywords == "bayesian") # Find the index of an interesting keyword bayes <- inspect(dtmsT)[bi,] # Vexing that it prints to console bayes_packages_index <- names(bayes[bayes==1]) # Here are the "Bayesian" packages bayes_packages <- pkgNames[as.numeric(bayes_packages_index)] length(bayes_packages) #229 # Here are the descriptions of the "Bayesian" packages bayes_pkgs_desc <- pkgDesc[bayes==1])
Here is the list of packages found.
Not all of these "fish" are going to be worth keeping, but at least we have reduced the search to something manageable. In 10 or 15 minutes of fishing you might catch something interesting.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.