Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
(This article is adapted to the latest version of rvest package.)
A large proportion of R's power should be attributed to the enormous amount of extension packages. Many packages are published to CRAN.
These packages cover a wide range of fields. In this post, I'll show you how to use R to scrap the titles of all CRAN packages from the web page and find out which keywords are the most popular.
To minimize the efforts, we try best to avoid reinventing the wheels and get some answer as quickly as possible. We only use existing packages to do all the work.
Here is our toolbox that is useful in this task:
rvest
: Scrape from the web page by selectorrlist
: Quickly perform mapping and filtering in functional stylepipeR
: Pipe all operations at high performance
First, we equip our R environment with these tools.
library(rvest) library(rlist) library(pipeR)
Then we download and parse the web page.
url <- "http://cran.r-project.org/web/packages/available_packages_by_date.html" page <- html(url)
Now page
is a parsed HTML document object that is well structured and is ready to query. Note that we need to get the texts in the third column of the table. Here we use XPath to locate the information we want. Or you can use CSS selector to do the same work.
The following code are written in fluent style with pipeline.
words <- page %>>% html_node("//tr//td[3]//text()", xpath = TRUE) %>>% # select the 3rd column list.map( # map each node to ... # 1. get the trimmed text in the XML node XML::xmlValue(.) %>>% # 2. split the text by non-word-letters strsplit("[^a-zA-Z]") %>>% # 3. put everything together in vector unlist(use.names = FALSE) %>>% # 4. lower all words tolower %>>% # 5. filter words with more than 3 letters to be meaningful list.filter(nchar(.) > 3L)) %>>% # put everything in a large character vector unlist %>>% # create a table of word count table %>>% # sort the table descending sort(decreasing = TRUE) %>>% # take out the first 100 elements head(100) %>>% # print out the results print data analysis models with functions 864 718 484 404 371 package regression estimation model based 336 308 273 249 238 using tools from bayesian linear 235 225 194 173 169 methods time interface multivariate statistical 169 168 160 133 124 test generalized clustering tests series 114 112 105 105 104 inference statistics random distribution selection 101 101 100 97 96 modeling spatial algorithm multiple simulation 89 89 87 87 82 mixed method likelihood distributions modelling 81 78 77 76 73 network sets classification mixture sampling 72 70 68 67 64 effects robust sparse survival variable 63 63 60 60 60 high fitting gene function optimization 58 57 57 56 56 graphical testing networks files nonparametric 55 55 54 52 52 plots sample dimensional genetic multi 52 52 51 51 51 utilities visualization implementation density matrix 51 51 50 49 49 hierarchical lasso learning markov correlation 48 48 48 48 47 dynamic plot prediction censored 47 47 47 46 46 datasets gaussian response adaptive association 45 45 45 44 44 binary design least normal system 44 44 43 43 43 fast functional point analyses confidence 42 42 42 41 41 experiments graphics objects population process 41 41 41 41 41
The work is done, in 12 lines, in only a little more than 2 seconds!
If you want to know more about these packages, please visit their project pages. Hope you can do more amazing things in your work.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.