Working with the RStudio CRAN logs

Joseph Rickert

7 years ago

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

by Joseph Rickert

The installr package has some really nice functions for working with the daily package download logs for the RStudio CRAN mirror which RStudio graciously makes available at http://cran-logs.rstudio.com/. The following code uses the download_RStudio_CRAN_data() function to download a month's worth of .gz compressed daily log files into the test3 directory and then uses the function read_RStudio_CRAN_data()to read all of these files into a data frame. (The portion of the status output provided shows the files being read in one at a time.). Next, the function most_downloaded_packages() calculates that the top six downloads for the month were: Rcpp, stringr, ggplot2, stringi, magrittr and plyr.

# CODE TO DOWNLOAD LOG RILES FROM RSTUDIO CRAN MIRROR
# FIND MOST DOWNLOADED PACKAGE AND PLOT DOWNLOADS
# FOR SELECTED PACKAGES
# -----------------------------------------------------------------
library(installr)
library(ggplot2)
library(data.table) #for downloading
 
# ----------------------------------------------------------------
# Read data from RStudio site
RStudio_CRAN_dir <- download_RStudio_CRAN_data(START = '2015-05-15',END = '2015-06-15', 
	                                           log_folder="C:/DATA/test3")
# read .gz compressed files form local directory
RStudio_CRAN_data <- read_RStudio_CRAN_data(RStudio_CRAN_dir)
 
#> RStudio_CRAN_data <- read_RStudio_CRAN_data(RStudio_CRAN_dir)
#Reading C:/DATA/test3/2015-05-15.csv.gz ...
#Reading C:/DATA/test3/2015-05-16.csv.gz ...
#Reading C:/DATA/test3/2015-05-17.csv.gz ...
#Reading C:/DATA/test3/2015-05-18.csv.gz ...
#Reading C:/DATA/test3/2015-05-19.csv.gz ...
#Reading C:/DATA/test3/2015-05-20.csv.gz ...
#Reading C:/DATA/test3/2015-05-21.csv.gz ...
#Reading C:/DATA/test3/2015-05-22.csv.gz ...
 
 
dim(RStudio_CRAN_data)
# [1] 8055660      10
 
# Find the most downloaded packages
pkg_list <- most_downloaded_packages(RStudio_CRAN_data)
pkg_list
 
#Rcpp  stringr  ggplot2  stringi magrittr     plyr 
  #125529   115282   103921   103727   102083    97183
 
lineplot_package_downloads(names(pkg_list),RStudio_CRAN_data)
 
# Look at plots for some packages
barplot_package_users_per_day("checkpoint",RStudio_CRAN_data)
#$total_installations
#[1] 359
barplot_package_users_per_day("Rcpp", RStudio_CRAN_data)
#$total_installations
#[1] 23832

The function lineplot_package_downloads() produces a multiple time series plot for the top five packages:

and the barplot_package_users_per_day() function provides download plots. Here we contrast downloads for the Revolution Analytics' checkpoint package and Rcpp.

Downloads for the checkpoint package look pretty uniform over the month. checkpoint is a relatively new, specialized package for dealing with reproducibility issues. The download pattern probably represents users discovering it. Rccp, on the other hand, is essential to an incredible number of other R packages. The right skewed plot most likely represents the tail end of the download cycle that started after Rcpp was upgraded on 5/1/15.

All of this works well for small amounts of data. However, the fact that read_RStudio_CRAN_data() puts everything in a data frame presents a bit of a problem for working with longer time periods with the 6GB of RAM on my laptop. So, after downloading the files representing the period (5/28/14 to 5/28/15) to my laptop,

# Convert .gz compresed files to .csv files
in_names <- list.files("C:/DATA/RStudio_logs_1yr_gz", pattern="*.csv.gz", full.names=TRUE)
out_names <- sapply(strsplit(in_names,".g",fixed = TRUE),"[[",1)
 
length(in_names)
for(i in 1:length(in_names)){
	df <- read.csv(in_names[i])
	write.csv(df, out_names[i],row.names=FALSE)
}

I used the external memory algorithms in Revolution R Enterprise to work with the data on disk. First, rxImport() brings all of the .csv files into a single .xdf file and stores it on my laptop. (Note that the rxSummary() function indicates that the file has over 90 million rows.) Then, the super efficient rxCube() function to tabulate the package counts.

# REVOSCALE R CODE TO IMPORT A YEARS WORTH OF DATA
data_dir <- "C:/DATA/RStudio_logs_1yr"
in_names <- list.files(data_dir, pattern="*.csv.gz", full.names=TRUE)
out_names <- sapply(strsplit(in_names,".g",fixed = TRUE),"[[",1)
 
#----------------------------------------------------
# Import to .xdf file
# Establish the column classes for the variables
colInfo <- list(
	       list(name = "date", type = "character"),
	       list(name = "time", type = "character"),
	       list(name = "size", type = "integer"),
	       list(name  = "r_version", type = "factor"), 
	       list(name = "r_arch", type = "factor"), 
	       list(name = "r_os", type = "factor"),
	       list(name = "package", type = "factor"),
	       list(name = "version", type = "factor"),
	       list(name = "country", type = "factor"),
	       list(name = "1p_1d", type = "integer"))
 
num_files <- length(out_names)
out_file <- file.path(data_dir,"RStudio_logs_1yr")
 
append = FALSE
for(i in 1:num_files){
rxImport(inData = out_names[i], outFile = out_file,     
		 colInfo = colInfo, append = append, overwrite=TRUE)
       	 append = TRUE
}	
# Look at a summary of the imported data
rxGetInfo(out_file)
#File name: C:DATARStudio_logs_1yrRStudio_logs_1yr.xdf 
#Number of observations: 90200221 
#Number of variables: 10 
 
# Long form tablualtion
cube1 <- rxCube(~ package,data= out_file)
# Computation time: 5.907 seconds.
cube1 <- as.data.frame(cube1)
sort1 <- rxSort(cube1, decreasing = TRUE, sortByVars = "Counts")
#Time to sort data file: 0.078 seconds
write.csv(head(sort1,100),"Top_100_Packages.csv")

Here are the download counts for top 100 packages for the period (5/28/14 to 5/28/15).

You can download this data here: Download Top_100_Packages

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.