Exploding, Impacting: looking at bioRxiv preprint view dynamics with R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
One of the joys of posting a preprint is seeing that people are viewing, downloading and (hopefully) reading your paper. On bioRxiv you can check out the statistics for your paper in the metrics tab.
We posted a preprint recently and it clocked up over 1,000 views in the first day or so. This made me wonder: is that a lot of views or not? How does it compare to other preprints in our category? I wrote some code to find out. It turns out our paper was the third most viewed cell biology preprint in September. My co-authors and I were very grateful for the interest!
Anyway, this post is about the code and how to get the necessary data to look at preprint view metrics. I’ll use it to look at the more interesting question of preprint view dynamics in our category “Cell Biology” this year. Read on for the R stuff, or skip to the findings.
tl;dr a cell biology preprint gets 50% of its lifetime views in the first 4-5 days.
Outline
We’ll use rbiorxiv
to get a list of peprints of interest. Then we’ll use this list to scrape the metrics data using rvest
. Finally, we’ll wrangle the data and make some plots using ggplot
etc.
There are three metrics summarised for each month. Abstract, full text and PDF. Abstract is usually the biggest number for two reasons: first, unless people click on a link to the PDF or the full text specifically, the landing page is the abstract. So typically a reader would view the abstract and then click on PDF or full text; second, full text takes a day or so to render so this number is usually the lowest number.
The final thing to note is that if a preprint is posted on the last day of the month, it will have one day’s worth of metrics, whereas a preprint posted at the beginning of the month will have a full month’s worth – more on this later.
The code
# Load packages library(rbiorxiv) library(tidyverse) library(rvest) # get data df <- biorxiv_content(from = "2024-01-01", to = "2024-09-17", limit = "*", format = "df") # filter for category is "cell biology" df_all <- subset(df, df$category == "cell biology") # select version 1 only df_all <- subset(df_all, df_all$version == 1) # convert jatsxml column to url by deleting .source.xml from the end of each df_all$url <- gsub(".source.xml", ".article-metrics", df_all$jatsxml)
This gives us urls to scrape for all 1st version preprints in the cell biology category in 2024. Note that subsequent versions of the preprint give the same data, so we select the 1st version to reduce the amount of scraping.
# for every url in df_all get the metrics # create an empty data frame first to store the results usage_metrics_df <- data.frame() # loop through each url and extract the usage metrics for (i in 1:nrow(df_all)) { ex_paper <- tryCatch( read_html(df_all$url[i]), error = function(e) { message(paste("Error reading URL:", df_all$url[i])) return(NULL) } ) if (is.null(ex_paper)) { next } usage_metrics <- ex_paper %>% html_node(".highwire-stats") %>% html_table() if (ncol(usage_metrics == "NULL") != 4) { next } # clean up the table usage_metrics_clean <- usage_metrics %>% rename(month = 1, abstract_views = 2, full_views = 3, pdf_views = 4) %>% mutate(across(c(abstract_views, full_views, pdf_views), ~ as.numeric(gsub(",", "", .)))) %>% mutate(month = as.Date(paste0("01-", month), format = "%d-%B")) usage_metrics_clean$doi <- df_all$doi[i] # add the DOI to the cleaned metrics # add the cleaned metrics to the list usage_metrics_df <- rbind(usage_metrics_df, usage_metrics_clean) } # merge with df_all to get the title and authors final_df <- merge(df_all, usage_metrics_df, by = "doi", all.y = TRUE) final_df$total_views <- final_df$abstract_views + final_df$full_views + final_df$pdf_views
OK, now we have all the metrics data. Let’s combine views of abstract, full text and PDF by summing them to get an idea of how many clicks a preprint received.
# for each doi, sum the views columns final_sum_df <- final_df %>% group_by(doi) %>% summarise( abstract_views = sum(abstract_views, na.rm = TRUE), full_views = sum(full_views, na.rm = TRUE), pdf_views = sum(pdf_views, na.rm = TRUE), total_views = sum(total_views, na.rm = TRUE), date = as.Date(date[1]) ) ggplot(final_sum_df, aes(x = date, y = total_views)) + geom_point(alpha = 0.25) + labs(x = "Posting date", y = "Total views") + scale_y_log10(limits = c(100, 30000)) + theme_classic(9) ggsave("Output/Plots/total_views.png", width = 8, height = 6)
Insights
For each preprint the total views received over all months is typically between 1000-5000 (note the log scale). The values are quite steady from January to June and there’s a fall off in September. More on this below but this is a clue to the half-life of interest in a preprint. Another thing to note is that there a dip in Total Views around 4th July which I am assuming is because the US (which is a major fraction of the audience here) is away from work. Again this is a hint about a preprint’s half-life.
# the column "month" shows the first of the month, as yyyy-mm-01 string, # convert to show last day of that month final_df$month <- as.Date(final_df$month, format = "%Y-%m-%d") # add one month and then take away one day final_df$month <- as.Date(format(as.Date(final_df$month, format = "%Y-%m-%d") + months(1) - days(1), "%Y-%m-%d")) # calculate the number of days between the date and the month final_df$days_from_month <- as.numeric(final_df$month - as.Date(final_df$date)) # also make a column where "date" is converted to the first of that month final_df$month_date <- as.Date(format(as.Date(final_df$date, format = "%Y-%m-%d"), "%Y-%m-01")) # change month_date to be the name of the month & factor properly final_df$month_date <- factor( month.abb[as.numeric(format(final_df$month_date, "%m"))], levels = month.abb) ggplot(final_df, aes(x = days_from_month, y = total_views, group = doi)) + geom_path(alpha = 0.10) + labs(x = "Days since posting", y = "Total Views") + scale_y_log10() + facet_wrap(~month_date) + theme_classic(9) ggsave("Output/Plots/views_days_facet.png", width = 8, height = 6)
We can look at each preprint’s metrics by month and plot them out versus the days since posting. The initial dynamics are similar regardless of when in the year a preprint was posted. The views are high in the first month and then drop away.
In the final code block we’ll dig into these initial dynamics.
# for final_df calculate the total views for each doi and each month as a # fraction of the total for that doi # take the doi and total_views column from final_sum_df, merge with final_df, # then normalise total_views by total_views_y final_sum_df2 <- final_sum_df %>% select(doi, total_views) %>% rename(total_views_y = total_views) final_alt_df <- merge(final_df, final_sum_df2, by = "doi", all.x = TRUE) final_alt_df$normalised_total_views <- final_alt_df$total_views / final_alt_df$total_views_y ggplot(final_alt_df, aes(x = days_from_month, y = normalised_total_views, group = doi)) + geom_path(alpha = 0.10) + labs(x = "Days since posting", y = "Fraction of total views") + geom_vline(xintercept = 30, linetype = "dashed") + theme_classic(9) ggsave("Output/Plots/views_days_frac.png", width = 8, height = 6) # generate the mean of normalised_total_views by days since posting final_alt_df %>% group_by(days_from_month) %>% summarise(mean_normalised_total_views = mean(normalised_total_views, na.rm = TRUE)) %>% ggplot(aes(x = days_from_month, y = mean_normalised_total_views)) + geom_point(alpha = 0.25) + labs(x = "Days since posting", y = "Mean fraction of total views") + lims(y = c(0, 1)) + theme_classic(9) ggsave("Output/Plots/views_days_frac_avg.png", width = 8, height = 6) # final_df has the total views for each article for per month, # first turn that into a cumulative sum for each paper final_df_cumsum <- final_df %>% group_by(doi, date) %>% mutate(cumsum_total_views = cumsum(total_views)) final_alt_df2 <- merge(final_df_cumsum, final_sum_df, by = c("doi"), all.x = TRUE) final_alt_df2$normalised_cumsum_total_views <- final_alt_df2$cumsum_total_views / final_alt_df2$total_views.y final_alt_df2 %>% group_by(days_from_month) %>% summarise(mean_normalised_total_views = mean(normalised_cumsum_total_views, na.rm = TRUE)) %>% ggplot(aes(x = days_from_month, y = mean_normalised_total_views)) + geom_point(alpha = 0.5) + labs(x = "Days since posting", y = "Mean fraction of total views") + lims(y = c(0, 1)) + theme_classic(9) ggsave("Output/Plots/views_days_frac_cumsum_avg.png", width = 8, height = 6)
When we look at the Total Views as a fraction of the total for the preprint’s lifetime, we can see that a preprint gathers about 75% of its lifetime views in the first month! After that, the following month’s metrics decline dramatically as the average preprint bumbles along picking up its final views.
Plotted a different way, by looking at the cumulative views as a fraction of the total, we can see that in the first 4-5 days after posting, a preprint gets half of its lifetime views.
There’s some caveats here. We’re only looking at Cell Biology, and only in first nine months of 2024. Conversely, Cell Biology is a large category so it should have a good range of hot papers and less popular work, but it’s not so large that it has distinct subdomains (like Neuroscience) that may affect preprint dynamics.
Obviously, with the attention on preprints being pretty much immediate, posting them close to major holidays – or maybe even close to the weekend – is likely to affect how many people will see your preprint.
—
The post title comes from the Belaire album “Exploding, Impacting”.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.