Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
Todays mission have been to visualize publication frequencies over the years for different psychotherapies using PubMed and R. I choose to concentrate on Psychodynamic Therapy (PDT), Cognitive Behavior Therapy (CBT), Acceptance and Commitment Therapy (ACT) and Mindfulness. I know there’s is a R function to query PubMed, but I didn’t use it and instead searched PubMed manually and saved the result as four separate csv-files.
PubMed search filter
I used a really rudimentary search filter:
Writing the R code
So, after I’d saved the data files from PubMed to my computer, I switched over the R and loaded the necessary packages.
library(stringr) # used for str_extract() library(ggplot2) # used for plotting library(plyr) # used for count()
Then I wrote a short function to extract year and frequency from the csv-file. Since publication year wasn’t under it’s own header I used str_extract to extract that information.
# create function named scrape.pubmed scrape.pubmed <- function() { # we only need column 5 results <- pubmed_result[5] # the header is repeated regularly, so I remove those rows. results <- results[results$Details != "ShortDetails",] # The remaning field contain journal name in addition to pub. year # The general expression catches values 1900--1999 OR 2000--2019 # Most rows had year information at the end of the string, # but not all rows had this. Therefore I used this regular expression to # minimize mismatches. year <- str_extract(results, "(19[0-9]{2}|20[0-1][0-9])") # Use the count()-function to group years together. # I didn't want values from 2012, hence the < 2012 addition. count <- count(year[year < 2012]) # Return the value of count return(count) }
Now that I have a function to collect my data, I loaded the different csv-files I saved from PubMed and collect year data from them.
# Get CBT data pubmed_result <- read.csv("~/Downloads/pubmed_cbt.csv", row.names = NULL) cbt.count <- scrape.pubmed() # Get PDT data pubmed_result <- read.csv("~/Downloads/pubmed_pdt.csv", row.names = NULL) pdt.count <- scrape.pubmed() # Get mindfulness data pubmed_result <- read.csv("~/Downloads/pubmed_mindfulness.csv", row.names = NULL) mindfulness.count <- scrape.pubmed() # Get ACT data pubmed_result <- read.csv("~/Downloads/pubmed_ACT.csv", row.names = NULL) act.count <- scrape.pubmed()
Now I have 4 variables who look like this:
> head(cbt.count, 10) x freq 1 1976 19 2 1977 2 3 1978 1 4 1979 1 5 1980 12 6 1981 9 7 1982 12 8 1983 15 9 1984 14 10 1985 16
I could combine them into one data.frame but I choose not to. Instead I plot them individually on different layers with ggplot.
# It doesn't matter in which order you enter the variables, I # started with cbt.count. For some reason ggplot wouldn't accept # the syntax without 'group = 1'. ggplot(cbt.count, aes(x, freq, group = 1, color = "CBT")) + # line layer for cbt.count geom_line() + # new layer for pdt.count geom_line(data = pdt.count, aes(x, freq, color = "PDT")) + # new layer for mindfulness.count geom_line(data = mindfulness.count, aes(x, freq, color = "Mindfulness")) + # new layer for act.clunt geom_line(data = act.count, aes(x, freq, color = "ACT"))+ # This line both control the groups colors and labels on the legend. scale_color_manual("Orientation", values= c("CBT" = "blue", "PDT" = "red", "Mindfulness" = "green", "ACT" = "magenta")) + ylab("Publications") + # y-label xlab("Year") + # x-label # break up x-axis from 1900 to 2010 by ticks of 10 scale_x_discrete(breaks = seq(1900,2010, 10)) + # add plot title opts(title = "Pubmed citations by year")
Results
We can see that PDT really had a decline in the early 90′s and the CBT started it’s rise and that it really sky rocketed in the 21th century. PDT also had an increase in the 21th century but it got a heck of a lot of work to do before it can match CBT’s publication output. The plot also show that mindfulness had a surge and that it’s output now is on par with PDT.
I’ve made some updates to this code in a more recent post:
Ggplot2, PubMed citation frequency and DSM-IV Axis I disorders by year
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.