Plotting Scopus article level citation data in R

[This article was first published on The 20% Statistician, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.



The Royal Society has decided to publish journal citations distributions. This makes sense. The journal impact factor is a single number trying to summarize a distribution, but it’s almost always better to plot your data. Somehave been hopeful that visualizing such distributions will make it clear what a troublesome statistic the journal impact factor is, and hope that other journals will also be open with their data.

I want to point out that all this data is readily available to anyone who has access to Scopus, and at the bottom of this post I’ll share the R code to create these plots yourself.

Go to Scopus, and search for any journal you’d like. Here, I’ll illustrate this process by a search for the journal Psychological Science, which has ISSN number 0956-7976. You can search for any range of years, but Scopus will only allow you to export 2000 cases at once. I limited this search to issues from 2010-2015.Due to copyright reasons, I cannot share the Scopus data I downloaded.


Then, select all results, and export ‘all available information’ as a .csv file, as illustrated in the animation below.


Now you have the data, plotting the citations is straightforward, and can be done with the code below (the plots in this blog posts look a bit different then the output in the code, but the data is the same). For example, here is the distribution of citations for Psychological Science for the years 2010-2015. The tail is so long, that I cut off the x-axis at 200 citations. Three (most notably, Simonsohn, Nelson, & Simmons, 2011, with 662 citations) papers are cited more than 200 times.



The data is clearly skewed, and obviously papers are cited more often, as the years go by. The differences between the means:

        Year    Mean
1       2010    34.551724
2       2011    25.329167
3       2012    18.460465
4       2013    12.055016
5       2014    6.176471
6       2015    1.814815

and medians:

        Year    Median
1       2010    25
2       2011    17
3       2012    15
4       2013    8
5       2014    4
6       2015    1

are obvious. You would probably exclude extreme outliers when analyzing your own data, but journals obviously like to keep them in because they boost the impact factor, even though they are not representative.

Feel free to play around with the script, and link to your plots in the comments below, or tweet them to me at @Lakens.




To leave a comment for the author, please follow the link and comment on their blog: The 20% Statistician.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)