All The Right Friends II: clustering papers using Google Scholar data

Stephen Royle

1 month ago

[This article was first published on Rstats – quantixed, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In a previous post, I looked at how Google Scholar ranks co-authors. While I had the data available I wondered whether paper authorship could be used in other ways.

A few months back, John Cook posted about using Jaccard index and jazz albums. The idea is to look at the players on two jazz albums and examine the overlap. I wondered whether a similar approach could be used for papers.

From John Cook’s post:

There were four musicians who played on both Kind of Blue and Round About Midnight: Miles Davis, Cannonball Adderly, John Coltrane, and Paul Chambers.

There were 9 musicians who performed on either Kind of Blue or Round About Midnight. Since 4 played on both albums, the Jaccard index comparing the personnel on the two albums is 4/9.

The analysis of authorships of papers would allow us to find the Jaccard distance between pairs of papers. These distances could then be used to classify papers using a clustering method. If that’s possible, it would demonstrate that topics can be inferred from authorship composition. The cool thing about that would be that classification of papers was possible without examining the scientific content of the work!

The code

Just like last time we’ll look at my papers, rather than Albert Einstein’s!

library(scholar)
library(viridisLite)
library(pheatmap)

# functions ----
# this function will simplify the author names to ease comparison
simplify_authors <- function(x) {
  s <- unlist(strsplit(x," "))
  t <- paste(substr(s[1],1,1),s[length(s)])
  return(t)
}

compute_jaccard <- function(ii,jj) {
  # we want all authors from both papers
  all <-  c(ii, jj)
  # number of unique authors in this set
  denom <- length(unique(all))
  # authors who are on both papers
  kk <- ii[ii %in% jj]
  num <- length(kk)
  
  return(1 - (num / denom))
}

# script ----
# scholar ID
id <- "PBcP8-oAAAAJ"
# get the papers for this profile
papers <- get_publications(id)
# this has 6 authors max, let's get the missing authors
papers$authors <-  papers$author

for(i in 1 : nrow(papers)) {
  if(grepl("...", papers$author[i], fixed = TRUE)) {
    papers$authors[i] <- get_complete_authors(id, papers$pubid[i])
  }
}

# now let's look at distance of authorships
# we'll get rid of papers without cluster id (abstracts etc)
papers <- papers[!is.na(papers$cid),]
# we need to flush special characters from the authors column to allow comparison
strings <- data.frame(ff = c("‐","é","ü","’"),
                      rr = c("-","e","u",""))
papers$coauthors <- papers$authors
for(i in 1:nrow(strings)) {
  papers$coauthors <- gsub(strings$ff[i],strings$rr[i],papers$coauthors, fixed = TRUE)
}

# we need some labels for the plot, so we'll get those at the same time
papers$labels <- ""
# a matrix to hold the distances
diss_mat <- matrix(nrow = nrow(papers), ncol = nrow(papers))

for(i in 1 : nrow(papers)) {
  aus_i <- unlist(strsplit(papers$coauthors[i],", "))
  aus_i <- sapply(aus_i,simplify_authors)
  aus_i <- tolower(aus_i)
  for(j in 1 : nrow(papers)) {
    aus_j <- unlist(strsplit(papers$coauthors[j],", "))
    aus_j <- sapply(aus_j,simplify_authors)
    aus_j <- tolower(aus_j)
    # compute jaccard
    diss_mat[i,j] <- compute_jaccard(aus_i,aus_j)
  }
  # generate paper label
  s <- unlist(strsplit(aus_i[1]," "))
  papers$labels[i] <- paste0(s[length(s)],"_",papers$year[i])
}
# now store the labels that we made
rownames(diss_mat) <- papers$labels
colnames(diss_mat) <- papers$labels

# to make the hierachical clustering we will do this:
ee <- diss_mat
ee[upper.tri(ee)] <- NA
ee <- as.dist(ee, diag = T)
hh <- hclust(ee)

png(filename = "Output/Plots/jaccard.png", bg = "white", width = 1800, height = 900, units = "px")
pheatmap(diss_mat,
         cluster_rows = hh,
         color = inferno(n = 256),
         cutree_cols = 7,
         size = 14)
dev.off()

The plot

Plot showing a heatmap of distances between pairs of papers.

The result is a heatmap and hierarchical clustering of pairs of papers. The papers are grouped into 7 clusters (arbitrarily). The clusters make sense to me and since I know the content of the papers, I can confirm that this method does largely group the papers thematically (with some limitations). Two examples:

The right-most cluster are all papers on P2X receptor trafficking
To the left of that are papers using EM to look at mitotic spindles

The limitation is that all of my single author papers are clustered (over on the left), yet these are on different topics. Obviously there is no further information to discern them since they only feature one author.

Verdict

This was a fun idea to explore. Probably cutting out single author papers before performing the analysis would improve the result. This likely only works well for Scholars with many papers (maybe >20). Another possibility would be to enrich the dataset by retrieving the papers of all of the primary author’s co-authors in order to enhance the contribution of scientific themes to the clusters.

—

The post title comes from “All The Right Friends” by R.E.M.

To leave a comment for the author, please follow the link and comment on their blog: Rstats – quantixed.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The code

The plot

Verdict

Related