All The Right Friends II: clustering papers using Google Scholar data
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In a previous post, I looked at how Google Scholar ranks co-authors. While I had the data available I wondered whether paper authorship could be used in other ways.
A few months back, John Cook posted about using Jaccard index and jazz albums. The idea is to look at the players on two jazz albums and examine the overlap. I wondered whether a similar approach could be used for papers.
From John Cook’s post:
There were four musicians who played on both Kind of Blue and Round About Midnight: Miles Davis, Cannonball Adderly, John Coltrane, and Paul Chambers.
There were 9 musicians who performed on either Kind of Blue or Round About Midnight. Since 4 played on both albums, the Jaccard index comparing the personnel on the two albums is 4/9.
The analysis of authorships of papers would allow us to find the Jaccard distance between pairs of papers. These distances could then be used to classify papers using a clustering method. If that’s possible, it would demonstrate that topics can be inferred from authorship composition. The cool thing about that would be that classification of papers was possible without examining the scientific content of the work!
The code
Just like last time we’ll look at my papers, rather than Albert Einstein’s!
library(scholar) library(viridisLite) library(pheatmap) # functions ---- # this function will simplify the author names to ease comparison simplify_authors <- function(x) { s <- unlist(strsplit(x," ")) t <- paste(substr(s[1],1,1),s[length(s)]) return(t) } compute_jaccard <- function(ii,jj) { # we want all authors from both papers all <- c(ii, jj) # number of unique authors in this set denom <- length(unique(all)) # authors who are on both papers kk <- ii[ii %in% jj] num <- length(kk) return(1 - (num / denom)) } # script ---- # scholar ID id <- "PBcP8-oAAAAJ" # get the papers for this profile papers <- get_publications(id) # this has 6 authors max, let's get the missing authors papers$authors <- papers$author for(i in 1 : nrow(papers)) { if(grepl("...", papers$author[i], fixed = TRUE)) { papers$authors[i] <- get_complete_authors(id, papers$pubid[i]) } } # now let's look at distance of authorships # we'll get rid of papers without cluster id (abstracts etc) papers <- papers[!is.na(papers$cid),] # we need to flush special characters from the authors column to allow comparison strings <- data.frame(ff = c("‐","é","ü","’"), rr = c("-","e","u","")) papers$coauthors <- papers$authors for(i in 1:nrow(strings)) { papers$coauthors <- gsub(strings$ff[i],strings$rr[i],papers$coauthors, fixed = TRUE) } # we need some labels for the plot, so we'll get those at the same time papers$labels <- "" # a matrix to hold the distances diss_mat <- matrix(nrow = nrow(papers), ncol = nrow(papers)) for(i in 1 : nrow(papers)) { aus_i <- unlist(strsplit(papers$coauthors[i],", ")) aus_i <- sapply(aus_i,simplify_authors) aus_i <- tolower(aus_i) for(j in 1 : nrow(papers)) { aus_j <- unlist(strsplit(papers$coauthors[j],", ")) aus_j <- sapply(aus_j,simplify_authors) aus_j <- tolower(aus_j) # compute jaccard diss_mat[i,j] <- compute_jaccard(aus_i,aus_j) } # generate paper label s <- unlist(strsplit(aus_i[1]," ")) papers$labels[i] <- paste0(s[length(s)],"_",papers$year[i]) } # now store the labels that we made rownames(diss_mat) <- papers$labels colnames(diss_mat) <- papers$labels # to make the hierachical clustering we will do this: ee <- diss_mat ee[upper.tri(ee)] <- NA ee <- as.dist(ee, diag = T) hh <- hclust(ee) png(filename = "Output/Plots/jaccard.png", bg = "white", width = 1800, height = 900, units = "px") pheatmap(diss_mat, cluster_rows = hh, color = inferno(n = 256), cutree_cols = 7, fontsize = 14) dev.off()
The plot
The result is a heatmap and hierarchical clustering of pairs of papers. The papers are grouped into 7 clusters (arbitrarily). The clusters make sense to me and since I know the content of the papers, I can confirm that this method does largely group the papers thematically (with some limitations). Two examples:
- The right-most cluster are all papers on P2X receptor trafficking
- To the left of that are papers using EM to look at mitotic spindles
The limitation is that all of my single author papers are clustered (over on the left), yet these are on different topics. Obviously there is no further information to discern them since they only feature one author.
Verdict
This was a fun idea to explore. Probably cutting out single author papers before performing the analysis would improve the result. This likely only works well for Scholars with many papers (maybe >20). Another possibility would be to enrich the dataset by retrieving the papers of all of the primary author’s co-authors in order to enhance the contribution of scientific themes to the clusters.
—
The post title comes from “All The Right Friends” by R.E.M.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.