All The Right Friends: how does Google Scholar rank co-authors?
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
On a scientist’s Google Scholar page, there is a list of co-authors in the sidebar. I’ve often wondered how Google determines in what order these co-authors appear.
The list of co-authors on a primary author’s page is not exhaustive. It only lists co-authors who also have a Google Scholar profile. They also have to be suggested to the primary author and they need to accept the co-author to list them on the page. Finally, the profile page only displays the first 20 co-authors. Any further co-authors can be seen by clicking “View All”. As I understand it, there is a limit to the number of co-authors a primary author is allowed to have; I currently have 40 and haven’t yet hit a limit. The ranking of co-authors is determined somehow and the first 20 are displayed on the primary author’s profile page, in the sidebar on the right.
How does Google Scholar rank these co-authors? Let’s use R to find out!
We’ll make use of the scholar package to get the data. The primary author used in the package vignette is Albert Einstein, but he doesn’t have any co-authors on Google Scholar; so we’ll use my data instead.
library(scholar) library(dplyr) library(ggplot2) library(zoo) # use a Google Scholar ID here id <- "PBcP8-oAAAAJ" # retrieve all the info from the page l <- get_profile(id) # retrieve details of all of the primary author's papers papers <- get_publications(id) # sadly this only has 6 authors max for each paper, let's get the missing authors papers$authors <- papers$author # we'll use a loop to use get_complete_authors() if required # we can test for this because papers with missing authors have an ellipse as final author for(i in 1 : nrow(papers)) { if(grepl("...", papers$author[i], fixed = TRUE)) { papers$authors[i] <- get_complete_authors(id, papers$pubid[i]) } }
At this point we have the primary author’s info, and a nice data frame of all of the primary author’s papers with number of cites and co-authors per paper.
We need to match up the Scholar co-authors to the authors in data frame. This involves a bit of manipulation because the Scholar co-author can enter their name in any format!
# authors in the data frame have names like "JD Bloggs" # we need names like "j bloggs" to match efficiently # get character vector of all authors from the data frame all_authors <- unlist(strsplit(papers$authors,", ")) # use this function remove middle initials that we don't need simplify_authors <- function(x) { s <- unlist(strsplit(x," ")) t <- paste(substr(s[1],1,1),s[length(s)]) return(t) } all_authors <- sapply(all_authors, simplify_authors) # Now, let's count the frequency of each co-author count_coau <- data.frame(au = tolower(all_authors)) %>% group_by(au) %>% count() # get the Scholar co-authors from `l` scholar_coau <- l$coauthors # here I manually added in the other Scholar co-authors from the View All modal # and again put them into the correct format scholar_coau <- sapply(scholar_coau, simplify_authors) # make a data frame of the Scholar co-authors and their rank scholar_df <- data.frame(coauthors = tolower(scholar_coau), rank = seq(1,length(scholar_coau))) # now merge with the data frame with the paper count by author compare_df <- merge(scholar_df, count_coau, by.x = "coauthors", by.y = "au", sort = FALSE)
First, let’s look if Scholar co-author rank is determined by the number of papers co-authored with the primary author.
# plot the number of papers as a function of rank ggplot(compare_df, aes(x = rank, y = n)) + geom_point() + lims(x = c(0,NA), y = c(0,NA)) + labs(x = "Scholar Rank", y = "Co-authored Papers") + theme_bw() # so rank is not determined by number of papers
Number of co-authored papers correlates with rank, but doesn’t determine it.
If rank is not determined (only) by number of co-authored papers, let’s look at the total citations that each co-author shares with the primary author.
compare_df$cites <- 0 for(i in 1 : nrow(compare_df)) { total_cites <- 0 au <- compare_df$coauthors[i] for(j in 1 : nrow(papers)) { aus <- unlist(strsplit(papers$authors[j],", ")) aus <- sapply(aus,simplify_authors) aus <- paste(unlist(tolower(aus)), collapse = ",") if(grepl(au, aus)) { total_cites <- total_cites + papers$cites[j] } } compare_df$cites[i] <- total_cites } # plot the total citation as a function of rank ggplot(compare_df, aes(x = rank, y = cites)) + geom_point() + lims(x = c(0,NA), y = c(0,NA)) + labs(x = "Scholar Rank", y = "Co-cites") + theme_bw()
Co-cites do not match the rank either.
Of course, it is possible that the ranking is done by some complex method, e.g. the number of co-authors that the co-author has. But if we assume the ranking is done using only information on the primary author’s page, how can it be done?
Let’s look at a graph of co-cites and number of papers.
ggplot(compare_df, aes(x = n, y = cites, colour = rank)) + geom_point() + scale_colour_gradient(low = "red", high = "blue") + lims(x = c(0,NA), y = c(0,NA)) + labs(x = "Papers", y = "Co-cites") + theme_bw()
This graph shows that the distance from the origin roughly scales inversely with rank.
If we take the log2 transform of the co-cites, we can see this more clearly.
ggplot(compare_df, aes(x = n, y = log2(cites), colour = rank)) + geom_point() + scale_colour_gradient(low = "red", high = "blue") + lims(x = c(0,NA), y = c(0,NA)) + labs(x = "Papers", y = "Co-cites (log2)") + theme_bw()
If we use the manhattan distance of number of papers and log2 scaled number of citations, we get something approximating the ranking!
compare_df$distance <- compare_df$n + log2(compare_df$cites) ggplot(compare_df, aes(x = rank, y = distance)) + geom_point() + lims(x = c(0,NA), y = c(0,NA)) + labs(x = "Rank", y = "Distance") + theme_bw()
This approximation works well. It’s not perfect. There are authors whose distance is not in ranked order with their neighbours. On closer inspection it seems that the number of co-authored papers is not accurate, or perhaps zero-cited papers are excluded from the paper count.
I tried this simple strategy on a few other primary authors and could replicate their co-authors’ rank order. I’m not certain this is the algorithm used but it certainly seems simple enough to be readily computed on each profile page.
So the total co-citations and the number of co-authored papers is used to compute the rank of co-authors
Not every co-author is a Scholar co-author. Some don’t have accounts for example. Knowing how the ranking is done, we can ask which lucky co-author could slot into the top co-author spots on my page, if they made an account!
# get a list of all co-authors unique_authors <- unique(all_authors) # exclude current Scholar co-authors unique_authors <- unique_authors[!(unique_authors %in% scholar_coau)] # generate a data frame of these authors in the correct format with blank ranking temp_df <- data.frame(coauthors = tolower(unique_authors), rank = 0) # merge to find the cumber of co-authored papers other_df <- merge(temp_df, count_coau, by.x = "coauthors", by.y = "au", sort = FALSE) # remove single paper coauthors for ease other_df <- other_df[other_df$n > 1,] # retrieve to number of co-citations for each co-author other_df$rank <- 0 other_df$cites <- 0 for(i in 1 : nrow(other_df)) { total_cites <- 0 au <- other_df$coauthors[i] for(j in 1 : nrow(papers)) { aus <- unlist(strsplit(papers$authors[j],", ")) aus <- sapply(aus,simplify_authors) aus <- paste(unlist(tolower(aus)), collapse = ",") if(grepl(au, aus)) { total_cites <- total_cites + papers$cites[j] } } other_df$cites[i] <- total_cites } # get the distance used for ranking other_df$distance <- other_df$n + log2(other_df$cites) # bind with the original data frame so that we can see where the new co-authors slot in all_df <- rbind(compare_df,other_df) # order by distance all_df <- all_df[order(all_df$distance, decreasing = TRUE),] # remove primary author (should have most cites and papers!) all_df <- all_df[-1,] # mark out non-Scholar co-authors all_df$new <- ifelse(all_df$rank == 0, 1, 0) # make a new column for interpolated rank all_df$interrank <- ifelse(all_df$rank == 0, NA, all_df$rank) all_df$interrank <- na.approx(all_df$interrank) # plot the result - limit to original top ten ggplot(all_df, aes(x = interrank, y = distance, colour = as.factor(new))) + geom_point() + lims(x = c(0,10), y = c(0,NA)) + labs(x = "Rank", y = "Distance") + theme_bw() + theme(legend.position = "none")
Scholar co-authors are shown in salmon while co-authors without a Scholar profile are shown in teal. From this plot, we can see that the coveted third place is up for grabs, along with the new 5th place. If everyone made a Scholar account, the person currently in 6th place would be pushed down into 10th.
I don’t imagine for one minute that anyone would be motivated to sign up to make it onto the sidebar of my page, but this exercise was interesting to highlight to me who my “closest” co-authors are.
—
The post title comes from “All The Right Friends” by R.E.M. The version I have is on a Best Of… compilation. I have many songs with “Friends” in the title but this seemed appropriate since the co-author side bar is over on the right of the Scholar profile page.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.