Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I’m always curious to see who is citing my one paper. Turns out I actually have two papers, and the most cited paper (with 19 citations, which sounds paltry but for me is quite exciting) is certainly not the one I’d have expected. By any stretch of the imagination. But, my second paper has only been in print for about six months. So I began to wonder: How do I, a third year grad student, stack up against other grad students, professors, and ecology rock stars? What is the key to having your work well cited?
Since I’ve been learning Python, I realized I could figure this out pretty easily. I wrote a Python script to run names through Google Scholar, look for a user profiles (if it exists), go to the the profile page, and extract the data from the “Citations by Year” chart. This chart presents how many times your papers have been cited in a year (it’s not cumulative). I’d attach the Python script if I felt like spending the time figuring out how to upload the text file to WordPress (but I don’t).
I ran 20 names through the program and downloaded the number of citations per year for each. I plotted them out in R and ran some super basic analyses and found that it’s surprisingly consistent and highly variable. Number of citations increased allometrically across all authors. However, the slopes varied significantly among authors. In the graphs, values are plotted as lag since date of first citation (which is set to 1).
FYI: Getting the y-axis to display ticks in exponential form was a bit tricky.
p + geom_point( aes(lag + 1, cites + 0.1, color=name ), size=3, show_guide=F ) + scale_y_continuous( trans=log2_trans(), breaks = c(0.1, 1, 10, 100, 1000), labels = trans_format('log10', math_format(10^.x)) ) + scale_x_continuous( trans = log2_trans() ) + ylab('Number of Citations per Year') + xlab('Years Since First Citation') + theme( axis.title = element_text(size=14), axis.text = element_text(size=12, color='black') )
I was a curious to know if there were differences between male and female ecologists, so I separated it out based on sex. The allometric relationship still holds for females but the number of citations per year for females increases more rapidly than it does for males.
Anyway, I thought this was interesting (I was more interested in getting my Python script to work, which it did). Thoughts?
I’m also willing to share the author list with the curious, but I kept it hidden to avoid insulting people who aren’t on the list (it was really a random sample of people I know and people whose papers I’ve read, which explains the slight bias towards marine ecology and insects).
UPDATE I had a request for the Python script, so here it is.
from urllib import FancyURLopener from bs4 import BeautifulSoup import numpy as np import pandas as pd import re # Make a new opener w/ a browser header so Google allows it class MyOpener(FancyURLopener): version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11' myopener = MyOpener() scholar_url = 'http://scholar.google.com/scholar?q=(query)&btnG=&hl=en&as_sdt=0%2C10' # Define the search function for the google chart string def findBetween(s, first, last): start = s.index(first) + len(first) end = s.index(last) return( s[start:end]) # Define the search for text within a string till the end othe string def getX(s): start = s.index('chxl=0:') + len('chxl=0:') return( s[start:] ) # Define the function to actually get the chart data def scholarCiteGet(link): # Navigate to and parse the user profile citLink2 = link.get('href') s2 = 'http://scholar.google.com' + citLink2 socket = myopener.open(s2) wsource2 = socket.read() socket.close() soup2 = BeautifulSoup(wsource2) # Find the chart image and encode the string URL chartImg = soup2.find_all('img')[2] chartSrc = chartImg['src'].encode('utf=8') # Get the chart y-data from the URL chartD = findBetween(chartSrc, 'chd=t:', '&chxl') chartD = chartD.split(',') chartD = [float(i) for i in chartD] chartD = np.array(chartD) # Get the chart y-conversion ymax = findBetween(chartSrc, '&chxr=', '&chd') ymax = ymax.split(',') ymax = [float(i) for i in ymax] ymax = np.array( ymax[-1] ) chartY = ymax/100 * chartD # Get the chart x-data chartX = getX(chartSrc) chartX = chartX.split('|') chartX = int(chartX[1]) chartX = np.arange(chartX, 2014) chartTime = chartX - chartX[0] # put the data together and return a dataframe name = soup2.title.string.encode('utf-8') name = name[:name.index(' - Google')] d = {'name':name, 'year':chartX, 'lag':chartTime, 'cites':chartY} citeData = pd.DataFrame(d) return(citeData) def scholarNameGet(name): # Navigate and parse the google scholar page with the search for the name specified name2 = name.replace(' ', '%20') s1 = ( scholar_url.replace('(query)', name2) ) socket = myopener.open(s1) wsource1 = socket.read() socket.close() soup1 = BeautifulSoup(wsource1) # Get the link to the user profile citText = soup1.find_all(href=re.compile('/citations?') ) if 'mauthors' in str(citText): citLink = citText[2] temp = scholarCiteGet(citLink) return(temp) else: citLink = citText[1] # If the link is to a user profile... get the data if 'User profiles' in str(citLink): temp = scholarCiteGet(citLink) return(temp) # If not, return 'no data' else: d = {'name':name, 'year':'No Data', 'lag':'No Data', 'cites':'No Data'} temp = pd.DataFrame(d, index=[0]) return(temp) # Run getCites once to populate the dataframe finalDat = pd.DataFrame() # Insert list of names here sciNames = [] for name in sciNames: a = scholarNameGet(name) finalDat = pd.concat([finalDat, a]) plotDat = finalDat.pivot(index = 'lag', columns = 'name', values = 'cites') plotDat = plotDat.replace('No Data', np.nan) plotDat.plot()
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.