UCLA Statistics: Analyzing Thesis/Dissertation Lengths

Posted on September 29, 2010 by Ryan Rosario in R bloggers | 0 Comments

[This article was first published on Byte Mining » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

As I am working on my dissertation and piecing together a mess of notes, code and output, I am wondering to myself “how long is this thing supposed to be?” I am definitely not into this to win the prize for longest dissertation. I just want to say my piece, make my point and move on. I’ve heard that the shortest dissertation in my program was 40 pages (not true). I heard someone from another school that their dissertation was over 300 pages. I am not holding myself to a strict limit, but I wanted a rough guideline. As a disclaimer, this blog post is more “fun” than “business.” This was just an analysis that I was interested in and felt that it was worth sharing since it combined Python, web scraping, R and ggplot2. It is not meant to be a thorough analysis of dissertation lengths or academic quality of the Department.

The UCLA Department of Statistics publishes most of its M.S. theses and Ph.D. dissertations on a website. It is not complete, especially for the earlier years, but it is a good enough population for my use.

Using this web page, I was able to extract information about each thesis submitted for publishing on this website: advisor name, work title, year completed, and level (M.S. or Ph.D.). Student name was removed for some anonymity, although anyone can easily perform this analysis manually. The scraping part was easy enough but was only half the battle. I also had to somehow extract the length of each manuscript. To do this, I visited the directory for each manuscript (organized by paper ID number), downloaded it to a temporary directory, and used the Python library pyPdf to extract the number of pages in the document. I must note that the number of pages returned by pyPdf is the number of raw pages in the PDF document, not the number of pages of writing excluding references, appendices, figures etc. I also manually corrected inconsistencies, such as name formatting, use of nicknames, and mispellings. For example, “Thomas Ferguson” was standardized to “Thomas S. Ferguson.” In the event that two advisor names were given, only the full time Statistics professor’s name was retained. If both were names were full time Statistics faculty members, only the first one was chosen. Sorry about that.

Naturally, I wanted to use a plot to see the distribution of thesis and dissertation lengths, but the one produced by base graphics was terrible:

This hideous graphic gives rise to some questions…

What does the bar less than 50 represent? Just length less than 50? (sarcasm)
What does the bar greater than 200 represent? Just length greater than 200? (sarcasm)
And how do I represent the obvious difference in length of manuscript by degree objective?

Although I respect the field of visualization, I am not huge on it, and I am usually content with the basics. This is one case where I had to step up my viz a notch. I had not used ggplot2 so there was no better time to learn. I will not attempt to explain what I am doing with the graphics, as there are already plenty of tutorials and write-ups from experts on the matter. Just look and be amazed…or just look. I wanted to give ggplot2 a spin, so I whipped this up as an example.

library(ggplot2)
qplot(Pages, data=these, main="Thesis/Dissertation Lengths\nUCLA Department of Statistics") + geom_histogram(aes(fill=Level))

Wow! Now it is obvious what each bar represents, and we can easily see the difference in lengths of M.S. theses and Ph.D. dissertations. We can easily see that M.S. theses were typically around 50 pages, and Ph.D. dissertations were typically about 110 pages with a long right tail. We can also see what tick labels represent, and the mesh grid gives a visual clue as to what the intermediate tick labels would be. We also see that there were two M.S. theses that was unusually long at 135 and 140 pages respectively. Their titles were Time Series Analysis of Air Pollution in the City of Bakersfield, California and Analysis of Interstate Highway 5 Hourly Traffic via Functional Linear Models, respectively. If you are from California, you can imagine why.

We can see that there is not much variance among lengths of Masters theses and much higher variance for Ph.D. dissertations. I hypothesized that there was an advisor and year effect. Based on hearsay, I had an idea of which advisors yielded the longest and shortest dissertations. My hunch does in fact appear to be true, but I am withholding those results. What I will say is that there does not seem to a be a “pattern.” It does not seem that the more accomplished professors yield longer (or shorter) dissertations. It also does not seem that certain fields, like Vision or Genetics, yield longer or shorter dissertations as a group.

The following is a boxplot of the length of Ph.D. dissertations for your entertainment.

But how has the length of dissertations changed over time? Or has it not?

qplot(Year, Pages, data=phd, main="Dissertation Lengths over Time\nUCLA Department of Statistics") + geom_smooth()

This plot is beautiful, and interesting. It seems to suggest that overall, the mean length of a dissertation fell sharply between 1996-2000. However, there is a sample size effect here and there is not enough information to claim that there was in fact a drop during this period. If there in fact was a decrease in dissertation length, there could be several reasons. The Department became independent from the Department of Mathematics in 1998. Perhaps the academic climate was changing and dissertations were becoming shorter. Or, it could be that the Department of Mathematics historically had longer dissertations, and once the Department split off, its requirements waned from those of Mathematics. I bold the word mean because a better statistic here is the median since dissertation lengths do not follow a normal distribution; rather, they follow a right skewed distribution. Still though, using the median does not account for the sample size effect.

From 2000 to 2006, dissertation lengths seemed to have leveled off. Then from 2006 to 2010, it appears that dissertation lengths increased. Not so fast though! Note that the number of dissertations filed from 2006-2010 is much larger than those submitted in other equivalent length periods of time — this bump is likely due to the number of observations. Based on my understanding of Department history, I believe that there probably was a decrease through the early years of the program as the Department established its own separate expectations. This may hold practically, but does it hold statistically?

The geom_smooth() adds a curve to the plot representing a moving average over the data. It is not a trend line! geom_smooth() also adds some type of margin of error around this smoothing line (I admit that I have not looked deeply into the internals of ggplot2). If we interpret the margin of error loosely as a confidence interval, we can make a statistical conclusion of this graph. Recall that a basic one-sample confidence interval with population standard deviation known is

$\bar{x} \pm z^* \frac{\sigma}{\sqrt{n}}$

If we are a given a value $\mu_0$ and it falls within the confidence interval, we must conclude that the true parameter $\mu$ could possibly be $\mu_0$ . Take $\mu_0=130$ pages. If we take the shaded region to be a confidence interval around $\mu$ then we see that it is possible that $\mu = 130$ pages throughout the time period I studied. To make a long story short, it is possible that the length of dissertations has remained constant over time.

So what is the purpose of this analysis? There is no purpose. It was just my curiosity, and thought that some of the coding was worth sharing.

With that said, after this extensive analysis, my goal is 110-115 pages.

Some open questions for readers:

How can I add the line $y = \mu_0 = 130$ to my time series plot?
What, in fact, does the shaded area represent (if it is not a margin of error forming a poor man’s confidence interval)?
Is it possible to change the measurement function in geom_smooth() from mean to median (or something else)?
Given 1-3 above, how can I also add jitter and alpha blending to the points? (I tried to do it but encountered errors)
Is there a better way to visualize this time series, given the sample size issue, without throwing out those dates?

Scraper code:

	#!/usr/bin/env python

	'''
	theses.py

	Created on September 28, 2010
	@author: Ryan R. Rosario
	@contact: <first name> @ stat //DOT/ ucla //DOT/ edu

	'''

	from pyPdf import PdfFileWriter, PdfFileReader #for dealing with PDF files.
	import urllib2, re, os, sys, warnings

	warnings.simplefilter("always")

	def scrapeMetadata():
	'''
	Extract information about theses from main theses.stat.ucla.edu site.
	'''
	url = 'http://theses.stat.ucla.edu/index_body.php?sort=year&order=DESC&' + \
	'where=&count=115&limit=115&position=0&collapse_search=true&' + \
	'collapse_sort=true&display_author_string=&display_title_string='
	contents = urllib2.urlopen(url).read()
	#Regexes that extract the ID, year, level and advisor for each paper.
	#ASSUMPTION: Extracted values are all in the same order.
	ids = re.compile(
	'Paper\#:</td>.*?<td class="element-(?:1\|2)L" align="left" ' \
	'valign="top"><br/>.?(\d+).?<br/>', re.DOTALL)
	years = re.compile(
	'Year:</td>.*?<td class="element-(?:1\|2)L" align="left" ' \
	'valign="top">(?:<br/>)?.?(\d{4}).?<br/>', re.DOTALL)
	levels = re.compile(
	'Level:</td>.*?<td class="element-(?:1\|2)L" align="left" ' \
	'valign="top">(?:<br/>)?.?([A-Za-z\.]+).?<br/>', re.DOTALL)
	advisors = re.compile(
	'Advisor:</td>.*?<td class="element-(?:1\|2)L" align="left" ' \
	'valign="top">(?:<br/>)?.?([A-Za-z\. -]+).?<br/>', re.DOTALL)
	#Strip whitespace and convert advisor name to uppercase.
	advisors = [a.strip().upper() for a in
	re.findall(advisors, contents)]
	#Convert M.S. Ph.D. to MS and PHD
	levels = [a.replace('.','').upper() for a in re.findall(levels, contents)]
	ids = re.findall(ids, contents)
	years = re.findall(years, contents)
	return advisors, levels, ids, years

	def parsePDFfile(id):
	'''
	Finds and downloads the manuscript and determines its length. File is
	deleted afterwards.

	Arguments:
	id -- paper number.

	WARNING: Internal function. Does not check that the ID number is valid.
	'''
	url = 'http://theses.stat.ucla.edu/%s' % id
	contents = urllib2.urlopen(url).read()
	thesis = re.compile('<a href="(.*?\.pdf)">')
	#Try to find a link to download the thesis file.
	try:
	filen = re.findall(thesis, contents)[0]
	except IndexError:
	#Some are listed, but have no document attached.
	return "NA"
	url += '/%s' % filen
	#Dump the thesis to a file.
	pdf = urllib2.urlopen(url).read()
	TMP = open('/tmp/' + filen, "wb")
	TMP.writelines(pdf)
	TMP.close()
	#Read the file and count the number of pages.
	input = PdfFileReader(file('/tmp/' + filen, "rb"))
	pages = input.getNumPages()
	os.unlink('/tmp/' + filen)
	return pages

	def output(data):
	'''
	Print data to CSV stream.

	Arguments:
	data -- a row/list of data to be printed.
	'''
	for row in data:
	print >> sys.stdout, ','.join(map(str, row))
	return

	def main():
	warnings.filterwarnings("ignore", category=DeprecationWarning)
	#Scrape the advisors names, degree level, ID number and graduation year
	#from theses.stat.ucla.edu
	advisors, levels, paper_nos, years = scrapeMetadata()
	pagecounts = []
	for paper_no in paper_nos:
	pagecount = parsePDFfile(paper_no)
	pagecounts.append(pagecount)
	#Collect all data together.
	data = zip(paper_nos, years, levels, advisors, pagecounts)
	#Print to STDOUT in CSV format.
	output(data)
	return

	if __name__ == "__main__":
	main()

view raw theses.py hosted with ❤ by GitHub

To leave a comment for the author, please follow the link and comment on their blog: Byte Mining » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

UCLA Statistics: Analyzing Thesis/Dissertation Lengths

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)