UCLA Statistics: Analyzing Thesis/Dissertation Lengths
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
As I am working on my dissertation and piecing together a mess of notes, code and output, I am wondering to myself “how long is this thing supposed to be?” I am definitely not into this to win the prize for longest dissertation. I just want to say my piece, make my point and move on. I’ve heard that the shortest dissertation in my program was 40 pages (not true). I heard someone from another school that their dissertation was over 300 pages. I am not holding myself to a strict limit, but I wanted a rough guideline. As a disclaimer, this blog post is more “fun” than “business.” This was just an analysis that I was interested in and felt that it was worth sharing since it combined Python, web scraping, R and ggplot2. It is not meant to be a thorough analysis of dissertation lengths or academic quality of the Department.
The UCLA Department of Statistics publishes most of its M.S. theses and Ph.D. dissertations on a website. It is not complete, especially for the earlier years, but it is a good enough population for my use.
Using this web page, I was able to extract information about each thesis submitted for publishing on this website: advisor name, work title, year completed, and level (M.S. or Ph.D.). Student name was removed for
Naturally, I wanted to use a plot to see the distribution of thesis and dissertation lengths, but the one produced by base graphics was terrible:
This hideous graphic gives rise to some questions…
- What does the bar less than 50 represent? Just length less than 50? (sarcasm)
- What does the bar greater than 200 represent? Just length greater than 200? (sarcasm)
- And how do I represent the obvious difference in length of manuscript by degree objective?
Although I respect the field of visualization, I am not huge on it, and I am usually content with the basics. This is one case where I had to step up my viz a notch. I had not used ggplot2 so there was no better time to learn. I will not attempt to explain what I am doing with the graphics, as there are already plenty of tutorials and write-ups from experts on the matter. Just look and be amazed…or just look. I wanted to give ggplot2 a spin, so I whipped this up as an example.
library(ggplot2) qplot(Pages, data=these, main="Thesis/Dissertation Lengths\nUCLA Department of Statistics") + geom_histogram(aes(fill=Level)) |
Wow! Now it is obvious what each bar represents, and we can easily see the difference in lengths of M.S. theses and Ph.D. dissertations. We can easily see that M.S. theses were typically around 50 pages, and Ph.D. dissertations were typically about 110 pages with a long right tail. We can also see what tick labels represent, and the mesh grid gives a visual clue as to what the intermediate tick labels would be. We also see that there were two M.S. theses that was unusually long at 135 and 140 pages respectively. Their titles were
We can see that there is not much variance among lengths of Masters theses and much higher variance for Ph.D. dissertations. I hypothesized that there was an advisor and year effect. Based on hearsay, I had an idea of which advisors yielded the longest and shortest dissertations. My hunch does in fact appear to be true, but I am withholding those results. What I will say is that there does not seem to a be a “pattern.” It does not seem that the more accomplished professors yield longer (or shorter) dissertations. It also does not seem that certain fields, like Vision or Genetics, yield longer or shorter dissertations as a group.
The following is a boxplot of the length of Ph.D. dissertations for your entertainment.
But how has the length of dissertations changed over time? Or has it not?
qplot(Year, Pages, data=phd, main="Dissertation Lengths over Time\nUCLA Department of Statistics") + geom_smooth() |
This plot is beautiful, and interesting. It
From 2000 to 2006, dissertation lengths seemed to have leveled off. Then from 2006 to 2010, it appears that dissertation lengths increased. Not so fast though! Note that the number of dissertations filed from 2006-2010 is much larger than those submitted in other equivalent length periods of time — this bump is likely due to the number of observations. Based on my understanding of Department history, I believe that there probably was a decrease through the early years of the program as the Department established its own separate expectations. This may hold practically, but does it hold statistically?
The geom_smooth() adds a curve to the plot representing a moving average over the data. It is not a trend line! geom_smooth() also adds some type of margin of error around this smoothing line (I admit that I have not looked deeply into the internals of ggplot2). If we interpret the margin of error loosely as a confidence interval, we can make a statistical conclusion of this graph. Recall that a basic one-sample confidence interval with population standard deviation known is
If we are a given a value and it falls within the confidence interval, we must conclude that the true parameter could possibly be . Take pages. If we take the shaded region to be a confidence interval around then we see that it is possible that pages throughout the time period I studied. To make a long story short, it is possible that the length of dissertations has remained constant over time.
So what is the purpose of this analysis? There is no purpose. It was just my curiosity, and thought that some of the coding was worth sharing.
With that said, after this extensive analysis, my goal is 110-115 pages.
Some open questions for readers:
- How can I add the line to my time series plot?
- What, in fact, does the shaded area represent (if it is not a margin of error forming a poor man’s confidence interval)?
- Is it possible to change the measurement function in geom_smooth() from mean to median (or something else)?
- Given 1-3 above, how can I also add jitter and alpha blending to the points? (I tried to do it but encountered errors)
- Is there a better way to visualize this time series, given the sample size issue, without throwing out those dates?
Scraper code:
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.