Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
David Hume – poster boy of the Scottish Enlightenment, contemporary of Adam Smith, and famous essayist – was embarrassingly nowhere to be found on my radar. The knowledge gap became apparent after working with an Ivy League philosopher-turned-scientist for 10 months. To remedy the situation, I decided to try combining actually reading Hume with Data Science (DS) in the form of Text Mining (TM) of Hume’s works, which can be sourced via the gutenbergr
library wrapper of Project Gutenberg’s API.
Why
Philosophical texts represent a particularly intellectually stimulating way to practice TM, given:
- the complexity of the ideas being expressed, combined with the historical differences in language usage, offers greater challenges in attaining meaningful knowledge extraction compared to commonplace business applications (e.g. customer feedback comments, news articles, etc.)
- quantitative techniques such as this are a new and rapidly developing advancement in the scientific study of this field of knowledge, as well as the highly related area of analytical philosophy
- network extensibility – this is a good example use case for both TM within an author + work, as well as across a network of philosophers (which is beyond the scope of this article)
What
For the purposes of this initial blog article and the minimal viable product (MVP) launch of the related domain (hume.xyz) I focused on the following features:
- Create an interactive web (Shiny) app to allow anyone to conduct TM or potentially other DS tasks on the works of David Hume
- Provide some quantitative analytics of the text (e.g. frequency table)
- Provide some interactive visualizations (cluster analysis, word cloud, etc.)
- Supervised and unsupervised FastText models
Future work could even scale to include social media and other sources of discussion by the general public and critics.
How
The Shiny app uses shinydashboard
to keep the UI looking coherent without having to much spend time on it.
I didn’t see a point in constantly re-pulling the data from the Internet, so I wrote some control flow logic to handle the downloading and reloading:
if(repull){ gw <- as.data.frame( gutenberg_works(gutenberg_author_id==1440)) hume <- gutenbergr::gutenberg_download(gw$gutenberg_id) gw$title <- gsub("\n", "",gsub("\r"," ",gw$title)) saveRDS(gw, "gw.RDS") }else{ hume <- readRDS("hume.RDS") gw <- readRDS("gw.RDS") gw$title <- gsub("\n", "",gsub("\r"," ",gw$title)) }
I put each output in a tab with interactive controls in a conditional panel. The values are displayed on the condition that the user has selected the corresponding tab.
fluidRow( tabsetPanel( tabPanel("Original Text", value = 1, verbatimTextOutput("value1")), tabPanel("Table: 1-gram Frequencies", dataTableOutput("table1")), tabPanel("Viz: 3D Network (slow)", value = 6, plotOutput("plot5")), # slow! fix me tabPanel("Viz: Cluster Analysis (slow)", value = 5, plotOutput("plot4")), # slow! fix me tabPanel("Viz: Hierarchical Clustering", value = 4, plotOutput("plot3")), tabPanel("Viz: Histogram", value = 2, plotOutput("plot1")), tabPanel("Viz: Word Cloud", value = 3, plotOutput("plot2")), tabPanel("About", value = 0, verbatimTextOutput("About")), id = "conditionedPanels" ) )
Right now I’m dealing with some performance issues on the K-means cluster analysis and network graph (fine locally, just slow on the Internet) but this should be solved imminently by either a switch to better hosting or re-writing that part of the code.
For now, 1.-3. are being served via ShinyApps.io and the full code behind the site is available on GitHub and will keep building iteratively as time allows. The code for 4. is also on GitHub, but I’ll hold off on deploying them online until I switch to serving the site from other hosting (soon).
Next Steps
A lot of improvements could be made to the existing output – stem completion, improving plot aesthetics, profiling load times, etc. Actually, I don’t really like the default look of the app much and the load times are quite poor. What’s nice is that the it’s readily forkable to a proper development or production server for open or proprietary usage. I hope this to be the first small iteration in an evolving project, which I will then fork myself for just such proprietary usage. All are encouraged to submit pull requests or fork.
Since time may be limited, I’m thinking of leaving it ugly for now and moving on to looking further at relationships between novels, other authors citing or cited by Hume, parts-of-speech tagging and N-gram analysis.
For a primer on n-gram TM in R, check out this tutorial on N-gram word clouds.
-JDM
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.