Single Word Analysis of Early 19th Century Poetry Using tidytext
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
When reading poetry, it feels as if you are reading emotion. A good poet can use words to bring to their reader whichever emotions they choose. To do this, these poets often have to look deep within themselves, using their own lives as fuel. For this reason you would expect that poets and as a result, their poetry, would be influenced by the events happening around them. If certain events caused the poet’s view of the world to be an unhappy one, you’d expect their poetry to be unhappy, and of course the same for the opposite.
In this analysis I attempt to show this relation between world events and the poets of those times.
Gathering the required data – this being poems and their dates – proved to be a very difficult task. I had initially hoped to be able to show an analysis of the entire 20th century (this may still happen when I find more time and/or sources), but in the end I settled for the years between 1890 – 1925. I managed to collect an overall number of 97 poems (a few of which are excluded from the analysis due to being written in 1888 or 1889) from 6 of the most well known poets of the time. These poets being:
- William Yeats
- T. S. Eliot
- Robert Frost
- Wallace Stevens
- E. E. Cummings
- D. H. Lawrence
I will only be going through the preparation code for Yeats. The rest is similar but not exactly the same, and can be found in my github repository (https://github.com/Laurence-Sonnenberg/single_word_poetry_analysis) if anyone would like to re-create anything I have done. The actual analysis is done using all poets/poems combined.
Firstly I load the packages I’m going to need. I have included a small function that other people might find useful.
The website I use for scraping has a page for each poet which lists the poems available – then it has a seperate page for each of the poems in the list, along with their dates. I have chosen this website specifically because having each poem dated is a must for my analysis. I run an initial scrape using rvest
to get the names which are then cleaned so they can be used in the urls to get the poems and dates. Some names are not exactly the same as their url versions, so they need to be manually changed.
Next I need to scrape the website for the poems and their dates. To do this I use a function that takes a poem name – and again using rvest
– scrapes for the given poem and its date, then cleans the text and waits before returning. The wait is so that I don’t send too many requests to the website in quick succession.
I then use the previous function in a for loop which loops through each poems name and scrapes for that particular poem and its date. With each loop I also add the data frame of poem and date to a list. Once the for loop has completed I rbind
all the data frames together. A complete data frame of all poems and dates will be needed for the next step.
I then combine the names from the original scrape, the dates, and the poems into a single data frame so they will be ready for me to work with. I also take a look at all the text, checking for and correcting any errors that could negatively affect my analysis. In this case I have found an error in a date.
After this, we have a dataset of dimension 50x3 with column headings poemName
, poem
and date
.
The next function I have written takes a row index, and using the combined data frame of poems, names and dates, along with tidytext
, tokenizes the poem found at that index into single words. It then uses anti_join
from dplyr
, along with the stop_words
dataset from tidytext
, to create a tibble – this tibble will now be missing a list of stop words which will not be helpful in the analysis. This list of words for each poem is then used to create a sentiment score. This is done using inner_join
and the sentiment
dataset filtered on the bing lexicon, again from tidytext
. The function calculates the sentiment score value for the poem depending on the number of negative and positive words in the list – negative words get a value of -1 and positive get a value of 1, these values are then summed and the result returned as a percentage of the number of words in that particular poem.
Now, using the previous function and sapply
I get the score for each poem and create a data frame with the poems score and its date. These scores will be used later to show the change in sentiment for each year.
The next function I have written takes a particular sentiment, which in this case can also be called an emotion, from a number of sentiments found in the NRC lexicon of the sentiment
dataset; it then again creates a tibble of words not including the stop words, and using semi_join
and the sentiment
dataset filtered on the NRC lexicon and the particular sentiment given, returns the sum of the number of words relating to that sentiment. These sums will be used to show which emotions were high (or low) in a particular year.
Using the above function and sapply
, I get the sums for each different sentiment in my sentiments list; this is done within a for loop, which loops through each poem, creating a data frame with each row consisting of a sentiment, the sum of words for that sentiment, the name of the poem the sentiment relates to, and the poems date. I then add this data frame to a list which will be used in my next step.
Once the for loop has completed I rbind
all the data frames in the list so that I can work with all of them together as a whole.
Now all the preparation code is done and we are ready to begin analyzing. For ease of use I have created a function for each poet to run all the code shown previously; the functions return both the bing and the NRC data frames, ready to be used. These functions are run but will not be shown here.
Firstly I will need to call each poet’s function and assign the returned data frames to variables so that I can work with them as needed.
Having collected all the poems from our five poets, we plot the count of poems per year for each of our selected sample:
From this we see that Yeats dominates the earlier period in our sample, while Frost plays a large part in the second half of the sample of text.
Next I will join all the bing scores for all poems using rbind
, I do this so that I can work with all the data at once. I then group by date and get a mean score for each year in the complete data frame. This returns a tibble of scores and dates which I can now plot.
I use multiplot so that I can show two different plots side by side.
The 0 values in the bar plot are not all actual 0 values but rather years where no information is available. When looking at the plots we can see a steady decline between the period 1890 - 1900, then due to a lack of data very little can be seen during the period from 1900 - 1915, after which there is again a massive decline between 1915 and 1920 and again a small drop between 1920 and 1925.
We can also look at the individuals poets, to determine who of them portrayed their sorrow the most through their poems. To do this we plot the average score per poet.
From the plot we see that Lawrence has the most negative mean score and with poems such as giorno dei morti or day of the dead, it comes as no suprise!
Finally, let’s do the NRC analysis. Firstly, I join together all the poet’s NRC data frames using rbind
, again so I can work with all the data at once. I then group them by date and sentiment, sum the sentiment values for each poem and use this sum to get percentage values for each sentiment in each year. Once I have done this the data frame is ready to be plotted.
I use facet_wrap
on the date so that I can display all the plots for all the years at once.
Looking at these plots we can see how sentiment changes through the years of available data. During the early 1890s we see equal percentages of anticipation, fear, joy, sadness and trust, while later we see anticipation, joy and trust. In 1903 joy takes the lead, but in 1910 and 1915 we see anticipation and trust increasing with fear and joy equal. In 1916 fear and sadness are high and 1917 anger, fear, sadness and trust are notibly high. From 1918 - 1919 we see a change from negative to positive with joy taking a big lead in 1919 and again in 1920. 1921 sees sadness again with others coming in with mostly equal percentages, and 1922 anticipation, joy and trust.
I am not a historian but fortunately my sister majored in history, so I asked her to send me a timeline of influencial world events between the years this analysis is based on. I have used this timeline to explain the findings.
From both the bing and NRC analysis it can be seen that there was a very apparent negative trend during the years from 1915 - 1920, this is very possibly due to a number of world events, namely; World War I during the years from 1914 - 1918, the influenza pandemic which lasted from 1918 - 1919, and the Russian Revolution in 1917. Another negative trend seems to appear at the end of the 19th Century, this time it is possibly due to the Boer War, which lasted from 1899 - 1902 and had repurcussions across Europe, as well as the Indian famine which killed an estimated one million people during the years 1899 and 1900.
A few things to be mentioned:
- As we all know the pop artists of the time will always move with popular trends, so it could be that the poets chosen, being the most popular during the years this analysis is based on, may have been more likely to be influenced by the events around them
- Due to the surprisingly difficult task of finding poetry with enough information, this analysis has been based on only a small number of poems
Taking the above into account, it is clear more work needs to be done; though the findings from this brief analysis do show a possible relation between poetry and the events of the time.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.