Text Mining and The Danish Immigration Debate
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I have for a while now wanted to learn how to do text mining in R. I have played a bit around with the tm package before, but nothing really serious. Being a Danish expat that might soon move back to Denmark with a Swiss girlfriend and child, the Danish immigration debate is of great interest to me. So naturally, like any quantitatively oriented scholar, I thought it would be interesting to somehow evaluate the Danish debate on immigration, and at the same time gain some experience with text mining. I wrote a small python script that logged on the one of the biggest newspapers in Denmark (Politikken.dk), did a search on immigration and downloaded all articles that were returned from the search.
The script returned 971 articles published in the period 2001 to 2012. I had another small script extract the date, category and content from the articles and store them in a csv file.
For the data used in the analysis, only articles in the categories: debate, politics, and opinion were selected. This narrowed the data down to 205 articles. The next step is to create a corpus object with the tm package, to store the content of the articles.
Once this is done the data cleaning can begin. The first step is to clean the corpus of text by first making all words lower case, remove punctuation, and remove any stopwords. Stopwords are common occurring words that do not carry any meaning on their own (i.e. the, is, at, etc.). I would liked to have stemmed the corpus as well, however the stemming function in the tm package does not seem to work with Danish (if anyone know to stem Danish words with the tm package let me know!). Finally a Document Term matrix was created, and words occurring very infrequently were removed.
As a first look at the data I found the ten words that occurred the most in the articles:
And here is the translation
- Danish : English
- Danmark: Denmark
- Ved: at
- Dansk: Danish
- Siger: says
- Danske: Danish
- Indvandring: immigration
- Nye: new
- Kun: only
- Folkepart: peoples party
- Regeringen: government
It is obvious from the plot that stemming the words would have been useful, however we can still see an interesting pattern. Not surprisingly Denmark, two forms of Danish and immigration appear on the list. What it interesting is that peoples party and government appear as well. A very prominent immigrant hostile party is named the Danish Peoples Party, and was a supporting party of the late liberal government. Hence when discussing immigration in Politikken over the last 11 years, the focus has been on Denmark (doh!), the government and the supporting Danish Peoples Party.
If we examine the ten words that correlate the most with immigration we the following list:
- Danish : English
- Udtalt: stated, explicit
- Nødvendigvis: necessarily
- Etnisk: ethnic
- Perspektiv: perspective
- Fremmer: promote
- Forestillinger: fancies
- Gift: married
- Fænomen: phenomenon
- Omkring: around
- Sider: sides
To see how the words correlate with each other and immigration, the figure below show the correlation matrix, the larger a bubble is the larger the correlation. Blue means positive and read means a negative correlation:
I was surprised to see how many of the words carry a positive connotation (for me at least), since the consensus seems to be that the immigration debate in Denmark is very hostile. However the articles did come from a paper that is slightly to the left of the middle. Since immigration from Islamic countries has been a hot topic in Danish politics, I decided to examine the correlation matrix of the top ten words that correlate with Islam. The words are the following:
- Danish : English
- Studeret: studied
- Ødelægger: destroys
- Israel: Isreal
- Vold: Violence
- Islams: Islams
- Establishment: establishment
- Krig: war
- Læse: read
- Kvarterer: neighborhoods
- Muslimske: moslem
The correlation matrix is shown below:
These results are more in line with what I expected to find, the presence of many words with strong negative connotations. However I am surprised to see that Israel is one of the words. However this can maybe be explained by the fact that the Danish Peoples Party is a strong supporter of Israel.
So what can we conclude from this? Well, the coverage of immigration in Politikken seem to be centered around mostly positive words, hence immigration as such is not seen as a contentious issue. However, the debate around Islam is centered around negative words, thus immigration from Islamic countries is a very contentious issue.
As usual here is the data and R code:
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.