Text Mining on Wine Description
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Here is an example of text mining with correspondence analysis.
Within the context of research into the characteristics of the wines from Chenin vines in the Loire Valley (French wines), a set of 10 dry white wines from Touraine were studied: 5 Touraine Protected Appellation of Origin (AOC) from Sauvignon vines, and 5 Vouvray AOC from Chenin vines.
These wines were described by 12 professionals. The instructions were: for each wine, give one or more words which, in your opinion, characterises the sensory aspects of the wine. This data was brought together in a table with the wines as rows and the columns as words, where the general term Xij is the number of times that a word j was associated with a wine i (data are available here).
This contingency table has been analysed using Correspondence Analysis (CA) to provide an image summarising the diversity of the wines. Prior to the analysis, the words which are used the least frequently are suppressed and a number of “neighbouring” words were grouped together (for example, sweet, smooth, and syrupy, all of which refer to the same perception, that of the sweet taste of the wine).
CA is implemented using the following commands:
library(FactoMineR) wine = read.table("http://factominer.free.fr/bookV2/wine.csv", header=TRUE,row.names=1,sep=";",check.names=FALSE) res.ca = CA(wine,col.sup=11,row.sup=31) summary(res.ca)
We can comment the graph saying that there are 3 poles of wines:
- Aubuissières Silex (6), characterised by sweet (cited 11 times), is the only wine to contain more than trace level residual sugars. This unusual characteristic for a dry wine, stands out, as it is only rarely cited for the other wines (never more than twice for one wine), and accounts for over a third of the words associated with this wine. The graph highlights the wine’s lack of character; although this term was only cited 3 times for this wine, we have classed it in second place (among other things, this characteristic is really a lack of a characteristic and is therefore less evocative).
- Aubuissières Marigny (7) + Fontainerie Coteaux (10). These two wines were mainly characterised by the terms oak, woody, which were each cited 7 and 5 times, respectively, whereas the word was only used 3 times elsewhere. This description can, of course, be linked to the fact that these two wines are the only two to have been cask aged. According to this plane, foreign flavour best characterises these wines, but we chose to place it second due to the low frequency of this term (4), even if it was cited for these two wines alone. It should also be noted that the effect of ageing wine in casks does not only lead to positive characteristics.
- The five Touraine wines (Sauvignon; 1–5). Characterising these wines was more difficult. The terms lots of character, fresh, delicate, discrete, and citrus were cited for these wines, which seems to fit with the traditional image of a Sauvignon wine, according to which this vine yields fresh, flavoursome wines. We can also add two more marginal characteristics: musty (and little character, respectively), cited 8 times (4 times, respectively), and which are never used to describe the Sauvignon wines.
Once these three poles are established, we can go on to qualify the dimensions. The first distinguishes the Sauvignons from the Chenin wines based on freshness and flavour. The second opposes the cask-aged Chenin wines (with an oak flavor) with that containing residual sugar (with a sweet flavour).
Having determined these outlines, the term lack of character, which was only used for wines 6 and 8, seems to appear in the right place, i.e., far from the wines which could be described as flavoursome, whether the flavour be due to the Sauvignon vines or from being aged in oak casks.
Finally, this plane offers an image of the Touraine white wines, according to which the Sauvignons are similar to one another and the Chenins are more varied. From a viticulturist’s point of view, this analysis identifies the marginal characteristics of the Chenin vine. In practice, this vine yields rather varied wines which seem particularly different from the Sauvignons as they are somewhat similar and rather typical.
You can find a complete decription of this data in the book Exploratory Multivaraite Data Analysis by Example Using R (Husson, Lê, Pagès).
Here are some materials: a video on another example of text mining, a video to better understand the CA method, and this video to see how to run CA with the R package FactoMineR.
You can also enroll in this MOOC.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.