[This article was first published on Wiekvoet, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In a previous blog it was shown using literature data that liking of apples was related to juiciness. However, there were some questionsWant to share your content on R-bloggers? click here if you have a blog, or here if you don't.
- Is the relation linear or slightly curved?
- The variation in liking around CJuiciness is large. Are more explanatory variables needed?
- So, what drives CJuiciness?
Error in average scores
The paper of Peneau et al. uses Tukey’s post hoc test to examine the differences between the products. The test is performed within the weeks. First we use the data to retrieve at what difference the test shows significant differences.
library(xlsReadWrite)
library(plyr)
library(ggplot2)
library(locfit)
#get data as before
datain <- read.xls(‘condensed.xls’)
#remove storage conditions
datain <- datain[-grep(‘bag|net’,datain$Products,ignore.case=TRUE),]
#create week variable
datain$week <- sapply(strsplit(as.character(datain$Product),’_’),
function(x) x[[2]])
#function to extract pairwise numerical differences and significances
extract.diff <- function(descriptor) {
diffs1 <- ldply(c(‘W1′,’W2’),function(Week) {
value <-
as.numeric(gsub(‘[[:alpha:]]’,”,
datain[,descriptor]))[datain$week==Week]
sig.dif <-
gsub(‘[[:digit:].]’,”,datain[,descriptor])[datain$week==Week]
dif.mat <- outer(value,value,’-‘)
sig.mat <- outer(sig.dif,sig.dif,function(X,Y) {
sapply(1:length(X),function(i) {
g1 <- grep(paste(‘[‘,X[i],’]’),Y[i])
length(g1>0)
})
})
data.frame(dif.val = as.vector(dif.mat), dif.sig =
as.vector(sig.mat),Week=Week,descriptor=descriptor)
})
diffs1
}
likedif <- extract.diff(“CLiking”)
likedif <- likedif[likedif$dif.val>=0,]
g <- ggplot(likedif, aes(dif.val,dif.sig))
g + geom_jitter(aes(colour=Week),position=position_jitter(height=.05))
The plot shows a difference of between 0.24 and 0.26 is enough to be significant. For Juiciness, the pattern is the same:
likedif <- extract.diff(“CJuiciness”)
likedif <- likedif[likedif$dif.val>=0,]
g <- ggplot(likedif, aes(dif.val,dif.sig))
g + geom_jitter(aes(colour=Week),position=position_jitter(height=.05))
In juiciness a difference of 0.24 is enough to be significant. Given all, a difference of 0.24 can be used for both variables.
Liking vs. Juiciness
The plot with errors is easy to make. For completeness a local fit is added.
dataval <- datain
vars <- names(dataval)[-1]
for (descriptor in vars) {
dataval[,descriptor] <- as.numeric(gsub(‘[[:alpha:]]’,”,dataval[,descriptor]))
}
l1 <- locfit(CLiking ~ lp(CJuiciness,nn=1),data=dataval)
topred <- data.frame(CJuiciness=seq(3.6,4.8,.1))
topred$CLiking <- predict(l1,topred)
g <- ggplot(dataval,aes(CJuiciness,CLiking))
g <- g + geom_point() + geom_errorbar(aes(ymin=CLiking-.24,ymax=CLiking+.24))
g <- g + geom_errorbarh(aes(xmin=CJuiciness-.24,xmax=CJuiciness+.24))
g <- g + geom_line(data=topred,colour=’blue’)
g
Both the local fit and the errors suggest that curvature is interesting to pursue. On top of that, a linear relation has as implication that any increase in juiciness is good. In general an optimum level is expected. Compare with sugar, if you like two lumps of sugar, two is dislike as not sweet enough, while four is too sweet. Hence, again all reason for curvature.
Regarding inclusion of extra variables, the data shows that the products Juiciness 4.1 are almost significant different. Given that this significance is Tukey HSD, a difference on the one point with much lower liking is probably relevant. On the other hand, the data is fairly well described by curvature, so one extra explaining variable should be enough.
Adding an extra explaining variable
The two prime candidates as second explaining variable are according to the previous calculation CSweetness and CMealiness. In general one would expect apples that are juicy to not be mealy so there is a reason to avoid CMealiness. Nevertheless both are investigated.
l1 <- locfit(CLiking ~ lp(CJuiciness,CSweetness,nn=1.3),data=dataval)
topred <- expand.grid(CJuiciness=seq(3.8,4.6,.1),CSweetness=seq(3.2,3.9,.1))
topred$CLiking <- predict(l1,topred)
v <- ggplot(topred, aes(CJuiciness, CSweetness, z = CLiking))
v <- v + stat_contour(aes(colour= ..level..) )
v + geom_point(data=dataval,stat=’identity’,position=’identity’,aes(CJuiciness,CSweetness))
l1 <- locfit(CLiking ~ lp(CJuiciness,CMealiness,nn=1.3),data=dataval)
topred <- expand.grid(CJuiciness=seq(3.8,4.6,.1),CMealiness=seq(1.4,2.1,.1))
topred$CLiking <- predict(l1,topred)
v <- ggplot(topred, aes(CJuiciness, CMealiness, z = CLiking))
v <- v + stat_contour(aes(colour= ..level..) )
v + geom_point(data=dataval,stat=’identity’,position=’identity’,aes(CJuiciness,CMealiness))
The link between CMealiness and CJuiciness is quite strong. It is also clear that CMealiness does not explain the large difference in liking at CJuiciness 4.1. Hence CSweetness is chosen. Not the best of statistical reasons, but all in all it feels like the better model
Simplified linear model
Finally, even though I like the local model, it is more convenient to use a simple linear model. After reduction of non-significant terms, only three factors are left. CJuiciness, CJuiciness^2 and CSweetness.
l1 <- lm(CLiking ~ CJuiciness*CSweetness + I(CJuiciness^2) + I(CSweetness^2),data=dataval)
summary(l1)
Call:
lm(formula = CLiking ~ CJuiciness * CSweetness + I(CJuiciness^2) +
I(CSweetness^2), data = dataval)
Residuals:
Min 1Q Median 3Q Max
-0.12451 -0.02432 0.01002 0.02591 0.06914
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6.5326 19.8267 -0.329 0.7475
CJuiciness 6.7662 4.0414 1.674 0.1199
CSweetness -2.5410 8.7807 -0.289 0.7772
I(CJuiciness^2) -0.8638 0.3564 -2.424 0.0321 *
I(CSweetness^2) 0.1264 0.9538 0.133 0.8968
CJuiciness:CSweetness 0.3321 0.6574 0.505 0.6225
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
Residual standard error: 0.05827 on 12 degrees of freedom
Multiple R-squared: 0.9107, Adjusted R-squared: 0.8735
F-statistic: 24.48 on 5 and 12 DF, p-value: 6.589e-06
l1 <- lm(CLiking ~ CJuiciness+CSweetness + I(CJuiciness^2) ,data=dataval)
summary(l1)
Call:
lm(formula = CLiking ~ CJuiciness + CSweetness + I(CJuiciness^2),
data = dataval)
Residuals:
Min 1Q Median 3Q Max
-0.125337 -0.023834 0.004955 0.024922 0.087851
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -14.3609 5.1169 -2.807 0.01400 *
CJuiciness 8.5997 2.3715 3.626 0.00275 **
CSweetness -0.2683 0.1104 -2.430 0.02916 *
I(CJuiciness^2) -0.9428 0.2795 -3.373 0.00455 **
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
Residual standard error: 0.0547 on 14 degrees of freedom
Multiple R-squared: 0.9082, Adjusted R-squared: 0.8885
F-statistic: 46.17 on 3 and 14 DF, p-value: 1.654e-07
topred <- expand.grid(CJuiciness=seq(3.8,4.6,.05),CSweetness=seq(3.2,3.9,.05))
topred$CLiking <- predict(l1,topred)
v <- ggplot(topred, aes(CJuiciness, CSweetness, z = CLiking))
v <- v + stat_contour(aes(colour= ..level..) )
v + geom_point(data=dataval,stat=’identity’,position=’identity’,aes(CJuiciness,CSweetness))
The resulting plot shows quite some difference with the local fit, but this is mostly at the regions without data.
Hence it is concluded that liking of apples depends mainly on Juiciness, and somewhat on Sweetness. Above a certain Juiciness no gain is to be made. Lower sweetness gives better liking
Discussion
It is a bit disappointing that the error in CJuiciness and CSweetness is not incorporated in the model. Unfortunately, this is easier said than done. The keyword here is Total Least Squares also named Deming regression. Unfortunately these are only viable if two variables are regressed. Curvature is also outside of the scope.
In addition, this leads to the question, what is error?. The ‘error’ between the scores consists of different parts. Differences between slices of apple, differences in sensory perception and different ways to score. Regarding the model, differences in slices of apple and sensory perception are counted in liking, while scoring error is not. Ideally, these would be split. This is rather a tall order.
- what drives CJuiciness?
- what drives CSweetness?
To leave a comment for the author, please follow the link and comment on their blog: Wiekvoet.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.