Distribution of car weights
[This article was first published on Wiekvoet, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Two weeks ago I described car data, among which weight distribution of cars in Netherlands. At that time it was purely plots. In the mean time I decided I wanted to model trends. As a first step of that, I decided to fit distributions for these data. I soon found out that most of the tools seem to be organized around individual objects as units. So, the 22 categories of weight might be extended to 100 000 records or so, one for each car. That is clearly not convenient. So, I brewed a MCMC algorithm for the Bayesian estimation of the distribution without all that. As I did most of this while on vacation, I had minimum access to books/internet, so it may be a bit clunky.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Data reading
Nothing new here, except I added manually a formal endpoint for the final weight category; 3500 kg. This number chosen because this is the maximum weight you may drive with a normal drivers license here.weight1 <- read.csv2('Motorvoertuigen__per_010613140907.csv',na.strings='-')
weight2 <- weight1[!is.na(weight1$Waarde),]
weight2$BuildYear <- as.numeric(sub('Bouwjaar ','',as.character(weight2$Bouwjaren)))
weight2$RefYear <- as.numeric(sub(', 1 januari','',as.character(weight2$Peildatum)))
weightcats <- levels(weight2$Onderwerpen_2)
weightcats <- gsub('en meer','and more',weightcats)
levels(weight2$Onderwerpen_2) <- weightcats
lweightcats <- as.numeric(gsub('( |-).*$','',weightcats))
uweightcats <- as.numeric(gsub('(^[0-9]+-)|([ [:alpha:]]+$)','',weightcats))
uweightcats[uweightcats==2451] <- 3500
weight3 <- weight2[weight2$RefYear==2001,c(2,6)]
weight3$lower <- lweightcats[weight3$Onderwerpen_2]
weight3$upper <- uweightcats[weight3$Onderwerpen_2]
weight3
Onderwerpen_2 Waarde lower upper
14 0-450 kg 95 0 450
196 451-550 kg 33 451 550
378 551-650 kg 262 551 650
560 651-750 kg 19950 651 750
742 751-850 kg 58752 751 850
924 851-950 kg 81650 851 950
1106 951-1050 kg 57738 951 1050
1288 1051-1150 kg 91793 1051 1150
1470 1151-1250 kg 104933 1151 1250
1652 1251-1350 kg 89506 1251 1350
1834 1351-1450 kg 30814 1351 1450
2016 1451-1550 kg 19517 1451 1550
2198 1551-1650 kg 13828 1551 1650
2380 1651-1750 kg 5512 1651 1750
2562 1751-1850 kg 4131 1751 1850
2744 1851-1950 kg 1558 1851 1950
2926 1951-2050 kg 1769 1951 2050
3108 2051-2150 kg 430 2051 2150
3290 2151-2250 kg 205 2151 2250
3472 2251-2350 kg 89 2251 2350
3654 2351-2450 kg 137 2351 2450
3836 2451 kg and more 631 2451 3500
Jumping through mixture space
I had decided the model will consist of three or four different normal distributions with different weights. These weights obviously sum to 1, so it is not obvious how to jump in this space while keeping to the rule that a jump from A to B, must be just as probable as a jump from B to A. It would also be nice to have weights between 0 and 1. In the end I decided to do this per mixture component. Take one component, move it to logit space, take the jump, move it back. Shrink/inflate all others proportionally to achieve sum of 1. The plot below shows a trace through mixture space.
nspacejump <- function(x,i,sd) {
xold <- arm::logit(x[i])
xnew <- arm::invlogit(xold + rnorm(1,0,sd))
x <- x/sum(x[-i])*(1-xnew)
x[i] <- xnew
x
}
x <- rep(1/3,3)
xmat <- matrix(0,nrow=30000,ncol=3)
index <- 1
for (irow in 1:10000) {
for (icol in 1:3) {
xmat[index,] <- x
x <- nspacejump(x,icol,.1)
index <- index + 1
}
}
plot(x=c(1,-1,0,1),y=c(0,0,sqrt(3),0),type=’l’)
lines(x=xmat[,1]-xmat[,2],y=xmat[,3]*sqrt(3),col=’blue’)
The model
It appeared as final solution to be most convenient to estimate expected proportions for all weight categories and examine how well proportions fit. Priors for mean weight of cars, 0 to 3500, standard deviation of weight of cars normal(300,300), for weight of normal distributions uniform over mixture space. Obviously only the standard deviation is implemented, it functions to avoid a solution with one normal distribution having enormous standard deviation, which seems not realistic.
curldev <- function(counts,curmodel) {
dsd <- sum(dnorm(curmodel$sd,300,300,log=TRUE))
prop.exp <- matrix(0,nrow=nrow(counts),ncol=nrow(curmodel))
for (i in 1:nrow(curmodel)) {
prop.exp[,i] <- curmodel$w[i]*(pnorm(counts$upper,
curmodel$mean[i],curmodel$sd[i])
–pnorm(counts$lower,
curmodel$mean[i],curmodel$sd[i])
–pnorm(counts$lower,
curmodel$mean[i],curmodel$sd[i]))
}
finprop <- rowSums(prop.exp)
dsd + dmultinom(counts$Waarde,prob=finprop,log=TRUE)
}
Fitting
It is not a very smart algorithm. It took maybe 60000 iterations or more before I had the feeling it had converged and was sampling from the final distribution. Luckily it is just small data in this setup. For examining convergence I made plots of deviation, means and standard deviations over iterations. The third time I did the calculation I got lazy, a 100 000 run as burn in subsequently 10 000 iterations for results..
curmodel <- data.frame(
mean=c(1000,1100,1200,1500),
sd=c(200,200,200,200),
w=c(.25,.25,.25,.25)
)
cnow <- curldev(weight3,curmodel)
ndist <- nrow(curmodel)
niter <- 10000 # you need more, say 100 000 for convergence
result <- matrix(nrow=ndist*niter,ncol=3*ndist+1)
index <- 1
for (iter in 1:niter) {
for (dist in 1:ndist) {
nowmodel <- curmodel
result[index,] <- c(unlist(curmodel),cnow)
curmodel$mean[dist] <- curmodel$mean[dist] + rnorm(1,0,1)
curmodel$sd[dist] <- curmodel$sd[dist] * runif(1,1/1.01,1.01)
curmodel$w <- nspacejump(curmodel$w,dist,.1)
cnew <- curldev(weight3,curmodel)
# cat(c(cnow,cnew,’\n’))
if (exp(cnew-cnow) > runif(1) ) cnow <- cnew
else curmodel <- nowmodel
index <- index + 1
}
}
curmodel
mean sd w
1 864.3793 87.4235 0.31131942
2 1792.4924 650.7524 0.00655972
3 1188.4141 113.8444 0.51915833
4 1407.4449 234.6751 0.16296252
Result
First a table of fit of the final MCMC sample. It looks like the result is close to observed value. I also added a plot of the distributions where I thinned to 1 in 100 MCMC samples. Hence the red line looks a bit bold. Integral is a constant used to ensure the surface under the sample lines stays constant and is in line to the observed values.
rescounts <- function(counts,curmodel) {
prop.exp <- matrix(0,nrow=nrow(counts),ncol=nrow(curmodel))
for (i in 1:nrow(curmodel)) {
prop.exp[,i] <- curmodel$w[i]*
(pnorm(counts$upper,curmodel$mean[i],curmodel$sd[i])
(pnorm(counts$upper,curmodel$mean[i],curmodel$sd[i])
-pnorm(counts$lower,curmodel$mean[i],curmodel$sd[i]))
}
finprop <- rowSums(prop.exp)
cbind(finprop*sum(counts$Waarde),counts$Waarde)
}
rescounts(weight3,curmodel)
[,1] [,2]
[1,] 65.92410 95
[2,] 71.68709 33
[3,] 1349.28494 262
[4,] 16246.30388 19950
[5,] 62352.31007 58752
[6,] 78830.29975 81650
[7,] 58194.45510 57738
[8,] 86826.12802 91793
[9,] 112381.22602 104933
[10,] 79122.57374 89506
[11,] 36005.92395 30814
[12,] 17873.04781 19517
[13,] 11812.20054 13828
[14,] 7602.84587 5512
[15,] 4219.25155 4131
[16,] 2033.40298 1558
[17,] 903.35197 1769
[18,] 423.51988 430
[19,] 247.82985 205
[20,] 183.80917 89
[21,] 152.41480 137
[22,] 579.90848 631
integral <- sum((weight3$upper-weight3$lower)*weight3$Waarde)
plot(c(Waarde,Waarde[nrow(weight3)])*100/c(upper-lower,3500-2451) ~
c(lower,3500),data=weight3,
type=’s’,xlim=c(0,3500),ylim=c(0,1.2e5),
xlab=’Weight’,ylab=’Frequency’)
for(i in seq(1,10000,100)) {
c(lower,3500),data=weight3,
type=’s’,xlim=c(0,3500),ylim=c(0,1.2e5),
xlab=’Weight’,ylab=’Frequency’)
for(i in seq(1,10000,100)) {
x <- seq(0,3500,10)
y <- rep(0,length(x))
newmodel <- data.frame(mean = result[i,1:4],
sd=result[i,5:8],
w=result[i,9:12])
for (ir in 1:nrow(newmodel))
y <- y + newmodel$w[ir]*dnorm(x,newmodel$mean[ir],newmodel$sd[ir])
lines(x=x,y=y/sum(y)/10*integral,col=’red’)
}
To leave a comment for the author, please follow the link and comment on their blog: Wiekvoet.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.