Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Last week, in our Inequality course, we’ve been looking at data. We started with some simulated data, only a few of them
> library("ineq") > load(url("http://freakonometrics.free.fr/income_5.RData")) > (income=sort(income)) [1] 19233 23707 53297 61667 218662
How could we say that there is inequality in this sample? If we look at the wealth owned by the poorest, the poorest person (1 out of 5) owns 5% of the wealth; the bottom two (2 out of 5) own 11%, etc
> income[1]/sum(income) [1] 0.05107471 > sum(income[1:2])/sum(income) [1] 0.1140305 > sum(income[1:3])/sum(income) [1] 0.2555648 > sum(income[1:4])/sum(income) [1] 0.4193262
If we plot those values, we get Lorenz curve
> plot(Lc(income)) > points(c(0:5)/5,c(0,cumsum(income)/sum(income)),pch=19,col="blue")
Now, what if we got 500 observations (and not only 5). In that case, a natural tool to visualize those data (or to be more specific, their distribution) is the histogram
> load(url("http://freakonometrics.free.fr/income_500.RData")) > summary(income) Min. 1st Qu. Median Mean 3rd Qu. Max. 2191 23830 42750 77010 87430 2003000 > hist(log(income),probability=TRUE,col="light blue",border="white") > lines(density(log(income)),col="red") > u=seq(6,15,length=251) lines(u,dnorm(u,mean(log(income)),sd(log(income))),col="blue")
Here we use an histogram to visualize our sample. But not on the income, on the logarithm of the income (because of some outliers, we cannot visualize anything on the histogram). Now, it is possible to compute Gini index to get some information about inequalities
> gini=function(x){ + n=length(x) + mu=mean(x) + g=2/(n*(n-1)*mu)*sum((1:n)*sort(x))-(n+1)/(n-1) + return(g)}
The problem is that, in practice, having an index without any confidence interval can be meaningless. To compute a confidence interval, we can use a bootstrap procedure,
> boot=function(x,f,b=500){ + n=length(x) + F=rep(NA,n) + for(s in 1:b){ + idx=sample(1:n,size=n,replace=TRUE) + F[s]=f(x[idx])} + return(F)} > G=boot(income,gini,1000) > hist(G,col="light blue",border="white",probability=TRUE)
The red segments is the 90% confidence interval,
> quantile(G,c(.05,.95)) 5% 95% 0.4954235 0.5743917
We did include also a blue line with a Gaussian distribution,
> segments(quantile(G,.05),1,quantile(G,.95),1, + col="red",lwd=2) > u=seq(.4,.65,length=251) > lines(u,dnorm(u,mean(G),sd(G)), + col="blue")
Another popular tool is the Pareto plot, where we plot the logarithm of the cumulative survival function against the logarithm of the income,
> n=length(income) > x=log(sort(income)) > y=log((n:1)/n) > plot(x,y)
If points were on a straight line, it would mean that incomes can be modeled with a Pareto distribution. Obviously, it is not the case here.
We’ve seen previously how to get Lorenz curve. Actually, it is possible to get also Lorenz curve for some parametric distribution, e.g. some log-normal ones,
> plot(Lc(income)) > lines(Lc.lognorm,param=1.5,col="red") > lines(Lc.lognorm,param=1.2,col="red") > lines(Lc.lognorm,param=.8,col="red")
Here, it sounds reasonnable to claim that a log-normal distribution would be a good fit. But maybe not a Pareto distribution
> plot(Lc(income)) > lines(Lc.pareto,param=2,col="red") > lines(Lc.pareto,param=1.5,col="red") > lines(Lc.pareto,param=1.2,col="red")
Actually, it is possible to fit some parametric distributions. Observe that sometimes, we have to change the scale, and use ‘000s of dollars, instead of dollars,
> library(MASS) > fitdistr(income/1e3,"gamma") shape rate 1.0812757769 0.0140404379 (0.0604530180) (0.0009868055)
Now, consider two distributions, a Gamma one, and a log-normal one (fitted with maximum likelihood techniques)
> (fit_g <- fitdistr(income/1e2,"gamma")) shape rate 1.0812757769 0.0014040438 (0.0473722529) (0.0000544185) > (fit_ln <- fitdistr(income/1e2,"lognormal")) meanlog sdlog 6.11747519 1.01091329 (0.04520942) (0.03196789)
We can visualize the densities
> u=seq(0,2e5,length=251) > hist(income,breaks=seq(0,2005000,by=5000), + col=rgb(0,0,1,.5),border="white", + xlim=c(0,2e5),probability=TRUE) > v_g <- dgamma(u/1e2, fit_g$estimate[1], + fit_g$estimate[2])/1e2 > v_ln <- dlnorm(u/1e2, fit_ln$estimate[1], + fit_ln$estimate[2])/1e2 > lines(u,v_g,col="red",lwd=2) > lines(u,v_ln,col=rgb(1,0,0,.4),lwd=2)
Here, it looks like the log-normal is a good candidate. We can also plot the cumulative distribution functions
> x <- sort(income) > y <- (1:500)/500 > plot(x,y,type="s",col="black",xlim=c(0,250000)) > v_g <- pgamma(u/1e2, fit_g$estimate[1], + fit_g$estimate[2]) > v_ln <- plnorm(u/1e2, fit_ln$estimate[1], + fit_ln$estimate[2]) > lines(u,v_g,col="red",lwd=2) > lines(u,v_ln,col=rgb(1,0,0,.4),lwd=2)
Now, consider some more realistic situation, where we do not have samples from surveys, but binned data,
> load(url("http://freakonometrics.free.fr/income_binned.RData")) > head(income_binned) low high number mean std_err 1 0 4999 95 3606 964 2 5000 9999 267 7686 1439 3 10000 14999 373 12505 1471 4 15000 19999 350 17408 1368 5 20000 24999 329 22558 1428 6 25000 29999 337 27584 1520 > tail(income_binned) low high number mean std_err 46 225000 229999 10 228374 1197 47 230000 234999 13 232920 1370 48 235000 239999 11 236341 1157 49 240000 244999 14 242359 1474 50 245000 249999 11 247782 1487 51 250000 Inf 228 395459 189032
There is a new package to model that kind of data,
> library(binequality) > n <- nrow(income_binned) > fit_LN <- fitFunc(ID=rep("Fake Data",n), + hb=income_binned[,"number"], + bin_min=income_binned[,"low"], + bin_max=income_binned[,"high"], + obs_mean=income_binned[,"mean"], + ID_name="Country", distribution=LNO, + distName="LNO") Time difference of 2.101471 secs for LNO fit across 1 distributions
Here, we can fit a log-normal distribution (see Methods for estimating inequality from binned incomes for more details about the methodology)
> N=income_binned$number > y2=N/sum(N)/diff(income_binned$low) > u=seq(min(income_binned$low), + max(income_binned$low),length=101) > v=dlnorm(u,fit_LN$parameters[1], + fit_LN$parameters[2]) > plot(u,v,col="blue",type="l",lwd=2) > for(i in 1:(n-1)) rect(income_binned$low[i],0, + income_binned$high[i],y2[i],col=rgb(1,0,0,.2), + border="white")
Here, on the histogram (since we have binned data, it is natural to plot an histogram), we can see that the fitted log-normal distribution is great. Actually, to be honest, data were simulated from a log-normal distribution, so it makes sense….
> N <- income_binned$number > y1 <- cumsum(N)/sum(N) > u <- seq(min(income_binned$low), + max(income_binned$low),length=101) > v <- plnorm(u,fit_LN$parameters[1], + fit_LN$parameters[2]) > plot(u,v,col="blue",type="l",lwd=2) > for(i in 1:(n-1)) rect(income_binned$low[i],0, + income_binned$high[i], y1[i],col=rgb(1,0,0,.2)) > for(i in 1:(n-1)) rect(income_binned$low[i], + y1[i],income_binned$high[i],c(0,y1)[i], + col=rgb(1,0,0,.4))
For the cumulated distribution function, I consider either the worst case scenario (everyone in a bin get the lower possible income) and the best case (everyone has the highest possible income).
It is also possible to fit all standard GB2 distributions
> fits=run_GB_family(ID=rep("Fake Data",n), + hb=income_binned[,"number"], + bin_min=income_binned[,"low"], + bin_max=income_binned[,"high"], + obs_mean=income_binned[,"mean"], + ID_name="Country")
To get the best candidate, look at
> fits$fit.filter[,c("gini","aic","bic")]
Now, that was fine, but those were simulated data. What about real data now? Here we go
> data = read.table( + "http://freakonometrics.free.fr/us_income.txt", + sep=",",header=TRUE) > head(data) low high number_1000s mean std_err 1 0 4999 4245 1249 50 2 5000 9999 5128 7923 30 3 10000 14999 7149 12389 28 4 15000 19999 7370 17278 26 5 20000 24999 7037 22162 27 6 25000 29999 6825 27185 28 > n <- nrow(data) > fit_LN <- fitFunc(ID=rep("US",n), + hb=data[,"number_1000s"], + bin_min=data[,"low"], + bin_max=data[,"high"], + obs_mean=data[,"mean"], + ID_name="Country", + distribution=LNO, distName="LNO") Time difference of 0.1855791 secs for LNO fit across 1 distributions
Again, I try to fit a log-normal distribution
> N=data$number > y2=N/sum(N)/diff(data$low) > u=seq(min(income_binned$low), + max(income_binned$low),length=101) > v=dlnorm(u,fit_LN$parameters[1], + fit_LN$parameters[2]) > plot(u,v,col="blue",type="l",lwd=2) > for(i in 1:(n-1)) rect(data$low[i], + 0,data$high[i],y2[i],col=rgb(1,0,0,.2))
But here, the fit is rather poor. Again, we can estimate all standard GB2 distribution
> > fits=run_GB_family(ID=rep("US",n), + hb=data[,"number_1000s"], + bin_min=data[,"low"], + bin_max=data[,"high"], + obs_mean=data[,"mean"], + ID_name="Country")
We can get Gini index, AIC and BIC
> fits$fit.filter[,c("gini","aic","bic")] gini aic bic 1 4.413431 825368.5 825407.3 2 4.395080 825598.8 825627.9 3 4.451881 825495.7 825524.8 4 4.480850 825881.7 825910.8 5 4.417276 825323.6 825352.7 6 4.922122 832408.2 832427.6 7 4.341036 827065.2 827084.6 8 4.318667 826112.8 826132.2 9 NA 831054.2 831073.6 10 NA NA NA
to see that the best distribution seems to be a Generalized Gamma.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.