Estimating Gini coefficient when we only have mean income by decile by @ellis2013nz
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Income inequality data
Ideally the Gini coefficient to estimate inequality is based on original household survey data with hundreds or thousands of data points. Often this data isn’t available due to access restrictions from privacy or other concerns, and all that is published is some kind of aggregate measure. Some aggregations include the income at the 80th percentile divided by that at the 20th (or 90 and 10); the number of people at the top of the distribution whose combined income is equal to that of everyone else; or the income of the top 1% as a percentage of all income. I wrote a little more about this in one of my earliest blog posts.
One way aggregated data are sometimes presented is as the mean income in each decile or quintile. This is not the same as the actual quantile values themselves, which are the boundary between categories. The 20th percentile is the value of the 20/100th person when they are lined up in increasing order, whereas the mean income of the first quintile is the mean of all the incomes of a “bin” of everyone from 0/100 to 20/100, when lined up in order.
To explore estimating Gini coefficients from this type of binned data I used data from the wonderful Lakner-Milanovic World Panel Income Distribution database, which is available for free download. This useful collection contains the mean income by decile bin of many countries from 1985 onwards – the result of some careful and doubtless very tedious work with household surveys from around the world. This is an amazing dataset, and amongst other purposes it can be used (as Milanovic and co-authors have pioneered dating back to his World Bank days) in combination with population numbers to estimate “global inequality”, treating everyone on the planet as part of a single economic community regardless of national boundaries. But that’s for another day.
Here’s R code to download the data (in Stata format) and grab the first ten values, which happen to represent Angloa in 1995. These particular data are based on consumption, which in poorer economies is often more sensible to measure than income:
Here’s the resulting 10 numbers. N
And this is the Lorenz curve:
Those graphics were drawn with this code:
Calculating Gini directly from deciles?
Now, I could just treat these 10 deciles as a sample of 10 representative people (each observation after all represents exactly 10% of the population) and calculate the Gini coefficient directly. But my hunch was that this would underestimate inequality, because of the straight lines in the Lorenz curve above which are a simplification of the real, more curved, reality.
To investigate this issue, I started by creating a known population of 10,000 income observations from a Burr distribution, which is a flexible, continuous non-negative distribution often used to model income. That looks like this:
Then I divided the data up into between 2 and 100 bins, took the means of the bins, and calculated the Gini coefficient of the bins. Doing this for 10 bins is the equivalent of calculating a Gini coefficient directly from decile data such as in the Lakner-Milanovic dataset. I got this result, which shows, that when you have the means of 10 bins, you are underestimating inequality slightly:
Here’s the code for that little simulation. I make myself a little function to bin data and return the mean values of the bins in a tidy data frame, which I’ll need for later use too:
A better method for Gini from deciles?
Maybe I should have stopped there; after all, there is hardly any difference between 0.32 and 0.34; probably much less than the sampling error from the survey. But I wanted to explore if there were a better way. The method I chose was to:
- choose a log-normal distribution that would generate (close to) the 10 decile averages we have;
- simulate individual-level data from that distribution; and
- estimate the Gini coefficient from that simulated data.
I also tried this with a Burr distribution but the results were very unstable. The log-normal approach was quite good at generating data with means of 10 bins very similar to the original data, and gave plausible values of Gini coefficient just slightly higher than when calculated directly of the bins’ means.
Here’s how I did that:
And here are the results. The first table shows the means of the bins in my simulated log-normal population (mean
) compared to the original data for Angola’s actual deciles in 1995 (x
). The next two values, 0.415 and 0.402, are the Gini coefficents from the simulated and original data respectively:
> cbind(bin_avs(simulated2), x)
bin_number bin_breaks mean x
1 1 (40.6,222] 165.0098 174
2 2 (222,308] 266.9120 287
3 3 (308,393] 350.3674 373
4 4 (393,487] 438.9447 450
5 5 (487,589] 536.5547 538
6 6 (589,719] 650.7210 653
7 7 (719,887] 795.9326 785
8 8 (887,1.13e+03] 1000.8614 967
9 9 (1.13e+03,1.6e+03] 1328.3872 1303
10 10 (1.6e+03,1.3e+04] 2438.4041 2528
> weighted.gini(simulated2)$Gini
[,1]
[1,] 0.4145321
>
> # compare to the value estimated directly from the data:
> weighted.gini(x)$Gini
[,1]
[1,] 0.401936
As would be expected from my earlier simulation, the Gini coefficient from the estimated underlying log-normal distribtuion is verr slightly higher than that calculated directly from the means of the decile bins.
Applying this method to the Lakner-Milanovic inequality data
I rolled up this approach into a function to convert means of deciles into Gini coefficients and applied it to all the countries and years in the World Panel Income Distribution data. Here are the results, first over time:
.. and then as a snapshot
Neither of these is great as a polished data visualisation, but it’s difficult data to present in a static snapshot, and will do for these illustrative purposes.
Here’s the code for that function (which depends on the previously defined ) and drawing those charts. Drawing on the convenience of Hadley Wickham’s dplyr
and ggplot2
it’s easy to do this on the fly and in the below I calculate the Gini coefficients twice, once for each chart. Technically this is wasteful, but with modern computers this isn’t a big deal even though there is quite a bit of computationally intensive stuff going on under the hood; the code below only takes a minute or so to run.
There we go - deciles to Gini fun with world inequality data!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.