Importing the New Zealand Income Survey SURF
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The quest for income microdata
For a separate project, I’ve been looking for source data on income and wealth inequality. Not aggregate data like Gini coefficients or the percentage of income earned by the bottom 20% or top 1%, but the sources used to calculate those things. Because it’s sensitve personal financial data either from surveys or tax data, it’s not easy to get hold of, particularly if you want to do it in the comfort of your own home and don’t have time / motivation to fill in application forms. So far, what I’ve got is a simulated unit record file (SURF) from the New Zealand Income Survey, published by Statistics New Zealand for educational and instructional purposes. It’s a simplified, simulated version of the original survey and it lets me make graphics like this one:
This plot shows the density of income from all sources for four of the most prominent ethnicity types in New Zealand. New Zealand official statistics allow people to identify with more than one ethnicity, which means there is some double counting (more on that below). Three things leap out at me from this chart:
- The density curves of the different ethnic groups are very similar visually
- People of Maori and Pacific Peoples ethnicity have proportionately more $300-$400 per week earners than Europeans and Asians, leading to an overall noticeable lean to the lower numbers
- Weekly income is bimodal, bunching up at $345 per week and $820 per week. Actually, this image is misleading in that respect; in reality it is trimodal, with a huge bunch of people with zero income (and some negative), who aren’t shown on this plot because of the logarithmic scale. So you could say that, for New Zealanders getting any income at all, there is a bimodal distribution.
Where’s that bimodal distribution coming from? The obvious candidate is part time workers, and this is seen when we plot income versus hours worked in the next image:
(That interesting diagonal line below which there are very few points is the minimum wage per hour)
Statistics New Zealand warn that while this simulated data is an accurate representation of the actual New Zealand population, it shouldn’t be used for precise statistics, so for now I won’t draw particularly strong conclusions on anything. A simulated unit record file is a great way of solving confidentiality purposes, but in the end it has been created by a statistical model. There’s a risk that interesting inferences might be just picking up something implicit in the model that wasn’t taken into account when it was first put together. That’s not likely to be the case for the basic distribution of the income values, but we’ll note the exploratory finding for now and move on.
A box plot is better for examining the full range of reported ethnicities but not so good for picking up the bi-modal subtlety. It should also be noted that both these graphs delete people who made losses (negative income) in a given week – the data show income from all sources:
Loading the NZIS SURF into a database
Here’s how I drew the graphs, and estimated the two modes. I think I’ll want to re-use this data quite often, so it’s worth while putting in a database that’s accessible from different projects without having to move a bunch of data files around in Windows Explorer. The way Statistics New Zealand have released the file, with codings rather than values for dimensions like ethnicity and region, also makes a database a good way to make the data analysis ready.After importing the data, the first significant job is to deal with that pesky ethnicity variable. In the released version of the data respondents with two ethnicities have both code numbers joined together eg 12 means both European (1) and Maori (2). To get around this I split the data into two fact tables, one with a single row per respondent with most of the data; and a second just for ethnicity with either one or two rows for each respondent. Here’s how I do that with a combination of {dplyr} and {tidyr}:
library(dplyr) library(RODBC) library(tidyr) library(mbie) # for AskCreds, which is alternatively directly available # https://github.com/nz-mbie/mbie-r-package/blob/master/pkg/R/Creds.R # imports, clean up, and save to database the data from # http://www.stats.govt.nz/tools_and_services/microdata-access/nzis-2011-cart-surf.aspx url <- "http://www.stats.govt.nz/~/media/Statistics/services/microdata-access/nzis11-cart-surf/nzis11-cart-surf.csv" nzis <- read.csv(url) #-----------------fact tables------------------ # Create a main table with a primary key f_mainheader <- nzis %>% mutate(survey_id = 1:nrow(nzis)) # we need to normalise the multiple ethnicities, currently concatenated into a single variable cat("max number of ethnicities is", max(nchar(nzis$ethnicity)), "n") f_ethnicity <- f_mainheader %>% select(ethnicity, survey_id) %>% mutate(First = substring(ethnicity, 1, 1), Second = substring(ethnicity, 2, 2)) %>% select(-ethnicity) %>% gather(ethnicity_type, ethnicity_id, -survey_id) %>% filter(ethnicity_id != "") # drop the original messy ethnicity variable and tidy up names on main header f_mainheader <- f_mainheader %>% select(-ethnicity) %>% rename(region_id = lgr, sex_id = sex, agegrp_id = agegrp, qualification_id = qualification, occupation_id = occupation)
Second step is to re-create the dimension tables that turn the codes (eg 1 and 2) into meaningful values (European and Maori). Statistics New Zealand provide these, but unfortunately in an Excel workbook that’s easier for humans than computers to link up to the data. There’s not too many of so it’s easy enough to code them by hand, which the next set of code does:
#-----------------dimension tables------------------ # all drawn from the data dictionary available at the first link given above d_sex <- data_frame(sex_id = 1:2, sex = c("male", "female")) d_agegrp <- data_frame( agegrp_id = seq(from = 15, to = 65)) %>% mutate(agegrp = ifelse(agegrp_id == 65, "65+", paste0(agegrp_id, "-", agegrp_id + 4))) d_ethnicity <- data_frame(ethnicity_id = c(1,2,3,4,5,6,9), ethnicity = c( "European", "Maori", "Pacific Peoples", "Asian", "Middle Eastern/Latin American/African", "Other Ethnicity", "Residual Categories")) d_occupation <- data_frame(occupation_id = 1:10, occupation = c( "Managers", "Professionals", "Technicians and Trades Workers", "Community and Personal Service Workers", "Clerical and Adminsitrative Workers", "Sales Workers", "Machinery Operators and Drivers", "Labourers", "Residual Categories", "No occupation" )) d_qualification <- data_frame(qualification_id = 1:5, qualification = c( "None", "School", "Vocational/Trade", "Bachelor or Higher", "Other" )) d_region <- data_frame(region_id =1:12, region = c("Northland", "Auckland", "Waikato", "Bay of Plenty", "Gisborne / Hawke's Bay", "Taranaki", "Manawatu-Wanganui", "Wellington", "Nelson/Tasman/Marlborough/West Coast", "Canterbury", "Otago", "Southland"))
The final step in the data cleaning is to save all of our tables to a database, create some indexes so they work nice and fast, and join them up in an analysis-ready view. In the below I use an ODBC (open database connectivity) connection to a MySQL server called "PlayPen". R plays nicely with databases; set it up and forget about it.
#---------------save to database--------------- creds <- AskCreds("Credentials for someone who can create databases") PlayPen <- odbcConnect("PlayPen_prod", uid = creds$uid, pwd = creds$pwd) try(sqlQuery(PlayPen, "create database nzis11") ) sqlQuery(PlayPen, "use nzis11") # fact tables. These take a long time to load up with sqlSave (which adds one row at a time) # but it's easier (quick and dirty) than creating a table and doing a bulk upload from a temp # file. Any bigger than this you'd want to bulk upload though - took 20 minutes or more. sqlSave(PlayPen, f_mainheader, addPK = FALSE, rownames = FALSE) sqlSave(PlayPen, f_ethnicity, addPK = TRUE, rownames = FALSE) # add a primary key on the fly in this case. All other tables # have their own already created by R. # dimension tables sqlSave(PlayPen, d_sex, addPK = FALSE, rownames = FALSE) sqlSave(PlayPen, d_agegrp, addPK = FALSE, rownames = FALSE) sqlSave(PlayPen, d_ethnicity, addPK = FALSE, rownames = FALSE) sqlSave(PlayPen, d_occupation, addPK = FALSE, rownames = FALSE) sqlSave(PlayPen, d_qualification, addPK = FALSE, rownames = FALSE) sqlSave(PlayPen, d_region, addPK = FALSE, rownames = FALSE) #----------------indexing---------------------- sqlQuery(PlayPen, "ALTER TABLE f_mainheader ADD PRIMARY KEY(survey_id)") sqlQuery(PlayPen, "ALTER TABLE d_sex ADD PRIMARY KEY(sex_id)") sqlQuery(PlayPen, "ALTER TABLE d_agegrp ADD PRIMARY KEY(agegrp_id)") sqlQuery(PlayPen, "ALTER TABLE d_ethnicity ADD PRIMARY KEY(ethnicity_id)") sqlQuery(PlayPen, "ALTER TABLE d_occupation ADD PRIMARY KEY(occupation_id)") sqlQuery(PlayPen, "ALTER TABLE d_qualification ADD PRIMARY KEY(qualification_id)") sqlQuery(PlayPen, "ALTER TABLE d_region ADD PRIMARY KEY(region_id)") #---------------create an analysis-ready view------------------- # In Oracle we'd use a materialized view, which MySQL can't do. But # the below is fast enough anyway: sql1 <- "CREATE VIEW vw_mainheader AS SELECT sex, agegrp, occupation, qualification, region, hours, income FROM f_mainheader a JOIN d_sex b on a.sex_id = b.sex_id JOIN d_agegrp c on a.agegrp_id = c.agegrp_id JOIN d_occupation e on a.occupation_id = e.occupation_id JOIN d_qualification f on a.qualification_id = f.qualification_id JOIN d_region g on a.region_id = g.region_id" sqlQuery(PlayPen, sql1)
Average weekly income in NZ 2011 by various slices and dices
Whew, that's out of the way. Next post that I use this data I can go straight to the database. We're now in a position to check our data matches the summary totals provided by Statistics New Zealand. Statistics New Zealand say this SURF can be treated as a simple random sample, which means each point can get an identical individual weight, which we can estimate from the summary tables in their data dictionary. Each person in the sample represents 1,174 in the population (in the below I have population figures in thousands, to match the Statistics New Zealand summaries.
Statistics New Zealand doesn't provide region and occupation summary statistics, and the qualification summaries they provide use a more detailed classification than is in the actual SURF. But for the other categories - sex, age group, and the trick ethnicity - my results match theirs, so I know I haven't munched the data.
sex | Mean | Sample | Population |
---|---|---|---|
female | 611 | 15217 | 1787.10 |
male | 779 | 14254 | 1674.00 |
tab1 <- sqlQuery(PlayPen, "SELECT sex, ROUND(AVG(income)) as Mean, COUNT(1) as Sample, ROUND(COUNT(1) * .11744, 1) as Population FROM vw_mainheader GROUP BY sex")
agegrp | Mean | Sample | Population |
---|---|---|---|
15-19 | 198 | 2632 | 309.10 |
20-24 | 567 | 2739 | 321.70 |
25-29 | 715 | 2564 | 301.10 |
30-34 | 796 | 2349 | 275.90 |
35-39 | 899 | 2442 | 286.80 |
40-44 | 883 | 2625 | 308.30 |
45-49 | 871 | 2745 | 322.40 |
50-54 | 911 | 2522 | 296.20 |
55-59 | 844 | 2140 | 251.30 |
60-64 | 816 | 1994 | 234.20 |
65+ | 421 | 4719 | 554.20 |
tab2 <- sqlQuery(PlayPen, "SELECT agegrp, ROUND(AVG(income)) as Mean, COUNT(1) as Sample, ROUND(COUNT(1) * .11744, 1) as Population FROM vw_mainheader GROUP BY agegrp")
qualification | Mean | Sample | Population |
---|---|---|---|
School | 564 | 7064 | 829.60 |
None | 565 | 6891 | 809.30 |
Other | 725 | 1858 | 218.20 |
Vocational/Trade | 734 | 8435 | 990.60 |
Bachelor or Higher | 955 | 5223 | 613.40 |
# qualification summary in data dictionary uses a different classification to that in data tab3 <- sqlQuery(PlayPen, "SELECT qualification, ROUND(AVG(income)) as Mean, COUNT(1) as Sample, ROUND(COUNT(1) * .11744, 1) as Population FROM vw_mainheader GROUP BY qualification ORDER BY Mean")
occupation | Mean | Sample | Population |
---|---|---|---|
No occupation | 257 | 10617 | 1246.90 |
Labourers | 705 | 2154 | 253.00 |
Residual Categories | 726 | 24 | 2.80 |
Community and Personal Service Workers | 745 | 1734 | 203.60 |
Sales Workers | 800 | 1688 | 198.20 |
Clerical and Adminsitrative Workers | 811 | 2126 | 249.70 |
Technicians and Trades Workers | 886 | 2377 | 279.20 |
Machinery Operators and Drivers | 917 | 1049 | 123.20 |
Professionals | 1105 | 4540 | 533.20 |
Managers | 1164 | 3162 | 371.30 |
# occupation summary not given in data dictionary tab4 <- sqlQuery(PlayPen, "SELECT occupation, ROUND(AVG(income)) as Mean, COUNT(1) as Sample, ROUND(COUNT(1) * .11744, 1) as Population FROM vw_mainheader GROUP BY occupation ORDER BY Mean")
region | Mean | Sample | Population |
---|---|---|---|
Bay of Plenty | 620 | 1701 | 199.80 |
Taranaki | 634 | 728 | 85.50 |
Waikato | 648 | 2619 | 307.60 |
Southland | 648 | 637 | 74.80 |
Manawatu-Wanganui | 656 | 1564 | 183.70 |
Northland | 667 | 1095 | 128.60 |
Nelson/Tasman/Marlborough/West Coast | 680 | 1253 | 147.20 |
Otago | 686 | 1556 | 182.70 |
Gisborne / Hawke's Bay | 693 | 1418 | 166.50 |
Canterbury | 701 | 4373 | 513.60 |
Auckland | 720 | 9063 | 1064.40 |
Wellington | 729 | 3464 | 406.80 |
# region summary not given in data dictionary tab5 <- sqlQuery(PlayPen, "SELECT region, ROUND(AVG(income)) as Mean, COUNT(1) as Sample, ROUND(COUNT(1) * .11744, 1) as Population FROM vw_mainheader GROUP BY region ORDER BY Mean ")
ethnicity | Mean | Sample | Population |
---|---|---|---|
Residual Categories | 555 | 85 | 10.00 |
Maori | 590 | 3652 | 428.90 |
Pacific Peoples | 606 | 1566 | 183.90 |
Middle Eastern/Latin American/African | 658 | 343 | 40.30 |
Asian | 678 | 3110 | 365.20 |
European | 706 | 22011 | 2585.00 |
Other Ethnicity | 756 | 611 | 71.80 |
tab6 <- sqlQuery(PlayPen, "SELECT ethnicity, ROUND(AVG(income)) as Mean, COUNT(1) as Sample, ROUND(COUNT(1) * .11744, 1) as Population FROM f_mainheader m JOIN f_ethnicity e ON m.survey_id = e.survey_id JOIN d_ethnicity d ON e.ethnicity_id = d.ethnicity_id GROUP BY ethnicity ORDER BY Mean")
Graphics showing density of income
Finally, here's the code that drew the charts we started with, showing the distribution of the weekly income New Zealanders of different ethnicity.
library(showtext) library(RODBC) library(ggplot2) library(dplyr) font.add.google("Poppins", "myfont") showtext.auto() # download individual level data from database including ethnicity join dtf <- sqlQuery(PlayPen, "SELECT ethnicity, income FROM f_mainheader m JOIN f_ethnicity e ON m.survey_id = e.survey_id JOIN d_ethnicity d ON e.ethnicity_id = d.ethnicity_id") # density plot of income by density dtf %>% filter(ethnicity %in% c("Asian", "European", "Maori", "Pacific Peoples")) %>% ggplot(aes(x = income, colour = ethnicity)) + geom_density(size = 1.1) + scale_x_log10("Weekly income from all sources", label = dollar, breaks = c(10, 100, 345, 825, 10000)) + theme_minimal(base_family = "myfont") + theme(legend.position = "bottom") + scale_colour_brewer("", palette = "Set1") # boxplot of income by ethnicity dtf %>% ggplot(aes(y = income, x = ethnicity, colour = ethnicity)) + geom_boxplot() + geom_rug() + scale_y_log10("Weekly income from all sources", label = dollar, breaks = c(10, 100, 1000, 10000)) + theme_minimal() + labs(x = "") + coord_flip() + scale_colour_brewer(palette = "Set2") + theme(legend.position = "none") # scatter plot of joint density of hours and income dtf2 <- sqlQuery(PlayPen, "select hours, income from vw_mainheader") ggplot(dtf2, aes(x = hours, y = income)) + geom_jitter(alpha = 0.05) + scale_x_log10("Hours worked", breaks = c(1, 10, 20, 40, 80)) + scale_y_log10("Weekly income from all sources", label = dollar, breaks = c(10, 100, 345, 825, 10000)) + theme_minimal(base_family = "myfont") # how did I choose to mark $345 and $825 on the scale? # ripped off (and improved) from http://stackoverflow.com/questions/27418461/calculate-the-modes-in-a-multimodal-distribution-in-r find_modes<- function(data, ...) { dens <- density(data, ...) y <- dens$y modes <- NULL for ( i in 2:(length(y) - 1) ){ if ( (y[i] > y[i - 1]) & (y[i] > y[i + 1]) ) { modes <- c(modes,i) } } if ( length(modes) == 0 ) { modes = 'This is a monotonic distribution' } return(dens$x[modes]) } x <- dtf$income x[x < 1] <- 1 # not interested in negative income just now # where are those modes? exp(find_modes(log(x)))
Edited 18 August 2015 to add the scatter plot of the joint density of hours and income.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.