Interesting Uses of censo2017 a Year After Publishing

rOpenSci - open tools for open science

20 hours ago

[This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

See the Spanish version of this blog post: Usos Interesantes de censo2017 un Año Después de su Publicación

Summary

This post is about the surprising uses I’ve noticed and the questions about the censo2017 R package, a tool for accessing the Chilean census 2017 data, I’ve gotten since it was peer-reviewed through rOpenSci one year ago. The original post about the package one year ago didn’t cover the different examples I present here, including a Python port of the R package.

Organizing the census data

Three years ago, I had to complete an assignment that required me to extract data from Windows-only software in DVD format, which got very complicated.

I needed to access REDATAM files and obtain a few population summaries with specific software for that format. To my surprise, the task got incredibly challenging, and I exported the data to SQL for more accessible data extraction.

What I initially organized in PostgreSQL, ended up being organized in a local (embedded) database, which is more convenient for end-users and can be installed with two lines of code.

Installation in R:

remotes::install_github("ropensci/censo2017")
censo2017::censo_descargar()

Installation in Python:

pip install git+https://github.com/pachadotdev/censo2017python.git#egg=censo2017
import censo2017
censo2017.descargar()

Unexpected use I: Substandard housing (in R)

We can use the census data to determine where substandard housing is common, leading to valuable insights for public policy. That is something I never contemplated when creating the software.

What kind of housing information can we look for in the census? To keep it simple, let’s explore the p01 variable, which we can look at in the variables_codificacion, a table that I created to organize my work.

library(censo2017)
library(dplyr)
con <- censo_conectar()
tbl(con, "variables") %>%
filter(variable == "p01") %>%
collect() %>%
kable()

tabla	variable	descripcion	tipo	rango
viviendas	p01	Tipo de Vivienda	integer	1 – 10

tbl(con, "variables_codificacion") %>%
filter(variable == "p01") %>%
select(valor, descripcion) %>%
collect() %>%
kable()

valor	descripcion
1	Casa
2	Departamento en Edificio
3	Vivienda Tradicional Indígena (Ruka, Pae Pae u Otras)
4	Pieza en Casa Antigua o en Conventillo
5	Mediagua, Mejora, Rancho o Choza
6	Móvil (Carpa, Casa Rodante o Similar)
7	Otro Tipo de Vivienda Particular
8	Vivienda Colectiva
9	Operativo Personas en Tránsito (No Es Vivienda)
10	Operativo Calle (No Es Vivienda)
11	Valor Perdido
0	No Aplica

The variable refers to “Tipo de Vivienda” (Housing Type) and contains ten numeric values that mean:

Value	Description
1	House
2	Apartment Building
3	Traditional Indigenous Dwelling (Ruka¹ or Others)
4	Bedroom in an Old House or in a Conventillo²
5	Half Hut, Improvement, Ranch or Shack
6	Mobile (Tent, Mobile Home or Similar)
7	Other Type of Private Housing
8	Collective Housing
9	Persons in Transit Operative (Not a Dwelling)
10	Street Operative (Not Housing)
11	Missing Value
0	Not applicable

traditional Mapuche house type↩︎
A building designed to house as many people as possible in a small space at a low cost.↩︎

In this example, I consider dwellings coded as 5 (Half Hut, Improvement, Ranch or Shack / Mediagua, Mejora, Rancho o Choza) as substandard. Because of the census data organization, to get the share of these types of housing per region, we need to:

Create the code for each political division (i.e., region) from the geographical code in the zones table.
Join the zones (“zonas”) table with the households (“viviendas”) table to match each unit to its location (i.e., household ID 100 is in the “Los Ríos (14th)” region).
Group by p01 and count the number of units.

For example, if we want to look at the 14th region specifically, one way to do this could be:

tbl(con, "zonas") %>%
mutate(region = str_sub(geocodigo, 1, 2)) %>%
filter(region == 14) %>%
inner_join(tbl(con, "viviendas")) %>%
group_by(p01) %>%
count() %>%
collect() %>%
kable()

p01	n
1	144081
2	4793
3	44
4	494
5	2187
6	35
7	1875
8	467
9	6
10	8

We can improve this a little bit with an optional join and a step to extract the percentages:

tbl(con, "zonas") %>%
mutate(region = str_sub(geocodigo, 1, 2)) %>%
filter(region == "14") %>%
inner_join(tbl(con, "viviendas")) %>%
group_by(p01) %>%
count() %>%
left_join(
tbl(con, "variables_codificacion") %>%
filter(variable == "p01") %>%
select(p01 = valor, p01_desc = descripcion)
) %>%
select(p01, p01_desc, n) %>%
arrange(-n) %>%
ungroup() %>%
mutate(p = round(100 * n / sum(n), 2)) %>%
collect() %>%
kable()

p01	p01_desc	n	p
1	Casa	144081	93.57
2	Departamento en Edificio	4793	3.11
5	Mediagua, Mejora, Rancho o Choza	2187	1.42
7	Otro Tipo de Vivienda Particular	1875	1.22
4	Pieza en Casa Antigua o en Conventillo	494	0.32
8	Vivienda Colectiva	467	0.30
3	Vivienda Tradicional Indígena (Ruka, Pae Pae u Otras)	44	0.03
6	Móvil (Carpa, Casa Rodante o Similar)	35	0.02
10	Operativo Calle (No Es Vivienda)	8	0.01
9	Operativo Personas en Tránsito (No Es Vivienda)	6	0.00

For the 14th region, housing quality seems not to be a problem. What if we wonder for the number of people living in each of these unit types? This would need additional steps to match people (“personas”) to homes (“hogares”), and homes to households.

tbl(con, "zonas") %>%
mutate(region = str_sub(geocodigo, 1, 2)) %>%
filter(region == "14") %>%
inner_join(tbl(con, "viviendas")) %>%
inner_join(tbl(con, "hogares")) %>%
inner_join(tbl(con, "personas")) %>%
group_by(p01) %>%
count() %>%
left_join(
tbl(con, "variables_codificacion") %>%
filter(variable == "p01") %>%
select(p01 = valor, p01_desc = descripcion)
) %>%
select(p01, p01_desc, n) %>%
arrange(-n) %>%
ungroup() %>%
mutate(p = round(100 * n / sum(n), 2)) %>%
collect() %>%
kable()

p01	p01_desc	n	p
1	Casa	359343	93.38
2	Departamento en Edificio	9139	2.37
8	Vivienda Colectiva	7905	2.05
5	Mediagua, Mejora, Rancho o Choza	3980	1.03
7	Otro Tipo de Vivienda Particular	3008	0.78
4	Pieza en Casa Antigua o en Conventillo	727	0.19
9	Operativo Personas en Tránsito (No Es Vivienda)	494	0.13
3	Vivienda Tradicional Indígena (Ruka, Pae Pae u Otras)	96	0.02
10	Operativo Calle (No Es Vivienda)	84	0.02
6	Móvil (Carpa, Casa Rodante o Similar)	61	0.02

How does the previous result compare with the capitol (13th region)?

tbl(con, "zonas") %>%
mutate(region = str_sub(geocodigo, 1, 2)) %>%
filter(region == "13") %>%
inner_join(tbl(con, "viviendas")) %>%
inner_join(tbl(con, "hogares")) %>%
inner_join(tbl(con, "personas")) %>%
group_by(p01) %>%
count() %>%
left_join(
tbl(con, "variables_codificacion") %>%
filter(variable == "p01") %>%
select(p01 = valor, p01_desc = descripcion)
) %>%
select(p01, p01_desc, n) %>%
arrange(-n) %>%
ungroup() %>%
mutate(p = round(100 * n / sum(n), 2)) %>%
collect() %>%
kable()

p01	p01_desc	n	p
1	Casa	5265416	74.03
2	Departamento en Edificio	1613157	22.68
8	Vivienda Colectiva	78763	1.11
4	Pieza en Casa Antigua o en Conventillo	72142	1.01
5	Mediagua, Mejora, Rancho o Choza	52710	0.74
7	Otro Tipo de Vivienda Particular	17335	0.24
9	Operativo Personas en Tránsito (No Es Vivienda)	9113	0.13
10	Operativo Calle (No Es Vivienda)	2740	0.04
3	Vivienda Tradicional Indígena (Ruka, Pae Pae u Otras)	865	0.01
6	Móvil (Carpa, Casa Rodante o Similar)	567	0.01

From the previous summarised data we can see that in the capitol, 52,710 individuals are living in substandard housing (Mediagua, Mejora, Rancho o Choza).

We shouldn’t forget to close the database connection.

censo_desconectar()

Unexpected use II: Water source (in Python)

The motivation for this example is to show that the goal of organizing the census data was not to promote R, but to show some of the benefits of having the data organized in an open format, regardless of the programming language used for the analysis. For me it’s easier to complete this example in R, but I wanted to include those using this census data and Python.

The variable for the water source is p05. The analysis is very similar to the previous one, and we start by looking at its description. This particular example is relevant, as it allows to quantify the number of people that might be exposed to water contamination in a certain area.

import censo2017
import duckdb
con = duckdb.connect(database = censo2017.archivo_sql())
con.execute("SELECT * FROM variables_codificacion WHERE variable = 'p05'").df()
## tabla variable valor descripcion
## 0 viviendas p05 1 Red Pública
## 1 viviendas p05 2 Pozo o Noria
## 2 viviendas p05 3 Camión Aljibe
## 3 viviendas p05 4 Río, Vertiente, Estero, Canal, Lago, Etc.
## 4 viviendas p05 99 Valor Perdido
## 5 viviendas p05 98 No Aplica

Value	Description
1	Public Network
2	Well or Noria
3	Cistern Truck
4	River, Stream, Estuary, Canal, Lake, Etc.
99	Missing Value
98	Not Applicable

By using this variable, we can know the number of people for each water source in the same way as the previous case.

con.execute("""
SELECT "p05", "p05_desc", "n",
ROUND((100.0 * "n") / SUM("n") OVER (), CAST(ROUND(2.0, 0) AS INTEGER)) AS "p"
FROM (
SELECT "LHS"."p05" AS "p05", "n", "p05_desc"
FROM (
SELECT "p05", COUNT(*) AS "n"
FROM (
SELECT *, "LHS"."hogar_ref_id" AS "hogar_ref_id"
FROM (
SELECT *, "LHS"."vivienda_ref_id" AS "vivienda_ref_id"
FROM (
SELECT *, "LHS"."zonaloc_ref_id" AS "zonaloc_ref_id"
FROM (
SELECT *
FROM (
SELECT *, SUBSTR("geocodigo", 1, 2) AS "region"
FROM "zonas"
) "q01"
WHERE ("region" = '14')
) "LHS"
INNER JOIN "viviendas" AS "RHS"
ON ("LHS"."zonaloc_ref_id" = "RHS"."zonaloc_ref_id")
) "LHS"
INNER JOIN "hogares" AS "RHS"
ON ("LHS"."vivienda_ref_id" = "RHS"."vivienda_ref_id")
) "LHS"
INNER JOIN "personas" AS "RHS"
ON ("LHS"."hogar_ref_id" = "RHS"."hogar_ref_id")
) "q02"
GROUP BY "p05"
) "LHS"
LEFT JOIN (
SELECT "valor" AS "p05", "descripcion" AS "p05_desc"
FROM "variables_codificacion"
WHERE ("variable" = 'p05')
) "RHS"
ON ("LHS"."p05" = "RHS"."p05")
) "q03"
ORDER BY -"n"
""").df()
## p05 p05_desc n p
## 0 1 Red Pública 299464 77.82
## 1 4 Río, Vertiente, Estero, Canal, Lago, Etc. 38796 10.08
## 2 2 Pozo o Noria 32150 8.35
## 3 98 No Aplica 8483 2.20
## 4 3 Camión Aljibe 3558 0.92
## 5 99 Valor Perdido 2386 0.62

Where we see that for the 14th region, 22.18% of the population doesn’t have access to the public tap water network. The advantage of this approach is that here we used pure SQL for querying, which would work in any language where we can pass SQL queries. The disadvantage is not having a tool such as dplyr and the need to learn the SQL syntax.

We shouldn’t forget to close the database connection.

con.close()

Concluding remarks

The examples shown here were simplified. Most of the inquiries I get are about data summaries at the sub-communal level, for example how to count the occurrences of a categorical variable in a neighborhood, as we did in our example of water access (i.e., well or public network). By using censo2017 we can extract useful information for evidence-based public policy when it comes to, for example, prioritizing budget decisions.

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.