Site icon R-bloggers

Comparing CENSO 2017 versions for Python and R

[This article was first published on Pachá, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Motivation

Language wars are pretty counterproductive, in my opinion. Different languages have different strengths. Here I’ll show the equivalence of Python and R for a particular context, which is creating and querying a DuckDB (embedded SQL) database.

A simple exercise with the census

The general idea here is to explore the question labelled “p15” (Highest Educational Attainment Obtained) to obtain how many people with completed high school we have in each region in Chile (and also how many with college diploma, etc). To do this we use the library CENSO 2017 to perform three steps.

Here I won’t use pandas, dplyr or anything besides the CENSO 2017 library and DuckDB (i.e., I’ll run standard SQL queries on different languages).

Step 1

The download function will delete any previous version of the database that exists. It needs to be run only once. This will create a DuckDB (SQL) database.

In Python we would type:

import censo2017
# I'll skip this as I've already downloaded the DB
# censo2017.descargar()

In R we would type:

library(censo2017)
# I'll skip this as I've already downloaded the DB
# censo_descargar()

Step 2

Once the tables have been imported, it is possible to read selectively from SQL. For this you have to open a connection. Let’s look, for example, at the variable “p15” in the table “variables”.

We can also see the coding of the variable in the table “variables_coding”.

In Python we would type:

import duckdb
con = duckdb.connect(database = censo2017.archivo_sql())

# read the table variables in the connection and filter for the variable p15
con.execute("SELECT * FROM variables WHERE variable = 'p15'").df()

con.execute("SELECT * FROM variables_codificacion WHERE variable = 'p15'").df()

In R we would type:

library(duckdb)
con <- censo_conectar()

# read the table variables in the connection and filter for the variable p15
dbGetQuery(con, "SELECT * FROM variables WHERE variable = 'p15'")
##      tabla variable                       descripcion    tipo  rango
## 1 personas      p15 Nivel del Curso Más Alto Aprobado integer 1 - 14
dbGetQuery(con, "SELECT * FROM variables_codificacion WHERE variable = 'p15'")
##       tabla variable valor
## 1  personas      p15     1
## 2  personas      p15     2
## 3  personas      p15     3
## 4  personas      p15     4
## 5  personas      p15     5
## 6  personas      p15     6
## 7  personas      p15     7
## 8  personas      p15     8
## 9  personas      p15     9
## 10 personas      p15    10
## 11 personas      p15    11
## 12 personas      p15    12
## 13 personas      p15    13
## 14 personas      p15    14
## 15 personas      p15    99
## 16 personas      p15    98
##                                                   descripcion
## 1                                 Sala Cuna o Jardín Infantil
## 2                                                   Prekínder
## 3                                                      Kínder
## 4                                      Especial o Diferencial
## 5                                            Educación Básica
## 6                   Primaria o Preparatoria (Sistema Antiguo)
## 7                                        Científico-Humanista
## 8                                         Técnica Profesional
## 9                               Humanidades (Sistema Antiguo)
## 10 Técnica Comercial, Industrial/Normalista (Sistema Antiguo)
## 11                                Técnico Superior (1-3 Años)
## 12                                 Profesional (4 o Más Años)
## 13                                                   Magíster
## 14                                                  Doctorado
## 15                                              Valor Perdido
## 16                                                  No Aplica

Step 3

To get detailed information for each region in relation to question “p15”, we need to think of the REDATAM data as a tree, and we need to join “zones” with “dwellings” by zone ID, then join “dwellings” with “households” by household ID, and then “households” with “persons” by household ID. This is done quickly with the DuckDB backend.

In Python we would type:

con.execute("""
SELECT "region", "p15", COUNT(*) AS "n"
FROM (SELECT "region", "geocodigo", "zonaloc_ref_id", "vivienda_ref_id", "LHS"."hogar_ref_id" AS "hogar_ref_id", "p15"
FROM (SELECT "region", "geocodigo", "zonaloc_ref_id", "LHS"."vivienda_ref_id" AS "vivienda_ref_id", "hogar_ref_id"
FROM (SELECT "region", "geocodigo", "LHS"."zonaloc_ref_id" AS "zonaloc_ref_id", "vivienda_ref_id"
FROM (SELECT SUBSTR(CAST("geocodigo" AS TEXT), 1, 2) AS "region", "geocodigo", "zonaloc_ref_id"
FROM "zonas") "LHS"
INNER JOIN (SELECT "zonaloc_ref_id", "vivienda_ref_id"
FROM "viviendas") "RHS"
ON ("LHS"."zonaloc_ref_id" = "RHS"."zonaloc_ref_id")
) "LHS"
INNER JOIN (SELECT "vivienda_ref_id", "hogar_ref_id"
FROM "hogares") "RHS"
ON ("LHS"."vivienda_ref_id" = "RHS"."vivienda_ref_id")
) "LHS"
INNER JOIN (SELECT "hogar_ref_id", "p15"
FROM "personas") "RHS"
ON ("LHS"."hogar_ref_id" = "RHS"."hogar_ref_id")
) "q01"
GROUP BY "region", "p15"
""").df()

con.close()

In R we would type:

dbGetQuery(con,
'
SELECT "region", "p15", COUNT(*) AS "n"
FROM (SELECT "region", "geocodigo", "zonaloc_ref_id", "vivienda_ref_id", "LHS"."hogar_ref_id" AS "hogar_ref_id", "p15"
FROM (SELECT "region", "geocodigo", "zonaloc_ref_id", "LHS"."vivienda_ref_id" AS "vivienda_ref_id", "hogar_ref_id"
FROM (SELECT "region", "geocodigo", "LHS"."zonaloc_ref_id" AS "zonaloc_ref_id", "vivienda_ref_id"
FROM (SELECT SUBSTR(CAST("geocodigo" AS TEXT), 1, 2) AS "region", "geocodigo", "zonaloc_ref_id"
FROM "zonas") "LHS"
INNER JOIN (SELECT "zonaloc_ref_id", "vivienda_ref_id"
FROM "viviendas") "RHS"
ON ("LHS"."zonaloc_ref_id" = "RHS"."zonaloc_ref_id")
) "LHS"
INNER JOIN (SELECT "vivienda_ref_id", "hogar_ref_id"
FROM "hogares") "RHS"
ON ("LHS"."vivienda_ref_id" = "RHS"."vivienda_ref_id")
) "LHS"
INNER JOIN (SELECT "hogar_ref_id", "p15"
FROM "personas") "RHS"
ON ("LHS"."hogar_ref_id" = "RHS"."hogar_ref_id")
) "q01"
GROUP BY "region", "p15"
')
##     region p15       n
## 1       02   8   91434
## 2       02  12   90268
## 3       02   7  143032
## 4       02  11   47194
## 5       02   5  124812
## 6       02  13    5559
## 7       02  10    4368
## 8       02   3   13215
## 9       02   9    8624
## 10      02   1   15229
## 11      02   2    8276
## 12      02   6    9650
## 13      02  99   17310
## 14      02  98   25522
## 15      02  14     926
## 16      02   4    2115
## 17      04   5  200593
## 18      04   8   94853
## 19      04   7  174461
## 20      04   4    3138
## 21      04   1   22028
## 22      04  12   90387
## 23      04  98   35237
## 24      04   9   13045
## 25      04  11   46222
## 26      04   3   17503
## 27      04   6   20393
## 28      04  99   16395
## 29      04  10    5581
## 30      04  13    4689
## 31      04   2   12182
## 32      04  14     879
## 33      05   5  385822
## 34      05   8  206365
## 35      05  12  261433
## 36      05   2   25514
## 37      05   1   45020
## 38      05   7  437271
## 39      05  11  146031
## 40      05  99   39275
## 41      05   9   56672
## 42      05  98   67641
## 43      05  13   19779
## 44      05   3   35048
## 45      05  10   13428
## 46      05   6   64143
## 47      05   4    8754
## 48      05  14    3706
## 49      06  98   43612
## 50      06   7  216205
## 51      06   4    3652
## 52      06   8   97540
## 53      06   5  280294
## 54      06  12   87637
## 55      06  11   55721
## 56      06   9   15419
## 57      06   2   13940
## 58      06   6   31162
## 59      06  99   17166
## 60      06   1   21878
## 61      06  10    5315
## 62      06   3   19316
## 63      06  13    5105
## 64      06  14     593
## 65      07   5  344576
## 66      07   8  132030
## 67      07  12  103048
## 68      07   1   28442
## 69      07  98   53032
## 70      07   3   21535
## 71      07  11   48702
## 72      07   7  215422
## 73      07   2   15126
## 74      07  99   20235
## 75      07   9   15516
## 76      07   6   31180
## 77      07  13    6277
## 78      07   4    4099
## 79      07  10    4666
## 80      07  14    1064
## 81      08  98   93119
## 82      08   6   59126
## 83      08   7  452048
## 84      08   8  221096
## 85      08   5  574214
## 86      08   2   29822
## 87      08  12  266035
## 88      08  13   17396
## 89      08   9   35567
## 90      08  10   17685
## 91      08  11  133748
## 92      08   3   41866
## 93      08  99   33154
## 94      08   1   50751
## 95      08   4    8402
## 96      08  14    3385
## 97      09   5  297155
## 98      09  98   48775
## 99      09   8  128149
## 100     09   1   25307
## 101     09  99   20664
## 102     09   9   14812
## 103     09  12  106787
## 104     09   3   20446
## 105     09   6   28229
## 106     09   2   13546
## 107     09   4    3075
## 108     09  10    5223
## 109     09  13    7139
## 110     09  14    1279
## 111     12   5   37128
## 112     12   8   26787
## 113     12  12   23590
## 114     12  99    2552
## 115     12   7   33728
## 116     12   6    5135
## 117     12   1    4994
## 118     12   9    3531
## 119     12  11   14908
## 120     12  98    4708
## 121     12   2    2233
## 122     12   3    3099
## 123     12  10    1612
## 124     12   4     559
## 125     12  13    1640
## 126     12  14     329
## 127     13  11  528487
## 128     13   5 1481406
## 129     13   7 1483547
## 130     13   8  935530
## 131     13   2  100801
## 132     13   3  139178
## 133     13   1  183756
## 134     13   9  169800
## 135     13   6  179888
## 136     13  98  271405
## 137     13  12 1253567
## 138     13  10   48549
## 139     13   4   30603
## 140     13  99  147127
## 141     13  13  138589
## 142     13  14   20575
## 143     15  11   16166
## 144     15   7   51052
## 145     15   8   36713
## 146     15   5   47977
## 147     15  12   29780
## 148     15   1    6584
## 149     15  13    1745
## 150     15  98    9112
## 151     15   6    5560
## 152     15   3    5288
## 153     15  99    5041
## 154     15  14     313
## 155     15   9    4081
## 156     15  10    2102
## 157     15   2    3643
## 158     15   4     911
## 159     09   7  188733
## 160     09  11   47905
## 161     01  10    2820
## 162     01  11   21102
## 163     01  12   44598
## 164     01   5   74514
## 165     01   8   52147
## 166     01   9    5177
## 167     01   6    5581
## 168     01   1   11359
## 169     01   7   74636
## 170     01   3    8107
## 171     01  98   13134
## 172     01  13    2719
## 173     01   2    5680
## 174     01  99    7518
## 175     01  14     307
## 176     01   4    1159
## 177     03   5   71114
## 178     03   8   39741
## 179     03  12   33499
## 180     03   6    7247
## 181     03   1    9232
## 182     03  10    2123
## 183     03   3    6910
## 184     03   7   66530
## 185     03  98   13341
## 186     03   9    4741
## 187     03   2    4904
## 188     03  11   19049
## 189     03  13    1322
## 190     03  99    4980
## 191     03   4    1203
## 192     03  14     232
## 193     10  12   85685
## 194     10   8   92592
## 195     10   5  277949
## 196     10   2   11913
## 197     10  98   35307
## 198     10   7  175494
## 199     10   4    2956
## 200     10   6   24175
## 201     10  11   46644
## 202     10   1   22099
## 203     10  10    3339
## 204     10   9   10656
## 205     10  99   16213
## 206     10   3   17412
## 207     10  13    5533
## 208     10  14     741
## 209     14   8   46746
## 210     14   5  116214
## 211     14   1   10002
## 212     14   7   81971
## 213     14  12   46332
## 214     14   3    7987
## 215     14   6   11309
## 216     14  11   18316
## 217     14   2    5338
## 218     14  98   17689
## 219     14  99    8688
## 220     14   9    6294
## 221     14  10    2362
## 222     14  14    1012
## 223     14   4    1471
## 224     14  13    3106
## 225     11   7   23583
## 226     11   8   10449
## 227     11   2    1713
## 228     11   5   32578
## 229     11  98    4404
## 230     11  12   11659
## 231     11  13     781
## 232     11   1    3695
## 233     11   9     967
## 234     11  11    6376
## 235     11  99    1728
## 236     11  14     118
## 237     11  10     263
## 238     11   6    2062
## 239     11   4     383
## 240     11   3    2399
dbDisconnect(con, shutdown = T)
To leave a comment for the author, please follow the link and comment on their blog: Pachá.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.