Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Motivation
Language wars are pretty counterproductive, in my opinion. Different languages have different strengths. Here I’ll show the equivalence of Python and R for a particular context, which is creating and querying a DuckDB (embedded SQL) database.
A simple exercise with the census
The general idea here is to explore the question labelled “p15” (Highest Educational Attainment Obtained) to obtain how many people with completed high school we have in each region in Chile (and also how many with college diploma, etc). To do this we use the library CENSO 2017 to perform three steps.
Here I won’t use pandas, dplyr or anything besides the CENSO 2017 library and DuckDB (i.e., I’ll run standard SQL queries on different languages).
Step 1
The download function will delete any previous version of the database that exists. It needs to be run only once. This will create a DuckDB (SQL) database.
In Python we would type:
import censo2017 # I'll skip this as I've already downloaded the DB # censo2017.descargar()
In R we would type:
library(censo2017) # I'll skip this as I've already downloaded the DB # censo_descargar()
Step 2
Once the tables have been imported, it is possible to read selectively from SQL. For this you have to open a connection. Let’s look, for example, at the variable “p15” in the table “variables”.
We can also see the coding of the variable in the table “variables_coding”.
In Python we would type:
import duckdb con = duckdb.connect(database = censo2017.archivo_sql()) # read the table variables in the connection and filter for the variable p15 con.execute("SELECT * FROM variables WHERE variable = 'p15'").df() con.execute("SELECT * FROM variables_codificacion WHERE variable = 'p15'").df()
In R we would type:
library(duckdb) con <- censo_conectar() # read the table variables in the connection and filter for the variable p15 dbGetQuery(con, "SELECT * FROM variables WHERE variable = 'p15'") ## tabla variable descripcion tipo rango ## 1 personas p15 Nivel del Curso Más Alto Aprobado integer 1 - 14 dbGetQuery(con, "SELECT * FROM variables_codificacion WHERE variable = 'p15'") ## tabla variable valor ## 1 personas p15 1 ## 2 personas p15 2 ## 3 personas p15 3 ## 4 personas p15 4 ## 5 personas p15 5 ## 6 personas p15 6 ## 7 personas p15 7 ## 8 personas p15 8 ## 9 personas p15 9 ## 10 personas p15 10 ## 11 personas p15 11 ## 12 personas p15 12 ## 13 personas p15 13 ## 14 personas p15 14 ## 15 personas p15 99 ## 16 personas p15 98 ## descripcion ## 1 Sala Cuna o Jardín Infantil ## 2 Prekínder ## 3 Kínder ## 4 Especial o Diferencial ## 5 Educación Básica ## 6 Primaria o Preparatoria (Sistema Antiguo) ## 7 Científico-Humanista ## 8 Técnica Profesional ## 9 Humanidades (Sistema Antiguo) ## 10 Técnica Comercial, Industrial/Normalista (Sistema Antiguo) ## 11 Técnico Superior (1-3 Años) ## 12 Profesional (4 o Más Años) ## 13 Magíster ## 14 Doctorado ## 15 Valor Perdido ## 16 No Aplica
Step 3
To get detailed information for each region in relation to question “p15”, we need to think of the REDATAM data as a tree, and we need to join “zones” with “dwellings” by zone ID, then join “dwellings” with “households” by household ID, and then “households” with “persons” by household ID. This is done quickly with the DuckDB backend.
In Python we would type:
con.execute(""" SELECT "region", "p15", COUNT(*) AS "n" FROM (SELECT "region", "geocodigo", "zonaloc_ref_id", "vivienda_ref_id", "LHS"."hogar_ref_id" AS "hogar_ref_id", "p15" FROM (SELECT "region", "geocodigo", "zonaloc_ref_id", "LHS"."vivienda_ref_id" AS "vivienda_ref_id", "hogar_ref_id" FROM (SELECT "region", "geocodigo", "LHS"."zonaloc_ref_id" AS "zonaloc_ref_id", "vivienda_ref_id" FROM (SELECT SUBSTR(CAST("geocodigo" AS TEXT), 1, 2) AS "region", "geocodigo", "zonaloc_ref_id" FROM "zonas") "LHS" INNER JOIN (SELECT "zonaloc_ref_id", "vivienda_ref_id" FROM "viviendas") "RHS" ON ("LHS"."zonaloc_ref_id" = "RHS"."zonaloc_ref_id") ) "LHS" INNER JOIN (SELECT "vivienda_ref_id", "hogar_ref_id" FROM "hogares") "RHS" ON ("LHS"."vivienda_ref_id" = "RHS"."vivienda_ref_id") ) "LHS" INNER JOIN (SELECT "hogar_ref_id", "p15" FROM "personas") "RHS" ON ("LHS"."hogar_ref_id" = "RHS"."hogar_ref_id") ) "q01" GROUP BY "region", "p15" """).df() con.close()
In R we would type:
dbGetQuery(con, ' SELECT "region", "p15", COUNT(*) AS "n" FROM (SELECT "region", "geocodigo", "zonaloc_ref_id", "vivienda_ref_id", "LHS"."hogar_ref_id" AS "hogar_ref_id", "p15" FROM (SELECT "region", "geocodigo", "zonaloc_ref_id", "LHS"."vivienda_ref_id" AS "vivienda_ref_id", "hogar_ref_id" FROM (SELECT "region", "geocodigo", "LHS"."zonaloc_ref_id" AS "zonaloc_ref_id", "vivienda_ref_id" FROM (SELECT SUBSTR(CAST("geocodigo" AS TEXT), 1, 2) AS "region", "geocodigo", "zonaloc_ref_id" FROM "zonas") "LHS" INNER JOIN (SELECT "zonaloc_ref_id", "vivienda_ref_id" FROM "viviendas") "RHS" ON ("LHS"."zonaloc_ref_id" = "RHS"."zonaloc_ref_id") ) "LHS" INNER JOIN (SELECT "vivienda_ref_id", "hogar_ref_id" FROM "hogares") "RHS" ON ("LHS"."vivienda_ref_id" = "RHS"."vivienda_ref_id") ) "LHS" INNER JOIN (SELECT "hogar_ref_id", "p15" FROM "personas") "RHS" ON ("LHS"."hogar_ref_id" = "RHS"."hogar_ref_id") ) "q01" GROUP BY "region", "p15" ') ## region p15 n ## 1 02 8 91434 ## 2 02 12 90268 ## 3 02 7 143032 ## 4 02 11 47194 ## 5 02 5 124812 ## 6 02 13 5559 ## 7 02 10 4368 ## 8 02 3 13215 ## 9 02 9 8624 ## 10 02 1 15229 ## 11 02 2 8276 ## 12 02 6 9650 ## 13 02 99 17310 ## 14 02 98 25522 ## 15 02 14 926 ## 16 02 4 2115 ## 17 04 5 200593 ## 18 04 8 94853 ## 19 04 7 174461 ## 20 04 4 3138 ## 21 04 1 22028 ## 22 04 12 90387 ## 23 04 98 35237 ## 24 04 9 13045 ## 25 04 11 46222 ## 26 04 3 17503 ## 27 04 6 20393 ## 28 04 99 16395 ## 29 04 10 5581 ## 30 04 13 4689 ## 31 04 2 12182 ## 32 04 14 879 ## 33 05 5 385822 ## 34 05 8 206365 ## 35 05 12 261433 ## 36 05 2 25514 ## 37 05 1 45020 ## 38 05 7 437271 ## 39 05 11 146031 ## 40 05 99 39275 ## 41 05 9 56672 ## 42 05 98 67641 ## 43 05 13 19779 ## 44 05 3 35048 ## 45 05 10 13428 ## 46 05 6 64143 ## 47 05 4 8754 ## 48 05 14 3706 ## 49 06 98 43612 ## 50 06 7 216205 ## 51 06 4 3652 ## 52 06 8 97540 ## 53 06 5 280294 ## 54 06 12 87637 ## 55 06 11 55721 ## 56 06 9 15419 ## 57 06 2 13940 ## 58 06 6 31162 ## 59 06 99 17166 ## 60 06 1 21878 ## 61 06 10 5315 ## 62 06 3 19316 ## 63 06 13 5105 ## 64 06 14 593 ## 65 07 5 344576 ## 66 07 8 132030 ## 67 07 12 103048 ## 68 07 1 28442 ## 69 07 98 53032 ## 70 07 3 21535 ## 71 07 11 48702 ## 72 07 7 215422 ## 73 07 2 15126 ## 74 07 99 20235 ## 75 07 9 15516 ## 76 07 6 31180 ## 77 07 13 6277 ## 78 07 4 4099 ## 79 07 10 4666 ## 80 07 14 1064 ## 81 08 98 93119 ## 82 08 6 59126 ## 83 08 7 452048 ## 84 08 8 221096 ## 85 08 5 574214 ## 86 08 2 29822 ## 87 08 12 266035 ## 88 08 13 17396 ## 89 08 9 35567 ## 90 08 10 17685 ## 91 08 11 133748 ## 92 08 3 41866 ## 93 08 99 33154 ## 94 08 1 50751 ## 95 08 4 8402 ## 96 08 14 3385 ## 97 09 5 297155 ## 98 09 98 48775 ## 99 09 8 128149 ## 100 09 1 25307 ## 101 09 99 20664 ## 102 09 9 14812 ## 103 09 12 106787 ## 104 09 3 20446 ## 105 09 6 28229 ## 106 09 2 13546 ## 107 09 4 3075 ## 108 09 10 5223 ## 109 09 13 7139 ## 110 09 14 1279 ## 111 12 5 37128 ## 112 12 8 26787 ## 113 12 12 23590 ## 114 12 99 2552 ## 115 12 7 33728 ## 116 12 6 5135 ## 117 12 1 4994 ## 118 12 9 3531 ## 119 12 11 14908 ## 120 12 98 4708 ## 121 12 2 2233 ## 122 12 3 3099 ## 123 12 10 1612 ## 124 12 4 559 ## 125 12 13 1640 ## 126 12 14 329 ## 127 13 11 528487 ## 128 13 5 1481406 ## 129 13 7 1483547 ## 130 13 8 935530 ## 131 13 2 100801 ## 132 13 3 139178 ## 133 13 1 183756 ## 134 13 9 169800 ## 135 13 6 179888 ## 136 13 98 271405 ## 137 13 12 1253567 ## 138 13 10 48549 ## 139 13 4 30603 ## 140 13 99 147127 ## 141 13 13 138589 ## 142 13 14 20575 ## 143 15 11 16166 ## 144 15 7 51052 ## 145 15 8 36713 ## 146 15 5 47977 ## 147 15 12 29780 ## 148 15 1 6584 ## 149 15 13 1745 ## 150 15 98 9112 ## 151 15 6 5560 ## 152 15 3 5288 ## 153 15 99 5041 ## 154 15 14 313 ## 155 15 9 4081 ## 156 15 10 2102 ## 157 15 2 3643 ## 158 15 4 911 ## 159 09 7 188733 ## 160 09 11 47905 ## 161 01 10 2820 ## 162 01 11 21102 ## 163 01 12 44598 ## 164 01 5 74514 ## 165 01 8 52147 ## 166 01 9 5177 ## 167 01 6 5581 ## 168 01 1 11359 ## 169 01 7 74636 ## 170 01 3 8107 ## 171 01 98 13134 ## 172 01 13 2719 ## 173 01 2 5680 ## 174 01 99 7518 ## 175 01 14 307 ## 176 01 4 1159 ## 177 03 5 71114 ## 178 03 8 39741 ## 179 03 12 33499 ## 180 03 6 7247 ## 181 03 1 9232 ## 182 03 10 2123 ## 183 03 3 6910 ## 184 03 7 66530 ## 185 03 98 13341 ## 186 03 9 4741 ## 187 03 2 4904 ## 188 03 11 19049 ## 189 03 13 1322 ## 190 03 99 4980 ## 191 03 4 1203 ## 192 03 14 232 ## 193 10 12 85685 ## 194 10 8 92592 ## 195 10 5 277949 ## 196 10 2 11913 ## 197 10 98 35307 ## 198 10 7 175494 ## 199 10 4 2956 ## 200 10 6 24175 ## 201 10 11 46644 ## 202 10 1 22099 ## 203 10 10 3339 ## 204 10 9 10656 ## 205 10 99 16213 ## 206 10 3 17412 ## 207 10 13 5533 ## 208 10 14 741 ## 209 14 8 46746 ## 210 14 5 116214 ## 211 14 1 10002 ## 212 14 7 81971 ## 213 14 12 46332 ## 214 14 3 7987 ## 215 14 6 11309 ## 216 14 11 18316 ## 217 14 2 5338 ## 218 14 98 17689 ## 219 14 99 8688 ## 220 14 9 6294 ## 221 14 10 2362 ## 222 14 14 1012 ## 223 14 4 1471 ## 224 14 13 3106 ## 225 11 7 23583 ## 226 11 8 10449 ## 227 11 2 1713 ## 228 11 5 32578 ## 229 11 98 4404 ## 230 11 12 11659 ## 231 11 13 781 ## 232 11 1 3695 ## 233 11 9 967 ## 234 11 11 6376 ## 235 11 99 1728 ## 236 11 14 118 ## 237 11 10 263 ## 238 11 6 2062 ## 239 11 4 383 ## 240 11 3 2399 dbDisconnect(con, shutdown = T)
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.