Data Wrangling with dbplyr
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
This is the second post in the series R & Databases. You can find the links to the first post of this series below:
In this post, we will learn to query data from a database using dplyr.
Libraries, Code & Data
We will use the following libraries in this post:
All the data sets used in this post can be found here and code can be downloaded from here.
Connect to Database
Let us connect to an in memory SQLite database using dbConnect()
.
con <- dbConnect(RSQLite::SQLite(), ":memory:")
We will copy the mtcars
data to the database so that we can use it for running
dplyr statements.
dplyr::copy_to(con, mtcars)
Reference Copied Data Frame
In order to use dplyr functions, we need to reference the table in the database using
tbl()
.
mtcars2 <- dplyr::tbl(con, "mtcars") mtcars2 ## # Source: table<mtcars> [?? x 11] ## # Database: sqlite 3.22.0 [:memory:] ## mpg cyl disp hp drat wt qsec vs am gear carb ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 ## 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 ## 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 ## 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 ## 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 ## 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 ## 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 ## 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 ## 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 ## 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 ## # ... with more rows
Query Data
We will look at some simple examples. Let us start by selecting mpg
, cyl
and drat
columns from mtcars2
.
select(mtcars2, mpg, cyl, drat) ## # Source: lazy query [?? x 3] ## # Database: sqlite 3.22.0 [:memory:] ## mpg cyl drat ## <dbl> <dbl> <dbl> ## 1 21 6 3.9 ## 2 21 6 3.9 ## 3 22.8 4 3.85 ## 4 21.4 6 3.08 ## 5 18.7 8 3.15 ## 6 18.1 6 2.76 ## 7 14.3 8 3.21 ## 8 24.4 4 3.69 ## 9 22.8 4 3.92 ## 10 19.2 6 3.92 ## # ... with more rows
We can filter data as well. Filter all the rows from mtcars2
where mpg
is
greater than 25.
filter(mtcars2, mpg > 25) ## # Source: lazy query [?? x 11] ## # Database: sqlite 3.22.0 [:memory:] ## mpg cyl disp hp drat wt qsec vs am gear carb ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1 ## 2 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2 ## 3 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1 ## 4 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1 ## 5 26 4 120. 91 4.43 2.14 16.7 0 1 5 2 ## 6 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2
Time to do some grouping and summarizing. Let us compute the average mileage for different types of cylinders.
mtcars2 %>% group_by(cyl) %>% summarise(mileage = mean(mpg)) ## Warning: Missing values are always removed in SQL. ## Use `AVG(x, na.rm = TRUE)` to silence this warning ## # Source: lazy query [?? x 2] ## # Database: sqlite 3.22.0 [:memory:] ## cyl mileage ## <dbl> <dbl> ## 1 4 26.7 ## 2 6 19.7 ## 3 8 15.1
Show Query
If you want to view the SQL query generated in the above step, use show_query()
or explain()
.
mileages <- mtcars2 %>% group_by(cyl) %>% summarise(mileage = mean(mpg)) dplyr::show_query(mileages) ## Warning: Missing values are always removed in SQL. ## Use `AVG(x, na.rm = TRUE)` to silence this warning ## <SQL> ## SELECT `cyl`, AVG(`mpg`) AS `mileage` ## FROM `mtcars` ## GROUP BY `cyl` dplyr::explain(mileages) ## Warning: Missing values are always removed in SQL. ## Use `AVG(x, na.rm = TRUE)` to silence this warning ## <SQL> ## SELECT `cyl`, AVG(`mpg`) AS `mileage` ## FROM `mtcars` ## GROUP BY `cyl` ## ## Warning: Missing values are always removed in SQL. ## Use `AVG(x, na.rm = TRUE)` to silence this warning ## <PLAN> ## addr opcode p1 p2 p3 p4 p5 comment ## 1 0 Init 0 43 0 00 NA ## 2 1 SorterOpen 1 2 0 k(1,B) 00 NA ## 3 2 Integer 0 5 0 00 NA ## 4 3 Integer 0 4 0 00 NA ## 5 4 Null 0 8 8 00 NA ## 6 5 Gosub 7 40 0 00 NA ## 7 6 OpenRead 0 2 1 2 00 NA ## 8 7 Rewind 0 15 0 00 NA ## 9 8 Column 0 1 10 00 NA ## 10 9 RealAffinity 10 0 0 00 NA ## 11 10 Column 0 0 11 00 NA ## 12 11 RealAffinity 11 0 0 00 NA ## 13 12 MakeRecord 10 2 12 00 NA ## 14 13 SorterInsert 1 12 0 00 NA ## 15 14 Next 0 8 0 01 NA ## 16 15 OpenPseudo 2 12 2 00 NA ## 17 16 SorterSort 1 42 0 00 NA ## 18 17 SorterData 1 12 2 00 NA ## 19 18 Column 2 0 9 00 NA ## 20 19 Compare 8 9 1 k(1,B) 00 NA ## 21 20 Jump 21 25 21 00 NA ## 22 21 Move 9 8 1 00 NA ## 23 22 Gosub 6 34 0 00 NA ## 24 23 IfPos 5 42 0 00 NA ## 25 24 Gosub 7 40 0 00 NA ## 26 25 Column 2 1 13 00 NA ## 27 26 AggStep0 0 13 2 avg(1) 01 NA ## 28 27 Column 2 0 1 00 NA ## 29 28 Integer 1 4 0 00 NA ## 30 29 SorterNext 1 17 0 00 NA ## 31 30 Gosub 6 34 0 00 NA ## 32 31 Goto 0 42 0 00 NA ## 33 32 Integer 1 5 0 00 NA ## 34 33 Return 6 0 0 00 NA ## 35 34 IfPos 4 36 0 00 NA ## 36 35 Return 6 0 0 00 NA ## 37 36 AggFinal 2 1 0 avg(1) 00 NA ## 38 37 Copy 1 14 1 00 NA ## 39 38 ResultRow 14 2 0 00 NA ## 40 39 Return 6 0 0 00 NA ## 41 40 Null 0 1 3 00 NA ## 42 41 Return 7 0 0 00 NA ## 43 42 Halt 0 0 0 00 NA ## 44 43 Transaction 1 0 2 0 01 NA ## 45 44 Goto 0 1 0 00 NA
Collect Data
Now, some interesting facts. When working with databases, dplyr never pulls data into R unless you explicitly ask for it. In the previous example, dplyr will not do anything until you ask for the mileages data. It generates the SQL and only pulls down a few rows when you try to print mileages
.
So how do we pull all the data and store it for further analysis? collect()
will pull all
the data and store it in a tibble and you can use it for any further analysis.
dplyr::collect(mileages) ## Warning: Missing values are always removed in SQL. ## Use `AVG(x, na.rm = TRUE)` to silence this warning ## # A tibble: 3 x 2 ## cyl mileage ## <dbl> <dbl> ## 1 4 26.7 ## 2 6 19.7 ## 3 8 15.1
References
Up Next..
In the next post, we will learn basic SQL commands.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.