Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This is common case when working with data that your source is a remote database. Usual ways to cope this when using R is either to load all the data into R or to perform the heaviest joins and aggregations with SQL before loading the data. Both of them have cons: the former one is limited by the memory capacity and may be very slow and the later forces you to use two technologies thus is more complicated and prone to errors. Solution to these problems is to use dplyr
with dbplyr
to communicate with database backend. This allows user to write dplyr
code that is translated to SQL and executed at database server. One can say that this combines advantages of the two standard solutions and gets rid of their disadvantages.
This is the first part of R with remote databases series. For other parts follow the tag databases.
The reader is assumed to know basics of dplyr
and SQL. If you want to practice dplyr
first there is great series of exercises Data wrangling: Transforming available. For quick introduction to dplyr
with database backend I recommend this vignette.
Answers to the exercises are available here.
If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.
Exercise 1
Load libraries: dplyr
, dbplyr
, DBI
, RSQLite
, nycflights13
. Create a connection to temporal in-memory SQLite database (the database will be created on-the-fly so you do not have to take care of this).
Exercise 2
Upload data sets nycflights13::flights
and nycflights13::planes
to the database as non temporal tables.
Exercise 3
List names of columns in flights
table (hint: There is a function in DBI
package for this).
Exercise 4
Use SQL query to count number of flights per carrier and pull it to a local tibble.
Exercise 5
Do the same thing using dplyr
verbs instead of SQL.
Exercise 6
Calculate number of flights, mean and total distance per plane, discard records with NA
at tailnum
column and save it to temporal table.
- import data into R in several ways while also beeing able to identify a suitable import tool
- use SQL code within R
- And much more
Exercise 7
List all the tables in the database (hint: There is a function in DBI
package for this).
Exercise 8
Use head()
, tail()
and nrow()
to investigate table planes
. Do you understand why the latter two do not work?
Exercise 9
Join the table from exercise 6 with planes
table and without pulling the data to local frame find manufacturers and models of 10 planes with the highest total flown distance.
Exercise 10
Check what is the actual SQL query generated by the code you have created in exercise 9.
Related exercise sets:
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.