Site icon R-bloggers

R with remote databases Exercises (Part-2)

[This article was first published on R-exercises, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


This is common case when working with data that your source is a remote database. Usual ways to cope this when using R is either to load all the data into R or to perform the heaviest joins and aggregations with SQL before loading the data. Both of them have cons: the former one is limited by the memory capacity and may be very slow and the later forces you to use two technologies thus is more complicated and prone to errors. Solution to these problems is to use dplyr with dbplyr to communicate with database backend. This allows user to write dplyr code that is translated to SQL and executed at database server. One can say that this combines advantages of the two standard solutions and gets rid of their disadvantages.

This is the second part of R with remote databases series. For other parts follow the tag databases.

The reader is assumed to know basics of dplyr and SQL. If you want to practice dplyr first there is great series of exercises Data wrangling: Transforming available. For quick introduction to dplyr with database backend I recommend this vignette.

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1
Load libraries: dplyr, dbplyr, DBI, RSQLite, nycflights13. Create a connection to temporal in-memory SQLite database and load to it two copies of data set nycflights13::flights – one with no indexes and another with index at carrier.

Exercise 2
Compare execution time for counting a number of flights per carrier for indexed and not indexed table. (HINT: You can use st <- Sys.time(); **code**; Sys.time - st or microbenchmark package.)

Exercise 3
Compare query plans for queries from Exercise 2.

Exercise 4
Check how dbplyr translates as.character to SQL.

Exercise 5
Check how dbplyr translates substr to SQL depending on values of parameters. What is the relation between R and SQL version?

Exercise 6
Check how dbplyr translates ^ to SQL. Check how does it work with SQLite.

Exercise 7
Check how dbplyr translates mean to SQL. What happens when you specify na.rm or trim parameter?

< aside class='stb-icon'>
Learn more about Data Pre-Processing in the online course R Data Pre-Processing & Data Management – Shape your Data!. In this course you will learn how to:
  • import data into R in several ways while also beeing able to identify a suitable import tool
  • use SQL code within R
  • And much more

Exercise 8
Calculate mean value of dep_delay per carrier. Compare results for the table in the database and for local one.

Exercise 9
In database calculate number of destination airports with airport code starting with ‘A’. Use wildcard matching.

Exercise 10
In database concatenate origin and dest to one column separated by ‘-‘.

Related exercise sets:

  1. R with remote databases Exercises (Part-1)
  2. Descriptive Analytics-Part 5: Data Visualisation (Categorical variables)
  3. Descriptive Analytics-Part 4 : Data Manipulation
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory

To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.