SQL for Data Science – Part 2
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
This is the fourth post in the series R & Databases. You can find the links to the other two posts of this series below:
In this post, we will learn to
- aggregate data
- order data
- and group data
Libraries, Code & Data
We will use the following libraries in this post:
All the data sets used in this post can be found here and code can be downloaded from here.
Set Up
ecom <- readr::read_csv('https://raw.githubusercontent.com/rsquaredacademy/datasets/master/web.csv') con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:") copy_to(con, ecom)
Aggregate
Let us combine the aggregate statements with WHERE statement to filter data.
- SUM() : returns the total sum of a numeric column
dbGetQuery(con, "SELECT SUM(n_visit) FROM ecom") ## SUM(n_visit) ## 1 4972 dbGetQuery(con, "SELECT SUM(n_visit) FROM ecom WHERE n_visit > 5") ## SUM(n_visit) ## 1 3574
- AVG() : returns the average value of a numeric column
dbGetQuery(con, "SELECT AVG(n_visit) FROM ecom") ## AVG(n_visit) ## 1 4.972 dbGetQuery(con, "SELECT AVG(n_visit) FROM ecom WHERE country LIKE 'P%'") ## AVG(n_visit) ## 1 5.079137
- MAX() : returns the largest value of the selected column
dbGetQuery(con, "SELECT MAX(n_visit) FROM ecom") ## MAX(n_visit) ## 1 10 dbGetQuery(con, "SELECT MAX(n_visit) FROM ecom WHERE device == 'tablet'") ## MAX(n_visit) ## 1 10
- MIN() : returns the smallest value of the selected column
dbGetQuery(con, "SELECT MIN(n_visit) FROM ecom") ## MIN(n_visit) ## 1 0 dbGetQuery(con, "SELECT MIN(n_visit) FROM ecom WHERE duration BETWEEN 600 AND 900") ## MIN(n_visit) ## 1 0
Alias
SQL aliases are used to give a table, or a column in a table, a temporary name. They are often used to make column names more readable. An alias only exists for the duration of the query. Below are a few examples:
dbGetQuery(con, "SELECT AVG(n_visit) AS avg_mobile FROM ecom WHERE device == 'mobile'") ## avg_mobile ## 1 5.479651 dbGetQuery(con, "SELECT MAX(n_visit) AS max_visit FROM ecom") ## max_visit ## 1 10 dbGetQuery(con, "SELECT MIN(duration) AS min_duration FROM ecom") ## min_duration ## 1 10
Order By
The ORDER BY keyword is used to sort the records in ascending or descending order. By default, the records are sorted in ascending order. Use the DESC keyword if you want to sort the records in descending order,
dbGetQuery(con, "SELECT * FROM ecom ORDER BY country") ## id referrer device bouncers n_visit n_pages duration ## 1 232 social laptop 0 8 2 60 ## 2 299 yahoo laptop 0 10 18 180 ## 3 570 social laptop 1 2 1 274 ## 4 677 direct tablet 1 10 1 682 ## country purchase order_items order_value ## 1 Afghanistan 0 0 0 ## 2 Afghanistan 0 0 0 ## 3 Afghanistan 0 0 0 ## 4 Afghanistan 0 0 0 ## [ reached getOption("max.print") -- omitted 996 rows ] dbGetQuery(con, "SELECT * FROM ecom ORDER BY duration") ## id referrer device bouncers n_visit n_pages duration ## 1 236 yahoo tablet 1 5 1 10 ## 2 615 social laptop 1 1 1 10 ## 3 392 yahoo laptop 0 0 1 12 ## 4 688 social mobile 1 2 1 12 ## country purchase order_items order_value ## 1 Poland 0 0 0 ## 2 Finland 0 0 0 ## 3 Indonesia 0 0 0 ## 4 Botswana 0 0 0 ## [ reached getOption("max.print") -- omitted 996 rows ] dbGetQuery(con, "SELECT * FROM ecom ORDER BY n_visit DESC") ## id referrer device bouncers n_visit n_pages duration ## 1 1 google laptop 1 10 1 693 ## 2 7 yahoo mobile 1 10 1 75 ## 3 8 direct mobile 1 10 1 908 ## 4 29 google mobile 1 10 1 338 ## country purchase order_items order_value ## 1 Czech Republic 0 0 0 ## 2 Bangladesh 0 0 0 ## 3 Indonesia 0 0 0 ## 4 Russia 0 0 0 ## [ reached getOption("max.print") -- omitted 996 rows ]
Group By
The GROUP BY statement is used with aggregate functions (COUNT, MAX, MIN, SUM, AVG) to group the result by one or more columns.
dbGetQuery(con, "SELECT device, count(*) AS visits FROM ecom GROUP BY device ORDER by visits DESC") ## device visits ## 1 mobile 344 ## 2 tablet 331 ## 3 laptop 325 dbGetQuery(con, "SELECT device, MAX(duration) AS max_duration FROM ecom GROUP BY device ORDER by max_duration DESC") ## device max_duration ## 1 tablet 999 ## 2 laptop 997 ## 3 mobile 994
Summary
In this post, we learnt to
- aggregate data
- order data
- and group data
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.