Window Functions in SQL vs. R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Both MS SQL Server and R’s dplyr
package offer window functions for different operations on a dataset. What are window functions? Window functions operate on a set of rows or “window” and return a value for each row, whether that’s something like a row number or percentile rank, or an aggregate result from a min, max, average, or sum.
The most common problem I have that window functions solve is when I have a set of records that I need to unduplicate. Let’s say I have student enrollment records where an individual student may have more than one enrollment; and let’s say I want to grab only the latest enrollment record from each student. In this case, each student would be a partition or window in which I could create a function to label the latest record. Once the latest record is identified, I could then pull just that single row.
SQL
How would I do this in SQL? Below I connect to my MS SQL Server instance and print a very simple table of student enrollment records. You can see there are more than one per studentID
. How would I grab the latest from each per the date
column?
library(tidyverse) library(DBI) library(odbc) library(knitr) con <- dbConnect(odbc::odbc(), "nicodemus", database = "Demo") enroll <- as_tibble(dbGetQuery(con, "select * from Demo.dbo.tbl_Enroll")) enroll ## # A tibble: 11 x 3 ## studentID enrollmentType date ## <chr> <chr> <dttm> ## 1 1000 D 2019-02-01 00:00:00 ## 2 1000 P 2019-01-20 00:00:00 ## 3 1001 D 2019-02-02 00:00:00 ## 4 1001 P 2018-12-01 00:00:00 ## 5 1001 P 2018-12-30 00:00:00 ## 6 1002 D 2019-01-15 00:00:00 ## 7 1002 D 2018-11-30 00:00:00 ## 8 1003 P 2018-10-31 00:00:00 ## 9 1003 P 2019-02-28 00:00:00 ## 10 1003 D 2018-09-30 00:00:00 ## 11 1003 P 2019-01-01 00:00:00
In the SQL below, I am using ROW_NUMBER()
to make this happen. I define the window by using the partition by
clause and then I use order by
to tell SQL in what I want the rows numbered. Here, I want desc
because I want the rows labeled 1
to be the latest enrollment record per student.
select * , ROW_NUMBER() over (partition by studentID order by [date] desc) as rowNum --SQL window function syntax from Demo.dbo.tbl_Enroll ## # A tibble: 11 x 4 ## studentID enrollmentType date rowNum ## <chr> <chr> <dttm> <S3: integer64> ## 1 1000 D 2019-02-01 00:00:00 1 ## 2 1000 P 2019-01-20 00:00:00 2 ## 3 1001 D 2019-02-02 00:00:00 1 ## 4 1001 P 2018-12-30 00:00:00 2 ## 5 1001 P 2018-12-01 00:00:00 3 ## 6 1002 D 2019-01-15 00:00:00 1 ## 7 1002 D 2018-11-30 00:00:00 2 ## 8 1003 P 2019-02-28 00:00:00 1 ## 9 1003 P 2019-01-01 00:00:00 2 ## 10 1003 P 2018-10-31 00:00:00 3 ## 11 1003 D 2018-09-30 00:00:00 4
How would I then grab the latest record? Create an outer query that just selects where rowNum = 1
.
select * from ( select * , ROW_NUMBER() over (partition by studentID order by [date] desc) as rowNum from Demo.dbo.tbl_Enroll ) t where rowNum = 1 --outer query to get just the latest record per student. ## # A tibble: 4 x 4 ## studentID enrollmentType date rowNum ## <chr> <chr> <dttm> <S3: integer64> ## 1 1000 D 2019-02-01 00:00:00 1 ## 2 1001 D 2019-02-02 00:00:00 1 ## 3 1002 D 2019-01-15 00:00:00 1 ## 4 1003 P 2019-02-28 00:00:00 1
R
R’s dplyr package also has window functions to do the same operation. What happens when I just apply rownumber()
to the data?
enroll %>% mutate(rowNum = row_number()) ## # A tibble: 11 x 4 ## studentID enrollmentType date rowNum ## <chr> <chr> <dttm> <int> ## 1 1000 D 2019-02-01 00:00:00 1 ## 2 1000 P 2019-01-20 00:00:00 2 ## 3 1001 D 2019-02-02 00:00:00 3 ## 4 1001 P 2018-12-01 00:00:00 4 ## 5 1001 P 2018-12-30 00:00:00 5 ## 6 1002 D 2019-01-15 00:00:00 6 ## 7 1002 D 2018-11-30 00:00:00 7 ## 8 1003 P 2018-10-31 00:00:00 8 ## 9 1003 P 2019-02-28 00:00:00 9 ## 10 1003 D 2018-09-30 00:00:00 10 ## 11 1003 P 2019-01-01 00:00:00 11
Great, it gave the row number of each row in the dataset. What happens when I put the column I want partitioned inside row_number()
?
enroll %>% mutate(rowNum = row_number(studentID)) ## # A tibble: 11 x 4 ## studentID enrollmentType date rowNum ## <chr> <chr> <dttm> <int> ## 1 1000 D 2019-02-01 00:00:00 1 ## 2 1000 P 2019-01-20 00:00:00 2 ## 3 1001 D 2019-02-02 00:00:00 3 ## 4 1001 P 2018-12-01 00:00:00 4 ## 5 1001 P 2018-12-30 00:00:00 5 ## 6 1002 D 2019-01-15 00:00:00 6 ## 7 1002 D 2018-11-30 00:00:00 7 ## 8 1003 P 2018-10-31 00:00:00 8 ## 9 1003 P 2019-02-28 00:00:00 9 ## 10 1003 D 2018-09-30 00:00:00 10 ## 11 1003 P 2019-01-01 00:00:00 11
That did nothing. This is without reading the documentation obviously. What that did was just give a row number for the studentID
vector regardless of duplication; I could have put any of the columns inside the function and it would return the same result. Not what we want. What we want to do is put studentID
in a group_by
clause before mutating.
enroll %>% group_by(studentID) %>% #adding the group_by clause to partition by studentID. mutate(rowNum = row_number()) ## # A tibble: 11 x 4 ## # Groups: studentID [4] ## studentID enrollmentType date rowNum ## <chr> <chr> <dttm> <int> ## 1 1000 D 2019-02-01 00:00:00 1 ## 2 1000 P 2019-01-20 00:00:00 2 ## 3 1001 D 2019-02-02 00:00:00 1 ## 4 1001 P 2018-12-01 00:00:00 2 ## 5 1001 P 2018-12-30 00:00:00 3 ## 6 1002 D 2019-01-15 00:00:00 1 ## 7 1002 D 2018-11-30 00:00:00 2 ## 8 1003 P 2018-10-31 00:00:00 1 ## 9 1003 P 2019-02-28 00:00:00 2 ## 10 1003 D 2018-09-30 00:00:00 3 ## 11 1003 P 2019-01-01 00:00:00 4
Sweet! But wait, this just listed the row number for each studentID
in whatever order it came in. We want it to list 1
for the lastest record per the date
column.
enroll %>% group_by(studentID) %>% mutate(rowNum = row_number(desc(date))) %>% #specify how to list row numbers here; desc for latest first. filter(rowNum == 1) #lastly, filter to just the latest record. ## # A tibble: 4 x 4 ## # Groups: studentID [4] ## studentID enrollmentType date rowNum ## <chr> <chr> <dttm> <int> ## 1 1000 D 2019-02-01 00:00:00 1 ## 2 1001 D 2019-02-02 00:00:00 1 ## 3 1002 D 2019-01-15 00:00:00 1 ## 4 1003 P 2019-02-28 00:00:00 1
Nice! I really wrote this to learn the nuances for how to do the same operation I do in SQL in R for my particular use case. The vignette provided by the tidyverse team did not really suit my needs in this instance, but others my find it helpful as further reading. Cheers!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.