Reshaping data frames using pivot functions from {tidyr} and tally from {dplyr}
[This article was first published on Ronan's #TidyTuesday blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
In this post, the San Francisco Rentals data set is used to demonstrate data reshaping in R. This involves changing the number of columns and rows in a data frame to fit a given use case. A data frame is made more tall or narrow by decreasing the number of columns, and wider by increasing the number of columns. The three reshaping methods covered in this article are:
- Making a data frame more narrow by summarising variables using group_by() and tally()
- Making a data frame wider with pivot_wider()
- “Lengthening” a data frame with pivot_longer()
Data frames created with these methods were used to make two plots:
- Count of construction permits by type per street
- Annual construction by type per San Francisco county
Setup
Loading the R libraries and data set.
# Loading libraries library(tidytuesdayR) library(tidyverse) library(tidytext) library(ggthemes) # Loading data tt <- tt_load("2022-07-05") Downloading file 1 of 3: `rent.csv` Downloading file 2 of 3: `sf_permits.csv` Downloading file 3 of 3: `new_construction.csv`
Reshaping a data frame by summarising variables
# Printing a summary of the San Francisco (SF) permits data frame tt$sf_permits # A tibble: 86,103 × 44 permit_number permit_type permit_type_definiti… permit_creation_da… <dbl> <dbl> <chr> <dttm> 1 2000010368 3 additions alteration… 2000-01-03 00:00:00 2 2000010353 6 demolitions 2000-01-03 00:00:00 3 2000010498 3 additions alteration… 2000-01-04 00:00:00 4 2000010484 3 additions alteration… 2000-01-04 00:00:00 5 2000010480 3 additions alteration… 2000-01-04 00:00:00 6 2000010475 3 additions alteration… 2000-01-04 00:00:00 7 2000010476 3 additions alteration… 2000-01-04 00:00:00 8 2000010474 3 additions alteration… 2000-01-04 00:00:00 9 2000010479 3 additions alteration… 2000-01-04 00:00:00 10 20000104173 3 additions alteration… 2000-01-04 00:00:00 # … with 86,093 more rows, and 40 more variables: block <chr>, # lot <chr>, street_number <dbl>, street_number_suffix <chr>, # street_name <chr>, street_suffix <chr>, unit <dbl>, # unit_suffix <chr>, description <chr>, status <chr>, # status_date <dttm>, filed_date <dttm>, issued_date <dttm>, # completed_date <dttm>, first_construction_document_date <dttm>, # structural_notification <chr>, … # Printing a summary of the shape of the data frame paste("tt$sf_permits has", nrow(tt$sf_permits), "rows and", ncol(tt$sf_permits), "columns.") [1] "tt$sf_permits has 86103 rows and 44 columns." # Creating a tall/narrow data set of permits per street permits_per_street <- tt$sf_permits %>% # Selecting variables/columns to keep select(permit_type_definition, street_name, permit_number) %>% # Grouping the permit numbers by type and street name for counting group_by(permit_type_definition, street_name) %>% # Counting/tallying the number of permits by type per street tally() # Printing a summary of the permits per street data frame permits_per_street # A tibble: 3,053 × 3 # Groups: permit_type_definition [4] permit_type_definition street_name n <chr> <chr> <int> 1 additions alterations or repairs 01st 196 2 additions alterations or repairs 02nd 763 3 additions alterations or repairs 03rd 778 4 additions alterations or repairs 04th 338 5 additions alterations or repairs 05th 223 6 additions alterations or repairs 06th 347 7 additions alterations or repairs 07th 199 8 additions alterations or repairs 08th 252 9 additions alterations or repairs 08th Ti 1 10 additions alterations or repairs 09th 301 # … with 3,043 more rows # Printing a summary of the shape of the data frame paste("permits_per_street has", nrow(permits_per_street), "rows and", ncol(permits_per_street), "columns.") [1] "permits_per_street has 3053 rows and 3 columns."
Reshaping a data frame to make it wider with the {tidyr} function pivot wider
# Creating a wider copy of the permits per street data frame permits_per_street_wider <- permits_per_street %>% # Pivoting the street names wider (creating a column for each street) and # selecting the "n" variable for the values in this data frame pivot_wider(names_from = street_name, values_from = n) # Printing the wider permits per street data frame permits_per_street_wider # A tibble: 4 × 1,588 # Groups: permit_type_definition [4] permit_type_defini… `01st` `02nd` `03rd` `04th` `05th` `06th` `07th` <chr> <int> <int> <int> <int> <int> <int> <int> 1 additions alterati… 196 763 778 338 223 347 199 2 demolitions 16 17 72 8 7 17 24 3 new construction 9 9 26 8 3 4 13 4 new construction w… NA 3 48 3 4 8 7 # … with 1,580 more variables: `08th` <int>, `08th Ti` <int>, # `09th` <int>, `10th` <int>, `11th` <int>, `12th` <int>, # `13th` <int>, `13th Ti` <int>, `14th` <int>, `15th` <int>, # `16th` <int>, `17th` <int>, `18th` <int>, `19th` <int>, # `20th` <int>, `21st` <int>, `22nd` <int>, `23rd` <int>, # `24th` <int>, `25th` <int>, `25th North` <int>, `26th` <int>, # `27th` <int>, `28th` <int>, `29th` <int>, `2nd` <int>, … # Printing a summary of the shape of the data frame paste("permits_per_street_wider has", nrow(permits_per_street_wider), "rows and", ncol(permits_per_street_wider), "columns.") [1] "permits_per_street_wider has 4 rows and 1588 columns."
Reshaping a data frame to make it more narrow with the {tidyr} function pivot longer
# Printing a summary of the new construction data frame tt$new_construction # A tibble: 261 × 10 cartodb_id the_geom the_geom_webmerca… county year totalproduction <dbl> <lgl> <lgl> <chr> <dbl> <dbl> 1 1 NA NA Alame… 1990 3601 2 2 NA NA Alame… 1991 226 3 3 NA NA Alame… 1992 2652 4 4 NA NA Alame… 1993 3049 5 5 NA NA Alame… 1994 2617 6 6 NA NA Alame… 1995 3515 7 7 NA NA Alame… 1996 3179 8 8 NA NA Alame… 1997 4591 9 9 NA NA Alame… 1998 6022 10 10 NA NA Alame… 1999 5601 # … with 251 more rows, and 4 more variables: sfproduction <dbl>, # mfproduction <dbl>, mhproduction <dbl>, source <chr> # Printing a summary of the shape of the data frame paste("tt$new_construction has", nrow(tt$new_construction), "rows and", ncol(tt$new_construction), "columns.") [1] "tt$new_construction has 261 rows and 10 columns." # Creating a taller/more narrow subset of production type per county production_per_county <- tt$new_construction %>% # Selecting variables/columns from tt$new_construction select(county, year, totalproduction, sfproduction, mfproduction,mhproduction) %>% # "Lengthening" the data frame by selecting columns to be pivoted to a longer format pivot_longer(cols = c(totalproduction, sfproduction, mfproduction, mhproduction)) %>% # Creating a copy of the "name" column to the more descriptive "production_type", as the # pivoted columns all describe types of production, and removing the original "name" # column mutate(production_type = name, name = NULL) %>% # Changing "production_type" from a character to a factor variable, with more # descriptive factor levels mutate(production_type = fct_recode(production_type, "Total" = "totalproduction", "Single family" = "sfproduction", "Multi family" = "mfproduction", "Mobile home" = "mhproduction")) # Printing a summary of the production per county data frame production_per_county # A tibble: 1,044 × 4 county year value production_type <chr> <dbl> <dbl> <fct> 1 Alameda County 1990 3601 Total 2 Alameda County 1990 2166 Single family 3 Alameda County 1990 1378 Multi family 4 Alameda County 1990 57 Mobile home 5 Alameda County 1991 226 Total 6 Alameda County 1991 -236 Single family 7 Alameda County 1991 395 Multi family 8 Alameda County 1991 67 Mobile home 9 Alameda County 1992 2652 Total 10 Alameda County 1992 2018 Single family # … with 1,034 more rows # Printing a summary of the shape of the data frame paste("production_per_county has", nrow(production_per_county), "rows and", ncol(production_per_county), "columns.") [1] "production_per_county has 1044 rows and 4 columns."
Plotting permit type counts per street using a tidy data frame of value counts
# Plotting the top 20 streets with the total number of each permit category permits_per_street %>% slice_max(order_by = n, n = 20) %>% mutate(street_name = reorder_within(street_name, n, permit_type_definition)) %>% ggplot(aes(x = n, y = street_name, fill = permit_type_definition)) + geom_col(show.legend = FALSE) + scale_y_reordered() + theme_solarized_2() + facet_wrap(~permit_type_definition, ncol = 2, scales = "free") + labs(title = "Count of construction permits by type per street", x = "Tally", y = "Street name")
Plotting annual construction per San Francisco county using a data frame created with pivot longer
# Plotting the annual construction by type per San Francisco county production_per_county %>% ggplot(aes(x = year, y = value, colour = fct_reorder2(production_type, year, value))) + geom_line() + theme_clean() + facet_wrap(~county, scales = "free") + scale_colour_brewer(palette = "Dark2") + scale_x_continuous(breaks = seq(min(production_per_county$year), max(production_per_county$year), 8)) + geom_vline(xintercept = 2008, linetype = 2, colour = "red", size = 0.4) + labs(colour = "Production type", x = "Year", y = "Units", title = "Annual construction by type per San Francisco county", subtitle = "Red vertical line marks 2008")
To leave a comment for the author, please follow the link and comment on their blog: Ronan's #TidyTuesday blog.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.