Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This post is a follow-up to my previous post Identifying R Functions & Packages Used in GitHub Repos that introduced funspotr.
funspotr can also be applied to gists:
By functions or packages used…?https://t.co/kbSLOpQZLF
— Bryan Shalloway (@brshallo) January 22, 2022
A problem I bumped into was that most of Chelsea’s gists don’t actually have .R or .Rmd extensions so my approach skipped most of her snippets. I wanted to parse my own gists but ran into a related problem that most of my github gist code snippets are saved as .md files1 so knitr::purl()
won’t work2.
In this post I…
- create a function to extract code chunks from simple .md files
- parse the functions and packages in my code using funspotr3.
Parsing code
First I used funspotr to get a table with all of my gists.
library(dplyr) library(purrr) library(stringr) library(funspotr) library(dplyr) library(purrr) library(stringr) library(funspotr) brshallo_gists <- funspotr::github_gists("brshallo") brshallo_gists ## # A tibble: 97 x 2 ## contents urls ## <chr> <chr> ## 1 funspotr-gists-cmparlettpelleriti-ex.R https://gist.githubusercontent.com~ ## 2 custom-ggplot-and-labels.R https://gist.githubusercontent.com~ ## 3 stratified-sampling-parameter-estimates.R https://gist.githubusercontent.com~ ## 4 grouped-nested-t-test.md https://gist.githubusercontent.com~ ## 5 benchmark-cdf-methods.md https://gist.githubusercontent.com~ ## 6 split-group-nest-join.md https://gist.githubusercontent.com~ ## 7 weighted-t-test-tidied.md https://gist.githubusercontent.com~ ## 8 cdf_density.R https://gist.githubusercontent.com~ ## 9 if_all-if_any-examples.R https://gist.githubusercontent.com~ ## 10 weighted-grouped-bootstrap-simulation.md https://gist.githubusercontent.com~ ## # ... with 87 more rows
Parsing R files
funspotr is already set-up to parse all the unique functions and packages from R or Rmd files.
r_gists <- brshallo_gists %>% filter(funspotr:::str_detect_r_rmd(contents)) r_gists_parsed <- funspotr::github_spot_funs(custom_urls = r_gists) r_gists_unnested <- r_gists_parsed %>% funspotr::unnest_github_results() r_gists_unnested ## # A tibble: 474 x 5 ## funs pkgs in_multiple_pkgs contents urls ## <chr> <chr> <lgl> <chr> <chr> ## 1 library base FALSE funspotr-gis~ https://gist.~ ## 2 github_gists funspotr FALSE funspotr-gis~ https://gist.~ ## 3 filter dplyr TRUE funspotr-gis~ https://gist.~ ## 4 str_detect_r_rmd (unknown) FALSE funspotr-gis~ https://gist.~ ## 5 github_spot_funs funspotr FALSE funspotr-gis~ https://gist.~ ## 6 unnest_github_results funspotr FALSE funspotr-gis~ https://gist.~ ## 7 library base FALSE custom-ggplo~ https://gist.~ ## 8 ggplot ggplot FALSE custom-ggplo~ https://gist.~ ## 9 aes ggplot FALSE custom-ggplo~ https://gist.~ ## 10 geom_point ggplot FALSE custom-ggplo~ https://gist.~ ## # ... with 464 more rows
Parsing markdown files
To parse my .md files, I wrote a function here extract_code_md()
that…
- reads in a file
- extracts the text in code chunks4
- saves it to a temporary file
- returns the file path of the temporary file
subset_even <- function(x) x[!seq_along(x) %% 2] extract_code_md <- function(file_path){ lines <- readr::read_file(file_path) %>% stringr::str_split("```.*", simplify = TRUE) %>% subset_even() %>% stringr::str_flatten("\n## new chunk \n") file_output <- tempfile(fileext = ".R") writeLines(lines, file_output) file_output }
I map extract_code_md()
on all the .md gists and then parse the files using funspotr.
# display output was weird here so just copied code and set eval = FALSE md_gists <- brshallo_gists %>% filter(!funspotr:::str_detect_r_rmd(contents)) md_gists_local <- md_gists %>% # name urls because that's what funspotr::githup_spot_funs() expects mutate(urls = map_chr(urls, extract_code_md)) md_gists_parsed <- funspotr::github_spot_funs(custom_urls = md_gists_local) md_gists_unnested <- md_gists_parsed %>% funspotr::unnest_github_results() md_gists_unnested ## # A tibble: 1,061 x 5 ## funs pkgs in_multiple_pkgs contents urls ## <chr> <chr> <lgl> <chr> <chr> ## 1 library base FALSE grouped-nested-t-test.md "C:\\Users\~ ## 2 require base FALSE grouped-nested-t-test.md "C:\\Users\~ ## 3 install_github remotes FALSE grouped-nested-t-test.md "C:\\Users\~ ## 4 na.omit stats FALSE grouped-nested-t-test.md "C:\\Users\~ ## 5 t.test stats FALSE grouped-nested-t-test.md "C:\\Users\~ ## 6 tidy broom FALSE grouped-nested-t-test.md "C:\\Users\~ ## 7 pull dplyr FALSE grouped-nested-t-test.md "C:\\Users\~ ## 8 group_by dplyr FALSE grouped-nested-t-test.md "C:\\Users\~ ## 9 summarise dplyr FALSE grouped-nested-t-test.md "C:\\Users\~ ## 10 list base FALSE grouped-nested-t-test.md "C:\\Users\~ ## # ... with 1,051 more rows
Note that I’m assuming all the code snippets are R code5.
Binding files together
I bind these files together and then arrange them based on the initial order in brshallo_gists
6.
gists_unnested <- bind_rows( r_gists_unnested, md_gists_unnested ) %>% # got this arranging by a vector trick from SO: # https://stackoverflow.com/questions/52216341/how-to-sort-rows-of-a-data-frame-based-on-a-vector-using-dplyr-pipe arrange(match(contents, brshallo_gists$contents)) %>% # add back-in links to url's where files are rather than urls column being # local paths for .md snippets select(-urls) %>% left_join(brshallo_gists, by = "contents") gists_unnested %>% DT::datatable(rownames = FALSE, class = 'cell-border stripe', filter = 'top', escape = FALSE, options = list(pageLength = 20))
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.