Site icon R-bloggers

Identifying R Functions & Packages in Github Gists

[This article was first published on rstats on Bryan Shalloway's Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
  • This post is a follow-up to my previous post Identifying R Functions & Packages Used in GitHub Repos that introduced funspotr.

    funspotr can also be applied to gists:

    By functions or packages used…?https://t.co/kbSLOpQZLF

    — Bryan Shalloway (@brshallo) January 22, 2022

    A problem I bumped into was that most of Chelsea’s gists don’t actually have .R or .Rmd extensions so my approach skipped most of her snippets. I wanted to parse my own gists but ran into a related problem that most of my github gist code snippets are saved as .md files1 so knitr::purl() won’t work2.

    In this post I…

    1. create a function to extract code chunks from simple .md files
    2. parse the functions and packages in my code using funspotr3.

    Parsing code

    First I used funspotr to get a table with all of my gists.

    library(dplyr)
    library(purrr)
    library(stringr)
    library(funspotr)
    library(dplyr)
    library(purrr)
    library(stringr)
    library(funspotr)
    
    brshallo_gists <- funspotr::github_gists("brshallo")
    
    brshallo_gists
    ## # A tibble: 97 x 2
    ##    contents                                  urls                               
    ##    <chr>                                     <chr>                              
    ##  1 funspotr-gists-cmparlettpelleriti-ex.R    https://gist.githubusercontent.com~
    ##  2 custom-ggplot-and-labels.R                https://gist.githubusercontent.com~
    ##  3 stratified-sampling-parameter-estimates.R https://gist.githubusercontent.com~
    ##  4 grouped-nested-t-test.md                  https://gist.githubusercontent.com~
    ##  5 benchmark-cdf-methods.md                  https://gist.githubusercontent.com~
    ##  6 split-group-nest-join.md                  https://gist.githubusercontent.com~
    ##  7 weighted-t-test-tidied.md                 https://gist.githubusercontent.com~
    ##  8 cdf_density.R                             https://gist.githubusercontent.com~
    ##  9 if_all-if_any-examples.R                  https://gist.githubusercontent.com~
    ## 10 weighted-grouped-bootstrap-simulation.md  https://gist.githubusercontent.com~
    ## # ... with 87 more rows

    Parsing R files

    funspotr is already set-up to parse all the unique functions and packages from R or Rmd files.

    r_gists <- brshallo_gists %>% 
      filter(funspotr:::str_detect_r_rmd(contents))
    
    r_gists_parsed <- funspotr::github_spot_funs(custom_urls = r_gists)
    
    r_gists_unnested <- r_gists_parsed %>% 
      funspotr::unnest_github_results()
    r_gists_unnested
    ## # A tibble: 474 x 5
    ##    funs                  pkgs      in_multiple_pkgs contents      urls          
    ##    <chr>                 <chr>     <lgl>            <chr>         <chr>         
    ##  1 library               base      FALSE            funspotr-gis~ https://gist.~
    ##  2 github_gists          funspotr  FALSE            funspotr-gis~ https://gist.~
    ##  3 filter                dplyr     TRUE             funspotr-gis~ https://gist.~
    ##  4 str_detect_r_rmd      (unknown) FALSE            funspotr-gis~ https://gist.~
    ##  5 github_spot_funs      funspotr  FALSE            funspotr-gis~ https://gist.~
    ##  6 unnest_github_results funspotr  FALSE            funspotr-gis~ https://gist.~
    ##  7 library               base      FALSE            custom-ggplo~ https://gist.~
    ##  8 ggplot                ggplot    FALSE            custom-ggplo~ https://gist.~
    ##  9 aes                   ggplot    FALSE            custom-ggplo~ https://gist.~
    ## 10 geom_point            ggplot    FALSE            custom-ggplo~ https://gist.~
    ## # ... with 464 more rows

    Parsing markdown files

    To parse my .md files, I wrote a function here extract_code_md() that…

    • reads in a file
    • extracts the text in code chunks4
    • saves it to a temporary file
    • returns the file path of the temporary file
    subset_even <- function(x) x[!seq_along(x) %% 2]
    
    extract_code_md <- function(file_path){
      
      lines <- readr::read_file(file_path) %>% 
        stringr::str_split("```.*", simplify = TRUE) %>%
        subset_even() %>% 
        stringr::str_flatten("\n## new chunk \n")
      
      file_output <- tempfile(fileext = ".R")
      writeLines(lines, file_output)
      file_output
    }

    I map extract_code_md() on all the .md gists and then parse the files using funspotr.

    # display output was weird here so just copied code and set eval = FALSE
    md_gists <- brshallo_gists %>% 
      filter(!funspotr:::str_detect_r_rmd(contents))
    
    md_gists_local <- md_gists %>% 
    # name urls because that's what funspotr::githup_spot_funs() expects
      mutate(urls = map_chr(urls, extract_code_md))
    
    md_gists_parsed <- funspotr::github_spot_funs(custom_urls = md_gists_local)
    
    md_gists_unnested <- md_gists_parsed %>% 
      funspotr::unnest_github_results()
    md_gists_unnested
    ## # A tibble: 1,061 x 5
    ##    funs           pkgs    in_multiple_pkgs contents                 urls        
    ##    <chr>          <chr>   <lgl>            <chr>                    <chr>       
    ##  1 library        base    FALSE            grouped-nested-t-test.md "C:\\Users\~
    ##  2 require        base    FALSE            grouped-nested-t-test.md "C:\\Users\~
    ##  3 install_github remotes FALSE            grouped-nested-t-test.md "C:\\Users\~
    ##  4 na.omit        stats   FALSE            grouped-nested-t-test.md "C:\\Users\~
    ##  5 t.test         stats   FALSE            grouped-nested-t-test.md "C:\\Users\~
    ##  6 tidy           broom   FALSE            grouped-nested-t-test.md "C:\\Users\~
    ##  7 pull           dplyr   FALSE            grouped-nested-t-test.md "C:\\Users\~
    ##  8 group_by       dplyr   FALSE            grouped-nested-t-test.md "C:\\Users\~
    ##  9 summarise      dplyr   FALSE            grouped-nested-t-test.md "C:\\Users\~
    ## 10 list           base    FALSE            grouped-nested-t-test.md "C:\\Users\~
    ## # ... with 1,051 more rows

    Note that I’m assuming all the code snippets are R code5.

    Binding files together

    I bind these files together and then arrange them based on the initial order in brshallo_gists6.

    gists_unnested <- bind_rows(
      r_gists_unnested,
      md_gists_unnested
    ) %>% 
      # got this arranging by a vector trick from SO:
      # https://stackoverflow.com/questions/52216341/how-to-sort-rows-of-a-data-frame-based-on-a-vector-using-dplyr-pipe
      arrange(match(contents, brshallo_gists$contents)) %>% 
      # add back-in links to url's where files are rather than urls column being
      # local paths for .md snippets
      select(-urls) %>% 
      left_join(brshallo_gists, by = "contents")
    
    gists_unnested %>% 
      DT::datatable(rownames = FALSE,
                class = 'cell-border stripe',
                filter = 'top',
                escape = FALSE,
                options = list(pageLength = 20))
  • To leave a comment for the author, please follow the link and comment on their blog: rstats on Bryan Shalloway's Blog.

    R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
    Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.