repoRter.nih: a convenient R interface to the NIH RePORTER Project API
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
The US National Institute of Health (NIH) received funding of approximately $42 billion in fiscal year 2022; $31 billion (72%) of this was awarded by the NIH in the form of research grant funding to hospitals, medical colleges, non-profits, businesses, and other organizations based in the U.S. and abroad.[https://nexus.od.nih.gov/all/2021/04/21/fy-2020-by-the-numbers-extramural-investments-in-research] The NIH maintains a publicly available database called “RePORTER” to track this substantial flow of grant funding and makes it available to the public via a web-based query interface as well as an API.
“The NIH RePORTER APIs is designed to programmatically expose relevant scientific awards data from both NIH and non-NIH federal agencies for the consumption of project teams or external 3rd party applications to support reporting, data analysis, data integration or to satisfy other business needs as deemed pertinent.”
–NIH RePORTER v2 API Documentation
This data can have significant value for many audiences, including researchers, investors, industry, watchdogs/public advocates, and R users. But constructing queries and retrieving results programmatically involves some coding overhead which can be a challenge for those not familiar with RESTful APIs and JSON; it takes some effort even for those who are. The repoRter.nih
package aims to simplify this task for the typical analyst scripting in R.
Getting Started
Installation
This package (latest stable release) can be installed from CRAN the usual way:
install.packages("repoRter.nih")
The current dev version can be installed from github, on the dev
branch:
devtools::install_github('bikeactuary/repoRter.nih@dev')
I welcome R developers more capable than myself to collaborate on improving the source code, documentation, and unit testing in this package.
Basic Workflow
library(repoRter.nih)
The make_req()
method is used to generate a valid JSON request object. The req can subsequently be passed to the RePORTER Project API and results retrieved via the get_nih_data()
method.
Generating the request:
# all projects funded by the Paycheck Protection Act, Coronavirus Response and # Relief Act, and American Rescue Plan, in fiscal year 2021 req <- make_req(criteria = list(fiscal_years = 2021, covid_response = c("C4", "C5", "C6"))) ## This is your JSON payload: ## { ## "criteria": { ## "fiscal_years": [ ## 2021 ## ], ## "covid_response": [ ## "C4", ## "C5", ## "C6" ## ], ## "use_relevance": false, ## "include_active_projects": false, ## "exclude_subprojects": false, ## "multi_pi_only": false, ## "newly_added_projects_only": false, ## "sub_project_only": false ## }, ## "offset": 0, ## "limit": 500 ## } ## ## If you receive a non-200 API response, compare this formatting (boxes, braces, quotes, etc.) to the 'Complete Payload' schema provided here: ## https://api.reporter.nih.gov/?urls.primaryName=V2.0#/Search/post_v2_projects_search
Sending the request and retrieving results:
res <- get_nih_data(req) ## Retrieving first page of results (up to 500 records) class(res) ## [1] "tbl_df" "tbl" "data.frame"
A tibble is returned containing 43 columns. This data is not flat – several columns are nested data.frame
s and list
s (of variable length vectors and data.frame
s of varying height).
res %>% glimpse(width = getOption("cli.width")) ## Rows: 251 ## Columns: 43 ## $ appl_id <int> 10255113, 10425707, 10403857, 10258548, 10439~ ## $ subproject_id <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~ ## $ fiscal_year <int> 2021, 2021, 2021, 2021, 2021, 2021, 2021, 202~ ## $ project_num <chr> "3P20GM104417-07S1", "3P20GM104417-08S1", "3R~ ## $ project_serial_num <chr> "GM104417", "GM104417", "ES028615", "ES028615~ ## $ organization <df[,17]> <data.frame[26 x 17]> ## $ award_type <chr> "3", "3", "3", "3", "7", "3", "1", "3", "~ ## $ activity_code <chr> "P20", "P20", "R01", "R01", "U01", "R01", "R0~ ## $ award_amount <int> 1115953, 681188, 300000, 1609765, 877287, 348~ ## $ is_active <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, TR~ ## $ project_num_split <df[,7]> <data.frame[26 x 7]> ## $ principal_investigators <list> [<data.frame[1 x 7]>], [<data.frame[1 x 7]>],~ ## $ contact_pi_name <chr> "ADAMS, ALEXANDRA K.", "ADAMS, ALEXANDRA K~ ## $ program_officers <list> [<data.frame[1 x 4]>], [<data.frame[1 x 4]>]~ ## $ agency_ic_admin <df[,3]> <data.frame[26 x 3]> ## $ agency_ic_fundings <list> [<data.frame[1 x 5]>], [<data.frame[1 x 5]>],~ ## $ cong_dist <chr> "MT-00", "MT-00", "IL-01", "IL-01", "MA-08",~ ## $ spending_categories <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~ ## $ project_start_date <chr> "2020-11-17T05:00:00Z", "2021-09-01T04:00:00~ ## $ project_end_date <chr> "2022-08-31T04:00:00Z", "2022-10-31T04:00:00Z~ ## $ organization_type <df[,3]> <data.frame[26 x 3]> ## $ full_foa <chr> "PA-20-135", "PAR-18-264", "PA-20-272", "PA-2~ ## $ full_study_section <df[,6]> <data.frame[26 x 6]> ## $ award_notice_date <chr> "2020-11-17T05:00:00Z", "2021-09-21T04:00:00Z~ ## $ is_new <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL~ ## $ mechanism_code_dc <chr> "RC", "RC", "RP", "RP", "RP", "RP", "RP", ~ ## $ core_project_num <chr> "P20GM104417", "P20GM104417", "R01ES028615", ~ ## $ terms <chr> "Adult ; 21+ years old ; Adult Human ; adu~ ## $ pref_terms <chr> "2019-nCoV;Adult;Affect;Agricultural Workers;~ ## $ abstract_text <chr> "Project Summary\nThe COVID-19 pandemic has d~ ## $ project_title <chr> "Center for American Indian and Rural Health ~ ## $ phr_text <chr> "Project Narrative\nWorking with our Latino c~ ## $ spending_categories_desc <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~ ## $ agency_code <chr> "NIH", "NIH", "NIH", "NIH", "NIH", "NIH", "NI~ ## $ covid_response <list> "C4", "C6", "C6", "C4", "C4", "C6", "C6", "C6~ ## $ arra_funded <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N", ~ ## $ budget_start <chr> "2020-11-17T05:00:00Z", "2021-09-01T04:00:00Z~ ## $ budget_end <chr> "2022-08-31T04:00:00Z", "2022-10-31T04:00:00Z~ ## $ cfda_code <chr> "310", "859", "113", "310", "310", "855", "85~ ## $ funding_mechanism <chr> "Research Centers", "Research Centers", "Non~ ## $ direct_cost_amt <int> 1006607, 616778, 297064, 1560464, 569403, 207~ ## $ indirect_cost_amt <int> 109346, 64410, 2936, 49301, 307884, 140955, 1~ ## $ project_detail_url <chr> "https://reporter.nih.gov/project-details/102~
Criteria-Field Translation
A dataset (nih_fields
) is provided with this package to assist in translating between field names used in the payload criteria
, column names in the return data, and field names used in the include_fields
, exclude_fields
, and sort_field
arguments.
data("nih_fields") nih_fields %>% print ## # A tibble: 43 x 5 ## payload_name response_name include_name return_class mod_ind ## <chr> <chr> <chr> <chr> <int> ## 1 appl_ids appl_id ApplId integer 1 ## 2 <NA> subproject_id SubprojectId character 0 ## 3 fiscal_years fiscal_year FiscalYear integer 1 ## 4 project_nums project_num ProjectNum character 1 ## 5 serial_num project_serial_num ProjectSerialNum character 1 ## 6 <NA> organization Organization data.frame 0 ## 7 award_types award_type AwardType character 1 ## 8 activity_codes activity_code ActivityCode character 1 ## 9 award_amount_range award_amount AwardAmount integer 1 ## 10 include_active_projects is_active IsActive logical 1 ## # ... with 33 more rows
Some fields can not be used as filtering criteria
– these will show NA
in the payload_name
column.
Generating Requests
Most of the detail (and function documentation) is around the many parameters available in RePORTER to filter/search project records. Let's get into some of the capabilities.
Default Request
If no arguments are supplied, the default behavior of make_req()
is to generate a request for all projects funded in fiscal_years = lubridate::year(Sys.Date())
. Limiting requests to a single year is often necessary (depending on additional filtering criteria used) due to a RePORTER restriction that a maximum of 10K records may be returned from any result set. There are currently ~2.6M projects in the database going back to fiscal year 1985, and each fiscal year tends to have 70-100K projects, so the 10K limit can be restrictive to the user wanting a broad search.
req <- make_req() ## This is your JSON payload: ## { ## "criteria": { ## "fiscal_years": [ ## 2022 ## ], ## "use_relevance": false, ## "include_active_projects": false, ## "exclude_subprojects": false, ## "multi_pi_only": false, ## "newly_added_projects_only": false, ## "sub_project_only": false ## }, ## "offset": 0, ## "limit": 500 ## } ## ## If you receive a non-200 API response, compare this formatting (boxes, braces, quotes, etc.) to the 'Complete Payload' schema provided here: ## https://api.reporter.nih.gov/?urls.primaryName=V2.0#/Search/post_v2_projects_search
The method prints a helpful message to the console in addition to returning the JSON. Set message = FALSE
if you wish to suppress this message.
Limiting Data Retrieved
You can limit both the width and height of the result set retrieved from the API.
Fields
We probably will not need to fetch every field every time. The include_fields
argument is provided to specify a limited set of fields to be returned. Alternatively, fields may be excluded using exclude_fields
.
Records (projects)
This package provides the ability to retrieve only a limited number of result pages via the max_pages
argument. This can be useful for developing/testing your queries (and for reducing time to render package documentation). Each page has a record count equal to limit
– so setting max_pages = 5
with the default limit = 500
(the maximum permitted by RePORTER) in make_req()
will result in up to 2,500 total records returned.
Ex. 1 – Limiting results and selecting fields
data("nih_fields") fields <- nih_fields %>% filter(response_name %in% c("appl_id", "subproject_id", "project_title", "fiscal_year", "award_amount", "is_active", "project_start_date")) %>% pull(include_name) req <- make_req(include_fields = fields, limit = 500, message = FALSE) # default res <- get_nih_data(query = req, max_pages = 1) ## Retrieving first page of results (up to 500 records) ## max_pages set to 1 by user. Result set contains 32 pages. Only partial results will be retrieved. res %>% glimpse(width = getOption("cli.width")) ## Rows: 500 ## Columns: 7 ## $ appl_id <int> 10335890, 10292904, 10400390, 10330448, 10356098, 1~ ## $ subproject_id <chr> "5790", NA, NA, NA, NA, "7956", NA, NA, "8318", NA,~ ## $ fiscal_year <int> 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 202~ ## $ award_amount <int> 244676, NA, 38717, 47003, 440201, 162193, 347296, 3~ ## $ is_active <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU~ ## $ project_start_date <chr> "2020-02-28T05:00:00Z", "2018-10-01T04:00:00Z", "20~ ## $ project_title <chr> "Project 2: Investigating the role of PAH exposures~
Some Vanilla Criteria
Many criteria are passed as vectors within the criteria
list argument. We will cover some of the most useful examples:
Ex. 2 – Organization search
We can refine our query results by providing filtering criteria to make_req()
, and by extension to the API. Suppose we want all currently active projects, funded in fiscal years 2017 through 2021, with a specific organization in mind (though we don't know exactly how its name will appear in RePORTER):
req <- make_req(criteria = list( fiscal_years = 2010:2011, include_active_projects = TRUE, org_names = c("Yale", "New Haven") ), include_fields = c("Organization", "FiscalYear", "AwardAmount"), message = FALSE)
Here we are asking for any orgs containing the strings “yale” or “new haven” (ignoring case) – there are implied wildcards on either end of the strings we provide. This is the same as org_name LIKE '%yale%' OR org_name LIKE '%new haven%'
in a SQL WHERE clause.
res <- get_nih_data(req, max_pages = 1) ## Retrieving first page of results (up to 500 records) ## max_pages set to 1 by user. Result set contains 8 pages. Only partial results will be retrieved. res %>% glimpse(width = getOption("cli.width")) ## Rows: 500 ## Columns: 3 ## $ fiscal_year <int> 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 202~ ## $ organization <df[,17]> <data.frame[26 x 17]> ## $ award_amount <int> 484894, 421533, 329765, 308745, 209375, 178489, 15000~
Notice the column organization
is a nested data frame – it has 17 columns and always a single record. Setting flatten_result = TRUE
in the call to get_nih_data()
will flatten all such return fields, prefixing the original field name and returning with clean names (see janitor::clean_names()
).
res <- get_nih_data(req, max_pages = 1, flatten_result = TRUE) ## Retrieving first page of results (up to 500 records) ## max_pages set to 1 by user. Result set contains 8 pages. Only partial results will be retrieved. res %>% glimpse(width = getOption("cli.width")) ## Rows: 500 ## Columns: 19 ## $ fiscal_year <int> 2022, 2022, 2022, 2022, 2022, 2022, 202~ ## $ award_amount <int> 484894, 421533, 329765, 308745, 209375,~ ## $ organization_org_name <chr> "YALE UNIVERSITY", "YALE UNIVERSITY", "~ ## $ organization_city <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~ ## $ organization_country <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~ ## $ organization_org_city <chr> "NEW HAVEN", "NEW HAVEN", "NEW HAVEN", ~ ## $ organization_org_country <chr> "UNITED STATES", "UNITED STATES", "UNIT~ ## $ organization_org_state <chr> "CT", "CT", "CT", "CT", "CT", "CT", "CT~ ## $ organization_org_state_name <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~ ## $ organization_dept_type <chr> "INTERNAL MEDICINE/MEDICINE", "INTERNAL~ ## $ organization_fips_country_code <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~ ## $ organization_org_duns <chr> "043207562", "043207562", "043207562", ~ ## $ organization_org_ueis <chr> "FL6GV84CKN57", "FL6GV84CKN57", "FL6GV8~ ## $ organization_primary_duns <chr> "043207562", "043207562", "043207562", ~ ## $ organization_primary_uei <chr> "FL6GV84CKN57", "FL6GV84CKN57", "FL6GV8~ ## $ organization_org_fips <chr> "US", "US", "US", "US", "US", "US", "US~ ## $ organization_org_ipf_code <chr> "9420201", "9420201", "9420201", "94202~ ## $ organization_org_zipcode <chr> "065208327", "065208327", "065208327", ~ ## $ organization_external_org_id <int> 9420201, 9420201, 9420201, 9420201, 942~
Most users will prefer the flattened format above. It looks like Yale is busy, but it is not the only org matching our search.
res %>% group_by(organization_org_name) %>% summarise(project_count = n()) ## # A tibble: 2 x 2 ## organization_org_name project_count ## <chr> <int> ## 1 UNIVERSITY OF NEW HAVEN 1 ## 2 YALE UNIVERSITY 499
The org_names_exact_match
criteria can be used as an alternative when we know the exact org name as it appears in RePORTER, if we want only that org's projects returned.
Ex. 3 – Geographic search
We can also filter projects by the geographic location (country/state/city) of the applicant organization.
## A valid request but probably not what we want req <- make_req(criteria = list( fiscal_years = 2010:2011, include_active_projects = TRUE, org_cities = "New Haven", org_states = "WY" ), include_fields = c("Organization", "FiscalYear", "AwardAmount"), message = FALSE ## suppress printed message ) res <- get_nih_data(req, max_pages = 5, flatten_result = TRUE) ## Retrieving first page of results (up to 500 records) ## Done - 0 records returned. Try a different search criteria.
Multiple criteria are usually connected by logical “AND” – there are no orgs based in the city of New Haven in Wyoming state (because it doesn't exist.)
req <- make_req(criteria = list( fiscal_years = 2015, include_active_projects = TRUE, org_states = "WY" ), include_fields = c("ApplId", "Organization", "FiscalYear", "AwardAmount"), sort_field = "AwardAmount", sort_order = "desc", message = FALSE) res <- get_nih_data(req, flatten_result = TRUE) ## Retrieving first page of results (up to 500 records) res %>% glimpse(width = getOption("cli.width")) ## Rows: 98 ## Columns: 20 ## $ appl_id <int> 8884461, 8898483, 10147717, 10201479, 1~ ## $ fiscal_year <int> 2015, 2015, 2021, 2021, 2021, 2021, 201~ ## $ award_amount <int> 4957554, 3521553, 3418046, 2638712, 200~ ## $ organization_org_name <chr> "WYOMING STATE DEPARTMENT OF HEALTH", "~ ## $ organization_city <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~ ## $ organization_country <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~ ## $ organization_org_city <chr> "CHEYENNE", "LARAMIE", "LARAMIE", "LARA~ ## $ organization_org_country <chr> "UNITED STATES", "UNITED STATES", "UNIT~ ## $ organization_org_state <chr> "WY", "WY", "WY", "WY", "WY", "WY", "WY~ ## $ organization_org_state_name <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~ ## $ organization_dept_type <chr> NA, "PHARMACOLOGY", "PHARMACOLOGY", "VE~ ## $ organization_fips_country_code <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~ ## $ organization_org_duns <chr> "809915796", "069690956", "069690956", ~ ## $ organization_org_ueis <chr> "JP1QRJYYJG73", "FDR5YF2K32X5", "FDR5YF~ ## $ organization_primary_duns <chr> "809915796", "069690956", "069690956", ~ ## $ organization_primary_uei <chr> "JP1QRJYYJG73", "FDR5YF2K32X5", "FDR5YF~ ## $ organization_org_fips <chr> "US", "US", "US", "US", "US", "US", "US~ ## $ organization_org_ipf_code <chr> "9408801", "9412601", "9412601", "94126~ ## $ organization_org_zipcode <chr> "820020001", "820712000", "820712000", ~ ## $ organization_external_org_id <int> 9408801, 9412601, 9412601, 9412601, 941~
Why are there projects from more recent years than 2015? Because the include_active_projects
flag adds in active projects that match all criteria aside from fiscal_years
(this appears to be the intended behavior by RePORTER).
Ex. 3 – Coronavirus/Covid-19 research
We already provided one example of this search criteria above. Let's mix it up and request all Covid response projects.
## all projects funded by the Paycheck Protection Act, Coronavirus Response and Relief Act, ## and American Rescue Plan, in fiscal year 2021 req <- make_req(criteria = list(covid_response = c("All")), include_fields = nih_fields %>% filter(payload_name %in% c("award_amount_range", "covid_response")) %>% pull(include_name)) ## This is your JSON payload: ## { ## "criteria": { ## "covid_response": [ ## "All" ## ], ## "use_relevance": false, ## "include_active_projects": false, ## "exclude_subprojects": false, ## "multi_pi_only": false, ## "newly_added_projects_only": false, ## "sub_project_only": false ## }, ## "include_fields": [ ## "AwardAmount", ## "CovidResponse" ## ], ## "offset": 0, ## "limit": 500 ## } ## ## If you receive a non-200 API response, compare this formatting (boxes, braces, quotes, etc.) to the 'Complete Payload' schema provided here: ## https://api.reporter.nih.gov/?urls.primaryName=V2.0#/Search/post_v2_projects_search res <- get_nih_data(req) ## Retrieving first page of results (up to 500 records) ## Retrieving results 501 to 1000 of 2572 ## Retrieving results 1001 to 1500 of 2572 ## Retrieving results 1501 to 2000 of 2572 ## Retrieving results 2001 to 2500 of 2572 ## Retrieving results 2501 to 2572 of 2572 res$covid_response %>% class() ## [1] "list" res$covid_response[[1]] ## [1] "Reg-CV"
covid_response
is a nested list (with character vectors of variable length) within the return tibble. We can use flatten_result = TRUE
here – elements of each vector will be collapsed to a single string delimited by “;”, massaging the list to a single character vector.
## all projects funded by the Paycheck Protection Act, Coronavirus Response and Relief Act, ## and American Rescue Plan, in fiscal year 2021 req <- make_req(criteria = list(covid_response = c("All")), message = FALSE) res <- get_nih_data(req, flatten_result = TRUE) ## Retrieving first page of results (up to 500 records) ## Retrieving results 501 to 1000 of 2572 ## Retrieving results 1001 to 1500 of 2572 ## Retrieving results 1501 to 2000 of 2572 ## Retrieving results 2001 to 2500 of 2572 ## Retrieving results 2501 to 2572 of 2572 unique(res$covid_response) ## [1] "Reg-CV" "CV" "C3" "C4" "C6" "C6;Reg-CV" "C5" ## [8] "C5;Reg-CV" "C4;Reg-CV"
Some projects are being funded from multiple sources. Summarizing all Covid-related project awards:
library(ggplot2) res %>% left_join(covid_response_codes, by = "covid_response") %>% mutate(covid_code_desc = case_when(!is.na(fund_src) ~ paste0(covid_response, ": ", fund_src), TRUE ~ paste0(covid_response, " (Multiple)"))) %>% group_by(covid_code_desc) %>% summarise(total_awards = sum(award_amount) / 1e6) %>% ungroup() %>% arrange(desc(covid_code_desc)) %>% mutate(prop = total_awards / sum(total_awards), csum = cumsum(prop), ypos = csum - prop/2 ) %>% ggplot(aes(x = "", y = prop, fill = covid_code_desc)) + geom_bar(stat="identity") + geom_text_repel(aes(label = paste0(dollar(total_awards, accuracy = 1, suffix = "M"), "\n", percent(prop, accuracy = .01)), y = ypos), show.legend = FALSE, nudge_x = .8, size = 3, color = "grey25") + coord_polar(theta ="y") + theme_void() + theme(legend.position = "right", legend.title = element_text(colour = "grey25"), legend.text = element_text(colour="blue", size=6, face="bold"), plot.title = element_text(color = "grey25"), plot.caption = element_text(size = 6)) + labs(caption = "Data Source: NIH RePORTER API v2") + ggtitle("Legislative Source for NIH Covid Response Project Funding")
A second dataset is provided to translate the covid_response
codes; it includes both the long-form and a shorter version of the source name.
data("covid_response_codes") covid_response_codes %>% print ## # A tibble: 6 x 3 ## covid_response funding_source fund_src ## <chr> <chr> <chr> ## 1 Reg-CV NIH regular appropriations funding NIH Reg Appropriations ## 2 CV Coronavirus Preparedness and Response Supplemental App~ Coronav Prep & Repons~ ## 3 C3 CARES Act (Coronavirus Aid, Relief, and Economic Secur~ CARES Act ## 4 C4 Paycheck Protection Program and Health Care Enhancemen~ PPP & Health Care Enh~ ## 5 C5 Coronavirus Response and Relief Supplemental Appropria~ Coronav Response & Re~ ## 6 C6 American Rescue Plan Act of 2021 American Rescue Plan
Additional Resources
The full vignette contains a few more advanced examples, including boolean search functionality on the project title, abstract, and tagged project terms. It also includes an example which allows the user to retrieve complete result sets above the 10,000 record limit through by applying some basic statistics and a little programming.
The RePORTER web interface and official API documentation are useful for getting familiar with available search parameters
h/t to Chris whose code on github was all I could find existing in R and served as a starting point for this work
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.