readr::problems() returns tidy data!
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
A handy little trick I picked up today when using readr
.
Some background: I needed a mapping between ZIP Code Tabulation Areas and counties (to link to some urban/rural data). The Census Bureau provides a CSV style table that includes information about each of the ZCTA (e.g., size, population, area by land/water type) and the FIPS codes for the state and county.
However, when I load that data using the readr
package:
library(tidyverse) zcta_to_county_mapping <- read_csv("http://www2.census.gov/geo/docs/maps-data/data/rel/zcta_county_rel_10.txt") %>% select(ZCTA5, STATE, COUNTY) %>% mutate(STATE = as.numeric(STATE), COUNTY = as.numeric(COUNTY)) ## Parsed with column specification: ## cols( ## .default = col_integer(), ## ZCTA5 = col_character(), ## COUNTY = col_character(), ## COAREA = col_double(), ## COAREALAND = col_double(), ## ZPOPPCT = col_double(), ## ZHUPCT = col_double(), ## ZAREAPCT = col_double(), ## ZAREALANDPCT = col_double(), ## COPOPPCT = col_double(), ## COHUPCT = col_double(), ## COAREAPCT = col_double(), ## COAREALANDPCT = col_double() ## ) ## See spec(...) for full column specifications. ## Warning: 1592 parsing failures. ## row col expected actual ## 1303 ZAREA an integer 3298386447 ## 1303 ZAREALAND an integer 3032137295 ## 1304 AREAPT an integer 2429735568 ## 1304 AREALANDPT an integer 2262437812 ## 1304 ZAREA an integer 3298386447 ## .... .......... .......... .......... ## See problems(...) for more details.
It produces a warning. Looking at the few rows it returned, it seems likely that the errors are coming from overflow - read_csv()
guessed that the variable was of type int
(8 bytes, max value of \(2^31 - 1\) or 2,147,483,647) byt some of these values are huge. I looked up a few of them and saw that they were all occuring in large, unpopulated areas. One of them (ZIP code 04462) is described by UnitedStatesZipCodes.org
as covering “an extremely large land area compared to other ZIP codes in the United States.”
So that seems like the source of the issue - but there were 1,592 failures! I want to make sure those failures never affect the variables that I’m interested in. I noticed the error message says to use problems()
to see more details. I did as it was told, expecting something about as useful as the results of warnings()
but was pleased to get get back a tbl_df
!
Checking to make sure the errors didn’t affect my variables of interest (ZCTA5
, STATE
and COUNTY
) was as easy as
problems(zcta_to_county_mapping) %>% filter(col %in% c("ZCTA5", "STATE", "COUNTY")) ## # A tibble: 0 × 4 ## # ... with 4 variables: row <int>, col <int>, expected <chr>, actual <chr>
I love when tools make life easier! Even the error handling returns tidy data!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.