Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
A new package isdparser is
on CRAN. isdparser
was in part liberated from rnoaa,
then improved. We'll use isdparser
in rnoaa
soon.
isdparser
does not download files for you from NOAA's ftp servers. The
package focuses on parsing the files, which are variable length ASCII strings
stored line by line, where each line has some mandatory data, and any amount
of optional data.
The data is great, and includes for example, wind speed and direction, temperature, cloud data, sea level pressure, and more. Includes data from approximately 35,000 stations worldwide, though best coverage is in North America/Europe/Australia. Data go all the way back to 1901, and are updated daily.
However, the data is not fun to parse, warranting an packge to deal with the parsing.
Installation
install.packages("isdparser")
If binaries aren't available, try from source:
install.packages("isdparser", type = "source")
or from GitHub:
devtools::install_github("ropenscilabs/isdparser")
library(isdparser) library(dplyr)
Parse individual lines
If you want to parse individual lines, use isd_parse_line()
First, let's get a ISD file. There's a few that come with the package:
path <- system.file('extdata/024130-99999-2016.gz', package = "isdparser")
Read in the file
lns <- readLines(path, encoding = "latin1")
Parse a line
isd_parse_line(lns[1]) #> # A tibble: 1 × 42 #> total_chars usaf_station wban_station date time date_flag #> <dbl> <chr> <chr> <date> <chr> <chr> #> 1 54 024130 99999 2016-01-01 0000 4 #> # ... with 36 more variables: latitude <dbl>, longitude <dbl>, #> # type_code <chr>, elevation <dbl>, call_letter <chr>, quality <chr>, #> # wind_direction <dbl>, wind_direction_quality <chr>, wind_code <chr>, #> # wind_speed <dbl>, wind_speed_quality <chr>, ceiling_height <chr>, #> # ceiling_height_quality <chr>, ceiling_height_determination <chr>, #> # ceiling_height_cavok <chr>, visibility_distance <chr>, #> # visibility_distance_quality <chr>, visibility_code <chr>, #> # visibility_code_quality <chr>, temperature <dbl>, #> # temperature_quality <chr>, temperature_dewpoint <dbl>, #> # temperature_dewpoint_quality <chr>, air_pressure <dbl>, #> # air_pressure_quality <chr>, #> # AW1_present_weather_observation_identifier <chr>, #> # AW1_automated_atmospheric_condition_code <chr>, #> # AW1_quality_automated_atmospheric_condition_code <chr>, #> # N03_original_observation <chr>, N03_original_value_text <chr>, #> # N03_units_code <chr>, N03_parameter_code <chr>, REM_remarks <chr>, #> # REM_identifier <chr>, REM_length_quantity <chr>, REM_comment <chr>
By default you get a tibble back, but you can ask for a list in return instead.
Parsing by line allows the user to decide how to apply parsing across lines,
whether it be lapply
style, or for loop, etc.
Parse entire files
You can also parse entire ISD files.
isd_parse(path) #> # A tibble: 2,601 × 42 #> total_chars usaf_station wban_station date time date_flag #> <dbl> <chr> <chr> <date> <chr> <chr> #> 1 54 024130 99999 2016-01-01 0000 4 #> 2 54 024130 99999 2016-01-01 0100 4 #> 3 54 024130 99999 2016-01-01 0200 4 #> 4 54 024130 99999 2016-01-01 0300 4 #> 5 54 024130 99999 2016-01-01 0400 4 #> 6 39 024130 99999 2016-01-01 0500 4 #> 7 54 024130 99999 2016-01-01 0600 4 #> 8 39 024130 99999 2016-01-01 0700 4 #> 9 54 024130 99999 2016-01-01 0800 4 #> 10 54 024130 99999 2016-01-01 0900 4 #> # ... with 2,591 more rows, and 36 more variables: latitude <dbl>, #> # longitude <dbl>, type_code <chr>, elevation <dbl>, call_letter <chr>, #> # quality <chr>, wind_direction <dbl>, wind_direction_quality <chr>, #> # wind_code <chr>, wind_speed <dbl>, wind_speed_quality <chr>, #> # ceiling_height <chr>, ceiling_height_quality <chr>, #> # ceiling_height_determination <chr>, ceiling_height_cavok <chr>, #> # visibility_distance <chr>, visibility_distance_quality <chr>, #> # visibility_code <chr>, visibility_code_quality <chr>, #> # temperature <dbl>, temperature_quality <chr>, #> # temperature_dewpoint <dbl>, temperature_dewpoint_quality <chr>, #> # air_pressure <dbl>, air_pressure_quality <chr>, #> # AW1_present_weather_observation_identifier <chr>, #> # AW1_automated_atmospheric_condition_code <chr>, #> # AW1_quality_automated_atmospheric_condition_code <chr>, #> # N03_original_observation <chr>, N03_original_value_text <chr>, #> # N03_units_code <chr>, N03_parameter_code <chr>, REM_remarks <chr>, #> # REM_identifier <chr>, REM_length_quantity <chr>, REM_comment <chr>
Optionally, you can print progress:
isd_parse(path, progress = TRUE) #> # A tibble: 2,601 × 42 #> total_chars usaf_station wban_station date time date_flag #> <dbl> <chr> <chr> <date> <chr> <chr> #> 1 54 024130 99999 2016-01-01 0000 4 #> 2 54 024130 99999 2016-01-01 0100 4 #> 3 54 024130 99999 2016-01-01 0200 4 #> 4 54 024130 99999 2016-01-01 0300 4 #> 5 54 024130 99999 2016-01-01 0400 4 #> 6 39 024130 99999 2016-01-01 0500 4 #> 7 54 024130 99999 2016-01-01 0600 4 #> 8 39 024130 99999 2016-01-01 0700 4 #> 9 54 024130 99999 2016-01-01 0800 4 #> 10 54 024130 99999 2016-01-01 0900 4 #> # ... with 2,591 more rows, and 36 more variables: latitude <dbl>, #> # longitude <dbl>, type_code <chr>, elevation <dbl>, call_letter <chr>, #> # quality <chr>, wind_direction <dbl>, wind_direction_quality <chr>, #> # wind_code <chr>, wind_speed <dbl>, wind_speed_quality <chr>, #> # ceiling_height <chr>, ceiling_height_quality <chr>, #> # ceiling_height_determination <chr>, ceiling_height_cavok <chr>, #> # visibility_distance <chr>, visibility_distance_quality <chr>, #> # visibility_code <chr>, visibility_code_quality <chr>, #> # temperature <dbl>, temperature_quality <chr>, #> # temperature_dewpoint <dbl>, temperature_dewpoint_quality <chr>, #> # air_pressure <dbl>, air_pressure_quality <chr>, #> # AW1_present_weather_observation_identifier <chr>, #> # AW1_automated_atmospheric_condition_code <chr>, #> # AW1_quality_automated_atmospheric_condition_code <chr>, #> # N03_original_observation <chr>, N03_original_value_text <chr>, #> # N03_units_code <chr>, N03_parameter_code <chr>, REM_remarks <chr>, #> # REM_identifier <chr>, REM_length_quantity <chr>, REM_comment <chr>
There's a parallel option as well, coming in handy with the larger ISD files:
isd_parse(path, parallel = TRUE)
Visualize the data
Make better date + time
df <- res %>% rowwise() %>% mutate( datetime = as.POSIXct(strptime(paste(date, paste0(substring(time, 1, 2), ":00:00")), "%Y-%m-%d %H:%M:%S")) ) %>% ungroup
viz
# removing some outliers (obs, look into more for serious use) library(ggplot2) ggplot(df[df$temperature < 100,], aes(datetime, temperature)) + geom_point() + theme_grey(base_size = 18)
Future work
I plan to improve performance via profiling and swapping out slower code for faster, as well as possibly dropping down to C++.
There was already a featur request for asking for fields of interest instead of getting all fields, so that's on the list.
Do try out isdparser
. Let us know of any bugs, and any feature requests!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.