Historical Weather Data
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I’m building a model which requires historical weather data from a selection of locations in South Africa. In this post I demonstrate the process of acquiring the data and doing some simple processing.
I need data for three locations: Brookes and Goje (in KwaZulu-Natal) and Hlangalane (in the Eastern Cape).
# A tibble: 3 × 4 name region lat lon <chr> <chr> <dbl> <dbl> 1 Brookes KwaZulu-Natal -29.6 29.8 2 Goje KwaZulu-Natal -28.3 31.2 3 Hlangalane Eastern Cape -31.0 28.6
Here are those locations on a map. They are sufficiently far apart that we would expect them to have different weather histories.
Data Acquisition
I’m getting the data using Weather API. The business plan gives me access to data going back to the beginning of 2010. I like to mix things up, so I’ll hit the API from Python and then use R to do the processing.
The API key is stored in an environment variable.
import os API_KEY = os.getenv("WEATHER_API_KEY")
Define the date range.
import pandas as pd DATE_MIN = "2021-08-01" DATE_MAX = "2022-08-01" DATES = pd.date_range(start=DATE_MIN, end=DATE_MAX)
Create a function for retrieving the data and writing it to a file. There will be one JSON file per location and date.
import re import requests def weather_history(name, region): location = name+", "+region slug = re.sub("[, ]+", "-", location.lower()) for date in DATES: date = date.date() URL = f"http://api.weatherapi.com/v1/history.json?key={API_KEY}&q={location}&dt={date}" response = requests.get(URL) with open(f"{date}-{slug}.json", "wt") as fid: fid.write(response.text) time.sleep(5)
Now retrieve the data.
weather_history("Goje", "KwaZulu-Natal")
Repeat for the other locations.
Data Processing
We’ll need a function for loading the JSON data into R. The data are nested, so we’ll include some code to unwrap and rectangle the data.
library(jsonlite) prepare_weather <- function(path) { weather <- read_json(path) weather$location %>% as_tibble() %>% # Drop time fields that relate to data acquisition (download) time. select(-starts_with("localtime")) %>% mutate( hours = weather$forecast$forecastday %>% map_dfr(function(day) { map_dfr(day$hour, function(hour) { hour$condition <- NULL hour }) }) %>% select(-ends_with("epoch")) %>% select(-matches("_(mph|f|in|miles)$")) %>% select(-matches("^(will_it|chance_of)_")) %>% rename_with(~ sub("_c$", "", .), matches("_c$")) %>% rename(precip = precip_mm, pressure=pressure_mb) %>% list() ) }
Let’s read the data for Goje on 1 August 2021.
(goje <- prepare_weather("2021-08-01-goje-kwazulu-natal.json")) # A tibble: 1 × 7 name region country lat lon tz_id hours <chr> <chr> <chr> <dbl> <dbl> <chr> <list> 1 Goje KwaZulu-Natal South Africa -28.3 31.2 Africa/Johannesburg <tibble>
The hours
list column contains the hourly weather data. Let’s take a quick look. We’ll only pull out a few columns that are relevant to the model (there are many more!).
goje %>% unnest(cols = hours) %>% # Use appropriate time zone when converting to date/time type. mutate(time = as.POSIXct(time, "%Y-%m-%d %H:%M", tz = unique(tz_id))) %>% select(time, temp, wind_kph, wind_dir, pressure, precip, humidity, cloud) # A tibble: 24 × 8 time temp wind_kph wind_dir pressure precip humidity cloud <dttm> <dbl> <dbl> <chr> <dbl> <dbl> <int> <int> 1 2021-08-01 00:00:00 16.6 17.3 NNE 1026 0 78 0 2 2021-08-01 01:00:00 16.2 16.7 NNE 1025 0 76 0 3 2021-08-01 02:00:00 15.9 16.1 NNE 1025 0 74 0 4 2021-08-01 03:00:00 15.5 15.5 N 1024 0 71 0 5 2021-08-01 04:00:00 15.6 15 N 1024 0 68 1 6 2021-08-01 05:00:00 15.6 14.5 N 1024 0 64 2 7 2021-08-01 06:00:00 15.7 14 N 1023 0 61 2 8 2021-08-01 07:00:00 16.9 13.3 N 1023 0 55 5 9 2021-08-01 08:00:00 18.2 12.6 N 1023 0 50 7 10 2021-08-01 09:00:00 19.4 11.9 N 1023 0 45 9 11 2021-08-01 10:00:00 21.4 11.5 NNE 1023 0 42 9 12 2021-08-01 11:00:00 23.5 11.2 NNE 1022 0 38 9 13 2021-08-01 12:00:00 25.5 10.8 NE 1021 0 35 8 14 2021-08-01 13:00:00 25.6 11.8 NE 1020 0 38 6 15 2021-08-01 14:00:00 25.6 12.7 NE 1019 0 40 3 16 2021-08-01 15:00:00 25.7 13.7 ENE 1018 0 43 0 17 2021-08-01 16:00:00 24.5 13.8 ENE 1018 0 48 0 18 2021-08-01 17:00:00 23.3 13.9 NE 1018 0 53 0 19 2021-08-01 18:00:00 22.1 14 NE 1018 0 58 0 20 2021-08-01 19:00:00 21.1 12 ENE 1019 0 60 0 21 2021-08-01 20:00:00 20.1 10 E 1019 0 61 0 22 2021-08-01 21:00:00 19.1 7.9 ESE 1020 0 63 0 23 2021-08-01 22:00:00 19.2 8.8 SSE 1021 0 64 0 24 2021-08-01 23:00:00 19.2 9.6 SSW 1021 0 65 0
We’ll wrap up with a few plots of daily aggregated data. First the total daily precipitation.
And finally the daily temperature (average is solid line and ribbon gives range).
These data are going to be particularly useful for our models.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.