Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I needed an offline copy of an economic calendar with all of the major international economic events. After grubbing around the internet I found the Economic Calendar on Myfxbook which had everything that I needed.
Here’s a screenshot of that calendar.
This seemed like a good candidate for some simple web scraping. However, the big orange button at the bottom was an indication of some minor challenges afoot. If you wanted to get the full calendar then you’d need to press this button repeatedly to retrieve additional pages of data. And, for the purpose of web scraping, this would need to be automated.
In addition there are a couple of modal dialogs that pop up on the page when you first visit the site. I’d like to get those out of the way too.
Choice of Tools
I had to choose between either (1) diagnosing the API behind the site or (2) running a browser tool to automate interaction with the site.
After taking a look at the network requests going back and forth between my browser and the server I concluded that the second approach would be best. My preferred tools for this are either Selenium or Playwright. Recently I have been leaning towards the latter.
Implementation
The scraper script (implemented in Python) ultimately consisted of a few components:
- spin up Playwright and launch a Chromium instance;
- navigate to the calendar URL;
- deal with the various popups on initial page launch (probably not strictly necessary but good to emulate real user interaction);
- keep on smashing (with liberal pauses) the button until all data retrieved; and
- parse the resulting table, then dump to CSV.
After developing and testing I deployed this as a job that’s run daily via GitLab CI/CD.
Results
The resulting CSV files contains all of the columns shown in the screenshot above. Here, for example, I load the file into R and display the first 20 records.
calendar <- read.csv("calendar.csv") |> rename(iso = currency) |> mutate(date = strptime(date, "%Y-%m-%d %H:%M:%S")) |> select(-previous, -consensus, -actual) head(calendar, n = 20) date iso event impact 1 2024-10-02 00:00:00 CNY National Day Golden Week None 2 2024-10-02 00:01:00 AUD CoreLogic Dwelling Prices MoM (Sep) None 3 2024-10-02 05:00:00 JPY Consumer Confidence (Sep) High 4 2024-10-02 07:00:00 EUR Unemployment Change (Sep) High 5 2024-10-02 07:00:00 EUR Tourist Arrivals YoY (Aug) Low 6 2024-10-02 07:15:00 EUR ECB Guindos Speech High 7 2024-10-02 08:00:00 EUR Unemployment Rate (Aug) High 8 2024-10-02 09:00:00 EUR Retail Sales YoY (Aug) Low 9 2024-10-02 09:00:00 EUR Unemployment Rate (Aug) Low 10 2024-10-02 09:00:00 EUR Unemployment Rate (Aug) High 11 2024-10-02 09:00:00 GBP 5-Year Treasury Gilt Auction Low 12 2024-10-02 09:30:00 EUR 10-Year Bund Auction Medium 13 2024-10-02 09:30:00 EUR ECB Lane Speech Low 14 2024-10-02 09:45:00 EUR ECB Buch Speech Low 15 2024-10-02 10:00:00 EUR Unemployment Rate (Sep) Low 16 2024-10-02 10:10:00 EUR 3-Month Bill Auction Low 17 2024-10-02 10:10:00 EUR 6-Month Bill Auction Low 18 2024-10-02 10:30:00 EUR Budget Balance (Aug) Low 19 2024-10-02 11:00:00 USD MBA Mortgage Refinance Index (Sep/27) Low 20 2024-10-02 11:00:00 USD MBA Purchase Index (Sep/27) Low
In the interests of brevity I have omitted the previous
, consensus
, and actual
columns, however, these are included in the CSV data. You can slice and dice these data as required. For example, here are the high impact events during the first two trading days of November 2024.
calendar |> filter( impact == "High", date >= "2024-11-01", date < "2024-11-05" ) date iso event impact 1 2024-11-01 01:45:00 CNY Caixin Manufacturing PMI (Oct) High 2 2024-11-01 08:00:00 EUR Unemployment Rate (Oct) High 3 2024-11-01 08:30:00 CHF procure.ch Manufacturing PMI (Oct) High 4 2024-11-01 09:00:00 EUR S&P Global Manufacturing PMI (Oct) High 5 2024-11-01 09:30:00 GBP S&P Global Manufacturing PMI (Oct) High 6 2024-11-01 12:30:00 USD Nonfarm Payrolls Private (Oct) High 7 2024-11-01 12:30:00 USD U-6 Unemployment Rate High 8 2024-11-01 12:30:00 USD Non Farm Payrolls (Oct) High 9 2024-11-01 12:30:00 USD Unemployment Rate (Oct) High 10 2024-11-01 13:30:00 CAD S&P Global Manufacturing PMI (Oct) High 11 2024-11-01 13:45:00 USD S&P Global Manufacturing PMI (Oct) High 12 2024-11-01 14:00:00 USD ISM Manufacturing PMI (Oct) High 13 2024-11-04 08:15:00 EUR HCOB Manufacturing PMI (Oct) High 14 2024-11-04 08:45:00 EUR HCOB Manufacturing PMI (Oct) High 15 2024-11-04 08:50:00 EUR HCOB Manufacturing PMI (Oct) High 16 2024-11-04 08:55:00 EUR HCOB Manufacturing PMI (Oct) High 17 2024-11-04 09:00:00 EUR HCOB Manufacturing PMI (Oct) High 18 2024-11-04 22:00:00 AUD Judo Bank Services PMI (Oct) High
The CSV file with these data can be downloaded here and will be updated daily.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.