[This article was first published on free range statistics - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Motivation
So, there’s been a recent flurry of attention to the stock market. It prompted last week’s post. But it also reminded me to bubble up to higher in my priority list a project to create a database of daily security prices and short positions from open data sources.
Yahoo Finance is an excellent data source and the quantmod R package provides a very convenient way of accessing it, but fundamentally its API is oriented around looking up a single security and examining its price and volume trends. This is fine for many purposes, but for analysis across securities one wants to start with a generic question like “which stocks have had the highest growth rate in price or volume”. This requires a data set of all available security prices, rather than looking them up one at a time.
It is in fact possible to create the dataset I want for these purposes from scratch from original sources in a few hours, but as I’m thinking of using this multiple times I want to persist it somehow to avoid duplicate work and hitting the original data sources more than needed. Caveat – I think this is allowed with the data sources I’m working with but it’s frustratingly difficult to find out for sure. If I’m wrong, let me know and I’ll pull down this post.
A database is the obvious solution. In this blog post I build such a database using SQLite, which works well enough for my single-user use case. At work we’d use SQL Server for a job like this but part of my motivation for today was to push me into a different zone and work with a less familiar tool, to remind myself of which things are idiosyncratic to which database systems.
The data model
For my purposes I am happy with daily data. The data I want to store has an obvious grain of date-security ie one row for each day for each security that I have data for. This implies just two dimensions – date and security (or product, as I call it below, following ASIC’s terminology for their published data on short positions). The facts that I’m interested in are the open, high, low, close and adjusted price for each day; the volume of transactions; the short position; the total float; and the short position as a proportion of the total float. The price and volume data can come from Yahoo Finance, and the short positions data can be downloaded from the Australian Securities and Investment Commission, who collect self-reports from short sellers.
My initial plan was just two tables:
one for the product dimension with unchanging (or slowly changing) information like its ASX ticker, the ticker on Yahoo Finance, its latest name with various alternative approaches to punctuation
one for the facts, with separate columns for the nine facts (open price, volume, etc) listed in the previous paragraph.
I think this is indeed probably the correct structure for a Kimball-style dimensionally modelled analytical datamart. However, it proved unpleasant to write the extract-transform-load for my two data sources into that wide fact table. It would have involved a lot of UPDATE operations to add data from one source to columns for rows that are partly populated by the other). SQLite in particular does not really support UPDATE in combination with a JOIN and getting around this would have been awkward. So to simplify things I normalised the data one step further and made my fact table more “long and thin”. This meant adding a variable dimension so that when extra data comes in from another source I am just adding new rows to the data for a subset of variables rather than filling in empty columns for existing rows.
Here’s how the data model looks:
I used the excellent (and free) DBeaver universal database tool to make that diagram, and to develop any non-trivial SQL code used in this blog.
And this is the SQL that creates it. I’ve used GO (in the style of Microsoft T-SQL) to separate each command, not because SQLite understands it (it doesn’t) but because I have R functions that do, which I’ll be using in a minute.
Post continues below SQL code
Populating with data
I used R for accessing the origin data from the web and sending SQL commands to set up the database. Here’s the first chunk of code that creates the empty database, runs the SQL above to set up tables, and makes an initial dump of ASX listed companies into the d_products table. I adapted some of this and subsequent code from this blog post by Michael Plazzer. I’m not sure how definitive is that list of ASX listed companies referred to in the below code.
Post continues below R code
Loading the short positions data
The short positions data from the ASIC website includes many products that aren’t in the list of listed companies on the ASX site. In general, I want to be able to update my list of products/securities. Getting data from Yahoo Finance, where I specify a security ticker code and then get the data, won’t let me do this (unless I tried codes at random). Because of all this, in my initial bulk upload I do the ASIC data first, hoping (without really checking how it happens, which of course I would for a more formal use) that this will surface new (or old) securities that aren’t in the spreadsheet I downloaded from the ASX.
The code below is in two chunks. It downloads all the CSVs of short positions data from ASIC, taking care not to re-download data it already has. Each CSV represents one day of three facts on each product. Then (somewhat more complex), it reads all the CSVs one at a time (if it hasn’t already processed this particular CSV); identifies missing products/securities which it then adds to the d_products table; then populates the fact table with the facts for all the products it’s found in this particular CSV, having matched them to the product_id field in the d_products table.
Other than the three-card shuffle with adding new products to d_products as it goes, and some annoying complications with different formats and encoding of the CSV files on the ASIC page (see comments in the code), this is fairly straightforward data-wrangling stuff for R.
This took three or four hours each for the two bits of functionality (bulk download and bulk import) to run.
Loading the price and volumes data
Next step is to get some data from Yahoo Finance on price and volumes. Overall, this is more straightforward. The quantmod R package describes the functionality I’m about to use as “essentially a simple wrapper to the underlying Yahoo! finance site’s historical data download”.
I’ve tried in the code below to make this updateable, so in future I can run the same code without downloading all the historical data again. But I haven’t fully tested this; it’s more a working prototype than production-ready code (and I wouldn’t use SQLite for production in this case). But here’s code that works, at least for now. It gets all the Australian security ticker names in the format used by Yahoo (finishing with .AX for the ASX) and their matching product_id values for my database; finds the latest data data is available in the database; downloads anything additional to that; normalises it into long format and uploads it to the fact table in the database.
This took a couple of hours to run (I didn’t time it precisely).
Updating the product dimension with some summary data
The final steps in the extract-transform-load process are some convenience additions to the database. First, I update the d_products table which has a latest_observed column in it with the most recent observation for each product:
Finally, I want to create a wider version of the data, closer to my original idea of a fact table with one row per product-date combination, and nine fact columns (for open price, volume, short position, etc). In another database I would use an indexed or materialized view for this sort of thing, but SQLite doesn’t support that. I tried making a view (basically a stored query) but its performance was too slow. So I created a whole new table that will need to be created from scratch after each update of the data. This isn’t as disastrous as it sounds - an indexed view does something similar in terms of disk space, and it only takes a minute or so to run this. And it is a convenient table to have.
So here’s the final step in this whole data upload and update process, creating that wide table from scratch. Note the clunky (to R or Python users who are used to things like spread() or pivot_wider()) way that SQL pivots a table wide, with that use of the SUM(CASE WHEN ...) pattern. It looks horrible, but it works (so long as you know in advance all the column names you are trying to make in the wider version):
Exploratory analysis
Phew, now for the fun bit. But I’m going to leave substantive analysis of this for another post, as this is already long enough! I’ll just do two things here.
First, let’s look at a summary of how many data points we’ve got in the database
variable
n
number_products
number_dates
open
880775
1873
8491
high
880775
1873
8491
low
880775
1873
8491
close
880775
1873
8491
volume
880775
1873
8491
adjusted
880775
1873
8491
short_positions
1318329
3048
2694
total_product_in_issue
1318326
3047
2694
short_positions_prop
1318204
3046
2694
That all looks as expected. In total we have about 9 million observations. There are many securities with short positions reported to ASIC that I couldn’t find prices and volumes for in Yahoo Finance, which is interesting and worth looking into, but not astonishing.
That table was created with this SQL (and a bit of R sugar around using knitr and kableExtra, not shown):
Finally, some real analysis. What can we do with this database? Here’s an example of the sort of thing that’s possible with this asset that wasn’t earlier. This is an answer to my hypothetical question I started with - what are the most traded and fastest growing (in price) securities on the ASX?
That chart was created with this code, which has three substantive bits:
an SQL query (and R code to send it to the databse) that grabs the data we need, averaged by year for each product, from the wide table defined above
a little function purely to change the upper/lower case status of product names for the chart
ggplot2 code to draw the chart.
La voila. Coming soon in a future blog post - exploring short positions of securities on the ASX.
Related
To leave a comment for the author, please follow the link and comment on their blog: free range statistics - R.