Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
TenK
is an R package aimed at simplifying the collection of SEC 10-K annual reports. It contains the following features:
- Robust scraping and parsing of reports using the rvest package
- Resolves FTP urls to their HTML counterparts, which increases the speed of retrieving the documents and adds a lot of useful metadata.
- Cleans and returns either full reports or just the business desciption for each report.
This document introduces basic usage of the TenK
package.
A copy of this documentation is available via R in PDF format. To view it, execute vignette("TenK")
in your R console.
1. Package information
- Package name: TenK
- Version: 0.01
- Documentation
- Report an issue
1.1 Known issues
TenK
can correctly scrape approximately 90% of all business descriptions. If any, issues are usually related to the following causes:
- The business description has been omitted
- The business description is located somewhere at the end of the document
- The report uses unconventional paragraph styles (this will result in the program being unable to find the description and returning “NA”).
- When the text was extracted from the HTML document, certain paragraphs got “squished” together, which throws off the program. (e.g. “part 1 item 1” becomes “part1item1.”.)
You may observe that words appear to be “squished” together, like so:
This is due to the way rvest
extracts text from the page and mainly affects paragraph headers. I am looking for a way to fix this.
1.2 How does TenK work?
The main function in this package, TenK_process
, takes as its input a URL belonging to a 10-K report. The URL can point either to the FTP or the HTML version of the report. If the user passes an FTP url, then TenK_process
automatically determines the HTML version and collects useful metadata. If the user passes an HTML url, TenK_process
also collects metadata and returns the scraped text. Currently, TenK_process
either returns the full 10-K report, or the business description section.
The figure below schematically outlines this process
2. Installing and loading the package
To install the TenK
package, execute:
You can then load the package as follows:
3. In-built data sets
For the years 2013-2016, TenK
provides datasets containing FTP urls for each 10-K filing. These can be queried as follows:
For more information about these data, execute ?filings10K2013
in the R console.
These data were scraped from the ‘master.idx’ file using the following script:
4. Retrieving 10-K reports
To retrieve a report, use the TenK_process
function. It has the following parameters:
- URL: (character) FTP or HTML url of the 10-K report
- metadata: (boolean) If FALSE, the function will not return any metadata other than the 10-K HTML url and the report. Defaults to TRUE.
- meta_list: (list) List containing the fields You want to query for the metadata. If empty, all metadata will be returned.
- retrieve: (character) Return either full report (“ALL”) or just the business description (“BD”)
4.1 Retrieve business description with all metadata
Retrieving all metadata plus the report is straightforward. This is demonstrated in the code block below.
You can retrieve a similar result when using a direct HTML url:
As you can see, this query does not return the ‘FTPurl’ field.
The names of these results correspond to the following metadata:
Variable | Description | Optional (yes/no) |
---|---|---|
CIK | Central Index Key (CIK) of the company. CIK numbers are unique identifiers that the SEC assigns to all entities and individuals that file disclosure documents. (source: https://www.sec.gov/investor/pubs/edgarguide.htm) | No |
ARC | SEC accession number. The accession number is a unique number that EDGAR assigns to each submission as the submission is received. You cannot use accession numbers to filter for types of filings. (source: https://www.sec.gov/investor/pubs/edgarguide.htm) | No |
Index.url | Company filings index url. For an example, see: https://goo.gl/n8jUXm | Yes |
company_name | Name of the company | Yes |
filing_date | Date on which the report was filed to the SEC | Yes |
date_accepted | Date/time on which the SEC accepted the report | Yes |
period_report | Fiscal year to which the report belongs. | Yes |
htm10kurl | URL pointing to the HTML version of the report. | Yes |
htm10kinfo | Meta information about the HTML version of the report. Contains file name, report type, file size and file extension | No |
FTPurl | URL pointing to the FTP version of the report | No |
report | Either the business description or the full report | No |
Optional fields can be manually selected/deselected.
4.2 Retrieve business descriptions with selected metadata
If you want to select optional metadata fields, you can do so by passing a list to the ‘meta_list’ parameter. This is demonstrated in the code block below.
4.3 Retrieve business description without metadata
If you don’t desire any metadata, you can turn this off by setting the ‘metadata’ parameter to FALSE:
5 Storing the results
There are several ways in which you can store the results of the TenK_process
function. Here, I’ll outline 4 ways to do this.
5.1 JavaScript Object Notation (JSON)
You can save R lists (this is what TenK_process
returns) as JSON files:
You can load the data as follows:
5.2 Postgresql
Postgresql is a stable, fast and flexible SQL database. Unlike MySQL, it is able to store large text files and is capable of storing terabytes of data.
After installing postgresql, you can use the RPostgreSQL
package to store and retrieve data.
5.2.1 Creating a table
The first step is to create a table with field names and data types. The example below does this for all metadata. In your own case, you may want to delete some of these fields if you don’t require them.
5.2.2 Writing data to the table
Once you’ve created your postgres table, you can append data to it by using the dbWriteTable
function:
The htm10kurl
effectively functions as a unique ID for each report. As such, it is convenient to use it as a way to check if a given record already exists in a table:
Before you store the record, you can run the check
If the function returns TRUE, the record already exists. If it returns FALSE, you can go ahead and store the record.
5.2.3 Querying data from the database
To query data from the database, you can use dbReadTable
:
As you can see, the data types (which we set when creating the table) are also imported into R:
5.3 Mongodb
Mongodb is a NoSQL database that excels at storing large documents and unstructured data (e.g. not column/row pairs).
After installing MongoDB on your system, you can send and load data using the rmongodb
package.
5.3.1 Creating a database
You don’t need to explicitly state that you want to create a mongodb database; rather, you would just start using it ad hoc. Note that with mongodb, a namespace is a combination of the database and the collection (similar to SQL table).
5.3.2 Storing a record
To store a record in the database, you can use mongo.insert
:
Once again, it is a good idea to check if the record already exists in the database. You can do this as follows:
You can then call it like this:
5.3.3 Retrieving a record
To retrieve a record, you can use mongo.find.one()
or mongo.find.all()
Note that, unlike with postgresql, the data does not automatically have the right data type. This is a drawback of schema-less databases like mongo.
5.4 Rdata
An Rdata file is a flexible and secure way to store R objects in a highly compressed file on disk. You can save your results as follows:
To load the data, execute the following:
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.