Working with Statistics Canada Data in R, Introduction
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Forward to Working with Statistics Canada Data in R, Part 1.
This is the Introduction to the series on working with Statistics Canada data in the R language. The goal of the series is to provide some examples (accompanied by detailed in-depth explanations) of working with Statistics Canada data in R. Besides, I’d love to see more economists, policy analysts, and social scientists using R in their work, so I’ll be doing my best to make this easy for people without STEM degrees.
Data Types
Statistics Canada data is routinely used for economic and policy analysis, as well as for social science research, journalism, and many other applications. It is expected that the reader has some basic R skills.
For the purposes of this series, let’s assume that there are three main types of StatCan data:
- Statistics Canada Data, previously known as Canadian Socio-economic Information Management System (CANSIM),
- National census data, and
- Geographic data provided in a multitude of formats that can be used by GIS software: ArcGIS shapefiles (.shp), Geography Markup Language files (.gml), MapInfo files (.tab), etc.
The “Working with Statistics Canada Data in R” series will thus consist of this Introduction and three main parts: one per each of the data types.
This is not an official classification of data types available from Statistics Canada. The classification into CANSIM, census, and geographic data is for convenience only, and is loosely based on the key tools used for StatCan data retrieval and processing in R.
The Tools You Need
To be more specific, cansim is the package designed to retrieve CANSIM data, and cancensus is the package to get census data. Further data processing will be done with the tidyverse meta-package (a collection of packages that is itself a package) which is some of the most powerful data manipulation software currently available. GIS data is a more complex matter, but at the very minimum you will need sf, tmap, and units packages. Obviously, just as the R language, all these are completely free and open source. I am not in any way associated with the authors of any of the above packages, I just use them a lot in my work.
Note that although CANSIM has been recently renamed to Statistics Canada Data, I will be using the historic name CANSIM throughout this series in order to distinguish the data obtained from Statistics Canada Data proper from other kinds of StatCan data, i.e. census and geographic data (see how confusing this can get?).
Finally, here’s the code that installs the minimum suite of packages required to run the examples from this series. Note that you might be unable to install sf and units right now, since they have system requirements such as certain libraries being installed, which don’t usually come available “out of the box”. More on sf and units installation in the upcoming “Working with Statistics Canada Geospatial Data” post.
install.packages(c("cansim", "cancensus", "tidyverse", "tmap")) # install.packages(c("sf", "units"))
Continue to Working with Statistics Canada Data in R, Part 1.
The post Working with Statistics Canada Data in R, Introduction appeared first on Data Enthusiast's Blog.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.