Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This post talks about my workflow for getting started with a new data analysis project using the ProjectTemplate
package.
Overview of ProjectTemplate
ProjectTemplate is an R Package which facilitates data analysis, encourages good data analysis habits, and standardises many data analytic steps. After many years of refining a data analysis workflow in R, I realised that I’d basically converged on something similar to ProjectTemplate anyway. However, my approach was not quite as systematic, and it took more effort than necessary to get started on a new project. Thus, since late 2013, I’ve been using ProjectTemplate to organise my R data analysis projects.
While I have found ProjectTemplate to be an excellent tool, I realised that when I created a new data analysis project based on ProjectTemplate, I was repeatedly making a large number of customisations to the initial set of files and folders. Thus, I’ve now set up a repository to store these customisations so that I can get started on a new data analysis project more efficiently. The purpose of this post is to document these modifications.
This post assumes a reasonable knowledge of R and ProjectTemplate. If you’re not familiar with ProjectTemplate, you could check out the ProjectTemplate website focusing particularly on the Getting Started section. If you’re really keen you could also watch an hour long video on ProjectTemplate, RStudio, and GitHub
General setup
I have a copy of my customised version of the ProjectTemplate directory and file structure on github in the AnglimModifiedProjectTemplate repository. Specifically, it has:
- Modifications to
global.dcf
as described below, - a blank
readme.md
- a couple of directories removed that I don’t use (e.g.,
diagnositics
,logs
,profiling
) - an initial
rmd
file with the customisations mentioned below in thereports
directory - An
.Rproj
RStudio project file to enable easy launching of RStudio. - An additional
output
directory for storing tabular, text, and other output
Thus, whenever I want to start a new data analysis project I can download and extract the zip file of the repository on github).
Thus, after creating a project folder, the following steps can be skipped when using my customised template.
- Open RStudio and create RStudio Project in existing directory
- Create
ProjectTemplate
folder structure withlibrary(ProjectTemplate); create.project()
- Move ProjectTemplate files into folder
- Modify
global.dcf
- Setup rmd reports
I also document below a few additional points about subsequent steps including:
- Setting up the data directory
- Updating the readme file
- Setttig up git repository
Modifying global.dcf
My preferred starting global.dcf
settings are
data_loading: on cache_loading: off munging: on logging: off load_libraries: on libraries: psych, lattice, Hmisc as_factors: off data_tables: off
A little explanation:
as_factors
I do quite a bit of string processing, particularly on data and on output tables. I find the automatic conversion of strings into factors to be a really annoying feature. Thus, setting this tooff
is my preferred setting.load_libraries:
I always have additional libraries so it makes sense to have thison
.libraries:
There are many common packages that I use, but I almost always make use of the above comma separate list of packages.
Setup rmd files
Basics of such files
I generally create a couple of rmd
files in the reports
directory (if you’re unfamiliar with RMarkdown, see this earlier post on RMarkdown). The first line in the first chunk is always:
```{r} library(ProjectTemplate); load.project() ```
This loads everything required to get started with the project.
RMarkdown in reports
In ProjectTemplate, you would typically store RMarkdown documents in the reports
directory. However, if you then try to compile that file in RStudio, you will realise that RStudio will treat the directory that contains the RMarkdown file as the working directory. In order to ensure that the working directory is the same as the project directory, add the following text to the top of your RMarkdown file.
`r opts_knit$set(root.dir='..')`
Explanation
- backtick r and then backtick delimits inline r code chunks; these general Rmarkdown options need to be in this format and not in a standard rmarkdown code chunk
- opts_knit$set() is the way to set general rmarkdown options.
- ‘..’ sets the working directory to one higher than the default.
Setup data folder
ProjectTemplate automatically names resulting data.frames with a name based on the file name. This is convenient. However, it is often the case that the file names need to be changed from some raw data supplied or it may be that the original data format is not perfectly suited for importing. In that case, I store the raw data in a separate folder called raw-data
and then export or create a copy in the desired format with the desired name in the data
folder.
Overriding default data import options
Some data files can not be imported using the default data import rules. Of course, you can change the file to comply with the rules. Alternatively, I think the standard solution is to add a file in the lib
directory (e.g., data-override.r
) that imports the data files. Give the imported data file the same name that ProjectTemplate would.
Update readme
I change the file to README.md to make it clear that it is a markdown formatted file. I can then add a little information about the project.
Setup git repository
If using github, I create a new repository on github.
Output folder
A common workflow for me is to generate tables, text, and figure output fromthe script which is then incorporated into a manuscript document. While I really like Sweave and RMarkdown, I often find it more practical to write a manuscript in Microsoft Word. I use the output
folder to store tabular output, standard text output, and figures.
In the case of tabular output, there is the task of ensuring the table is formatted appropriately (e.g., desired number of decimal places, cell alignment, cell borders, , cell merging, etc.). I typically find this easiest to do in Excel. Thus, I have a file called output-processing.xlsx
. I import the tabular data into this file and apply relevant formatting. This can then be incorporated into the manuscript. Here are a few more notes about Table conversion in MS Word.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.