Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Use Thor’s hammer to get data scientists ready to work faster than you ever thought. Thor’s hammer just needs 5 pieces and you and your new employees are good to go.
Welcome to your new office! Let’s take a look into your computer with your supervisor next to you saying:
Here is our working folder. Click through it and you’ll find out what we are doing. There is a list of software tools you’ll need to work with. Please install them. In case of any questions, ask Jamie.
Does this sound familiar to you? So what is so special about this if you are talking to people who use R. Actually nothing. But the text might start like this:
Here is our working folder. Folder A contains some really useful scripts, we call them hammers. Folder B contains some packages we started, we call them scissors and you find some packages we started on our github, these are the saws. We also have a package server, better install stuff from there. You can choose the R-Version you want to work with, Jamie has the most recent one, ask him how he does it. It would be great if you could use RStudio and get some common packages in it, maybe Jamie has a list of his favorite packages. Your most important project is inside our github. So please look through the folder structure. In case of any question, come to me, ask Jamie or check the wiki.
Yeah cool. These guys have a wiki. At least I can look something up. OK, they have three different places for R-packages and nobody knows which environment to work with, but I can handle this. Let’s go to Jamie and start.
When I saw systems like that for the first time, it really drove me crazy. Today’s Biostatisticians, Data Scientists or Software developers cost you more than 150 USD/hour. So you basically waste minimum two days if your startup structure is bad. I mean the structure you heard about does not seem bad, but it is. Taking into account you’ll need 30 minutes to get Jamie's R-environment, 3 hours to install it, 5–6 hours to look through package folders, 5 hours or more to read the wiki and 2 hours to get access to github and then you still miss some system dependencies, this costs your minimum 2,325 USD. For this amount you can get a better computer or have free coffee for the whole year. So what can you do against it?
- A pre-set up IDE installation — saves 2 hours
- A standard R environment — saves 3 hours
- A list of tools + a nice tutorial on installation — saves 5–6 hours
- A fixed and standardized folder structure OR standardized project names — saves your life
- A collection of vignettes — saves at least 5 hours
IDE installation
The integrated development environment (IDE) is the place to work for the guys in your office. It contains the connection to the source-code control system, the code editor, the console to run code, basically everything. You do not want somebody to waste time on getting this set up. So please have an install script that
- installs the IDE
- installs the extensions needed for your source code control systems and all links to that
- installs all system components to work with this IDE (you hopefully know from your older projects)
- installs a list of bookmarks to your important folders
- sets up everything in a pre-defined folder or at least the extension repository. You can use a folder like:
C:\company_tools\IDEs\ourIDE
This IDE installation script will set up everybody in your department with the same IDE being installed at the same place. So you won’t hear questions like “How do I access github?” and the new people won’t have to call you because it says “It’s not possible to access version control without the following system components: XXX, XXX, XXX”.
In case the IDE crashes on the first day of the co-worker, you know where to look for it. This allows you to check for missing plugins and missing links in the system PATH . You think this just saves you 5 minutes, but each of the requests I mentioned is taking 5 minutes. Three simple questions and 15 minutes and 75 USD (0.25 hours * 150 USD/hour * 2 people) are gone.
A standard R environment
People working with R know how hard it is to share code with a co-worker. They have different package versions installed, they have the R environment in a different folder on their HD, they may even have a different R-version. All these troubles will occur, if you do not use packrat or RstudioConnect or RCloud for each of your projects. And these troubles will definitely occur, even if you do so. But shall this happen at the first day a co-worker starts? Of course not.
So please give them a pre-defined R-version with a bunch of packages being pre-installed. I recommend to have at least 50, maybe a 100 packages you use on a daily basis, installed. Decide for one R-version that every co-worker needs to have installed with ~100 packages. Have an install script that installs it in a pre-defined folder. Something like
C:\company_tools\R\R-Versions\R-3.4.0-company
Please store the whole install script and the packages to be installed pre-compiled on a company repository. The install process shall not take forever and not everything needs to be recompiled. All people in your office shall have at least the same OS which allows you to store everything pre-compiled. This makes the whole R installation a copy&paste process and by this really fast.
A list of tools + tutorial
In one or the other R developer team people might want you to have a bunch of tools installed. For example in my group we use:
- Miktex
- ImageMagick
- Ghostscript
- LibreOffice
- Pandoc
- Java>1.8
- git
- qpdf
So if it is clear that sooner or later your co-worker will need these tools, please provide the guy with an installation script that downloads the most recent version, or a version you defined, for each tool and installs it.
Sometimes people will need admin rights to install all of these. So instead of requesting them for each single installation, they request them once and install all the stuff you told them to have.
If an install script is not suitable for you, write an entry in your wiki or a package vignette. This will contain all the steps needed to get the tools ready to work.
A fixed and standardized folder structure OR standardized project names
In R there are a lot of different working styles when it comes to folders and projects. My two major observations or working styles were:
- people working on github and having RStudio projects
- people working on shared drives/TFS/SVN and having sub-folder structures
1 For the people who work with project names, it is important to find a project, even if it was started years ago. Additionally you do not want to cause conflicts with your co-workers, because two projects have the same name. Moreover a project shall not contain all of the code developed in your department. Maybe it needs just 50 lines of code. So please define the following:
- What is the naming convention for your projects e.g. username_task_month_year
- What is the general size of your project e.g. one R-package, one script file, one script file + one data folder
- Who is allowed to work on one project
- Where do you list all projects e.g. a wiki, a specific website, a ticket management system
and the most important part: WRITE IT DOWN and make it available to everybody. If you have decided on these 4 parts, write it down, make it a working policy and kick peoples asses if they do not follow the rules. Else you’ll end up in chaos.
It will really help the guy at the first day who knows he has to find one of Jamie’s projects that deals with Clustering, as it might be called:
jamie123_clustering_patient_data_january_2017
2 For people who like folders and folder structures, I guess it is a bit easier. You may want to have the ability to find things after years, too. If you want to standardize the development of R-packages, research projects, test collections and report projects, those shall each look the same. A guy who developed one R-package in your company should be able to look into a second one and understand it in minutes. So please define the following:
- What shall be the name of a certain project folder e.g. typeOfProject_name_month
- Which sub-folders are needed for an R-package e.g. a change log folder, a releases folder, a test folder, a README.md file
- Do you need separate folders for running projects vs packages? Shall they be stored at different places?
- Is there any kind of folders that need to be existing for the storage of extensions, plugins, libraries?
and the most important part: WRITE IT DOWN and make it available to everybody. If you have decided on these 4 parts, write it down, make it a working policy and kick peoples asses if they do not follow the rules. Else you’ll end up in chaos.
I’m sorry that I was repeating myself, but I really needed to make my point.
Your collection of vignettes
This is a bit R specific. But instead of writing Confluence or Wiki entries, I really like the idea of using vignettes + pkgdown. For any development projects wikis are a nice tool to look things up. But inside wikis your code does not run immediately. In case you are writing vignettes to show your co-worker how certain things have to be done, you can check yourself. The code you write to document your installation scripts, your standard R-environment or even your folder structure has to work. Each single line of code can be executed inside the vignette.
Additionally vignettes are well known for R developers and they know they can access them via vignette() . Moreover the pkgdown package allows you to put the whole information on a website. This makes it a wiki again.
I also recommend to write such vignettes about your standard way on “How to build a package”, “How to document a function call in R Code”, “How to generate a knitR report with the right design” …. If you do all this in a nice and comprehensive way your co-worker won’t talk to Jamie, he’ll read your wiki. Instead of two people working, you’ll just need one.
Of course the new guy should still have a coffee with Jamie to know what’s up 😉
Something left for pro’s
If you really want to make it easy for your co-workers, write an R-package that contains all this. It can contain the install scripts, the vignettes, functions for basic processes, a function that builds default folder structures, maybe even functions that start tickets for newbies at the global IT service desk. Put it all in a package and call it: thorshammer
This package will be the most important tool for the first hour of your co-worker!
Final words
I guess if you follow my instructions the whole setting up a co-worker process runs within an hour or maybe less. You could start the day like this.
Hey, welcome. This is your PC, please get admin rights, after please download the thorshammer package. It will be the one tool you need today. First run the welcome bash script. During the script runs you can read inside the wiki how it sets up your folder structure and the IDE and how you can work with our version control. We have some basic tutorials how we do things in thorshammer, so please use it and see how we nail things down. After you nailed down a few planks, please have a coffee with Jamie to tell you what’s up next.
Enjoy your coffee.
How to get people R ready in an hour — Thor's Hammer was originally published in Data Driven Investor on Medium, where people are continuing the conversation by highlighting and responding to this story.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.