Configuring Azure and RStudio for text analysis
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I just finished teaching Computer-Assisted Content Analysis at the IQMR summer school at Syracuse. With three lecture and three labs the problem every year is getting the right R packages onto people’s machines. In particular, anything that involves compilation – and when you’re using quanteda, readtext, and stm, that’s lots of things – is going to be trouble. Over the years, R and the various operating systems it has to live in have got a lot better about this but ultimately the best solution is… not to do it at all. That is, to run everything on somebody else’s computers, excuse me, ‘in the cloud’. When students access an appropriately provisioned RStudio Server through their browsers they’re good to go from Lab one.
This post is about how to set all that up.
For the last few years, I’ve been using Amazon’s AWS infrastructure – big shout to James Lo, who helped me set it up the first time around. It was a bit involved, but it definitely worked. Meanwhile Microsoft was polishing its competitor, Azure, so this year, after confirming I didn’t have to actually run Windows anywhere, I decided I’d give that a try instead.
tl;dr It’s pretty good. You should try it too.
It took about a weekend to get all the details right, probably because, while I’m pretty familiar with Unix, I’m no devops expert. Also, some text-related R packages need quite specific stuff on the server before they’ll work. Hopefully these instructions will save you some fraction of that time. They borrow shamelessly from Colin Gillespie’s excellent blog post from last year, but are pitched at less knowledgeable folk, i.e. previous me.
We’ll start by configuring an Azure virtual machine, set up R and RStudio, install some useful text analysis packages, and finally add users.
Configuring an Azure VM
First you have to sign up for Azure. This is a bit tedious, but your Skype / MS Live login credentials should work.
Say “yes I’d like to do the 30 day trial of Azure”. Eventually you’ll be presented with a ‘portal’ web page with a Dashboard. There’s an icon on the left side called Virtual Machines. Press it and the ‘Create virtual machine’ button.
I’m going to choose Ubuntu 17.10 (Artful Aardvark) from the Ubuntu Server collection, because… who knows what all these distributions are? The last time I seriously ran Linux it was all Redhat 9 (apart from that fateful Gentoo episode that we don’t talk about). Anyway, press ‘Create’ in the panel that appears on the right.
Let’s start with a ‘Basics’ tab. I’ll call it rstudio, set my username to conjugateprior, switch the authentication to ‘Password’ (don’t @ me, I’m trying to minimize the steps).
Now to invent an enormous and fiendishly hard-to-guess password that’s so witty that it’s a tragedy know one can ever know it, and add it to your password manager and the password box, in that order. You do have a password manager don’t you?
We’ll need a ‘Resource group’ though I’m not entirely sure why. In the screen shot I’m using one I made earlier. You can provide some arbitrary name and leave the button on ‘Create new’.
Next we’ll pick a machine size. We can pick a smallish one at first because it’s easy to supersize it. Right now, I’ll just use something about the size of my desktop.
Now to the settings. Most of the defaults seem OK, but we should open some ports: SSH so we can log in and HTTP for regular web access. A bit later we’ll also open 8787 because that’s where RStudio Server listens.
Apparently we’ll need a ‘Diagnostics storage account’. Just give it a name. I’ve no idea what one of these is, but now we’ve got one. That should be enough to get going.
The ‘Summary’ tells us that the VM itself will be about 20c per hour, although there will be storage costs on top of that. Still, it’s a free trial with $200 of stuff thrown in, so we’ll probably survive.
Here comes the machine. It takes about 5 minutes to get going.
When it arrives there’s an ‘Overview’ screen with graphs and suchlike representing the state of the virtual machine. So let’s talk to it. Over in the top right is the IP address we need (where it says ‘Public IP address’).
Now pull up a terminal window. That is: open Terminal if you’re on a Mac, and do the ‘CMD’ thing to launch the weird black box if you’re on Windows. Now to log in. If the IP address is 23.56.19.11 (it isn’t) then that would be
ssh [email protected]
and we’re in. Time to update the system:
sudo apt-get update sudo apt-get upgrade
That was the system. Now for R.
Set up R
This page describes how to get R onto an Ubuntu system, but for our purposes the instructions are in slightly the wrong order. The first thing we need to read is the part about ‘Secure APT’. I need the public key of the person who signs the Ubuntu R distribution before we can get it. This key-grabbing incantation should work first time:
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9
Now we tell APT where to get R from by opening up the apt/sources.list file. Let’s do it in nano because that’s built in and I hate vi (keep not @-ing me, folks).
sudo nano /etc/apt/sources.list
At the bottom of this file we add the line
deb https://cloud.r-project.org/bin/linux/ubuntu artful/
to get R 3.4, then save and exit, (control-O control-X).
Back at the command line, it’s time to update again
sudo apt-get update
and install base R
sudo apt-get install r-base
Normally this is the point that we install R’s development tools, but it seems that on this version of Ubuntu we don’t have to. Nice. I do want some other system stuff though because some of our text packages will depend on them. The comments tell you what they do
sudo apt-get install libxml2 libxml2-dev # XML sudo apt-get install libcairo2-dev # graphics device sudo apt-get install libssl-dev libcurl4-openssl-dev # web stuff sudo apt-get install libapparmor-dev # needed by sys package apparently sudo apt-get install libpoppler-cpp-dev # text conversion, needed by readtext
Set up RStudio
Right, now back to the Portal to open up a port for RStudio. On the left side there is an icon for ‘Network settings’, click on that and the ‘Add inbound port rule’
Now our inbound ports should be 80 (HTTP), 22 (SSH), and 8787 (RStudio’s port) plus three Azure thingies that we’ll leave well alone.
Back to the terminal. I want to install an RStudio Server. The instructions for that are here but here’s what they currently amount to:
sudo apt-get install gdebi-core wget https://download2.rstudio.org/rstudio-server-1.1.453-amd64.deb sudo gdebi rstudio-server-1.1.453-amd64.deb
Don’t forget to type ‘y’ when asked (N is the default).
RStudio should install and then start, so give we’ll give it a moment before opening up a browser at the IP address in the portal on port 8787. That would be
http://23.56.19.11:8787
Is there an RStudio login page there? There is.
Now, since nobody can remember IP addresses, we better give this thing a name. Once more into the Portal dear friends.
In the ‘Overview’, over on the right there is a ‘DNS name’ link called ‘configure’, press it and provide a name.
Now the RStudio login address is:
http://iqmr.eastus.cloudapp.azure.com:8787
which is still a bit of a mouthful but at least won’t change whenever we resize the machine. In my limited experimentation it seems like this address won’t always work for ssh though, so it’s as well to have the Portal open showing the current IP if you, like me, have the digit span of a chipmunk.
The RStudio server should now be running. If you want to stop it from the command line
sudo rstudio-server stop
and replace stop with start to get it going again.
Now for the hard part: installing things system-wide and adding users.
Installing R packages
Back on the command line:
sudo R
to get into R, then at the R prompt type
.libPaths()
There should be a system path first on the list, probably /usr/local/lib/R/site-packages. If that’s not there, add it as a lib argument to all lines below. But it should be there.
Now to install a bunch of useful packages. We could do them all at once, but it’s better to do them one by one to check they all actually go through
install.packages("quanteda") # all things text install.packages("stm") # topic models install.packages("rvest") # scraping the web install.packages("dplyr") # needs no introduction install.packages("devtools") # to install stuff from github, 'remotes' might be lighter install.packages("rmarkdown") # create vignettes and suchlike install.packages("xtable") # needed so quanteda's overridden View function works install.packages("caTools") # because RStudio seems to needs this to compile Rmd files
That should keep us going for the show. Come on it’s time to go add users.
Add Users
On Ubuntu one can add users in bulk from a file with lines in the right format. If you have a quick peek at /etc/passwd we can see what it looks like (there’s a x where we’ll
have a password in the file). So we can easily construct such a file from a list of names with lines of the kind:
boo:boo_and_her_secretpassword:::boo,,,:/home/boo:/bin/bash
Now, in theory, the newusers command can just be given this file. In practice it tends to segfault for longish files, for reasons known only to the Ubuntu devs. I just worked around that by having a python script repeatedly write a one line files and then call newusers in a system call.. This is annoying, but if you’ve ever had to configure a laptop soundcard you barely feel it. If you’re not expecting a lot of users you can add them one by one, e.g. by typing
sudo adduser boo
and filling in her details. Either way, each user will be able to log in to the RStudio server as long as as the virtual machine is running. They will also be able to install packages into their local R directories, but maybe encourage them to ask you for packages because it’s better to install them systemwide.
OK. We now have RStudio in the cloud, a bunch of users, and an R setup suitable for doing quite a lot of fun things with text. Maybe log in to RStudio as boo to check it all works. When you get tired, or you decide it’s Bedtime for Bank Balance, just roll over to the Portal and press the big stop button.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.