Simple automated web-scraping with R CMD BATCH and Task Scheduler
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
With a mixture of R’s command-line tool, a batch file, and the Windows Task Scheduler, a simple automated web-scraper can be built.
Invoking R at the command-line
It is possible to invoke R from the Windows command-line by entering the full path name of the executable, such as
C:\"Program Files"\R\R-3.3.0\bin\R --vanilla
The option --vanilla
is an alias for several options, which in short summary tell R to not load any files at startup and to not ask the user whether to save the workspace image upon exit. If you were to invoke R from within the bin/
directory, you could enter the much simpler command
R --vanilla
And in the spirit of keeping things simple, if you place the bin/
directory in your PATH
variable, then no matter the location of your current directory, you can always use the simpler command to invoke R. Here’s how to set that bin/
directory in your PATH
variable:
- Press the Windows Key.
- Type
systempropertiesadvanced
– all one word – and press enter — OR — typesysdm.cpl
, press enter, and click the “Advanced” tab. - Click “Environment Variables”.
- Select “Path” and click “Edit”.
- Place the cursor at the very end of the “Variable value” field.
- Type the appropriate path name to the
bin/
directory with a preceding;
(path names are;
delimited); here’s an example of what I typed:;C:\Program Files\R\R-3.3.0\bin\
- Click all the “OK” buttons until you have exited.
Congratulations. You can now invoke R from anywhere within the command-line.
The BATCH tool
The above invocation of R will launch R in the command-line window – just as though you were using the command-line in RStudio or R GUI. However, from within the command line there are several CMD
“tools” which are available to the user which are not meant to be called directly (from a GUI).
One such tool, BATCH
, allows the user to run R files at the command-line (similar to using source()
in the interactive GUI). The command
R --vanilla CMD BATCH file.R file.Rout
will execute file.R
and save the output to file.Rout
— assuming you are within (your working directory is) file.R
‘s directory.
The .Rout
file, if not given, is created in the same directory as the .R
file and is given the same name but with extension .Rout
. In the above example, once R CMD BATCH
has finished executing file.R
, it calls proc.time()
and inserts the returned value in the .Rout
file — giving an indication of how long it took to execute the file. Warning messages and errors are also written to the .Rout
file.
About batch files
Instead of repeatedly entering an R CMD BATCH
command to run an R file, the command can be both stored in and executed from a batch file. Batch files, which have extension .bat
, are plain text files whose content can be read and executed by the shell. These files can be created and edited using any text editing program (including RStudio).
Here is a batch file based on the above example:
@echo off R --vanilla CMD BATCH file.R file.Rout
where:
@echo off
= do not print the lines of code.- The directory that the batch file is saved to and executed from is the same directory as
file.R
‘s directory — if not, then change the working directory or specify the full file path.
Windows Task Scheduler
The Windows Task Scheduler allows users to schedule various types of tasks. One such task that can be scheduled is the execution of a batch file.
Using the GUI interface, it is possible to schedule an R file to execute daily by telling the scheduler to run a batch file which runs an R CMD BATCH
command to execute that R file. Using the Task Scheduler GUI is a straight forward process:
- Press the Windows Key, type either
taskschd.msc
or “task scheduler”, and press enter to open the program. - Click on “Create Task”.
- Assign a name and give a description.
- Create a new trigger and action to execute a batch file on a daily basis.
- Select additional conditions and settings as needed (such as “Wake to run” and “Run task as soon as possible after a scheduled start is missed”).
There are other features you can use such as “Hidden” or “Run weather user is logged on or not”, but the above should be a good enough.
Putting it all together
I have taken some web-scraping code from a previous post on scraping North Dakota rig count data and modified and saved it in a file called rigcount.data.R
. You can find the modified code bellow, plus some caveats about writing R files that are executed by R CMD BATCH
, at the end of this post.
Here is all that is need to create a simple automated web-scraper based on rigcount.data.R
:
- Create a batch file to execute
rigcount.data.R
. The batch file will run in theC:\Windows\System32
directory, so be sure to change the directory to where your R file is located, such as@echo off cd %USERPROFILE%\R\ R --vanilla CMD BATCH rigcount.data.R rigcount.data.Rout
- Use the task scheduler to create a task that will execute the above batch file on a daily basis.
There you have it. With a scheduled task to execute the batch file, you have just created a simple automated web-scraper.
rigcount.data.r
Because you are executing an R file in batch mode, there will be a few changes to how R normally works when used with a program such as RStudio (which redirects standard input and output among other things).
- The library path to your
%USERPROFILE%\R
directory that is normally available when using RStudio will not be seen when usingR CMD BATCH
. That is why, before callinglibrary()
, it is necessary to specify that path, as in my case.libPaths("C:/Users/Luke/Documents/R/win-library/3.3")
- When using
write.csv()
to create a new CSV file within RStudio, you normally don’t need to create and connect to that file. UsingR CMD BATCH
, however, you will need to do this, such asfname <- "C:/Users/Luke/Documents/R/newFile.csv" file.create(fname) fcon <- file(fname, open = "w") write.csv(some.object, fname, row.names = FALSE) close(fcon)
Here is the code for rigcount.data.R
:
# Scrape Rig Count Data --------------------------------------------------- # Load Dependencies .libPaths("C:/Users/Luke/Documents/R/win-library/3.3") library(rvest) # Set today's date; to be used in file name. today <- Sys.Date() # Create and load URL; scrape table nodes and attributes ("summary"). url <- "https://www.dmr.nd.gov/oilgas/riglist.asp" html <- url %>% read_html() table <- html %>% html_nodes("table") table.summary <- table %>% html_attr("summary") # Find the table with rig count data, which is called "results". table.filter <- grep("results", table.summary) rig.table <- table[table.filter] %>% html_table() # Extract the table from the list; find and apply the header to the table. rig.table <- rig.table[[1]] rig.table.header <- table[table.filter] %>% html_nodes("thead") %>% html_nodes("th") %>% html_text() colnames(rig.table) <- rig.table.header # Add "Publication Date" and make it the first column. rig.table[ncol(rig.table) + 1L] <- today names(rig.table)[ncol(rig.table)] <- "Publication Date" rig.table <- rig.table[, c(ncol(rig.table), 1:(ncol(rig.table) - 1L))] # Write table to CSV file. fname <- paste0(getwd(), "/", today, ".csv") file.create(fname) fcon <- file(fname, open = "w") write.csv(rig.table, fname, row.names = FALSE) close(fcon)
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.