Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
School of Data had a recent post how to copy “every item” from a multi-page list. While their post did provide a neat hack, their “words of warning” are definitely missing some items and the overall methodology can be improved upon with some basic R scripting.
First, the technique they outlined relies heavily on how parameters are passed and handled by the server the form is connected to. The manual technique is not guaranteed to work across all types of forms nor even those with a “count” popup. I can see this potentially frustrating many budding data janitors.
Second, this particular technique and example really centers around jQuery DataTables. While their display style can be highly customized, it’s usually pretty easy to determine if they are being used both visually:
(i.e. by the controls & style of the controls available) and in the source:
The URLs might be local or on a common content delivery network, but it should be pretty easy to determine when a jQuery DataTable is in use. Once you do, you should also be able to tell if it’s calling out to a URL for some JSON to populate the structure.
Here, I just used Chrome’s Developer Tools to look a the responses coming back from the server. That’s a pretty ugly GET
request, but we can see the query parameters a bit better if we scroll down:
These definitely track well with the jQuery DataTable server-side documentation so we should be able to use this to our advantage to avoid the pitfalls of overwhelming the browser with HTML entities and doing cut & paste to save out the list.
Getting the Data With R
The R code to get this same data is about as simple as it gets. All you need is the data source URL, with a modified length
query parameter. After that’s it’s just a few lines of code:
library(httr) library(jsonlite) library(dplyr) # for glimpse url <- "http://www.allflicks.net/wp-content/themes/responsive/processing/processing_us.php?draw=1&columns%5B0%5D%5Bdata%5D=box_art&columns%5B0%5D%5Bname%5D=&columns%5B0%5D%5Bsearchable%5D=true&columns%5B0%5D%5Borderable%5D=false&columns%5B0%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B0%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B1%5D%5Bdata%5D=title&columns%5B1%5D%5Bname%5D=&columns%5B1%5D%5Bsearchable%5D=true&columns%5B1%5D%5Borderable%5D=true&columns%5B1%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B1%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B2%5D%5Bdata%5D=year&columns%5B2%5D%5Bname%5D=&columns%5B2%5D%5Bsearchable%5D=true&columns%5B2%5D%5Borderable%5D=true&columns%5B2%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B2%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B3%5D%5Bdata%5D=rating&columns%5B3%5D%5Bname%5D=&columns%5B3%5D%5Bsearchable%5D=true&columns%5B3%5D%5Borderable%5D=true&columns%5B3%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B3%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B4%5D%5Bdata%5D=category&columns%5B4%5D%5Bname%5D=&columns%5B4%5D%5Bsearchable%5D=true&columns%5B4%5D%5Borderable%5D=true&columns%5B4%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B4%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B5%5D%5Bdata%5D=available&columns%5B5%5D%5Bname%5D=&columns%5B5%5D%5Bsearchable%5D=true&columns%5B5%5D%5Borderable%5D=true&columns%5B5%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B5%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B6%5D%5Bdata%5D=director&columns%5B6%5D%5Bname%5D=&columns%5B6%5D%5Bsearchable%5D=true&columns%5B6%5D%5Borderable%5D=true&columns%5B6%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B6%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B7%5D%5Bdata%5D=cast&columns%5B7%5D%5Bname%5D=&columns%5B7%5D%5Bsearchable%5D=true&columns%5B7%5D%5Borderable%5D=true&columns%5B7%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B7%5D%5Bsearch%5D%5Bregex%5D=false&order%5B0%5D%5Bcolumn%5D=5&order%5B0%5D%5Bdir%5D=desc&start=0&length=7448&search%5Bvalue%5D=&search%5Bregex%5D=false&movies=true&shows=true&documentaries=true&rating=netflix&_=1431945465056" resp <- GET(url) |
Normally we would be able to do:
content(resp, as="parsed") |
but this server did not set the Content-Type
of the response well, so we have to do it by hand with the jsonlite
package:
recs <- fromJSON(content(resp, as="text")) |
The recs
variable is now an R list with a structure that (thankfully) fully represents the expected server response:
## List of 4 ## $ draw : int 1 ## $ recordsTotal : int 7448 ## $ recordsFiltered: int 7448 ## $ data :'data.frame': 7448 obs. of 9 variables: ## ..$ box_art : chr [1:7448] "<img src="http://cdn1.nflximg.net/images/9159/12119159.jpg" width="55" alt="Thumbnail">" "<img src="http://cdn1.nflximg.net/images/6195/20866195.jpg" width="55" alt="Thumbnail">" "<img src="http://cdn1.nflximg.net/images/3735/2243735.jpg" width="55" alt="Thumbnail">" "<img src="http://cdn0.nflximg.net/images/2668/21112668.jpg" width="55" alt="Thumbnail">" ... ## ..$ title : chr [1:7448] "In the Bedroom" "Wolfy: The Incredible Secret" "Bratz: Diamondz" "Tinker Bell and the Legend of the NeverBeast" ... ## ..$ year : chr [1:7448] "2001" "2013" "2006" "2015" ... ## ..$ rating : chr [1:7448] "3.3" "2.5" "3.6" "4" ... ## ..$ category : chr [1:7448] "<a href="http://www.allflicks.net/category/thrillers/">Thrillers</a>" "<a href="http://www.allflicks.net/category/children-and-family-movies/">Children & Family Movies</a>" "<a href="http://www.allflicks.net/category/children-and-family-movies/">Children & Family Movies</a>" "<a href="http://www.allflicks.net/category/children-and-family-movies/">Children & Family Movies</a>" ... ## ..$ available: chr [1:7448] "17 May 2015" "17 May 2015" "17 May 2015" "17 May 2015" ... ## ..$ cast : chr [1:7448] "Tom Wilkinson, Sissy Spacek, Nick Stahl, Marisa Tomei, William Mapother, William Wise, Celia Weston, Karen Allen, Frank T. Well"| __truncated__ "Rafael Marin, Christian Vandepas, Gerald Owens, Yamile Vasquez, Pilar Uribe, James Carrey, Rebecca Jimenez, Joshua Jean-Baptist"| __truncated__ "Olivia Hack, Soleil Moon Frye, Tia Mowry-Hardrict, Dionne Quan, Wendie Malick, Lacey Chabert, Kaley Cuoco, Charles Adler" "Ginnifer Goodwin, Mae Whitman, Rosario Dawson, Lucy Liu, Pamela Adlon, Raven-Symoné, Megan Hilty" ... ## ..$ director : chr [1:7448] "Todd Field" "Éric Omond" "Mucci Fassett, Nico Rijgersberg" "Steve Loter" ... ## ..$ id : chr [1:7448] "60022258" "70302834" "70053695" "80028529" ... |
We see there is a data.frame
in there with the expected # of records. We can also use glimpse
from dplyr
to see the data table a bit better:
glimpse(recs$data) ## Observations: 7448 ## Variables: ## $ box_art (chr) "<img src="http://cdn1.nflximg.net/images/9159/12... ## $ title (chr) "In the Bedroom", "Wolfy: The Incredible Secret", ... ## $ year (chr) "2001", "2013", "2006", "2015", "1993", "2013", "2... ## $ rating (chr) "3.3", "2.5", "3.6", "4", "3.5", "3.1", "3.3", "4.... ## $ category (chr) "<a href="http://www.allflicks.net/category/thril... ## $ available (chr) "17 May 2015", "17 May 2015", "17 May 2015", "17 M... ## $ cast (chr) "Tom Wilkinson, Sissy Spacek, Nick Stahl, Marisa T... ## $ director (chr) "Todd Field", "Éric Omond", "Mucci Fassett, Nico R... ## $ id (chr) "60022258", "70302834", "70053695", "80028529", "8... |
Now, we can use that in any R workflow or use write it out as a CSV (or other format) for other workflows to use. No browsers were crashed and we have code we run again to scrape the site (i.e. when the add more movies to the database) vs a manual cut & paste workflow.
Many of the concepts in this post can be applied to other data table displays (i.e. those not based on jQuery DataTable), but you’ll have to get comfortable with the developer tools view of your favorite browser.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.