Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Continued from Part 1.
Part 2: Digging into the Kelley Blue Book
The only thing better than a bit of data is a lot of data. Now that we can grab KBB values for a given trim of a given model in a given year, we set our ambitions higher: automating the collection of these values for all trims of a model over a set of years. To do so, let’s back up and recall how we got to the KBB results page:
Let’s suppose we’re still set on the Honda Accord and are considering the last ten model years. Going with “Search by: Year, Make & Model”, we get to the following self-explanatory screen:
Choosing (2005, Honda, Accord) pushes us to the following address: http://www.kbb.com/used-cars/honda/accord/2005/. There, we are reminded that the KBB reports different values for retail, certified retail, private sellers, and trade-ins:
Let’s go with “Private Party Value” for now; we end up at http://www.kbb.com/used-cars/honda/accord/2005/private-party-value. We’re now presented with a plethora of different trims, enough to make us nostalgic for Henry Ford:
Start with the “DX Sedan 4D”. We arrive at http://www.kbb.com/used-cars/honda/accord/2005/private-party-value/equipment?id=846. If the previous screen didn’t freak us out, this one definitely should—-but if we ignore the options at the bottom (which are set to their standard values for the given model year and trim), we’re left with the important parameters: the choice of automatic or manual transmission and the mileage (and the ZIP code, which I’ll discuss later).
I can’t drive stick, so I’m not particularly worried about changing the transmission from its default of Automatic. But if you wanted to, note that choosing Automatic with default options and 10,000 miles pushes you to http://www.kbb.com/used-cars/honda/accord/2005/private-party-value/condition?id=846&mileage=10000 whereas choosing Manual, 5-Spd with the same options and mileage gives http://www.kbb.com/used-cars/honda/accord/2005/private-party-value/condition?id=846&equipment=35014|true&mileage=10000.
Either way, we end up at a completely pointless page: no matter what you select, the results page gives values for all conditions.
Say we select “Good”. The results page for the Automatic is located at http://www.kbb.com/used-cars/honda/accord/2005/private-party-value/pricing-report?condition=good&id=846&mileage=10000 and the results page for the Manual, 5-Spd is located at http://www.kbb.com/used-cars/honda/accord/2005/private-party-value/pricing-report?condition=good&id=846&equipment=35014|true&mileage=10000. If we want, we can tear off the “condition” field, in which case the default condition, Excellent, is highlighted.
So, if we want to grab results for a bunch of different years and trims, we need to figure out the id=846 part of the URL (and possibly the equipment=35014|true part if we’re after a manual transmission). Again, it’s time for Firebug. Back up to the trim selection page at http://www.kbb.com/used-cars/honda/accord/2005/private-party-value and load up Firebug. If we examine the links for the various trims, we see that the links for the available trims are contained within a div with id='UCPathTrim'.
The next step is to write some R code to parse the trim selection page and pull out the available trims and their corresponding id values. This will make use of some of the core functionality of the XML package.
The XML package and HTML documents
In the last post, we used the function readHTMLTable from the XML package to read the results from a webpage into an R data.frame. At the time, there was little mention of the technical details; now, we’re moving beyond convenient functions and into the great unknown.
The XML package, written by Professor Duncan Temple Lang of UC Davis, is a wrapper for libxml2. The package website, hosted by The Omega Project for Statistical Computing, is at http://www.omegahat.org/RSXML/, and the package listing on CRAN is located at http://cran.r-project.org/web/packages/XML/index.html.
At its core, the XML package is meant for parsing XML and HTML documents into tree structures and selecting and extracting or otherwise manipulating branches or nodes of the trees. Take a look at the HTML tab of Firebug again (on http://www.kbb.com/used-cars/honda/accord/2005/private-party-value), and note that the webpage consists of a tree of HTML tags. At its root, there’s a html node, with children head and body; within the body branch are nodes defining the structure of the document, including a branch descending from a div node (<div class="modCBox UCPathModule" id="UCPathTrim">) containing a branch descending from a span node (<span class="sectContent">) with leaf nodes like <a class="link_circle_arrow_blue" href="/used-cars/honda/accord/2005/private-party-value/equipment?id=846"> Accord DX Sedan 4D</a>.
Now, moving to R, we’ll look at the tree produced by the XML package for this document. The first section of code should be fairly straightforward:
## download the webpage kbbHTML <- readLines("http://www.kbb.com/used-cars/honda/accord/2005/private-party-value") ## load the XML package and parse the downloaded document require(XML) kbbTree <- htmlTreeParse(kbbHTML, asText = TRUE) ## get the root ('html') node kbbRoot <- xmlRoot(kbbTree)
Each node object (class XMLNode) is also a list containing its immediate children as node objects.
> ## print the child nodes ('head' and 'body') > print(summary(kbbRoot)) Length Class Mode head 14 XMLNode list body 19 XMLNode list
Thus, we can get the body of the document:
## select the 'body' child node using the usual R list element extraction syntax kbbBody <- kbbRoot[["body"]]
Within the body, there’s a bunch of child nodes (the same ones we see in Firebug, of course):
> ## print the child nodes of the 'body' > print(summary(kbbBody)) Length Class Mode script 1 XMLNode list script 1 XMLNode list div 4 XMLNode list comment 0 XMLCommentNode list script 0 XMLNode list script 1 XMLNode list script 1 XMLNode list script 1 XMLNode list noscript 1 XMLNode list comment 0 XMLCommentNode list comment 0 XMLCommentNode list script 0 XMLNode list div 2 XMLNode list script 0 XMLNode list script 1 XMLNode list comment 0 XMLCommentNode list script 1 XMLNode list noscript 1 XMLNode list comment 0 XMLCommentNode list
Either by looking at the tree in Firebug or using summaries of the tree in R, we can identify the div node we’re looking for and access the corresponding node object in R:
## select our 'div id="UCPathTrim"...' node; instead of using node ## names (like 'div'), which aren't necessarily unique here, we use ## indices (we want the first child of the first child of the second ## child of the second child of the third child of 'body') divUCPathTrim <- kbbBody[[3]][[2]][[2]][[1]][[1]] > ## print the child nodes > print(summary(divUCPathTrim)) Length Class Mode h2 1 XMLNode list text 0 XMLTextNode list span 9 XMLNode list
We can then access the trim links, which are the leaf nodes of the span node under divUCPathTrim. Printing an XMLNode object outputs the raw HTML.
> ## print the HTML of the first of the link leaf nodes (children of the 'span' node) > print(divUCPathTrim[["span"]][[1]]) <a href="/used-cars/honda/accord/2005/private-party-value/equipment?id=846" class="link_circle_arrow_blue">Accord DX Sedan 4D</a>
To get the node contents (here, the trim label), we use the xmlValue function:
> ## print the *contents* of this leaf node > print(xmlValue(divUCPathTrim[["span"]][[1]])) [1] "Accord DX Sedan 4D"
To get the link target (the ‘href’ attribute), we use the xmlAttrs function:
> ## print the 'href' attribute of this leaf node > print(xmlAttrs(divUCPathTrim[["span"]][[1]])[["href"]]) [1] "/used-cars/honda/accord/2005/private-party-value/equipment?id=846"
There’s an easier way to select a set of nodes and apply functions over this set. To do so, we must learn a bit of XPath.
XPath
XPath is a query language for selecting sets of nodes from XML or XML-like documents (like HTML webpages). A nice quick introduction to XPath syntax is the w3schools.com article XPath Syntax. Open it in a tab, read it, and come back.
Done? Good. If we’re super lazy, we can use Firebug to generate an XPath expression to select a given node—just right click on the node and choose “Copy XPath”. Here’s the XPath expression for the second of the nine trim links:
/html/body/div/div[2]/div[2]/div/div/span/a[2]
To select all of the nine trim links, we simply chop off the “[2]” on the end (match all a nodes that are children of that span):
/html/body/div/div[2]/div[2]/div/div/span/a
If we want a short XPath expression, we can instead use something like this:
//div[@id = 'UCPathTrim']//a
That is, we select all a nodes that descend from any div node with attribute id='UCPathTrim'. In XPath syntax, “//nodename” selects descendant nodes named nodename while “/nodename” selects child nodes named nodename (immediate descendants). Using double forward slashes allows us to skip specifying intermediate nodes. Expressions within brackets are conditions, evaluated to booleans, specifying whether a node should or should not be included.
Is there any advantage to using one expression over the other? So long as the structure of the webpage doesn’t change, both will work; however, if the order of the nodes in the document changes, the former expression will fail, but the latter will continue to work (it selects on the div id attribute rather than its position in the document). Similarly, if the div id changes but the document structure otherwise remains unchanged (this is unlikely, but might happen if they messed around with their CSS styling or something), the former would continue working but the latter would fail.
We can create a fancier XPath expression using XPath functions that will continue to work so long as the KBB URL scheme stays the same. Since the rest of the code will depend on this remaining constant, our XPath expression should only fail at the same time as the rest of our code. A list of XPath functions can be found here. We’ll use the function contains(x, y), which returns true if string x contains string y (else false). Our XPath expression is:
//a[contains(@href, 'used-cars/honda/accord/2005/private-party-value/equipment')]
This selects all links with target URLs containing ‘used-cars/honda/accord/2005/private-party-value/equipment’.
getNodeSet and xpathApply
To use XPath with the XML package, we need to parse the document a little differently. You see, the XML package can either parse the document into a tree structure of R objects (as we did above, using htmlTreeParse) or into a tree structure of pointers to C-level objects. In the latter case, the parsed structure is maintained as lower-level objects in memory, and is not immediately accessible in R. Indeed, incorrectly accessing the parsed document object can cause R to crash. However, parsing the document into this C-level structure internal to libxml2 permits the use of XPath expressions. For more, do help("xmlParse").
In practice, using XPath expressions with the XML package is fairly simple. We parse the document with htmlParse instead of htmlTreeParse, and select sets of nodes corresponding to XPath expressions using getNodeSet. We can then lapply or sapply over the resulting nodeset. If we only need to apply a single function, we can instead use xpathApply to apply a function to an XPath-defined set directly.
## parse the downloaded document to an XMLInternalDocument kbbInternalTree <- htmlParse(kbbHTML, asText = TRUE) ## select nodes matching our XPath expression xpath.expression <- "//a[contains(@href,'/used-cars/honda/accord/2005/private-party-value/equipment')]" trim.nodes <- getNodeSet(doc = kbbInternalTree, path = xpath.expression) > ## the result is of class "XMLNodeSet", a list of 9 externalptr > ## objects of class "XMLInternalElementNode" > print(summary(trim.nodes)) Length Class Mode [1,] 1 XMLInternalElementNode externalptr [2,] 1 XMLInternalElementNode externalptr [3,] 1 XMLInternalElementNode externalptr [4,] 1 XMLInternalElementNode externalptr [5,] 1 XMLInternalElementNode externalptr [6,] 1 XMLInternalElementNode externalptr [7,] 1 XMLInternalElementNode externalptr [8,] 1 XMLInternalElementNode externalptr [9,] 1 XMLInternalElementNode externalptr > ## we can now lapply or sapply over this list object > print(lapply(trim.nodes, function(x) c(xmlValue(x), xmlAttrs(x)[["href"]]))) [[1]] [1] " Accord DX Sedan 4D" [2] "/used-cars/honda/accord/2005/private-party-value/equipment?id=846" [[2]] [1] " Accord EX Coupe 2D" [2] "/used-cars/honda/accord/2005/private-party-value/equipment?id=863" [[3]] [1] " Accord EX Sedan 4D" [2] "/used-cars/honda/accord/2005/private-party-value/equipment?id=859" [[4]] [1] " Accord EX-L Coupe 2D" [2] "/used-cars/honda/accord/2005/private-party-value/equipment?id=263736" [[5]] [1] " Accord EX-L Sedan 4D" [2] "/used-cars/honda/accord/2005/private-party-value/equipment?id=263737" [[6]] [1] " Accord Hybrid Sedan 4D" [2] "/used-cars/honda/accord/2005/private-party-value/equipment?id=868" [[7]] [1] " Accord LX Coupe 2D" [2] "/used-cars/honda/accord/2005/private-party-value/equipment?id=856" [[8]] [1] " Accord LX Sedan 4D" [2] "/used-cars/honda/accord/2005/private-party-value/equipment?id=850" [[9]] [1] " Accord LX Special Edition Coupe 2D" [2] "/used-cars/honda/accord/2005/private-party-value/equipment?id=867"
Putting it all together
I’m getting tired, so let’s jump ahead to a complete function that retrieves all of the trims for a given year. If you’ve read and understood everything above, you should be able to figure out how the function works without much trouble (with the possible exception of the XPath expression, which needlessly uses regular expressions). Go wild with help(...) until it all makes sense.
getKBBYearTrims <- function(prefix, year, type = "private-party-value") { require(XML) kbbTrimPageURL <- sprintf("%s%i/%s", prefix, year, type) cat("Loading", kbbTrimPageURL, "\n") x <- readLines(kbbTrimPageURL) g <- htmlParse(x, asText=TRUE) xpath <- gsub("([http:/w.]+kbb\\.com/)(.*)", "//a[contains(@href, '\\2/equipment')]", kbbTrimPageURL) cat("XPath expression is:", xpath, "\n") trims <- getNodeSet(doc = g, path = xpath) trimlabels <- sapply(trims, xmlValue) trimids <- sapply(trims, function(node) sub(".*id=([[:digit:]]+)$", "\\1", xmlAttrs(node)[["href"]])) trimtable <- data.frame(year = year, trim = trimlabels, id = trimids, stringsAsFactors = FALSE) return(trimtable) }
The function works great for 2005 Accords:
> ## print trims and ids for 2005 Honda Accords > print(getKBBYearTrims(prefix = "http://www.kbb.com/used-cars/honda/accord/", year = 2005)) Loading http://www.kbb.com/used-cars/honda/accord/2005/private-party-value XPath expression is: //a[contains(@href, 'used-cars/honda/accord/2005/private-party-value/equipment')] year trim id 1 2005 Accord DX Sedan 4D 846 2 2005 Accord EX Coupe 2D 863 3 2005 Accord EX Sedan 4D 859 4 2005 Accord EX-L Coupe 2D 263736 5 2005 Accord EX-L Sedan 4D 263737 6 2005 Accord Hybrid Sedan 4D 868 7 2005 Accord LX Coupe 2D 856 8 2005 Accord LX Sedan 4D 850 9 2005 Accord LX Special Edition Coupe 2D 867
The following function wraps getKBBYearTrims to return a data.frame of trims for a set of model years.
getKBBTrims <- function(prefix, years, type = "private-party-value") { kbbTrimList <- lapply(years, function(year) getKBBYearTrims(prefix, year)) kbbTrims <- do.call('rbind', kbbTrimList) return(kbbTrims) }
Using it, we can try getting the trims for a series of model years:
> ## print trims and ids for years 2003 to 2007 > accord.trims <- getKBBTrims(prefix = "http://www.kbb.com/used-cars/honda/accord/", years = 2003:2007) Loading http://www.kbb.com/used-cars/honda/accord/2003/private-party-value XPath expression is: //a[contains(@href, 'used-cars/honda/accord/2003/private-party-value/equipment')] Loading http://www.kbb.com/used-cars/honda/accord/2004/private-party-value XPath expression is: //a[contains(@href, 'used-cars/honda/accord/2004/private-party-value/equipment')] Loading http://www.kbb.com/used-cars/honda/accord/2005/private-party-value XPath expression is: //a[contains(@href, 'used-cars/honda/accord/2005/private-party-value/equipment')] Loading http://www.kbb.com/used-cars/honda/accord/2006/private-party-value XPath expression is: //a[contains(@href, 'used-cars/honda/accord/2006/private-party-value/equipment')] Loading http://www.kbb.com/used-cars/honda/accord/2007/private-party-value XPath expression is: //a[contains(@href, 'used-cars/honda/accord/2007/private-party-value/equipment')] > print(accord.trims) year trim id 1 2003 Accord DX Sedan 4D 2488 2 2003 Accord EX Coupe 2D 2496 3 2003 Accord EX Sedan 4D 2498 4 2003 Accord EX-L Coupe 2D 263731 5 2003 Accord EX-L Sedan 4D 263730 6 2003 Accord LX Coupe 2D 2495 7 2003 Accord LX Sedan 4D 2492 8 2004 Accord DX Sedan 4D 2664 9 2004 Accord EX Coupe 2D 2671 10 2004 Accord EX Sedan 4D 2676 11 2004 Accord EX-L Coupe 2D 263735 12 2004 Accord EX-L Sedan 4D 263734 13 2004 Accord LX Coupe 2D 2669 14 2004 Accord LX Sedan 4D 2663 15 2005 Accord DX Sedan 4D 846 16 2005 Accord EX Coupe 2D 863 17 2005 Accord EX Sedan 4D 859 18 2005 Accord EX-L Coupe 2D 263736 19 2005 Accord EX-L Sedan 4D 263737 20 2005 Accord Hybrid Sedan 4D 868 21 2005 Accord LX Coupe 2D 856 22 2005 Accord LX Sedan 4D 850 23 2005 Accord LX Special Edition Coupe 2D 867 24 2006 Accord EX Coupe 2D 741 25 2006 Accord EX Sedan 4D 739 26 2006 Accord EX-L Coupe 2D 263727 27 2006 Accord EX-L Sedan 4D 263726 28 2006 Accord Hybrid Sedan 4D 744 29 2006 Accord LX Coupe 2D 736 30 2006 Accord LX Sedan 4D 734 31 2006 Accord SE Sedan 4D 738 32 2006 Accord VP Sedan 4D 737 33 2007 Accord EX Coupe 2D 83835 34 2007 Accord EX Sedan 4D 83834 35 2007 Accord EX-L Coupe 2D 263674 36 2007 Accord EX-L Sedan 4D 263675 37 2007 Accord Hybrid Sedan 4D 83836 38 2007 Accord LX Coupe 2D 83833 39 2007 Accord LX Sedan 4D 83829 40 2007 Accord SE Sedan 4D 83832 41 2007 Accord VP Sedan 4D 83827
Everything works great. What a shock.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.