How to buy a used car with R (part 2)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Continued from Part 1.
Part 2: Digging into the Kelley Blue Book
The only thing better than a bit of data is a lot of data. Now that we can grab KBB values for a given trim of a given model in a given year, we set our ambitions higher: automating the collection of these values for all trims of a model over a set of years. To do so, let’s back up and recall how we got to the KBB results page:
Let’s suppose we’re still set on the Honda Accord and are considering the last ten model years. Going with “Search by: Year, Make & Model”, we get to the following self-explanatory screen:
Choosing (2005, Honda, Accord) pushes us to the following address: http://www.kbb.com/used-cars/honda/accord/2005/. There, we are reminded that the KBB reports different values for retail, certified retail, private sellers, and trade-ins:
Let’s go with “Private Party Value” for now; we end up at http://www.kbb.com/used-cars/honda/accord/2005/private-party-value. We’re now presented with a plethora of different trims, enough to make us nostalgic for Henry Ford:
Start with the “DX Sedan 4D”. We arrive at http://www.kbb.com/used-cars/honda/accord/2005/private-party-value/equipment?id=846. If the previous screen didn’t freak us out, this one definitely should—-but if we ignore the options at the bottom (which are set to their standard values for the given model year and trim), we’re left with the important parameters: the choice of automatic or manual transmission and the mileage (and the ZIP code, which I’ll discuss later).
I can’t drive stick, so I’m not particularly worried about changing the transmission from its default of Automatic. But if you wanted to, note that choosing Automatic with default options and 10,000 miles pushes you to http://www.kbb.com/used-cars/honda/accord/2005/private-party-value/condition?id=846&mileage=10000 whereas choosing Manual, 5-Spd with the same options and mileage gives http://www.kbb.com/used-cars/honda/accord/2005/private-party-value/condition?id=846&equipment=35014|true&mileage=10000.
Either way, we end up at a completely pointless page: no matter what you select, the results page gives values for all conditions.
Say we select “Good”. The results page for the Automatic is located at http://www.kbb.com/used-cars/honda/accord/2005/private-party-value/pricing-report?condition=good&id=846&mileage=10000 and the results page for the Manual, 5-Spd is located at http://www.kbb.com/used-cars/honda/accord/2005/private-party-value/pricing-report?condition=good&id=846&equipment=35014|true&mileage=10000. If we want, we can tear off the “condition” field, in which case the default condition, Excellent, is highlighted.
So, if we want to grab results for a bunch of different years and trims, we need to figure out the id=846 part of the URL (and possibly the equipment=35014|true part if we’re after a manual transmission). Again, it’s time for Firebug. Back up to the trim selection page at http://www.kbb.com/used-cars/honda/accord/2005/private-party-value and load up Firebug. If we examine the links for the various trims, we see that the links for the available trims are contained within a div with id='UCPathTrim'.
The next step is to write some R code to parse the trim selection page and pull out the available trims and their corresponding id values. This will make use of some of the core functionality of the XML package.
The XML package and HTML documents
In the last post, we used the function readHTMLTable from the XML package to read the results from a webpage into an R data.frame. At the time, there was little mention of the technical details; now, we’re moving beyond convenient functions and into the great unknown.
The XML package, written by Professor Duncan Temple Lang of UC Davis, is a wrapper for libxml2. The package website, hosted by The Omega Project for Statistical Computing, is at http://www.omegahat.org/RSXML/, and the package listing on CRAN is located at http://cran.r-project.org/web/packages/XML/index.html.
At its core, the XML package is meant for parsing XML and HTML documents into tree structures and selecting and extracting or otherwise manipulating branches or nodes of the trees. Take a look at the HTML tab of Firebug again (on http://www.kbb.com/used-cars/honda/accord/2005/private-party-value), and note that the webpage consists of a tree of HTML tags. At its root, there’s a html node, with children head and body; within the body branch are nodes defining the structure of the document, including a branch descending from a div node ( Now, moving to R, we’ll look at the tree produced by the XML package for this document. The first section of code should be fairly straightforward: Each node object (class XMLNode) is also a list containing its immediate children as node objects. Thus, we can get the body of the document: Within the body, there’s a bunch of child nodes (the same ones we see in Firebug, of course): Either by looking at the tree in Firebug or using summaries of the tree in R, we can identify the div node we’re looking for and access the corresponding node object in R: We can then access the trim links, which are the leaf nodes of the span node under divUCPathTrim. Printing an XMLNode object outputs the raw HTML. To get the node contents (here, the trim label), we use the xmlValue function: To get the link target (the ‘href’ attribute), we use the xmlAttrs function: There’s an easier way to select a set of nodes and apply functions over this set. To do so, we must learn a bit of XPath. XPath is a query language for selecting sets of nodes from XML or XML-like documents (like HTML webpages). A nice quick introduction to XPath syntax is the w3schools.com article XPath Syntax. Open it in a tab, read it, and come back. Done? Good. If we’re super lazy, we can use Firebug to generate an XPath expression to select a given node—just right click on the node and choose “Copy XPath”. Here’s the XPath expression for the second of the nine trim links: To select all of the nine trim links, we simply chop off the “[2]” on the end (match all a nodes that are children of that span): If we want a short XPath expression, we can instead use something like this: That is, we select all a nodes that descend from any div node with attribute id='UCPathTrim'. In XPath syntax, “//nodename” selects descendant nodes named nodename while “/nodename” selects child nodes named nodename (immediate descendants). Using double forward slashes allows us to skip specifying intermediate nodes. Expressions within brackets are conditions, evaluated to booleans, specifying whether a node should or should not be included. Is there any advantage to using one expression over the other? So long as the structure of the webpage doesn’t change, both will work; however, if the order of the nodes in the document changes, the former expression will fail, but the latter will continue to work (it selects on the div id attribute rather than its position in the document). Similarly, if the div id changes but the document structure otherwise remains unchanged (this is unlikely, but might happen if they messed around with their CSS styling or something), the former would continue working but the latter would fail. We can create a fancier XPath expression using XPath functions that will continue to work so long as the KBB URL scheme stays the same. Since the rest of the code will depend on this remaining constant, our XPath expression should only fail at the same time as the rest of our code. A list of XPath functions can be found here. We’ll use the function contains(x, y), which returns true if string x contains string y (else false). Our XPath expression is: This selects all links with target URLs containing ‘used-cars/honda/accord/2005/private-party-value/equipment’. To use XPath with the XML package, we need to parse the document a little differently. You see, the XML package can either parse the document into a tree structure of R objects (as we did above, using htmlTreeParse) or into a tree structure of pointers to C-level objects. In the latter case, the parsed structure is maintained as lower-level objects in memory, and is not immediately accessible in R. Indeed, incorrectly accessing the parsed document object can cause R to crash. However, parsing the document into this C-level structure internal to libxml2 permits the use of XPath expressions. For more, do help("xmlParse"). In practice, using XPath expressions with the XML package is fairly simple. We parse the document with htmlParse instead of htmlTreeParse, and select sets of nodes corresponding to XPath expressions using getNodeSet. We can then lapply or sapply over the resulting nodeset. If we only need to apply a single function, we can instead use xpathApply to apply a function to an XPath-defined set directly. I’m getting tired, so let’s jump ahead to a complete function that retrieves all of the trims for a given year. If you’ve read and understood everything above, you should be able to figure out how the function works without much trouble (with the possible exception of the XPath expression, which needlessly uses regular expressions). Go wild with help(...) until it all makes sense. The function works great for 2005 Accords: The following function wraps getKBBYearTrims to return a data.frame of trims for a set of model years. Using it, we can try getting the trims for a series of model years: Everything works great. What a shock.
## download the webpage
kbbHTML <- readLines("http://www.kbb.com/used-cars/honda/accord/2005/private-party-value")
## load the XML package and parse the downloaded document
require(XML)
kbbTree <- htmlTreeParse(kbbHTML, asText = TRUE)
## get the root ('html') node
kbbRoot <- xmlRoot(kbbTree)
> ## print the child nodes ('head' and 'body')
> print(summary(kbbRoot))
Length Class Mode
head 14 XMLNode list
body 19 XMLNode list
## select the 'body' child node using the usual R list element extraction syntax
kbbBody <- kbbRoot[["body"]]
> ## print the child nodes of the 'body'
> print(summary(kbbBody))
Length Class Mode
script 1 XMLNode list
script 1 XMLNode list
div 4 XMLNode list
comment 0 XMLCommentNode list
script 0 XMLNode list
script 1 XMLNode list
script 1 XMLNode list
script 1 XMLNode list
noscript 1 XMLNode list
comment 0 XMLCommentNode list
comment 0 XMLCommentNode list
script 0 XMLNode list
div 2 XMLNode list
script 0 XMLNode list
script 1 XMLNode list
comment 0 XMLCommentNode list
script 1 XMLNode list
noscript 1 XMLNode list
comment 0 XMLCommentNode list
## select our 'div id="UCPathTrim"...' node; instead of using node
## names (like 'div'), which aren't necessarily unique here, we use
## indices (we want the first child of the first child of the second
## child of the second child of the third child of 'body')
divUCPathTrim <- kbbBody[[3]][[2]][[2]][[1]][[1]]
> ## print the child nodes
> print(summary(divUCPathTrim))
Length Class Mode
h2 1 XMLNode list
text 0 XMLTextNode list
span 9 XMLNode list
> ## print the HTML of the first of the link leaf nodes (children of the 'span' node)
> print(divUCPathTrim[["span"]][[1]])
<a href="/used-cars/honda/accord/2005/private-party-value/equipment?id=846" class="link_circle_arrow_blue">Accord DX Sedan 4D</a>
> ## print the *contents* of this leaf node
> print(xmlValue(divUCPathTrim[["span"]][[1]]))
[1] "Accord DX Sedan 4D"
> ## print the 'href' attribute of this leaf node
> print(xmlAttrs(divUCPathTrim[["span"]][[1]])[["href"]])
[1] "/used-cars/honda/accord/2005/private-party-value/equipment?id=846"
XPath
/html/body/div/div[2]/div[2]/div/div/span/a[2]
/html/body/div/div[2]/div[2]/div/div/span/a
//div[@id = 'UCPathTrim']//a
//a[contains(@href, 'used-cars/honda/accord/2005/private-party-value/equipment')]
getNodeSet and xpathApply
## parse the downloaded document to an XMLInternalDocument
kbbInternalTree <- htmlParse(kbbHTML, asText = TRUE)
## select nodes matching our XPath expression
xpath.expression <- "//a[contains(@href,'/used-cars/honda/accord/2005/private-party-value/equipment')]"
trim.nodes <- getNodeSet(doc = kbbInternalTree,
path = xpath.expression)
> ## the result is of class "XMLNodeSet", a list of 9 externalptr
> ## objects of class "XMLInternalElementNode"
> print(summary(trim.nodes))
Length Class Mode
[1,] 1 XMLInternalElementNode externalptr
[2,] 1 XMLInternalElementNode externalptr
[3,] 1 XMLInternalElementNode externalptr
[4,] 1 XMLInternalElementNode externalptr
[5,] 1 XMLInternalElementNode externalptr
[6,] 1 XMLInternalElementNode externalptr
[7,] 1 XMLInternalElementNode externalptr
[8,] 1 XMLInternalElementNode externalptr
[9,] 1 XMLInternalElementNode externalptr
> ## we can now lapply or sapply over this list object
> print(lapply(trim.nodes, function(x) c(xmlValue(x), xmlAttrs(x)[["href"]])))
[[1]]
[1] " Accord DX Sedan 4D"
[2] "/used-cars/honda/accord/2005/private-party-value/equipment?id=846"
[[2]]
[1] " Accord EX Coupe 2D"
[2] "/used-cars/honda/accord/2005/private-party-value/equipment?id=863"
[[3]]
[1] " Accord EX Sedan 4D"
[2] "/used-cars/honda/accord/2005/private-party-value/equipment?id=859"
[[4]]
[1] " Accord EX-L Coupe 2D"
[2] "/used-cars/honda/accord/2005/private-party-value/equipment?id=263736"
[[5]]
[1] " Accord EX-L Sedan 4D"
[2] "/used-cars/honda/accord/2005/private-party-value/equipment?id=263737"
[[6]]
[1] " Accord Hybrid Sedan 4D"
[2] "/used-cars/honda/accord/2005/private-party-value/equipment?id=868"
[[7]]
[1] " Accord LX Coupe 2D"
[2] "/used-cars/honda/accord/2005/private-party-value/equipment?id=856"
[[8]]
[1] " Accord LX Sedan 4D"
[2] "/used-cars/honda/accord/2005/private-party-value/equipment?id=850"
[[9]]
[1] " Accord LX Special Edition Coupe 2D"
[2] "/used-cars/honda/accord/2005/private-party-value/equipment?id=867"
Putting it all together
getKBBYearTrims <- function(prefix, year, type = "private-party-value") {
require(XML)
kbbTrimPageURL <- sprintf("%s%i/%s", prefix, year, type)
cat("Loading", kbbTrimPageURL, "\n")
x <- readLines(kbbTrimPageURL)
g <- htmlParse(x, asText=TRUE)
xpath <- gsub("([http:/w.]+kbb\\.com/)(.*)", "//a[contains(@href, '\\2/equipment')]", kbbTrimPageURL)
cat("XPath expression is:", xpath, "\n")
trims <- getNodeSet(doc = g, path = xpath)
trimlabels <- sapply(trims, xmlValue)
trimids <- sapply(trims, function(node) sub(".*id=([[:digit:]]+)$", "\\1", xmlAttrs(node)[["href"]]))
trimtable <- data.frame(year = year,
trim = trimlabels,
id = trimids,
stringsAsFactors = FALSE)
return(trimtable)
}
> ## print trims and ids for 2005 Honda Accords
> print(getKBBYearTrims(prefix = "http://www.kbb.com/used-cars/honda/accord/", year = 2005))
Loading http://www.kbb.com/used-cars/honda/accord/2005/private-party-value
XPath expression is: //a[contains(@href, 'used-cars/honda/accord/2005/private-party-value/equipment')]
year trim id
1 2005 Accord DX Sedan 4D 846
2 2005 Accord EX Coupe 2D 863
3 2005 Accord EX Sedan 4D 859
4 2005 Accord EX-L Coupe 2D 263736
5 2005 Accord EX-L Sedan 4D 263737
6 2005 Accord Hybrid Sedan 4D 868
7 2005 Accord LX Coupe 2D 856
8 2005 Accord LX Sedan 4D 850
9 2005 Accord LX Special Edition Coupe 2D 867
getKBBTrims <- function(prefix, years, type = "private-party-value") {
kbbTrimList <- lapply(years, function(year) getKBBYearTrims(prefix, year))
kbbTrims <- do.call('rbind', kbbTrimList)
return(kbbTrims)
}
> ## print trims and ids for years 2003 to 2007
> accord.trims <- getKBBTrims(prefix = "http://www.kbb.com/used-cars/honda/accord/", years = 2003:2007)
Loading http://www.kbb.com/used-cars/honda/accord/2003/private-party-value
XPath expression is: //a[contains(@href, 'used-cars/honda/accord/2003/private-party-value/equipment')]
Loading http://www.kbb.com/used-cars/honda/accord/2004/private-party-value
XPath expression is: //a[contains(@href, 'used-cars/honda/accord/2004/private-party-value/equipment')]
Loading http://www.kbb.com/used-cars/honda/accord/2005/private-party-value
XPath expression is: //a[contains(@href, 'used-cars/honda/accord/2005/private-party-value/equipment')]
Loading http://www.kbb.com/used-cars/honda/accord/2006/private-party-value
XPath expression is: //a[contains(@href, 'used-cars/honda/accord/2006/private-party-value/equipment')]
Loading http://www.kbb.com/used-cars/honda/accord/2007/private-party-value
XPath expression is: //a[contains(@href, 'used-cars/honda/accord/2007/private-party-value/equipment')]
> print(accord.trims)
year trim id
1 2003 Accord DX Sedan 4D 2488
2 2003 Accord EX Coupe 2D 2496
3 2003 Accord EX Sedan 4D 2498
4 2003 Accord EX-L Coupe 2D 263731
5 2003 Accord EX-L Sedan 4D 263730
6 2003 Accord LX Coupe 2D 2495
7 2003 Accord LX Sedan 4D 2492
8 2004 Accord DX Sedan 4D 2664
9 2004 Accord EX Coupe 2D 2671
10 2004 Accord EX Sedan 4D 2676
11 2004 Accord EX-L Coupe 2D 263735
12 2004 Accord EX-L Sedan 4D 263734
13 2004 Accord LX Coupe 2D 2669
14 2004 Accord LX Sedan 4D 2663
15 2005 Accord DX Sedan 4D 846
16 2005 Accord EX Coupe 2D 863
17 2005 Accord EX Sedan 4D 859
18 2005 Accord EX-L Coupe 2D 263736
19 2005 Accord EX-L Sedan 4D 263737
20 2005 Accord Hybrid Sedan 4D 868
21 2005 Accord LX Coupe 2D 856
22 2005 Accord LX Sedan 4D 850
23 2005 Accord LX Special Edition Coupe 2D 867
24 2006 Accord EX Coupe 2D 741
25 2006 Accord EX Sedan 4D 739
26 2006 Accord EX-L Coupe 2D 263727
27 2006 Accord EX-L Sedan 4D 263726
28 2006 Accord Hybrid Sedan 4D 744
29 2006 Accord LX Coupe 2D 736
30 2006 Accord LX Sedan 4D 734
31 2006 Accord SE Sedan 4D 738
32 2006 Accord VP Sedan 4D 737
33 2007 Accord EX Coupe 2D 83835
34 2007 Accord EX Sedan 4D 83834
35 2007 Accord EX-L Coupe 2D 263674
36 2007 Accord EX-L Sedan 4D 263675
37 2007 Accord Hybrid Sedan 4D 83836
38 2007 Accord LX Coupe 2D 83833
39 2007 Accord LX Sedan 4D 83829
40 2007 Accord SE Sedan 4D 83832
41 2007 Accord VP Sedan 4D 83827
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.