Site icon R-bloggers

How to buy a used car with R (part 1)

[This article was first published on Dan Knoepfle's Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’m in the process of buying a used car. Since I enjoy making these decisions as complicated as possible, I’ve written some R code to scrape relevant websites for informative data. I’ve written this up as a blog entry because I think it’s a decent example of how one might use the XML package and Firebug to quickly and easily bring data from websites into R.

Part 1: Scraping the surface of the Kelley Blue Book

In the past, the first resource a used car buyer looking for price information might have turned to was the Kelley Blue Book; now, this information is available for free at KBB.com:

Finding the data with Firebug

For now, I’m going to skip ahead to the page containing the kind of information that we want; later, I’ll back up and go through the process of getting to that page and detail how I wrote some simple functions automating queries for different parameters.

Here’s http://www.kbb.com/used-cars/honda/accord/2005/private-party-value/pricing-report?condition=excellent&id=846&mileage=10000, giving the KBB private party value for a 2005 Honda Accord DX Sedan with automatic transmission, standard options, and 10,000 miles:

To get at the data we want, we need to identify where it is located in the structure of the page. While one can do this by simply reading the HTML source code, Firebug makes things much simpler. Load up Firebug and go to the HTML tab. Click the Inspect Element button (or go to the Firebug menu and choose Inspect Element); as you mouse-over elements on the page, you’ll notice that the corresponding tag in the HTML element tree is opened and highlighted. In the screenshot below, I’ve clicked on the value for the Excellent condition:

Examining the HTML tree in the Firebug display, we can see that all of the information we’re interested in is contained in a table with id ‘priceCondition’. Similarly, if you’re using Google Chrome, you can accomplish the same thing with the Developer Tools. Below, Firefox is on the left and Chrome is on the right:

Parsing the web with the XML package

The XML package includes a convenient function called readHTMLTable to grab the data from the table we identified earlier. We can simply give it the URL of the page and it returns a list containing each of the page’s tables as an R object (converting them to data.frame by default).

kbbURL <- "http://www.kbb.com/used-cars/honda/accord/2005/private-party-value/pricing-report?condition=excellent&id=846&mileage=10000"
require(XML)
kbbTables <- readHTMLTable(kbbURL)

With this minimal amount of effort, we’re most of the way to what we’re after:

> print(kbbTables)
$priceCondition
  Condition\r\n                \r\n   Value
1                                      
2                         Excellent $12,340
3                              Good $11,665
4                              Fair $10,565

By explicitly specifying the header, skipping the first two rows, and extracting the ‘priceCondition’ data.frame itself, we’re left with the raw data we are interested in:

kbbTable <- readHTMLTable(doc = kbbURL,
                          header = c("Condition","Value"),
                          skip.rows = c(1,2))[["priceCondition"]]


> print(kbbTable)
  Condition   Value
1 Excellent $12,340
2      Good $11,665
3      Fair $10,565

Now, if we take a look at the URL we’re using, http://www.kbb.com/used-cars/honda/accord/2005/private-party-value/pricing-report?condition=excellent&id=846&mileage=10000, it should be apparent that fetching these values for any given mileage won’t be any trouble. The following code gets the KBB values for 10,000 mile increments from 10,000 to 150,000 miles:

kbbURLPrefix <- "http://www.kbb.com/used-cars/honda/accord/2005/private-party-value/pricing-report?condition=excellent&id=846&mileage="
kbbValuesList <- lapply(seq(10000,150000,by=10000), function(m) {
  readHTMLTable(doc = paste(kbbURLPrefix,m,sep=""),
                header = c("Condition","Value"),
                skip.rows = c(1,2))[["priceCondition"]]
})


> length(kbbValuesList)
[1] 15
> head(kbbValuesList,2)
[[1]]
  Condition   Value
1 Excellent $12,340
2      Good $11,665
3      Fair $10,565

[[2]]
  Condition   Value
1 Excellent $11,965
2      Good $11,290
3      Fair $10,190

Finally, we can convert the list into one big data.frame and augment it with the corresponding mileages and the model year. This leaves us with a nice data.frame from which we can extract whatever information we desire.

kbbValues <- do.call('rbind', kbbValuesList)
kbbValues$Mileage <- rep(seq(10000,150000,by=10000), each = 3)
kbbValues$Year <- 2005


> head(kbbValues)
  Condition   Value Mileage Year
1 Excellent $12,340   10000 2005
2      Good $11,665   10000 2005
3      Fair $10,565   10000 2005
4 Excellent $11,965   20000 2005
5      Good $11,290   20000 2005
6      Fair $10,190   20000 2005
> print(kbbValues[which(kbbValues$Condition == "Excellent"),c("Mileage","Value")])
   Mileage   Value
1    10000 $12,340
4    20000 $11,965
7    30000 $11,565
10   40000 $11,140
13   50000 $10,740
16   60000 $10,265
19   70000  $9,740
22   80000  $9,190
25   90000  $8,640
28  100000  $9,440
31  110000  $7,640
34  120000  $7,190
37  130000  $6,765
40  140000  $6,190
43  150000  $5,965

Graphing our results with ggplot

Our last trick for the day is a simple one: take the data and make a pretty picture. Having collected the KBB values for different conditions and mileages, it is straightforward to construct a plot of value versus mileage for each condition.

First, however, we need to convert the kbbValues$Value column from its current human-readable state (a factor with levels like “$10,265”) into a more natural form for analysis. A quick bit of regular expressions magic using gsub does the trick, and we’re left with a nice column of numbers:

kbbValues$Value <- as.numeric(gsub("[$,]","",kbbValues$Value))


> kbbValues$Value
 [1] 12340 11665 10565 11965 11290 10190 11565 10890  9790 11140 10465  9365
[13] 10740 10065  8965 10265  9590  8490  9740  9065  7965  9190  8515  7415
[25]  8640  7965  6865  9440  8765  7665  7640  6965  5865  7190  6515  5415
[37]  6765  6090  4990  6190  5515  4415  5965  5290  4190

Use of ggplot is a subject best left for another time. Here, it’s as simple as:

require(ggplot2)
ggplot(kbbValues, aes(x = Mileage, y = Value, color = Condition, group = Condition)) + geom_line()

This gives us the following beautiful plot:

Wait, what?

So, where did that peak at 100,000 miles come from?

Well, looking back, it’s clear that it’s present in the raw data in kbbValues. If we check the original page (http://www.kbb.com/used-cars/honda/accord/2005/private-party-value/pricing-report?condition=excellent&id=846&mileage=100000), however, the values don’t match. What happened?

The culprit, and a correction, are in the code below:

kbbURLPrefix <- "http://www.kbb.com/used-cars/honda/accord/2005/private-party-value/pricing-report?condition=excellent&id=846&mileage="
kbbValuesList <- lapply(seq(10000,150000,by=10000), function(m) {
  currentURL <- sprintf("%s%i",kbbURLPrefix,m)
  cat(currentURL,"\n") # print debug info so we catch these errors!
  readHTMLTable(doc = currentURL,
                # The following converts m to character using as.character,
                # but as.character(100000) returns "1e+05"
                # doc = paste(kbbURLPrefix,m,sep=""),
                header = c("Condition","Value"),
                skip.rows = c(1,2))[["priceCondition"]]
})

Using the corrected procedure, we are rewarded with a nice, smooth graph:

To leave a comment for the author, please follow the link and comment on their blog: Dan Knoepfle's Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.