Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I’m in the process of buying a used car. Since I enjoy making these decisions as complicated as possible, I’ve written some R code to scrape relevant websites for informative data. I’ve written this up as a blog entry because I think it’s a decent example of how one might use the XML package and Firebug to quickly and easily bring data from websites into R.
Part 1: Scraping the surface of the Kelley Blue Book
In the past, the first resource a used car buyer looking for price information might have turned to was the Kelley Blue Book; now, this information is available for free at KBB.com:
Finding the data with Firebug
For now, I’m going to skip ahead to the page containing the kind of information that we want; later, I’ll back up and go through the process of getting to that page and detail how I wrote some simple functions automating queries for different parameters.
Here’s http://www.kbb.com/used-cars/honda/accord/2005/private-party-value/pricing-report?condition=excellent&id=846&mileage=10000, giving the KBB private party value for a 2005 Honda Accord DX Sedan with automatic transmission, standard options, and 10,000 miles:
To get at the data we want, we need to identify where it is located in the structure of the page. While one can do this by simply reading the HTML source code, Firebug makes things much simpler. Load up Firebug and go to the HTML tab. Click the Inspect Element button (or go to the Firebug menu and choose Inspect Element); as you mouse-over elements on the page, you’ll notice that the corresponding tag in the HTML element tree is opened and highlighted. In the screenshot below, I’ve clicked on the value for the Excellent condition:
Examining the HTML tree in the Firebug display, we can see that all of the information we’re interested in is contained in a table with id ‘priceCondition’. Similarly, if you’re using Google Chrome, you can accomplish the same thing with the Developer Tools. Below, Firefox is on the left and Chrome is on the right:
Parsing the web with the XML package
The XML package includes a convenient function called readHTMLTable to grab the data from the table we identified earlier. We can simply give it the URL of the page and it returns a list containing each of the page’s tables as an R object (converting them to data.frame by default).
kbbURL <- "http://www.kbb.com/used-cars/honda/accord/2005/private-party-value/pricing-report?condition=excellent&id=846&mileage=10000" require(XML) kbbTables <- readHTMLTable(kbbURL)
With this minimal amount of effort, we’re most of the way to what we’re after:
> print(kbbTables) $priceCondition Condition\r\n \r\n Value 1 2 Excellent $12,340 3 Good $11,665 4 Fair $10,565
By explicitly specifying the header, skipping the first two rows, and extracting the ‘priceCondition’ data.frame itself, we’re left with the raw data we are interested in:
kbbTable <- readHTMLTable(doc = kbbURL, header = c("Condition","Value"), skip.rows = c(1,2))[["priceCondition"]] > print(kbbTable) Condition Value 1 Excellent $12,340 2 Good $11,665 3 Fair $10,565
Now, if we take a look at the URL we’re using, http://www.kbb.com/used-cars/honda/accord/2005/private-party-value/pricing-report?condition=excellent&id=846&mileage=10000, it should be apparent that fetching these values for any given mileage won’t be any trouble. The following code gets the KBB values for 10,000 mile increments from 10,000 to 150,000 miles:
kbbURLPrefix <- "http://www.kbb.com/used-cars/honda/accord/2005/private-party-value/pricing-report?condition=excellent&id=846&mileage=" kbbValuesList <- lapply(seq(10000,150000,by=10000), function(m) { readHTMLTable(doc = paste(kbbURLPrefix,m,sep=""), header = c("Condition","Value"), skip.rows = c(1,2))[["priceCondition"]] }) > length(kbbValuesList) [1] 15 > head(kbbValuesList,2) [[1]] Condition Value 1 Excellent $12,340 2 Good $11,665 3 Fair $10,565 [[2]] Condition Value 1 Excellent $11,965 2 Good $11,290 3 Fair $10,190
Finally, we can convert the list into one big data.frame and augment it with the corresponding mileages and the model year. This leaves us with a nice data.frame from which we can extract whatever information we desire.
kbbValues <- do.call('rbind', kbbValuesList) kbbValues$Mileage <- rep(seq(10000,150000,by=10000), each = 3) kbbValues$Year <- 2005 > head(kbbValues) Condition Value Mileage Year 1 Excellent $12,340 10000 2005 2 Good $11,665 10000 2005 3 Fair $10,565 10000 2005 4 Excellent $11,965 20000 2005 5 Good $11,290 20000 2005 6 Fair $10,190 20000 2005 > print(kbbValues[which(kbbValues$Condition == "Excellent"),c("Mileage","Value")]) Mileage Value 1 10000 $12,340 4 20000 $11,965 7 30000 $11,565 10 40000 $11,140 13 50000 $10,740 16 60000 $10,265 19 70000 $9,740 22 80000 $9,190 25 90000 $8,640 28 100000 $9,440 31 110000 $7,640 34 120000 $7,190 37 130000 $6,765 40 140000 $6,190 43 150000 $5,965
Graphing our results with ggplot
Our last trick for the day is a simple one: take the data and make a pretty picture. Having collected the KBB values for different conditions and mileages, it is straightforward to construct a plot of value versus mileage for each condition.
First, however, we need to convert the kbbValues$Value column from its current human-readable state (a factor with levels like “$10,265”) into a more natural form for analysis. A quick bit of regular expressions magic using gsub does the trick, and we’re left with a nice column of numbers:
kbbValues$Value <- as.numeric(gsub("[$,]","",kbbValues$Value)) > kbbValues$Value [1] 12340 11665 10565 11965 11290 10190 11565 10890 9790 11140 10465 9365 [13] 10740 10065 8965 10265 9590 8490 9740 9065 7965 9190 8515 7415 [25] 8640 7965 6865 9440 8765 7665 7640 6965 5865 7190 6515 5415 [37] 6765 6090 4990 6190 5515 4415 5965 5290 4190
Use of ggplot is a subject best left for another time. Here, it’s as simple as:
require(ggplot2) ggplot(kbbValues, aes(x = Mileage, y = Value, color = Condition, group = Condition)) + geom_line()
This gives us the following beautiful plot:
Wait, what?
So, where did that peak at 100,000 miles come from?
Well, looking back, it’s clear that it’s present in the raw data in kbbValues. If we check the original page (http://www.kbb.com/used-cars/honda/accord/2005/private-party-value/pricing-report?condition=excellent&id=846&mileage=100000), however, the values don’t match. What happened?
The culprit, and a correction, are in the code below:
kbbURLPrefix <- "http://www.kbb.com/used-cars/honda/accord/2005/private-party-value/pricing-report?condition=excellent&id=846&mileage=" kbbValuesList <- lapply(seq(10000,150000,by=10000), function(m) { currentURL <- sprintf("%s%i",kbbURLPrefix,m) cat(currentURL,"\n") # print debug info so we catch these errors! readHTMLTable(doc = currentURL, # The following converts m to character using as.character, # but as.character(100000) returns "1e+05" # doc = paste(kbbURLPrefix,m,sep=""), header = c("Condition","Value"), skip.rows = c(1,2))[["priceCondition"]] })
Using the corrected procedure, we are rewarded with a nice, smooth graph:
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.