How accurate is Next Bus II: wrangling data
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
- Goal: determine how accurate Next Bus predictions are
- Step 1: Create a timeseries
- Step 2: Flag arrivals and departures
- Step 2: Filter
- Step 3: Calculate error in Next Bus predictions
- Step 4: Write out data to csv
In a the previous post we extracted Next Bus predictions from the Wmata API.
If that’s too much or uninteresting to you, start here. Our extracted data looks something like this:
time Minutes VehicleID DirectionText RouteID TripID 2014-08-04 18:43:33.525000 17 6506 South to Federal Triangle 64 6464186 2014-08-04 18:43:33.525000 37 7228 South to Federal Triangle 64 6464187 2014-08-04 18:43:33.525000 56 7235 South to Federal Triangle 64 6464188 2014-08-04 18:43:33.525000 80 7231 South to Federal Triangle 64 6464189 2014-08-04 18:43:43.868000 17 6506 South to Federal Triangle 64 6464186 2014-08-04 18:43:43.868000 37 7228 South to Federal Triangle 64 6464187
Goal: determine how accurate Next Bus predictions are
I did this analysis in R, however it could be easily be done in Python or one of many languages. It could even be done in JavaScript. However, doing some data magic in R first will improve performance of the d3 visualizations. It also allows for much more rapid prototyping, exploration and analysis.
Step 1: Create a timeseries
Wmata conveniently provides with us a TripID for each unique bus trip. The only problem is they’re not actually unique.
The simplest fix I found was the create a unique identifier for bus trips by concatenating TripID
and VehicleID
.
There are certainly more complex ways that use a time dimension to determine unique trips, but this worked for me.
## reading in Nextbus data df <- read.delim('/data/bus64_8Aug2014.txt', sep='|', header=T, stringsAsFactors=F) ## setting up dates df$time <- as.POSIXct(df$time, origin='EST') df$hour <- as.numeric(strftime(df$time, '%H')) ## creating unique ID for bus trips df$TripID <- as.character(df$TripID) df$ID <- paste(df$TripID, df$VehicleID, sep="_") df <- df[order(df$ID, df$time),]
Step 2: Flag arrivals and departures
Next Bus doesn’t actually tell us when a bus arrives. We need to determine this from the time series we collect. This is one of the reasons I collect predictions from the API every 10 seconds. There is probably a more computationally efficient method for doing this using vectorized functions, but this works fine.
## marking arrival and departure dates for Trips depart <- by(df$time, df$ID, min) arrive <- by(df$time, df$ID, max) df$departure <- 0; df$arrival <- 0; df$est <- 0 for(i in 1:nrow(depart)-1) { tname <- names(depart)[i] df$departure[df$ID==tname & df$time==depart[i]] <- 1 df$arrival[df$ID==tname & df$time==arrive[i]] <- 1 }
This gives us something like this:
time Minutes VehicleID DirectionText RouteID TripID hour ID departure arrival 2014-08-12 12:35:36 1 6470 South to Federal Triangle 64 6464164 12 6464164_6470 0 0 2014-08-12 12:35:46 1 6470 South to Federal Triangle 64 6464164 12 6464164_6470 0 0 2014-08-12 12:35:57 0 6470 South to Federal Triangle 64 6464164 12 6464164_6470 0 0 2014-08-12 12:36:38 0 6470 South to Federal Triangle 64 6464164 12 6464164_6470 0 0 2014-08-12 12:36:48 0 6470 South to Federal Triangle 64 6464164 12 6464164_6470 0 0 2014-08-12 12:36:58 0 6470 South to Federal Triangle 64 6464164 12 6464164_6470 0 1 2014-08-07 10:56:06 98 6482 South to Federal Triangle 64 6464164 10 6464164_6482 1 0 2014-08-07 10:56:16 98 6482 South to Federal Triangle 64 6464164 10 6464164_6482 0 0 2014-08-07 10:56:26 98 6482 South to Federal Triangle 64 6464164 10 6464164_6482 0 0
Assumption 1: When Next Bus says a bus is arriving (Minutes==0
) multiple times, I take the latest prediction as the actual arrival time.
In other words, I take the last prediction where Minutes==0
before the bus disappears off your Next Bus app as the arrival time.
Step 2: Filter
Assumption 2: Remove bus trips that never arrive. There are some ghost buses out there. It’s impossible to determine the error in a prediction if you don’t know the outcome (true arrival time), so these trips are removed.
# removing buses that never arrive TripsThatArrive <- df$ID[df$arrival==1 & df$Minutes==0] df <- df[df$ID %in% TripsThatArrive,]
Step 3: Calculate error in Next Bus predictions
Now that we know the arrival times for each bus trip (df$arrival
), the prediction in minutes until arrival (df$Minutes
)
and the time the prediction was made (df$time
), we can figure calculate the prediction error (df$err
) and actual time until arrival (df$est
).
# finding actual prediction-to-arrival times df$est <- 0 for(i in 1:nrow(depart)) { tname <- names(depart)[i] df$est[df$ID==tname] <- (df$time[df$ID==tname & df$arrival==1] - df$time[df$ID==tname])/60 } # finding prediction errors df$err <- df$est - df$Minutes
Step 4: Write out data to csv
Easy enough. This is the data that will feed the d3 visualizations built in the next post.
write.table(df, file='/Users/ajb/Documents/github/nextbus/data/cleanTrips.csv', row.names=F, sep=',')
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.