Batch Geocoding with R and Google maps
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I’ve recently wanted to geocode a large number of addresses (think circa 60k) in Ireland as part of a visualisation of the Irish property market. Geocoding can be simply achieved in R using the geocode() function from the ggmap library. The geocode function uses Googles Geocoding API to turn addresses from text to latitude and longitude pairs very simply.
There is a usage limit on the geocoding service for free users of 2,500 addresses per IP address per day. This hard limit cannot be overcome without employing new a IP address, or paying for a business account. To ease the pain of starting an R process every 2,500 addresses / day, I’ve built the a script that geocodes addresses up the the API query limit every day with a few handy features:
- Once it hits the geocoding limit, it patiently waits for Google’s servers to let it proceed.
- The script pings Google once per hour during the down time to start geocoding again as soon as possible.
- A temporary file containing the current data state is maintained during the process. Should the script be interrupted, it will start again from the place it left off once any problems with the data /connection has been rectified.
The R script assumes that you are starting with a database that is contained in a single *.csv file, “input.csv”, where the addresses are contained in the “address” column. Feel free to use/modify to suit your own devices!
Comments are included where possible:
# Geocoding script for large list of addresses. # Shane Lynn 10/10/2013 #load up the ggmap library library(ggmap) # get the input data infile <- "input" data <- read.csv(paste0('./', infile, '.csv')) # get the address list, and append "Ireland" to the end to increase accuracy # (change or remove this if your address already include a country etc.) addresses = data$Address addresses = paste0(addresses, ", Ireland") #define a function that will process googles server responses for us. getGeoDetails <- function(address){ #use the gecode function to query google servers geo_reply = geocode(address, output='all', messaging=TRUE, override_limit=TRUE) #now extract the bits that we need from the returned list answer <- data.frame(lat=NA, long=NA, accuracy=NA, formatted_address=NA, address_type=NA, status=NA) answer$status <- geo_reply$status #if we are over the query limit - want to pause for an hour while(geo_reply$status == "OVER_QUERY_LIMIT"){ print("OVER QUERY LIMIT - Pausing for 1 hour at:") time <- Sys.time() print(as.character(time)) Sys.sleep(60*60) geo_reply = geocode(address, output='all', messaging=TRUE, override_limit=TRUE) answer$status <- geo_reply$status } #return Na's if we didn't get a match: if (geo_reply$status != "OK"){ return(answer) } #else, extract what we need from the Google server reply into a dataframe: answer$lat <- geo_reply$results[[1]]$geometry$location$lat answer$long <- geo_reply$results[[1]]$geometry$location$lng if (length(geo_reply$results[[1]]$types) > 0){ answer$accuracy <- geo_reply$results[[1]]$types[[1]] } answer$address_type <- paste(geo_reply$results[[1]]$types, collapse=',') answer$formatted_address <- geo_reply$results[[1]]$formatted_address return(answer) } #initialise a dataframe to hold the results geocoded <- data.frame() # find out where to start in the address list (if the script was interrupted before): startindex <- 1 #if a temp file exists - load it up and count the rows! tempfilename <- paste0(infile, '_temp_geocoded.rds') if (file.exists(tempfilename)){ print("Found temp file - resuming from index:") geocoded <- readRDS(tempfilename) startindex <- nrow(geocoded) print(startindex) } # Start the geocoding process - address by address. geocode() function takes care of query speed limit. for (ii in seq(startindex, length(addresses))){ print(paste("Working on index", ii, "of", length(addresses))) #query the google geocoder - this will pause here if we are over the limit. result = getGeoDetails(addresses[ii]) print(result$status) result$index <- ii #append the answer to the results file. geocoded <- rbind(geocoded, result) #save temporary results as we are going along saveRDS(geocoded, tempfilename) } #now we add the latitude and longitude to the main data data$lat <- geocoded$lat data$long <- geocoded$lat data$accuracy <- geocoded$accuracy #finally write it all to the output files saveRDS(data, paste0("../data/", infile ,"_geocoded.rds")) write.table(data, file=paste0("../data/", infile ,"_geocoded.csv"), sep=",", row.names=FALSE)
Let me know if you find a use for the script, or if you have any suggestions for improvements.
Please be aware that it is against the Google Geocoding API terms of service to geocode addresses without displaying them on a Google map. Please see the terms of service for more details on usage restrictions.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.