Site icon R-bloggers

Geocoding 18 million addresses with PostGIS Tiger Geocoder

[This article was first published on From Learning and Evolution to Data Science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Summary

Background

I found I want to geocode lots of addresses in my Red Cross Smoke Alarm Project. The NFIRS data have 18 million addresses in 9 years data, and I would like to

Possible Approaches

I did some research on the possible options:

PostGIS can work in both windows and linux, and Enigma.io has shared their automated Tiger Geocoder setup tool for linux. However the Tiger database itself need 105G space and I don’t have a linux box for that(Amazon AWS free tier service only allow for 30G storage), so I decided to install PostGIS in windows and experiment with everything first.

Windows Setup

I need to install postgresql server, PostGIS extension and Tiger geocoder extension. This is a very detailed installation guide for PostGIS in windows. I’ll just add some notes from my experience:

With server and extension installed, I need to load Tiger data. The Tiger geocoder provided scripts generating functions for you to download Tiger data from Census ftp then set up the database. The official documentation didn’t provide enough information for me, so I have to search and tweak a lot. At first I tried the commands from SQL query tool but it didn’t show any result. Later I solved this problem with hints from this guide, although it was written for Ubuntu.

splitStates.pyOpen in Github
__author__ = 'draco'
# split all states loader script into separate scripts by states.
# replace all the "set TMPDIR=..." line with ":: ---- end state ----\nset TMPDIR=..."
# then delete the first line of ":: ---- end state ----\n"
# modify the base file path and output file folder by your case.

text_file = open("e:\\Data\\all_states.bat","r")
lines = text_file.readlines()
text_file.close()
print len(lines)
sep = ":: ---- end state ----\n"
file_no = 1
temp = ""
for line in lines[0:]:
    if line != sep:
        temp += line
    else:
        state_file = open("e:\\Data\\" + str(file_no).zfill(2) + ".bat", 'w')
        state_file.write(temp)
        state_file.close()
        temp = line
        file_no += 1

Linux Setup

After I moved the postgresql database to regular hard drive because of storage limit, the geocoding performance was very low. Fortunately I got the generous support of DataKind on their AWS resources, so I can run the geocoding task in Amazon EC2 server. I want to test everything as comprehensive as possible before deploying an expensive EC2 instance, thus I decided to do everything with the Amazon EC2 free tier service first. The free tier only allow 30G storage, 1G RAM but I can test with 2 states first.

I used the ansible playbook from Enigma to setup the AWS EC2 instance. Here are some notes:

After lots of experimentation I have my batch geocoding workflow ready, then I started to setup the full scale server with DataKind resources.

Geocoding Script And Work Flow Setup

I’ll discuss the geocoding script and my work flow for batch geocoding in next post.

Version History

To leave a comment for the author, please follow the link and comment on their blog: From Learning and Evolution to Data Science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.