Geocoding 18 million addresses with PostGIS Tiger Geocoder
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Summary
- This post discussed the background, approaches, windows and linux environment setup for my Geocoding task.
- See more details about the script and workflow in next post.
Background
I found I want to geocode lots of addresses in my Red Cross Smoke Alarm Project. The NFIRS data have 18 million addresses in 9 years data, and I would like to
- verify all the addresses because many inputs have quality problems.
- map street address to coordinates, so we can map and do more geospatial analysis.
- map street address to census block, then we can link NFIRS data to other public data like census data of American Community Survey(ACS) and American Housing Survey(AHS)).
Possible Approaches
I did some research on the possible options:
- Online service. Most free online API have limit, and paid service would be too expensive for my task. Surprisingly FCC have an API to map coordinates to census block which didn’t mention limit, but it cannot geocode street address to coordinates.
- This company provide service in Amazon EC2 for a fee. They have more information about their setup in github. What I did is actually a similar approach but in a totally DIY way.
- Setup your own geocoder. Tiger geocoder is a PostGIS extension which use Census Tiger data to geocode addresses.
PostGIS can work in both windows and linux, and Enigma.io has shared their automated Tiger Geocoder setup tool for linux. However the Tiger database itself need 105G space and I don’t have a linux box for that(Amazon AWS free tier service only allow for 30G storage), so I decided to install PostGIS in windows and experiment with everything first.
Windows Setup
I need to install postgresql server, PostGIS extension and Tiger geocoder extension. This is a very detailed installation guide for PostGIS in windows. I’ll just add some notes from my experience:
- It’s best if you could install the database in SSD drive. My first setup was in SSD and only have two states data, the geocoding performance was pretty good. Then I need to download all the states so I have to move the database to regular hard drive according to this guide (note the data folder path value cannot have the trialling escape, otherwise the PostgreSQL Service will just fail). After that the geocoding performance dropped considerably.
- The pgAdmin is easy to use. I used SQL Query, View Data (or view top 100 rows if the table is huge) a lot. The explain analyze function in the SQL Query tool is also very intuitive.
With server and extension installed, I need to load Tiger data. The Tiger geocoder provided scripts generating functions for you to download Tiger data from Census ftp then set up the database. The official documentation didn’t provide enough information for me, so I have to search and tweak a lot. At first I tried the commands from SQL query tool but it didn’t show any result. Later I solved this problem with hints from this guide, although it was written for Ubuntu.
- You need to install 7z.exe and wget windows version and record their path.
- Create a directory for download. Postgresql need to have permission for that folder. I just created the folder in same level with the postgresql database folder, and both of them have user group
Authenticated users
in full control. If you write a sql copy command to read csv file in some other folder that don’t have this user permission, there could be a permission denied error. Start pgAdmin, connect to the GIS database you created in installation, run psql tool from pgAdmin, input
\a
\t
to set up format first, and set output file by\o nation_generator.bat
then run
SELECT loader_generate_nation_script('windows');
to generate the script to load national tables. It will be a file with the name specified with
\o nation_generator.bat
before located in the same folder ofpsql.exe
, which should be the postgresql bin folder.Technically you can input the parameters specific to your system settings in the table
loader_variables
andloader_platform
which are undertiger
schema. However after I inputed the parameters, only the stage folder(i.e. where to download data to) was taken into generated script. My guess is the file path with spaces need be proper escaped and quoted. The script generating function is reading from database then write to file, hat means the file path will go through several different internal representations, which make the escaping and quoting more complicated. I just replaced the default parameters with mine in the generated script later. Update: I found this answer later. I probably should useSET
command instead of directly editing the table columns. Anyway, replacing the settings afterwards still works, and you need to double check it.- All the parameters are listed in the first section of generated script, and
cd your_stage_folder
will be used several times through the script. You need to edit the parameters in first section and make sure the stage folder is correct in all places. After the national data is loaded by running with the script, you can specify the states that you want to load. Actually the tiger database support 56 states/regions. You can find them by
select stusps, name from tiger.state order by stusps;
Start psql again, go through similar steps and run
SELECT loader_generate_script(ARRAY['VA','MD'], 'windows');
Put the states abbreviations that you want in the array. Note if you copy the query results it will be quoted with double quote by default, but you need single quote in SQL. You can change the pgAdmin output setting in
Options - Query tool - Results grid
.The generated script will have one section for each state, each section have parameters set in beginning. You need to replace the parameters and the
cd your_stage_folder
to correct values. Using an editor that support multi line search replace will make this much easier.- I don’t want to load 56 states in one script. If anything went wrong it will be bothersome to start again from last point. I wanted to split the big script into 56 ones, one state each. I searched for a while and didn’t find a software to do this, then I just wrote a python script.
First add a marker in the script to separate states. I replaced all occurrences of
set TMPDIR=e:\data\gisdata\temp\\
to
:: ---- end state ---- set TMPDIR=e:\data\gisdata\temp\\
then deleted the
:: ---- end state ----
marker in the first line. This make the marker appear in the end of each state section. Note the::
is commenting symbol in dos bat so it will not interfere with the script.Then I run this python script to split it by states.
__author__ = 'draco' # split all states loader script into separate scripts by states. # replace all the "set TMPDIR=..." line with ":: ---- end state ----\nset TMPDIR=..." # then delete the first line of ":: ---- end state ----\n" # modify the base file path and output file folder by your case. text_file = open("e:\\Data\\all_states.bat","r") lines = text_file.readlines() text_file.close() print len(lines) sep = ":: ---- end state ----\n" file_no = 1 temp = "" for line in lines[0:]: if line != sep: temp += line else: state_file = open("e:\\Data\\" + str(file_no).zfill(2) + ".bat", 'w') state_file.write(temp) state_file.close() temp = line file_no += 1 |
Linux Setup
After I moved the postgresql database to regular hard drive because of storage limit, the geocoding performance was very low. Fortunately I got the generous support of DataKind on their AWS resources, so I can run the geocoding task in Amazon EC2 server. I want to test everything as comprehensive as possible before deploying an expensive EC2 instance, thus I decided to do everything with the Amazon EC2 free tier service first. The free tier only allow 30G storage, 1G RAM but I can test with 2 states first.
I used the ansible playbook from Enigma to setup the AWS EC2 instance. Here are some notes:
- Be careful when using AWS free tier service, anything not explicitly marked as
free tier
could cost money. - For a test server with national tables and 2 states data, the free tier t2.micro instance actually is enough. You can normalizing addresses, mapping census block for all states and geocoding for those 2 states addresses. Actually my 2 states free server with 1G RAM is much faster than the t2.large 8G server with all states data loaded. I’ll discuss this in more details in another post about the geocoding script.
- I used bitvise as my ssh and sftp tool since it seemed to be an enhanced putty.
- I either use nano as the editor in AWS instance (for error with nano history), or just download the file to be edited with sftp, edit then upload (sometimes I may need to convert the script with
dos2unix
). It’s much easier to edit multiple places with sublime.
After lots of experimentation I have my batch geocoding workflow ready, then I started to setup the full scale server with DataKind resources.
Interestingly,
sudo
doesn’t work in the t2.large instance. I searched and found it’s because the private ip is not in hosts. The problem can be solved by adding the machine name into hosts file, however how can you edit hots file withoutsudo
? Finally I used this to solve this problem.sudo passwd root su root nano /etc/hosts su ubuntu
The command of running ansible playbook from the Enigma repo have
\
to extend one line into multiple lines. My first try didn’t copy the new line after each\
correctly (because I was using a firefox extension to automatically copy on select) and the command cannot run, but the error message was very misleading so I didn’t realize it’s because of this.- Although the Tiger database will take 105G at last, the downloading and provisioning need more spaces for temporary files. My first attempt with 120G storage was filled up in the provisioning, so I have to start again with 180G.
Gnu Screen
let you use same terminal window to switch between tasks or even split windows, so that you can leave the process running but detached from the screen. It’s essential to run and control the time consuming tasks. Here is a cheat sheet and quick list.- I enabled the color prompt of bash by removing comment of
#force_color_prompt=yes
in~/.bashrc
. When you need to scroll through long history of command line or reading many lines of output, a colored prompt could separate the command and the output well. You may need to use psql a lot, so I placed a .pgpass file in my user folder (change its permission with
chmod 0600 ~/.pgpass
). I also set several other options in.psqlrc
file in user folder, including color prompt, timing on, vertical output etc.\timing \x auto \set COMP_KEYWORD_CASE upper \set PROMPT1 '%[%033[1;33;40m%]%n@%/%R%[%033[0m%]%# '
I was not satisfied with the geocoding performance so I experimented with tuning postgresql configurations. This post and this guide helped me most. The average time needed for geocoding one address in a 200 records sample dropped from 320 ms to 100 ms.
Geocoding Script And Work Flow Setup
I’ll discuss the geocoding script and my work flow for batch geocoding in next post.
Version History
- 2015-11-17 : First version.
- 2016-05-11 : Added Summary.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.