Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
By Earl F Glynn | Franklin Center
The Montana Secretary of State sells two files of voter registration records for a fee of $1000:
- Statewide Voter File
- Voter History File
The Montana Secretary of State provides information about the two files on this page.
I wrote an R script to scrutinize these two Montana voter registration files, looking for data quality issues and looking for evidence of how well the files are maintained by the Montana county clerks.
Were the 230 voters sharing the same Jan. 1, 1980 birthday part of a rare event, a clerical error, a computer glitch, or possible voter fraud in 1998?
MT.R script
The MR.R script scrutinizes the Montana voter registration files.
Briefly this is what the R script does:
- Reads the Statewide Voter File with all fields as strings.
- Computes a cross-tabulation of voter status by county and writes the results to a file. This PDF version of the file shows the number of active and inactive voters by county. (The inactive voters are a measure of voter bloat). Another PDF shows the Voter_Status_Reason for each Voter_Status category.
- Computes frequency counts of each field in the voter file and writes results to a separate file by field. See frequency count files here after reviewing changes to these files.
- Creates a summary of metadata about each data field, including statistics on field size and number of records with the field defined and missing.
- Reads the Voter History File with all fields as strings.
- Compares IDs in the History File to IDs in the Voter File to show there was history for voters not in the voter file. (This was likely caused when the voter file was purged but the history file was not .)
- Creates frequency counts of fields: ballot_county and election_type (see this page for interpretation of election_types), and voting_method (1=Absentee, 2=Polling Place, 3=Provisional, 4=Vote by Mail).
- Cleans up and converts Election_date fields from character to a date format YYYY-MM-DD. (Sorting these dates as strings would put them in chronological order.) Creates frequency counts by election_date. (Note the earliest election_date is for the 1984 general election. Voter history reporting and completeness likely varies by county.)
- Uses R’s aggregate function to count the number of ballots cast per voter (number of history records) and to determine the Election_date of the most recent ballot cast.
- Uses R’s merge function (much like an SQL join) to connect the history data, Ballots and LastElection by voter ID.
- Uses R’s merge function to add voter history stats (Ballots, LastElection) to each Voter File record.
- Selects a subset of Voter File fields for voters who last voted in 2006 or earlier (11,445 records). Here is a subset list of 64 voters who last cast ballots in 1992 or earlier — 20+ years ago.
- Computes a cross-tabulation of voter status by county for voters who last voted in 2006 or earlier. This is to learn if those who have not voted in some time have already been tagged as “inactive.”
Additional analysis of the voter file and history may be added in the future.
Statewide Voter File
CharCount. I use a CharCount utility to look at a new file to learn what is in it statistically at the ASCII character level. This graphic shows the results for the Montana Voter file.
CharCount shows 633,241 x’0a’ (line feed) and 633,241 x’0d’ (carriage return) characters in the file. These paired line endings are expected in ASCII files created under Windows.
The 22.8 million tab (x’09′) characters are not surprising since the file is tab-delimited. But some simple math shows a minor problem. 633,241 lines * 36 tabs/line = 22,796,676 tabs. CharCount showed a missing tab with only 22,796,675 present.
This “missing” tab was at the end of the first line — but oddly, this tab wasn’t missing, all the other rows had an extra tab.
The header row of column labels used a tab as a separator and had no tab at the end of the line. The remaining voter lines used a tab as a terminator and ended each field with a tab. This minor problem caused an extra null field in R that was simply deleted by the MT.R script after fixing problems with the column names.
The other “problem” characters identified by CharCount include x’89′ (1), x’bd’ (105), x’c2′ (105), and x’c3′ (1). A pair of x’89′ /x’c3′ characters was related to an accent mark in a French name. The 105 x’bd’/x’c2′ pairs seem to be related to a single character used for the symbol “1/2″ in addresses. These are all benign problem characters but they may not display or print consistently.
Descriptive Statistics. One approach to learn more about fields in a file is to look at simple descriptive statistics.
Many programs make all sorts of conversion assumptions and silently impose import “rules” that usually work. Before letting these automatic import conversions to take place, I normally scrutinize a file treating all fields as characters strings. I defer conversions to numeric fields or dates until I understand any problems in the data.
What are the values seen in each field and are they appropriate?
The Counts directory shows a separate file of frequency counts of values observed for each field.
These frequency count files are in a CSV format that must be viewed in an ASCII editor sometimes since Excel imposes its rules on import and sometimes distorts the original character data.
Often “problem” values appear at the beginning or end of the frequency count files.
For example, some “problem” first names appear at the beginning of the 02-FIRSTNAME.csv file:
"FIRSTNAME","Count" "",1 """CONNIE""",1 """NOKA""",1 "(NONE)",1 "`MICHELE",1 "A",281
One person has a blank first name. One person has “(NONE)” as a first name. Three others appear to have nicknames highlighted by quotation marks.
Data Quality. The MT.R script created a summary file with some metadata each of the data fields in the statewide voter file of 633,240 voters. Comments about data problems from reviewing the frequency count information in the Counts directory were included in this summary.
The biggest data quality problem was the missing RA_CITY and RA_ZIP_CODE fields (RA = residence address) for 11,274 voters.
Several minor problems were found in various other fields.
One voter from Flathead county is missing a FirstName field and a voter from Missoula county has a FirstName of “(NONE)”. Perhaps both are accurate and are not data problems.
Some names seem to have crept into the NamesSuffix field and some numbers have crept into the MiddleName field.
A few ZIP codes are outside the expected range for Montana. Some were from a ZIP code apparently shared with South Dakota, but some were invalid (00000) or completely out of range (e.g., 82725). [Validation of all the zip codes within the Montana 59001-59937 range was not done.]
The date-only DOB (Date of Birth) field should not have a time subfield. It’s unclear why a time subfield is needed for EFF_REGN_DATE and why only a few are a time other than “12:00:00 AM”.
Many voters at same address. The frequency counts for the RESIDENCEADDRESS field shows a number of examples of 10 or more voters at the same address. These could be checked to see if there is an explanation for so many voters at the same address.
Voter History File
The CharCount utility showed the Montana Voter History File consisted of normal ASCII characters.
The voter file had 5,285,042 history records for 602,650 unique voter IDs.
On average each voter has about 9 history records. Some voters have no history but voter David Bertelsen from Wibaux County cast 41 ballots from 1996 through 2011.
About 93% of the 633,240 voters in the voter file have voter history records, but there were 11,329 voters with history that were not in the voter file. Apparently when maintenance is performed on the Voter File the History File is not updated.
Like the Voter File, the dates in the History File have unneeded time subfields that are mostly “12:00:00 AM”.
Summaries below show how many Montana voters have not voted in years.
Rare event, clerical error, computer glitch, or possible voter fraud in 1998?
A second look at the frequency counts for voter date of birth shows that 230 people have the birthday Jan. 1, 1980.
The chart below shows only about 20 to 40 voters in Montana share a given birthday, so the 230 is statistically quite unlikely. Based on the chart, the 230 number is about 180 too high for any give date from Jan 1, 1900 to the present.
The chart shows an anomaly with 96 voters having a Jan. 1, 1901 birthday, which was likely a code for “unknown birthday.” This is common in many voter registration lists in other states, especially when many county files are merged into a single state file.
Except for Jan. 1, 1980 and the 1901 anomaly, the highest observed counts among voters were 62 with a birthday of July 6, 1951 and 58 with a birthday on May 11, 1984 or Dec. 1, 1954. Again, this shows how unlikely the 230 number is.
Some statistics suggest the irregularities occurred in Livingston city; Park, Big Horn and Jefferson Counties; House Districts 41, 42, 62; Senate Districts 21, 31. But it’s unclear how that information might be significant or connected. Was there a common processing problem in these areas, or was there possibly an effort to get more voters for some reason?
This likely happened in 1998 based on the most common (modal) effective registration date of 1/1/1998 among these 230, when they would have been 18 years old.
Voter history for these 230 voters shows 127 voted in the 1998 federal general election on 11/3/1998.
Jan. 1 birthdays were too common in a 2010 Missouri House race, but the Missouri case involved a large migration of Somalis to the Missouri district.
Too many birthdates for a given day may be an indication of the duplication of a number of voter registration application forms.
Voter Bloat Summary
- Inactive Voters
Almost 1 in 4 Montana voters are “inactive.”
- Voters who have not voted in years
This list shows all 11,445 Montana voters who have no voter history for casting ballots since 2006.
Almost 80% of these 11,445 voters are marked as “inactive” in the voter file. This means that many of these voters may be purged in the near future (but could be years) as part of the NVRA process of purging voters.
This summary shows the breakdown by Montana County.
The table above show Lewis and Clark County has 2,147 of the 9,111 inactive voters on the list of 11,445 voters from 2006 or before.
Other Questions
- Why is there such a range in the number of voters in Montana House Districts? House District 24 has 3,571 voters while House District 67 has 9,868 voters.
- Why is there such a range in the number of voters in Montana Senate Districts? Senate District 12 has 8,605 voters while Senate District 35 has 17,967 voters.
NOTE (Sept. 12, 2012)
The county codes given online by the Montana Secretary of State appear to have switched Cascade and Carter counties:
Carter County should be before Cascade County alphabetically, which is more evidence the online page is wrong.
I discovered this switch when US Census and voter registration data for Carter and Cascade counties did not match well on charts.
Carter County is fairly small with about 1,160 people while Cascade County is much larger with about 81,000 people. If the two counties were approximately the same size the error would not have been discovered.
Related
- Montana: Comparison of Registered Voter Counts to Census Voting Age Population, Watchdog Labs, Sept. 19, 2012.
- Linda McCulloch: Audits show voter fraud nonexistent in Montana, Billings Gazette, Sept. 9, 2012.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.