Ack! Duplicates in the Data!

Posted on May 3, 2012 by inkhorn82 in R bloggers | 0 Comments

[This article was first published on Data and Analysis with R, at Work, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

As I mentioned in a previous post, I compiled the data set that I’m currently working on in PostgreSQL. To get this massive data set, I had to write a query that was massive by dint of the number of LEFT JOINs that I had to write. Today I caught myself wondering If I had remembered to add in DISTINCT to the SELECT clause in my query, as that would prevent any obvious duplicates. I looked and, woops, I didn’t specify it in the SELECT clause. It turns out that I didn’t have too many duplicates (only 135 out of about 395000 rows), but I was still bothered.

I took a look at the duplicates and found out that there were a small bunch of duplicates in the main name and contact table where the ID number of the records was the same, but one record had actual address and contact info, while another one would be pretty bare (in other words, missing information in various contact fields). Although there was only one duplicate for each affected record, the many LEFT JOINs that I did multiplied the number of duplicates to get to about 135.

I really didn’t want to export the data again from the database and redo all the calculations I had made, so I had to find some way of cleaning up all these duplicates from within R. Thankfully, I had turned the many contact fields in the data into dummy variables in view of the data mining that I had to do later on (I love dummy variables!). This then made it possible to distinguish duplicates based on whether they had more or less fields filled out, enabling me to remove them, as shown in the script below:

	# These column numbers represent fields with name/contact info that I've
	# marked with 1s and 0s depending on whether or not there's anything in
	# the field.

	bio_cols = c(5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,23,24,25,26)

	# Now we get the row numbers of all the records with duplicate IDs

	dupe_id_rows = which(duplicated(big.dataset$ID) == TRUE)

	# Now we get the actual ID values that were duplicated

	dupe_ids = big.dataset$ID[which(duplicated(big.dataset$ID) == TRUE)]

	# The following line gives us a sum of the number of fields, for each
	# record, that have information in them

	info_sums = apply(big.dataset[dupe_conids_rows,bio_cols],1,sum)

	# Next we build a data frame that glues this information together. The
	# goal is to isolate those rows representing records with missing info,
	# and then delete them from the larger data set

	a = data.frame(sums=info_sums,ids=dupe_ids,rows=dupe_id_rows)

	# Now we order the data frame by the ID values

	a = a[order(a[,2]),]

	# Here we get the mean number of fields, for each ID value, filled out
	# between duplicates that have more information, and duplicates that have
	# missing information

	a$avg = rep(tapply(a$sums, a$ids, mean), tapply(a$sums, a$ids, length))

	# Here we make a column that marks a row for deletion (with a 1 instead
	# of a 0) if the sum of fields that have information in them is less
	# than the mean number for that ID value. Those rows that are marked
	# are the duplicates that are missing info.

	a$del = ifelse(a$sums < a$avg,1,0)

	# The following is a full list of unique row numbers from the bigger
	# dataset that constitute duplicates that are missing information.

	rows_to_del = a$rows[a$del == 1]

	# Here we delete those rows from the bigger dataset

	big.dataset = big.dataset[-rows_to_del,]

	# Since all remaining dupes are exact duplicates, this line removes them

	big.dataset = big.dataset[which(duplicated(big.dataset$ID) == FALSE),]