Site icon R-bloggers

Spring Cleaning Data: 3 of 6- The Little but Big Correction

[This article was first published on OutLie..R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Building on the previous posts (post 1 & post 2) I found there were 12 instances with the type of credit where there was a “Primary*” which means the lender borrowed twice in the same day, in the 2010 q4 data. It would seem simple enough in Excel, use the filter, find the 12 instances, delete the “*” and be done. For R this turned out to be much more difficult to do. Mainly because the “*” is an operator used in multiplication, so it is hard to get rid of. What seemed to be a minor correction turned into something much bigger. Why would I go through all the trouble?

Mainly because I know that there will be cases where I will be dealing with data much larger than this one, that will not fit into Excel. Where I will have to be able to make corrections in R because I will not be able to do it otherwise. Plus, it is a good way to hone my data manipulation skills.

I used the following, but failed doing it (suggestions for improvement are welcomed):

  • Gsub()- did not work because of the ‘*’
  • Variable[condition]<-replacement- looking for a numerical value, which there is none
  • Ifelse()- did not find the difference between Primary and Primary*
  • Used the stringer package to try and delete the last character, but did not work

So what I did was use the subset() function to flush the problem variables out. Knowing there was 12 I replaced the wrong variables with the correction, using the rep() function. Then I made sure it was a factor using the as.factor().


#Correcting the Primary Credit* 
#Step 1 isolate the rows of data
tmp<-subset(x=dw.2010.q4, 
   subset=type.credit=='Primary Credit*')
 
#Step 2 change the column of data to the correct label
tmp$type.credit<-rep('Primary Credit', 12)
 
#Step 3 make sure the data is a factor
tmp$type.credit<-as.factor(tmp$type.credit)
summary(tmp)

I then erased the problem variables from the original data using the file[-which(file$variable==’Primary Credit*’),]. Note the use of the ‘-‘ before the which to remove those rows. Then I added the cleaned up rows to the data. I am not particularly interested in keeping the data in any particular order as the date will keep things in chronological order.

#Step 4 remove the Primary Credit* 
#rows of data from original dataframe
dw.2010.q4<-dw.2010.q4[-which(
   dw.2010.q4$type.credit=='Primary Credit*'),]
 
#Step 5 add the corrected data back into dataframe
dw.2010.q4<-rbind(dw.2010.q4, tmp)
 
#Check to see if correct
summary(dw.2010.q4)
To leave a comment for the author, please follow the link and comment on their blog: OutLie..R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.