Site icon R-bloggers

RObservations #4 Using Base R to Clean Data

[This article was first published on r – bensstats, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A friend of mine had some data which was mixed with character strings and was interested in looking at the numeric data only; Because the data set was quite large, cleaning it manually wasn’t viable.

Besides for being too great of a task to do manually – tampering with the raw data can be very dangerous if you don’t have a way to track the changes or know what was done to it. This is why data cleaning should always be handled with a script and/or with a pipeline established without tampering with the raw data itself.

In this blog we’re going to look at a quick trick that I found useful for cleaning data frames on a large scale using base R and some understanding of data structures in R.

The Problem

Assuming character data in this data frame is not meaningful and we are only interested in numeric data. Here’s a “toy example” of how the data might look.

# Sample Data Set
df<-data.frame(a=c(1,"a",2.4),b=c("c",0.1,"b"))

df


##     a   b
## 1   1   c
## 2   a 0.1
## 3 2.4   b

Seeing this we can construct a matrix of TRUE/FALSE values that can tell us which variables contain character strings- and which do not.

Setting up a TRUE/FALSE matrix

Obviously, the Regular Expressions can get more complicated, but for this example- we will keep it simple to convey the concept.

tfvals<-as.matrix(as.data.frame(lapply(df, function(x) grepl("[A-z]",x))))

tfvals


##          a     b
## [1,] FALSE  TRUE
## [2,]  TRUE FALSE
## [3,] FALSE  TRUE

We use the grepl() function to give us logical values in the columns of our data set to determine which data is not meaningful to us by use of regular expressions. Because the result returned is a list- we need to convert this back to a data frame, which in turn needs to be converted into a matrix.

This gives us a nice and neat matrix of the TRUE/FALSE values that we will be able to use to clean our data set!

For R users who are more comfortable with the pipe operator (%>%) this solution can be rewritten as:

#' Import the magrittr library

library(magrittr)

tfvals <- df %>%
  lapply(function(x)grepl("[A-z]", x)) %>%
  as.data.frame() %>%
  as.matrix()

tfvals


##          a     b
## [1,] FALSE  TRUE
## [2,]  TRUE FALSE
## [3,] FALSE  TRUE

Which is also much easier to look at and requires less deciphering by others to read. But feel free to make whatever is easier for your environment and tastes (that comment is going to get some lash back!)

Cleaning and Formatting the Data

The actual cleaning of the data can now be done with one line.

df[tfvals]<-NA

# And there you have it!
df


##      a    b
## 1    1 <NA>
## 2 <NA>  0.1
## 3  2.4 <NA>

Now, to make sure we have our data in numeric format, we will coerce all of our data into numeric format by first coercing it to character format.

newdf<-data.frame(lapply(df,function(x) as.numeric(as.character(x))))

newdf


##     a   b
## 1 1.0  NA
## 2  NA 0.1
## 3 2.4  NA

The reason why we did this is because our data frame values are read as character data. Additionally as a safety precaution- if you are using an older version of R your data frame may still may have the stringsAsFactors argument still set to TRUE (yes, I know the newer versions of R have now set stringsAsFactors=FALSE as default in data frames, but hear me out); Thus we first coerce all the data that we have from the factor class to character, and from character to numeric data.

This can alternatively be rewritten with pipes (although, in this case I find using base R easier to read).

newdf <- df %>%
  lapply(function(x)
    x %>%
      as.character %>%
      as.numeric) %>%
  data.frame

newdf


##     a   b
## 1 1.0  NA
## 2  NA 0.1
## 3 2.4  NA

And there you have it!

Conclusion

I found a problem like this all to commonplace; more often than not- a uncleaned data set is handed over to you and knowing how to clean it with minimal effort is crucial to working efficiently. This is what inspired me to write about this.

Additionally, I’m really happy that my blog has been getting more reach via Social Media and R-Bloggers. With that, I am thrilled to have a lot of experienced R-Programmers who get to check out my work.

So my question for you (yes, YOU!)- how would you deal with this problem? Let me know in the comments!

Did you like this content? Be sure to never miss an update and Subscribe!

To leave a comment for the author, please follow the link and comment on their blog: r – bensstats.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.