Site icon R-bloggers

Separating and Trimming Messy Data the Tidy Way

[This article was first published on Paul Oldham's Analytics Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

When working with table data from the scientific or patent literature, it is extremely common to find that columns contain concatenated data. That is, they contain multiple entries with a semicolon as a separator. Data of this type is not tidy (Wickham 2014). What we commonly want to do is to separate the data out as the basis for counting. However, extra white space can have a major impact on any counts of this data if it is not recognised and dealt with. We will go through this step by step using a simple example and then scale up to a real world example.

Here is a simple example of a table containing a column with concatenated data.

library(tidyverse)
messy <- tibble::tibble(messy = c("this is not the; messiest", 
                                  "messy data column; in the world", 
                                  "it's just; a; tribute"))

If we try to count this:

messy %>% 
  count(messy) %>% 
  kable() %>% 
  kable_styling()
messy n
it’s just; a; tribute 1
messy data column; in the world 1
this is not the; messiest 1

We get three results when of course actually what we want is a count of the concatenated data points in the column. We can handle this easily with the tidyr::separate_rows function in R.

messy %>% 
  separate_rows(messy, sep = ";") %>% 
  count(messy) %>% 
  kable() %>% 
  kable_styling()
messy n
a 1
in the world 1
messiest 1
tribute 1
it’s just 1
messy data column 1
this is not the 1

Simples… as our friendly local meerkat might say. But let’s look at what happens if we double up our entries and imagine that two different people had written out the same thing.

messy <- tibble::tibble(messy = c("this is not the; messiest", 
                                  "messy data column; in the world", 
                                  "it's just; a; tribute", 
                                  " this is not the; messiest", 
                                  " messy data column;  in the world", 
                                  "it's just; a; tribute"))

Now let’s try again. We are innocently expecting a count of 2 for the repeated words and phrases.

messy %>% 
  separate_rows(messy, sep = ";") %>% 
  count(messy) %>% 
  kable() %>% 
  kable_styling()
messy n
in the world 1
a 2
in the world 1
messiest 2
messy data column 1
this is not the 1
tribute 2
it’s just 2
messy data column 1
this is not the 1

This is not adding up correctly because while the two versions appear to be identical there are subtle differences… involving spaces introduced by our mysterious second person. This is a simple case so maybe you spotted them. When using R you will often want to try using str(), the equivalent in a language such as Python, or stare very hard at the screen in Excel.

str(messy)
## Classes 'tbl_df', 'tbl' and 'data.frame':    6 obs. of  1 variable:
##  $ messy: chr  "this is not the; messiest" "messy data column; in the world" "it's just; a; tribute" " this is not the; messiest" ...

It’s hard to read but we can see a white space at the start of " this is not the messiest". This though helps makes the point that when dealing with thousands of data points extra white space can be really hard to spot in R, Excel or anything else.

We can understand this more clearly by using a quick logical test in R to test whether two strings are identical or not.

"this is messy" == " this is messy"
## [1] FALSE

The reason for this is that in testing whether the strings are identical R (and anything else) will match all characters, including white space.

If you work with text based data extra white space appears in the data all the time after separation, mainly in the form of leading white space as we will see below.

The solution is simple, we trim the white space on both sides. In R we can do this using either the stringr function str_trim or the base R function trimws. We’ll use stringr here because it is a reminder of how useful this tidyverse package is.

messy %>% separate_rows(messy, sep = ";") %>%
  mutate(messy = str_trim(messy, side = "both")) %>% 
  count(messy) %>% 
  kable() %>% 
  kable_styling()
messy n
a 2
in the world 2
it’s just 2
messiest 2
messy data column 2
this is not the 2
tribute 2

We use a call to dplyr::mutate and then a call to stringr str_trim to trim the white space on both sides of the separated strings and then overwrite the column in place.

Both str_trim and the base R trimws have arguments for where to trim white space. In the case of stringr it is side = and with trimws it is which =. As these are general functions there may be situations where you will want to trim either the leading (left) or the trailing (right) spaces. If you are working with metadata from the scientific literature (such as Web of Science or Crossref) or with patent data my recommendation is to always trim on both sides unless you have a good reason not to.

We now have a piece of code that will work for just about anything where white space is left over. We normally want to turn that into a function that we can use over and over again. One reason the tidyverse set of packages are so popular is that they are so easy to use. But, if we try and put the code above into a function it won’t work.

fun <- function(df, col, sep){
  df %>% tidyr::separate_rows(col, sep = sep) %>% 
    dplyr::mutate(col = stringr::str_trim(col, side = "both")) %>% 
    dplyr::count(col)
  }
fun(messy, messy, sep = ";")

We will get a message that

Error: col must evaluate to column positions or names, not a list

If we try quoting the “messy” col it appears to work but instead counts the number of entries. We can go around the houses… and go slightly bananas in the process… trying to fix this only to run into mysterious problem after problem. The reason for this is that dplyr and tidyr use non standard evaluation (tidy evaluation) with the result that R does not know how to evaluate it. We need to start getting to grips with tidy evaluation to get our code to work in a function. A whole bunch of very useful resources on that have been compiled by Mara Averick here. One solution, bearing in mind that there may well be a better one, is this.

separate_rows_trim <- function(df, col, sep){
  col <- enquo(col)
  df %>% tidyr::separate_rows(!!col, sep = sep) %>% 
    dplyr::mutate(!!col := stringr::str_trim(!!col, side = "both")) %>% 
    dplyr::count(!!col := !!col, sort = TRUE) %>% 
    tidyr::drop_na(!!col)
}

In this case we use bang bang !! to tell R when to evaluate col with a bit of help from := from rlang. To actually get to grips with tidy evaluation I recommend Mara Avericks compilation here. For a much deeper dive and highly illuminating read try the metaprogramming chapter of Hadley Wickham’s forthcoming 2nd edition of Advanced R. For the moment we can move on.

Let’s try again.

separate_rows_trim(messy, messy, sep = ";") %>% 
  kable() %>% 
  kable_styling()
messy n
a 2
in the world 2
it’s just 2
messiest 2
messy data column 2
this is not the 2
tribute 2

We now have a reusable function.

This toy example introduces the importance of trimming white space when working with data that has been separated out. Otherwise bad things will happen when you start to count. To finish off let’s use some real world data from a patent dataset to illustrate this.

Scaling Up

This article is part of work in progress on the WIPO Patent Analytics Handbook. Patent data is simultaneously really well organised and really messy… with many concatenated columns containing data of varying lengths. In addition a single data set will often compile records from different patent databases. This leads to the same problem we encountered above where a mysterious second person types exactly the same thing in a slightly different way. This is really common with names such as applicants or inventors.

Here we will use the drones dataset, a new work in progress dataset of patent data involving drone technology. As it’s a big dataset we will just use the applicants field with 18,970 rows.

If you would like to explore the drones dataset try this. First make sure you have devtools.

install.packages("devtools")

Then install from github with:

devtools::install_github("poldham/drones")

We will use the applicants table. This contains a column called applicants_cleaned that I have previously mainly cleaned up.

library(drones)
applicants %>% 
  select(applicants_cleaned)
## # A tibble: 18,970 x 1
##    applicants_cleaned                         
##    <chr>                                      
##  1 SHENZHEN HUBSAN TECHNOLOGY CO. LTD.        
##  2 Intel Corporation                          
##  3 YOKOGAWA ELECTRIC CORPORATION              
##  4 NETWORK PERFORMANCE RESEARCH GROUP LLC     
##  5 NETWORK PERFORMANCE RESEARCH GROUP LLC     
##  6 Choi Hae-Yong                              
##  7 Ziva Corporation                           
##  8 WiTricity Corporation                      
##  9 Thales                                     
## 10 International Business Machines Corporation
## # ... with 18,960 more rows

We can only see one case of a semi colon in this case but we can quickly get an idea of how many there are with str_count

applicants %>% 
  select(applicants_cleaned) %>% 
  str_count(., pattern = ";") %>% 
  kable()
x
5916

Creating a count of the number of separators per record reveals that the maximum number of semicolons is 20.

applicants %>% 
  select(applicants_cleaned) %>% 
  mutate(sepcount = str_count(applicants_cleaned, ";")) %>% 
  drop_na(applicants_cleaned) %>% 
  filter(sepcount == max(sepcount)) %>% 
  select(sepcount) %>% 
  kable()
sepcount
20
20

Let’s try counting the data up both ways to join them together. We’ll limit this to the top ten.

library(drones)
df1 <- applicants %>% 
  separate_rows(applicants_cleaned, sep = ";") %>%
  drop_na(applicants_cleaned) %>% 
  count(applicants_cleaned, sort = TRUE) %>% 
  rename(messy = n) %>% 
  .[1:10,] %>% 
  mutate(applicants_cleaned = str_trim(applicants_cleaned, side = "both"))

df2 <- separate_rows_trim(applicants, applicants_cleaned, sep = ";") %>% 
  drop_na(applicants_cleaned) %>% 
  rename(tidy = n) %>% 
  .[1:10,]

If we join these two tables together we will be able to calculate the differences between them.

df3 <- merge(df1, df2, by = "applicants_cleaned") %>% 
  arrange(desc(tidy)) %>%
  mutate(percent = (tidy - messy) / tidy * 100) %>% 
  mutate(percent = formatC(percent, digits = 2))

df3 %>% 
  kable("html", escape = F) %>% 
  kable_styling(position = "center") %>% 
  column_spec(4, width = 8)
applicants_cleaned messy tidy percent
QUALCOMM Incorporated 483 498 3
Thales 322 382 16
HON HAI PRECISION INDUSTRY CO LTD 345 345 0
QINGHUA UNIV 343 343 0
Samsung Electronics Co. Ltd. 207 213 2.8
International Business Machines Corporation 184 193 4.7
THE BOEING COMPANY 177 181 2.2
GOOGLE INC. 165 167 1.2
Elwha LLC 161 166 3
SONY CORPORATION 144 148 2.7

We can see from the percentage scores that there is significant variance in the counts, with a maximum of 16% variance in the case of Thales. The reason this matters whether using patent data or data from the scientific literature is that any counts that do not recognise the white space problem will be wrong… and generally quite seriously wrong. Typically with patent data the most observable change is movement in the top rankings. But where precision in counting is important, such as capturing all documents linked to a company in a highly competitive field, that can really really matter.

Round Up

When working with data from the scientific literature or patent data in spreadsheets or data frames we will always want to separate out the data in order to count it, whether in R, Python or using tools such as Open Refine. The act of separating data onto new rows is however only one step with trimming white space a key step to arrive at accurate counts.

So ends this episode of “fun with white space and semicolons”. Thanks for reading!

References

Ren, Kun, and Kenton Russell. 2016. Formattable: Create ’Formattable’ Data Structures. https://CRAN.R-project.org/package=formattable.

Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (10). Foundation for Open Access Statistic. https://doi.org/10.18637/jss.v059.i10.

———. 2017. Tidyverse: Easily Install and Load the ’Tidyverse’. https://CRAN.R-project.org/package=tidyverse.

———. 2018. Stringr: Simple, Consistent Wrappers for Common String Operations. https://CRAN.R-project.org/package=stringr.

Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2018. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.

Wickham, Hadley, and Lionel Henry. 2018. Tidyr: Easily Tidy Data with ’Spread()’ and ’Gather()’ Functions. https://CRAN.R-project.org/package=tidyr.

Zhu, Hao. 2018. KableExtra: Construct Complex Table with ’Kable’ and Pipe Syntax. https://CRAN.R-project.org/package=kableExtra.

To leave a comment for the author, please follow the link and comment on their blog: Paul Oldham's Analytics Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.