Data wrangling tricks from the R4DS slack

R | Discindo

19 hours ago

[This article was first published on R | Discindo, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Every now and then there is a question on the R4DS slack that produces a large number of replies. This case was particularly interesting because the user that had the issue deleted all the original content, the question, the clarifications and so on.

I thought it would be interesting to document this, because it may be useful to other people.

The question was posted with regards to some genes data, and the task was to filter out some of the genes that have missing values. But generally speaking the question can be summarized like this:

How to filter a dataset in a such way that all observations from a group are dropped if the group doesn’t meet certain criteria?

These are our sample data:

dat <- data.frame(
groups = c("a", "a", "a", "b", "b", "b", "c", "c", "c"),
features = c("c", "d", "e", "e", "e", "d", "d", "f", "g"),
col1 = c(1, 2, 5, NA, 5, NA, 6, 7, NA),
col2 = c(1, 2, 4, 4, NA, 5, 6, NA, 7),
col3 = c(1, 3, 4, NA, 5, NA, 4, 7, 8)
)
dat
## groups features col1 col2 col3
## 1 a c 1 1 1
## 2 a d 2 2 3
## 3 a e 5 4 4
## 4 b e NA 4 NA
## 5 b e 5 NA 5
## 6 b d NA 5 NA
## 7 c d 6 6 4
## 8 c f 7 NA 7
## 9 c g NA 7 8

We want to keep only group a since it is the only one where none of the observations have missing data.

There is a case of complete observations in group c as well, but it is only one (on row 7), so we don’t want to keep that group.

A simple pipeline to do this would be:

library(dplyr)
library(tidyr)
dat |>
drop_na() |>
group_by(groups) |>
filter(n() == 3)
## # A tibble: 3 × 5
## # Groups: groups [1]
## groups features col1 col2 col3
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 a c 1 1 1
## 2 a d 2 2 3
## 3 a e 5 4 4

After providing this solution, I thought about the different scenarios where a similar recipe may be needed. For example, what if the data is too big and, data.table would be preferred? Or what if the pipeline should be part of a bigger project using python for example?

More R

So here are the possible solutions with other libraries. First data.table.

library(data.table)
dat_dt <- setDT(dat)
dat_grp <-
dat_dt[complete.cases(dat_dt),][, .N, by = groups][N == 3]
dt_result <- dat_dt[dat_grp, on = "groups"]
dt_result
## groups features col1 col2 col3 N
## 1: a c 1 1 1 3
## 2: a d 2 2 3 3
## 3: a e 5 4 4 3

Most of the time I find data.table unreadable, and luckily, a tidytable solution is also possible.

dat |>
tidytable::drop_na() |>
tidytable::group_by(groups) |>
tidytable::filter(dplyr::n() == 3)
## # A tidytable: 3 × 5
## # Groups: groups
## groups features col1 col2 col3
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 a c 1 1 1
## 2 a d 2 2 3
## 3 a e 5 4 4

Python

On to python. Here is a simple chain in pandas that does the same.

import pandas as pd
dat_py = r.dat
(dat_py.dropna(axis = 0, how = "any")
.groupby("groups")
.filter(lambda x: len(x) == 3))
## groups features col1 col2 col3
## 0 a c 1.0 1.0 1.0
## 1 a d 2.0 2.0 3.0
## 2 a e 5.0 4.0 4.0

The interesting bit is that drop_na() and dropna() behave differently. In tidyr, the function assumes that it is checking all of the columns by default, while in pandas arguments have to be called to check if any of the columns have missing values.

SQL

And, how about some SQL? It is possible, of course, that your data may be in a database and not in a flat file. Here is a sqlite solution.

library(DBI)
con <- dbConnect(RSQLite::SQLite(), ":memory:")
dbWriteTable(con, "dat", dat)
query <- "SELECT t1.* FROM dat AS t1
INNER JOIN (
SELECT groups, COUNT(*) as cnt
FROM dat
WHERE col1 IS NOT NULL AND col2 IS NOT NUll AND col3 IS NOT NULL
GROUP BY groups
) AS t2
ON t1.groups = t2.groups
WHERE t2.cnt = 3;"
res <- dbSendQuery(con, query)
dbFetch(res)
## groups features col1 col2 col3
## 1 a c 1 1 1
## 2 a d 2 2 3
## 3 a e 5 4 4

Summary

This was a quick post that provides answers with different tools to the question: How to filter a dataset in a such way that all observations from a group are dropped if the group doesn’t meet certain criteria?

We saw how to do it in tidyverse and in data.table. What is missing is base R. Sorry about that! However there are solutions in pandas and SQL. Hopefully something can be useful to somebody.

To leave a comment for the author, please follow the link and comment on their blog: R | Discindo.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

More R

Python

SQL

Summary

Related