Tidy DataFrames but not Tibbles

Posted on August 10, 2024 by Jonathan Carroll in R bloggers | 0 Comments

[This article was first published on rstats on Irregularly Scheduled Programming, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A while ago (2019 seems so long ago now) I started working on something I thought was interesting but which never really got any traction. It has potential once more, so it’s about time I wrote up what it does and why I think it’s a useful idea. I’m going to talk about using the {dplyr} package on some data with rows and columns, but we’re not talking about data.frames or tibbles…

Okay, I probably do need to talk about them a little to start with.

I’m also taking this opportunity to reinforce my own understanding of all these pieces – I find the best way to learn something is to teach it.

`data.frame`

One of the basic building blocks of data in R is the data.frame – technically a list of vectors (vectors must have the same ‘type’ but lists don’t have this restriction) where the vectors are all the same length. An example looks like this

d <- data.frame(x = 2:5, y = LETTERS[2:5], z = 4:7+2i)
d
##   x y    z
## 1 2 B 4+2i
## 2 3 C 5+2i
## 3 4 D 6+2i
## 4 5 E 7+2i

If we unpack that a bit, it is just a list of vectors

unclass(d)
## $x
## [1] 2 3 4 5
## 
## $y
## [1] "B" "C" "D" "E"
## 
## $z
## [1] 4+2i 5+2i 6+2i 7+2i
## 
## attr(,"row.names")
## [1] 1 2 3 4

and there we see one more feature; they have an attribute row.names. By default these are just numbers (encoded as strings) which we can get to with the rownames() function

rownames(d)
## [1] "1" "2" "3" "4"

If you’ve read any guides to using R you will likely have seen the mtcars dataset which features these row names prominently - here’s the first 4 rows and columns

mtcars[1:4, 1:4]
##                 mpg cyl disp  hp
## Mazda RX4      21.0   6  160 110
## Mazda RX4 Wag  21.0   6  160 110
## Datsun 710     22.8   4  108  93
## Hornet 4 Drive 21.4   6  258 110

We can extract those row names

rownames(mtcars[1:4, 1:4])
## [1] "Mazda RX4"      "Mazda RX4 Wag"  "Datsun 710"     "Hornet 4 Drive"

This “row name” data sort of sits off to the side of the data object, but yes, they could live “in” the data just fine as their own vector. That’s the approach that the tidyverse takes with tibbles. There’s one other benefit to these that is often overlooked, though.

In the above examples I’ve extracted some rows of the mtcars data.frame with indices, but the [ method works just as well with row identifiers (strings)…

mercs <- grep("Merc", rownames(mtcars), value = TRUE)
mercs
## [1] "Merc 240D"   "Merc 230"    "Merc 280"    "Merc 280C"   "Merc 450SE" 
## [6] "Merc 450SL"  "Merc 450SLC"
mtcars[mercs, 1:4]
##              mpg cyl  disp  hp
## Merc 240D   24.4   4 146.7  62
## Merc 230    22.8   4 140.8  95
## Merc 280    19.2   6 167.6 123
## Merc 280C   17.8   6 167.6 123
## Merc 450SE  16.4   8 275.8 180
## Merc 450SL  17.3   8 275.8 180
## Merc 450SLC 15.2   8 275.8 180

You’re probably more familiar with this when subsetting columns, in which case you’re using the column names in exactly the same way

metrics <- c("mpg", "hp", "drat")
mtcars[mercs, metrics]
##              mpg  hp drat
## Merc 240D   24.4  62 3.69
## Merc 230    22.8  95 3.92
## Merc 280    19.2 123 3.92
## Merc 280C   17.8 123 3.92
## Merc 450SE  16.4 180 3.07
## Merc 450SL  17.3 180 3.07
## Merc 450SLC 15.2 180 3.07

As we’ll see, this becomes very convenient when we have meaningful identifiers in both dimensions.

{tibble}

A tibble is almost exactly like a data.frame except that the print method shows it a bit differently

tbl <- tibble::as_tibble(mtcars[1:4, 1:4])
tbl
## # A tibble: 4 × 4
##     mpg   cyl  disp    hp
##   <dbl> <dbl> <dbl> <dbl>
## 1  21       6   160   110
## 2  21       6   160   110
## 3  22.8     4   108    93
## 4  21.4     6   258   110

and the row names are gone. That’s by design, and enforced quite strongly - if I tried to add them back, {tibble} complains, and refuses

rownames(tbl) <- rownames(mtcars)[1:4]
## Warning: Setting row names on a tibble is deprecated.
tbl
## # A tibble: 4 × 4
##     mpg   cyl  disp    hp
## * <dbl> <dbl> <dbl> <dbl>
## 1  21       6   160   110
## 2  21       6   160   110
## 3  22.8     4   108    93
## 4  21.4     6   258   110

The world moves on, though.

If you have row names that you really want to keep, {tibble} has a way to get them into their own column so that they’ll be preserved in a tibble

tbl_rows <- mtcars[6:10, 1:5] |> 
  tibble::rownames_to_column() |> 
  tibble::as_tibble()
tbl_rows
## # A tibble: 5 × 6
##   rowname      mpg   cyl  disp    hp  drat
##   <chr>      <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Valiant     18.1     6  225    105  2.76
## 2 Duster 360  14.3     8  360    245  3.21
## 3 Merc 240D   24.4     4  147.    62  3.69
## 4 Merc 230    22.8     4  141.    95  3.92
## 5 Merc 280    19.2     6  168.   123  3.92

To get the named subsetting behaviour described earlier, we have to specify the "rowname" column in the output and do some filtering on it to get the rows we want

tbl_rows[tbl_rows$rowname %in% mercs, c("rowname", metrics)]
## # A tibble: 3 × 4
##   rowname     mpg    hp  drat
##   <chr>     <dbl> <dbl> <dbl>
## 1 Merc 240D  24.4    62  3.69
## 2 Merc 230   22.8    95  3.92
## 3 Merc 280   19.2   123  3.92

Of course, there’s a better way to subset a tibble…

{dplyr}

In terms of interacting with these objects, {dplyr} is great - it’s reminiscent of SQL syntax (SELECT, GROUP BY, … look, it’s not 1:1, but it’s somewhat familiar) and makes for a really clean pipeline of data processing - it got in early supporting the {magrittr} pipe and encouraging a ‘data as the first argument’ construction that meant piping from one function to another went smoothly.

There’s not much in {dplyr} that you couldn’t do with a bunch of base functions if you were so inclined - and some people have gone as far as proving that with non-dplyr packages with matching functions but base internals, forgoing {dplyr}’s ability to connect to a database with the exact same code you would use on a local tibble ({dbplyr} translates the R calls into native SQL syntax behind the scenes) for some more debuggable code, e.g. {poorman}.

{dplyr} functions mostly return the same class as the input; if you mutate a data.frame you’ll get back a data.frame (which wasn’t always the case, but is less surprising now)

mtcars[1:5, 1:6] |> 
  dplyr::mutate(pwr = hp / wt)
##                    mpg cyl disp  hp drat    wt      pwr
## Mazda RX4         21.0   6  160 110 3.90 2.620 41.98473
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 38.26087
## Datsun 710        22.8   4  108  93 3.85 2.320 40.08621
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 34.21462
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 50.87209

One feature that tibble adds on top of data.frame is the concept of “groups”. For this reason, if you perform a grouping on a data.frame you’ll get a tibble because that’s the only supported option

# slice(1) on a data.frame returns 1 row of a data.frame
mtcars |> 
  dplyr::slice(1)
##           mpg cyl disp  hp drat   wt  qsec vs am gear carb
## Mazda RX4  21   6  160 110  3.9 2.62 16.46  0  1    4    4
# slice(1) on a grouped tibble returns 1 row _per group_ 
mtcars |> 
  dplyr::group_by(gear, am) |> 
  dplyr::slice(1)
## # A tibble: 4 × 11
## # Groups:   gear, am [4]
##     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
## 2  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
## 3  21       6  160    110  3.9   2.62  16.5     0     1     4     4
## 4  26       4  120.    91  4.43  2.14  16.7     0     1     5     2

{tibble} stores this information in a special "groups" attribute, and ensures that {dplyr} functions are aware of it. It also prints this information above the object (plus the number of resulting ‘groups’).

Is there anything it can’t do?

A Whole New World

That ended up being a lot of background, glad to see you’re still with me. That’s {dplyr} which can take either a data.frame or a tibble to do lots of useful operations on data. If, like me prior to joining the world of computational biology, you thought that was all there was to it, the next part comes as a bit of an eye opener.

Just like in a video game where all of a sudden the map is revealed to be twice as large as you knew it, there’s an entire world of R packages beyond CRAN, mainly targeted at computational biology/bioinformatics, known as Bioconductor. This is well supported by the R project itself, but you could spend your life writing R analyses and never use it if you aren’t working on bioinfo stuff.

I don’t think I’d used anything from Bioconductor before I joined a large biotech - incidentally the one at which Robert Gentleman himself was a senior director in bioinformatics and computational biology until just a couple of years before I joined. Over the next few years, though, I’d become very familiar with the packages on offer in that space.

In the world of sequencing data, the humble data.frame just doesn’t have the right amount of structure - rectangular data is great, but no matter which way you pivot (wide or long) you’re going to have some trouble representing all the different qualities and quantities associated with a sample; for example representing the level of expression of some gene for some samples belonging to some subject, where both the subject and the sample have metadata, as does the experiment itself, as do the genes. Keeping in mind that this might involve a thousand samples and many tens of thousands of genes, a flat table just won’t cut it.

A better option is to create a different structure, one which keeps the different components (sample metadata, gene metadata) aligned and together, but not in the same table. Several of these have evolved over time, but a current favourite of the field is the SummarizedExperiment structure. This organises the components in a way that they can be stored and interrogated without having to perform a dozen filters to get to the corner of data you’re interested in.

If we think of this like a database, we have a table for the actual measurements, another for the sample metadata, and another for the metadata on the things being observed (e.g. genes). The problem is, we need to define keys that link the different tables together. If our measurements were a giant matrix of values, e.g.

expression_data <- matrix(c(2, 3, 1, 0, 9, 5, 3, 4, 5), ncol = 3)
expression_data
##      [,1] [,2] [,3]
## [1,]    2    0    3
## [2,]    3    9    4
## [3,]    1    5    5

how would we know that the second column is for sample_Y and the third row is for gene_DEF? The matrix itself is numeric, so we can’t just add a new row/column to store the sample/gene information, which is character.

No points for guessing that the intro above was for a reason - those attributes are now really useful.

library(charcuterie)
rownames(expression_data) <- paste0("gene_", c("ABC", "DEF", "GHI"))
colnames(expression_data) <- paste0("sample_", chars("XYZ"))
expression_data
##          sample_X sample_Y sample_Z
## gene_ABC        2        0        3
## gene_DEF        3        9        4
## gene_GHI        1        5        5

Much clearer! Even better, now we have a way to link up the metadata in either dimension.

For the samples, we can create a data.frame with the relevant metadata for each sample, and label each sample row with the relevant ID

sample_metadata <- data.frame(name = c("Jim", "Jack", "Joe"), age = c(51, 53, 52))
rownames(sample_metadata) <- paste0("sample_", chars("XYZ"))
sample_metadata
##          name age
## sample_X  Jim  51
## sample_Y Jack  53
## sample_Z  Joe  52

Now we can imagine taking this metadata and sort of turning it sideways so that the labels align like interlocking puzzle pieces. These run across the columns, so we can call them the “column data” of the final structure. Similarly, we can do the same for the genes, perhaps annotating the chromosome they’re on, and other identifiers, and use these as the “row data”.

This is the basic structure of the SummarizedExperiment, but all is not well.

Bioconductor leans heavily into the S4 object model, making use of multiple dispatch as well as object validation. For example, if I wanted to create a range of fake genomic locations I could use

gr <- GenomicRanges::GRanges("chrX", IRanges::IRanges(1:6, width = 6:1))
gr
## GRanges object with 6 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chrX       1-6      *
##   [2]     chrX       2-6      *
##   [3]     chrX       3-6      *
##   [4]     chrX       4-6      *
##   [5]     chrX       5-6      *
##   [6]     chrX         6      *
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Now, and this is just a toy example, if I wanted to use some of these in a data.frame…

d <- data.frame(a = 1:6, b = gr)
d
##   a b.seqnames b.start b.end b.width b.strand
## 1 1       chrX       1     6       6        *
## 2 2       chrX       2     6       5        *
## 3 3       chrX       3     6       4        *
## 4 4       chrX       4     6       3        *
## 5 5       chrX       5     6       2        *
## 6 6       chrX       6     6       1        *

Hmmm… it looks like it unfolded and did something strange. It was expecting a vector, not … whatever gr is.

Not to worry, though - the Bioconductor ecosystem defines a DataFrame object that does know how to deal with S4 classes just fine (there will be a lot of R noise here, I can’t help that - lots of masking)

library(S4Vectors)
## Loading required package: stats4
## Loading required package: BiocGenerics
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, aperm, append, as.data.frame, basename, cbind,
##     colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
##     get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
##     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
##     Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
##     table, tapply, union, unique, unsplit, which.max, which.min
## 
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:utils':
## 
##     findMatches
## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname
DataFrame(a = 1:6, b = gr)
## DataFrame with 6 rows and 2 columns
##           a         b
##   <integer> <GRanges>
## 1         1  chrX:1-6
## 2         2  chrX:2-6
## 3         3  chrX:3-6
## 4         4  chrX:4-6
## 5         5  chrX:5-6
## 6         6    chrX:6

You might say - “hey, that looks familiar! It’s kind of like a tibble” and you’re right; the types of each column are shown at the top, and in the case of the GRanges object, it’s a non-atomic type, handled correctly. There’s important differences compared to a tibble, though. Importantly, we can add row names to this just fine

DataFrame(a = 1:6, b = gr, row.names = paste0("range", 1:6))
## DataFrame with 6 rows and 2 columns
##                a         b
##        <integer> <GRanges>
## range1         1  chrX:1-6
## range2         2  chrX:2-6
## range3         3  chrX:3-6
## range4         4  chrX:4-6
## range5         5  chrX:5-6
## range6         6    chrX:6

If we break the object, it will let us know because S4 performs object validation

a <- DataFrame(x = 1:3, y = LETTERS[1:3])
a
## DataFrame with 3 rows and 2 columns
##           x           y
##   <integer> <character>
## 1         1           A
## 2         2           B
## 3         3           C
validObject(a)
## [1] TRUE
# look away, I'm not supposed to do this
names(a@listData) <- NULL
a
## DataFrame with 3 rows and 2 columns
##   c("1", "2", "3") c("A", "B", "C")
##          <integer>      <character>
## 1                1                A
## 2                2                B
## 3                3                C
validObject(a)
## Error in validObject(a): invalid class "DFrame" object: 
##     column names should not be NULL

This is the right class for our sample metadata

as(sample_metadata, "DataFrame")
## DataFrame with 3 rows and 2 columns
##                 name       age
##          <character> <numeric>
## sample_X         Jim        51
## sample_Y        Jack        53
## sample_Z         Joe        52

Perfect!

If we look at an actual example of a SummarizedExperiment, e.g. from the {airway} package which captures read counts per gene for an airway smooth muscle cell lines RNA-Seq experiment

library("airway")
## Loading required package: SummarizedExperiment
## Loading required package: MatrixGenerics
## Loading required package: matrixStats
## 
## Attaching package: 'MatrixGenerics'
## The following objects are masked from 'package:matrixStats':
## 
##     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
##     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
##     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
##     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
##     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
##     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
##     colWeightedMeans, colWeightedMedians, colWeightedSds,
##     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
##     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
##     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
##     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
##     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
##     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
##     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
##     rowWeightedSds, rowWeightedVars
## Loading required package: GenomicRanges
## Loading required package: IRanges
## Loading required package: GenomeInfoDb
## Loading required package: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## 
## Attaching package: 'Biobase'
## The following object is masked from 'package:MatrixGenerics':
## 
##     rowMedians
## The following objects are masked from 'package:matrixStats':
## 
##     anyMissing, rowMedians
data("airway")
se <- airway
se
## class: RangedSummarizedExperiment 
## dim: 63677 8 
## metadata(1): ''
## assays(1): counts
## rownames(63677): ENSG00000000003 ENSG00000000005 ... ENSG00000273492
##   ENSG00000273493
## rowData names(10): gene_id gene_name ... seq_coord_system symbol
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample

we see a lot of information - the “dimensions” of this object is 63677 genes measured for 8 samples. The ‘rownames’ element shows the IDs of these genes which will link the ‘rowData’ to the counts assay rows, and the ‘colnames’ element shows the sample IDs which will link the ‘colData’ to the counts assay columns.

Pulling this apart…

cd <- colData(se)
cd
## DataFrame with 8 rows and 9 columns
##            SampleName     cell      dex    albut        Run avgLength
##              <factor> <factor> <factor> <factor>   <factor> <integer>
## SRR1039508 GSM1275862  N61311     untrt    untrt SRR1039508       126
## SRR1039509 GSM1275863  N61311     trt      untrt SRR1039509       126
## SRR1039512 GSM1275866  N052611    untrt    untrt SRR1039512       126
## SRR1039513 GSM1275867  N052611    trt      untrt SRR1039513        87
## SRR1039516 GSM1275870  N080611    untrt    untrt SRR1039516       120
## SRR1039517 GSM1275871  N080611    trt      untrt SRR1039517       126
## SRR1039520 GSM1275874  N061011    untrt    untrt SRR1039520       101
## SRR1039521 GSM1275875  N061011    trt      untrt SRR1039521        98
##            Experiment    Sample    BioSample
##              <factor>  <factor>     <factor>
## SRR1039508  SRX384345 SRS508568 SAMN02422669
## SRR1039509  SRX384346 SRS508567 SAMN02422675
## SRR1039512  SRX384349 SRS508571 SAMN02422678
## SRR1039513  SRX384350 SRS508572 SAMN02422670
## SRR1039516  SRX384353 SRS508575 SAMN02422682
## SRR1039517  SRX384354 SRS508576 SAMN02422673
## SRR1039520  SRX384357 SRS508579 SAMN02422683
## SRR1039521  SRX384358 SRS508580 SAMN02422677

we see the sample metadata includes whether or not the sample was treated, and some details about the experiment. Similarly, the gene metadata details the various IDs and genomic locations.

rowData(se)[1:3, ]
## DataFrame with 3 rows and 10 columns
##                         gene_id   gene_name  entrezid   gene_biotype
##                     <character> <character> <integer>    <character>
## ENSG00000000003 ENSG00000000003      TSPAN6        NA protein_coding
## ENSG00000000005 ENSG00000000005        TNMD        NA protein_coding
## ENSG00000000419 ENSG00000000419        DPM1        NA protein_coding
##                 gene_seq_start gene_seq_end    seq_name seq_strand
##                      <integer>    <integer> <character>  <integer>
## ENSG00000000003       99883667     99894988           X         -1
## ENSG00000000005       99839799     99854882           X          1
## ENSG00000000419       49551404     49575092          20         -1
##                 seq_coord_system      symbol
##                        <integer> <character>
## ENSG00000000003               NA      TSPAN6
## ENSG00000000005               NA        TNMD
## ENSG00000000419               NA        DPM1

Tying these together is the actual assay data

assay(se)[1:5, 1:5]
##                 SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
## ENSG00000000003        679        448        873        408       1138
## ENSG00000000005          0          0          0          0          0
## ENSG00000000419        467        515        621        365        587
## ENSG00000000457        260        211        263        164        245
## ENSG00000000460         60         55         40         35         78

It all works really nicely - there are also additional places to store metadata about the experiment shared across samples, and metadata about the columns themselves

mcols(colData(se)) <- DataFrame(
  position = paste0("column ", 1:9), 
  is_ID = c(TRUE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE)
)
mcols(colData(se))
## DataFrame with 9 rows and 2 columns
##               position     is_ID
##            <character> <logical>
## SampleName    column 1      TRUE
## cell          column 2      TRUE
## dex           column 3     FALSE
## albut         column 4     FALSE
## Run           column 5      TRUE
## avgLength     column 6     FALSE
## Experiment    column 7      TRUE
## Sample        column 8      TRUE
## BioSample     column 9      TRUE

Filtering a DataFrame

Now we get to a really interesting part - we want to do some analysis on this data.

First, we want to look at the samples which were untreated. A DataFrame uses the same [ subsetting mechanism as a data.frame so we could do

cd[cd$dex == "untrt", c("SampleName", "avgLength")]
## DataFrame with 4 rows and 2 columns
##            SampleName avgLength
##              <factor> <integer>
## SRR1039508 GSM1275862       126
## SRR1039512 GSM1275866       126
## SRR1039516 GSM1275870       120
## SRR1039520 GSM1275874       101

or even

subset(cd, dex == "untrt", select = c("SampleName", "avgLength"))
## DataFrame with 4 rows and 2 columns
##            SampleName avgLength
##              <factor> <integer>
## SRR1039508 GSM1275862       126
## SRR1039512 GSM1275866       126
## SRR1039516 GSM1275870       120
## SRR1039520 GSM1275874       101

and those work fine.

What can often happen is you want the data for a specific sample. Sure, that ID might be encoded in a column, but typically the sample identifiers are used as the rownames of the colData, so we can get the data for sample “SRR1039512” with

cd["SRR1039512", ]
## DataFrame with 1 row and 9 columns
##            SampleName     cell      dex    albut        Run avgLength
##              <factor> <factor> <factor> <factor>   <factor> <integer>
## SRR1039512 GSM1275866  N052611    untrt    untrt SRR1039512       126
##            Experiment    Sample    BioSample
##              <factor>  <factor>     <factor>
## SRR1039512  SRX384349 SRS508571 SAMN02422678

What would be really nice, though, is to be able to use {dplyr}!

Let’s try…

dplyr::filter(cd, dex == "untrt")
## Error in UseMethod("filter"): no applicable method for 'filter' applied to an object of class "c('DFrame', 'DataFrame', 'SimpleList', 'RectangularData', 'List', 'DataFrame_OR_NULL', 'Vector', 'list_OR_List', 'Annotated', 'vector_OR_Vector')"

Nope, it’s not defined for this class.

One of the most frequent workarounds for this is to convert this DataFrame to a tibble, at which point, yes, {dplyr} should work

tbl_untrt <- dplyr::filter(tibble::as_tibble(cd), dex == "untrt")
tbl_untrt
## # A tibble: 4 × 9
##   SampleName cell    dex   albut Run       avgLength Experiment Sample BioSample
##   <fct>      <fct>   <fct> <fct> <fct>         <int> <fct>      <fct>  <fct>    
## 1 GSM1275862 N61311  untrt untrt SRR10395…       126 SRX384345  SRS50… SAMN0242…
## 2 GSM1275866 N052611 untrt untrt SRR10395…       126 SRX384349  SRS50… SAMN0242…
## 3 GSM1275870 N080611 untrt untrt SRR10395…       120 SRX384353  SRS50… SAMN0242…
## 4 GSM1275874 N061011 untrt untrt SRR10395…       101 SRX384357  SRS50… SAMN0242…

That works, but there are two problems - both of them are that we now have a tibble, firstly because it now has no row names, and secondly, as a consequence, we can’t “put it back” into the SummarizedExperiment.

We have the same problem with the gene information - we can convert to tibble and use {dplyr}, but we’re stuck there

dplyr::filter(tibble::as_tibble(rowData(airway)), symbol == "CD38")
## # A tibble: 1 × 10
##   gene_id   gene_name entrezid gene_biotype gene_seq_start gene_seq_end seq_name
##   <chr>     <chr>        <int> <chr>                 <int>        <int> <chr>   
## 1 ENSG0000… CD38            NA protein_cod…       15779898     15854853 4       
## # ℹ 3 more variables: seq_strand <int>, seq_coord_system <int>, symbol <chr>

Critically, tibble suffers from the same lack of S4 support as regular data.frames, so if there are any of those columns, this workflow will break

d <- tibble::tibble(a = 1)
d
## # A tibble: 1 × 1
##       a
##   <dbl>
## 1     1
d$grX <- GenomicRanges::GRanges("chrX", IRanges::IRanges(1:6, width = 6:1))
## Error in `$<-`:
## ! Assigned data `GenomicRanges::GRanges("chrX", IRanges::IRanges(1:6,
##   width = 6:1))` must be a vector.
## Caused by error in `vec_size()`:
## ! `x` must be a vector, not a <GRanges> object.

What else can we do?

Of Course I Have a Proposed Solution

Let’s create a gnarly DataFrame with S4 columns and row names

m <- mtcars[, c("cyl", "hp", "am", "gear", "disp")]
d <- as(m, "DataFrame")
d
## DataFrame with 32 rows and 5 columns
##                         cyl        hp        am      gear      disp
##                   <numeric> <numeric> <numeric> <numeric> <numeric>
## Mazda RX4                 6       110         1         4       160
## Mazda RX4 Wag             6       110         1         4       160
## Datsun 710                4        93         1         4       108
## Hornet 4 Drive            6       110         0         3       258
## Hornet Sportabout         8       175         0         3       360
## ...                     ...       ...       ...       ...       ...
## Lotus Europa              4       113         1         5      95.1
## Ford Pantera L            8       264         1         5     351.0
## Ferrari Dino              6       175         1         5     145.0
## Maserati Bora             8       335         1         5     301.0
## Volvo 142E                4       109         1         4     121.0
d$grX <- GenomicRanges::GRanges("chrX", IRanges::IRanges(1:32, width=10))
d$grY <- GenomicRanges::GRanges("chrY", IRanges::IRanges(1:32, width = 10))
d$nl <- IRanges::NumericList(lapply(d$gear, function(n) round(rnorm(n), 2)))
d
## DataFrame with 32 rows and 8 columns
##                         cyl        hp        am      gear      disp        grX
##                   <numeric> <numeric> <numeric> <numeric> <numeric>  <GRanges>
## Mazda RX4                 6       110         1         4       160  chrX:1-10
## Mazda RX4 Wag             6       110         1         4       160  chrX:2-11
## Datsun 710                4        93         1         4       108  chrX:3-12
## Hornet 4 Drive            6       110         0         3       258  chrX:4-13
## Hornet Sportabout         8       175         0         3       360  chrX:5-14
## ...                     ...       ...       ...       ...       ...        ...
## Lotus Europa              4       113         1         5      95.1 chrX:28-37
## Ford Pantera L            8       264         1         5     351.0 chrX:29-38
## Ferrari Dino              6       175         1         5     145.0 chrX:30-39
## Maserati Bora             8       335         1         5     301.0 chrX:31-40
## Volvo 142E                4       109         1         4     121.0 chrX:32-41
##                          grY                    nl
##                    <GRanges>         <NumericList>
## Mazda RX4          chrY:1-10    1.94,0.80,1.02,...
## Mazda RX4 Wag      chrY:2-11  0.90, 1.52,-0.46,...
## Datsun 710         chrY:3-12    1.64,0.47,0.66,...
## Hornet 4 Drive     chrY:4-13      0.06, 1.93,-1.11
## Hornet Sportabout  chrY:5-14     -1.01,-0.77,-0.04
## ...                      ...                   ...
## Lotus Europa      chrY:28-37 -0.22, 0.66,-1.10,...
## Ford Pantera L    chrY:29-38 -1.38, 0.74, 1.24,...
## Ferrari Dino      chrY:30-39 -0.81, 1.10, 0.85,...
## Maserati Bora     chrY:31-40    1.59,0.65,1.32,...
## Volvo 142E        chrY:32-41    1.07,0.78,0.00,...

and add some more metadata about the colData columns

mcols(colData(se)) <- DataFrame(
  position = paste0("column ", 1:9), 
  is_ID = c(TRUE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE)
)
mcols(colData(se))
## DataFrame with 9 rows and 2 columns
##               position     is_ID
##            <character> <logical>
## SampleName    column 1      TRUE
## cell          column 2      TRUE
## dex           column 3     FALSE
## albut         column 4     FALSE
## Run           column 5      TRUE
## avgLength     column 6     FALSE
## Experiment    column 7      TRUE
## Sample        column 8      TRUE
## BioSample     column 9      TRUE

My solution was to write the {dplyr} verbs in a way that was compatible with DataFrame but which avoids passing through a tibble wherever possible. That became {DFplyr} and it’s been sitting around on GitHub not doing much for the last 5 years.

I wrote a first version of this package despite being quite naive to the actual implementation of DataFrame, so I’m very grateful to Michael Lawrence and Hervé Pagès who helped improve compatibility and pointed out bugs with what I’d done.

Time to take it for a tour

library(DFplyr)
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:Biobase':
## 
##     combine
## The following objects are masked from 'package:GenomicRanges':
## 
##     intersect, setdiff, union
## The following object is masked from 'package:GenomeInfoDb':
## 
##     intersect
## The following objects are masked from 'package:IRanges':
## 
##     collapse, desc, intersect, setdiff, slice, union
## The following object is masked from 'package:matrixStats':
## 
##     count
## The following objects are masked from 'package:S4Vectors':
## 
##     first, intersect, rename, setdiff, setequal, union
## The following objects are masked from 'package:BiocGenerics':
## 
##     combine, intersect, setdiff, union
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Warning: replacing previous import 'S4Vectors::rename' by 'dplyr::rename' when
## loading 'DFplyr'
## 
## Attaching package: 'DFplyr'
## The following objects are masked from 'package:dplyr':
## 
##     count, desc
## The following objects are masked from 'package:IRanges':
## 
##     desc, slice
## The following object is masked from 'package:matrixStats':
## 
##     count
## The following object is masked from 'package:stats':
## 
##     filter
colData(se)
## DataFrame with 8 rows and 9 columns
##            SampleName     cell      dex    albut        Run avgLength
##              <factor> <factor> <factor> <factor>   <factor> <integer>
## SRR1039508 GSM1275862  N61311     untrt    untrt SRR1039508       126
## SRR1039509 GSM1275863  N61311     trt      untrt SRR1039509       126
## SRR1039512 GSM1275866  N052611    untrt    untrt SRR1039512       126
## SRR1039513 GSM1275867  N052611    trt      untrt SRR1039513        87
## SRR1039516 GSM1275870  N080611    untrt    untrt SRR1039516       120
## SRR1039517 GSM1275871  N080611    trt      untrt SRR1039517       126
## SRR1039520 GSM1275874  N061011    untrt    untrt SRR1039520       101
## SRR1039521 GSM1275875  N061011    trt      untrt SRR1039521        98
##            Experiment    Sample    BioSample
##              <factor>  <factor>     <factor>
## SRR1039508  SRX384345 SRS508568 SAMN02422669
## SRR1039509  SRX384346 SRS508567 SAMN02422675
## SRR1039512  SRX384349 SRS508571 SAMN02422678
## SRR1039513  SRX384350 SRS508572 SAMN02422670
## SRR1039516  SRX384353 SRS508575 SAMN02422682
## SRR1039517  SRX384354 SRS508576 SAMN02422673
## SRR1039520  SRX384357 SRS508579 SAMN02422683
## SRR1039521  SRX384358 SRS508580 SAMN02422677
cd <- colData(se)

One thing I added was a custom ‘label’ if you’re using this package within RStudio, in which case the Environment pane will show cd as “Formal class DataFrame (dplyr-compatible)”. That’s not a change in the object itself, but rather all DataFrame objects will now show this, because {dplyr} verbs should work on them.

Now I can just filter the colData as if it was a data.frame or a tibble, but the result is still a DataFrame

treated <- filter(cd, dex == "trt")
treated
## DataFrame with 4 rows and 9 columns
##            SampleName     cell      dex    albut        Run avgLength
##              <factor> <factor> <factor> <factor>   <factor> <integer>
## SRR1039509 GSM1275863  N61311       trt    untrt SRR1039509       126
## SRR1039513 GSM1275867  N052611      trt    untrt SRR1039513        87
## SRR1039517 GSM1275871  N080611      trt    untrt SRR1039517       126
## SRR1039521 GSM1275875  N061011      trt    untrt SRR1039521        98
##            Experiment    Sample    BioSample
##              <factor>  <factor>     <factor>
## SRR1039509  SRX384346 SRS508567 SAMN02422675
## SRR1039513  SRX384350 SRS508572 SAMN02422670
## SRR1039517  SRX384354 SRS508576 SAMN02422673
## SRR1039521  SRX384358 SRS508580 SAMN02422677

Importantly, the row names and metadata are completely preserved, relevant to this filtered subset

rownames(treated)
## [1] "SRR1039509" "SRR1039513" "SRR1039517" "SRR1039521"
mcols(treated)
## DataFrame with 9 rows and 2 columns
##               position     is_ID
##            <character> <logical>
## SampleName    column 1      TRUE
## cell          column 2      TRUE
## dex           column 3     FALSE
## albut         column 4     FALSE
## Run           column 5      TRUE
## avgLength     column 6     FALSE
## Experiment    column 7      TRUE
## Sample        column 8      TRUE
## BioSample     column 9      TRUE
exp <- select(cd, Experiment, Sample, BioSample)
exp
## DataFrame with 8 rows and 3 columns
##            Experiment    Sample    BioSample
##              <factor>  <factor>     <factor>
## SRR1039508  SRX384345 SRS508568 SAMN02422669
## SRR1039509  SRX384346 SRS508567 SAMN02422675
## SRR1039512  SRX384349 SRS508571 SAMN02422678
## SRR1039513  SRX384350 SRS508572 SAMN02422670
## SRR1039516  SRX384353 SRS508575 SAMN02422682
## SRR1039517  SRX384354 SRS508576 SAMN02422673
## SRR1039520  SRX384357 SRS508579 SAMN02422683
## SRR1039521  SRX384358 SRS508580 SAMN02422677
mcols(exp)
## DataFrame with 3 rows and 2 columns
##               position     is_ID
##            <character> <logical>
## Experiment    column 7      TRUE
## Sample        column 8      TRUE
## BioSample     column 9      TRUE

I can add more sample data if I have it

cdq <- mutate(cd, quality = runif(nrow(cd)))
cdq
## DataFrame with 8 rows and 10 columns
##            SampleName     cell      dex    albut        Run avgLength
##              <factor> <factor> <factor> <factor>   <factor> <integer>
## SRR1039508 GSM1275862  N61311     untrt    untrt SRR1039508       126
## SRR1039509 GSM1275863  N61311     trt      untrt SRR1039509       126
## SRR1039512 GSM1275866  N052611    untrt    untrt SRR1039512       126
## SRR1039513 GSM1275867  N052611    trt      untrt SRR1039513        87
## SRR1039516 GSM1275870  N080611    untrt    untrt SRR1039516       120
## SRR1039517 GSM1275871  N080611    trt      untrt SRR1039517       126
## SRR1039520 GSM1275874  N061011    untrt    untrt SRR1039520       101
## SRR1039521 GSM1275875  N061011    trt      untrt SRR1039521        98
##            Experiment    Sample    BioSample   quality
##              <factor>  <factor>     <factor> <numeric>
## SRR1039508  SRX384345 SRS508568 SAMN02422669  0.386436
## SRR1039509  SRX384346 SRS508567 SAMN02422675  0.566096
## SRR1039512  SRX384349 SRS508571 SAMN02422678  0.327839
## SRR1039513  SRX384350 SRS508572 SAMN02422670  0.744911
## SRR1039516  SRX384353 SRS508575 SAMN02422682  0.331616
## SRR1039517  SRX384354 SRS508576 SAMN02422673  0.728328
## SRR1039520  SRX384357 SRS508579 SAMN02422683  0.626803
## SRR1039521  SRX384358 SRS508580 SAMN02422677  0.248757
mcols(cdq)
## DataFrame with 10 rows and 2 columns
##               position     is_ID
##            <character> <logical>
## SampleName    column 1      TRUE
## cell          column 2      TRUE
## dex           column 3     FALSE
## albut         column 4     FALSE
## Run           column 5      TRUE
## avgLength     column 6     FALSE
## Experiment    column 7      TRUE
## Sample        column 8      TRUE
## BioSample     column 9      TRUE
## quality             NA        NA

and since no metadata about this new column was available, it has a blank row in the mcols entry.

If I perform an operation that doesn’t make sense in terms of row names, they will be removed

cdq |> 
  group_by(dex) |> 
  summarise(avg_quality = mean(quality))
##     dex avg_quality
## 1   trt   0.5720231
## 2 untrt   0.4181736

but rearranging them is fine

cdq |> 
  arrange(desc(quality)) |> 
  slice(1:4)
## DataFrame with 4 rows and 10 columns
##            SampleName     cell      dex    albut        Run avgLength
##              <factor> <factor> <factor> <factor>   <factor> <integer>
## SRR1039513 GSM1275867  N052611    trt      untrt SRR1039513        87
## SRR1039517 GSM1275871  N080611    trt      untrt SRR1039517       126
## SRR1039520 GSM1275874  N061011    untrt    untrt SRR1039520       101
## SRR1039509 GSM1275863  N61311     trt      untrt SRR1039509       126
##            Experiment    Sample    BioSample   quality
##              <factor>  <factor>     <factor> <numeric>
## SRR1039513  SRX384350 SRS508572 SAMN02422670  0.744911
## SRR1039517  SRX384354 SRS508576 SAMN02422673  0.728328
## SRR1039520  SRX384357 SRS508579 SAMN02422683  0.626803
## SRR1039509  SRX384346 SRS508567 SAMN02422675  0.566096

One thing that DataFrame lacks - as does data.frame - is the concept of groups. {tibble} resolved this by adding a "groups" attribute, so I did the same in {DFplyr} - I don’t own {S4Vectors}, so I had to mask some functions in order to build in this support. I can now perform grouped operations on a DataFrame and have the grouping information printed like {tibble} does

cd |> 
  group_by(dex) 
## DataFrame with 8 rows and 9 columns
## Groups:  dex 
##            SampleName     cell      dex    albut        Run avgLength
##              <factor> <factor> <factor> <factor>   <factor> <integer>
## SRR1039508 GSM1275862  N61311     untrt    untrt SRR1039508       126
## SRR1039509 GSM1275863  N61311     trt      untrt SRR1039509       126
## SRR1039512 GSM1275866  N052611    untrt    untrt SRR1039512       126
## SRR1039513 GSM1275867  N052611    trt      untrt SRR1039513        87
## SRR1039516 GSM1275870  N080611    untrt    untrt SRR1039516       120
## SRR1039517 GSM1275871  N080611    trt      untrt SRR1039517       126
## SRR1039520 GSM1275874  N061011    untrt    untrt SRR1039520       101
## SRR1039521 GSM1275875  N061011    trt      untrt SRR1039521        98
##            Experiment    Sample    BioSample
##              <factor>  <factor>     <factor>
## SRR1039508  SRX384345 SRS508568 SAMN02422669
## SRR1039509  SRX384346 SRS508567 SAMN02422675
## SRR1039512  SRX384349 SRS508571 SAMN02422678
## SRR1039513  SRX384350 SRS508572 SAMN02422670
## SRR1039516  SRX384353 SRS508575 SAMN02422682
## SRR1039517  SRX384354 SRS508576 SAMN02422673
## SRR1039520  SRX384357 SRS508579 SAMN02422683
## SRR1039521  SRX384358 SRS508580 SAMN02422677
cd |> 
  group_by(dex) |> 
  summarise(meanAvgLength = mean(avgLength))
##     dex meanAvgLength
## 1   trt        109.25
## 2 untrt        118.25

If I try to “put back” the subset colData I get an error, because there is assay data for samples we dropped

colData(se)
## DataFrame with 8 rows and 9 columns
##            SampleName     cell      dex    albut        Run avgLength
##              <factor> <factor> <factor> <factor>   <factor> <integer>
## SRR1039508 GSM1275862  N61311     untrt    untrt SRR1039508       126
## SRR1039509 GSM1275863  N61311     trt      untrt SRR1039509       126
## SRR1039512 GSM1275866  N052611    untrt    untrt SRR1039512       126
## SRR1039513 GSM1275867  N052611    trt      untrt SRR1039513        87
## SRR1039516 GSM1275870  N080611    untrt    untrt SRR1039516       120
## SRR1039517 GSM1275871  N080611    trt      untrt SRR1039517       126
## SRR1039520 GSM1275874  N061011    untrt    untrt SRR1039520       101
## SRR1039521 GSM1275875  N061011    trt      untrt SRR1039521        98
##            Experiment    Sample    BioSample
##              <factor>  <factor>     <factor>
## SRR1039508  SRX384345 SRS508568 SAMN02422669
## SRR1039509  SRX384346 SRS508567 SAMN02422675
## SRR1039512  SRX384349 SRS508571 SAMN02422678
## SRR1039513  SRX384350 SRS508572 SAMN02422670
## SRR1039516  SRX384353 SRS508575 SAMN02422682
## SRR1039517  SRX384354 SRS508576 SAMN02422673
## SRR1039520  SRX384357 SRS508579 SAMN02422683
## SRR1039521  SRX384358 SRS508580 SAMN02422677
treated
## DataFrame with 4 rows and 9 columns
##            SampleName     cell      dex    albut        Run avgLength
##              <factor> <factor> <factor> <factor>   <factor> <integer>
## SRR1039509 GSM1275863  N61311       trt    untrt SRR1039509       126
## SRR1039513 GSM1275867  N052611      trt    untrt SRR1039513        87
## SRR1039517 GSM1275871  N080611      trt    untrt SRR1039517       126
## SRR1039521 GSM1275875  N061011      trt    untrt SRR1039521        98
##            Experiment    Sample    BioSample
##              <factor>  <factor>     <factor>
## SRR1039509  SRX384346 SRS508567 SAMN02422675
## SRR1039513  SRX384350 SRS508572 SAMN02422670
## SRR1039517  SRX384354 SRS508576 SAMN02422673
## SRR1039521  SRX384358 SRS508580 SAMN02422677
colData(se) <- treated
## Error in `colData<-`(`*tmp*`, value = new("DFrame", rownames = c("SRR1039509", : nrow of supplied 'colData' must equal ncol of object

But as long as I have some metadata for all the samples, it’s fine - cdq was the version of colData with quality information, and you can see it’s incorporated on the last line of this output

colData(se) <- cdq
se
## class: RangedSummarizedExperiment 
## dim: 63677 8 
## metadata(1): ''
## assays(1): counts
## rownames(63677): ENSG00000000003 ENSG00000000005 ... ENSG00000273492
##   ENSG00000273493
## rowData names(10): gene_id gene_name ... seq_coord_system symbol
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(10): SampleName cell ... BioSample quality

Storing S4 classes isn’t all we can do - performing {dplyr} operations on S4 columns is perfectly fine; here, adding more S4 columns based on others

myrange <- GenomicRanges::GRanges("chrX", IRanges::IRanges(2:5, width = 10))

d <- mutate(d, quality = runif(32), overlaps = GenomicRanges::countOverlaps(grX, myrange))
d <- mutate(d, nearest = plyranges::join_nearest(grX, myrange))
d
## DataFrame with 32 rows and 11 columns
##                         cyl        hp        am      gear      disp        grX
##                   <numeric> <numeric> <numeric> <numeric> <numeric>  <GRanges>
## Mazda RX4                 6       110         1         4       160  chrX:1-10
## Mazda RX4 Wag             6       110         1         4       160  chrX:2-11
## Datsun 710                4        93         1         4       108  chrX:3-12
## Hornet 4 Drive            6       110         0         3       258  chrX:4-13
## Hornet Sportabout         8       175         0         3       360  chrX:5-14
## ...                     ...       ...       ...       ...       ...        ...
## Lotus Europa              4       113         1         5      95.1 chrX:28-37
## Ford Pantera L            8       264         1         5     351.0 chrX:29-38
## Ferrari Dino              6       175         1         5     145.0 chrX:30-39
## Maserati Bora             8       335         1         5     301.0 chrX:31-40
## Volvo 142E                4       109         1         4     121.0 chrX:32-41
##                          grY                    nl  overlaps   quality
##                    <GRanges>         <NumericList> <integer> <numeric>
## Mazda RX4          chrY:1-10    1.94,0.80,1.02,...         4  0.144992
## Mazda RX4 Wag      chrY:2-11  0.90, 1.52,-0.46,...         4  0.726243
## Datsun 710         chrY:3-12    1.64,0.47,0.66,...         4  0.146000
## Hornet 4 Drive     chrY:4-13      0.06, 1.93,-1.11         4  0.695177
## Hornet Sportabout  chrY:5-14     -1.01,-0.77,-0.04         4  0.921556
## ...                      ...                   ...       ...       ...
## Lotus Europa      chrY:28-37 -0.22, 0.66,-1.10,...         0  0.371289
## Ford Pantera L    chrY:29-38 -1.38, 0.74, 1.24,...         0  0.420842
## Ferrari Dino      chrY:30-39 -0.81, 1.10, 0.85,...         0  0.890471
## Maserati Bora     chrY:31-40    1.59,0.65,1.32,...         0  0.395305
## Volvo 142E        chrY:32-41    1.07,0.78,0.00,...         0  0.532688
##                      nearest
##                    <GRanges>
## Mazda RX4          chrX:1-10
## Mazda RX4 Wag      chrX:2-11
## Datsun 710         chrX:3-12
## Hornet 4 Drive     chrX:4-13
## Hornet Sportabout  chrX:5-14
## ...                      ...
## Lotus Europa      chrX:28-37
## Ford Pantera L    chrX:29-38
## Ferrari Dino      chrX:30-39
## Maserati Bora     chrX:31-40
## Volvo 142E        chrX:32-41

or filtering rows

d |> 
  arrange(desc(quality)) |> 
  filter(overlaps > 0) |> 
  slice(1:4)
## DataFrame with 4 rows and 11 columns
##                         cyl        hp        am      gear      disp        grX
##                   <numeric> <numeric> <numeric> <numeric> <numeric>  <GRanges>
## Merc 280C                 6       123         0         4     167.6 chrX:11-20
## Merc 450SL                8       180         0         3     275.8 chrX:13-22
## Merc 450SE                8       180         0         3     275.8 chrX:12-21
## Hornet Sportabout         8       175         0         3     360.0  chrX:5-14
##                          grY                    nl  overlaps   quality
##                    <GRanges>         <NumericList> <integer> <numeric>
## Merc 280C         chrY:11-20 -0.08, 0.55,-0.59,...         4  0.990185
## Merc 450SL        chrY:13-22      1.23,-0.35, 0.04         2  0.979190
## Merc 450SE        chrY:12-21      2.60, 0.98,-1.50         3  0.958658
## Hornet Sportabout  chrY:5-14     -1.01,-0.77,-0.04         4  0.921556
##                      nearest
##                    <GRanges>
## Merc 280C         chrX:11-20
## Merc 450SL        chrX:13-22
## Merc 450SE        chrX:12-21
## Hornet Sportabout  chrX:5-14

You don’t necessarily need to extract the DataFrame into a variable, either - working with the extraction directly could be cleaner in some cases

count(colData(se), cell)
## DataFrame with 4 rows and 2 columns
##       cell         n
##   <factor> <integer>
## 1  N052611         2
## 2  N061011         2
## 3  N080611         2
## 4  N61311          2

Other Ways to Filter

Using {dplyr} syntax on a DataFrame is very convenient, but that’s only one component of a SummarizedExperiment - it would be great to be able to extend this further and do something like

filter(airway, rows = ..., cols = ...)

It is possible to do something like this - if I generate IDs of the ‘rows’ and ‘columns’ I want to subset to, I can pass these in to a [ extraction

untrt <- colData(airway) |> filter(dex == "untrt") |> rownames()
untrt
## [1] "SRR1039508" "SRR1039512" "SRR1039516" "SRR1039520"
genes <- rowData(airway)[c(1:3, 8:10), ] |> rownames()
genes
## [1] "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" "ENSG00000001036"
## [5] "ENSG00000001084" "ENSG00000001167"
airway_sub <- airway[genes, untrt]
airway_sub
## class: RangedSummarizedExperiment 
## dim: 6 4 
## metadata(1): ''
## assays(1): counts
## rownames(6): ENSG00000000003 ENSG00000000005 ... ENSG00000001084
##   ENSG00000001167
## rowData names(10): gene_id gene_name ... seq_coord_system symbol
## colnames(4): SRR1039508 SRR1039512 SRR1039516 SRR1039520
## colData names(9): SampleName cell ... Sample BioSample
dim(airway_sub)
## [1] 6 4
colData(airway_sub)
## DataFrame with 4 rows and 9 columns
##            SampleName     cell      dex    albut        Run avgLength
##              <factor> <factor> <factor> <factor>   <factor> <integer>
## SRR1039508 GSM1275862  N61311     untrt    untrt SRR1039508       126
## SRR1039512 GSM1275866  N052611    untrt    untrt SRR1039512       126
## SRR1039516 GSM1275870  N080611    untrt    untrt SRR1039516       120
## SRR1039520 GSM1275874  N061011    untrt    untrt SRR1039520       101
##            Experiment    Sample    BioSample
##              <factor>  <factor>     <factor>
## SRR1039508  SRX384345 SRS508568 SAMN02422669
## SRR1039512  SRX384349 SRS508571 SAMN02422678
## SRR1039516  SRX384353 SRS508575 SAMN02422682
## SRR1039520  SRX384357 SRS508579 SAMN02422683
rowData(airway_sub)
## DataFrame with 6 rows and 10 columns
##                         gene_id   gene_name  entrezid   gene_biotype
##                     <character> <character> <integer>    <character>
## ENSG00000000003 ENSG00000000003      TSPAN6        NA protein_coding
## ENSG00000000005 ENSG00000000005        TNMD        NA protein_coding
## ENSG00000000419 ENSG00000000419        DPM1        NA protein_coding
## ENSG00000001036 ENSG00000001036       FUCA2        NA protein_coding
## ENSG00000001084 ENSG00000001084        GCLC        NA protein_coding
## ENSG00000001167 ENSG00000001167        NFYA        NA protein_coding
##                 gene_seq_start gene_seq_end    seq_name seq_strand
##                      <integer>    <integer> <character>  <integer>
## ENSG00000000003       99883667     99894988           X         -1
## ENSG00000000005       99839799     99854882           X          1
## ENSG00000000419       49551404     49575092          20         -1
## ENSG00000001036      143815948    143832827           6         -1
## ENSG00000001084       53362139     53481768           6         -1
## ENSG00000001167       41040684     41067715           6          1
##                 seq_coord_system      symbol
##                        <integer> <character>
## ENSG00000000003               NA      TSPAN6
## ENSG00000000005               NA        TNMD
## ENSG00000000419               NA        DPM1
## ENSG00000001036               NA       FUCA2
## ENSG00000001084               NA        GCLC
## ENSG00000001167               NA        NFYA

Notice that this new object now has only 6 genes and 4 samples (the original has 63677 genes and 8 samples).

This sort of filtering is even nicer in MultiAssayExperiment where multiple SummarizedExperiments (or similar objects) are aggregated into a super object where the colData may be relevant to samples which appear in multiple assays.

MultiAssayExperiment data layout

MultiAssayExperiment has some of the best written vignettes I’ve seen, and I thoroughly recommend reading them if you have any need to use this format

library(MultiAssayExperiment)

(building the object not shown)

mae <- MultiAssayExperiment(experiments = ExpList, 
                            colData = colDat,
                            sampleMap = sampMap
)

mae
## A MultiAssayExperiment object of 4 listed
##  experiments with user-defined names and respective classes.
##  Containing an ExperimentList class object of length 4:
##  [1] Affy: SummarizedExperiment with 5 rows and 4 columns
##  [2] Methyl450k: matrix with 5 rows and 5 columns
##  [3] RNASeqGene: matrix with 5 rows and 4 columns
##  [4] GISTIC: RangedSummarizedExperiment with 5 rows and 3 columns
## Functionality:
##  experiments() - obtain the ExperimentList instance
##  colData() - the primary/phenotype DataFrame
##  sampleMap() - the sample coordination DataFrame
##  `$`, `[`, `[[` - extract colData columns, subset, or experiment
##  *Format() - convert into a long or wide DataFrame
##  assays() - convert ExperimentList to a SimpleList of matrices
##  exportClass() - save data to flat files
colData(mae)
## DataFrame with 4 rows and 2 columns
##                 sex       age
##         <character> <integer>
## Jack              M        38
## Jill              F        39
## Bob               M        40
## Barbara           F        41

Subsetting the different dimensions (rows, columns, assays) can be done with the following format

myMultiAssay[rows, columns, assays]

and since mae$x is defined to extract a colData column, I can subset this object to measurements of two genes in any assay for just the male subjects

mae[c("ENST00000355076", "ENST00000383323"), mae$sex == "M", ]
## A MultiAssayExperiment object of 4 listed
##  experiments with user-defined names and respective classes.
##  Containing an ExperimentList class object of length 4:
##  [1] Affy: SummarizedExperiment with 2 rows and 2 columns
##  [2] Methyl450k: matrix with 2 rows and 3 columns
##  [3] RNASeqGene: matrix with 1 rows and 2 columns
##  [4] GISTIC: RangedSummarizedExperiment with 1 rows and 2 columns
## Functionality:
##  experiments() - obtain the ExperimentList instance
##  colData() - the primary/phenotype DataFrame
##  sampleMap() - the sample coordination DataFrame
##  `$`, `[`, `[[` - extract colData columns, subset, or experiment
##  *Format() - convert into a long or wide DataFrame
##  assays() - convert ExperimentList to a SimpleList of matrices
##  exportClass() - save data to flat files

All of that brings us to the tidyOmics project which aims to create tidy analysis packages for omics data. One of the packages there is tidySummarizedExperiment which is a reframing of SummarizedExperiment in a ‘tidy’ sense

# remotes::install_github("stemangiola/tidySummarizedExperiment")
tidySummarizedExperiment::pasilla
## # A SummarizedExperiment-tibble abstraction: 102,193 × 5
## # Features=14599 | Samples=7 | Assays=counts
##    .feature    .sample counts condition type      
##    <chr>       <chr>    <int> <chr>     <chr>     
##  1 FBgn0000003 untrt1       0 untreated single_end
##  2 FBgn0000008 untrt1      92 untreated single_end
##  3 FBgn0000014 untrt1       5 untreated single_end
##  4 FBgn0000015 untrt1       0 untreated single_end
##  5 FBgn0000017 untrt1    4664 untreated single_end
##  6 FBgn0000018 untrt1     583 untreated single_end
##  7 FBgn0000022 untrt1       0 untreated single_end
##  8 FBgn0000024 untrt1      10 untreated single_end
##  9 FBgn0000028 untrt1       0 untreated single_end
## 10 FBgn0000032 untrt1    1446 untreated single_end
## # ℹ 40 more rows

If you’re already familiar with MultiAssayExperiment then you may find this somewhat familiar to the longFormat output there

longFormat(mae)
## DataFrame with 80 rows and 5 columns
##           assay     primary         rowname     colname     value
##     <character> <character>     <character> <character> <numeric>
## 1          Affy        Jack ENST00000294241      array1       101
## 2          Affy        Jack ENST00000355076      array1       102
## 3          Affy        Jack ENST00000383706      array1       103
## 4          Affy        Jack ENST00000234812      array1       104
## 5          Affy        Jack ENST00000383323      array1       105
## ...         ...         ...             ...         ...       ...
## 76       GISTIC        Jill ENST00000135411       samp2         0
## 77       GISTIC        Jill ENST00000135412       samp2         1
## 78       GISTIC        Jill ENST00000135413       samp2         0
## 79       GISTIC        Jill ENST00000135414       samp2         1
## 80       GISTIC        Jill ENST00000383323       samp2         0

Summary

That was a bit long winded, but I’ve been wanting to spell out my love for rownames which comes from this Bioconductor approach for some time. {DFplyr} is a bit of a mess internally, but I believe it has promise to be useful, so I’m interested in use-cases, failure modes, etc… The tidyOmics project defines a lot of new structures which need to interact with DataFrames and I’m reasonably sure that {DFplyr} can help with that.

If you have comments, suggestions, or improvements - or if you think {DFplyr} could be useful to you, feel free to use the comment section below, or hit me up on Mastodon.

devtools::session_info()

## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.3.3 (2024-02-29)
##  os       Pop!_OS 22.04 LTS
##  system   x86_64, linux-gnu
##  ui       X11
##  language (EN)
##  collate  en_AU.UTF-8
##  ctype    en_AU.UTF-8
##  tz       Australia/Adelaide
##  date     2024-08-11
##  pandoc   3.2 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/x86_64/ (via rmarkdown)
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package                  * version     date (UTC) lib source
##  abind                      1.4-5       2016-07-21 [1] CRAN (R 4.3.3)
##  airway                   * 1.22.0      2023-10-26 [1] Bioconductor
##  Biobase                  * 2.62.0      2023-10-24 [1] Bioconductor
##  BiocBaseUtils              1.4.0       2023-10-24 [1] Bioconductor
##  BiocGenerics             * 0.48.1      2023-11-01 [1] Bioconductor
##  BiocIO                     1.12.0      2023-10-24 [1] Bioconductor
##  BiocParallel               1.36.0      2023-10-24 [1] Bioconductor
##  Biostrings                 2.70.3      2024-03-13 [1] Bioconductor 3.18 (R 4.3.3)
##  bitops                     1.0-7       2021-04-24 [1] CRAN (R 4.3.2)
##  blogdown                   1.19        2024-02-01 [1] CRAN (R 4.3.3)
##  bookdown                   0.36        2023-10-16 [1] CRAN (R 4.3.2)
##  bslib                      0.6.1       2023-11-28 [3] CRAN (R 4.3.2)
##  cachem                     1.0.8       2023-05-01 [3] CRAN (R 4.3.0)
##  callr                      3.7.3       2022-11-02 [3] CRAN (R 4.2.2)
##  charcuterie              * 0.0.2       2024-08-07 [1] local
##  cli                        3.6.1       2023-03-23 [1] CRAN (R 4.3.3)
##  codetools                  0.2-19      2023-02-01 [4] CRAN (R 4.2.2)
##  colorspace                 2.1-0       2023-01-23 [1] CRAN (R 4.3.3)
##  crayon                     1.5.2       2022-09-29 [3] CRAN (R 4.2.1)
##  data.table                 1.15.0      2024-01-30 [3] CRAN (R 4.3.2)
##  DelayedArray               0.28.0      2023-10-24 [1] Bioconductor
##  devtools                   2.4.5       2022-10-11 [1] CRAN (R 4.3.2)
##  DFplyr                   * 0.0.1.9000  2024-08-11 [1] local
##  digest                     0.6.34      2024-01-11 [3] CRAN (R 4.3.2)
##  dplyr                    * 1.1.4       2023-11-17 [3] CRAN (R 4.3.2)
##  ellipsis                   0.3.2       2021-04-29 [3] CRAN (R 4.1.1)
##  evaluate                   0.23        2023-11-01 [3] CRAN (R 4.3.2)
##  fansi                      1.0.6       2023-12-08 [1] CRAN (R 4.3.3)
##  fastmap                    1.1.1       2023-02-24 [3] CRAN (R 4.2.2)
##  fs                         1.6.3       2023-07-20 [3] CRAN (R 4.3.1)
##  generics                   0.1.3       2022-07-05 [3] CRAN (R 4.2.1)
##  GenomeInfoDb             * 1.38.8      2024-03-15 [1] Bioconductor 3.18 (R 4.3.3)
##  GenomeInfoDbData           1.2.11      2024-06-21 [1] Bioconductor
##  GenomicAlignments          1.38.2      2024-01-16 [1] Bioconductor 3.18 (R 4.3.3)
##  GenomicRanges            * 1.54.1      2023-10-29 [1] Bioconductor
##  ggplot2                    3.5.1       2024-04-23 [1] CRAN (R 4.3.3)
##  glue                       1.7.0       2024-01-09 [1] CRAN (R 4.3.3)
##  gtable                     0.3.5       2024-04-22 [1] CRAN (R 4.3.3)
##  htmltools                  0.5.7       2023-11-03 [3] CRAN (R 4.3.2)
##  htmlwidgets                1.6.2       2023-03-17 [1] CRAN (R 4.3.2)
##  httpuv                     1.6.12      2023-10-23 [1] CRAN (R 4.3.2)
##  httr                       1.4.7       2023-08-15 [3] CRAN (R 4.3.1)
##  icecream                   0.2.1       2023-09-27 [1] CRAN (R 4.3.2)
##  IRanges                  * 2.36.0      2023-10-24 [1] Bioconductor
##  jquerylib                  0.1.4       2021-04-26 [3] CRAN (R 4.1.2)
##  jsonlite                   1.8.8       2023-12-04 [3] CRAN (R 4.3.2)
##  knitr                      1.45        2023-10-30 [3] CRAN (R 4.3.2)
##  later                      1.3.1       2023-05-02 [1] CRAN (R 4.3.2)
##  lattice                    0.22-5      2023-10-24 [4] CRAN (R 4.3.1)
##  lazyeval                   0.2.2       2019-03-15 [1] CRAN (R 4.3.2)
##  lifecycle                  1.0.4       2023-11-07 [1] CRAN (R 4.3.3)
##  magrittr                   2.0.3       2022-03-30 [1] CRAN (R 4.3.3)
##  Matrix                     1.6-5       2024-01-11 [4] CRAN (R 4.3.3)
##  MatrixGenerics           * 1.14.0      2023-10-24 [1] Bioconductor
##  matrixStats              * 1.1.0       2023-11-07 [1] CRAN (R 4.3.2)
##  memoise                    2.0.1       2021-11-26 [3] CRAN (R 4.2.0)
##  mime                       0.12        2021-09-28 [3] CRAN (R 4.2.0)
##  miniUI                     0.1.1.1     2018-05-18 [1] CRAN (R 4.3.2)
##  MultiAssayExperiment     * 1.28.0      2023-10-24 [1] Bioconductor
##  munsell                    0.5.1       2024-04-01 [1] CRAN (R 4.3.3)
##  pillar                     1.9.0       2023-03-22 [1] CRAN (R 4.3.3)
##  pkgbuild                   1.4.2       2023-06-26 [1] CRAN (R 4.3.2)
##  pkgconfig                  2.0.3       2019-09-22 [1] CRAN (R 4.3.3)
##  pkgload                    1.3.3       2023-09-22 [1] CRAN (R 4.3.2)
##  plotly                     4.10.4      2024-01-13 [1] CRAN (R 4.3.3)
##  plyranges                  1.22.0      2023-10-24 [1] Bioconductor
##  prettyunits                1.2.0       2023-09-24 [3] CRAN (R 4.3.1)
##  processx                   3.8.3       2023-12-10 [3] CRAN (R 4.3.2)
##  profvis                    0.3.8       2023-05-02 [1] CRAN (R 4.3.2)
##  promises                   1.2.1       2023-08-10 [1] CRAN (R 4.3.2)
##  ps                         1.7.6       2024-01-18 [3] CRAN (R 4.3.2)
##  purrr                      1.0.2       2023-08-10 [3] CRAN (R 4.3.1)
##  R6                         2.5.1       2021-08-19 [1] CRAN (R 4.3.3)
##  Rcpp                       1.0.11      2023-07-06 [1] CRAN (R 4.3.2)
##  RCurl                      1.98-1.14   2024-01-09 [1] CRAN (R 4.3.3)
##  remotes                    2.4.2.1     2023-07-18 [1] CRAN (R 4.3.2)
##  restfulr                   0.0.15      2022-06-16 [1] CRAN (R 4.3.3)
##  rjson                      0.2.21      2022-01-09 [1] CRAN (R 4.3.3)
##  rlang                      1.1.4       2024-06-04 [1] CRAN (R 4.3.3)
##  rmarkdown                  2.25        2023-09-18 [3] CRAN (R 4.3.1)
##  Rsamtools                  2.18.0      2023-10-24 [1] Bioconductor
##  rstudioapi                 0.15.0      2023-07-07 [3] CRAN (R 4.3.1)
##  rtracklayer                1.62.0      2023-10-24 [1] Bioconductor
##  S4Arrays                   1.2.1       2024-03-04 [1] Bioconductor 3.18 (R 4.3.3)
##  S4Vectors                * 0.40.2      2023-11-23 [1] Bioconductor 3.18 (R 4.3.3)
##  sass                       0.4.8       2023-12-06 [3] CRAN (R 4.3.2)
##  scales                     1.3.0       2023-11-28 [1] CRAN (R 4.3.2)
##  sessioninfo                1.2.2       2021-12-06 [1] CRAN (R 4.3.2)
##  shiny                      1.7.5.1     2023-10-14 [1] CRAN (R 4.3.2)
##  SparseArray                1.2.4       2024-02-11 [1] Bioconductor 3.18 (R 4.3.3)
##  stringi                    1.8.3       2023-12-11 [3] CRAN (R 4.3.2)
##  stringr                    1.5.1       2023-11-14 [3] CRAN (R 4.3.2)
##  SummarizedExperiment     * 1.32.0      2023-10-24 [1] Bioconductor
##  tibble                     3.2.1       2023-03-20 [1] CRAN (R 4.3.3)
##  tidyr                      1.3.1       2024-01-24 [3] CRAN (R 4.3.2)
##  tidyselect                 1.2.0       2022-10-10 [3] CRAN (R 4.2.1)
##  tidySummarizedExperiment   1.15.1      2024-08-06 [1] Github (stemangiola/tidySummarizedExperiment@4b8a4e1)
##  ttservice                  0.4.1       2024-06-07 [1] CRAN (R 4.3.3)
##  urlchecker                 1.0.1       2021-11-30 [1] CRAN (R 4.3.2)
##  usethis                    3.0.0       2024-07-29 [1] CRAN (R 4.3.3)
##  utf8                       1.2.4       2023-10-22 [1] CRAN (R 4.3.3)
##  vctrs                      0.6.5       2023-12-01 [1] CRAN (R 4.3.3)
##  viridisLite                0.4.2       2023-05-02 [1] CRAN (R 4.3.3)
##  withr                      3.0.0       2024-01-16 [1] CRAN (R 4.3.3)
##  xfun                       0.41        2023-11-01 [3] CRAN (R 4.3.2)
##  XML                        3.99-0.16.1 2024-01-22 [1] CRAN (R 4.3.3)
##  xtable                     1.8-4       2019-04-21 [1] CRAN (R 4.3.2)
##  XVector                    0.42.0      2023-10-24 [1] Bioconductor
##  yaml                       2.3.8       2023-12-11 [3] CRAN (R 4.3.2)
##  zlibbioc                   1.48.2      2024-03-13 [1] Bioconductor 3.18 (R 4.3.3)
## 
##  [1] /home/jono/R/x86_64-pc-linux-gnu-library/4.3
##  [2] /usr/local/lib/R/site-library
##  [3] /usr/lib/R/site-library
##  [4] /usr/lib/R/library
## 
## ──────────────────────────────────────────────────────────────────────────────

To leave a comment for the author, please follow the link and comment on their blog: rstats on Irregularly Scheduled Programming.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Tidy DataFrames but not Tibbles

`data.frame`

{tibble}

{dplyr}

A Whole New World

Filtering a DataFrame

Of Course I Have a Proposed Solution

Other Ways to Filter

Summary

Related

data.frame

{tibble}

{dplyr}

A Whole New World

Filtering a DataFrame

Of Course I Have a Proposed Solution

Other Ways to Filter

Summary

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

`data.frame`

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)