Tidy DataFrames but not Tibbles
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
A while ago (2019 seems so long ago now) I started working on something I
thought was interesting but which never really got any traction. It has
potential once more, so it’s about time I wrote up what it does and why I think
it’s a useful idea. I’m going to talk about using the {dplyr} package on some
data with rows and columns, but we’re not talking about data.frame
s or
tibble
s…
Okay, I probably do need to talk about them a little to start with.
I’m also taking this opportunity to reinforce my own understanding of all these pieces – I find the best way to learn something is to teach it.
data.frame
One of the basic building blocks of data in R is the data.frame
– technically
a list of vectors (vectors must have the same ‘type’ but lists don’t have this
restriction) where the vectors are all the same length. An example looks like
this
d <- data.frame(x = 2:5, y = LETTERS[2:5], z = 4:7+2i) d ## x y z ## 1 2 B 4+2i ## 2 3 C 5+2i ## 3 4 D 6+2i ## 4 5 E 7+2i
If we unpack that a bit, it is just a list of vectors
unclass(d) ## $x ## [1] 2 3 4 5 ## ## $y ## [1] "B" "C" "D" "E" ## ## $z ## [1] 4+2i 5+2i 6+2i 7+2i ## ## attr(,"row.names") ## [1] 1 2 3 4
and there we see one more feature; they have an attribute row.names
. By default
these are just numbers (encoded as strings) which we can get to with the rownames()
function
rownames(d) ## [1] "1" "2" "3" "4"
If you’ve read any guides to using R you will likely have seen the mtcars
dataset
which features these row names prominently - here’s the first 4 rows and columns
mtcars[1:4, 1:4] ## mpg cyl disp hp ## Mazda RX4 21.0 6 160 110 ## Mazda RX4 Wag 21.0 6 160 110 ## Datsun 710 22.8 4 108 93 ## Hornet 4 Drive 21.4 6 258 110
We can extract those row names
rownames(mtcars[1:4, 1:4]) ## [1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive"
This “row name” data sort of sits off to the side of the data object, but yes,
they could live “in” the data just fine as their own vector. That’s the approach
that the tidyverse takes with tibble
s. There’s one other benefit to these that
is often overlooked, though.
In the above examples I’ve extracted some rows of the mtcars
data.frame
with
indices, but the [
method works just as well with row identifiers (strings)…
mercs <- grep("Merc", rownames(mtcars), value = TRUE) mercs ## [1] "Merc 240D" "Merc 230" "Merc 280" "Merc 280C" "Merc 450SE" ## [6] "Merc 450SL" "Merc 450SLC" mtcars[mercs, 1:4] ## mpg cyl disp hp ## Merc 240D 24.4 4 146.7 62 ## Merc 230 22.8 4 140.8 95 ## Merc 280 19.2 6 167.6 123 ## Merc 280C 17.8 6 167.6 123 ## Merc 450SE 16.4 8 275.8 180 ## Merc 450SL 17.3 8 275.8 180 ## Merc 450SLC 15.2 8 275.8 180
You’re probably more familiar with this when subsetting columns, in which case you’re using the column names in exactly the same way
metrics <- c("mpg", "hp", "drat") mtcars[mercs, metrics] ## mpg hp drat ## Merc 240D 24.4 62 3.69 ## Merc 230 22.8 95 3.92 ## Merc 280 19.2 123 3.92 ## Merc 280C 17.8 123 3.92 ## Merc 450SE 16.4 180 3.07 ## Merc 450SL 17.3 180 3.07 ## Merc 450SLC 15.2 180 3.07
As we’ll see, this becomes very convenient when we have meaningful identifiers in both dimensions.
{tibble}
A tibble
is almost exactly like a data.frame
except that the print method shows
it a bit differently
tbl <- tibble::as_tibble(mtcars[1:4, 1:4]) tbl ## # A tibble: 4 × 4 ## mpg cyl disp hp ## <dbl> <dbl> <dbl> <dbl> ## 1 21 6 160 110 ## 2 21 6 160 110 ## 3 22.8 4 108 93 ## 4 21.4 6 258 110
and the row names are gone. That’s by design, and enforced quite strongly - if I tried to add them back, {tibble} complains, and refuses
rownames(tbl) <- rownames(mtcars)[1:4] ## Warning: Setting row names on a tibble is deprecated. tbl ## # A tibble: 4 × 4 ## mpg cyl disp hp ## * <dbl> <dbl> <dbl> <dbl> ## 1 21 6 160 110 ## 2 21 6 160 110 ## 3 22.8 4 108 93 ## 4 21.4 6 258 110
The world moves on, though.
If you have row names that you really want to keep, {tibble} has a way to get them
into their own column so that they’ll be preserved in a tibble
tbl_rows <- mtcars[6:10, 1:5] |> tibble::rownames_to_column() |> tibble::as_tibble() tbl_rows ## # A tibble: 5 × 6 ## rowname mpg cyl disp hp drat ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Valiant 18.1 6 225 105 2.76 ## 2 Duster 360 14.3 8 360 245 3.21 ## 3 Merc 240D 24.4 4 147. 62 3.69 ## 4 Merc 230 22.8 4 141. 95 3.92 ## 5 Merc 280 19.2 6 168. 123 3.92
To get the named subsetting behaviour described earlier, we have to specify the
"rowname"
column in the output and do some filtering on it to get the rows we
want
tbl_rows[tbl_rows$rowname %in% mercs, c("rowname", metrics)] ## # A tibble: 3 × 4 ## rowname mpg hp drat ## <chr> <dbl> <dbl> <dbl> ## 1 Merc 240D 24.4 62 3.69 ## 2 Merc 230 22.8 95 3.92 ## 3 Merc 280 19.2 123 3.92
Of course, there’s a better way to subset a tibble
…
{dplyr}
In terms of interacting with these objects, {dplyr} is great - it’s reminiscent
of SQL syntax (SELECT
, GROUP BY
, … look, it’s not 1:1, but it’s somewhat
familiar) and makes for a really clean pipeline of data processing - it got in
early supporting the {magrittr} pipe and encouraging a ‘data as the first
argument’ construction that meant piping from one function to another went
smoothly.
There’s not much in {dplyr} that you couldn’t do with a bunch of base functions
if you were so inclined - and some people have gone as far as proving that with
non-dplyr packages with matching functions but base internals, forgoing {dplyr}’s
ability to connect to a database with the exact same code you would use on a
local tibble
({dbplyr} translates the R calls into native SQL syntax behind the
scenes) for some more debuggable code, e.g. {poorman}.
{dplyr} functions mostly return the same class as the input; if you mutate
a
data.frame
you’ll get back a data.frame
(which wasn’t always the case, but
is less surprising now)
mtcars[1:5, 1:6] |> dplyr::mutate(pwr = hp / wt) ## mpg cyl disp hp drat wt pwr ## Mazda RX4 21.0 6 160 110 3.90 2.620 41.98473 ## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 38.26087 ## Datsun 710 22.8 4 108 93 3.85 2.320 40.08621 ## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 34.21462 ## Hornet Sportabout 18.7 8 360 175 3.15 3.440 50.87209
One feature that tibble
adds on top of data.frame
is the concept of “groups”.
For this reason, if you perform a grouping on a data.frame
you’ll get a tibble
because that’s the only supported option
# slice(1) on a data.frame returns 1 row of a data.frame mtcars |> dplyr::slice(1) ## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21 6 160 110 3.9 2.62 16.46 0 1 4 4 # slice(1) on a grouped tibble returns 1 row _per group_ mtcars |> dplyr::group_by(gear, am) |> dplyr::slice(1) ## # A tibble: 4 × 11 ## # Groups: gear, am [4] ## mpg cyl disp hp drat wt qsec vs am gear carb ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 ## 2 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 ## 3 21 6 160 110 3.9 2.62 16.5 0 1 4 4 ## 4 26 4 120. 91 4.43 2.14 16.7 0 1 5 2
{tibble} stores this information in a special "groups"
attribute, and ensures
that {dplyr} functions are aware of it. It also prints this information above
the object (plus the number of resulting ‘groups’).
Is there anything it can’t do?
A Whole New World
That ended up being a lot of background, glad to see you’re still with me. That’s
{dplyr} which can take either a data.frame
or a tibble
to do lots of useful
operations on data. If, like me prior to joining the world of computational
biology, you thought that was all there was to it, the next part comes as a bit of
an eye opener.
Just like in a video game where all of a sudden the map is revealed to be twice as large as you knew it, there’s an entire world of R packages beyond CRAN, mainly targeted at computational biology/bioinformatics, known as Bioconductor. This is well supported by the R project itself, but you could spend your life writing R analyses and never use it if you aren’t working on bioinfo stuff.
I don’t think I’d used anything from Bioconductor before I joined a large biotech - incidentally the one at which Robert Gentleman himself was a senior director in bioinformatics and computational biology until just a couple of years before I joined. Over the next few years, though, I’d become very familiar with the packages on offer in that space.
In the world of sequencing data, the humble data.frame
just doesn’t have the
right amount of structure - rectangular data is great, but no matter which way
you pivot (wide or long) you’re going to have some trouble representing all the
different qualities and quantities associated with a sample; for example representing
the level of expression of some gene for some samples belonging to some subject,
where both the subject and the sample have metadata, as does the experiment itself,
as do the genes. Keeping in mind that this might involve a thousand samples and
many tens of thousands of genes, a flat table just won’t cut it.
A better option is to create a different structure, one which keeps the different
components (sample metadata, gene metadata) aligned and together, but not in the
same table. Several of these have evolved over time, but a current favourite of
the field is the SummarizedExperiment
structure. This organises the components
in a way that they can be stored and interrogated without having to perform a
dozen filters to get to the corner of data you’re interested in.
If we think of this like a database, we have a table for the actual measurements, another for the sample metadata, and another for the metadata on the things being observed (e.g. genes). The problem is, we need to define keys that link the different tables together. If our measurements were a giant matrix of values, e.g.
expression_data <- matrix(c(2, 3, 1, 0, 9, 5, 3, 4, 5), ncol = 3) expression_data ## [,1] [,2] [,3] ## [1,] 2 0 3 ## [2,] 3 9 4 ## [3,] 1 5 5
how would we know that the second column is for sample_Y and the third row is for gene_DEF? The matrix itself is numeric, so we can’t just add a new row/column to store the sample/gene information, which is character.
No points for guessing that the intro above was for a reason - those attributes are now really useful.
library(charcuterie) rownames(expression_data) <- paste0("gene_", c("ABC", "DEF", "GHI")) colnames(expression_data) <- paste0("sample_", chars("XYZ")) expression_data ## sample_X sample_Y sample_Z ## gene_ABC 2 0 3 ## gene_DEF 3 9 4 ## gene_GHI 1 5 5
Much clearer! Even better, now we have a way to link up the metadata in either dimension.
For the samples, we can create a data.frame
with the relevant metadata for
each sample, and label each sample row with the relevant ID
sample_metadata <- data.frame(name = c("Jim", "Jack", "Joe"), age = c(51, 53, 52)) rownames(sample_metadata) <- paste0("sample_", chars("XYZ")) sample_metadata ## name age ## sample_X Jim 51 ## sample_Y Jack 53 ## sample_Z Joe 52
Now we can imagine taking this metadata and sort of turning it sideways so that the labels align like interlocking puzzle pieces. These run across the columns, so we can call them the “column data” of the final structure. Similarly, we can do the same for the genes, perhaps annotating the chromosome they’re on, and other identifiers, and use these as the “row data”.
This is the basic structure of the SummarizedExperiment
, but all is not well.
Bioconductor leans heavily into the S4 object model, making use of multiple dispatch as well as object validation. For example, if I wanted to create a range of fake genomic locations I could use
gr <- GenomicRanges::GRanges("chrX", IRanges::IRanges(1:6, width = 6:1)) gr ## GRanges object with 6 ranges and 0 metadata columns: ## seqnames ranges strand ## <Rle> <IRanges> <Rle> ## [1] chrX 1-6 * ## [2] chrX 2-6 * ## [3] chrX 3-6 * ## [4] chrX 4-6 * ## [5] chrX 5-6 * ## [6] chrX 6 * ## ------- ## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Now, and this is just a toy example, if I wanted to use some of these in a
data.frame
…
d <- data.frame(a = 1:6, b = gr) d ## a b.seqnames b.start b.end b.width b.strand ## 1 1 chrX 1 6 6 * ## 2 2 chrX 2 6 5 * ## 3 3 chrX 3 6 4 * ## 4 4 chrX 4 6 3 * ## 5 5 chrX 5 6 2 * ## 6 6 chrX 6 6 1 *
Hmmm… it looks like it unfolded and did something strange. It was expecting a
vector, not … whatever gr
is.
Not to worry, though - the Bioconductor ecosystem defines a DataFrame
object
that does know how to deal with S4 classes just fine (there will be a lot of R
noise here, I can’t help that - lots of masking)
library(S4Vectors) ## Loading required package: stats4 ## Loading required package: BiocGenerics ## ## Attaching package: 'BiocGenerics' ## The following objects are masked from 'package:stats': ## ## IQR, mad, sd, var, xtabs ## The following objects are masked from 'package:base': ## ## anyDuplicated, aperm, append, as.data.frame, basename, cbind, ## colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find, ## get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply, ## match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, ## Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort, ## table, tapply, union, unique, unsplit, which.max, which.min ## ## Attaching package: 'S4Vectors' ## The following object is masked from 'package:utils': ## ## findMatches ## The following objects are masked from 'package:base': ## ## expand.grid, I, unname DataFrame(a = 1:6, b = gr) ## DataFrame with 6 rows and 2 columns ## a b ## <integer> <GRanges> ## 1 1 chrX:1-6 ## 2 2 chrX:2-6 ## 3 3 chrX:3-6 ## 4 4 chrX:4-6 ## 5 5 chrX:5-6 ## 6 6 chrX:6
You might say - “hey, that looks familiar! It’s kind of like a tibble
” and
you’re right; the types of each column are shown at the top, and in the case
of the GRanges
object, it’s a non-atomic type, handled correctly. There’s
important differences compared to a tibble
, though. Importantly, we can add
row names to this just fine
DataFrame(a = 1:6, b = gr, row.names = paste0("range", 1:6)) ## DataFrame with 6 rows and 2 columns ## a b ## <integer> <GRanges> ## range1 1 chrX:1-6 ## range2 2 chrX:2-6 ## range3 3 chrX:3-6 ## range4 4 chrX:4-6 ## range5 5 chrX:5-6 ## range6 6 chrX:6
If we break the object, it will let us know because S4 performs object validation
a <- DataFrame(x = 1:3, y = LETTERS[1:3]) a ## DataFrame with 3 rows and 2 columns ## x y ## <integer> <character> ## 1 1 A ## 2 2 B ## 3 3 C validObject(a) ## [1] TRUE # look away, I'm not supposed to do this names(a@listData) <- NULL a ## DataFrame with 3 rows and 2 columns ## c("1", "2", "3") c("A", "B", "C") ## <integer> <character> ## 1 1 A ## 2 2 B ## 3 3 C validObject(a) ## Error in validObject(a): invalid class "DFrame" object: ## column names should not be NULL
This is the right class for our sample metadata
as(sample_metadata, "DataFrame") ## DataFrame with 3 rows and 2 columns ## name age ## <character> <numeric> ## sample_X Jim 51 ## sample_Y Jack 53 ## sample_Z Joe 52
Perfect!
If we look at an actual example of a SummarizedExperiment
, e.g. from the {airway}
package which captures read counts per gene for an airway smooth muscle cell lines
RNA-Seq experiment
library("airway") ## Loading required package: SummarizedExperiment ## Loading required package: MatrixGenerics ## Loading required package: matrixStats ## ## Attaching package: 'MatrixGenerics' ## The following objects are masked from 'package:matrixStats': ## ## colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse, ## colCounts, colCummaxs, colCummins, colCumprods, colCumsums, ## colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs, ## colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats, ## colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds, ## colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads, ## colWeightedMeans, colWeightedMedians, colWeightedSds, ## colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet, ## rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods, ## rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps, ## rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins, ## rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks, ## rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars, ## rowWeightedMads, rowWeightedMeans, rowWeightedMedians, ## rowWeightedSds, rowWeightedVars ## Loading required package: GenomicRanges ## Loading required package: IRanges ## Loading required package: GenomeInfoDb ## Loading required package: Biobase ## Welcome to Bioconductor ## ## Vignettes contain introductory material; view with ## 'browseVignettes()'. To cite Bioconductor, see ## 'citation("Biobase")', and for packages 'citation("pkgname")'. ## ## Attaching package: 'Biobase' ## The following object is masked from 'package:MatrixGenerics': ## ## rowMedians ## The following objects are masked from 'package:matrixStats': ## ## anyMissing, rowMedians data("airway") se <- airway se ## class: RangedSummarizedExperiment ## dim: 63677 8 ## metadata(1): '' ## assays(1): counts ## rownames(63677): ENSG00000000003 ENSG00000000005 ... ENSG00000273492 ## ENSG00000273493 ## rowData names(10): gene_id gene_name ... seq_coord_system symbol ## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521 ## colData names(9): SampleName cell ... Sample BioSample
we see a lot of information - the “dimensions” of this object is 63677 genes measured for 8 samples. The ‘rownames’ element shows the IDs of these genes which will link the ‘rowData’ to the counts assay rows, and the ‘colnames’ element shows the sample IDs which will link the ‘colData’ to the counts assay columns.
Pulling this apart…
cd <- colData(se) cd ## DataFrame with 8 rows and 9 columns ## SampleName cell dex albut Run avgLength ## <factor> <factor> <factor> <factor> <factor> <integer> ## SRR1039508 GSM1275862 N61311 untrt untrt SRR1039508 126 ## SRR1039509 GSM1275863 N61311 trt untrt SRR1039509 126 ## SRR1039512 GSM1275866 N052611 untrt untrt SRR1039512 126 ## SRR1039513 GSM1275867 N052611 trt untrt SRR1039513 87 ## SRR1039516 GSM1275870 N080611 untrt untrt SRR1039516 120 ## SRR1039517 GSM1275871 N080611 trt untrt SRR1039517 126 ## SRR1039520 GSM1275874 N061011 untrt untrt SRR1039520 101 ## SRR1039521 GSM1275875 N061011 trt untrt SRR1039521 98 ## Experiment Sample BioSample ## <factor> <factor> <factor> ## SRR1039508 SRX384345 SRS508568 SAMN02422669 ## SRR1039509 SRX384346 SRS508567 SAMN02422675 ## SRR1039512 SRX384349 SRS508571 SAMN02422678 ## SRR1039513 SRX384350 SRS508572 SAMN02422670 ## SRR1039516 SRX384353 SRS508575 SAMN02422682 ## SRR1039517 SRX384354 SRS508576 SAMN02422673 ## SRR1039520 SRX384357 SRS508579 SAMN02422683 ## SRR1039521 SRX384358 SRS508580 SAMN02422677
we see the sample metadata includes whether or not the sample was treated, and some details about the experiment. Similarly, the gene metadata details the various IDs and genomic locations.
rowData(se)[1:3, ] ## DataFrame with 3 rows and 10 columns ## gene_id gene_name entrezid gene_biotype ## <character> <character> <integer> <character> ## ENSG00000000003 ENSG00000000003 TSPAN6 NA protein_coding ## ENSG00000000005 ENSG00000000005 TNMD NA protein_coding ## ENSG00000000419 ENSG00000000419 DPM1 NA protein_coding ## gene_seq_start gene_seq_end seq_name seq_strand ## <integer> <integer> <character> <integer> ## ENSG00000000003 99883667 99894988 X -1 ## ENSG00000000005 99839799 99854882 X 1 ## ENSG00000000419 49551404 49575092 20 -1 ## seq_coord_system symbol ## <integer> <character> ## ENSG00000000003 NA TSPAN6 ## ENSG00000000005 NA TNMD ## ENSG00000000419 NA DPM1
Tying these together is the actual assay data
assay(se)[1:5, 1:5] ## SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 ## ENSG00000000003 679 448 873 408 1138 ## ENSG00000000005 0 0 0 0 0 ## ENSG00000000419 467 515 621 365 587 ## ENSG00000000457 260 211 263 164 245 ## ENSG00000000460 60 55 40 35 78
It all works really nicely - there are also additional places to store metadata about the experiment shared across samples, and metadata about the columns themselves
mcols(colData(se)) <- DataFrame( position = paste0("column ", 1:9), is_ID = c(TRUE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE) ) mcols(colData(se)) ## DataFrame with 9 rows and 2 columns ## position is_ID ## <character> <logical> ## SampleName column 1 TRUE ## cell column 2 TRUE ## dex column 3 FALSE ## albut column 4 FALSE ## Run column 5 TRUE ## avgLength column 6 FALSE ## Experiment column 7 TRUE ## Sample column 8 TRUE ## BioSample column 9 TRUE
Filtering a DataFrame
Now we get to a really interesting part - we want to do some analysis on this data.
First, we want to look at the samples which were untreated. A DataFrame
uses the
same [
subsetting mechanism as a data.frame
so we could do
cd[cd$dex == "untrt", c("SampleName", "avgLength")] ## DataFrame with 4 rows and 2 columns ## SampleName avgLength ## <factor> <integer> ## SRR1039508 GSM1275862 126 ## SRR1039512 GSM1275866 126 ## SRR1039516 GSM1275870 120 ## SRR1039520 GSM1275874 101
or even
subset(cd, dex == "untrt", select = c("SampleName", "avgLength")) ## DataFrame with 4 rows and 2 columns ## SampleName avgLength ## <factor> <integer> ## SRR1039508 GSM1275862 126 ## SRR1039512 GSM1275866 126 ## SRR1039516 GSM1275870 120 ## SRR1039520 GSM1275874 101
and those work fine.
What can often happen is you want the data for a specific sample. Sure, that ID
might be encoded in a column, but typically the sample identifiers are used as
the rownames of the colData
, so we can get the data for sample “SRR1039512” with
cd["SRR1039512", ] ## DataFrame with 1 row and 9 columns ## SampleName cell dex albut Run avgLength ## <factor> <factor> <factor> <factor> <factor> <integer> ## SRR1039512 GSM1275866 N052611 untrt untrt SRR1039512 126 ## Experiment Sample BioSample ## <factor> <factor> <factor> ## SRR1039512 SRX384349 SRS508571 SAMN02422678
What would be really nice, though, is to be able to use {dplyr}!
Let’s try…
dplyr::filter(cd, dex == "untrt") ## Error in UseMethod("filter"): no applicable method for 'filter' applied to an object of class "c('DFrame', 'DataFrame', 'SimpleList', 'RectangularData', 'List', 'DataFrame_OR_NULL', 'Vector', 'list_OR_List', 'Annotated', 'vector_OR_Vector')"
Nope, it’s not defined for this class.
One of the most frequent workarounds for this is to convert this DataFrame
to
a tibble
, at which point, yes, {dplyr} should work
tbl_untrt <- dplyr::filter(tibble::as_tibble(cd), dex == "untrt") tbl_untrt ## # A tibble: 4 × 9 ## SampleName cell dex albut Run avgLength Experiment Sample BioSample ## <fct> <fct> <fct> <fct> <fct> <int> <fct> <fct> <fct> ## 1 GSM1275862 N61311 untrt untrt SRR10395… 126 SRX384345 SRS50… SAMN0242… ## 2 GSM1275866 N052611 untrt untrt SRR10395… 126 SRX384349 SRS50… SAMN0242… ## 3 GSM1275870 N080611 untrt untrt SRR10395… 120 SRX384353 SRS50… SAMN0242… ## 4 GSM1275874 N061011 untrt untrt SRR10395… 101 SRX384357 SRS50… SAMN0242…
That works, but there are two problems - both of them are that we now have a
tibble
, firstly because it now has no row names, and secondly, as a consequence,
we can’t “put it back” into the SummarizedExperiment
.
We have the same problem with the gene information - we can convert to tibble
and use {dplyr}, but we’re stuck there
dplyr::filter(tibble::as_tibble(rowData(airway)), symbol == "CD38") ## # A tibble: 1 × 10 ## gene_id gene_name entrezid gene_biotype gene_seq_start gene_seq_end seq_name ## <chr> <chr> <int> <chr> <int> <int> <chr> ## 1 ENSG0000… CD38 NA protein_cod… 15779898 15854853 4 ## # ℹ 3 more variables: seq_strand <int>, seq_coord_system <int>, symbol <chr>
Critically, tibble
suffers from the same lack of S4 support as regular data.frame
s,
so if there are any of those columns, this workflow will break
d <- tibble::tibble(a = 1) d ## # A tibble: 1 × 1 ## a ## <dbl> ## 1 1 d$grX <- GenomicRanges::GRanges("chrX", IRanges::IRanges(1:6, width = 6:1)) ## Error in `$<-`: ## ! Assigned data `GenomicRanges::GRanges("chrX", IRanges::IRanges(1:6, ## width = 6:1))` must be a vector. ## Caused by error in `vec_size()`: ## ! `x` must be a vector, not a <GRanges> object.
What else can we do?
Of Course I Have a Proposed Solution
Let’s create a gnarly DataFrame
with S4 columns and row names
m <- mtcars[, c("cyl", "hp", "am", "gear", "disp")] d <- as(m, "DataFrame") d ## DataFrame with 32 rows and 5 columns ## cyl hp am gear disp ## <numeric> <numeric> <numeric> <numeric> <numeric> ## Mazda RX4 6 110 1 4 160 ## Mazda RX4 Wag 6 110 1 4 160 ## Datsun 710 4 93 1 4 108 ## Hornet 4 Drive 6 110 0 3 258 ## Hornet Sportabout 8 175 0 3 360 ## ... ... ... ... ... ... ## Lotus Europa 4 113 1 5 95.1 ## Ford Pantera L 8 264 1 5 351.0 ## Ferrari Dino 6 175 1 5 145.0 ## Maserati Bora 8 335 1 5 301.0 ## Volvo 142E 4 109 1 4 121.0 d$grX <- GenomicRanges::GRanges("chrX", IRanges::IRanges(1:32, width=10)) d$grY <- GenomicRanges::GRanges("chrY", IRanges::IRanges(1:32, width = 10)) d$nl <- IRanges::NumericList(lapply(d$gear, function(n) round(rnorm(n), 2))) d ## DataFrame with 32 rows and 8 columns ## cyl hp am gear disp grX ## <numeric> <numeric> <numeric> <numeric> <numeric> <GRanges> ## Mazda RX4 6 110 1 4 160 chrX:1-10 ## Mazda RX4 Wag 6 110 1 4 160 chrX:2-11 ## Datsun 710 4 93 1 4 108 chrX:3-12 ## Hornet 4 Drive 6 110 0 3 258 chrX:4-13 ## Hornet Sportabout 8 175 0 3 360 chrX:5-14 ## ... ... ... ... ... ... ... ## Lotus Europa 4 113 1 5 95.1 chrX:28-37 ## Ford Pantera L 8 264 1 5 351.0 chrX:29-38 ## Ferrari Dino 6 175 1 5 145.0 chrX:30-39 ## Maserati Bora 8 335 1 5 301.0 chrX:31-40 ## Volvo 142E 4 109 1 4 121.0 chrX:32-41 ## grY nl ## <GRanges> <NumericList> ## Mazda RX4 chrY:1-10 1.94,0.80,1.02,... ## Mazda RX4 Wag chrY:2-11 0.90, 1.52,-0.46,... ## Datsun 710 chrY:3-12 1.64,0.47,0.66,... ## Hornet 4 Drive chrY:4-13 0.06, 1.93,-1.11 ## Hornet Sportabout chrY:5-14 -1.01,-0.77,-0.04 ## ... ... ... ## Lotus Europa chrY:28-37 -0.22, 0.66,-1.10,... ## Ford Pantera L chrY:29-38 -1.38, 0.74, 1.24,... ## Ferrari Dino chrY:30-39 -0.81, 1.10, 0.85,... ## Maserati Bora chrY:31-40 1.59,0.65,1.32,... ## Volvo 142E chrY:32-41 1.07,0.78,0.00,...
and add some more metadata about the colData
columns
mcols(colData(se)) <- DataFrame( position = paste0("column ", 1:9), is_ID = c(TRUE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE) ) mcols(colData(se)) ## DataFrame with 9 rows and 2 columns ## position is_ID ## <character> <logical> ## SampleName column 1 TRUE ## cell column 2 TRUE ## dex column 3 FALSE ## albut column 4 FALSE ## Run column 5 TRUE ## avgLength column 6 FALSE ## Experiment column 7 TRUE ## Sample column 8 TRUE ## BioSample column 9 TRUE
My solution was to write the {dplyr} verbs in a way that was compatible with DataFrame
but which avoids passing through a tibble
wherever possible. That became {DFplyr}
and it’s been sitting around on GitHub
not doing much for the last 5 years.
I wrote a first version of this package despite being quite naive to the actual
implementation of DataFrame
, so I’m very grateful to Michael Lawrence and
Hervé Pagès who helped improve compatibility and pointed out bugs with what I’d
done.
Time to take it for a tour
library(DFplyr) ## Loading required package: dplyr ## ## Attaching package: 'dplyr' ## The following object is masked from 'package:Biobase': ## ## combine ## The following objects are masked from 'package:GenomicRanges': ## ## intersect, setdiff, union ## The following object is masked from 'package:GenomeInfoDb': ## ## intersect ## The following objects are masked from 'package:IRanges': ## ## collapse, desc, intersect, setdiff, slice, union ## The following object is masked from 'package:matrixStats': ## ## count ## The following objects are masked from 'package:S4Vectors': ## ## first, intersect, rename, setdiff, setequal, union ## The following objects are masked from 'package:BiocGenerics': ## ## combine, intersect, setdiff, union ## The following objects are masked from 'package:stats': ## ## filter, lag ## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union ## Warning: replacing previous import 'S4Vectors::rename' by 'dplyr::rename' when ## loading 'DFplyr' ## ## Attaching package: 'DFplyr' ## The following objects are masked from 'package:dplyr': ## ## count, desc ## The following objects are masked from 'package:IRanges': ## ## desc, slice ## The following object is masked from 'package:matrixStats': ## ## count ## The following object is masked from 'package:stats': ## ## filter colData(se) ## DataFrame with 8 rows and 9 columns ## SampleName cell dex albut Run avgLength ## <factor> <factor> <factor> <factor> <factor> <integer> ## SRR1039508 GSM1275862 N61311 untrt untrt SRR1039508 126 ## SRR1039509 GSM1275863 N61311 trt untrt SRR1039509 126 ## SRR1039512 GSM1275866 N052611 untrt untrt SRR1039512 126 ## SRR1039513 GSM1275867 N052611 trt untrt SRR1039513 87 ## SRR1039516 GSM1275870 N080611 untrt untrt SRR1039516 120 ## SRR1039517 GSM1275871 N080611 trt untrt SRR1039517 126 ## SRR1039520 GSM1275874 N061011 untrt untrt SRR1039520 101 ## SRR1039521 GSM1275875 N061011 trt untrt SRR1039521 98 ## Experiment Sample BioSample ## <factor> <factor> <factor> ## SRR1039508 SRX384345 SRS508568 SAMN02422669 ## SRR1039509 SRX384346 SRS508567 SAMN02422675 ## SRR1039512 SRX384349 SRS508571 SAMN02422678 ## SRR1039513 SRX384350 SRS508572 SAMN02422670 ## SRR1039516 SRX384353 SRS508575 SAMN02422682 ## SRR1039517 SRX384354 SRS508576 SAMN02422673 ## SRR1039520 SRX384357 SRS508579 SAMN02422683 ## SRR1039521 SRX384358 SRS508580 SAMN02422677 cd <- colData(se)
One thing I added was a custom ‘label’ if you’re using this package within RStudio,
in which case the Environment pane will show cd
as “Formal class DataFrame (dplyr-compatible)”.
That’s not a change in the object itself, but rather all DataFrame
objects will
now show this, because {dplyr} verbs should work on them.
Now I can just filter
the colData
as if it was a data.frame
or a tibble
,
but the result is still a DataFrame
treated <- filter(cd, dex == "trt") treated ## DataFrame with 4 rows and 9 columns ## SampleName cell dex albut Run avgLength ## <factor> <factor> <factor> <factor> <factor> <integer> ## SRR1039509 GSM1275863 N61311 trt untrt SRR1039509 126 ## SRR1039513 GSM1275867 N052611 trt untrt SRR1039513 87 ## SRR1039517 GSM1275871 N080611 trt untrt SRR1039517 126 ## SRR1039521 GSM1275875 N061011 trt untrt SRR1039521 98 ## Experiment Sample BioSample ## <factor> <factor> <factor> ## SRR1039509 SRX384346 SRS508567 SAMN02422675 ## SRR1039513 SRX384350 SRS508572 SAMN02422670 ## SRR1039517 SRX384354 SRS508576 SAMN02422673 ## SRR1039521 SRX384358 SRS508580 SAMN02422677
Importantly, the row names and metadata are completely preserved, relevant to this filtered subset
rownames(treated) ## [1] "SRR1039509" "SRR1039513" "SRR1039517" "SRR1039521" mcols(treated) ## DataFrame with 9 rows and 2 columns ## position is_ID ## <character> <logical> ## SampleName column 1 TRUE ## cell column 2 TRUE ## dex column 3 FALSE ## albut column 4 FALSE ## Run column 5 TRUE ## avgLength column 6 FALSE ## Experiment column 7 TRUE ## Sample column 8 TRUE ## BioSample column 9 TRUE exp <- select(cd, Experiment, Sample, BioSample) exp ## DataFrame with 8 rows and 3 columns ## Experiment Sample BioSample ## <factor> <factor> <factor> ## SRR1039508 SRX384345 SRS508568 SAMN02422669 ## SRR1039509 SRX384346 SRS508567 SAMN02422675 ## SRR1039512 SRX384349 SRS508571 SAMN02422678 ## SRR1039513 SRX384350 SRS508572 SAMN02422670 ## SRR1039516 SRX384353 SRS508575 SAMN02422682 ## SRR1039517 SRX384354 SRS508576 SAMN02422673 ## SRR1039520 SRX384357 SRS508579 SAMN02422683 ## SRR1039521 SRX384358 SRS508580 SAMN02422677 mcols(exp) ## DataFrame with 3 rows and 2 columns ## position is_ID ## <character> <logical> ## Experiment column 7 TRUE ## Sample column 8 TRUE ## BioSample column 9 TRUE
I can add more sample data if I have it
cdq <- mutate(cd, quality = runif(nrow(cd))) cdq ## DataFrame with 8 rows and 10 columns ## SampleName cell dex albut Run avgLength ## <factor> <factor> <factor> <factor> <factor> <integer> ## SRR1039508 GSM1275862 N61311 untrt untrt SRR1039508 126 ## SRR1039509 GSM1275863 N61311 trt untrt SRR1039509 126 ## SRR1039512 GSM1275866 N052611 untrt untrt SRR1039512 126 ## SRR1039513 GSM1275867 N052611 trt untrt SRR1039513 87 ## SRR1039516 GSM1275870 N080611 untrt untrt SRR1039516 120 ## SRR1039517 GSM1275871 N080611 trt untrt SRR1039517 126 ## SRR1039520 GSM1275874 N061011 untrt untrt SRR1039520 101 ## SRR1039521 GSM1275875 N061011 trt untrt SRR1039521 98 ## Experiment Sample BioSample quality ## <factor> <factor> <factor> <numeric> ## SRR1039508 SRX384345 SRS508568 SAMN02422669 0.386436 ## SRR1039509 SRX384346 SRS508567 SAMN02422675 0.566096 ## SRR1039512 SRX384349 SRS508571 SAMN02422678 0.327839 ## SRR1039513 SRX384350 SRS508572 SAMN02422670 0.744911 ## SRR1039516 SRX384353 SRS508575 SAMN02422682 0.331616 ## SRR1039517 SRX384354 SRS508576 SAMN02422673 0.728328 ## SRR1039520 SRX384357 SRS508579 SAMN02422683 0.626803 ## SRR1039521 SRX384358 SRS508580 SAMN02422677 0.248757 mcols(cdq) ## DataFrame with 10 rows and 2 columns ## position is_ID ## <character> <logical> ## SampleName column 1 TRUE ## cell column 2 TRUE ## dex column 3 FALSE ## albut column 4 FALSE ## Run column 5 TRUE ## avgLength column 6 FALSE ## Experiment column 7 TRUE ## Sample column 8 TRUE ## BioSample column 9 TRUE ## quality NA NA
and since no metadata about this new column was available, it has a blank row in
the mcols
entry.
If I perform an operation that doesn’t make sense in terms of row names, they will be removed
cdq |> group_by(dex) |> summarise(avg_quality = mean(quality)) ## dex avg_quality ## 1 trt 0.5720231 ## 2 untrt 0.4181736
but rearranging them is fine
cdq |> arrange(desc(quality)) |> slice(1:4) ## DataFrame with 4 rows and 10 columns ## SampleName cell dex albut Run avgLength ## <factor> <factor> <factor> <factor> <factor> <integer> ## SRR1039513 GSM1275867 N052611 trt untrt SRR1039513 87 ## SRR1039517 GSM1275871 N080611 trt untrt SRR1039517 126 ## SRR1039520 GSM1275874 N061011 untrt untrt SRR1039520 101 ## SRR1039509 GSM1275863 N61311 trt untrt SRR1039509 126 ## Experiment Sample BioSample quality ## <factor> <factor> <factor> <numeric> ## SRR1039513 SRX384350 SRS508572 SAMN02422670 0.744911 ## SRR1039517 SRX384354 SRS508576 SAMN02422673 0.728328 ## SRR1039520 SRX384357 SRS508579 SAMN02422683 0.626803 ## SRR1039509 SRX384346 SRS508567 SAMN02422675 0.566096
One thing that DataFrame
lacks - as does data.frame
- is the concept of groups.
{tibble} resolved this by adding a "groups"
attribute, so I did the same in {DFplyr} -
I don’t own {S4Vectors}, so I had to mask some functions in order to build in this
support. I can now perform grouped operations on a DataFrame
and have the
grouping information printed like {tibble} does
cd |> group_by(dex) ## DataFrame with 8 rows and 9 columns ## Groups: dex ## SampleName cell dex albut Run avgLength ## <factor> <factor> <factor> <factor> <factor> <integer> ## SRR1039508 GSM1275862 N61311 untrt untrt SRR1039508 126 ## SRR1039509 GSM1275863 N61311 trt untrt SRR1039509 126 ## SRR1039512 GSM1275866 N052611 untrt untrt SRR1039512 126 ## SRR1039513 GSM1275867 N052611 trt untrt SRR1039513 87 ## SRR1039516 GSM1275870 N080611 untrt untrt SRR1039516 120 ## SRR1039517 GSM1275871 N080611 trt untrt SRR1039517 126 ## SRR1039520 GSM1275874 N061011 untrt untrt SRR1039520 101 ## SRR1039521 GSM1275875 N061011 trt untrt SRR1039521 98 ## Experiment Sample BioSample ## <factor> <factor> <factor> ## SRR1039508 SRX384345 SRS508568 SAMN02422669 ## SRR1039509 SRX384346 SRS508567 SAMN02422675 ## SRR1039512 SRX384349 SRS508571 SAMN02422678 ## SRR1039513 SRX384350 SRS508572 SAMN02422670 ## SRR1039516 SRX384353 SRS508575 SAMN02422682 ## SRR1039517 SRX384354 SRS508576 SAMN02422673 ## SRR1039520 SRX384357 SRS508579 SAMN02422683 ## SRR1039521 SRX384358 SRS508580 SAMN02422677 cd |> group_by(dex) |> summarise(meanAvgLength = mean(avgLength)) ## dex meanAvgLength ## 1 trt 109.25 ## 2 untrt 118.25
If I try to “put back” the subset colData
I get an error, because there is
assay data for samples we dropped
colData(se) ## DataFrame with 8 rows and 9 columns ## SampleName cell dex albut Run avgLength ## <factor> <factor> <factor> <factor> <factor> <integer> ## SRR1039508 GSM1275862 N61311 untrt untrt SRR1039508 126 ## SRR1039509 GSM1275863 N61311 trt untrt SRR1039509 126 ## SRR1039512 GSM1275866 N052611 untrt untrt SRR1039512 126 ## SRR1039513 GSM1275867 N052611 trt untrt SRR1039513 87 ## SRR1039516 GSM1275870 N080611 untrt untrt SRR1039516 120 ## SRR1039517 GSM1275871 N080611 trt untrt SRR1039517 126 ## SRR1039520 GSM1275874 N061011 untrt untrt SRR1039520 101 ## SRR1039521 GSM1275875 N061011 trt untrt SRR1039521 98 ## Experiment Sample BioSample ## <factor> <factor> <factor> ## SRR1039508 SRX384345 SRS508568 SAMN02422669 ## SRR1039509 SRX384346 SRS508567 SAMN02422675 ## SRR1039512 SRX384349 SRS508571 SAMN02422678 ## SRR1039513 SRX384350 SRS508572 SAMN02422670 ## SRR1039516 SRX384353 SRS508575 SAMN02422682 ## SRR1039517 SRX384354 SRS508576 SAMN02422673 ## SRR1039520 SRX384357 SRS508579 SAMN02422683 ## SRR1039521 SRX384358 SRS508580 SAMN02422677 treated ## DataFrame with 4 rows and 9 columns ## SampleName cell dex albut Run avgLength ## <factor> <factor> <factor> <factor> <factor> <integer> ## SRR1039509 GSM1275863 N61311 trt untrt SRR1039509 126 ## SRR1039513 GSM1275867 N052611 trt untrt SRR1039513 87 ## SRR1039517 GSM1275871 N080611 trt untrt SRR1039517 126 ## SRR1039521 GSM1275875 N061011 trt untrt SRR1039521 98 ## Experiment Sample BioSample ## <factor> <factor> <factor> ## SRR1039509 SRX384346 SRS508567 SAMN02422675 ## SRR1039513 SRX384350 SRS508572 SAMN02422670 ## SRR1039517 SRX384354 SRS508576 SAMN02422673 ## SRR1039521 SRX384358 SRS508580 SAMN02422677 colData(se) <- treated ## Error in `colData<-`(`*tmp*`, value = new("DFrame", rownames = c("SRR1039509", : nrow of supplied 'colData' must equal ncol of object
But as long as I have some metadata for all the samples, it’s fine - cdq
was the
version of colData
with quality information, and you can see it’s incorporated
on the last line of this output
colData(se) <- cdq se ## class: RangedSummarizedExperiment ## dim: 63677 8 ## metadata(1): '' ## assays(1): counts ## rownames(63677): ENSG00000000003 ENSG00000000005 ... ENSG00000273492 ## ENSG00000273493 ## rowData names(10): gene_id gene_name ... seq_coord_system symbol ## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521 ## colData names(10): SampleName cell ... BioSample quality
Storing S4 classes isn’t all we can do - performing {dplyr} operations on S4 columns is perfectly fine; here, adding more S4 columns based on others
myrange <- GenomicRanges::GRanges("chrX", IRanges::IRanges(2:5, width = 10)) d <- mutate(d, quality = runif(32), overlaps = GenomicRanges::countOverlaps(grX, myrange)) d <- mutate(d, nearest = plyranges::join_nearest(grX, myrange)) d ## DataFrame with 32 rows and 11 columns ## cyl hp am gear disp grX ## <numeric> <numeric> <numeric> <numeric> <numeric> <GRanges> ## Mazda RX4 6 110 1 4 160 chrX:1-10 ## Mazda RX4 Wag 6 110 1 4 160 chrX:2-11 ## Datsun 710 4 93 1 4 108 chrX:3-12 ## Hornet 4 Drive 6 110 0 3 258 chrX:4-13 ## Hornet Sportabout 8 175 0 3 360 chrX:5-14 ## ... ... ... ... ... ... ... ## Lotus Europa 4 113 1 5 95.1 chrX:28-37 ## Ford Pantera L 8 264 1 5 351.0 chrX:29-38 ## Ferrari Dino 6 175 1 5 145.0 chrX:30-39 ## Maserati Bora 8 335 1 5 301.0 chrX:31-40 ## Volvo 142E 4 109 1 4 121.0 chrX:32-41 ## grY nl overlaps quality ## <GRanges> <NumericList> <integer> <numeric> ## Mazda RX4 chrY:1-10 1.94,0.80,1.02,... 4 0.144992 ## Mazda RX4 Wag chrY:2-11 0.90, 1.52,-0.46,... 4 0.726243 ## Datsun 710 chrY:3-12 1.64,0.47,0.66,... 4 0.146000 ## Hornet 4 Drive chrY:4-13 0.06, 1.93,-1.11 4 0.695177 ## Hornet Sportabout chrY:5-14 -1.01,-0.77,-0.04 4 0.921556 ## ... ... ... ... ... ## Lotus Europa chrY:28-37 -0.22, 0.66,-1.10,... 0 0.371289 ## Ford Pantera L chrY:29-38 -1.38, 0.74, 1.24,... 0 0.420842 ## Ferrari Dino chrY:30-39 -0.81, 1.10, 0.85,... 0 0.890471 ## Maserati Bora chrY:31-40 1.59,0.65,1.32,... 0 0.395305 ## Volvo 142E chrY:32-41 1.07,0.78,0.00,... 0 0.532688 ## nearest ## <GRanges> ## Mazda RX4 chrX:1-10 ## Mazda RX4 Wag chrX:2-11 ## Datsun 710 chrX:3-12 ## Hornet 4 Drive chrX:4-13 ## Hornet Sportabout chrX:5-14 ## ... ... ## Lotus Europa chrX:28-37 ## Ford Pantera L chrX:29-38 ## Ferrari Dino chrX:30-39 ## Maserati Bora chrX:31-40 ## Volvo 142E chrX:32-41
or filtering rows
d |> arrange(desc(quality)) |> filter(overlaps > 0) |> slice(1:4) ## DataFrame with 4 rows and 11 columns ## cyl hp am gear disp grX ## <numeric> <numeric> <numeric> <numeric> <numeric> <GRanges> ## Merc 280C 6 123 0 4 167.6 chrX:11-20 ## Merc 450SL 8 180 0 3 275.8 chrX:13-22 ## Merc 450SE 8 180 0 3 275.8 chrX:12-21 ## Hornet Sportabout 8 175 0 3 360.0 chrX:5-14 ## grY nl overlaps quality ## <GRanges> <NumericList> <integer> <numeric> ## Merc 280C chrY:11-20 -0.08, 0.55,-0.59,... 4 0.990185 ## Merc 450SL chrY:13-22 1.23,-0.35, 0.04 2 0.979190 ## Merc 450SE chrY:12-21 2.60, 0.98,-1.50 3 0.958658 ## Hornet Sportabout chrY:5-14 -1.01,-0.77,-0.04 4 0.921556 ## nearest ## <GRanges> ## Merc 280C chrX:11-20 ## Merc 450SL chrX:13-22 ## Merc 450SE chrX:12-21 ## Hornet Sportabout chrX:5-14
You don’t necessarily need to extract the DataFrame
into a variable, either -
working with the extraction directly could be cleaner in some cases
count(colData(se), cell) ## DataFrame with 4 rows and 2 columns ## cell n ## <factor> <integer> ## 1 N052611 2 ## 2 N061011 2 ## 3 N080611 2 ## 4 N61311 2
Other Ways to Filter
Using {dplyr} syntax on a DataFrame
is very convenient, but that’s only one
component of a SummarizedExperiment
- it would be great to be able to extend
this further and do something like
filter(airway, rows = ..., cols = ...)
It is possible to do something like this - if I generate IDs of the ‘rows’ and
‘columns’ I want to subset to, I can pass these in to a [
extraction
untrt <- colData(airway) |> filter(dex == "untrt") |> rownames() untrt ## [1] "SRR1039508" "SRR1039512" "SRR1039516" "SRR1039520" genes <- rowData(airway)[c(1:3, 8:10), ] |> rownames() genes ## [1] "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" "ENSG00000001036" ## [5] "ENSG00000001084" "ENSG00000001167" airway_sub <- airway[genes, untrt] airway_sub ## class: RangedSummarizedExperiment ## dim: 6 4 ## metadata(1): '' ## assays(1): counts ## rownames(6): ENSG00000000003 ENSG00000000005 ... ENSG00000001084 ## ENSG00000001167 ## rowData names(10): gene_id gene_name ... seq_coord_system symbol ## colnames(4): SRR1039508 SRR1039512 SRR1039516 SRR1039520 ## colData names(9): SampleName cell ... Sample BioSample dim(airway_sub) ## [1] 6 4 colData(airway_sub) ## DataFrame with 4 rows and 9 columns ## SampleName cell dex albut Run avgLength ## <factor> <factor> <factor> <factor> <factor> <integer> ## SRR1039508 GSM1275862 N61311 untrt untrt SRR1039508 126 ## SRR1039512 GSM1275866 N052611 untrt untrt SRR1039512 126 ## SRR1039516 GSM1275870 N080611 untrt untrt SRR1039516 120 ## SRR1039520 GSM1275874 N061011 untrt untrt SRR1039520 101 ## Experiment Sample BioSample ## <factor> <factor> <factor> ## SRR1039508 SRX384345 SRS508568 SAMN02422669 ## SRR1039512 SRX384349 SRS508571 SAMN02422678 ## SRR1039516 SRX384353 SRS508575 SAMN02422682 ## SRR1039520 SRX384357 SRS508579 SAMN02422683 rowData(airway_sub) ## DataFrame with 6 rows and 10 columns ## gene_id gene_name entrezid gene_biotype ## <character> <character> <integer> <character> ## ENSG00000000003 ENSG00000000003 TSPAN6 NA protein_coding ## ENSG00000000005 ENSG00000000005 TNMD NA protein_coding ## ENSG00000000419 ENSG00000000419 DPM1 NA protein_coding ## ENSG00000001036 ENSG00000001036 FUCA2 NA protein_coding ## ENSG00000001084 ENSG00000001084 GCLC NA protein_coding ## ENSG00000001167 ENSG00000001167 NFYA NA protein_coding ## gene_seq_start gene_seq_end seq_name seq_strand ## <integer> <integer> <character> <integer> ## ENSG00000000003 99883667 99894988 X -1 ## ENSG00000000005 99839799 99854882 X 1 ## ENSG00000000419 49551404 49575092 20 -1 ## ENSG00000001036 143815948 143832827 6 -1 ## ENSG00000001084 53362139 53481768 6 -1 ## ENSG00000001167 41040684 41067715 6 1 ## seq_coord_system symbol ## <integer> <character> ## ENSG00000000003 NA TSPAN6 ## ENSG00000000005 NA TNMD ## ENSG00000000419 NA DPM1 ## ENSG00000001036 NA FUCA2 ## ENSG00000001084 NA GCLC ## ENSG00000001167 NA NFYA
Notice that this new object now has only 6 genes and 4 samples (the original has 63677 genes and 8 samples).
This sort of filtering is even nicer in MultiAssayExperiment
where multiple
SummarizedExperiments
(or similar objects) are aggregated into a super object
where the colData
may be relevant to samples which appear in multiple assays.
MultiAssayExperiment
has some of the best written vignettes
I’ve seen, and I thoroughly recommend reading them if you have any need to use
this format
library(MultiAssayExperiment)
(building the object not shown)
mae <- MultiAssayExperiment(experiments = ExpList, colData = colDat, sampleMap = sampMap ) mae ## A MultiAssayExperiment object of 4 listed ## experiments with user-defined names and respective classes. ## Containing an ExperimentList class object of length 4: ## [1] Affy: SummarizedExperiment with 5 rows and 4 columns ## [2] Methyl450k: matrix with 5 rows and 5 columns ## [3] RNASeqGene: matrix with 5 rows and 4 columns ## [4] GISTIC: RangedSummarizedExperiment with 5 rows and 3 columns ## Functionality: ## experiments() - obtain the ExperimentList instance ## colData() - the primary/phenotype DataFrame ## sampleMap() - the sample coordination DataFrame ## `$`, `[`, `[[` - extract colData columns, subset, or experiment ## *Format() - convert into a long or wide DataFrame ## assays() - convert ExperimentList to a SimpleList of matrices ## exportClass() - save data to flat files colData(mae) ## DataFrame with 4 rows and 2 columns ## sex age ## <character> <integer> ## Jack M 38 ## Jill F 39 ## Bob M 40 ## Barbara F 41
Subsetting the different dimensions (rows, columns, assays) can be done with the following format
myMultiAssay[rows, columns, assays]
and since mae$x
is defined to extract a colData
column, I can subset this
object to measurements of two genes in any assay for just the male subjects
mae[c("ENST00000355076", "ENST00000383323"), mae$sex == "M", ] ## A MultiAssayExperiment object of 4 listed ## experiments with user-defined names and respective classes. ## Containing an ExperimentList class object of length 4: ## [1] Affy: SummarizedExperiment with 2 rows and 2 columns ## [2] Methyl450k: matrix with 2 rows and 3 columns ## [3] RNASeqGene: matrix with 1 rows and 2 columns ## [4] GISTIC: RangedSummarizedExperiment with 1 rows and 2 columns ## Functionality: ## experiments() - obtain the ExperimentList instance ## colData() - the primary/phenotype DataFrame ## sampleMap() - the sample coordination DataFrame ## `$`, `[`, `[[` - extract colData columns, subset, or experiment ## *Format() - convert into a long or wide DataFrame ## assays() - convert ExperimentList to a SimpleList of matrices ## exportClass() - save data to flat files
All of that brings us to the tidyOmics project which
aims to create tidy analysis packages for omics data. One of the packages there
is tidySummarizedExperiment
which is a reframing of SummarizedExperiment
in
a ‘tidy’ sense
# remotes::install_github("stemangiola/tidySummarizedExperiment") tidySummarizedExperiment::pasilla ## # A SummarizedExperiment-tibble abstraction: 102,193 × 5 ## # Features=14599 | Samples=7 | Assays=counts ## .feature .sample counts condition type ## <chr> <chr> <int> <chr> <chr> ## 1 FBgn0000003 untrt1 0 untreated single_end ## 2 FBgn0000008 untrt1 92 untreated single_end ## 3 FBgn0000014 untrt1 5 untreated single_end ## 4 FBgn0000015 untrt1 0 untreated single_end ## 5 FBgn0000017 untrt1 4664 untreated single_end ## 6 FBgn0000018 untrt1 583 untreated single_end ## 7 FBgn0000022 untrt1 0 untreated single_end ## 8 FBgn0000024 untrt1 10 untreated single_end ## 9 FBgn0000028 untrt1 0 untreated single_end ## 10 FBgn0000032 untrt1 1446 untreated single_end ## # ℹ 40 more rows
If you’re already familiar with MultiAssayExperiment
then you may find this
somewhat familiar to the longFormat
output there
longFormat(mae) ## DataFrame with 80 rows and 5 columns ## assay primary rowname colname value ## <character> <character> <character> <character> <numeric> ## 1 Affy Jack ENST00000294241 array1 101 ## 2 Affy Jack ENST00000355076 array1 102 ## 3 Affy Jack ENST00000383706 array1 103 ## 4 Affy Jack ENST00000234812 array1 104 ## 5 Affy Jack ENST00000383323 array1 105 ## ... ... ... ... ... ... ## 76 GISTIC Jill ENST00000135411 samp2 0 ## 77 GISTIC Jill ENST00000135412 samp2 1 ## 78 GISTIC Jill ENST00000135413 samp2 0 ## 79 GISTIC Jill ENST00000135414 samp2 1 ## 80 GISTIC Jill ENST00000383323 samp2 0
Summary
That was a bit long winded, but I’ve been wanting to spell out my love for rownames
which comes from this Bioconductor approach for some time. {DFplyr} is a bit of a
mess internally, but I believe it has promise to be useful, so I’m interested in
use-cases, failure modes, etc… The tidyOmics project defines a lot of new structures
which need to interact with DataFrame
s and I’m reasonably sure that {DFplyr} can
help with that.
If you have comments, suggestions, or improvements - or if you think {DFplyr} could be useful to you, feel free to use the comment section below, or hit me up on Mastodon.
devtools::session_info()
## ─ Session info ─────────────────────────────────────────────────────────────── ## setting value ## version R version 4.3.3 (2024-02-29) ## os Pop!_OS 22.04 LTS ## system x86_64, linux-gnu ## ui X11 ## language (EN) ## collate en_AU.UTF-8 ## ctype en_AU.UTF-8 ## tz Australia/Adelaide ## date 2024-08-11 ## pandoc 3.2 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/x86_64/ (via rmarkdown) ## ## ─ Packages ─────────────────────────────────────────────────────────────────── ## package * version date (UTC) lib source ## abind 1.4-5 2016-07-21 [1] CRAN (R 4.3.3) ## airway * 1.22.0 2023-10-26 [1] Bioconductor ## Biobase * 2.62.0 2023-10-24 [1] Bioconductor ## BiocBaseUtils 1.4.0 2023-10-24 [1] Bioconductor ## BiocGenerics * 0.48.1 2023-11-01 [1] Bioconductor ## BiocIO 1.12.0 2023-10-24 [1] Bioconductor ## BiocParallel 1.36.0 2023-10-24 [1] Bioconductor ## Biostrings 2.70.3 2024-03-13 [1] Bioconductor 3.18 (R 4.3.3) ## bitops 1.0-7 2021-04-24 [1] CRAN (R 4.3.2) ## blogdown 1.19 2024-02-01 [1] CRAN (R 4.3.3) ## bookdown 0.36 2023-10-16 [1] CRAN (R 4.3.2) ## bslib 0.6.1 2023-11-28 [3] CRAN (R 4.3.2) ## cachem 1.0.8 2023-05-01 [3] CRAN (R 4.3.0) ## callr 3.7.3 2022-11-02 [3] CRAN (R 4.2.2) ## charcuterie * 0.0.2 2024-08-07 [1] local ## cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.3) ## codetools 0.2-19 2023-02-01 [4] CRAN (R 4.2.2) ## colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.3.3) ## crayon 1.5.2 2022-09-29 [3] CRAN (R 4.2.1) ## data.table 1.15.0 2024-01-30 [3] CRAN (R 4.3.2) ## DelayedArray 0.28.0 2023-10-24 [1] Bioconductor ## devtools 2.4.5 2022-10-11 [1] CRAN (R 4.3.2) ## DFplyr * 0.0.1.9000 2024-08-11 [1] local ## digest 0.6.34 2024-01-11 [3] CRAN (R 4.3.2) ## dplyr * 1.1.4 2023-11-17 [3] CRAN (R 4.3.2) ## ellipsis 0.3.2 2021-04-29 [3] CRAN (R 4.1.1) ## evaluate 0.23 2023-11-01 [3] CRAN (R 4.3.2) ## fansi 1.0.6 2023-12-08 [1] CRAN (R 4.3.3) ## fastmap 1.1.1 2023-02-24 [3] CRAN (R 4.2.2) ## fs 1.6.3 2023-07-20 [3] CRAN (R 4.3.1) ## generics 0.1.3 2022-07-05 [3] CRAN (R 4.2.1) ## GenomeInfoDb * 1.38.8 2024-03-15 [1] Bioconductor 3.18 (R 4.3.3) ## GenomeInfoDbData 1.2.11 2024-06-21 [1] Bioconductor ## GenomicAlignments 1.38.2 2024-01-16 [1] Bioconductor 3.18 (R 4.3.3) ## GenomicRanges * 1.54.1 2023-10-29 [1] Bioconductor ## ggplot2 3.5.1 2024-04-23 [1] CRAN (R 4.3.3) ## glue 1.7.0 2024-01-09 [1] CRAN (R 4.3.3) ## gtable 0.3.5 2024-04-22 [1] CRAN (R 4.3.3) ## htmltools 0.5.7 2023-11-03 [3] CRAN (R 4.3.2) ## htmlwidgets 1.6.2 2023-03-17 [1] CRAN (R 4.3.2) ## httpuv 1.6.12 2023-10-23 [1] CRAN (R 4.3.2) ## httr 1.4.7 2023-08-15 [3] CRAN (R 4.3.1) ## icecream 0.2.1 2023-09-27 [1] CRAN (R 4.3.2) ## IRanges * 2.36.0 2023-10-24 [1] Bioconductor ## jquerylib 0.1.4 2021-04-26 [3] CRAN (R 4.1.2) ## jsonlite 1.8.8 2023-12-04 [3] CRAN (R 4.3.2) ## knitr 1.45 2023-10-30 [3] CRAN (R 4.3.2) ## later 1.3.1 2023-05-02 [1] CRAN (R 4.3.2) ## lattice 0.22-5 2023-10-24 [4] CRAN (R 4.3.1) ## lazyeval 0.2.2 2019-03-15 [1] CRAN (R 4.3.2) ## lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.3.3) ## magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.3) ## Matrix 1.6-5 2024-01-11 [4] CRAN (R 4.3.3) ## MatrixGenerics * 1.14.0 2023-10-24 [1] Bioconductor ## matrixStats * 1.1.0 2023-11-07 [1] CRAN (R 4.3.2) ## memoise 2.0.1 2021-11-26 [3] CRAN (R 4.2.0) ## mime 0.12 2021-09-28 [3] CRAN (R 4.2.0) ## miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 4.3.2) ## MultiAssayExperiment * 1.28.0 2023-10-24 [1] Bioconductor ## munsell 0.5.1 2024-04-01 [1] CRAN (R 4.3.3) ## pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.3) ## pkgbuild 1.4.2 2023-06-26 [1] CRAN (R 4.3.2) ## pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.3) ## pkgload 1.3.3 2023-09-22 [1] CRAN (R 4.3.2) ## plotly 4.10.4 2024-01-13 [1] CRAN (R 4.3.3) ## plyranges 1.22.0 2023-10-24 [1] Bioconductor ## prettyunits 1.2.0 2023-09-24 [3] CRAN (R 4.3.1) ## processx 3.8.3 2023-12-10 [3] CRAN (R 4.3.2) ## profvis 0.3.8 2023-05-02 [1] CRAN (R 4.3.2) ## promises 1.2.1 2023-08-10 [1] CRAN (R 4.3.2) ## ps 1.7.6 2024-01-18 [3] CRAN (R 4.3.2) ## purrr 1.0.2 2023-08-10 [3] CRAN (R 4.3.1) ## R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.3) ## Rcpp 1.0.11 2023-07-06 [1] CRAN (R 4.3.2) ## RCurl 1.98-1.14 2024-01-09 [1] CRAN (R 4.3.3) ## remotes 2.4.2.1 2023-07-18 [1] CRAN (R 4.3.2) ## restfulr 0.0.15 2022-06-16 [1] CRAN (R 4.3.3) ## rjson 0.2.21 2022-01-09 [1] CRAN (R 4.3.3) ## rlang 1.1.4 2024-06-04 [1] CRAN (R 4.3.3) ## rmarkdown 2.25 2023-09-18 [3] CRAN (R 4.3.1) ## Rsamtools 2.18.0 2023-10-24 [1] Bioconductor ## rstudioapi 0.15.0 2023-07-07 [3] CRAN (R 4.3.1) ## rtracklayer 1.62.0 2023-10-24 [1] Bioconductor ## S4Arrays 1.2.1 2024-03-04 [1] Bioconductor 3.18 (R 4.3.3) ## S4Vectors * 0.40.2 2023-11-23 [1] Bioconductor 3.18 (R 4.3.3) ## sass 0.4.8 2023-12-06 [3] CRAN (R 4.3.2) ## scales 1.3.0 2023-11-28 [1] CRAN (R 4.3.2) ## sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.2) ## shiny 1.7.5.1 2023-10-14 [1] CRAN (R 4.3.2) ## SparseArray 1.2.4 2024-02-11 [1] Bioconductor 3.18 (R 4.3.3) ## stringi 1.8.3 2023-12-11 [3] CRAN (R 4.3.2) ## stringr 1.5.1 2023-11-14 [3] CRAN (R 4.3.2) ## SummarizedExperiment * 1.32.0 2023-10-24 [1] Bioconductor ## tibble 3.2.1 2023-03-20 [1] CRAN (R 4.3.3) ## tidyr 1.3.1 2024-01-24 [3] CRAN (R 4.3.2) ## tidyselect 1.2.0 2022-10-10 [3] CRAN (R 4.2.1) ## tidySummarizedExperiment 1.15.1 2024-08-06 [1] Github (stemangiola/tidySummarizedExperiment@4b8a4e1) ## ttservice 0.4.1 2024-06-07 [1] CRAN (R 4.3.3) ## urlchecker 1.0.1 2021-11-30 [1] CRAN (R 4.3.2) ## usethis 3.0.0 2024-07-29 [1] CRAN (R 4.3.3) ## utf8 1.2.4 2023-10-22 [1] CRAN (R 4.3.3) ## vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.3.3) ## viridisLite 0.4.2 2023-05-02 [1] CRAN (R 4.3.3) ## withr 3.0.0 2024-01-16 [1] CRAN (R 4.3.3) ## xfun 0.41 2023-11-01 [3] CRAN (R 4.3.2) ## XML 3.99-0.16.1 2024-01-22 [1] CRAN (R 4.3.3) ## xtable 1.8-4 2019-04-21 [1] CRAN (R 4.3.2) ## XVector 0.42.0 2023-10-24 [1] Bioconductor ## yaml 2.3.8 2023-12-11 [3] CRAN (R 4.3.2) ## zlibbioc 1.48.2 2024-03-13 [1] Bioconductor 3.18 (R 4.3.3) ## ## [1] /home/jono/R/x86_64-pc-linux-gnu-library/4.3 ## [2] /usr/local/lib/R/site-library ## [3] /usr/lib/R/site-library ## [4] /usr/lib/R/library ## ## ──────────────────────────────────────────────────────────────────────────────
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.