sqldf and grouping rows in R
[This article was first published on Recipes, scripts and genomics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In R, you can treat tables (or data.frames as they are called in R) as SQL tables. That means you can query them as you would query a database with SQL commands. This is particularly useful 1) if you know SQL, hahah:) 2) if you have large tables with millions of rows. In R, querying a database will be much faster than iterating through the rows of a million row table.
Say you have a table of exon locations and RPKM value for each exon and you want to get the total RPKM value for each transcript:
>head(rnaEx)
ex_rank ex_name tx_chr tx_start tx_end score len strand
1 heart_k25:1002070u:2.9 chrX 101864664 101864760 1314.937 65 +
2 heart_k25:1002070u:2.9 chrX 101864664 101864760 0.000 5 +
1 heart_k25:1002622u:3.0 chrX 96534438 96534487 1055.568 50 –
# load the library that can treat tables as SQL tables
>require(sqldf)
# query the table with SQL command: group by transcripts and sum up the scores
>int.res=sqldf(“select ex_name,tx_chr,tx_start,tx_
>head(int.res)
ex_name tx_chr tx_start tx_end strand sum_score len
heart_k25:100000u:2.5 chrX 73370385 73370481 + 2448.919 97 heart_k25:1000027u:5.5 chr1 93550994 93551053 – 2352.406 60
heart_k25:1000076u:2.5 chr5 15764153 15764204 – 1140.013 52 heart_k25:1000078:2.6 chr11 116415816 116415975 + 3787.981 160
Check here for more info on sqldf:
Similar functionality also exists in R through another package called plyr. Although, I feel like sqldf will be faster for large data.frames, I have no evidence on that at the moment. plyr is described here and here, and check below for a quick intro:
To leave a comment for the author, please follow the link and comment on their blog: Recipes, scripts and genomics.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.