RTCGA factory of R packages – Quick Guide
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Yesterday we have been delivered with the new version of R – R 3.3.0 (codename Supposedly Educational). This enabled Bioconductor (yes, not all packages are distributed on CRAN) to release it’s new version 3.3. This means that all packages held on Bioconductor, that were under rapid and vivid development, have been moved to stable-release versions and now can be easily installed. This happens once or twice a year. With that date I have finished work with RTCGA package and released, on Bioconductor, the RTCGA Factory of R Packages. Read this quick guide to find out more about this R Toolkit for Biostatistics with the usage of data from The Cancer Genome Atlas study.
About The Cancer Genome Atlas
The Cancer Genome Atlas (TCGA) is a comprehensive and coordinated effort to accelerate our understanding of the molecular basis of cancer through the application of genome analysis technologies, including large-scale genome sequencing – http://cancergenome.nih.gov/.
Our team converted selected datasets from this study into few separate packages that are hosted on Bioconductor. These R packages make selected datasets easier to access and manage. Data sets in RTCGA packages are large and cover complex relations between clinical outcomes and genetic background.
To use RTCGA install package with instructions from it’s Bioconductor home page
Check, Download and Read Data
Packages from the RTCGA factory will be useful for at least three audiences: biostatisticians that work with cancer data; researchers that are working on large scale algorithms, for them RTGCA data will be a perfect blasting site; teachers that are presenting data analysis method on real data problems.
TCGA releases various datasets over time for different cohorts, that are determined by cancer types. One can check
infoTCGA()
– what are cohort codes and counts for each cohort from TCGA project,checkTCGA('Dates')
– what are TCGA datasets’ dates of release,checkTCGA('DataSets', cancerType = "BRC")
– what are TCGA datasets’ names for current release date and cohort.
With that knowledge we are able to download specific datasets from TCGA study. The following command downloads datasets that have string Merge_Clinical.Level_1
in it’s name for BRCA
cohort type (Breast carcinoma) for 2015-11-01
date of release.
For specific datasets (8 types) we have prepared readTCGA
funciton that reads dataset to the tidy format, using datatable::fread
function. For expression datasets we also change columns types to natural numeric values.
Prepared Available Datasets
For the most popular datasets types we have prepared data packages that provides various genetic information for 2015-11-01
date of TCGA release. You can read about those datasets and install them with
Those datasets can be converted to Bioconductor format with convertTCGA
function. You can check full documentation prepared with staticdocs here – http://rtcga.github.io/RTCGA/staticdocs/.
Manipulate and Visualize Data
For prepared datasets we have provided functions to manipulate and visualize effect of statistical procedures like Principal Component Analysis (based on ggbiplot) or estimates of the Kaplan-Meier survival curves (based on the elegant survminer package). Check few examples below
Survival Curves
PCA Biplot
For more visualization examples visit RTCGA project website. If you have noticed any bugs or have any reflections please open an issue under project’s repository or post a comment on below Disqus panel.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.