Create a GENCODE transcript database in R
[This article was first published on Chitka, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The following gist will help the researchers in creating the gencode transcript database using the bioconductor packages. I am assuming that the user’s computer has preinstalled packages “GenomicRanges” and “GenomicFeatures”. Following script has the following information:
- loads the needs bioconductor packages
- gives information about creating the intermediate files needed for generating the database
- brief explanation about each step in the procedure
- create the transcript database, saving and loading when needed
- extract information for each feature (gene, cds,transcript,exon,intron,intergenic regions) as ‘GRanges’ object, ‘sort’ when needed.
- saves all the extracted features into combined object to be loaded in future
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
# Create the TranscriptDb object from gtf file | |
#++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
# load the library for creating the gencode transcript database and extracting and processing the features | |
library("GenomicFeatures") | |
library("GRanges") | |
# Get the chromosome info as a dataframe | |
# One can use the script fethChromsomeSize from UCSC to the get info, filter it to remove information from | |
# non-std chromosomes | |
# Add the header "chrom length is_circular" additional column to say that none of the chromosomes are circular. | |
# chrom length is_circular | |
# chr1 249250621 FALSE | |
# chr2 243199373 FALSE | |
# chr3 198022430 FALSE | |
chrom.info=read.table(file="/home/kalyankpy/hg19/hg19.size", header=T) | |
# Create the transcriptdb from the gencode gtf file | |
# Download the latest gencode comprehensive gtf file from gencode website | |
gencode<-makeTranscriptDbFromGFF("/home/kalyankpy/Downloads/gencode.v19.annotation_2.gtf", | |
format="gtf", exonRankAttributeName="exon_number", chrominfo=chrom.info, | |
dataSource=paste("ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz"), | |
species="Homo Sapies") | |
# Save the transcriptDb object as a sql database object | |
saveDb(gencode, file="/home/kalyankpy/reg_elements/gencode_human_v19.sqlite") | |
loadDb("/home/kalyankpy/reg_elements/gencode_human_v19.sqlite") | |
#### Create GRAnges objects for the each feature | |
genes.gencode<-sort(genes(gencode)) | |
intergenic.gencode<-gaps(genes.gencode) | |
transcript.gencode<-sort(transcripts(gencode)) | |
cds.gencode<-sort(cds(gencode)) | |
exons.gencode<-sort(exons(gencode)) | |
introns.gencode<-sort(unlist(intronsByTranscript(gencode))) | |
# save the combined obj for easy loading in future. | |
save(genes.gencode,intergenic.gencode,transcript.gencode,cds.gencode,exons.gencode,introns.gencode, file="/home/kalyankpy/reg_elements/gencode_all_features.rda") | |
# Load the database in future | |
load("/home/kalyankpy/reg_elements/gencode_all_features.rda") | |
# Alternatively save all the above objects into a single 'GenomicRangesList' | |
annotation.all.features=GenomicRangesList(genes.gencode,intergenic.gencode,transcript.gencode,cds.gencode,exons.gencode,introns.gencode) | |
# save this list of features | |
save(annotation.all.features, file="/home/kalyankpy/reg_elements/gencode_all_features_list.rda") | |
# Load in future | |
load("/home/kalyankpy/reg_elements/gencode_all_features_list.rda") |
To leave a comment for the author, please follow the link and comment on their blog: Chitka.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.