How to read BSMAP methylation ratio files into R via methylKit
[This article was first published on Recipes, scripts and genomics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
BSMAP is an aligner for bisulfite sequencing reads. It outputs aligned reads as well as methylation ratios per base (via methratio.py script). The methylation ratios can be read into R via methylKit package and regular methylKit analysis can be performed using the BSMAP data.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
First, you need to separate BSMAP methylation ratio file based on methylation context. Cs in CpG, CHH and CHG context should be separated into different files. This can be achieved with a perl or awk one-liner. If you want to analyze Cs from all contexts then of course you don’t need to separate the file. For example, to get all Cs in the CpG context you will need to do something like following:
awk ‘($3==”-” && $4~/^.{1}CG/ ) || ($3==”+” && $4~/^.{2}CG/)’ BSMAPexample.txt > CpG.txt
Here is how methylation ratio file from BSMAP looks like (file includes only Cs in CpG context):
chr pos strand context ratio total_C methy_C CI_lower CI_upper chr1 3121589 + CGCGT 0.000 56 0 0.000 0.064 chr1 3121597 + ATCGG 0.000 56 0 0.000 0.064 chr1 3121599 + GTCGT 0.000 56 0 0.000 0.064 chr1 3121605 + CTCGG 0.000 56 0 0.000 0.064 chr1 3121606 + TGCGC 0.000 56 0 0.000 0.064 chr1 3121607 + GGCGC 0.000 56 0 0.000 0.064 chr1 3121611 + CTCGA 0.000 56 0 0.000 0.064 chr1 3121614 + TACGC 0.000 56 0 0.000 0.064 chr1 3121631 + CTCGT 0.000 56 0 0.000 0.064
You can read the methylation ratio file by using “pipeline” argument in read() function. You need to provide a list of column numbers corresponding to chr,start,end,strand,coverage and ratio of methylation. Actually, you can read any generic methylation ratio or percentage file using this method. The file needs to have the location information (chr,start,end and strand), coverage information and methylation percentage or ratio information.
Thanks to Maxime Caron for sharing the BSMAP methylation ratio file.
Reference for methylKit:
Altuna Akalin, Matthias Kormaksson, Sheng Li, Francine E. Garrett-Bakelman, Maria E. Figueroa, Ari Melnick, Christopher E. Mason.(2012). “methylKit: A comprehensive R package for the analysis of genome-wide DNA methylation profiles.“ Genome Biology , 13:R87.
PS: For the newer releases of BSMAP (~v2.73), you may need to coerce the CT counts to an integer. BSMAP ratio extractor may output effective CT counts as float but they should be coerced to integer before they are read in by methylKit. methylKit (>= v0.5.6) will automatically coerce those values to nearest integer. However, you can also use awk to coerce float to integer values. The example below coerces the 6th column of the the file to integer
awk ‘{OFS=”\t”;print $1,$2,$3,$4,$5,int($6);}’ sample.BSMAP.txt
To leave a comment for the author, please follow the link and comment on their blog: Recipes, scripts and genomics.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.