Simple plots reveal interesting artifacts
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I’ve recently been working with methylation data; specifically, from the Illumina Infinium HumanMethylation450 bead chip. It’s a rather complex array which uses two types of probes to determine the methylation state of DNA at ~ 485 000 sites in the genome.
The Bioconductor project has risen to the challenge with a (somewhat bewildering) variety of packages to analyse data from this chip. I’ve used lumi, methylumi and minfi, but will focus here on just the latter.
Update: a related post from Oliver Hofmann (@fiamh).
The data came to me in a standard directory and file structure (created, I think, using BeadStudio), as raw IDAT files with a sample sheet. So, reading the data into R was straightforward:
library(minfi) library(IlluminaHumanMethylation450kmanifest) baseDir <- "path/to/datadir" targets <- read.450k.sheet(baseDir) RGset <- read.450k.exp(base = baseDir, targets = targets)
This creates an RGChannelSet:
RGChannelSet (storageMode: lockedEnvironment) assayData: 622399 features, 108 samples element names: Green, Red protocolData: none phenoData sampleNames: 6057841162_R01C01 6057841162_R02C01 ... 6264488081_R06C02 (108 total) varLabels: Sample_Name Sample_Well ... filenames (9 total) varMetadata: labelDescription featureData: none experimentData: use 'experimentData(object)' Annotation: IlluminaHumanMethylation450k
The minfi package provides a function to convert the red/green channel values to raw intensity measurements representing methylated or unmethylated:
raw <- preprocessRaw(RGset)
That results in a MethylSet:
MethylSet (storageMode: lockedEnvironment) assayData: 485577 features, 108 samples element names: Meth, Unmeth protocolData: none phenoData sampleNames: 6057841162_R01C01 6057841162_R02C01 ... 6264488081_R06C02 (108 total) varLabels: Sample_Name Sample_Well ... filenames (9 total) varMetadata: labelDescription featureData: none experimentData: use 'experimentData(object)' Annotation: IlluminaHumanMethylation450k Preprocessing Raw (no normalization or bg correction) minfi version 1.0.0 Manifest version 0.2.0
The functions getMeth() and getUnmeth() extract the methylated/unmethylated intensities, respectively, to a matrix:
meth <- getMeth(raw) dim(meth) ## [1] 485577 108
The MethylSet object, raw, has the same structure as the more familiar eSet, so we can extract phenotype data using pData():
names(pData(raw)) ## [1] "Sample_Name" "Sample_Well" "Sample_Plate" "Sample_Group" "Pool_ID" ## [6] "Array" "Slide" "Basename" "filenames"
It turns out that for my data, the 108 samples were run in 9 groups of 12 and are ordered as such in the sample sheet: that is samples 1-12 = Group1, 13-24 = Group2 and so on, up to 97-108 = Group9. It also turns out that the variable pData(raw)$Array is ordered in groups of 6. So for the first 18 samples:
head(pData(raw)$Array, 18) ## [1] "R01C01" "R02C01" "R03C01" "R04C01" "R05C01" "R06C01" "R01C02" "R02C02" ## [9] "R03C02" "R04C02" "R05C02" "R06C02" "R01C01" "R02C01" "R03C01" "R04C01" ## [17] "R05C01" "R06C01"
Keep those groups and orderings in mind, as we generate a boxplot of raw methylated intensities for the 108 samples:
library(RColorBrewer) cols <- brewer.pal(9, "Set3") groups.uniq <- unique(pData(raw)$Sample_Group) m <- match(pData(raw)$Sample_Group, groups.uniq) boxplot(meth, col = cols[m])
Result on the right.
Interesting and evidently, something “non-random” is going on. It looks as though we have 2 artifacts to consider. First, for these samples, groups 6-9 differ from groups 1-5. More interestingly, within groups, we see a cyclic signal where intensity rises gradually across the first 6 arrays, then falls and rises again for the second set of 6 arrays.
I love that when I tweet about this technical, esoteric stuff, I get instant and useful responses. Thanks for following: @iameltonjohn, @tweetingtechno and @fiamh.
Take-home message: very simple plots can be informative. Now to figure out what it all means.
Filed under: bioinformatics, R, research diary, statistics Tagged: illumina 450k, methylation, microarray
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.