Metabolic Phenotyping Protocol Part 3
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
If you aren’t familiar with ChemoSpec
, you might wish to look at the introductory vignette first.
In this series of posts we are following the protocol as described in the printed publication closely (Blaise et al. 2021). The authors have also provided a Jupyter notebook. This is well worth your time, even if Python is not your preferred language, as there are additional examples and discussion for study.
Read in the Data
Load the Spectra
object we created in Part 2 so we can summarize it.
library("ChemoSpec") load("Worms2.RData") # restores the 'Worms2' Spectra object sumSpectra(Worms2)
C. elegans metabolic phenotyping study (Blaise 2007) There are 133 spectra in this set. The y-axis unit is intensity. The frequency scale runs from 8.9995 to 5e-04 ppm There are 8600 frequency values. The frequency resolution is 0.001 ppm/point. This data set is not continuous along the frequency axis. Here are the data chunks: beg.freq end.freq size beg.indx end.indx 1 8.9995 5.0005 -3.999 1 4000 2 4.5995 0.0005 -4.599 4001 8600 The spectra are divided into 4 groups: group no. color symbol alt.sym 1 Mut_L2 28 #FB0D16FF 0 m2 2 Mut_L4 33 #FFC0CBFF 15 m4 3 WT_L2 32 #511CFCFF 1 w2 4 WT_L4 40 #2E94E9FF 16 w4 *** Note: this is an S3 object of class 'Spectra'
Exploratory Data Analysis, Con’t.
If you recall in Part 2 we removed five samples. Let’s re-run PCA without these samples and show the key plots. We will simply report these here without much discussion; they are pretty much as expected.
c_pca <- c_pcaSpectra(Worms2, choice = "autoscale")
plotScree(c_pca)
p <- plotScores(Worms2, c_pca, pcs = 1:2, ellipse = "rob", tol = 0.02) p
p <- plotScores(Worms2, c_pca, pcs = 2:3, ellipse = "rob", leg.loc = "bottomleft", tol = 0.02) p
One thing the published protocol does not explicitly discuss is an inspection of the loadings, but it is covered in the Jupyter notebook. The loadings are useful in order to see if any particular frequencies are driving the separation of the samples in the score plot. Let’s plot the loadings (Figure 4). Remember that these data were autoscaled, and hence all frequencies, including noisy frequencies, will contribute to the separation. If we had not scaled the data, these plots would look dramatically different.
p <- plotLoadings(Worms2, c_pca, loads = 1:2) p
The s-plot is another very useful way to find peaks that are important in separating the samples (Figure 5); we can see that the peaks around 1.30-1.32, 1.47-1.48, and 3.03-3.07 are important drivers of the separation in the score plot. Having discovered this, one can investigate the source of those peaks.
p <- sPlotSpectra(Worms2, c_pca, tol = 0.001) p
Supervised Analysis with PLS-DA
ChemoSpec
carries out exploratory data analysis, which is an unsupervised process. The next step in the protocol is PLS-DA (partial least squares - discriminant analysis). I have written about ChemoSpec
+ PLS here if you would like more background on plain PLS. However, PLS-DA is a technique that combines data reduction/variable selection along with classification. We’ll need the mixOmics
package (F et al. (2017)) package for this analysis; note that loading it replaces the plotLoadings
function from ChemoSpec
.
library("mixOmics")
Loading required package: MASS
Loading required package: lattice
Loaded mixOmics 6.20.0 Thank you for using mixOmics! Tutorials: http://mixomics.org Bookdown vignette: https://mixomicsteam.github.io/Bookdown Questions, issues: Follow the prompts at http://mixomics.org/contact-us Cite us: citation('mixOmics')
Attaching package: 'mixOmics'
The following object is masked from 'package:ChemoSpec': plotLoadings
Figure 6 shows the score plot; the results suggest that classification and modeling may be successful. The splsda
function carries out a single sparse computation. One computation should not be considered the ideal answer; a better approach is to use cross-validation, for instance the bootsPLS
function in the bootsPLS
package (Rohart, Le Cao, and Wells (2018) which uses splsda
under the hood). However, that computation is too time-consuming to demonstrate here.
X <- Worms2$data Y <- Worms2$groups splsda <- splsda(X, Y, ncomp = 8)
plotIndiv(splsda, col.per.group = c("#FB0D16FF", "#FFC0CBFF", "#511CFCFF", "#2E94E9FF"), title = "sPLS-DA Score Plot", legend = TRUE, ellipse = TRUE)
To estimate the number of components needed, the perf
function can be used. The results are in Figure 7 and suggest that five components are sufficient to describe the data.
perf.splsda <- perf(splsda, folds = 5, nrepeat = 5) plot(perf.splsda)
At this point, we have several ideas of how to proceed. Going forward, one might choose to focus on accurate classification, or on determining which frequencies should be included in a predictive model. Any model will need to refined and more details extracted. The reader is referred to the case study from the mixOmics folks which covers these tasks and explains the process.
This post was created using ChemoSpec
version 6.1.3 and ChemoSpecUtils
version 1.0.0.
References
Reuse
Citation
@online{hanson2022, author = {Bryan Hanson}, title = {Metabolic {Phenotyping} {Protocol} {Part} 3}, date = {2022-05-01}, url = {http://chemospec.org/posts/2022-05-01-Protocol-Pt3/2022-05-01-Protocol-Pt3.html}, langid = {en} }
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.