Estimate Gene Diversity

fabiomarroni

11 years ago

[This article was first published on Fabio Marroni's Blog » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I provide here an R function to estimate gene diversity of diallelic sites (e.g. SNPs), given allele frequencies at each segregating site.
The function takes three input parameters:

maf: a numeric value (or vector) representing minor allele frequency at each site. Default is 0.5
nreads: size of each resampling experiment. Default is 10000.
nreplicates: Number of resampling experiments to conduct. Default is 200.

The function outputs the median value of estimated gene diversity, together with upper and lower 95% confidence limits.

I tested the function against results provided by the software package Arlequin and found excellent correlation (r>0.95). However, additional testing is welcome!

Once you obtain gene diversity you can easily calculate nucleotide diversity by summing gene diversity over the segregating sites and dividing by the length of your sequence.

References:
Nei, M. and Li, W.H., (1979) Mathematical model for studying genetic variation in terms of restriction endonucleases. Proc Natl Acad Sci U S A, 76(10), 5269-5273.
Tajima, F., (1989) Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Adv Genet, 123(3), 585-595.

If you use this function in a published work, please refer to:
Marroni F, Pinosio S, Di Centa E, Jurman I, Boerjan W, Felice N, Cattonaro F, Morgante M., (2011) Large scale detection of rare variants via pooled multiplexed next generation sequencing: towards next generation Ecotilling. Plant J. doi: 10.1111/j.1365-313X.2011.04627.x.

egd<-function(maf=0.5,nreads=10000,nreplicates=200)
{
total<-rep(NA,nreplicates)
  for(aaa in 1:nreplicates)
  {
  pino1<-sample(c("a","b"),replace=T,size=nreads,prob=c(maf,1-maf))
  pino2<-sample(c("a","b"),replace=T,size=nreads,prob=c(maf,1-maf))
  total[aaa]<-sum(pino1!=pino2)/nreads
  }
  gene.div.median<-median(total)
  gene.div.ucl<-quantile(total,probs=0.975)
  gene.div.lcl<-quantile(total,probs=0.025)
  gd<-c(gene.div.median,gene.div.ucl,gene.div.lcl)
  names(gd)<-c("gene.div.median","gene.div.ucl","gene.div.lcl")
  gd
}

To leave a comment for the author, please follow the link and comment on their blog: Fabio Marroni's Blog » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.