Simulating Allele Counts in a population using R
[This article was first published on Doodling with Data, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This post is inspired by the Week 7 lectures of the Coursera course “Introduction to Genetics and Evolution” (I highly recommend this course for anyone interested in genetics, BTW.) Professor Noor uses a Univ Washington software called AlleleA1 for trying out scenarios.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
We can just as well use R to get an intuitive feel for how Alleles and Genotypes propagate or die out in populations.
Basic Scenario
There are N individuals in an isolated island. Say, we are interested in two specific Alleles (Big “A”, or small “a”). This in turn means that they can have 3 types of genotypes: AA, Aa or aa. The individuals mate in pairs, and produce two offspring and die out. (Thus the total population remains the same generation after generation.)The genotype of the offspring depends on those of the parents. A ‘gamete’ has only one parental allele, depending on what the parent’s genotype was. AA type parent can only product gamete type A, aa parent can only produce gamete type a, but Aa can produce either type of gamete.
A Punnett square of parents gametes to offspring’s genotypes.
| A | a
—————-
A | AA | Aa
a | Aa | aa
With these simple rules, we can use R Simulation scripts to observe what happens to the Allele Frequencies over generations. (The goal here is to learn to use R for Monte Carlo simulations.)
Writing the R Script from scratch
I toyed around with the idea of using character strings for the genotypes and the alleles. But then I realized that are only three types and I could just as easily represent them with the numbers 1, 2, 3 as a simple R vector.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Genotype AA = 1, Aa = 2 and aa = 3 | |
# Gamete A =1 and a = 0 | |
#Uniformly distribute AA, Aa and aa | |
start.pop <- sample(1:3, kStartPop, replace=TRUE) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#Given a parent individual, get one of their Alleles in the gamete | |
getGamete <- function(indiv) { | |
if (indiv == 1) return(1) #AA | |
if (indiv == 3) return(0) #aa | |
#if Parent is Aa, the gamete is one binomial trial with prob.big.A | |
if (indiv == 2) return(rbinom(1, size=1, prob.big.A)) #Aa | |
} | |
#Two parental Gametes combine to form a zygote | |
combineGametes <- function(mg,dg) { | |
if ((mg == 1) && (dg == 1)) return(1) #AA | |
else if ((mg == 0) && (dg == 0)) return(3) #aa | |
else return(2) #Aa | |
} | |
#Given two individuals, get an offspring for next generation | |
getOffspring <- function(mom, dad) { | |
momGam <- getGamete(mom) | |
dadGam <- getGamete(dad) | |
return(combineGametes(momGam, dadGam)) | |
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
getNextGen <- function(x) { | |
nextgen <- list() | |
for(i in seq(1, kStartPop, by=2)) { | |
firstborn <- getOffspring(x[i], x[i+1]) | |
secondborn <- getOffspring(x[i], x[i+1]) | |
nextgen[i] <- firstborn | |
nextgen[i+1] <- secondborn | |
} | |
return(unlist(nextgen)) | |
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
simulationOneTrial <- function(start.pop, knumGenerations, trial.index) { | |
df.allele <- NULL | |
df.gen <- NULL | |
#Keep track of the individuals in each generation | |
df.gen <- rbind(df.gen, start.pop) | |
x <- start.pop | |
for ( gen in 1:knumGenerations) { | |
#print(unlist(nex)) | |
#print(paste("Gen:", gen)) | |
numA <- calcAllelesInGeneration(x) | |
df.gen <- rbind(df.gen, x) | |
df.allele <- rbind(df.allele, c(gen, trial.index, numA)) | |
nex <- getNextGen(x) | |
x <- sample(nex) #shuffle the population for breeding | |
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#Bookkeeping | |
calcAllelesInGeneration <- function(x) { | |
AA = sum(x==1) | |
AB = sum(x==2) | |
BB = sum(x==3) | |
#print(unlist(x)) | |
#print(c(AA, AB, BB)) | |
num.A <- (2 * AA) + AB | |
num.B <- (2 * BB) + AB | |
return(c(num.A, num.B)) | |
} | |
###plotting function | |
plotAllelesWithTime <- function(df, num.trials) { | |
colorRange<-colorRampPalette(c(rgb(0,0,1), rgb(1,0.7,0) )) | |
p <- ggplot(df, aes(x= Generation, y= value, group=Trial, color=factor(Trial))) + geom_line() | |
p <- p + scale_colour_manual(values = colorRange(num.trials), | |
name="Trial") | |
p <- p + labs(title = "Allele A Frequencies Across Generations (2011) ") | |
p <- p + ylab("Number of \"A\" Allele in the Population") | |
return(p) | |
} |

Using this simple Monte Carlo “toy” we can develop quite a bit of intuition.
For small starting populations, either the big A or the small a allele takes over the entire population fairly quickly. Given large enough number of generations, invariably one of the alleles gets wiped out.

As one example, we can see that even a small increase in the probability of Allele A to be 0.53 (up from 0.5) makes it take over quite dramatically.
Conversely, setting it to any value under 0.5 means that the Big A allele gets wiped out of the entire population.
The entire R script can be found here. You can download the code and try playing with various starting scenarios, changing the starting population counts, generations and probabilities.
References:
- Coursera.org (Introduction to Genetics and Evolution by Md. Noor, Week 7 lectures)
- AlleleA1 software at Univ Washington
To leave a comment for the author, please follow the link and comment on their blog: Doodling with Data.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.