BDA3 Chapter 1 Exercise 3
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
BDA3 Chapter 1 Exercise 3
Here’s my solution to exercise 3, chapter 1, of Gelman’s Bayesian Data Analysis (BDA), 3rd edition. There are solutions to some of the exercises on the book’s webpage.
Suppose a particular gene for eye colour has two alleles: a dominant X and a recessive x allele. Having xx gives you blue eyes, otherwise you have brown eyes. Suppose also that the proportion of blue-eyed people is p2, and the proportion of heterozygotes is 2p(1–p). There are 3 questions to answer:
- What is the probability of a brown-eyed child of brown-eyed parents being a heterozygote?
- If such a heterozygote, Judy, has n brown-eyed children with a random heterozygote, what’s the probability that Judy is a heterozygote?
- Under the conditions of part 2, what is the probability that Judy’s first grandchild has blue eyes?
Simulation
Let’s first set up some data with which we can verify the results via simulation.
Data
We’ll simulate a large population of individuals where the probability of the recessive allele is 0.2.
set.seed(11146) N <- 5000000 p <- 0.2 alleles <- c('x', 'X') weights <- c(p, 1 - p) df <- tibble(id = 1:N %>% as.character()) %>% mutate( allele1 = sample(alleles, N, prob = weights, replace = TRUE), allele2 = sample(alleles, N, prob = weights, replace = TRUE), genotype = if_else(allele1 == allele2, 'homozygote', 'heterozygote'), eye_colour = if_else(allele1 == 'x' & allele2 == 'x', 'blue', 'brown') )
id | allele1 | allele2 | genotype | eye_colour |
---|---|---|---|---|
1 | X | X | homozygote | brown |
2 | x | X | heterozygote | brown |
3 | X | x | heterozygote | brown |
4 | X | X | homozygote | brown |
5 | x | X | heterozygote | brown |
6 | X | X | homozygote | brown |
This has the correct distribution of alleles, since p2≈ 0.04 and (1−p)2≈ 0.64.
allele_distribution <- df %>% group_by(allele1, allele2) %>% tally() %>% mutate(frac = n / sum(n))
allele1 | allele2 | n | frac |
---|---|---|---|
x | x | 200006 | 0.2000018 |
x | X | 800015 | 0.7999982 |
X | x | 800802 | 0.2002016 |
X | X | 3199177 | 0.7997984 |
This also has the correct distribution of eye colours.
eye_colour_distribution <- df %>% group_by(eye_colour) %>% tally() %>% mutate(frac = n / sum(n))
eye_colour | n | frac |
---|---|---|
blue | 200006 | 0.0400012 |
brown | 4799994 | 0.9599988 |
The genotype distribution is also correct, since 2p(1−p)≈ 0.32.
genotype_distribution <- df %>% group_by(genotype) %>% tally() %>% mutate(frac = n / sum(n))
genotype | n | frac |
---|---|---|
heterozygote | 1600817 | 0.3201634 |
homozygote | 3399183 | 0.6798366 |
Reproduction
Let’s also define a couple of functions to simulate reproduction within our population. The pair
function matches up random individuals from the first table with random individuals from the second.
pair <- function(df1, df2) { inner_join( df1 %>% select(-matches('\\.(x|y)$')) %>% select(matches('^(id|allele|genotype|eye)')) %>% ungroup() %>% sample_frac(size = 1) %>% mutate(row = row_number()), df2 %>% select(-matches('\\.(x|y)$')) %>% select(matches('^(id|allele|genotype|eye)')) %>% ungroup() %>% sample_frac(size = 1) %>% mutate(row = row_number()), by = 'row' ) %>% select(-row) %>% return() }
The reproduce
function then randomly generates a child from the paired individuals.
reproduce <- function(pairs, n=1) { pairs %>% crossing(child = 1:n) %>% mutate( # the variables x and y indicate the allele taken from parent x and y, respectively x = rbinom(n(), 1, 0.5) + 1, y = rbinom(n(), 1, 0.5) + 1, allele1 = if_else(x == 1, allele1.x, allele2.x), allele2 = if_else(y == 1, allele1.y, allele2.y), genotype = if_else(allele1 == allele2, 'homozygote', 'heterozygote'), eye_colour = if_else(allele1 == 'x' & allele2 == 'x', 'blue', 'brown'), id = paste(id.x, id.y, child, sep = '-') ) %>% return() }
The kids
table then represents the next generation from random mating within the entire population.
kids <- df %>% pair(df) %>% reproduce()
allele1.x | allele2.x | allele1.y | allele2.y | allele1 | allele2 | x | y |
---|---|---|---|---|---|---|---|
X | X | X | x | X | X | 2 | 1 |
X | x | X | X | x | X | 2 | 1 |
X | X | X | x | X | x | 1 | 2 |
X | X | X | X | X | X | 1 | 1 |
X | x | X | x | X | X | 1 | 1 |
x | X | X | X | x | X | 1 | 1 |
The parent attributes are contained in the kids
table, with the .x
suffix for one parent and .y
for the other.
Part 1
We’ll use A to stand for the allele combination, e.g. XX, or Xx=xX, and E for eye colour. The subscripts i=1,2 will be used for each of the two parents, and the absence of subscripts will indicate the variable for the child. We need to calculate the probability that the child is heterogenous given that they are brown-eyed with brown-eyed parents:
P(A=Xx∣E,E1,E2=B).
It will be easier to calculate this if we can rewrite it as a probability conditional only on A∙-variables. First note that
P(A,A1,A2)=P(A∣A1,A2)P(A1,A2)=P(A∣A1,A2)P(A1)P(A2)
using the chain rule and the assumption of random mating. Therefore,
P(A=Xx∣E∙=B)=P(E∙=B∣A=Xx)⋅P(A=Xx)P(E∙=B)=∑a1,a2P(E∙=B∣A=Xx,A1=a1,A2=a2)⋅P(A=Xx∣A1=a1,A2=a2)⋅P(A1=a1)⋅P(A2=a2)∑a,a1,a2P(E∙=B∣A=a,A1=a1,A2=a2)⋅P(A=a∣A1=a1,A2=a2)⋅P(A1=a1)⋅P(A2=a2),
where the numerator is marginalised over possible values of A1 and A2, and the denominator additionally over A.
The factors involving E∙ are either 1 or 0, depending only on whether the given combination of alleles can give rise to brown eyes or not, respectively. Moreover, P(Ai=XX)=(1–p)2 and P(Ai=Xx)=2p(1–p), where the case Ai=xx is impossible conditional on everybody having brown eyes. The only non-trivial calculations now involve P(A=a∣A1=a1,A2=a2):
P(A=Xx∣A1=Xx,A2=Xx)=12P(A=Xx∣A1=Xx,A2=XX)=12P(A=Xx∣A1=XX,A2=XX)=0P(A=XX∣A1=Xx,A2=Xx)=14P(A=XX∣A1=Xx,A2=XX)=12P(A=XX∣A1=XX,A2=XX)=1,
as can be verified by inspection.
Now let’s plug in these values into the formula for the desired probability. The numerator is
P(A=Xx∣E∙=B)=12⋅(2p(1–p))2+12⋅2⋅2p(1–p)(1–p)2+0⋅(1–p)4(12+14)⋅4p2(1–p)2+(12+12)⋅4p(1–p)3+(0+1)⋅(1–p)4=(1–p)2(1–p)22p2+2p(1–p)3p2+4p(1–p)+(1–p)2=2p2+2p–2p23p2+4p–4p2+1+p2–2p=2p1+2p,
as required. This is approximately 2p for small p, and is approximatily 12 for large p.
Part 1 simulation
To condition on brown-eyed children from brown-eyed parents, we can just filter the kids
table. Such a child is called judy
in this exercise.
judy <- kids %>% filter( eye_colour.x == 'brown', eye_colour.y == 'brown', eye_colour == 'brown' ) judy_genotypes <- judy %>% group_by(genotype) %>% tally() %>% mutate(frac = n / sum(n))
genotype | n | frac |
---|---|---|
heterozygote | 1280529 | 0.2858268 |
homozygote | 3199559 | 0.7141732 |
This is very close to the theoretical value of 2p1+2p≈ 0.286.
Part 2
Denote by EC∙=B the condition that all of Judy’s children have brown eyes, and by Ap=a the condition that Judy’s partner has allele combination a. Then
P(A=Xx∣E∙=B=EC∙,Ap=Xx)=P(A=Xx∣E∙=B,Ap=Xx)⋅P(EC∙=B∣E∙=B,Ap=Xx=A)P(EC∙=B∣E∙=B,Ap=Xx)=P(A=Xx∣E∙=B)⋅P(EC∙=B∣Ap=Xx=A)∑aP(EC∙=B∣E∙=B,Ap=Xx,A=a)⋅P(A=a∣E∙=B,Ap=Xx)=P(A=Xx∣E∙=B)⋅P(EC∙=B∣Ap=Xx=A)∑aP(EC∙=B∣Ap=Xx,A=a)⋅P(A=a∣E∙=B)=2p1+2p⋅(34)nP(EC∙=B∣Ap=Xx=A)⋅P(A=Xx∣E∙=B)+P(EC∙=B∣Ap=Xx,A=XX)⋅P(A=XX∣E∙=B)=2p1+2p⋅(34)n2p1+2p⋅(34)n+11+2p=2p⋅(34)n2p⋅(34)n+1=2p⋅3n2p⋅3n+4n,
where we have used conditional independence several times for the probability of the child’s alleles given the parents’ alleles. As n→∞, this probability shrinks to 0.
Part 2 simulation
To simulate part 2, we need to pair judy
with heterozygotes from the general population, then filter for those children with brown eyes.
judy_kids <- df %>% filter(genotype == 'heterozygote') %>% pair(judy) %>% reproduce() %>% ungroup() %>% filter(eye_colour == 'brown')
Amongst judy_kids
, Judy’s attributes have the .y
suffix. Given the above conditions, the probability of her possible genotypes are then:
judy_genotypes_posterior <- judy_kids %>% group_by(genotype.y) %>% tally() %>% mutate(frac = n / sum(n))
genotype.y | n | frac |
---|---|---|
heterozygote | 343962 | 0.2313911 |
homozygote | 1142534 | 0.7686089 |
This is close to the theoretical value of 6p6p+4≈ 23.1%.
Part 3
Let’s introduce some notation. Let Ag be the alleles of Judy’s first grandchild, the child of c with alleles Ac whose partner has alleles Apc. We wish to calculate P(Ag=xx∣E∙=B=EC∙,Ap=Xx).
First note that
$$ P(Ac=Xx∣E∙=EC∙=B,Ap=Xx,A=a)=P(EC∙=B∣E∙=B,Ac=Xx=Ap,A=a)⋅P(Ac=Xx∣E∙=B,Ap=Xx,A=a,A)P(EC∙=B∣E∙=B,Ap=Xx,A=a)=P(EC∙=B∣E∙=B,Ac=Xx=Ap,A=a)⋅0.5P(EC∙=B∣E∙=B,Ap=Xx,A=a)={(34)n−1⋅12(34)nif A=Xx1⋅12/1othewrise={23if A=Xx12othewrise.$$
Thus,
$$ P(Ac=Xx∣E∙=B=EC∙,Ap=Xx)=∑aP(Ac=Xx∣E∙=B=EC∙,Ap=Xx,A=a)⋅P(A=a∣E∙=B=EC∙,Ap=Xx)=23⋅2p⋅(34)n2p⋅(34)n+1+12⋅12p⋅(34)n+1=p(34)n−1+0.52p⋅(34)n+1,$$
which converges to 12 as n→∞.
The probability that Judy’s first grandchild is a homozygote can then be calculated by marginalising over the allele combinations of the child and their partner:
$$ P(Ag=xx∣E∙=B=EC∙,Ap=Xx)=∑ac,apcP(Ag=xx∣E∙=B=EC∙,Ap=Xx,Ac=ac,Apc=apc)⋅P(Ac=ac,Apc=apc∣E∙=B=EC∙,Ap=Xx)=∑ac,apcP(Ag=xx∣Ac=ac,Apc=apc)⋅P(Apc=apc)⋅P(Ac=ac∣E∙=B=EC∙,Ap=Xx)=∑apcP(Ag=xx∣Ac=Xx,Apc=apc)⋅P(Apc=apc)⋅P(Ac=Xx∣E∙=B=EC∙,Ap=Xx)=p(34)n−1+0.52p⋅(34)n+1⋅∑apcP(Ag=xx∣Ac=Xx,Apc=apc)⋅P(Apc=apc)=p(34)n−1+0.52p⋅(34)n+1⋅(P(Ag=xx∣Ac=Xx,Apc=Xx)⋅P(Apc=Xx)+P(Ag=xx∣Ac=Xx,Apc=xx)⋅P(Apc=xx))=p(34)n−1+0.52p⋅(34)n+1⋅(14⋅2p(1–p)+12⋅p2)=p(34)n−1+0.52p⋅(34)n+1⋅p2,$$
since
the grandchild can only be blue-eyed if the (brown-eyed) child has at least one x-allele, i.e. the child is Xx;
A and Ap are independent by the random mating assumption; and
Ac and Apc are independent by the random mating assumption.
As n→∞, this probability converges to p4.
Part 3 simulation
To simulate Judy’s grandkids, we pair up judy_kids
with members of the general population.
judy_grandkids <- judy_kids %>% pair(df) %>% reproduce() judy_grandkids %>% summarise(mean(eye_colour == 'blue')) %>% pull() %>% signif(3) [1] 0.054
The above fraction of grandkids with blue eyes is consistent with the theoretical value of 4p+0.56p+4p2≈ 0.0538.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.