Minimax Estimation and Identity Testing of Markov Chains
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
We briefly review the two classical problems of distribution estimation and identity testing (in the context of property testing), then propose to extend them to a Markovian setting. We will see that the sample complexity depends not only on the number of states, but also on the stationary and mixing properties of the chains.
The distribution setting
Estimation/Learning. A fundamental problem in statistics is to
estimate a probability distribution from independent samples. Consider
an alphabet X of size d, and draw X1,X2,…,Xn
from a distribution μ over X. How large must n –the
sample size– be in order to obtain a good estimator
ˆμ(X1,…,Xn) of μ? In order to make the question
precise, we choose a notion of distance ρ between distributions,
pick two small numbers δ,ε>0 and can for instance
say that an estimator is good, when with high probability 1–δ
over the random choice of the sample, the estimator is ε
close to the true distribution. Framed as a (probably approximately
correct) minimax problem, we define the sample complexity of the problem
to be n0(ε,δ)=argmin{n∈N:minˆμmaxμP(ρ(ˆμ(X1,…,Xn),μ)>ε)<δ},
Identity testing. A different problem is to imagine that the practitioner has access to independent data sampled from some unknown distribution μ and the full description of a reference distribution ˉμ. They then have to make a determination as to whether the data was sampled from ˉμ, or from a distribution which is at least ε far from μ (composite). To better compare with the previous problem, we keep the total variation metric. We briefly note that the unknown distribution being closer than ε, but not equal to ˉμ, is not among the choices –we require separation of the hypotheses.
One can come up with a simple “testing-by-learning” approach to solve
the problem. First estimate the unknown distribution down to precision
ε/2 using the sample, and then verify whether the estimator
is close to ˆμ. However, identity testing corresponds to a
binary, seemingly easier question. As such, we would expect the sample
complexity to be less than that of the estimation problem. Among the
first to investigate the problem in the case where ˉμ is the
uniform distribution was Paninski (2008), who showed
that the complexity of the problem grows roughly as the square root of
the alphabet size, i.e. much more economical than estimation. For this
uniformity testing problem, several procedures are known to achieve the
minimax rate (although not all work in all regimes of parameters). One
can for instance count the number of collisions in the sample, count the
number of bins appearing exactly once in the sample, or even compute a
chi-square statistic. In fact, the complexity for the worst-case
ˉμ has been pinned-down to (see Diakonikolas et al. (2017))
n0(ε,δ)≍√dlog1/δ+log1/δε2
The Markovian setting
We depart from independence and increase the richness of the process by
considering the Markovian setting. More specifically, we wish to perform
statistical inference from sampling a “single trajectory” of a Markov
chain, i.e. a sequence of observations X1,X2,…,Xm where
X1 is drawn from an unknown, arbitrary initial distribution, and the
transition probabilities governing Xt→Xt+1 are collected in a
row-stochastic matrix P. It is a challenging but fair representation
of a process outside of the control of the scientist, that cannot for
example be restarted, or set to any particular state. For simplicity of
the exposition below, we will assume that P is ergodic and time
reversible, but the approach and results can be extended to the periodic
or non-reversible cases (see the concluding remarks). To measure
distance between Markov chains, we here consider the infinity matrix
norm between their respective transition matrices, ρ(P,P′)=‖P–P′‖∞=maxx∈X∑x′∈X|P(x,x′)–P′(x,x′)|,
Estimation/Learning. Finite sample analyses for the problem are relatively recent, with the first treatments of Hao, Orlitsky, and Pichapati (2018) in expectation and Wolfer and Kontorovich (2019) in the PAC framework.
We estimate the unknown P with the empirical tally matrix, where
ˆP(x,x′) is computed by counting transitions from x to x′
and dividing by the number of visits to x (modulo some mild
smoothing). Tighter upper and lower bounds on the sample complexity were
later obtained (see Wolfer and Kontorovich (2021))
c(dπ⋆ε2+dlogdγ⋆)≤m0(ε,δ)≤C(d+log1/(εδ)π⋆ε2+log1/(π⋆δ)π⋆γ⋆)
We witness a stark difference between the iid and Markovian settings: a phase transition appears in the sample complexity. Roughly speaking and taking ε as a constant, when the mixing time of the chain is larger than the number of states, the second term becomes the dominant one, i.e. the difficulty of estimation is chiefly controlled by how well we can navigate among the states. When the chain mixes more quickly, moving from state to state is not bottleneck, and the harder task is the estimation of the respective conditional distributions. An additional caveat is that the chain will generally not visit all states equally in the long run. As a result, the sample complexity necessarily depends on the proportion π⋆ of the trajectory spent in the least visited state. The astute reader may notice that when the mixing time is small and the stationary distribution is uniform (P is doubly stochastic), we recover up to logarithmic factors the previously intuited complexity of d2/ε2.
Identity testing. For the Markov chain identity testing problem
(still w.r.t the matrix uniform norm), we rely on a two-stage testing
procedure. First, we verify that we visited all states about
√d/ε2 times. As a second step, we perform identity
testing of all conditional distributions. The Markov property ensures
that our transition observations are drawn independently from the
conditional distribution, thus that we can rely on one of the known iid
testing procedures as a black-box. We obtain the below-listed upper and
lower bounds on the sample complexity (see Wolfer and Kontorovich (2020))
c(√dˉπ⋆ε2+dˉγ⋆)≤m0(ε,δ)≤C(√dˉπ⋆ε2logdεδ+log1/(ˉπ⋆δ)ˉπ⋆ˉγ⋆)
Figure 1: Topology of class of Markov chains achieving the lower bounds in d/γ⋆.
Figure 2: Example of the class in Figure 1 with d=9, τ∈{−1,1}3 and 0<η≪1. The mixing time of the chain is of the order of 1/η, and is decoupled from the proximity parameter ε.
Perhaps surprisingly, the bounds only depends on properties ˉπ⋆, ˉγ⋆ of the reference stochastic matrix. As a matter of fact, the unknown chain need not even be irreducible for the procedure to work. A quadratic speed-up also appears in the bounds, which will affect the dominant term when the chain mixes faster than √d.
It is instructive to inspect the set of reversible Markov chains that achieves the lower bounds in d/γ⋆ (see Figure 1 and Figure 2). Every element consists of an “inner clique” and an “outer rim”. The inner clique can be made arbitrarily sticky in the sense that the chain will only move from one state to another within the clique with underwhelming probability η, which has an overall slowing effect on the mixing time of the chain. On the other hand, being on one state of the clique, the chain has at least constant –but parametrizable– probability of reaching one of the two connected states in the outer rim. In order to distinguish between two chains where one of the inner nodes has ε-different transition probabilities towards the rim than the other, the trajectory will necessarily have to go through a large fraction of the states.
Related work and closing comments. As mentioned earlier in this note, the results are valid for a substantially larger class of chains. In the non-reversible setting, the absolute spectral gap can readily be replaced with the pseudo-spectral gap Paulin (2015), and if the chain is irreducible (but possibly periodic), one can instead perform inference on the lazy process governed by (P+I)/2, which can be simulated with an additional fair coin. More recently, Chan, Ding, and Li (2021) investigated the Markov chain minimax learning and testing problems by considering cover times instead of mixing times.
We end this post by stressing that the minimax rates pertain to the choice of metric. We refer the interested reader to Daskalakis, Dikkala, and Gravin (2018), Cherapanamjeri and Bartlett (2019) and Fried and Wolfer (2022), who analyze the identity testing problem with respect to a different notion of discrepancy between Markov chains. Although the last-mentioned framework can so far only handle reversible chains, it advantageously leads to minimax rates that are independent of mixing or hitting.
References
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.