
Supertree methods are tree reconstruction techniques that combine several smaller gene trees (possibly on different sets of species) to build a larger species tree. The question of interest is whether the reconstructed supertree converges to the true species tree as the number of gene trees increases (that is, the consistency of supertree methods). In this paper, we are particularly interested in the convergence rate of the maximum likelihood supertree. Previous studies on the maximum likelihood supertree approach often formulate the question of interest as a discrete problem and focus on reconstructing the correct topology of the species tree. Aiming to reconstruct both the topology and the branch lengths of the species tree, we propose an analytic approach for analyzing the convergence of the maximum likelihood supertree method. Specifically, we consider each tree as one point of a metric space and prove that the distance between the maximum likelihood supertree and the species tree converges to zero at a polynomial rate under some mild conditions. We further verify these conditions for the popular exponential error model of gene trees.
Citation: Vu Dinh, Lam Si Tung Ho. Convergence of maximum likelihood supertree reconstruction[J]. AIMS Mathematics, 2021, 6(8): 8854-8867. doi: 10.3934/math.2021513
[1] | Naif Alotaibi, A. S. Al-Moisheer, Ibrahim Elbatal, Salem A. Alyami, Ahmed M. Gemeay, Ehab M. Almetwally . Bivariate step-stress accelerated life test for a new three-parameter model under progressive censored schemes with application in medical. AIMS Mathematics, 2024, 9(2): 3521-3558. doi: 10.3934/math.2024173 |
[2] | Abdulhakim A. Al-Babtain, Amal S. Hassan, Ahmed N. Zaky, Ibrahim Elbatal, Mohammed Elgarhy . Dynamic cumulative residual Rényi entropy for Lomax distribution: Bayesian and non-Bayesian methods. AIMS Mathematics, 2021, 6(4): 3889-3914. doi: 10.3934/math.2021231 |
[3] | Ali A. Al-Shomrani . An improvement in maximum likelihood estimation of the Burr XII distribution parameters. AIMS Mathematics, 2022, 7(9): 17444-17460. doi: 10.3934/math.2022961 |
[4] | Shenglan Peng, Zikang Wan . Truncation point estimation of truncated normal samples and its applications. AIMS Mathematics, 2022, 7(10): 19083-19104. doi: 10.3934/math.20221048 |
[5] | Xue Hu, Haiping Ren . Statistical inference of the stress-strength reliability for inverse Weibull distribution under an adaptive progressive type-Ⅱ censored sample. AIMS Mathematics, 2023, 8(12): 28465-28487. doi: 10.3934/math.20231457 |
[6] | A. M. Abd El-Raheem, Ehab M. Almetwally, M. S. Mohamed, E. H. Hafez . Accelerated life tests for modified Kies exponential lifetime distribution: binomial removal, transformers turn insulation application and numerical results. AIMS Mathematics, 2021, 6(5): 5222-5255. doi: 10.3934/math.2021310 |
[7] | Peng Wang, Jiajun Huang, Weijia He, Jingqi Zhang, Fan Guo . Maximum likelihood DOA estimation based on improved invasive weed optimization algorithm and application of MEMS vector hydrophone array. AIMS Mathematics, 2022, 7(7): 12342-12363. doi: 10.3934/math.2022685 |
[8] | M. G. M. Ghazal, Yusra A. Tashkandy, Oluwafemi Samson Balogun, M. E. Bakr . Exponentiated extended extreme value distribution: Properties, estimation, and applications in applied fields. AIMS Mathematics, 2024, 9(7): 17634-17656. doi: 10.3934/math.2024857 |
[9] | M. R. Irshad, S. Aswathy, R. Maya, Amer I. Al-Omari, Ghadah Alomani . A flexible model for bounded data with bathtub shaped hazard rate function and applications. AIMS Mathematics, 2024, 9(9): 24810-24831. doi: 10.3934/math.20241208 |
[10] | Ahmed R. El-Saeed, Ahmed T. Ramadan, Najwan Alsadat, Hanan Alohali, Ahlam H. Tolba . Analysis of progressive Type-Ⅱ censoring schemes for generalized power unit half-logistic geometric distribution. AIMS Mathematics, 2023, 8(12): 30846-30874. doi: 10.3934/math.20231577 |
Supertree methods are tree reconstruction techniques that combine several smaller gene trees (possibly on different sets of species) to build a larger species tree. The question of interest is whether the reconstructed supertree converges to the true species tree as the number of gene trees increases (that is, the consistency of supertree methods). In this paper, we are particularly interested in the convergence rate of the maximum likelihood supertree. Previous studies on the maximum likelihood supertree approach often formulate the question of interest as a discrete problem and focus on reconstructing the correct topology of the species tree. Aiming to reconstruct both the topology and the branch lengths of the species tree, we propose an analytic approach for analyzing the convergence of the maximum likelihood supertree method. Specifically, we consider each tree as one point of a metric space and prove that the distance between the maximum likelihood supertree and the species tree converges to zero at a polynomial rate under some mild conditions. We further verify these conditions for the popular exponential error model of gene trees.
High-throughput sequencing is making large collections of sequences available to researchers at a low cost. These genomic data represent a broad spectrum of life and motivates studies of the problem of reconstructing large phylogenetic trees using statistical methods. Those data, however, also come from different sources, cover different genomic regions on which the evolutionary processes happen very differently, and may not be collected on the same set of species. Thus, combining trees on different, overlapping sets of species into a "supertree" has become a popular approach for reconstructing large species trees. Over the years, several methods of supertree reconstruction have been developed [8], and analyses of supertrees continue to play more and more important roles in the search for answers of many fundamental evolutionary questions.
While supertree methods have been of great interest in phylogenetics, little is known about their theoretical properties, especially in non-asymptotic settings. In cases when the individual trees are gene trees and the set of taxa are the same across the input trees, several statistical consistent methods have been derived [5,7,11,14,16,17,18,20,29]. However, most proofs of statistical consistency have been analyzed under the condition that each of the gene trees can be estimated accurately, and do not provide a guarantee of robustness to gene tree estimation errors [25]. This is a critical concern because when the species tree is constructed using sequences taken from large genomic regions, those regions have a high chance of involving some recombination, which violates one of the main assumptions of the multi-species coalescent model. On the other hand, limiting analyses to short regions increase gene tree estimation error (GTEE) and various summary methods had impaired accuracy when the error was high [10,19]. It is later proved in [24] that when the sequence length of each locus is bounded and the gene tree cannot be estimated reliably, most summary methods that estimate the species tree by combining gene trees are not statistically consistent.
Because some coalescent-based summary methods sometimes produce less accurate estimates than concatenation [3,19,22], seemingly as a result of GTEE, the question of whether provable guarantees can be established in the presence of GTEE (for both species trees and supertree) naturally arises. In the context of species tree estimation, [25] establish statistical consistency of the Rooted Triplet Consensus method and the Maximum Pseudo-likelihood for Estimating Species Trees method [18] under GTEE and provide bounds on the sampling complexity for these methods to construct the correct species tree with high probability.
The maximum likelihood (ML) supertree method is proposed by [28] based on a probability model that permits "errors" in gene tree topologies and allows the species tree to be estimated even if there is topological conflict amongst gene trees. [28] shows that ML estimate of the species tree is topologically consistent under fairly general conditions and also shows that the method of Matrix Representation with Parsimony [2] may be inconsistent under these same conditions. However, while the ML estimate is topologically consistent, no results on the convergence rate of the estimator are obtained. In this paper, we propose a new analytic approach to study the convergence of the ML supertree method. Based on embedding trees into a metric space, we establish the conditions for which the convergence rate of the ML supertree can be obtained. We verify these conditions for the popular exponential error model of gene trees, thus obtain a polynomial convergence rate for the ML supertree to the true species tree under this model.
In this paper, the term phylogenetic tree refers to a tree T with leaves labeled by a set of species. Each branch of T is associated with a non-negative branch length. A tree is said to be resolved if it is bifurcating and all branch lengths are positive. Given an unrooted phylogenetic tree T on a finite set L of species, any subset L′ of L induces a phylogenetic tree on L′, denoted T|L′, which is the subtree of T that connects the species in L′ only.
We regard the tree space as a metric space (T,d) where T is the set of all phylogenetic trees with branch lengths bounded from above by a positive constant g0 and d is a continuous metric. For simplicity of presentation, we will assume that d is the BHV distance on the set of trees with n species [4], but note that the analysis of the paper can be extended to any continuous and locally-Euclidean distance, including the branch-score distance [15] and the AGPS distance [1].
To describe trees that are "near" to each other, the BHV distance use the class of nearest neighbor interchange (NNI) moves [23]. An NNI move is defined as a transformation that collapses an interior branch to zero and then expands the resulting degree 4 vertex into a branch in a different way. The BHV space models the set of trees T on n species as a cubical complex consisting of a collection of orthants, each isomorphic to R2n−3≥0. Each orthant of T corresponds uniquely to a tree topology, and the coordinates in each orthant parameterize the branch lengths for the corresponding tree. The adjacent orthants of the complex with the same dimension correspond to NNI-adjacent trees.
The BHV space is equipped with a natural metric distance: the shortest path lying in the BHV space between the points. If two points lie in the same orthant, this distance is the usual Euclidean distance. If two points are in different orthants, they can be joined by a sequence of straight segments, with each segment lying in a single orthant. We can then measure the length of the path by adding up the lengths of the segments. The distance between the two trees T1 and T2 on the BHV space is defined as the minimum of the lengths of such segmented paths joining the two points.
Throughout the paper, we assume that the evolution of the species has not involved reticulate processes, and there exists an underlying "true species tree" in T, denoted by T∗. Furthermore, T∗ is a resolved tree with leaf set L∗. In a supertree reconstruction problem, we observe a sequence of gene trees (T1,T2,…,Tk), where Ti has leaf set Li, and wish to combine these trees into a phylogenetic tree ˆTk on the union of the leaf sets
ˆLk=∪ki=1Li⊂L∗. |
In this paper, we consider a tree-generating probability model (PT)T∈T such that for each set of species L⊂L∗, PLT(⋅) is a distribution of trees forming by species in L. We assume that the observed gene trees {Ti}ki=1 are independently distributed according to {PLiT∗(⋅)}ki=1 respectively. In other words, the joint probability of the observed gene trees is
PT∗(T1,T2,…,Tk)=k∏i=1PLiT∗(Ti). |
Remark 2.1. We note that the tree-generating probability model PT∗ may not necessarily be nested. That is, given two nested sets of leaves L1⊂L2⊂L∗, although the probability PL2T∗ (which is used to generate trees with leaf set L2) also induces a natural distribution on the set of trees with leaf set L1, this probability PL2T∗|L1 may not be the same as PL1T∗.
Given the observed gene trees (T1,T2,…,Tk), the ML supertree is defined as
ˆTk=argmaxT∈Tℓk(T) |
where ℓk(T) denotes the log-likelihood function
ℓk(T)=k∑i=1logPLiT(Ti). |
To enable theoretical analyses of the ML supertree method, we make the following assumptions.
Assumption 2.1 (Weak covering property). The sequence of subsets of L∗ satisfies the weak covering property: there exists c1>0,γ>1/2, and K>0 such that for each subset L of taxa from L∗ of size 4,
1kγ|{i≤k:L⊂Li}|≥c1∀k≥K. |
Here, |A| denotes the number of elements in the set A.
We note that the weak covering property ensures that as the number of gene trees increases, all quartets (subtrees with 4 leaves) of T∗ are visited with enough frequency to enable a reliable estimate for each of the quartets. Assumption 2.1 is a direct generalization of the covering property introduced in [28], which can be obtained from Assumption 2.1 by setting γ=1.
Assumption 2.2 (Model identifiability). For all L⊂L∗, the distribution PLT∗(⋅) is identifiable. That is, if d(T∗|L,T|L)>0, then KL(PLT∗,PLT)>0.
Assumption 2.2 guarantees that it is at least possible to reconstruct the restriction T∗|L of T∗ to the subset L with a complete knowledge of PLT∗. This assumption is similar to, but distinct from, the condition of basic centrality introduced by [28], which requires that for all subsets of L⊂L∗ of size 4,
PLT∗[T∗|L]≥(1+η)PLT∗[T′] |
for all trees T′ on leaf set L that are different from T∗|L, and where η>0:
● On one hand, the basic centrality condition implies that if d(T∗|L,T|L)>0, the modes of the distributions PLT∗ and PLT are different. From this, we can deduce that KL(PLT∗,PLT)>0. Thus, our identifiability assumption is somewhat weaker than this condition. Assumption 2.2 also does not impose any assumption on the family of distribution themselves and thus can be applied to a wider class of probabilistic models.
● On the other hand, the basic centrality condition only concerns leaf sets of size 4, while Assumption 2.2 impose restrictions on leaf sets of all sizes. However, we note that conditions on subset of leaves (such as the basic centrality condition) only work under the implicit assumption that there are some connection between PL2T|L1 and PL1T for L1⊂L2. Since our framework does not assume any nested structure in the probability model, Assumption 2.2 is more appropriate.
Finally, we impose the following regularity conditions on the tree-generating probability model:
Assumption 2.3. (Regularity)
(a) For all T∈T, L⊂L∗, and any tree T′ with leaf set L,
i. PLT(T′)>0,
ii. logPLT(T′) is a locally-Lipchitz function with respect to T, and the Lipchitz constant does not depend on T′.
(b) There exist c2,c3>0, m≥2 such that for any leaf set L and tree T, if d(T∗,T)≤c2, then
KL(PLT∗,PLT)≥c3d(T∗|L,T|L)m. |
Remark 2.2. If for each tree T′ with leaf set L, the probability density function PLT(T′) is an analytic function with respect to T in a neighborhood of T∗, then Assumption 2.3(b) holds.
Proof. Note that the function EPT∗[logPLT(T′)] (with respect to T) is analytic in a neighborhood U of T∗. Define
A={T∈U:ET′∼PT∗[logPLT∗(T′)]=ET′∼PT∗[logPLT(T′)]}. |
For all T∈A, we have
KL(PLT∗,PLT)=ET′∼PT∗[logPLT∗(T′)]−ET′∼PT∗[logPLT(T′)]=0. |
By Assumption 2.2, we conclude that T|L=T∗|L for all T∈A. Applying Łojasiewicz inequality [13, Theorem 1], we deduce that there exists c3>0 and m≥2 such that for any T in the neighborhood,
EPT∗[logPLT∗(T′)]−EPT∗[logPLT(T′)]≥c3d(T,A)m≥c3d(T∗|L,T|L)m. |
This implies the result.
Throughout this paper, we will assume that Assumptions 2.1, 2.2, and 2.3 hold. The main point of Assumption 2.1 is to guarantee that every subset of 4 species is sufficiently represented by the data. It is worth noticing that 4 is the smallest number of species that carries the information of the topology of an unrooted tree. Therefore, if a set of 4 species is only found in few trees, we will not be able to reconstruct their phylogenetic relationship. While Assumption 2.2 is technical, its goal is to guarantee that there is a unique "true" supertree. Finally, Assumption 2.3 ensures that the model has enough information to distinguish between the "true" supertree and any alternative with substantial accuracy.
In the next section, we will establish the following convergence rate of the ML supertree.
Theorem 1. Under Assumptions 2.1, 2.2, and 2.3, for any δ>0, there exist constants m>0, Cδ>0 and Kδ>0 such that for all k≥Kδ, we have
d(ˆTk,T∗)≤Cδ(logkk(γ−1/2))1/m |
with probability at least 1−δ.
Theorem 1 establishes that under fairly mild regularity conditions, the ML supertree is consistent and has a polynomial convergence rate. Notably, the result holds for all identifiable family of locally-analytic distributions (Remark 2.2). The degree of this (polynomial) convergence rate depends on the sampling scheme of the leaves (characterized by the covering coefficient γ) and on the geometry of the probabilistic model (characterized by the constant m in Assumption 2.3(b)). We further note that Assumption 2.3(b)) is only required for establishing the convergence rate of the ML estimator and the absence of this condition does not affect the proof of consistency.
Corollary 1. Under Assumptions 2.1, 2.2, and 2.3(a), the ML supertree is consistent.
Remark 2.3. Theorem 1 is still valid if the universal Assumption 2.3(b) is replaced by the following condition:
There exist c2,c3>0, m≥2 such that if d(T∗,T)≤c2, then
KL(PLiT∗,PLiT)≥c3d(T∗|Li,T|Li)m∀i∈N. |
Remark 2.4. We note that the bound of the convergence rate in Theorem 1 is not sharp.
To enable the analysis of convergence, we define
Rk(T)=1kℓk(T∗)−1kℓk(T) |
and
Rk(T):=EPT∗[Rk(T)]=1k(EPT∗[ℓk(T∗)]−EPT∗[ℓk(T)])=1kKL(PT∗,PT). |
We refer to Rk and EPT∗[Rk(T)] as the empirical risk function and the expected risk function, respectively.
An intuitive argument for the consistency of the ML estimator can be described as follows. As the number of gene trees increases, we have
|Rk(T)−Rk(T)|→0 |
with sufficiently high probability. Therefore, the ML estimator will converge to the optimal value of the risk function, which is attained at the true species tree T∗. This simple argument is formalized by a lower bound of the expected risk function (Section 3.1) and a uniform concentration bound on the deviation of the empirical risk function and its expectation (Section 3.2).
Lemma 3.1. There exist a neighborhood U of T∗ and c4>0 such that
EPT∗[ℓk(T∗)−ℓk(T)]≥c4kγd(T∗,T)m∀T∈U. |
Proof. We note that
EPT∗[ℓk(T∗)−ℓk(T)]=KL(PT∗,PT). |
Consider an arbitrary leaf set L⊂L∗ of size 4 and define I={i≤k:L⊂Li}. By the weak covering property, we have |I|≥c1kγ. Thus,
KL(PT∗,PT)≥∑i∈IKL(PLiT∗,PLiT)≥∑i∈Ic3d(T∗|Li,T|Li)m≥c1c3kγ(2n−3)m/2d(T∗|L,T|L)m |
for all T in some neighborhood U around T∗. Here, the second inequality comes from Assumption 2.3(b).
On the other hand, [9] (Lemma 6.2 (i)) proved that for some leaf set L⊂L∗ of size 4, we have
d(T∗|L,T|L)≥1(2n−3)d(T∗,T). |
We deduce that
KL(PT∗,PT)≥1(2n−3)m/2c1c3kγd(T∗,T)m |
Lemma 3.2. For any leaf set L and any x>0, there exists Cx>0 such that
KL(PLT∗,PLT)≥Cx |
when d(T∗|L,T|L)≥x.
Proof. Assume that there exists a sequence of tree {Ti} such that d(T∗|L,Ti|L)≥x and KL(PLT∗,PLTi)→0. Since the tree space is compact, we can extract a sub-sequence {Tij} of {Ti} that converges to some tree T0. We deduce that KL(PLT∗,PLT0)=0 and T∗|L≠T0|L. This contradicts Assumption 2.2.
Lemma 3.3. Let V be a neighborhood of T∗. There exist cV>0 such that for any T∉V, we have
KL(PT∗,PT)≥cVkγ. |
Proof. By [9] (Lemma 6.2 (i)), there exists L⊂L∗ of size 4 such that
d(T∗|L,T|L)≥1(2n−3)d(T∗,T)≥CV>0 |
since T∉V. Define I={i≤k:L⊂Li}, we note that for all i∈I
d(T∗|Li,T|Li)≥(2n−3)1/2d(T∗|L,T|L)≥C′V, |
which implies KL(PLiT∗,PLiT)≥C″V for some C″V>0 by Lemma 3.2.
By the weak covering property, we have |I|≥c1kγ. Thus,
KL(PT∗,PT)≥∑i∈IKL(PLiT∗,PLiT)≥c1kγC″V |
which completes the proof.
Lemma 3.4 (Concentration bound). For any δ>0, k≥3, there exists c5(δ) such that
|Rk(T)−EPT∗[Rk(T)]|≤c5logk√k∀T, |
with probability at least 1−δ.
Proof. Since the functions logPLiT(Ti) is locally Lipschitz with respect to T (Assumption 2.3(a)) and number of species is finite, there exists C1>0 such that
|logPLiT(Ti)−logPLiT′(Ti)|≤C1d(T,T′)∀i,Ti,T,T′. |
Since the tree space is compact, C1d(T,T′) is bounded by a constant C2. Using Hoeffding's inequality [12], we obtain
P[|Rk(T)−EPT∗[Rk(T)]|≥x√k]≤2exp(−2x2kC22). |
On the other hand, we have |Rk(T)−Rk(T′)|≤C1d(T,T′) and |EPT∗[Rk(T)]−EPT∗[Rk(T′)]|≤C1d(T,T′). Thus, if we define the events
A(x,k,T)={|Rk(T)−EPT∗[Rk(T)]|≥x2√k} |
and
B(x,k,T)={∃T′:d(T,T′)≤x4C1√kand|Rk(T′)−EPT∗[Rk(T′)]|≥x√k} |
then B(x,k,T)⊂A(x,k,T). Note that the total number of balls of radius x/(4C1√k) required to cover the tree space is bounded above by
Cn(4g0C1√kx)2n−3(2n−3)!! |
where g0 is the upper bound of branch lengths and Cn is a constant depending on the number of species. To obtain the desired inequality, we will chose x such that
Cn(4g0C1√kx)2n−3(2n−3)!!×2exp(−2x2kC22)≤δ |
which can be done with x=C(δ,n,C2,g0)logk.
First, we establish that the ML estimator is consistent.
By Lemma 3.4, we have
R(ˆTk)≤Rk(ˆTk)+c5log(k)√k≤c5log(k)√k |
with probability at least 1−δ, since ˆTk is the maximizer of the empirical risk function.
On the other hand, let V be a neighborhood of T∗, by Lemma 3.3, we have
EPT∗[Rk(T′)]≥cV1k1−γ∀T′∉V. |
Since γ>1/2, there exists Kδ,V such that for k≥Kδ,V, we have
c5log(k)√k≤cV1k1−γ |
We deduce that for k≥Kδ,V, ˆTk∈V with probability at least 1−δ. This proves that the ML estimator is consistent.
Next, we will derive the convergence rate of the ML supertree. From Lemma 3.1, there exist a neighborhood U of T∗ and c4,m>0 such that
EPT∗[Rk(T)]=1kEPT∗[ℓk(T∗)−ℓk(T)]≥c41k1−γ d(T∗,T)m∀T∈U. |
For k≥Kδ,U, we have ˆTk∈U with probability at least 1−δ. By Lemma 3.4, with probability 1−2δ, we obtain
c41k1−γ d(T∗,ˆTk)m≤Rk(ˆTk)−Rk(ˆTk)≤c5log(k)√k. |
Here,
Rk(ˆTk)=1kℓk(T∗)−1kℓ(ˆTk)≤0 |
because ˆTk is the maximizer of Rk. This completes the proof.
In this section, we consider the exponential model, a simple model of gene tree estimation errors in which the probability of observing a given tree decreases exponentially with its distance from the species tree. For a detailed discussion on its validity and applicability, we refer the readers to [28] and the references therein.
Suppose d is some metric on the set of trees, in the exponential model, the probability of reconstructing any tree T′ with a leaf set L, when T∗ is the generating tree is proportional to an exponentially decaying function of the distance from T′ to T∗|L:
PLT∗(T′)=αT∗,Lexp(−βLd(T′,T∗|L)). |
Here βL is a constant that depends only on the set of leaves L, while αT∗,L is the normalizing constant to ensure that PLT∗ is a density function.
In this section, we verify the identifiability and regularity conditions for the exponential model when d is any continuous tree distance such that if T and T′ have the same topology, then d(T,T′) is the Euclidean distance.
Theorem 2. Under the exponential model with Assumptions 2.1, for any δ>0, Cδ>0 and Kδ>0 such that for all k≥Kδ,
d(ˆTk,T∗)≤Cδ(logkk(γ−1/2))1/m |
with probability at least 1−δ, where
m=4supi|Li|−4. |
Proof. First, we note that for all T′ with leaf set L, PLT(T′) is a positive and locally Lipschitz function. To obtain an upper bound on the convergence rate for the ML supertree, we need to verify Assumption 2.2 and Assumption 2.3(b)/Remark 2.3.
Identifiability. Since it is sufficient to prove either KL(PLT∗,PLT)>0 or KL(PLT,PLT∗)>0, we can assume that αT∗,L≥αT,L.
We denote
Ac={T′:βLd(T′,T∗|L)≤c} |
and pick c small enough such that βLd(T′,T|L)≥2c. We note that in Ac,
exp(−βLd(T′,T∗|L))≥exp(−βLd(T′,T|L)). |
Since αT∗,L≥αT,L, we have
αT∗,Lexp(−βLd(T′,T∗|L))−αT,Lexp(−βLd(T′,T|L))≥αT,L[exp(−βLd(T′,T∗|L))−exp(−βLd(T′,T∗|L))]. |
By Pinsker's inequality,
KL(PLT∗,PLT)≥2dTV(PLT∗,PLT)2≥2α2T,L[∫Acexp(−βLd(T′,T∗|L))−exp(−βLd(T′,T|L))]2≥2α2T,Lμ(Ac)2(ec−e2c)2>0, |
which establishes the identifiability of the exponential model. Here, μ is the Lebesgue measure.
Regularity. Consider a fixed tree T∈T and leaf set L, we first assume that αT∗,L≥αT,L. If we define
A={T′:βLd(T′,T∗|L)≤13d(T|L,T∗|L))}, |
then
βLd(T′,T|L)≥23d(T|L,T∗|L)∀T′∈A. |
Using the same argument as above, we have
dTV(PLT∗,PLT)≥αT,Liμ(A)[e−13d(T|L,T∗|L)−e−23d(T|L,T∗|L)]. |
It can be verified that there exists C>0 such that for all x∈[0,C], we have
e−x−e−2x≥x2. | (4.1) |
Moreover since d is Euclidean inside each orthant, if d(T,T∗) is small enough, we have
μ(A)=C2|L|−3(d(T|L,T∗|L)3βL)2|L|−3. | (4.2) |
Thus, let W be a neighborhood in the same topology of T∗ such that
αT,L≥12αT∗,L∀T∈W, |
and both Eqs (4.1) and (4.2) are satisfied, we have
KL(PLT∗,PLT)≥Cd(T,T∗|L)4|L|−4∀T∈W, |
for some constant C>0 independent of T.
For the case when αT∗,L<αT,L we define
A′={T′:βLd(T′,T∗|L)≤13d(T′|L,T|L))}, |
and the argument proceeds similarly. This validates Assumption 2.3(b) for exponential model with m=4n−4. However, if we use the regularity condition in Remark 2.3, we can obtain the result with
m=4supi|Li|−4. |
We apply our method to reconstruct the yeast species-tree from 106 genes of eight yeast species obtained from the R package phangorn [27]. Specifically, we reconstruct the gene-tree for each gene using UPGMA (unweighted pair group method with arithmetic mean) method implemented in phangorn. To replicate the setting of this paper, we randomly remove two species from each reconstructed gene-tree. In other words, the inputs are 106 reconstructed gene-trees, each has 6 species (see Figure 1 for examples).
We apply our method to estimate the full species-tree using the exponential model with BHV distance. The R function dist.multiPhylo in the package distory [6] is used to compute the BHV distance. The reconstructed species-tree (see Figure 2) is consistent with the well-known result of [26]. This confirms the applicability of our method to reliably estimate supertrees.
In this paper, we propose a novel analytic approach to analyze the convergence of the ML supertree method. Instead of focusing on reconstructing the correct discrete topology of the species tree as in previous studies [25,28, e.g.], we employ a continuous model of the tree space and analyze the ML estimator on this metric space, aiming at recovering both the topology and the branch lengths of the species tree. This framework enables us to use tools from statistical learning and information theory to establish the convergence rate of the ML estimator and at the same time, to weaken the conditions to obtain consistency and convergence of the estimator. Our weak covering property is an extension of the classical covering property [28] and provides a considerable relaxation on the sampling schemes for supertree estimation. Our identifiability condition is also more intuitive and generalizable than the well-known basic centrality condition and does not impose constraints on the shape of the probabilistic model of gene tree estimation errors. Our information-theoretical approach to analyze statistical estimator on tree spaces is of independent interest and can be extended to other problems in phylogenetics.
There are several avenues for future directions for this work. The first direction is extending our results to other practical models of phylogenetic errors, including the multiple-coalescent model (along with a detailed model of the effects of short sequence length on the accuracy in estimating the individual gene trees). Second, while our result provides a polynomial bound on the convergence rate, the power of the convergence (characterized by the geometric constant m) is not sharp. A sharper bound of the convergence rate would be of great interest to the field (from both theoretical and applied perspective) and would require further understanding of the tree-generating probabilistic model.
LSTH was supported by startup funds from Dalhousie University, the Canada Research Chairs program, the NSERC Discovery Grant RGPIN-2018-05447, and the NSERC Discovery Launch Supplement DGECR-2018-00181. VD was supported by a startup fund from University of Delaware and National Science Foundation grant DMS-1951474.
The author declares no conflict of interest in this paper.
[1] |
N. Amenta, M. Godwin, N. Postarnakevich, K. S. John, Approximating geodesic tree distance, Inform. Process. Lett., 103 (2007), 61-65. doi: 10.1016/j.ipl.2007.02.008
![]() |
[2] |
B. R. Baum, Combining trees as a way of combining data sets for phylogenetic inference, and the desirability of combining gene trees, Taxon, 41 (1992), 3-10. doi: 10.2307/1222480
![]() |
[3] |
M. S. Bayzid, T. Warnow, Naive binning improves phylogenomic analyses, Bioinformatics, 29 (2013), 2277-2284. doi: 10.1093/bioinformatics/btt394
![]() |
[4] |
L. J. Billera, S. P. Holmes, K. Vogtmann, Geometry of the space of phylogenetic trees, Adv. Appl. Math., 27 (2001), 733-767. doi: 10.1006/aama.2001.0759
![]() |
[5] |
D. Bryant, R. Bouckaert, J. Felsenstein, N. A. Rosenberg, A. RoyChoudhury, Inferring species trees directly from biallelic genetic markers: bypassing gene trees in a full coalescent analysis, Mol. Biol. Evol., 29 (2012), 1917-1932. doi: 10.1093/molbev/mss086
![]() |
[6] | J. Chakerian, S. Holmes, DISTORY: Distance between phylogenetic histories. R package version, 1 (2013). |
[7] |
J. Chifman, L. Kubatko, Quartet inference from SNP data under the coalescent model, Bioinformatics, 30 (2014), 3317-3324. doi: 10.1093/bioinformatics/btu530
![]() |
[8] | J. A. Cotton, M. Wilkinson, Majority-rule supertrees, Syst. biol., 56 (2007), 445-452. |
[9] | V. Dinh, L. S. T. Ho, M. A. Suchard, F. A. Matsen IV, Consistency and convergence rate of phylogenetic inference via regularization, Ann. Stat., 46 (2018), 1481. |
[10] |
J. Gatesy, M. S. Springer, Phylogenetic analysis at deep timescales: unreliable gene trees, bypassed hidden support, and the coalescence/concatalescence conundrum, Mol. Phylogenet. Evol., 80 (2014), 231-266. doi: 10.1016/j.ympev.2014.08.013
![]() |
[11] | J. Heled, A. J. Drummond, Bayesian inference of species trees from multilocus data, Mol. Biol. Evol., 27 (2009), 570-580. |
[12] |
W. Hoeffding, Probability inequalities for sums of bounded random variables, J. Am. Stat. Assoc., 58 (1963), 13-30. doi: 10.1080/01621459.1963.10500830
![]() |
[13] | S. Ji, J. Kollár, B. Shiffman, A global Łojasiewicz inequality for algebraic varieties, T. Am. Math. Soc., 329 (1992), 813-818. |
[14] |
L. S. Kubatko, B. C. Carstens, L. L. Knowles, STEM: species tree estimation using maximum likelihood for gene trees under coalescence, Bioinformatics, 25 (2009), 971-973. doi: 10.1093/bioinformatics/btp079
![]() |
[15] | M. K. Kuhner, J. Felsenstein, A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates, Mol. Biol. Evol., 11 (1994), 459-468. |
[16] |
B. R. Larget, S. K. Kotha, C. N. Dewey, C. Ané, BUCKy: gene tree/species tree reconciliation with bayesian concordance analysis, Bioinformatics, 26 (2010), 2910-2911. doi: 10.1093/bioinformatics/btq539
![]() |
[17] |
L. Liu, L. Yu, Estimating species trees from unrooted gene trees, Syst. Biol., 60 (2011), 661-667. doi: 10.1093/sysbio/syr027
![]() |
[18] |
L. Liu, L. Yu, S. V. Edwards, A maximum pseudo-likelihood approach for estimating species trees under the coalescent model, BMC Evol. Biol., 10 (2010), 302. doi: 10.1186/1471-2148-10-302
![]() |
[19] |
S. Mirarab, M. S. Bayzid, B. Boussau, T. Warnow, Statistical binning enables an accurate coalescent-based estimation of the avian tree, Science, 346 (2014), 1250463. doi: 10.1126/science.1250463
![]() |
[20] |
S. Mirarab, R. Reaz, M. S. Bayzid, T. Zimmermann, M. S. Swenson, T. Warnow, ASTRAL: genome-scale coalescent-based species tree estimation, Bioinformatics, 30 (2014), i541-i548. doi: 10.1093/bioinformatics/btu462
![]() |
[21] | E. Mossel, S. Roch, Incomplete lineage sorting: consistent phylogeny estimation from multiple loci, IEEE ACM T. Comput. Bi., 7 (2008), 166-171. |
[22] | S. Patel, R. T. Kimball, E. L. Braun, Error in phylogenetic estimation for bushes in the tree of life, Journal of Phylogenetics & Evolutionary Biology, (2013). |
[23] |
D. F. Robinson, Comparison of labeled trees with valency three, J. Comb. Theory B, 11 (1971), 105-119. doi: 10.1016/0095-8956(71)90020-7
![]() |
[24] |
S. Roch, M. Nute, T. Warnow, Long-branch attraction in species tree estimation: inconsistency of partitioned likelihood and topology-based summary methods, Syst. biol., 68 (2019), 281-297. doi: 10.1093/sysbio/syy061
![]() |
[25] |
S. Roch, T. Warnow, On the robustness to gene tree estimation error (or lack thereof) of coalescent-based species tree methods, Syst. Biol., 64 (2015), 663-676. doi: 10.1093/sysbio/syv016
![]() |
[26] |
A. Rokas, B. L. Williams, N. King, S. B. Carroll, Genome-scale approaches to resolving incongruence in molecular phylogenies, Nature, 425 (2003), 798-804. doi: 10.1038/nature02053
![]() |
[27] |
K. P. Schliep, phangorn: phylogenetic analysis in r, Bioinformatics, 27 (2011), 592-593. doi: 10.1093/bioinformatics/btq706
![]() |
[28] |
M. Steel, A. Rodrigo, Maximum likelihood supertrees, Syst. Biol., 57 (2008), 243-250. doi: 10.1080/10635150802033014
![]() |
[29] |
P. Vachaspati, T. Warnow, ASTRID: accurate species trees from internode distances, BMC genomics, 16 (2015), 1-13. doi: 10.1186/1471-2164-16-1
![]() |