Renewing Felsenstein’s Bootstrap

Renewing Felsenstein’s Phylogenetic Bootstrap in the Era of Big Data

Like Comment

Just published online in Nature - 

Renewing Felsenstein’s Phylogenetic Bootstrap in the Era of Big Data

F. Lemoine, J.-B. Domelevo Entfellner, E. Wilkinson, D. Correia, M. Dávila Felipe, T. De Oliveira, O. Gascuel*

* Correspondence:  

The phylogenetic bootstrap was proposed by Joseph Felsenstein more than 30 years ago. This method, based on resampling and replications, is used extensively to assess the robustness of phylogenetic inferences. Its usefulness, simplicity and interpretability made it extremely popular in evolutionary studies, to the point that it is generally required for publication of phylogenies. Felsenstein’s article has been cited more than 35,000 times and is ranked in the top 100 of the most cited scientific papers of all time. In 2017, it was cited more than 2,000 times.

However, it is commonly acknowledged that Felsenstein’s bootstrap is not appropriate for large datasets containing hundreds or thousands of taxa, which are now common thanks to high-throughput sequencing technologies. While such datasets generally contain a lot of phylogenetic information, the Felsenstein’s bootstrap proportions (FBP) tend to be low, especially when the tree is inferred from a single gene, or only a few genes. The reason for such degradation is explained by the core methodology of Felsenstein’s bootstrap. A bootstrap branch must match exactly a branch in the original tree estimate, to be accounted for in the bootstrap support of that branch. A difference of just one taxon is sufficient for the bootstrap branch to be counted absent, while it is nearly identical to the original branch. The standard approach is to remove “rogue” (phylogenetically unstable) taxa and relaunch the analysis, but this is statistically questionable and computationally expensive. Moreover, with large trees inferred branches are likely to have errors and a large fraction of taxa may be unstable, even in the absence of model misspecification of any sort, and without long branches.

In this article, we propose a new version of phylogenetic bootstrap, in which the presence of original branches in bootstrap trees is measured using a gradual “transfer” distance, as opposed to the original version using a binary presence/absence index. This distance is normalized in the [0, 1] range and averaged over all bootstrap trees. We so obtain the “transfer bootstrap expectation” (TBE), which replaces the branch presence frequency of FBP (i.e. the expectation of a 0/1 function), by the expectation of a nearly continuous function. By construction, TBE supports are necessarily higher than FBP’s and the difference is substantial for deep branches. When combined with consistent tree estimation, TBE rarely supports poor branches. Our results with mammal, HIV and simulated data sets, clearly demonstrate its usefulness, especially with deep branches and large trees, where branches known to be essentially correct are supported by TBE but not by FBP. Importantly, TBE supports are easily interpreted as fractions of unstable taxa, and the ability of TBE to identify the most unstable taxa (e.g. recombinant HIV sequences) makes it possible to study them further, understand why they are phylogenetically unstable, and revise the branch supports. TBE computation and other phylogenetic tools are available from

FIGURE: Felsenstein (FBP) and transfer (TBE) bootstrap supports comparison with the simian subtree, extracted from a large mammal phylogeny (FastTree, using 1,449 COI-5P sequences). All simian taxa are included, but two additional non-simians are added, one rogue taxon (Maxomys rajah, detected as rogue by TBE) and one stable but erroneous taxon with partial sequence (Canis adustus). This simian tree is very close to the NCBI taxonomy, except these two additional sequences. However, FBP supports very few clades, while TBE reveals the major groups (e.g. New World and Old World Monkeys).

Olivier Gascuel

Prof., CNRS & Institut Pasteur

My main research interest is in the field of evolution and phylogeny. My focus is on the mathematical and computational tools and concepts, which form an essential basis of evolutionary studies. I (co)authored several software programs, some widely diffused and well cited, notably BioNJ,, SeaView and PhyML. I also coauthored probabilistic models to describe protein evolution (e.g. LG), simple methods to select phylogenetic models (SMS) and test branches (aLRT), and fast dating algorithms (LSD). My main applications are related to viruses, HIV, and molecular epidemiology.