Statistical mechanics of pan-genomes: A peek into the Grand Tinkerer’s workshop

To advance our understanding of how evolution shapes prokaryote pan-genomes we need new concepts and models. Using metabolic reactions, we introduced a sampling approach that disentangles the drivers of pan-genome evolution and provides a mechanistic interpretation for gene frequencies.

 Our study started with an informal talk in a small room of the Theoretical Ecology department in Utrecht (NL). Joined by colleagues who were investigating microbial patterns on a variety of biological systems, ranging from virtual bugs(1) to “real” large metagenomic datasets(2). Our subjects varied widely, but usually around the healthy balance between theory and biological data.

A recent “Trends in Microbiology” paper triggered the debate(3). Young presented a useful analogy in the paper: “Bacteria are smartphones and mobile genes are Apps”. Modern smartphones have operating systems, which are like bacterial core genes needed for basic functionality and shared by all the recent phones from a manufacturer. But also have “optional” apps, which are like bacterial accessory genes, providing additional functionality and personalized user experience. Are some operating systems more “pluggable” than others (i.e. new “apps” more compatible and easily installed)? Is core-genome pluggability related to a species’ niche breadth? How about the genes? Are some of them interoperable (i.e. can be easily installed in any operating system)? Are the interoperable genes “useful” or are they the selfish genetic elements that manage to stick around?

“Bacteria are smartphones and mobile genes are Apps”. A useful analogy by Young (2016)(3).
“Bacteria are smartphones and mobile genes are Apps”. A useful analogy by Young (2016)(3).

In the midst of these questions, the conversation soon shifted towards a recent unsettled debate in the literature. Using the same concept of effective population size, two studies reached exactly opposite conclusions:

  • “pangenomes are the result of adaptive, not neutral, evolution”(4)

  • “accessory gene turnover is for a large part dictated by neutral evolution”(5)

For some context, note that genetic variation, either neutral or adaptive, can spread from the current to future generations (i.e. get fixed). The number of generations needed to fix a variation depends on the effective population size. Effective population sizes are usually different from the actual number of individuals (census population). For instance, humans are estimated to have a small effective population size, in the range of thousands to tens of thousands, contrasting with its census population of nearly eight billion individuals(6). Bacterial species usually have larger effective population sizes (typically > 108 (7)).

Two very important consequences derive from the effective population size:

  • The larger the effective population size, the sharper the eyesight of natural selection. In large effective populations, even small selective advantages are easily fixed. For instance, the preference for a codon that allows slightly more efficient translation quickly spreads in a bacterial lineage(8), the same would require countless human generations. McNally et al. (2017) concluded that because of large effective populations “horizontal gene transfer (HGT) genes are largely—though not always—adaptive, and the presence of pangenomes is typically an adaptive phenomenon”. Also that the high HGT rates commonly observed in bacterial lineages are a hallmark of their adaptive value since if HGT were to be even slightly more deleterious, one would expect the large population sizes to quickly reduce its rates(4).


  • The larger the effective population size, the more frequent are neutral variations. Neutral or nearly neutral variations take many generations to get fixed in large effective populations. As a result, there are many “drifting” variants and high genetic diversity in each generation. Andreani et al. (2017) found a significant positive correlation between the number of these drifting variants and the rates of HGT, concluding that HGT rates are driven by neutral evolution(5).


Adaptive views tend to be favored in the pan-genome literature, particularly in empirical studies. Core genes are commonly considered an essential and irreplaceable part of the lineage’s operating system. Whereas accessory genes are considered to correspond to some recent short-term adaptation. This view is supported by numerous examples where HGT provided the perfect recipe for a lineage to grow on a new substrate, evade an antibiotic, or even jump from being a harmless environmental organism to a deadly pathogen(9). A gene’s frequency would then correspond to its adaptive value (highly abundant = highly beneficial), and different frequencies, proportionally, to different niches or environmental opportunities (e.g., a low-frequency corresponding to a recent or rare niche opportunity). Wouldn’t it be fascinating if we could connect the genomics of a lineage to its ecology by simply correlating the pan-genomic gene frequencies to gene functions?

Unfortunately, such naïve expectation misplaces the driving forces of pan-genome evolution. It reproduces some of the pitfalls that were labeled by Gould & Lewontin as the “adaptationist programme” or the “Panglossian paradigm”(10). Prof. Pangloss, a character from Voltaire’s Candide or, The Optimist, for whom “all is for the best” in the “best of all possible worlds”, “Our noses were made to carry spectacles,  so we have spectacles. Legs were clearly intended for breeches, and we wear them.” The adaptationist programme “proceeds by breaking an organism into unitary ‘traits’ and proposing an adaptive story for each considered separately”(10). Such programme risks confusing the “spandrels” for the “dome”. Spandrels, even if magnificently and harmoniously adorned, “set the quadripartite symmetry of the dome above”(10) and are really the consequence of architectural constraints that emerge when one attempts to build the cathedral’s dome. Like the spandrels, the agents of evolutionary change are better explained by constraints and opportunities at a higher level.

What really changes when we shift the lens of natural selection from a single lineage evolving through vertical inheritance (such as Lenski’s famous E. coli experiment) to the pan-genome? i.e. when we consider multiple lineages possibly connected through the exchange of genes via horizontal gene transfers. In the light of HGT, natural selection and the drivers of molecular evolution have a whole new meaning and new phenomena need to be considered:

  • Sex & gene pools. Not all operating systems have access to all apps. The more similar the two operating systems are, the more likely is one to find apps that work on both of them. HGT commonly operates through mechanisms that favor genetic similarity (e.g. phage-host affinity, homologous recombination), creating a common gene pool (app store) for similar genomes and “sexually” isolating others. Different from sexually-reproducing organisms, this isolation is gradual and noisy. It’s common to find related strains more isolated with regard to some genes when compared to distantly related strains and we often find promiscuous genetic elements, that cross virtually any species barriers. Nevertheless, HGT introduces a weak but still significant barrier to gene flow.


  • Tinkering. The so-far-used analogy of operating systems and apps can easily be mistaken for a cleverly engineered system. Nevertheless, randomly shopping for genes from a fortuitous gene pool is completely different from intentionally visiting the app store for your much-needed apps. The functions of newly installed genes need to somehow fit in. Even if natural selection can do a good job in filtering the useful and eliminating the junk, it can only work with what is available. It does not function as the grand engineer, but rather as the grand tinkerer. In the words of François Jacob: “a tinkerer who does not know exactly what he is going to produce but uses whatever he finds around him whether it be pieces of strings, fragments of wood, or old cardboards; (…) to produce some kind of workable object. (…) What he ultimately produces is generally related to no special project, and it results from a series of contingent events, of all opportunities he had to enrich his stock with leftovers"(11). The natural MacGyver.

  • Historical accidents. Tinkering often leads to an unexpected dependency on history. “tinkerers who tackle the same problem are likely to end up with different solutions”(11). Often multiple solutions to the same problem exist. Also, the hierarchical nature of biology favors historically contingent structures. For instance, the functional metabolic network is something more than the collection of individual reactions. The products of one reaction are the substrates of others, which create interdependencies and that work together to provide an organism’s energy and biomass. Even if a novel metabolic gene finds its way into a genome, its expression and function depend on the preexisting machinery, setting a particular path that depends on the lineage’s history.

You can read in our paper how we used the set of genome-encoded metabolic reactions, the reactome, to understand gene frequencies in pan-genomes(12). In order to get a statistical peek into the mechanics of gene frequency distribution, we built an in silico system with three important ingredients of pan-genome evolution:

  • The genotype: represented by the presence or absence of metabolic reactions (proxies for genes). One can imagine a "genotype tape" of zeros and ones. Each entry in the tape represents the presence or absence of a gene. The size of the tape corresponds to the number of genes (reactions) in the pan-genome. “Mutations” represent a switch from zero to one, or vice-versa if a gene is inserted or deleted. Each evolutionary lineage is represented by its own tape (genome).


  • The environment: all organisms wander around in an ever-changing world. We modeled this variable environment as a random, uniform distribution of metabolite compositions, representing a large, unbiased set of environments that a lineage could have encountered in its evolutionary path.


  • The phenotype: a common way to represent phenotype in metabolic models is by taking the flux over the biomass equation. Accordingly, we implemented the organismal phenotype in a specific environment as the maximum amount of biomass that could be produced by the genotype-encoded metabolic reactions given the corresponding metabolite composition.


With these three ingredients in hand, we can represent both potential and realized genotypes for a given pan-genome (see the cartoon for illustration). The easiest, are the realized or natural genotypes. Those are the ones we find in a genome database. We take the genomes that belong to the same family, map their metabolic genes and construe their genotype tapes. By piling up these tapes into a table, we can calculate the gene frequencies as they have been realized by the bacteria in nature. An assumption is that the number of genomes in the database is sufficiently large and represents a relatively unbiased sample of the natural biodiversity of the family (2).


The pan-genome encoded reactions, here represented by the different facial features,  allow us to look beyond the extant organisms (realized) into the potential organisms (or metabolic reaction networks) that could be formed by HGT.

The potential genotypes are the ones that could exist, i.e. whose reactions are encoded within the pan-genome gene pool. To obtain a statistical sample of potential genotypes across the alternative environments, we added two important further constraints. First, the reactions coded by a valid genotype must collectively be capable of growing. Second, to reflect the rapid loss of genes we require a genotype to be irreducible, meaning that the removal of a single reaction impairs growth. On a single environment, there are often many ways to fulfill these two conditions, so we statistically sample a representative number of these irreducible phenotypes.

The realized genotypes found in nature are not irreducible like the sampled potential genotypes. Most bacterial genomes are ready to thrive on diverse and changing environmental conditions, thus they commonly encode a combination of irreducible modules. Furthermore, HGT is an active and ongoing process. Many reactions could be from recent insertion events that have not found a function or are in the process of being lost; others could have frequencies that correspond to fixation by stochastic and neutral evolution; others could still be part of selfish genetic elements that find no use to the host’s growth phenotype.

In the light of this genomic flexibility, we asked what does the frequency of a gene in the pan-genome mean for its function? 

Contrasting the potential and the realized pan-genomes suggests some answers to this question. First of all, the realized frequency on its own is arguably meaningless since it’s masked by the neutral dynamics and other confounding forces. But gene frequency in the potential irreducible genotypes is highly meaningful and correlates with life history and biological function. We can illustrate this by comparing what it means to be a “core” or a “shell” gene in the potential pan-genomes to its meaning in the realized pan-genome.

Core genes in the potential pan-genome are indispensable in all cases and in all environments, they are a proper subset of the realized core genes. The remaining realized core genes have widely distributed (potential) frequencies, reflecting a range of neutral and adaptive forces. For instance, many genes have a realized frequency of 100% (i.e. are found in all the genomes sequenced to date), but can be removed from the metabolic networks in many environments without impairing growth.  

Genes with intermediate frequencies, i.e. “shell” genes in the potential pan-genomes reflect genes that are required under some conditions. To expose these conditions, it is useful to distinguish intermediate frequency within and across environments. Within an environment, intermediate frequencies reflect genes encoding alternative reaction routes to the same phenotype. While across environments, they reflect environment-driven adaptation to specific components of the environment.

Together, we used the statistical distribution of reactions in the potential genotypes to predict aspects of the environment where a lineage lives and has evolved, helping us expose the inner-workings of horizontal gene transfer(12), the Grand Tinkerer of pan-genome evolution.


  1. van Dijk B, Hogeweg P, Doekes HM, Takeuchi N. Slightly beneficial genes are retained by bacteria evolving DNA uptake despite selfish elements. Elife. 2020;9:e56801.
  2. Meijenfeldt FAB von, Hogeweg P, Dutilh BE. On specialists and generalists: niche range strategies across the tree of life [Internet]. bioRxiv; 2022 [cited 2022 Sep 10]. p. 2022.07.21.500953. Available from:
  3. Young JPW. Bacteria Are Smartphones and Mobile Genes Are Apps. Trends Microbiol. 2016 Dec 1;24(12):931–2.
  4. McInerney JO, McNally A, O’Connell MJ. Why prokaryotes have pangenomes. Nat Microbiol. 2017 Mar 28;2(4):1–5.
  5. Andreani NA, Hesse E, Vos M. Prokaryote genome fluidity is dependent on effective population size. ISME J. 2017 Jul;11(7):1719–21.
  6. Park L. Effective population size of current human population. Genet Res. 2011 Apr;93(2):105–14.
  7. Bobay LM, Ochman H. Factors driving effective population size and pan-genome evolution in bacteria. BMC Evol Biol. 2018 Oct 12;18:153.
  8. McINERNEY JO. Prokaryotic Genome Evolution as Assessed by Multivariate Analysis of Codon Usage Patterns. Microb Comp Genomics. 1997 Jan;2(1):89–97.
  9. Vibrio cholerae and cholera: out of the water and into the host | FEMS Microbiology Reviews | Oxford Academic [Internet]. [cited 2022 Sep 10]. Available from:
  10. Gould SJ, Lewontin RC, Maynard Smith J, Holliday R. The spandrels of San Marco and the Panglossian paradigm: a critique of the adaptationist programme. Proc R Soc Lond B Biol Sci. 1979 Sep 21;205(1161):581–98.
  11. Jacob F. Evolution and Tinkering. Science. 1977 Jun 10;196(4295):1161–6.
  12. Garza DR, von Meijenfeldt FAB, van Dijk B, Boleij A, Huynen MA, Dutilh BE. Nutrition or nature: using elementary flux modes to disentangle the complex forces shaping prokaryote pan-genomes. BMC Ecol Evol. 2022 Aug 16;22(1):101.


Please sign in or register for FREE

If you are a registered user on Nature Portfolio Ecology & Evolution Community , please sign in