Skip to content

Overview

Scientific discussion meeting organised by Professor Mark Achtman FRS, Professor Kathryn Holt and Professor David Aanensen.

The comparative genomics of microbial pathogens from all domains of life has become a big data problem. Databases already contain >200,000 assembled genomes from single species but adequate tools to reveal population structures are in their infancy. This meeting brought together world-class bioinformaticians with experts on bacterial and viral genomes to illustrate multiple approaches to solving this challenge.

A related journal issue has been published in Philosophical Transactions of the Royal Society B.

Attending the event

This meeting has taken place. You can watch the recording here.

Enquiries: contact the Scientific Programmes team

Organisers

Schedule


Chair

13:50-14:00
Open

Speakers

14:00-14:30
EnteroBase: Hierarchical clustering of >600,000 bacterial genomes

Abstract

The number of sets of genomic short read sequences in the public domain has exploded since 2012. EnteroBase (https://enterobase.warwick.ac.uk/) has assembled >600,000 draft genomes from short reads (from SRA or uploaded by users), assigned allelic designations to sequences from the core genome (cgMLST), and clustered the resulting sequence types at multiple levels (hierarchical clustering; HierCC (PMID: 33823553) for the genera Salmonella, Escherichia/Shigella, Streptococcus, Clostridioides, Vibrio and Yersinia (PMID: 31809257, 32726198, 33055096, 33614977). One HierCC level is a complete replacement for classical taxonomy or ANI because it automatically and reliably identifies species/sub-species. In several genera, two other HierCC levels correspond to ST Complexes and Super-Lineages, which are the predominant population structures in these genera. HierCC levels with even higher resolution are proving useful for tracking transmissions and single-source outbreaks of gastrointestinal disease. HierCC topologies are also consistent with trees based on the presence or absence of all genes in the pan-genome. These conclusions will be illustrated with specialised case studies.

Speakers

14:30-15:00
Analysing bacterial population structure from millions of genomes

Abstract

In less than a decade, bacterial population genomics has progressed from the effort of sequencing dozens to thousands of strains in a single study. There are now >250,000 genomes available even for a single bacterial species and the number of genomes is expected to continue to increase rapidly given the advances in sequencing technology and widespread genomic surveillance initiatives. The biological insights enabled by population genomics are particularly important in evolutionary epidemiology, as the genome sequences provide high-resolution data for the estimation of transmission and evolutionary dynamics, including the horizontal transfer of virulence and resistance elements. Professor Corander will discuss statistical and computational techniques that are amenable to rapidly analysing population structure in data consisting of millions of whole genomes. 

Speakers

15:00-15:30
Discussion

Speakers


Chair

16:00-16:30
Beyond the S. aureus comet: what tree shapes occur in large bacterial genomic data?

Abstract

When methicillin-resistant Staphylococcus aureus (MRSA) arose and disseminated widely, some phylogenetic trees of MRSA-containing types of staphylococcus aureus had a distinctive 'comet' shape, with a 'comet head' of recently-adapted resistant isolates in the context of a 'comet tail' that was predominantly drug sensitive. Placing an isolate in the context of such a 'comet' helped public health laboratories interpret local data within the broader setting of S aureus evolution. In this work Professor Colijn and her colleagues ask what other tree shapes, analogous to the MRSA comet, are present in bacterial WGS datasets. They extract trees from large bacterial genomic datasets, visualise them as images, and cluster the images. They find nine major groups of tree images, including the 'comet', star-like phylogenies, barbell' phylogenies and other shapes, and comment on the evolutionary and epidemiological stories these shapes might illustrate.

Speakers

16:30-17:00
Genome-scale metabolic network reconstructions of hundreds of diverse Escherichia coli strains reveal strain-specific adaptations and evolutionary trajectories

Abstract

Bottom-up approaches to systems biology rely on constructing a mechanistic basis for the biochemical and genetic processes that underlie cellular functions. Genome-scale network reconstructions of metabolism are built from all known metabolic reactions and metabolic genes in a target organism. A network reconstruction can be converted into a mathematical format and thus lend itself to mathematical analysis. Genome-scale models (GEMs) of enable a systems approach to characterise the pan and core metabolic capabilities of the E coli species. The models have been used to systematically analyze growth capabilities in more than 650 different growth-supporting environments as well as to predict strain-specific auxotrophies. In this work, genome-scale models were constructed for more than 300 representative strains of E coli across all 295 HC1100 levels. The models were used to study E coli metabolic diversity and speciation on a large scale. The results show that unique strain-specific metabolic capabilities correspond to pathotypes and environmental niches. Genome-scale analysis of multiple strains of a species can thus be used to define the metabolic essence of a microbial species and delineate growth differences that shed light on the adaptation process to a particular microenvironment.

Speakers

17:00-17:30
New methods with high accuracy and scalability for large-scale phylogenetic estimation

Abstract

The estimation of phylogenetic trees for individual genes or multi-locus datasets is a basic part of considerable biological research. In order to enable large trees to be computed, Disjoint Tree Mergers (DTMs) have been developed; these methods operate by dividing the input sequence dataset into disjoint sets, constructing trees on each subset, and then combining the subset trees (using auxiliary information) into a tree on the full dataset. DTMs have been used to advantage for multi-locus species tree estimation, enabling highly accurate species trees at reduced computational effort, compared to leading species tree estimation methods. The talk will show that DTMs can be used to improve the accuracy and speed of methods for species tree estimation methods (eg, ASTRAL) as well as for gene tree estimation (eg, RAxML), thus enabling these methods to run efficiently on much larger datasets than currently possible, and without the need for high performing computing platforms or massive parallelism. These methods are available in open source form on github. 

Speakers

17:30-18:00
Discussion

Speakers


Chair

13:00-13:30
Opening the door to studying nucleotide-resolution genetic variation in bacterial pan-genomes

Abstract

When we study evolution of a bacterial species, we use different models, depending on what we want to achieve or infer. One approach is to reduce to single nucleotide polymorphism (SNP) variation in the 'core genome'  (presumably inherited vertically) to study phylogeography or to study an outbreak. In focusing on SNPs (and invariant sites), it has been possible for researchers to build a range of sophisticated phylogenetic models. However once we try to incorporate genome organisation, chromosomal rearrangements, movement of plasmids, transposons or phage, then the modelling problem is far harder. The question of how to  properly model bacterial genetic variation is wide open and extremely challenging. A prerequisite for any solution to this, is a decision on how to describe the variation in the first place – you cannot model variation until you represent it. Note that this is true even if you have perfect genome assemblies: even if it were possible to multiple sequence align them, this would not really help with how to notice that a SNP at one position in one genome is 'the same' as a SNP somewhere else in another. This talk will cover a solution to this representation problem, showing how it is possible to represent the pan genome of a species as a network of 'floating' graphs, representing the ensemble of known variation in orthology blocks (using genes and intergenic regions, but this could be done for mobile elements also). In doing so it becomes possible to discover and describe genetic variation at fine (SNP/indel) and coarse (gene order) level, and to compare diverse cohorts of genomes across the full pan-genome.

Speakers

13:30-14:00
How the interplay between mobile elements shapes bacterial genomes

Abstract

Horizontal gene transfer driven by self-mobilisable genetic elements allows the acquisition of complex adaptive traits and their transmission to subsequent generations. Transfer speeds up evolutionary processes as exemplified by the acquisition of virulence traits in emerging infectious agents and by antibiotic resistance in many human pathogens. Transfer is also costly because the vectors of horizontal transfer compete within genomes, have their own mobile elements and are often deadly. As a result, genomes are repositories of multiple defense systems from hosts and from mobile elements that interact in complex ways to drive gene flow in communities. The combination of evolutionary genomics and sequence analysis is now opening up these processes to show how they bring into the genome a constant flux of novel genes that favour the establishment and the invention of novel functions. 

Speakers

14:00-14:30
Diversification and adaptation of human skin bacteria during health and disease

Speakers

14:30-15:00
Discussion

Speakers


Chair

15:30-16:00
A scalable analytical approach from bacterial genomes to epidemiology

Abstract

Recent years have seen a remarkable increase in the practicality of sequencing whole genomes from large numbers of bacterial isolates. The availability of this data source has huge potential to deliver new insights into the evolution and epidemiology of bacterial pathogens, but the analytical methodology has been lagging behind the sequencing technology. Here Professor Didelot presents a step-by-step approach for such genomic epidemiology analyses, from bacterial genomes to epidemiological interpretations. A central component of this approach is the dated phylogeny, which is a phylogenetic tree with branch lengths measured in units of time. The construction of dated phylogenies from bacterial genomic data needs to account for the disruptive effect of recombination on phylogenetic relationships, and Professor Didelot describes how this can be achieved. Dated phylogenies can then be used to perform fine-scale or large-scale epidemiological analyses, depending on the proportion of cases for which genomes are available. A key feature of this approach is computational scalability, and in particular the ability to process hundreds or thousands of genomes within a matter of hours. This is a clear advantage of the step-by-step approach described here. Professor Didelot discusses other advantages and disadvantages of the approach, as well as potential improvements and avenues for future research.

Speakers

16:00-16:30
Pathogenwatch and data tools to bridge genomics and epidemiology for public health

Speakers

16:30-17:00
Unlocking Typhi genomics data to inform public health policy

Abstract

Typhoid fever is a systemic infection caused by Salmonella enterica serovar Typhi (S Typhi). Antimicrobials are the mainstay of typhoid disease control, and effective antimicrobial therapy can reduce the rate of complications from 10–30% down to 1%. A new conjugate vaccine has recently been pre-qualified by WHO and national immunisation programs are currently being considered by many countries where the disease is endemic, however data on disease burden, pathogen populations and antimicrobial resistance (AMR) are scarce in most such settings. Where typhoid surveillance is undertaken, namely for routine surveillance of travel-related infections in high income countries and burden studies in low income countries, whole genome sequencing (WGS) has been widely adopted as the primary method for characterisation of S Typhi isolates. WGS data can provide insights into pathogen diversity and transmission dynamics, as well as the emergence, dissemination and prevalence of AMR, much of which has relevance to understanding disease in settings other than those directly sampled (including regional trends, and country-of-acquisition for travel cases). However the resulting data are not readily accessible to public health decision makers. To fill this gap we are developing an interactive dashboard (TyphiNET, http://typhi.net), which aims to provide a window into genome-derived surveillance information for non-genomics experts. The dashboard relies on critical infrastructure that is being developed alongside, including (i) a community-driven effort to publicly share S Typhi sequence and source data in a manner that facilitates downstream aggregation for public health surveillance (the Global Typhoid Genomics Consortium, https://www.typhoidgenomics.org/); (ii) the GenoTyphi genotyping scheme, which provides simple, stable, phylogenetically informative, nomenclature to facilitate reporting and communication about pathogen variants; and (iii) Typhi Pathogenwatch, a public genomic epidemiology platform that provides uniform identification of genotypes and AMR determinants from genome data (in addition to whole-genome-based clustering), which is then fed into the TyphiNET dashboard.

Speakers

17:00-17:30
Discussion

Speakers