The utility of SARS-CoV-2 genomic data for informative clustering under different epidemiological scenarios and sampling

Infect Genet Evol. 2023 Sep:113:105484. doi: 10.1016/j.meegid.2023.105484. Epub 2023 Jul 31.

Abstract

Objectives: Clustering pathogen sequence data is a common practice in epidemiology to gain insights into the genetic diversity and evolutionary relationships among pathogens. We can find groups of cases with a shared transmission history and common origin, as well as identifying transmission hotspots. Motivated by the experience of clustering SARS-CoV-2 cases using whole genome sequence data during the COVID-19 pandemic to aid with public health investigation, we investigated how differences in epidemiology and sampling can influence the composition of clusters that are identified.

Methods: We performed genomic clustering on simulated SARS-CoV-2 outbreaks produced with different transmission rates and levels of genomic diversity, along with varying the proportion of cases sampled.

Results: In single outbreaks with a low transmission rate, decreasing the sampling fraction resulted in multiple, separate clusters being identified where intermediate cases in transmission chains are missed. Outbreaks simulated with a high transmission rate were more robust to changes in the sampling fraction and largely resulted in a single cluster that included all sampled outbreak cases. When considering multiple outbreaks in a sampled jurisdiction seeded by different introductions, low genomic diversity between introduced cases caused outbreaks to be merged into large clusters. If the transmission and sampling fraction, and diversity between introductions was low, a combination of the spurious break-up of outbreaks and the linking of closely related cases in different outbreaks resulted in clusters that may appear informative, but these did not reflect the true underlying population structure. Conversely, genomic clusters matched the true population structure when there was relatively high diversity between introductions and a high transmission rate.

Conclusion: Differences in epidemiology and sampling can impact our ability to identify genomic clusters that describe the underlying population structure. These findings can help to guide recommendations for the use of pathogen clustering in public health investigations.

Keywords: Bioinformatics; Epidemiology; Infectious diseases; Mathematical modelling; Phylogenetics; SARS-CoV-2.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • COVID-19* / epidemiology
  • Cluster Analysis
  • Disease Outbreaks
  • Genomics
  • Humans
  • Pandemics
  • SARS-CoV-2* / genetics