Semi-automated assembly of high-quality diploid human reference genomes

Erich D Jarvis; Giulio Formenti; Arang Rhie; Andrea Guarracino; Chentao Yang; Jonathan Wood; Alan Tracey; Francoise Thibaud-Nissen; Mitchell R Vollger; David Porubsky; Haoyu Cheng; Mobin Asri; Glennis A Logsdon; Paolo Carnevali; Mark J P Chaisson; Chen-Shan Chin; Sarah Cody; Joanna Collins; Peter Ebert; Merly Escalona; Olivier Fedrigo; Robert S Fulton; Lucinda L Fulton; Shilpa Garg; Jennifer L Gerton; Jay Ghurye; Anastasiya Granat; Richard E Green; William Harvey; Patrick Hasenfeld; Alex Hastie; Marina Haukness; Erich B Jaeger; Miten Jain; Melanie Kirsche; Mikhail Kolmogorov; Jan O Korbel; Sergey Koren; Jonas Korlach; Joyce Lee; Daofeng Li; Tina Lindsay; Julian Lucas; Feng Luo; Tobias Marschall; Matthew W Mitchell; Jennifer McDaniel; Fan Nie; Hugh E Olsen; Nathan D Olson; Trevor Pesout; Tamara Potapova; Daniela Puiu; Allison Regier; Jue Ruan; Steven L Salzberg; Ashley D Sanders; Michael C Schatz; Anthony Schmitt; Valerie A Schneider; Siddarth Selvaraj; Kishwar Shafin; Alaina Shumate; Nathan O Stitziel; Catherine Stober; James Torrance; Justin Wagner; Jianxin Wang; Aaron Wenger; Chuanle Xiao; Aleksey V Zimin; Guojie Zhang; Ting Wang; Heng Li; Erik Garrison; David Haussler; Ira Hall; Justin M Zook; Evan E Eichler; Adam M Phillippy; Benedict Paten; Kerstin Howe; Karen H Miga; Human Pangenome Reference Consortium

doi:10.1038/s41586-022-05325-5

Semi-automated assembly of high-quality diploid human reference genomes

Nature. 2022 Nov;611(7936):519-531. doi: 10.1038/s41586-022-05325-5. Epub 2022 Oct 19.

Authors

Erich D Jarvis^#^{1

2}, Giulio Formenti^#³, Arang Rhie⁴, Andrea Guarracino⁵, Chentao Yang⁶, Jonathan Wood⁷, Alan Tracey⁷, Francoise Thibaud-Nissen⁸, Mitchell R Vollger⁹, David Porubsky⁹, Haoyu Cheng^{10

11}, Mobin Asri¹², Glennis A Logsdon⁹, Paolo Carnevali¹³, Mark J P Chaisson¹⁴, Chen-Shan Chin¹⁵, Sarah Cody¹⁶, Joanna Collins⁷, Peter Ebert¹⁷, Merly Escalona¹⁸, Olivier Fedrigo¹⁹, Robert S Fulton¹⁶, Lucinda L Fulton¹⁶, Shilpa Garg²⁰, Jennifer L Gerton²¹, Jay Ghurye²², Anastasiya Granat²³, Richard E Green¹², William Harvey⁹, Patrick Hasenfeld²⁴, Alex Hastie²⁵, Marina Haukness¹², Erich B Jaeger²³, Miten Jain¹², Melanie Kirsche²⁶, Mikhail Kolmogorov²⁷, Jan O Korbel²⁴, Sergey Koren⁴, Jonas Korlach²⁸, Joyce Lee²⁵, Daofeng Li^{29

30}, Tina Lindsay¹⁶, Julian Lucas¹², Feng Luo³¹, Tobias Marschall¹⁷, Matthew W Mitchell³², Jennifer McDaniel³³, Fan Nie³⁴, Hugh E Olsen¹², Nathan D Olson³³, Trevor Pesout¹², Tamara Potapova²¹, Daniela Puiu³⁵, Allison Regier³⁶, Jue Ruan³⁷, Steven L Salzberg³⁵, Ashley D Sanders³⁸, Michael C Schatz²⁶, Anthony Schmitt³⁹, Valerie A Schneider⁸, Siddarth Selvaraj³⁹, Kishwar Shafin¹², Alaina Shumate³⁵, Nathan O Stitziel^{16

29

40}, Catherine Stober²⁴, James Torrance⁷, Justin Wagner³³, Jianxin Wang³⁴, Aaron Wenger²⁸, Chuanle Xiao⁴¹, Aleksey V Zimin³⁵, Guojie Zhang⁴², Ting Wang^{16

29

30}, Heng Li¹⁰, Erik Garrison⁴³, David Haussler^{44

45}, Ira Hall⁴⁶, Justin M Zook³³, Evan E Eichler^{44

9}, Adam M Phillippy⁴, Benedict Paten¹², Kerstin Howe⁴⁷, Karen H Miga⁴⁸; Human Pangenome Reference Consortium

Affiliations

¹ Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA. ejarvis@rockefeller.edu.
² Howard Hughes Medical Institute, Chevy Chase, MD, USA. ejarvis@rockefeller.edu.
³ Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA. gformenti@rockefeller.edu.
⁴ Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA.
⁵ Genomics Research Centre, Human Technopole, Viale Rita Levi-Montalcini, Milan, Italy.
⁶ BGI-Shenzhen, Shenzhen, China.
⁷ Tree of Life, Wellcome Sanger Institute, Cambridge, UK.
⁸ National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
⁹ Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA.
¹⁰ Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA.
¹¹ Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
¹² UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA.
¹³ Chan Zuckerberg Initiative, Redwood City, CA, USA.
¹⁴ Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA.
¹⁵ Foundation for Biological Data Science, Belmont, CA, USA.
¹⁶ McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, USA.
¹⁷ Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany.
¹⁸ Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA, USA.
¹⁹ Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA.
²⁰ Department of Biology, University of Copenhagen, Copenhagen, Denmark.
²¹ Stowers Institute for Medical Research, Kansas City, MO, USA.
²² Dovetail Genomics, Scotts Valley, CA, USA.
²³ Illumina, Inc., San Diego, CA, USA.
²⁴ European Molecular Biology Laboratory, Genome Biology Unit, Heidelberg, Germany.
²⁵ Bionano Genomics, San Diego, CA, USA.
²⁶ Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA.
²⁷ Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA.
²⁸ Pacific Biosciences, Menlo Park, CA, USA.
²⁹ Department of Genetics, Washington University School of Medicine, St. Louis, MO, USA.
³⁰ The Edison Family Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, MO, USA.
³¹ School of Computing, Clemson University, Clemson, SC, USA.
³² Coriell Institute for Medical Research, Camden, NJ, USA.
³³ Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA.
³⁴ Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha, China.
³⁵ Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA.
³⁶ DNAnexus, Mountain View, CA, USA.
³⁷ Agricultural Genomics Institute, Chinese Academy of Agricultural Sciences, Shenzhen, China.
³⁸ Berlin Institute for Medical Systems Biology, Max Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC), Berlin, Germany.
³⁹ Arima Genomics, San Diego, CA, USA.
⁴⁰ Cardiovascular Division, John T. Milliken Department of Internal Medicine, Washington University School of Medicine, St. Louis, USA.
⁴¹ State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou, China.
⁴² Center for Evolutionary & Organismal Biology, Zhejiang University School of Medicine, Hangzhou, China.
⁴³ Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA.
⁴⁴ Howard Hughes Medical Institute, Chevy Chase, MD, USA.
⁴⁵ Department of Ecology and Evolutionary Biology, University of California Santa Cruz, Santa Cruz, CA, USA.
⁴⁶ Yale School of Medicine, New Haven, CT, USA.
⁴⁷ Tree of Life, Wellcome Sanger Institute, Cambridge, UK. kj2@sanger.ac.uk.
⁴⁸ UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA. khmiga@ucsc.edu.

^# Contributed equally.

Abstract

The current human reference genome, GRCh38, represents over 20 years of effort to generate a high-quality assembly, which has benefitted society^1,2. However, it still has many gaps and errors, and does not represent a biological genome as it is a blend of multiple individuals^3,4. Recently, a high-quality telomere-to-telomere reference, CHM13, was generated with the latest long-read technologies, but it was derived from a hydatidiform mole cell line with a nearly homozygous genome⁵. To address these limitations, the Human Pangenome Reference Consortium formed with the goal of creating high-quality, cost-effective, diploid genome assemblies for a pangenome reference that represents human genetic diversity⁶. Here, in our first scientific report, we determined which combination of current genome sequencing and assembly approaches yield the most complete and accurate diploid genome assembly with minimal manual curation. Approaches that used highly accurate long reads and parent-child data with graph-based haplotype phasing during assembly outperformed those that did not. Developing a combination of the top-performing methods, we generated our first high-quality diploid reference assembly, containing only approximately four gaps per chromosome on average, with most chromosomes within ±1% of the length of CHM13. Nearly 48% of protein-coding genes have non-synonymous amino acid changes between haplotypes, and centromeric regions showed the highest diversity. Our findings serve as a foundation for assembling near-complete diploid human genomes at scale for a pangenome reference to capture global genetic variation from single nucleotides to structural rearrangements.

Publication types

Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't

MeSH terms

Chromosome Mapping* / standards
Chromosomes, Human / genetics
Diploidy*
Genetic Variation / genetics
Genome, Human* / genetics
Genomics* / methods
Genomics* / standards
Haplotypes / genetics
High-Throughput Nucleotide Sequencing / methods
High-Throughput Nucleotide Sequencing / standards
Humans
Reference Standards
Sequence Analysis, DNA / methods
Sequence Analysis, DNA / standards

Abstract

Publication types

MeSH terms

Grants and funding