All posts by Laurent Francioli

gnomAD v3.0

Laurent Francioli & Daniel MacArthur

We are thrilled to announce the release of gnomAD v3, a catalog containing 602M SNVs and 105M indels based on the whole-genome sequencing of 71,702 samples mapped to the GRCh38 build of the human reference genome. By increasing the number of whole genomes almost 5-fold from gnomAD v2.1, this release represents a massive leap in analysis power for anyone interested in non-coding regions of the genome or in coding regions poorly captured by exome sequencing. 

In addition, gnomAD v3 adds new diversity – for instance, by almost doubling the number of African-American samples we had in gnomAD v2 (exomes and genomes combined), and also including our first set of allele frequencies for the Amish population.

Here’s the population breakdown in this release:

Population Description Genomes
afr African/African-American 21,042
ami Amish 450
amr Latino/Admixed American 6,835
asj Ashkenazi Jewish 1,662
eas East Asian 1,567
fin Finnish 5,244
nfe Non-Finnish European 32,299
sas South Asian 1,526
oth Other (population not assigned) 1,077
Total 71702

In this blog post we wanted to lay out how we generated this new data set, specifically:

  • A new method, implemented in Hail, that allows the joint genotyping of tens of thousands of whole genomes
  • The approach we took for quality control and filtering of samples
  • The approach we took for quality control and filtering of variants

But first we’d like to say thanks to the many, many people who made this release possible. First, the 141 members of the gnomAD consortium – principal investigators from around the world who have generously shared their exome and genome data with the gnomAD team and the scientific community. Their commitment to open and collaborative science is at the core of gnomAD; this resource would not exist without them. And of course, we would like to thank the hundreds of thousands of research participants whose data are included in gnomAD for their dedication to science. 

The Broad Institute’s Genomics Platform sequenced over 80% of the genomes of this gnomAD release. We thank them for this important contribution to gnomAD, and for storage and computing resources essential to the project. The Broad’s Data Sciences Platform also played a crucial role – the Data Gen team for producing the uniform gvcfs required for joint calling, and Eric Banks and Laura Gauthier for their work on the immense callset that forms the core of gnomAD. We also owe an enormous debt of gratitude to the Hail team, especially Chris Vittal, for creating the novel gvcf combiner described below; a callset of this scale would have been impossible without his engineering. 

Konrad Karczewski, Laura Gauthier, Grace Tiao and Mike Wilson made absolutely critical contributions to the quality control and analysis process. In addition, the Broad’s Office of Research Subject Protection were key contributors to the management of gnomAD’s regulatory compliance. 

The amazing gnomAD browser is the creation of Nick Watts and Matt Solomonson, and there was considerable work required for this version to scale to the size of this dataset and move all of our annotations onto GRCh38. And finally, we wanted to thank Jessica Alföldi, who has contributed in myriad ways to the development of this resource, including a critical role in communicating with principal investigators, assembling the sample list, and gathering the permissions required for this release to be possible.

Continue reading gnomAD v3.0

gnomAD v2.1

Laurent Francioli, Grace Tiao, Konrad Karczewski, Matthew Solomonson and Nick Watts

We are delighted to announce the release of gnomAD v2.1! This new release of gnomAD is based on the same underlying callset as gnomAD v2.0.2, but has the following improvements and new features:

  • An awesome new browser
  • Per-gene loss-of-function constraint
  • Improved sample and variant filtering processes
  • Allele frequencies in sub-continental populations in Europe and East Asia
  • Allele frequencies computed for the following subsets of the data:
    • Controls-only (no cases from common disease case/control studies)
    • Samples not assessed for a neurological phenotype
    • Samples that were not part of a cancer cohort
    • Samples that are not part of the Trans-Omics for Precision Medicine (TOPMed)-BRAVO dataset
  • New annotations for each variant
    • Filtering allele frequency using Poisson 95% and 99% CI, per population
    • Age histogram of heterozygous and homozygous carriers

gnomAD v2.1 comprises a total of 16mln SNVs and 1.2mln indels from 125,748 exomes, and 229mln SNVs and 33mln indels from 15,708 genomes. In addition to the 7 populations already present in gnomAD 2.0.2, this release now breaks down the non-Finnish Europeans and East Asian populations further into sub-populations. The population breakdown is detailed below. Continue reading gnomAD v2.1