It is an absolute pleasure to announce the official release of the gnomAD manuscript package. In a set of seven papers, published in Nature, Nature Medicine, and Nature Communications, we describe a wide variety of different approaches to exploring and understanding the patterns of genetic variation revealed by exome and genome sequences from 141,456 humans.
Publication announcements always feel a little strange in this new era of open science. In our case, the underlying gnomAD data set has been publicly fully available for browsing and downloading since October 2016, and we’ve had the preprints available online since early 2019. However, it’s undeniable that there is something deeply gratifying about seeing these pieces of science revealed in their final, concrete form.
For me this package has a particular significance – it represents the culmination of seven and a half years of work with a phenomenal team at the Broad Institute, and marks my transition to a new role in Australia, and the handover of the gnomAD project to new leadership. So I wanted to spend some time in this post reflecting on the history of the project that became gnomAD, the people who’ve made it possible, and where things will go from here. Continue reading The gnomAD papers→
We are thrilled to announce the release of gnomAD v3, a catalog containing 602M SNVs and 105M indels based on the whole-genome sequencing of 71,702 samples mapped to the GRCh38 build of the human reference genome. By increasing the number of whole genomes almost 5-fold from gnomAD v2.1, this release represents a massive leap in analysis power for anyone interested in non-coding regions of the genome or in coding regions poorly captured by exome sequencing.
In addition, gnomAD v3 adds new diversity – for instance, by almost doubling the number of African-American samples we had in gnomAD v2 (exomes and genomes combined), and also including our first set of allele frequencies for the Amish population.
Ryan Collins, Harrison Brand, Daniel MacArthur, and Mike Talkowski
The first gnomAD structural variant (SV) callset is now available via the gnomAD website and integrated directly into the gnomAD Browser.
This initial gnomAD SV callset includes nearly a half-million distinct SVs across seven SV mutational classes and 13 subclasses of complex SVs detected in 14,891 genomes spanning four major global populations. In the publicly released callset and gnomAD browser, you can find site, frequency, and annotation data for ~445k SVs from 10,738 unrelated genomes with appropriate consent to allow the release of this information. In this post we summarize how we created this new call set, and some important practical considerations when using it. You can get more details, including callset generation and analyses, in the full gnomAD-SV preprint available on bioRxiv.
Allele frequencies in sub-continental populations in Europe and East Asia
Allele frequencies computed for the following subsets of the data:
Controls-only (no cases from common disease case/control studies)
Samples not assessed for a neurological phenotype
Samples that were not part of a cancer cohort
Samples that are not part of the Trans-Omics for Precision Medicine (TOPMed)-BRAVO dataset
New annotations for each variant
Filtering allele frequency using Poisson 95% and 99% CI, per population
Age histogram of heterozygous and homozygous carriers
gnomAD v2.1 comprises a total of 16mln SNVs and 1.2mln indels from 125,748 exomes, and 229mln SNVs and 33mln indels from 15,708 genomes. In addition to the 7 populations already present in gnomAD 2.0.2, this release now breaks down the non-Finnish Europeans and East Asian populations further into sub-populations. The population breakdown is detailed below.
Well, it is time once again for the American Society of Human Genetics Meeting – this year being held in San Diego, Oct. 16-20. Here’s a guide to the latest science that the MacArthur lab, the Broad Institute Rare Disease Group, and our close affiliates, will be presenting at the meeting.
Executive summary: the NIH is seeking comments on a new proposed policy on genomic data sharing. While there is much to like about the new policy, we are very concerned about the proposed requirement for a click-through agreement on all aggregate genomic resources (which would include heavily-used databases such as ExAC and gnomAD). Our draft response to the Request for Comments is below. If you agree with our concern, please consider replying to the Request for Comments yourself, using the template text at the end of this post if useful.
Today, we are pleased to announce the formal release of the genome aggregation database (gnomAD). This release comprises two callsets: exome sequence data from 123,136 individuals and whole genome sequencing from 15,496 individuals. Importantly, in addition to an increased number of individuals of each of the populations in ExAC, we now additionally provide allele frequencies across over 5000 Ashkenazi Jewish (ASJ) individuals. The population breakdown is detailed in the table below. Continue reading The genome Aggregation Database (gnomAD)→
Today we are celebrating the official publication of the Exome Aggregation Consortium (ExAC) paper in Nature – marking the end of a phase in this project that has involved most of the members of my lab (and many, many others beyond) for a large chunk of the last few years. This official publication is an opportune time to reflect on how ExAC came to be, and the impact it’s had on us and the wider community.
First, some background
Exome sequencing is a very cost-effective approach that allows us to look with high resolution at just the 1-2% of the human genome that codes for protein – these are the parts we understand the best, and also the parts where the vast majority of severe disease-causing mutations are found. Because exome sequencing is so powerful it’s been applied to tens of thousands of patients with rare, severe diseases such as muscular dystrophy and epilepsy. However, a key challenge when sequencing patients is that everyone carries tens of thousands of genetic changes, and we need a database of “normal” variation that tells us which of those changes are seen in healthy people, and how common they are.
As we were organizing analyses for the ExAC flagship paper, we were inspired by Titus Brown’s manifesto on reproducibility. As with most collaborative papers, we had a core team of analysts, each responsible for their subset of analyses and figure panels. Part of the way through, we realized there were many shared features that we wanted to keep consistent across analyses, particularly variant filtering strategies and annotations. We made the decision to organize the code in a central github repository, which the team members could access and edit as analyses progressed. Today, we are happy to release this code, which you can use to reproduce every figure in the paper!