gnomAD v3.0

Laurent Francioli & Daniel MacArthur

We are thrilled to announce the release of gnomAD v3, a catalog containing 602M SNVs and 105M indels based on the whole-genome sequencing of 71,702 samples mapped to the GRCh38 build of the human reference genome. By increasing the number of whole genomes almost 5-fold from gnomAD v2.1, this release represents a massive leap in analysis power for anyone interested in non-coding regions of the genome or in coding regions poorly captured by exome sequencing. 

In addition, gnomAD v3 adds new diversity – for instance, by almost doubling the number of African-American samples we had in gnomAD v2 (exomes and genomes combined), and also including our first set of allele frequencies for the Amish population.

Here’s the population breakdown in this release:

Population Description Genomes
afr African/African-American 21,042
ami Amish 450
amr Latino/Admixed American 6,835
asj Ashkenazi Jewish 1,662
eas East Asian 1,567
fin Finnish 5,244
nfe Non-Finnish European 32,299
sas South Asian 1,526
oth Other (population not assigned) 1,077
Total 71702

In this blog post we wanted to lay out how we generated this new data set, specifically:

  • A new method, implemented in Hail, that allows the joint genotyping of tens of thousands of whole genomes
  • The approach we took for quality control and filtering of samples
  • The approach we took for quality control and filtering of variants

But first we’d like to say thanks to the many, many people who made this release possible. First, the 141 members of the gnomAD consortium – principal investigators from around the world who have generously shared their exome and genome data with the gnomAD team and the scientific community. Their commitment to open and collaborative science is at the core of gnomAD; this resource would not exist without them. And of course, we would like to thank the hundreds of thousands of research participants whose data are included in gnomAD for their dedication to science. 

The Broad Institute’s Genomics Platform sequenced over 80% of the genomes of this gnomAD release. We thank them for this important contribution to gnomAD, and for storage and computing resources essential to the project. The Broad’s Data Sciences Platform also played a crucial role – the Data Gen team for producing the uniform gvcfs required for joint calling, and Eric Banks and Laura Gauthier for their work on the immense callset that forms the core of gnomAD. We also owe an enormous debt of gratitude to the Hail team, especially Chris Vittal, for creating the novel gvcf combiner described below; a callset of this scale would have been impossible without his engineering. 

Konrad Karczewski, Laura Gauthier, Grace Tiao and Mike Wilson made absolutely critical contributions to the quality control and analysis process. In addition, the Broad’s Office of Research Subject Protection were key contributors to the management of gnomAD’s regulatory compliance. 

The amazing gnomAD browser is the creation of Nick Watts and Matt Solomonson, and there was considerable work required for this version to scale to the size of this dataset and move all of our annotations onto GRCh38. And finally, we wanted to thank Jessica Alföldi, who has contributed in myriad ways to the development of this resource, including a critical role in communicating with principal investigators, assembling the sample list, and gathering the permissions required for this release to be possible.

Continue reading gnomAD v3.0

Structural variants in gnomAD

Ryan Collins, Harrison Brand, Daniel MacArthur, and Mike Talkowski

The first gnomAD structural variant (SV) callset is now available via the gnomAD website and integrated directly into the gnomAD Browser.

This initial gnomAD SV callset includes nearly a half-million distinct SVs across seven SV mutational classes and 13 subclasses of complex SVs detected in 14,891 genomes spanning four major global populations. In the publicly released callset and gnomAD browser, you can find site, frequency, and annotation data for ~445k SVs from 10,738 unrelated genomes with appropriate consent to allow the release of this information.
In this post we summarize how we created this new call set, and some important practical considerations when using it. You can get more details, including callset generation and analyses, in the full gnomAD-SV preprint available on bioRxiv.

Overview of the gnomAD-SV cohort and callset

We jointly analyzed 14,891 genomes, 14,216 (95.5%) of which passed initial data quality thresholds, using a multi-algorithm consensus approach (details below) to construct the initial gnomAD SV callset. These samples were all sequenced with 150bp Illumina reads to an average coverage of 32X, and were aligned to the GRCh37/hg19 reference genome. Like previous gnomAD releases, this cohort was aggregated across various population genetic and complex disease studies, and most (57.4%) of the gnomAD-SV cohort overlaps with the genomes included in the current (v2.1) gnomAD SNV and indel callset (for more details, see the supplementary information for the gnomAD-SV preprint).

We subdivided samples into four continental populations and one “other” category, as follows:

Distribution of samples by population in gnomAD-SV

In the gnomAD Browser and the released VCF, we provide variant frequencies separately for each of the four major populations as well as the “other” category.

As with the SNV & indel callset, the subset of 10,738 genomes represented in the gnomAD browser have been screened to remove first-degree relatives and samples with documented pediatric or severe disease. Thus, the publicly available SV data from these 10,738 genomes represents a relatively diverse collection of unrelated individuals that should have rates of most severe diseases equivalent to, if not lower than, the general population.  

The gnomAD SV callset

In gnomAD, our working definition of SVs includes all genomic rearrangements involving at least 50bp of DNA, which can be categorized into mutational classes based on their breakpoint signature(s) and/or changes in copy number.

Classes of SV considered in gnomAD-SV

The mutational classes of SVs can be subdivided into two groups:

  • Unbalanced SVs are rearrangements that result in gains or losses of genomic DNA (also known as copy number variants, or CNVs)
  • Balanced SVs are rearrangements that do not involve changes in copy number

Additionally, each SV can be described as canonical or complex:

  • Canonical SVs are rearrangements that involve a single distinct breakpoint signature and/or changes in copy number
  • Complex SVs are rearrangements that involve two or more distinct breakpoint signatures and/or changes in copy number

Across the full gnomAD-SV dataset, we discovered a total of 498,257 distinct SVs appearing in one or more genomes, 382,460 of which (76.8%) were completely resolved and appeared in a subset of 12,549 unrelated genomes used in the formal analyses presented in the gnomAD SV preprint. We categorized SVs into the following mutational classes (note: numbers correspond to the final set of SVs from 12,549 unrelated genomes used in the gnomAD SV preprint):

Summary of variants catalogued in gnomAD-SV by mutational class

A main aim of producing this gnomAD-SV callset was to supplement existing resources by characterizing which genes in the genome are known to tolerate alteration by SVs, and specifically which kinds of rearrangement these genes can tolerate.

We annotated each SV for multiple potential genic effects using an approach described in the gnomAD SV preprint. In brief, this approach considers SV size, class, position, and overlap with exons from canonical transcripts of Gencode v19 protein-coding genes, which can be graphically summarized below for four main categories SV-gene interactions:

Criteria used in gnomAD-SV for annotating genic consequences of SVs

SVs in the gnomAD Browser

One of the most noteworthy aspects of the gnomAD-SV release is the direct integration of SVs into the gnomAD Browser, an effort led by the talented duo of developers, Nick Watts and Matt Solomonson.

In the near future. we intend to produce a short tutorial video covering these features. For now, you can start using the SV features right away with these two easy steps:

1. Navigate to your favorite gene or region using the search bar:

2. Select the SV dataset by clicking the toggle button in the top-right part of the screen:

Once you’ve activated SV mode in the gnomAD Browser, you can interact with the data in multiple ways, described below.

The variant track displays all SVs currently in the view range corresponding to their coordinates, with each SV on its own row.

In the variant track, certain symbols are assigned to specific classes of SV, such as a triangle for insertions (denoting the position at which sequence was inserted), or a lightning bolt for breakpoints of chromosomal translocations or unresolved breakends.

The variant table provides selected metadata for each variant displayed in the variant track, including allele frequencies, consequences, number of homozygotes, coordinates, and size.

All columns in the table are sortable by clicking the column headers.

There are three main options for filtering the variant track & table:

  1. Filter by consequence by selecting any combination of consequences in the top row of checkboxes below the variant track
  2. Filter by SV class by clicking any combination of buttons in the bottom row below the variant track  
  3. Filter by keyword by entering that keyword into the search field

For more details about each of these filters, there are help text pop-ups available by clicking the “?” next to each row of filter buttons.

We provide two options for coloring the SV data: color-by-consequence and color-by-class. You can toggle between these modes using the button directly above the variant track on the right side of the page. The current color key will be reflected by the colors used for the two sets of filter buttons found directly below the variant track.

By default, low-quality variants–such as unresolved breakends–will not be displayed in the variant track or table. You can activate these variants using the checkbox labeled “Include filtered variants” directly above the variant table.

Both the variant track and table are synchronized: any filter, sort, scroll, or highlight applied to one will synchronously be applied to the other. For example, hovering over a single row in the variant table will draw a dashed box to highlighting that SV in the table, but will also highlight that same variant and its position in the track (and vice versa).

To find more information about an individual variant, you can click on the variant ID in the table or on the variant itself in the track to access the variant page. This page features additional details on the variant, including which algorithms and evidence types contributed to its discovery, its allele frequencies across continental populations, and a full list of all predicted genic consequences.

If you’re ever unsure about what a given SV represents or what its functional consequences mean, visit that variant’s page and click on any of the “?” help buttons for more information. This is particularly useful for complex SVs, where each help popup gives a graphical representation of the alternate allele structure.

Multiallelic CNVs, which are sites in the genome that have a variable range of copy numbers across individuals in the population, have an extra feature for their variant pages: the distribution of copy numbers in the gnomAD-SV cohort. While we do not phase these regions and/or resolve their individual underlying copy number haplotypes, we provide an aggregate copy number for that locus in sum across both chromosomes per individual. This is represented as a histogram, which has a mouse-over tooltip to show how many individuals are present at each copy state.

Practical considerations when using the gnomAD-SV callset

Below, we list a few important rules-of-thumb to keep in mind when using the gnomAD-SV callset:

  • Breakpoint resolution is variable: the resolution of breakpoints varies by the type(s) of evidence contributing to the call, which are marked in the INFO field of the VCF and displayed in the gnomAD browser. Broadly, this means:
    • Variants with split read evidence are accurate within ~tens of bases
    • Variants with discordant read-pair evidence (but no split read evidence) are accurate within ~tens-to-hundreds of bases
    • Variants with read depth evidence (but no split read or read-pair evidence) are accurate within ~hundreds of bases
  • Other algorithms and technologies may capture variants missed by gnomAD: although short-read WGS is capable of detecting a broad spectrum of SV classes, it has notable limitations compared to long-read WGS and other approaches. Furthermore, while the four algorithms (Manta, DELLY, MELT, and cn.MOPS) capture a significant  fraction of SVs accessible to short-read WGS by comparison to most previous studies, they will certainly not capture all SVs (for instance, small [<300bp] duplications are a known limitation of the current gnomAD callset). Primary sequence context is the most important factor to consider when evaluating SVs ascertained by short-read WGS (such as the data used to build gnomAD-SV). For instance, short-read WGS has limited sensitivity to detect SVs in regions of low-complexity and highly repetitive sequence by comparison to technologies like long-read WGS (i.e., PacBio, Nanopore), optical mapping, or linked-read WGS. In practice, we find short-read WGS performs poorest in regions of segmental duplication (also known as low-copy repeats) or simple repeats.
  • Gene consequence annotations are predictions: please note that the gene consequence annotations we generated for gnomAD-SV are predictions, and may not always hold true on a site-by-site basis. Further, we are unable to subdivide our gene consequence annotations for SVs into “low-confidence” and “high-confidence” subsets, such as is performed by LOFTEE for SNVs and indels. Thus, we request that you please carefully inspect each putatively functional SV before interpretation, particularly in the context of its exonic overlap and affected transcripts.
  • Manual review of all variants has not been conducted: given the size of this dataset, it is not feasible to curate each variant call. Thus, it is expected that some SVs included in this release represent false and/or misrepresented variant calls, consistent with the range of error rate estimates reported in the gnomAD-SV preprint (~2-12%; see Supplementary Table 4). If you find a variant that appears to meet these criteria, please report the variant.

gnomAD SV discovery methods

To discover and genotype SVs in this cohort, we extended a multi-algorithm consensus framework from a previous study. You can find technical details on this approach in the methods of Werling et al. (2018) and in the gnomAD-SV preprint.

In this new iteration of the SV pipeline, now dubbed the gnomAD SV Discovery Pipeline, we have configured all methods to be executed on Google Cloud via the Workflow Description Language (WDL) and the Cromwell Execution Engine. For gnomAD-SV, these methods were deployed on FireCloud (now known as Terra), a user-friendly interface for massively parallel computing and managing large bioinformatics workflows in the cloud.

At a high level, the general approach taken by the gnomAD SV Discovery Pipeline can be summarized by the following five steps:

  • Executes four independent SV detection algorithms (Manta, DELLY, MELT, and cn.MOPS) per sample
  • Integrates raw outputs across algorithms and samples
  • Filters calls based on four classes of direct evidence available from raw WGS data (anomalous paired-end reads, read depth, split reads, and B-allele SNP frequencies)
  • Genotypes all candidate breakpoints across all samples
  • Resolves complex SVs, compound heterozygous sites, multiallelic CNVs, performs sex-aware allosome genotype adjustment, and many more steps to refine variant calls

We have provided all components of this pipeline as WDLs, which are available via the FireCloud Methods Repository. Additional details on code availability can be found in the gnomAD-SV supplement.

For the SV callset described in the gnomAD-SV preprint and the callset released via the gnomAD Browser, we performed extensive QC, applied a series of additional filters, and conducted multiple benchmarking analyses. You can read more about the process of producing the gnomAD-SV callset in the supplement of the gnomAD-SV preprint.


The creation of gnomAD-SV would not have been possible without the PIs, researchers, and participants of each study that contributed data towards this overall aggregation effort. We would also like to thank the members of the Talkowski Laboratory, the MacArthur Laboratory, the Analytical and Translational Genetics Unit (ATGU) at MGH, and the Broad-SV Group, all of whom provided valuable insight and feedback throughout the process. Specifically, we are grateful to Nick Watts and Matt Solomonson, who developed the new SV features for the gnomAD Browser. Additional acknowledgements, including grants and grant numbers, are listed in the gnomAD-SV preprint.

gnomAD v2.1

Laurent Francioli, Grace Tiao, Konrad Karczewski, Matthew Solomonson and Nick Watts

We are delighted to announce the release of gnomAD v2.1! This new release of gnomAD is based on the same underlying callset as gnomAD v2.0.2, but has the following improvements and new features:

  • An awesome new browser
  • Per-gene loss-of-function constraint
  • Improved sample and variant filtering processes
  • Allele frequencies in sub-continental populations in Europe and East Asia
  • Allele frequencies computed for the following subsets of the data:
    • Controls-only (no cases from common disease case/control studies)
    • Samples not assessed for a neurological phenotype
    • Samples that were not part of a cancer cohort
    • Samples that are not part of the Trans-Omics for Precision Medicine (TOPMed)-BRAVO dataset
  • New annotations for each variant
    • Filtering allele frequency using Poisson 95% and 99% CI, per population
    • Age histogram of heterozygous and homozygous carriers

gnomAD v2.1 comprises a total of 16mln SNVs and 1.2mln indels from 125,748 exomes, and 229mln SNVs and 33mln indels from 15,708 genomes. In addition to the 7 populations already present in gnomAD 2.0.2, this release now breaks down the non-Finnish Europeans and East Asian populations further into sub-populations. The population breakdown is detailed below. Continue reading gnomAD v2.1

The MacArthur lab and Rare Disease Group at ASHG 2018

Well, it is time once again for the American Society of Human Genetics Meeting – this year being held in San Diego, Oct. 16-20. Here’s a guide to the latest science that the MacArthur lab, the Broad Institute Rare Disease Group, and our close affiliates, will be presenting at the meeting.

Wednesday 9am: Konrad Karczewski, a post-doc in the MacArthur lab, will present an analysis of the distribution of loss-of-function tolerance across human genes, using the Genome Aggregation Database (gnomAD) data. Konrad will also announce the release of the gnomAD 2.1 dataset, a refined and heavily updated version of gnomAD v2, as well as the release of the new gnomAD browser (Ballroom 20A – upper level)

Wednesday 9:45am: Julia Goodrich, a post-doc in the MacArthur lab, will present an assessment of the penetrance of Mendelian disease variants using data from over 40,000 exomes collected by the AMP-T2D and T2D-GENES consortia (Ballroom 20A – upper level)

Wednesday 2pm: Clara Williamson, a clinical project coordinator in the Rare Disease Group, will present poster #1323 – a description of the Rare Genomes Project – a direct-to-patient study that aims to engage US families affected by rare disease directly in the research process (Exhibit hall – ground level)

Wednesday 3pm: Grace Tiao, a computational biologist in the MacArthur lab, will present poster #1464, which describes the comprehensive quality control pipeline used for the genetic data of  the 141,456 individuals of the newly-released gnomAD 2.1 dataset (Exhibit hall – ground level)

Wednesday 6:15pm: Tim Poterba, a software engineer in the Neale lab, will present the open-source Hail framework for scalable genomic analysis, a framework absolutely essential for (among many other things!) the production and analysis of the gnomAD dataset (Room 6D – upper level)


Thursday 9am: Anne O’Donnell-Luria, the associate director of our Rare Disease Group, will present the history, methods and results of the Centers for Mendelian Genomics, a network of American centers for gene discovery in Mendelian disorders (Ballroom 20D – upper level)

Thursday 12pm: Nicky Whiffin, a postdoc in James Ware’s lab and an affiliate of the MacArthur lab, will present an analysis of variation in uORFs (small upstream open reading frames) in the gnomAD genomes dataset (Ballroom 20A – upper level)

Thursday 2pm: Ben Weisburd, a software engineer in the MacArthur lab, will present poster #1411, which describes seqr, a web-based tool for the analysis of rare disease patient genetic data used for all of our diagnoses within the Center for Mendelian Genomics and Rare Genomes Project (Exhibit hall – ground level)

Thursday 3pm: Zaheer Valivullah, a genomic variant analyst in the Rare Disease Group, will present poster #1198, describing new, potentially pathogenic variants implicated in neuromuscular disease (Exhibit hall – ground level)

Thursday 3pm: Kristen Laricchia, an associate computational biologist in the MacArthur lab, will present poster #1402, which describes her analysis of variation in the mitochondrial DNA of the gnomAD genomes dataset (Exhibit hall – ground level)


Friday 2pm: Mike Wilson, an associate computational biologist in the MacArthur lab, will present poster #1607, which describes the computational pipeline the lab uses to assess the quality of exomes and genome sequence data from rare disease families, before it passes into our gene discovery and diagnosis processes (Exhibit hall – ground level)

Friday 2pm: Katherine Chao, a genomic variant analyst in the Rare Disease Group, will present poster #1637, describing the use of GATK’s germline CNV caller to make new diagnoses for rare disease families using exome sequencing data (Exhibit hall – ground level)

Friday 2pm: Qingbo Wang, a graduate student in the MacArthur lab, will present poster #2741, where he describes an analysis of multi-nucleotide variants in the gnomAD 2.1 dataset, as well as in rare disease family genetic data (Exhibit hall – ground level)

Friday 3pm: Matt Solomonson, a software engineer in the MacArthur lab, will present poster #1628, which describes the re-engineering of the gnomAD browser to handle larger data sets and add cool new features (e.g. transcript structure and ClinVar variants), as well as the extension of the browser framework to the display of results from case-control datasets (Exhibit hall – ground level)


Saturday 9am: Beryl Cummings, a graduate student in the MacArthur lab, will present a transcript-level expression metric, which can be used to understand the phenotypic result of variation in differentially-expressed exons – and has turned out to be a powerful way to prioritize variants in a variety of diseases (Room 6C – upper level)

Response to “Proposal to Update Data Management of Genomic Summary Results Under the NIH Genomic Data Sharing Policy”

Executive summary: the NIH is seeking comments on a new proposed policy on genomic data sharing. While there is much to like about the new policy, we are very concerned about the proposed requirement for a click-through agreement on all aggregate genomic resources (which would include heavily-used databases such as ExAC and gnomAD). Our draft response to the Request for Comments is below. If you agree with our concern, please consider replying to the Request for Comments yourself, using the template text at the end of this post if useful.

Draft response to Request for Comments
We would like to applaud the NIH for moving in the right direction with its new “Proposal to Update Data Management of Genomic Summary Results Under the NIH Genomic Data Sharing Policy”. The rapid and open sharing of summary statistics from aggregate genomic data brings tremendous benefit to the scientific community, and the potential harms of such sharing are largely theoretical. Our own experience with the ExAC and gnomAD public variant frequency databases has shown that the benefits to academic, clinical and pharmaceutical scientists from sharing of aggregate data are substantial: The browsers have had over 10 million page views by over 200,000 unique users from 166 countries in the past three years, and have been used by diagnostic laboratories in the analysis of >50,000 rare disease families. Even greater value will arise as a result of broader sharing of aggregate statistics as empowered by the new policy.

However, we are still very concerned by one aspect of the new Genomic Summary Results Data Sharing Policy – the creation of a new tier of access, rapid-access, which requires a click-through agreement to gain access to summary statistics. These concerns can be summarized as follows: (1) Click-throughs make programmatic access to data-sets challenging; (2) they greatly complicate or prevent multiple important types of re-use of the data; and (3) they are highly unlikely to deter anyone with genuine malicious intent. Overall, our position is that click-through agreements are a security fig leaf that gives the impression of extra protection, but actually do no good – and can do non-trivial harm. And we would like to emphasize that ExAC and gnomAD, along with other aggregate data sharing sites such as the Exome Variant Server, do not and never have had click-through agreements, and to the best of our knowledge no harm has ever come to participants as a result.

To explain those points in a bit more detail:

  1. It is critical for summary statistic resources such as gnomAD that we allow access through programmatic interfaces (APIs) so that people can query them using software (e.g. pull out just a targeted slice of the data) and perform automated queries (e.g. pull out the frequency of a specific variant when a user loads a different web page about that variant). Most implementations of click-through agreements will prevent or greatly complicate any form of programmatic access. There are possible technical workarounds, but all of them result in some kind of barrier to programmatic access.
  2. Probably the single biggest obstacle created by click-through agreements is that they prevent or substantially complicate data re-use. Right now anyone can download the complete list of frequencies from gnomAD and load it up in another website, or use it to build other useful web services (the complete ExAC data set has been downloaded thousands of times). With any kind of click-through agreement they either couldn’t do that at all, or would have to incorporate the same agreement in their usage policy, which may be incompatible with their proposed usage.
  3. Most importantly, click-through agreements do nothing to prevent the types of usage that are most likely to be harmful. It is worth noting that ExAC and gnomAD have existed on the web for almost 3 years and been accessed more than 10 million times without us being aware of a single incident that has any risk of harming participants. The vast majority of users are simply interested in using the data in their research. The theoretical bad actor who is interested in malicious usage is extremely unlikely to be dissuaded by a click-through agreement, nor does the click-through agreement offer any real after-the-fact protection if a malicious actor decides to do harm.

In summary, click-through agreements will degrade or destroy programmatic access and data reuse, without having any meaningful effect on participant safety. Any policy that advocates for click-through agreements as a solution should spell out explicitly exactly what types of misuse the click-through will prevent, and should justify the barriers to data usage that would result.

We believe it would be a mistake to incorporate click-through agreements into any NIH-wide policy. Instead, we suggest that the NIH require clear wording about the responsible use of aggregate data (such as avoiding reidentification) on all websites sharing aggregate genetic data, perhaps with a link on every page, but with no click-through barrier. This would provide a reasonable balance between serving the needs of the research community and protecting the public trust.


Daniel MacArthur.

A request for gnomAD users and supporters
For any member of the ExAC/gnomAD community who agrees that the public sharing of summary statistics is both harmless to participants and of great benefit to science, we urge you to read the new policy proposal here, and to respond to the NIH’s Request for Comment here by October 20th.

Feel free to edit or use the text below:

I am writing as an avid user of the ExAC and gnomAD databases. [Please provide a brief description of your use of these resources and their benefits to your research]

I believe that the new “Proposal to Update Data Management of Genomic Summary Results Under the NIH Genomic Data Sharing Policy” is a step in the right direction – there is no evidence that controlled-access of summary statistics prevents any harm to participants, and the open access to variant frequency data through ExAC and gnomAD has been very important to my research.

However, I am concerned about the proposal to create a new rapid-access category for summary statistics that would require the use of click-through agreements. These agreements make it difficult to reuse summary statistics and to access data programmatically. Most importantly, there is no evidence that they prevent harm to participants. A wide variety of summary statistics have been publicly available without click-through agreements for many years, including ExAC and gnomAD, and no harm of any kind has ever been done to any participant whose data is aggregated in those summary statistics in that time.

I urge the NIH to modify this proposal, and to designate summary statistics as open access, with the exception of communities and populations who believe that they are especially vulnerable to harm from possible reidentification.


[name here]

The genome Aggregation Database (gnomAD)

Konrad Karczewski and Laurent Francioli

Today, we are pleased to announce the formal release of the genome aggregation database (gnomAD). This release comprises two callsets: exome sequence data from 123,136 individuals and whole genome sequencing from 15,496 individuals. Importantly, in addition to an increased number of individuals of each of the populations in ExAC, we now additionally provide allele frequencies across over 5000 Ashkenazi Jewish (ASJ) individuals. The population breakdown is detailed in the table below.exacv2_barplot_cut.png

Population Description Genomes Exomes Total
AFR African/African American 4,368 7,652 12,020
AMR Admixed American 419 16,791 17,210
ASJ Ashkenazi Jewish 151 4,925 5,076
EAS East Asian 811 8,624 9,435
FIN Finnish 1,747 11,150 12,897
NFE Non-Finnish European 7,509 55,860 63,369
SAS South Asian 0 15,391 15,391
OTH Other (population not assigned) 491 2,743 3,234
Total 15,496 123,136 138,632

While the data from gnomAD have been available on the beta gnomAD browser for several months (since our announcement at ASHG last October), this release corresponds to a far more carefully filtered data set that is now available through the gnomAD browser, as well as a sites VCF and a Hail sites VDS for download and analysis at

In this blog post, we first describe the major changes from ExAC that will be apparent to users. We then dig into the details of the sample and variant QC done on the gnomAD data. It is important to note that while the alignment and variant calling process was similar to that of ExAC, the sample and variant QC were fairly different. In particular, the vast majority of processing and QC analyses were performed using Hail. This scalable framework was crucial for processing such large datasets in a reasonable amount of time, allowing for exploration of the data at a much more rapid scale. We’d like to extend a special thanks to the Hail team for their support throughout the process, especially Tim Poterba and Cotton Seed.

Important usage notes

These few notes describe important changes from the ExAC dataset.

Changes to the browser

The gnomAD browser is very similar to the ExAC browser with a few modifications to support integration of genome data. The coverage plot now has a green line to display genome coverage. In the variant table, a source column indicates whether the variant belongs to the exome callset (exomes_icon), genome callset (genomes_icon), or both callsets (exomes_icon and genomes_icon). Variants present in both datasets have combined summary statistics (allele count, allele frequency, etc). We have added checkboxes for specifying which data to display on the page. Users can choose to include or exclude SNPs, indels, exome variants, genome variants, or filtered (non-pass) variants and the table statistics update accordingly. If filtered (non-pass) variants are included, the data source icons will be distinguished with reduced opacity and a dotted border to indicate lower reliability (exomes_filtered_icon). Variants present in both datasets have combined summary statistics (allele count, allele frequency, etc). We have added checkboxes for specifying which data to display on the page. A variant’s filter status can be discovered by hovering over the icon (e.g. AC0, RF). Be careful when including filtered variants: some poor quality variants (and summary statistics) will be added to the table. This feature should be used with caution.


On the variant page, the population frequency table has been modified such that exome and genome population statistics can be included or excluded. By default, filtered variants are not included in the population table but can be added by clicking the checkbox.

There is a new “report variant” button on each variant page. Clicking this button opens a submission form where users can express concerns about a variant (such as low quality or clinical implausibility). Users can also request more information about variant carriers. Generally phenotype data is not available for samples in gnomAD as many of the individuals who have contributed data to gnomAD were not fully consented for phenotype data sharing. In the future, it may become available for a small subset of samples with appropriate consents. This is an experimental feature. Due to limited personnel, the gnomAD team may not be able to respond to your submission.

Finally, read data can now be viewed for both exome and genome data (restricted to exome calling intervals only), with up to 3 IGV.js tracks provided for each sample type (heterozygous, homozygous, and hemizygous). If there are more than 3 individuals for one of these sample types, the 3 individuals shown are those that have the highest GQ.

Allele count default is high quality genotypes only

We have changed the default allele count to include only genotypes passing our adjusted threshold (GQ >= 20, DP >= 10, and have now added: allele balance > 0.2 for heterozygote genotypes), which was previously in AC_Adj. The raw allele counts are now available in AC_raw. All sites with no high quality genotype (AC = 0) are marked as filtered using the AC0 filter.

No chromosome Y in genomes

There is no chromosome Y coverage / calls in the gnomAD genomes, because reasons.

Changes in the VCF

By and large, the release VCFs look and feel similar to the ExAC v1 VCF. Exomes and genomes are distributed separately. We have removed the Het fields as they were hard to parse and instead provide genotype counts (GC), or a G-indexed array of counts of individuals per genotype. As mentioned in the Variant QC section, we have fundamentally switched from a site-level filtering strategy to an allele-specific filtering strategy. For this reason, we now have a field named AS_FilterStatus in the INFO column that specifies the filtering status for each allele separately. If any allele is PASS, then the entire site is PASS (unless it is in a LCR or SEGDUP region, or fails the InbreedingCoeff filter). Finally, we provide VDS format for users wishing to perform analyses in Hail.

Alignment and variant QC

The alignment and variant calling was performed on exomes and genomes separately using a standardized BWA-Picard pipeline on the human genome build 37, followed by joint variant calling across each whole callset using GATK HaplotypeCaller (v3.4). Up to this point, the pipeline used is very similar to the ExAC pipeline; however, after the variant calling step, there are a number of differences that are worth noting here.

Sample QC process

The sample filtering process was similar to that of ExAC, where we first excluded samples based on sequencing and raw variant call quality metrics, then removed related individuals as to keep only samples more distant than 2nd degree relatives. From this set, we assigned ancestry and finally excluded samples based on raw variant call QC metrics stratified by sequencing platform and ancestry. These steps are detailed below.

Hard filtering

Sequencing and alignment quality metrics were computed using Picard tools and raw quality metrics were computed using GATK, PLINK, and Hail. We have excluded samples based on the following thresholds:

  • Both exomes and genomes
    • High contamination: freemix > 5%
    • Excess chimeric reads: > 5% chimeric reads
    • Excess heterozygosity: F < -0.05
  • Exomes only
    • No coverage: a set of common test SNPs on chromosome 20 had no coverage
    • Some TCGA samples (e.g. known tumor) were removed
    • Ambiguous sex: fell outside of:
      • Male: Proportion of heterozygous sites on chromosome X (non-PAR regions) < 0.2 and chromosome Y coverage > 0.1x
      • Female:  Proportion of heterozygous sites on chromosome X (non-PAR regions) > 0.2 and chromosome Y coverage < 0.1x
  • Genomes only
    • Low depth: Mean depth < 15x
    • Small insert size: Median insert size < 250bp
    • Ambiguous sex: fell outside of:
      • Male: chromosome X F-stat > 0.8
      • Female: chromosome X F-stat < 0.4

Relatedness filtering

We used KING to infer relatedness amongst individuals that passed the hard filters thresholds. KING was run on all samples (exomes and genomes) together on a set of well behaved sites selected as follows:

  • Autosomal, bi-allelic SNVs only (excluding CpG sites)
  • Call rate across all samples > 99%
  • Allele frequency > 0.1%
  • LD-pruned to r2 = 0.1

We then selected the largest set of samples such that no two individuals are 2nd degree relative or closer. When we had to select between two samples to be kept in, we used the following criteria to select which one to keep:

  • Genomes had priority over exomes
  • Within genomes
    • PCR free had priority over PCR+
    • In a parent/child pair, the parent was kept
    • Ties broken by the sample with highest mean coverage
  • Within exomes
    • Exomes sequenced at the Broad Institute had priority over others
    • More recently sequenced samples were given priority (when information available)
    • In a parent/child pair, the parent was kept
    • Ties broken by the sample with the higher percent bases above 20x

Ancestry assignment

We computed principal components (PCs) on the same well-behaved bi-allelic autosomal SNVs as described above on all unrelated samples (both exomes and genomes). We then leveraged a set of 53,044 samples for which we knew the ancestry to train a random forests classifier using the PCs as features. We then assigned ancestry to all samples for which the probability of that ancestry was > 90% according to the random forest model.  All other samples, were assigned the other ancestry (OTH). In addition, the 34 south asian (SAS) samples among the genomes were assigned to other as well due to their low number.

Stratified QC filtering

Our final sample QC step was to exclude any sample falling outside 4 median absolute deviation (MAD) from the median of any of the given metrics:

  • number of deletions
  • number of insertions
  • number of SNVs
  • insertion : deletion ratio
  • transition : transversion (TiTv) ratio
  • heterozygous : homozygous ratio

Because these metrics are sensitive to population and sequencing platform, the distributions were computed separately for each population / platform group. The population used are those described above. The platforms were:

  • Genome, PCR free
  • Genome, PCR+
  • Exome Agilent
  • Exome ICE
  • Exome ICE, 150bp read-length
  • Exome, Nimblegen
  • Exome, unknown capture kit (for these, the samples were removed if their metrics fell outside of the extreme bounds of all other exome datasets)

In some cases, certain projects had individuals that were sequenced on multiple platforms. To disentangle the platforms used, we calculated the missingness rate per individual per capture platform, and performed principal components analysis using these values. The first 3 PCs displayed signatures of capture platform (PC1 distinguished the Agilent platform from Illumina, PC2 distinguished samples on the Nimblegen platform, while PC3 represented 76 bp reads vs 150 bp reads), so these were used in a decision tree model to apply to those samples that fell into these projects with multiple platforms.

Variant QC

For the variant QC process, we used the final set of high quality release samples as well as additional family members forming trios (1,529 trios in exomes, 412 in genomes) that were filtered at the relatedness phase, in order to compute metrics such as transmission and Mendelian violations. Variant QC was performed on the exomes and genomes separately but using the same pipeline.

The variant QC process departed significantly from the ExAC pipeline, since this time round we decided (after several weeks of assessment) not to use the standard GATK Variant Quality Score Recalibration (VQSR) approach for filtering variants, which turns out not to perform especially well once sample sizes exceed 100,000. Instead, we developed a combination of a random forest classifier and hard filters, described below, which performs much better for the very rare variants uncovered by gnomAD.

Random forests classifier

One of the major limitations of the ExAC dataset (and, indeed, virtually all other large-scale call sets) was the fact that the VQSR filter was applied at the site (locus) level rather than the variant (allele) level. For instance, if a (say, artifactual) indel was found at the same site as a common SNP, the quality metrics would be combined. In this case, the quality metrics for the indel might be artificially inflated (or decreased for the SNP). This issue has become increasingly problematic with larger number of samples, and thus a higher density of variation – in the gnomAD dataset 22.3% of the variants in exomes and 12.8% of the variants in genomes occur at multi-allelic sites. To address these issues, we used an allele-specific random forest classifier trained implemented in Hail / pyspark with 500 trees of max. depth 5. The features and labels for the classifier are described below.

As the filtering process was performed on exomes and genomes separately, users will notice that for some variants, we have 2 filter statuses which may be discordant in some cases. In particular, there are 144,941 variants filtered out in exomes, but passing filters in genomes and 290,254 variants for the reverse. We’ll be investigating these variants in future analyses, but for now users should just treat them with caution.

Random forests features

While our variant calls VCF only contained site-specific metrics, Hail enabled us to compute some genotype-driven allele-specific metrics. The table below summarizes the features used in the random forests model, along with their respective weight in the model generated.

Feature Description Site / Allele Specific Genomes Feature Weight Exomes Feature Weight
Allele type SNV, Indel, complex Allele 0.00123 0.0025
DP_MEDIAN Median depth across variant carriers Allele 0.0449 0.0513
GQ_MEDIAN Median genotype quality across variant carriers Allele 0.1459 0.1520
DREF_MEDIAN Median reference dosage across variant carriers Allele 0.4154 0.3327
AB_MEDIAN Median allele balance among heterozygous carriers Allele 0.2629 0.3559
Number of alleles Total number of alleles present at the site Site 0.0037 0.0031
Mixed site Whether more than one allele type is present at the site Site 0.0003 0.0001
Overlapping spanning deletion Whether there is a spanning deletion (STAR_AC > 0) that overlaps the site Site 0.003 0.0004
InbreedingCoeff Inbreeding coefficient Site* 0.0116 0.0369
MQRankSum Z-score from Wilcoxon rank sum test of Alt vs. Ref read mapping qualities Site* 0.0054 0.0067
ReadPosRankSum Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias Site* 0.0097 0.0101
SOR Symmetric Odds Ratio of 2×2 contingency table to detect strand bias Site* 0.0958 0.0482

*: These features should ideally be computed for each allele separately but they were generated for each site in our callset and it was impractical to re-generate them for each allele.

Note that since random forests doesn’t tolerate missing data, we have naively imputed all missing values using the median value for that feature. In particular, this was done for metrics that are only calculated among heterozygous individuals (MQRankSum, ReadPosRankSum, allele balance), while the handful of variants missing other features were simply discarded.

Random forests training examples

Our true positive set was composed of: SNVs present on the Omni2.5 genotyping array and found in 1000 Genomes (available in the GATK bundle), indels present in the Mills and Devine indel dataset (available in the GATK bundle), and transmitted singletons (variants found in two and only two individuals, which were a parent-offspring pair). Our false positive set comprised of variants failing traditional GATK hard filters: QD < 2 || FS > 60 || MQ < 30).

Random forests filtering threshold

In order to set a threshold for the PASS / RF filter in the release, we used the following metrics to determine a cutoff on the random forests model output, to build what we believed to be a high quality set of variants:

  • Precision / recall against two well characterized samples:
    • NA12878 from genome in a bottle
    • CHM1_CHM13: A mixture of DNA (est. 50.7% / 49.3%) from two haploid CHM cell lines, deep sequenced with PacBio and _de novo_ assembled, available here. Note that indels of size 1 were removed because of PacBio-specific problems.
  • Number of singleton Mendelian violations in the trios (N=1 amongst all samples and is found in a child)
  • Singleton transition : transversion (TiTv) ratio for SNVs
  • Singleton insertion : deletion ratio for indels

For exomes, our filtration process removes 18.5% of SNVs (RF probability >= 0.1) and 22.8% of indels (RF probability >= 0.2). For genomes, we filtered 11.0% of SNVs (RF probability >= 0.4) and 30.9% of indels (RF probability >= 0.4).

Hard filters

In addition to the random forests filter described above, we also filtered all variants falling in low complexity (LCR filter) and segmental duplication (SEGDUP filter) regions. We also filtered variants with low Inbreeding coefficient (< -0.3; InbreedingCoeff filter). Finally, we filtered all variants where no individual had a high quality non-reference genotype (GQ >= 20, DP >= 10, AB > 0.2 for heterozygotes; AC0 filter).


Any data set of this scale requires a huge team to make it possible. We want to thank a few specific people here, while also acknowledging the many, many people behind the scenes at the Broad Institute who built the infrastructure and resources to enable this work.

Massive thanks, as always, to everyone who allowed their data to be used in this project. The gnomAD consortium now includes more than 100 principal investigators, who are listed here. We’re grateful to all of them for allowing their data to be used for this project.

Jessica Alföldi played a critical role in assembling the sample list and gathering the permissions required to ensure that each and every sample in the project was authorized to be included – a seriously daunting task! Fengmei Zhao contributed to the assembly of the underlying data, and Kristen Laricchia and Monkol Lek played important roles in sample and variant quality control.

As described above, we’ve made substantial changes to the browser for the gnomAD release (with more cool stuff coming soon!) – this is almost entirely thanks to Matthew Solomonson, who’s done a brilliant job on short timelines to make the system work. Ben Weisburd also contributed, and generated all of the read visualizations you can see on the variant pages.

The pipelines and infrastructure required to align and call variants, and the actual generation of the massive exome and genome call sets, are entirely the work of the heroic men and women of the Broad’s Data Sciences Platform. Many people deserve thanks here, but we wanted to single out Eric Banks, Kathleen Tibbetts, Charlotte Tolonen, Yossi Farjoun, Laura Gauthier, Dave Shiga, Jose Soto, George Grant, Sam Novod, Valentin Ruano-Rubio, and Bradley Taylor.

To complete quality control (many, many times!) on a call set of this scale in a timely fashion we needed serious computational grunt. We benefited enormously from the availability of Hail, an open-source Spark-based framework for genomic analysis – put simply, we would have been screwed without it. Massive thanks to the Hail team for building such a tremendous resource, as well as for their incredible responsiveness to each new bug report and feature request, especially Tim Poterba and Cotton Seed. We also needed good hardware to run the analysis on: over the course of this work we moved back and forth between a Cray Urika and the Google Cloud. Thanks to the Cray team for their hard work whenever the machine went down and we needed it back up fast.

And finally, special thanks to the Broad Institute Genomics Platform as a whole, who generated 90% of the data used in gnomAD, and also provided critical computing resources required for this work.

Announcing the Exome Aggregation Consortium paper

Today we are celebrating the official publication of the Exome Aggregation Consortium (ExAC) paper in Nature – marking the end of a phase in this project that has involved most of the members of my lab (and many, many others beyond) for a large chunk of the last few years. This official publication is an opportune time to reflect on how ExAC came to be, and the impact it’s had on us and the wider community.

First, some background

Exome sequencing is a very cost-effective approach that allows us to look with high resolution at just the 1-2% of the human genome that codes for protein – these are the parts we understand the best, and also the parts where the vast majority of severe disease-causing mutations are found. Because exome sequencing is so powerful it’s been applied to tens of thousands of patients with rare, severe diseases such as muscular dystrophy and epilepsy. However, a key challenge when sequencing patients is that everyone carries tens of thousands of genetic changes, and we need a database of “normal” variation that tells us which of those changes are seen in healthy people, and how common they are.

The goal of this project was to create such a database, at massive scale. Briefly, we pulled together DNA sequencing data from the largest collection of humans ever assembled – more than 60,000 people in total, from more than two dozen different disease projects, generously donated by researchers from around the world – to look at the distribution of genetic variation across the exome, and to make it publicly available for anyone to use. This project became known as the Exome Aggregation Consortium, or ExAC.

This project involved assembling nearly a petabyte (a thousand terabytes) of raw sequencing data and putting all of that through the same processing pipeline to produce a set of variant calls that were the same across the whole project. At the end of that process we produced a summary file, that basically is a list of all of the 10 million variants we discovered in the project, and how common they are across different populations – and made that publicly available back in 2014. That resource has now been received over 5.2 million page views by researchers around the world, mostly for the interpretation of genetic changes found in rare disease patients.

In this paper, we describe how this resource can be used in a variety of ways to understand human genetic variation, particularly focusing on the very, very rare DNA changes – those seen in less than one in a thousand people – most of which had never been observed until this project. We show that our new database is much more powerful than existing resources for filtering the variants found in rare disease patients, making it a much more effective tool for clinical labs to use in diagnosis. And importantly, we also show that we are able to use this database to identify genes that have fewer variants than we would expect to see by chance – basically, identifying genes that are intolerant of variation in the general population, meaning they are much more likely to be involved in causing disease.

I’m not going to describe these findings in detail, because the paper is open access and you can read it yourself – there is also a wonderful summary in Nature by Jay Shendure. However, I did want to highlight a few key things that you can’t learn from the paper, including the story of how this project actually happened.

How ExAC came about

In 2012 my (brand new) lab started sequencing exomes from patients with rare muscle diseases. Two things we found almost immediately were that (1) we desperately needed to interpret our variants in the context of large collections of variation from the “normal” population, and (2) the existing resources simply weren’t sufficient for this job. Both 1000 Genomes and the Exome Variant Server were fantastic resources, but neither was large enough to provide insight into the extremely rare variants we were seeing in our patients.

At around the same time, a few key things fell into place. Firstly, Mark DePristo’s GATK team at the Broad Institute had fancy new software that could, at least in theory, allow the generation of unified variant calls across tens of thousands of samples, and they were keen to try it out at scale. Secondly, it became clear that there was a critical mass of colleagues at the Broad Institute who had in combination sequenced over 20,000 exomes, and who were willing to make that data available for a large shared effort. And finally, I had been joined in my lab by a remarkably talented postdoc, Monkol Lek, who had the informatic background required to coordinate and analyze a large-scale sequencing resource. And so we set about building one.

Over the next 18 months, we worked closely with the Broad’s sequence production team to produce at least five call sets, starting with a pilot run of just over 13,000 samples. In each case we encountered either intractable computational scaling problems, or quality control issues in the final call set. Over the same time the number of samples continued to grow, and we became steadily more ambitious with each failure. In late 2013 we made an attempt at variant calling across nearly 57,000 samples, which took many months; depressingly, it produced a product with unusably high error rates. In early 2014 I was genuinely unsure whether this project would ever be successful.

We were ultimately saved by some truly remarkable engineering work on the Broad’s production variant-calling pipeline, spearheaded by a team guided by Tim Fennell and Eric Banks. Out of the ashes of the failed 57K call set rose a new, much larger effort, using a new pipeline (for aficionados, this marked the switch from UnifiedGenotyper to HaplotypeCaller) – and to our delight, this new tool proved both enormously faster and much more accurate than its predecessor. In June 2014 we suddenly had a call set that passed every quality control test we threw at it. And while the technical delays had been agonizing, that 18-month cloud turned out to have a massive silver lining: over this period of delays the number of exomes available to us had grown, staggeringly, to more than 90,000 samples. This June call set, after removing nearly a third of samples (for good reasons, outlined below), became what is now the public release of ExAC.

The awesome consortium

It’s worth being very explicit about this: ExAC didn’t sequence any exomes. The resource only exists because two dozen investigators from more than 20 consortia (see the full list here) chose to make their raw data freely available to create a resource for the benefit of the wider community. I am personally still absolutely astonished that so many of our colleagues were not only willing, but actively enthusiastic about making their data available to the world, and I’m forever indebted to them for this.

I won’t name every single of these wonderful individuals here (again, here’s the full list) but I did want to single out a few people who really went above and beyond to support the resource, even when it was barely more than an idea, especially David Altshuler, Mark Daly, Sek Kathiresan, Mark McCarthy and Pat Sullivan. Every one of these people stood up for ExAC multiple times. As a junior faculty member proposing a project that was basically insanely ambitious, I have been incredibly lucky to have support from such a group.

There are many other awesome people in this story. We’ll get back to them later.

Going public

So, in June 2014 we had the biggest collection of exome sequencing data that had ever been assembled. One thing that was never in question was that we wanted to get this data set out to the community as quickly as possible – that was the whole purpose of ExAC – and so we set ourselves the task of preparing for a public launch at the American Society of Human Genetics meeting in October, where I was scheduled to give a talk. Over the next couple of months, increasingly stringent QC run by Monkol and other members of the team convinced us that this was a call set of excellent quality, so everything was in place for an October announcement.

But we didn’t simply want to dump a raw text file of sequence variants on the web – we needed to make this resource available in a way that anyone could use without needing bioinformatics expertise. And so I asked another postdoc in the lab, Konrad Karczewski, if he could put his considerable web development experience to work in building such a website.

And so, in the months leading up to ASHG, the now-iconic ExAC browser took shape. This process reached its climax at approximately 3am on October 18th, the morning of my talk, in a San Diego AirBnB, where Konrad, Brett Thomas and Ben Weisburd worked to rescue the website from a catastrophic database failure while I sat nearby frantically editing my slides and preparing for the possibility of a completely non-functional website. Magically, everything worked at the last minute – and the site has now been stable for almost two years and 5.2 million page views, testament to Konrad’s development of a resource that was both robust and user-friendly. Last year, Ben Weisburd added a new feature that allowed users to view the raw read support for individual variants – which profoundly changed my experience of the data set, and has been very popular with users.

Analysis time

Once the chaos of the ASHG launch was over, we got back to work on actually making sense of this massive data set. The next year was incredibly fun, as we crawled through the the intricacies of the data set and figured out what we could learn from it about human variation, biology and disease.

There are few greater joys as a PI than watching a team of fantastic young scientists work together on an exciting project – and this project was intensely collaborative, involving most of the members of my lab as well as many others. Monkol of course played a key role in many analyses. Konrad Karczewski led the analyses on mutational recurrence and loss-of-function variants. Eric Minikel led much of the work surrounding the clinical interpretation of ExAC variants, with strong support from Anne O’Donnell Luria and James Ware. And Kaitlin Samocha, a ludicrously talented graduate student from Mark Daly’s lab, drove the analysis of loss-of-function constraint that ended up becoming a headline result for the paper. Along the way, many other trainees and scientists contributed in ways that strengthened the project. Tens of thousands of messages were posted in our lab’s #exac_papers Slack channel, including dozens of versions of every figure that ended up in the paper; and slowly, magically, an unstructured Google Doc morphed into a real scientific manuscript.

(Other papers came together, too. We published Eric’s analysis of variation in the PRNP gene back in February – and today, simultaneous with the main ExAC paper, ExAC collaborators have published work showing the discovery of copy number variants across the ExAC individuals as well as the use of ExAC to understand the impact of variants in severe heart muscle disease.)

Open everything

One of the things I’m proudest of about ExAC is the way we systematically set about making our data, tools and results available to the public as quickly and easily as possible. The data set was pushed out to the public, with no access restrictions, through Konrad’s browser basically as soon as we had a version we felt we could trust. All of the software we used to produce the ExAC paper figures is freely available, thanks to the hard work and dedication of Konrad Karczewski, Eric Minikel and others in cleaning up the code and preparing it for release.

We also made our analysis results available as a preprint back in November last year – and the final Nature paper, with many improvements thanks to reviewers and a wonderful editor, is fully open access.

The benefit of this approach, beyond the warm fuzzy feeling of releasing stuff that other people can use, is that we got rapid feedback from the community – this was extremely useful in spotting glitches in the data, as well as errors in the manuscript preprint. We intend to continue with this open approach for the project, and we hope this can also serve as a model for other consortia.

What comes next?

The work of ExAC is far from over. Later this year we’ll be announcing the release of the second version of this resource, which we hope will exceed 120,000 exomes. We’ll also be releasing a whole-genome version of the resource that provides insight into variants outside protein-coding regions. And most importantly, the science will continue: my group, our collaborators, and researchers and clinicians around the world will continue to use this resource to diagnose rare disease patients, to understand the distribution of variation across the human population, and to explore human biology and disease.

Special thanks

I’ve thanked a lot of people in the post above, which is only appropriate, but there are many more who deserve it. Huge numbers of people were involved in putting this resource together.

One important thing to note about ExAC is that it was effectively unfunded for most of its existence. During this time, the immense costs of data storage and computing were covered (invisibly) by institutional support from the Broad, as were many of the personnel who contributed to the project. Many people in the team now known as the Data Sciences Platform at the Broad were involved in building the pipeline to make this call set possible, running it (over and over again!) on enormous numbers of samples, and helping us to make sense of the results – especially including Amy Levy-Moonshine, Khalid Shakir, Ryan Poplin, Laura Gauthier, Valentin Ruano-Rubio, Yossi Farjoun, Kathleen Tibbetts, Charlotte Tolonen, Tim Fennell, and Eric Banks. The Broad’s Genomics Platform generated 90% of the sequencing data in ExAC, and helped in many ways to make the project possible – huge thanks to Stacey Gabriel for her constant support. Namrata Gupta and Christine Stevens played major roles in assembling massive lists of sequenced individuals, and seeking data usage permissions. Andrea Saltzmann and Stacey Donnelly were our ethics gurus, who helped us get the IRB permissions needed for this resource to go live.

In September 2014 we received a U54 grant from NIDDK that partially defrays the ongoing cost of the project, but we are still absolutely dependent on the amazing and ongoing support of this Broad team. The sheer amount of resources and time donated to this effort is testament to the Broad’s genuine commitment to producing resources that help other people; I couldn’t be prouder to be part of this organization today.

And finally, I wanted to give special thanks to two incredibly patient spouses – my wife Ilana and Monkol’s wife Angela – for tolerating our prolonged absences and frequent distraction over the last few years. We appreciate it more than we say.

Reproduce all the figures!: a user’s guide to ExAC, part 2

As we were organizing analyses for the ExAC flagship paper, we were inspired by Titus Brown’s manifesto on reproducibility. As with most collaborative papers, we had a core team of analysts, each responsible for their subset of analyses and figure panels. Part of the way through, we realized there were many shared features that we wanted to keep consistent across analyses, particularly variant filtering strategies and annotations. We made the decision to organize the code in a central github repository, which the team members could access and edit as analyses progressed. Today, we are happy to release this code, which you can use to reproduce every figure in the paper!


How to use

The code and miscellaneous data are available at

git clone
cd exac_2015

Then, start R, and run the following. Warning: this process requires a decent amount of memory, ideally on a machine that has at least 16G RAM (newer Macbook Pros are fine, but not Macbook Airs). On Linux and machines with stricter memory policies, we recommend 32G or even 64G to allow for overhead in loading.

exac = load_exac_data()
## [1] "Using locally loaded exac file. Run exac = load_exac_data(reload=T) to update to newest file."
## [1] "Step (1/6): Loading data..."
## [1] "Now processing data (2/6): Determining sequence contexts..."
## [1] "(3/6): Computing allele frequencies..."
## [1] "(4/6): Calculating bad regions of the genome..."
## [1] "(5/6): Determining what to _use_..."
## [1] "(6/6): Parsing SIFT and PolyPhen scores, and separating functional categories..."
## [1] "Done! Here you go. Took:"
##    user  system elapsed 
## 857.437  17.924 878.059

This will take about 10-15 mins to load, plus a ~800M download the first time you run it. exac is then the data frame of ALL ~10M variants in the dataset with over ~100 annotations per variant. Each variant is now its own entry, as opposed to a typical VCF where multi-allelic variants are combined into a single line. Note that these are unfiltered variant calls. Make sure to filter to either PASS sites, or better yet, our criteria for high-quality calls (given by the column use, see below for details):

filtered_calls = subset(exac, filter == 'PASS')


use_data = subset(exac, use)

You can view the table using head() or with dplyr’s tbl_df, which formats the output nicely:

## Source: local data frame [10,195,872 x 116]
##    chrom   pos   ref   alt    id                     filter
##    (chr) (int) (chr) (chr) (chr)                      (chr)
## 1      1 13372     G     C     .                       PASS
## 2      1 13380     C     G     . VQSRTrancheSNP99.60to99.80
## 3      1 13382     C     G     . VQSRTrancheSNP99.60to99.80
## 4      1 13402     G     C     . VQSRTrancheSNP99.60to99.80
## 5      1 13404     G     A     . VQSRTrancheSNP99.60to99.80
## 6      1 13404     G     T     . VQSRTrancheSNP99.60to99.80
## 7      1 13417     C CGAGA     .                       PASS
## 8      1 13418     G     A     .             AC_Adj0_Filter
## 9      1 13426     A     G     . VQSRTrancheSNP99.80to99.90
## 10     1 13438     C     A     . VQSRTrancheSNP99.80to99.90
## ..   ...   ...   ...   ...   ...                        ...
## Variables not shown: ac (int), ac_afr (int), ac_amr (int),
##   ac_adj (int), ac_eas (int), ac_fin (int), ac_hemi (int),
##   ac_het (int), ac_hom (int), ac_nfe (int), ac_oth (int),
##   ac_sas (int), ac_male (int), ac_female (int),
##   ac_consanguineous (int), ac_popmax (int), an (int), an_afr
##   (int), an_amr (int), an_adj (int), an_eas (int), an_fin
##   (int), an_nfe (int), an_oth (int), an_sas (int), an_male
##   (int), an_female (int), an_consanguineous (int), an_popmax
##   (int), hom_afr (int), hom_amr (int), hom_eas (int), hom_fin
##   (int), hom_nfe (int), hom_oth (int), hom_sas (int),
##   hom_consanguineous (int), clinvar_measureset_id (int),
##   clinvar_conflicted (int), clinvar_pathogenic (int),
##   clinvar_mut (chr), popmax (chr), k1_run (chr), k2_run
##   (chr), k3_run (chr), doubleton_dist (dbl), inbreedingcoeff
##   (dbl), esp_af_popmax (dbl), esp_af_global (dbl), esp_ac
##   (int), kg_af_popmax (dbl), kg_af_global (dbl), kg_ac (int),
##   vqslod (dbl), gene (chr), feature (chr), symbol (chr),
##   canonical (chr), ccds (chr), exon (chr), intron (chr),
##   consequence (chr), hgvsc (chr), amino_acids (chr), sift
##   (chr), polyphen (chr), lof_info (chr), lof_flags (chr),
##   lof_filter (chr), lof (chr), context (chr), ancestral
##   (chr), coverage_median (dbl), coverage_mean (dbl),
##   coverage_10 (dbl), coverage_15 (dbl), coverage_20 (dbl),
##   coverage_30 (dbl), pos_id (chr), indel (lgl),
##   bases_inserted (int), transition (lgl), cpg (lgl),
##   alt_is_ancestral (lgl), af_popmax (dbl), af_global (dbl),
##   singleton (lgl), maf_global (dbl), daf_global (dbl),
##   daf_popmax (dbl), pos_bin (fctr), bin_start (dbl), sector
##   (chr), bad (lgl), use (lgl), lof_use (lgl), sift_word
##   (chr), sift_score (dbl), polyphen_word (chr),
##   polyphen_score (dbl), category (chr)

A previous blog post by Monkol Lek, A Guide to the Exome Aggregation Consortium (ExAC) Data Set, described some of the challenges in exploring the data. With the dataset loaded into R, we can dive into a few more tips and tricks. First, you’ll notice that there are many annotations per variant pre-calculated, in particular:

Allele frequency

In particular, one set of questions that always comes up is regarding the allele counts (AC) and allele numbers (AN). The raw counts (ac and an) refer to the total number of chromosomes with this allele, and total that were able to be called (whether reference or alternate), respectively. Thus, the allele frequency is ac/an. However, for more stringent applications, we adjusted the allele count to only include individuals with genotype quality (GQ) >= 20 and depth (DP) >= 10, which we term ac_adj (and correspondingly, an_adj for total number of chromosomes from individuals with those criteria regardless of genotype call).

In this table, af_global is then ac_adj/an_adj and ac and an from each population is filtered to only adjusted counts (i.e. ac_EAS includes only individuals of East Asian descent that meet this criteria). Additionally, we provide a maximum allele frequency across all populations as af_popmax and the corresponding population as popmax, which is useful for filtering variants for Mendelian gene discovery. Finally, we have hom_* for all populations, which is the count of homoyzgous individuals for this allele.

Functional annotations

The output from VEP is parsed for each allele and, in the default table, summarized using the most severe consequence across all transcripts. You can run load_exac_data('canonical') for instance to load the annotations for only the canonical transcripts, or pass transcript=T or gene=T to load in a data frame with one entry for every transcript or gene, respectively, for each variant.

Loss-of-function (LoF) annotations of variants from LOFTEE are provided in the lof, lof_filter, and lof_flags columns, and the highest-confidence variants are indicated as lof_use. PolyPhen and SIFT annotations are parsed into words (e.g. benign or possibly_damaging) and scores.

There are also annotations from ClinVar, intersected with output of Eric Minikel’s Clinvar parsing tool. A query for pathogenic variants without conflicting evidence may use clinvar_pathogenic == 1 && clinvar_conflicted == 0 && clinvar_mut == 'ALT'.

Other annotations

Finally, there are numerous convenience annotations for quick analysis, including logical values for indel (as well as the number of bases inserted or deleted in bases_inserted), cpg, transition, singleton, various coverage metrics (mean, median, and proportion covered at a given depth – e.g. coverage_30 indicates the proportion of individuals covered at 30X). The 3-base context around SNPs is provided in the context column (warning: this is currently broken for indels, and in any case, not as useful).

As mentioned above, the use column is a crucial one for filtering to the highest-quality variants, which includes the following criteria:

  • Filter status of “PASS” (from VQSR)
  • Adjusted allele count (ac_adj, see above) > 0 (i.e. there is at least one individual with at least GQ >= 20 and DP >= 10)
  • At this locus, at least 80% of individuals (regardless of whether they have the variant) met the high-quality adjusted status as above (in code, exac$an_adj > .8*max(exac$an_adj))
  • Does not fall in the top 10 most-multiallelic kilobases of the genome (see Multi-allelic enriched regions)

With the data, there are many analyses that you can then do, including:

Recreate the figures from the paper

This one’s an easy one. Just run the following, which will take ~an hour. (Warning: will take lots of memory and CPU and has been known to crash R, particularly for some more computationally expensive figures, such as Figures 2E and 3F).


Depending on your system, it may be easier to open the script and run sections of the code independently.

Multi-allelic enriched regions

Early on after we released the ExAC dataset to the public, Nicole Deflaux at Google performed a nice analysis identifying chromosome 14 as a hotspot of multi-allelic variants (full analysis here: We wanted to trace the source of this issue, so we used ggbio to plot the multi-allelics across the genome.

Install ggbio if its not already installed:


Then, use the genome_plotting.R script where we’ve written functions to calculate and plot the density of multi-allelic variants (for now, limiting to quad-allelic and up) per kilobase across the genome.

multiallelic_density = get_multiallelic_density(exac, resolution=1000)


The 2 regions highlighted in light blue look interesting (chr14 as before, as well as chr2), so we can dig into those regions. The top 10 regions ordered by frequency are:

multiallelic_density %>% 
  arrange(desc(freq)) %>% 
##    chrom bin_start   bin_end freq
## 1     14 106330000 106331000  109
## 2      2  89160000  89161000  103
## 3     14 106329000 106330000   76
## 4     14 107178000 107179000   63
## 5     17  18967000  18968000   46
## 6     22  23223000  23224000   44
## 7      1 152975000 152976000   42
## 8      2  89161000  89162000   39
## 9     14 107179000 107180000   38
## 10    17  19091000  19092000   38

If we look these up, we’ll find that these regions include immunoglobulin kappa (IGK, chr2), heavy (IGH, chr14), lambda variable (IGLV, chr22), and joining (IGKJ1, chr2), as well as some difficult-to-call regions of the genome. In ExAC, we’ve added a column for whether a variant falls into these 10 regions of the genome, named bad for lack of a better phrase, and these variants are then excluded from the high-quality use variants.

dbSNP rs number versus allele frequency

We can also do some fun analyses that didn’t make the cut. One easy hypothesis we had was whether rs number (the numbers in a dbSNP RSID identifier) would correlate with allele frequency. We would expect a negative correlation, as rs numbers are sequentially assigned, and common variants should have been discovered earlier on. Of course, this will not be a perfect correlation, as rare variants in the first individuals sequenced, as well as disease variants, will be identified earlier and have lower rs numbers.

First, let’s get the rsid column into a numeric form and subset to only those in dbSNP:

exac %<>% mutate(rsid = as.numeric(gsub('rs', '', id)))
rsid_variants = filter(exac, ! & af_global > 0)

The rs number is negatively correlated with allele frequency, as expected.

cor.test(rsid_variants$rsid, rsid_variants$af_global)
##  Pearson's product-moment correlation
## data:  rsid_variants$rsid and rsid_variants$af_global
## t = -536.42, df = 1108700, p-value = 0
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4554037 -0.4524481
## sample estimates:
##        cor 
## -0.4539271
ggplot(rsid_variants) + aes(x = rsid, y = af_global) + geom_point() + geom_smooth(method="lm", se=F) + ylab('Global Allele Frequency')


It’s a bit tough to see the correlation there, but it’s promising: later RSIDs do appear to have lower allele frequencies than earlier ones. Our fit line doesn’t seem too great, so let’s use the log of the allele frequency and use smoothScatter() to visualize it:

smoothScatter(rsid_variants$rsid, log10(rsid_variants$af_global), xlab='RSID', ylab='Global allele frequency (log10)', col='white')
abline(lm(log10(rsid_variants$af_global) ~ rsid_variants$rsid))


The largest blob in the bottom right corresponds to variants submitted by the NHLBI Exome Sequencing Project, from which many exomes are included in ExAC.

We encourage anyone to continue exploring the ExAC dataset and we hope that this repository helps others reproduce and expand on the analyses we’ve done. We welcome any and all comments, here and on the github page. Enjoy!

A personal journey to quantify disease risk

We have a new paper out today in Science Translational Medicine that describes our application of the massive Exome Aggregation Consortium database to understanding the variation in one specific gene: the PRNP gene, which encodes the prion protein.

This project was a special one for a number of reasons. Firstly, there’s an incredibly strong personal motivation behind this work, which you can read much more about in a blog post by lead author Eric Minikel. Secondly, it’s a clear demonstration of the way in which we can use large-scale reference databases to interpret genetic variation, including flagging some variants as non-causal or having mild effects. Thirdly, as discussed in the accompanying perspective by Robert Green and colleagues, this work is already having clinical impact by changing the diagnosis for people with families affected by prion disease. And finally, the discovery of “knockout” variants in PRNP in healthy individuals is tantalizing evidence that inhibiting this gene in mutation carriers is likely to be a safe therapeutic approach.

The paper is of course open access, so you can read the details yourself. Huge congratulations to Eric for pulling this paper together!