Genomic data aggregation

assembling a catalogue of human genetic variation

In 2014, together with collaborators from a wide range of disease-specific research consortia, we assembled and reprocessed the world’s largest collection of human exome data, the Exome Aggregation Consortium (ExAC) collection, providing unprecedented resolution of the patterns of genetic variation in human protein-coding genes. We released ExAC as a public dataset with the aggregate allele frequencies of 60,706 humans. In 2016, we released a massive update of this reference database, more than doubling the number of samples and giving it a new name: the Genome Aggregation Database (gnomAD). The ExAC and gnomAD websites have been accessed over 20 million times since October 2014, and they have become the default reference data set for clinical diagnostic labs worldwide. The gnomAD data are released publicly for the benefit of the wider biomedical community, with no publication restrictions or embargoes.

To find out more about our research in this area you can read the ExAC manuscript in Nature, or our set of gnomAD preprints: the flagship gnomAD paper focusing on loss-of-function (LoF) variants, our transcript annotation paper, a perspective on the use of LoF variants in drug discovery, a specific example of this in LRRK2, and analyses of structural variation, multi-nucleotide variants, and 5’ upstream open reading frames in the gnomAD dataset. You can also read our blog post about the creation and quality control of the current gnomAD dataset, and can find the code for the gnomAD browser and the gnomAD quality control pipeline on github.

We are currently working on expanding gnomAD once more – with both more exomes and more genomes from more populations, and realigning everything to the hg38 reference (yes, finally).