Announcing the Exome Aggregation Consortium paper

Today we are celebrating the official publication of the Exome Aggregation Consortium (ExAC) paper in Nature – marking the end of a phase in this project that has involved most of the members of my lab (and many, many others beyond) for a large chunk of the last few years. This official publication is an opportune time to reflect on how ExAC came to be, and the impact it’s had on us and the wider community.

First, some background

Exome sequencing is a very cost-effective approach that allows us to look with high resolution at just the 1-2% of the human genome that codes for protein – these are the parts we understand the best, and also the parts where the vast majority of severe disease-causing mutations are found. Because exome sequencing is so powerful it’s been applied to tens of thousands of patients with rare, severe diseases such as muscular dystrophy and epilepsy. However, a key challenge when sequencing patients is that everyone carries tens of thousands of genetic changes, and we need a database of “normal” variation that tells us which of those changes are seen in healthy people, and how common they are.

The goal of this project was to create such a database, at massive scale. Briefly, we pulled together DNA sequencing data from the largest collection of humans ever assembled – more than 60,000 people in total, from more than two dozen different disease projects, generously donated by researchers from around the world – to look at the distribution of genetic variation across the exome, and to make it publicly available for anyone to use. This project became known as the Exome Aggregation Consortium, or ExAC.

This project involved assembling nearly a petabyte (a thousand terabytes) of raw sequencing data and putting all of that through the same processing pipeline to produce a set of variant calls that were the same across the whole project. At the end of that process we produced a summary file, that basically is a list of all of the 10 million variants we discovered in the project, and how common they are across different populations – and made that publicly available back in 2014. That resource has now been received over 5.2 million page views by researchers around the world, mostly for the interpretation of genetic changes found in rare disease patients.

In this paper, we describe how this resource can be used in a variety of ways to understand human genetic variation, particularly focusing on the very, very rare DNA changes – those seen in less than one in a thousand people – most of which had never been observed until this project. We show that our new database is much more powerful than existing resources for filtering the variants found in rare disease patients, making it a much more effective tool for clinical labs to use in diagnosis. And importantly, we also show that we are able to use this database to identify genes that have fewer variants than we would expect to see by chance – basically, identifying genes that are intolerant of variation in the general population, meaning they are much more likely to be involved in causing disease.

I’m not going to describe these findings in detail, because the paper is open access and you can read it yourself – there is also a wonderful summary in Nature by Jay Shendure. However, I did want to highlight a few key things that you can’t learn from the paper, including the story of how this project actually happened.

How ExAC came about

In 2012 my (brand new) lab started sequencing exomes from patients with rare muscle diseases. Two things we found almost immediately were that (1) we desperately needed to interpret our variants in the context of large collections of variation from the “normal” population, and (2) the existing resources simply weren’t sufficient for this job. Both 1000 Genomes and the Exome Variant Server were fantastic resources, but neither was large enough to provide insight into the extremely rare variants we were seeing in our patients.

At around the same time, a few key things fell into place. Firstly, Mark DePristo’s GATK team at the Broad Institute had fancy new software that could, at least in theory, allow the generation of unified variant calls across tens of thousands of samples, and they were keen to try it out at scale. Secondly, it became clear that there was a critical mass of colleagues at the Broad Institute who had in combination sequenced over 20,000 exomes, and who were willing to make that data available for a large shared effort. And finally, I had been joined in my lab by a remarkably talented postdoc, Monkol Lek, who had the informatic background required to coordinate and analyze a large-scale sequencing resource. And so we set about building one.

Over the next 18 months, we worked closely with the Broad’s sequence production team to produce at least five call sets, starting with a pilot run of just over 13,000 samples. In each case we encountered either intractable computational scaling problems, or quality control issues in the final call set. Over the same time the number of samples continued to grow, and we became steadily more ambitious with each failure. In late 2013 we made an attempt at variant calling across nearly 57,000 samples, which took many months; depressingly, it produced a product with unusably high error rates. In early 2014 I was genuinely unsure whether this project would ever be successful.

We were ultimately saved by some truly remarkable engineering work on the Broad’s production variant-calling pipeline, spearheaded by a team guided by Tim Fennell and Eric Banks. Out of the ashes of the failed 57K call set rose a new, much larger effort, using a new pipeline (for aficionados, this marked the switch from UnifiedGenotyper to HaplotypeCaller) – and to our delight, this new tool proved both enormously faster and much more accurate than its predecessor. In June 2014 we suddenly had a call set that passed every quality control test we threw at it. And while the technical delays had been agonizing, that 18-month cloud turned out to have a massive silver lining: over this period of delays the number of exomes available to us had grown, staggeringly, to more than 90,000 samples. This June call set, after removing nearly a third of samples (for good reasons, outlined below), became what is now the public release of ExAC.

The awesome consortium

It’s worth being very explicit about this: ExAC didn’t sequence any exomes. The resource only exists because two dozen investigators from more than 20 consortia (see the full list here) chose to make their raw data freely available to create a resource for the benefit of the wider community. I am personally still absolutely astonished that so many of our colleagues were not only willing, but actively enthusiastic about making their data available to the world, and I’m forever indebted to them for this.

I won’t name every single of these wonderful individuals here (again, here’s the full list) but I did want to single out a few people who really went above and beyond to support the resource, even when it was barely more than an idea, especially David Altshuler, Mark Daly, Sek Kathiresan, Mark McCarthy and Pat Sullivan. Every one of these people stood up for ExAC multiple times. As a junior faculty member proposing a project that was basically insanely ambitious, I have been incredibly lucky to have support from such a group.

There are many other awesome people in this story. We’ll get back to them later.

Going public

So, in June 2014 we had the biggest collection of exome sequencing data that had ever been assembled. One thing that was never in question was that we wanted to get this data set out to the community as quickly as possible – that was the whole purpose of ExAC – and so we set ourselves the task of preparing for a public launch at the American Society of Human Genetics meeting in October, where I was scheduled to give a talk. Over the next couple of months, increasingly stringent QC run by Monkol and other members of the team convinced us that this was a call set of excellent quality, so everything was in place for an October announcement.

But we didn’t simply want to dump a raw text file of sequence variants on the web – we needed to make this resource available in a way that anyone could use without needing bioinformatics expertise. And so I asked another postdoc in the lab, Konrad Karczewski, if he could put his considerable web development experience to work in building such a website.

And so, in the months leading up to ASHG, the now-iconic ExAC browser took shape. This process reached its climax at approximately 3am on October 18th, the morning of my talk, in a San Diego AirBnB, where Konrad, Brett Thomas and Ben Weisburd worked to rescue the website from a catastrophic database failure while I sat nearby frantically editing my slides and preparing for the possibility of a completely non-functional website. Magically, everything worked at the last minute – and the site has now been stable for almost two years and 5.2 million page views, testament to Konrad’s development of a resource that was both robust and user-friendly. Last year, Ben Weisburd added a new feature that allowed users to view the raw read support for individual variants – which profoundly changed my experience of the data set, and has been very popular with users.

Analysis time

Once the chaos of the ASHG launch was over, we got back to work on actually making sense of this massive data set. The next year was incredibly fun, as we crawled through the the intricacies of the data set and figured out what we could learn from it about human variation, biology and disease.

There are few greater joys as a PI than watching a team of fantastic young scientists work together on an exciting project – and this project was intensely collaborative, involving most of the members of my lab as well as many others. Monkol of course played a key role in many analyses. Konrad Karczewski led the analyses on mutational recurrence and loss-of-function variants. Eric Minikel led much of the work surrounding the clinical interpretation of ExAC variants, with strong support from Anne O’Donnell Luria and James Ware. And Kaitlin Samocha, a ludicrously talented graduate student from Mark Daly’s lab, drove the analysis of loss-of-function constraint that ended up becoming a headline result for the paper. Along the way, many other trainees and scientists contributed in ways that strengthened the project. Tens of thousands of messages were posted in our lab’s #exac_papers Slack channel, including dozens of versions of every figure that ended up in the paper; and slowly, magically, an unstructured Google Doc morphed into a real scientific manuscript.

(Other papers came together, too. We published Eric’s analysis of variation in the PRNP gene back in February – and today, simultaneous with the main ExAC paper, ExAC collaborators have published work showing the discovery of copy number variants across the ExAC individuals as well as the use of ExAC to understand the impact of variants in severe heart muscle disease.)

Open everything

One of the things I’m proudest of about ExAC is the way we systematically set about making our data, tools and results available to the public as quickly and easily as possible. The data set was pushed out to the public, with no access restrictions, through Konrad’s browser basically as soon as we had a version we felt we could trust. All of the software we used to produce the ExAC paper figures is freely available, thanks to the hard work and dedication of Konrad Karczewski, Eric Minikel and others in cleaning up the code and preparing it for release.

We also made our analysis results available as a preprint back in November last year – and the final Nature paper, with many improvements thanks to reviewers and a wonderful editor, is fully open access.

The benefit of this approach, beyond the warm fuzzy feeling of releasing stuff that other people can use, is that we got rapid feedback from the community – this was extremely useful in spotting glitches in the data, as well as errors in the manuscript preprint. We intend to continue with this open approach for the project, and we hope this can also serve as a model for other consortia.

What comes next?

The work of ExAC is far from over. Later this year we’ll be announcing the release of the second version of this resource, which we hope will exceed 120,000 exomes. We’ll also be releasing a whole-genome version of the resource that provides insight into variants outside protein-coding regions. And most importantly, the science will continue: my group, our collaborators, and researchers and clinicians around the world will continue to use this resource to diagnose rare disease patients, to understand the distribution of variation across the human population, and to explore human biology and disease.

Special thanks

I’ve thanked a lot of people in the post above, which is only appropriate, but there are many more who deserve it. Huge numbers of people were involved in putting this resource together.

One important thing to note about ExAC is that it was effectively unfunded for most of its existence. During this time, the immense costs of data storage and computing were covered (invisibly) by institutional support from the Broad, as were many of the personnel who contributed to the project. Many people in the team now known as the Data Sciences Platform at the Broad were involved in building the pipeline to make this call set possible, running it (over and over again!) on enormous numbers of samples, and helping us to make sense of the results – especially including Amy Levy-Moonshine, Khalid Shakir, Ryan Poplin, Laura Gauthier, Valentin Ruano-Rubio, Yossi Farjoun, Kathleen Tibbetts, Charlotte Tolonen, Tim Fennell, and Eric Banks. The Broad’s Genomics Platform generated 90% of the sequencing data in ExAC, and helped in many ways to make the project possible – huge thanks to Stacey Gabriel for her constant support. Namrata Gupta and Christine Stevens played major roles in assembling massive lists of sequenced individuals, and seeking data usage permissions. Andrea Saltzmann and Stacey Donnelly were our ethics gurus, who helped us get the IRB permissions needed for this resource to go live.

In September 2014 we received a U54 grant from NIDDK that partially defrays the ongoing cost of the project, but we are still absolutely dependent on the amazing and ongoing support of this Broad team. The sheer amount of resources and time donated to this effort is testament to the Broad’s genuine commitment to producing resources that help other people; I couldn’t be prouder to be part of this organization today.

And finally, I wanted to give special thanks to two incredibly patient spouses – my wife Ilana and Monkol’s wife Angela – for tolerating our prolonged absences and frequent distraction over the last few years. We appreciate it more than we say.

About dgmacarthur

I'm a geneticist working on the interpretation of human genomes. I'm also an advocate of people having free access to their own genetic information.

2 thoughts on “Announcing the Exome Aggregation Consortium paper

  1. Thanks for the completion of your “insanely ambitious” resource! Looking forward to even greater versions of this wonderful set. Best wishes.


Leave a comment

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s