All posts by dgmacarthur

About dgmacarthur

I'm a geneticist working on the interpretation of human genomes. I'm also an advocate of people having free access to their own genetic information.

Response to “Proposal to Update Data Management of Genomic Summary Results Under the NIH Genomic Data Sharing Policy”

Executive summary: the NIH is seeking comments on a new proposed policy on genomic data sharing. While there is much to like about the new policy, we are very concerned about the proposed requirement for a click-through agreement on all aggregate genomic resources (which would include heavily-used databases such as ExAC and gnomAD). Our draft response to the Request for Comments is below. If you agree with our concern, please consider replying to the Request for Comments yourself, using the template text at the end of this post if useful.

Draft response to Request for Comments
We would like to applaud the NIH for moving in the right direction with its new “Proposal to Update Data Management of Genomic Summary Results Under the NIH Genomic Data Sharing Policy”. The rapid and open sharing of summary statistics from aggregate genomic data brings tremendous benefit to the scientific community, and the potential harms of such sharing are largely theoretical. Our own experience with the ExAC and gnomAD public variant frequency databases has shown that the benefits to academic, clinical and pharmaceutical scientists from sharing of aggregate data are substantial: The browsers have had over 10 million page views by over 200,000 unique users from 166 countries in the past three years, and have been used by diagnostic laboratories in the analysis of >50,000 rare disease families. Even greater value will arise as a result of broader sharing of aggregate statistics as empowered by the new policy.

However, we are still very concerned by one aspect of the new Genomic Summary Results Data Sharing Policy – the creation of a new tier of access, rapid-access, which requires a click-through agreement to gain access to summary statistics. These concerns can be summarized as follows: (1) Click-throughs make programmatic access to data-sets challenging; (2) they greatly complicate or prevent multiple important types of re-use of the data; and (3) they are highly unlikely to deter anyone with genuine malicious intent. Overall, our position is that click-through agreements are a security fig leaf that gives the impression of extra protection, but actually do no good – and can do non-trivial harm. And we would like to emphasize that ExAC and gnomAD, along with other aggregate data sharing sites such as the Exome Variant Server, do not and never have had click-through agreements, and to the best of our knowledge no harm has ever come to participants as a result.

To explain those points in a bit more detail:

  1. It is critical for summary statistic resources such as gnomAD that we allow access through programmatic interfaces (APIs) so that people can query them using software (e.g. pull out just a targeted slice of the data) and perform automated queries (e.g. pull out the frequency of a specific variant when a user loads a different web page about that variant). Most implementations of click-through agreements will prevent or greatly complicate any form of programmatic access. There are possible technical workarounds, but all of them result in some kind of barrier to programmatic access.
  2. Probably the single biggest obstacle created by click-through agreements is that they prevent or substantially complicate data re-use. Right now anyone can download the complete list of frequencies from gnomAD and load it up in another website, or use it to build other useful web services (the complete ExAC data set has been downloaded thousands of times). With any kind of click-through agreement they either couldn’t do that at all, or would have to incorporate the same agreement in their usage policy, which may be incompatible with their proposed usage.
  3. Most importantly, click-through agreements do nothing to prevent the types of usage that are most likely to be harmful. It is worth noting that ExAC and gnomAD have existed on the web for almost 3 years and been accessed more than 10 million times without us being aware of a single incident that has any risk of harming participants. The vast majority of users are simply interested in using the data in their research. The theoretical bad actor who is interested in malicious usage is extremely unlikely to be dissuaded by a click-through agreement, nor does the click-through agreement offer any real after-the-fact protection if a malicious actor decides to do harm.

In summary, click-through agreements will degrade or destroy programmatic access and data reuse, without having any meaningful effect on participant safety. Any policy that advocates for click-through agreements as a solution should spell out explicitly exactly what types of misuse the click-through will prevent, and should justify the barriers to data usage that would result.

We believe it would be a mistake to incorporate click-through agreements into any NIH-wide policy. Instead, we suggest that the NIH require clear wording about the responsible use of aggregate data (such as avoiding reidentification) on all websites sharing aggregate genetic data, perhaps with a link on every page, but with no click-through barrier. This would provide a reasonable balance between serving the needs of the research community and protecting the public trust.


Daniel MacArthur.

A request for gnomAD users and supporters
For any member of the ExAC/gnomAD community who agrees that the public sharing of summary statistics is both harmless to participants and of great benefit to science, we urge you to read the new policy proposal here, and to respond to the NIH’s Request for Comment here by October 20th.

Feel free to edit or use the text below:

I am writing as an avid user of the ExAC and gnomAD databases. [Please provide a brief description of your use of these resources and their benefits to your research]

I believe that the new “Proposal to Update Data Management of Genomic Summary Results Under the NIH Genomic Data Sharing Policy” is a step in the right direction – there is no evidence that controlled-access of summary statistics prevents any harm to participants, and the open access to variant frequency data through ExAC and gnomAD has been very important to my research.

However, I am concerned about the proposal to create a new rapid-access category for summary statistics that would require the use of click-through agreements. These agreements make it difficult to reuse summary statistics and to access data programmatically. Most importantly, there is no evidence that they prevent harm to participants. A wide variety of summary statistics have been publicly available without click-through agreements for many years, including ExAC and gnomAD, and no harm of any kind has ever been done to any participant whose data is aggregated in those summary statistics in that time.

I urge the NIH to modify this proposal, and to designate summary statistics as open access, with the exception of communities and populations who believe that they are especially vulnerable to harm from possible reidentification.


[name here]

Announcing the Exome Aggregation Consortium paper

Today we are celebrating the official publication of the Exome Aggregation Consortium (ExAC) paper in Nature – marking the end of a phase in this project that has involved most of the members of my lab (and many, many others beyond) for a large chunk of the last few years. This official publication is an opportune time to reflect on how ExAC came to be, and the impact it’s had on us and the wider community.

First, some background

Exome sequencing is a very cost-effective approach that allows us to look with high resolution at just the 1-2% of the human genome that codes for protein – these are the parts we understand the best, and also the parts where the vast majority of severe disease-causing mutations are found. Because exome sequencing is so powerful it’s been applied to tens of thousands of patients with rare, severe diseases such as muscular dystrophy and epilepsy. However, a key challenge when sequencing patients is that everyone carries tens of thousands of genetic changes, and we need a database of “normal” variation that tells us which of those changes are seen in healthy people, and how common they are.

The goal of this project was to create such a database, at massive scale. Briefly, we pulled together DNA sequencing data from the largest collection of humans ever assembled – more than 60,000 people in total, from more than two dozen different disease projects, generously donated by researchers from around the world – to look at the distribution of genetic variation across the exome, and to make it publicly available for anyone to use. This project became known as the Exome Aggregation Consortium, or ExAC.

This project involved assembling nearly a petabyte (a thousand terabytes) of raw sequencing data and putting all of that through the same processing pipeline to produce a set of variant calls that were the same across the whole project. At the end of that process we produced a summary file, that basically is a list of all of the 10 million variants we discovered in the project, and how common they are across different populations – and made that publicly available back in 2014. That resource has now been received over 5.2 million page views by researchers around the world, mostly for the interpretation of genetic changes found in rare disease patients.

In this paper, we describe how this resource can be used in a variety of ways to understand human genetic variation, particularly focusing on the very, very rare DNA changes – those seen in less than one in a thousand people – most of which had never been observed until this project. We show that our new database is much more powerful than existing resources for filtering the variants found in rare disease patients, making it a much more effective tool for clinical labs to use in diagnosis. And importantly, we also show that we are able to use this database to identify genes that have fewer variants than we would expect to see by chance – basically, identifying genes that are intolerant of variation in the general population, meaning they are much more likely to be involved in causing disease.

I’m not going to describe these findings in detail, because the paper is open access and you can read it yourself – there is also a wonderful summary in Nature by Jay Shendure. However, I did want to highlight a few key things that you can’t learn from the paper, including the story of how this project actually happened.

How ExAC came about

In 2012 my (brand new) lab started sequencing exomes from patients with rare muscle diseases. Two things we found almost immediately were that (1) we desperately needed to interpret our variants in the context of large collections of variation from the “normal” population, and (2) the existing resources simply weren’t sufficient for this job. Both 1000 Genomes and the Exome Variant Server were fantastic resources, but neither was large enough to provide insight into the extremely rare variants we were seeing in our patients.

At around the same time, a few key things fell into place. Firstly, Mark DePristo’s GATK team at the Broad Institute had fancy new software that could, at least in theory, allow the generation of unified variant calls across tens of thousands of samples, and they were keen to try it out at scale. Secondly, it became clear that there was a critical mass of colleagues at the Broad Institute who had in combination sequenced over 20,000 exomes, and who were willing to make that data available for a large shared effort. And finally, I had been joined in my lab by a remarkably talented postdoc, Monkol Lek, who had the informatic background required to coordinate and analyze a large-scale sequencing resource. And so we set about building one.

Over the next 18 months, we worked closely with the Broad’s sequence production team to produce at least five call sets, starting with a pilot run of just over 13,000 samples. In each case we encountered either intractable computational scaling problems, or quality control issues in the final call set. Over the same time the number of samples continued to grow, and we became steadily more ambitious with each failure. In late 2013 we made an attempt at variant calling across nearly 57,000 samples, which took many months; depressingly, it produced a product with unusably high error rates. In early 2014 I was genuinely unsure whether this project would ever be successful.

We were ultimately saved by some truly remarkable engineering work on the Broad’s production variant-calling pipeline, spearheaded by a team guided by Tim Fennell and Eric Banks. Out of the ashes of the failed 57K call set rose a new, much larger effort, using a new pipeline (for aficionados, this marked the switch from UnifiedGenotyper to HaplotypeCaller) – and to our delight, this new tool proved both enormously faster and much more accurate than its predecessor. In June 2014 we suddenly had a call set that passed every quality control test we threw at it. And while the technical delays had been agonizing, that 18-month cloud turned out to have a massive silver lining: over this period of delays the number of exomes available to us had grown, staggeringly, to more than 90,000 samples. This June call set, after removing nearly a third of samples (for good reasons, outlined below), became what is now the public release of ExAC.

The awesome consortium

It’s worth being very explicit about this: ExAC didn’t sequence any exomes. The resource only exists because two dozen investigators from more than 20 consortia (see the full list here) chose to make their raw data freely available to create a resource for the benefit of the wider community. I am personally still absolutely astonished that so many of our colleagues were not only willing, but actively enthusiastic about making their data available to the world, and I’m forever indebted to them for this.

I won’t name every single of these wonderful individuals here (again, here’s the full list) but I did want to single out a few people who really went above and beyond to support the resource, even when it was barely more than an idea, especially David Altshuler, Mark Daly, Sek Kathiresan, Mark McCarthy and Pat Sullivan. Every one of these people stood up for ExAC multiple times. As a junior faculty member proposing a project that was basically insanely ambitious, I have been incredibly lucky to have support from such a group.

There are many other awesome people in this story. We’ll get back to them later.

Going public

So, in June 2014 we had the biggest collection of exome sequencing data that had ever been assembled. One thing that was never in question was that we wanted to get this data set out to the community as quickly as possible – that was the whole purpose of ExAC – and so we set ourselves the task of preparing for a public launch at the American Society of Human Genetics meeting in October, where I was scheduled to give a talk. Over the next couple of months, increasingly stringent QC run by Monkol and other members of the team convinced us that this was a call set of excellent quality, so everything was in place for an October announcement.

But we didn’t simply want to dump a raw text file of sequence variants on the web – we needed to make this resource available in a way that anyone could use without needing bioinformatics expertise. And so I asked another postdoc in the lab, Konrad Karczewski, if he could put his considerable web development experience to work in building such a website.

And so, in the months leading up to ASHG, the now-iconic ExAC browser took shape. This process reached its climax at approximately 3am on October 18th, the morning of my talk, in a San Diego AirBnB, where Konrad, Brett Thomas and Ben Weisburd worked to rescue the website from a catastrophic database failure while I sat nearby frantically editing my slides and preparing for the possibility of a completely non-functional website. Magically, everything worked at the last minute – and the site has now been stable for almost two years and 5.2 million page views, testament to Konrad’s development of a resource that was both robust and user-friendly. Last year, Ben Weisburd added a new feature that allowed users to view the raw read support for individual variants – which profoundly changed my experience of the data set, and has been very popular with users.

Analysis time

Once the chaos of the ASHG launch was over, we got back to work on actually making sense of this massive data set. The next year was incredibly fun, as we crawled through the the intricacies of the data set and figured out what we could learn from it about human variation, biology and disease.

There are few greater joys as a PI than watching a team of fantastic young scientists work together on an exciting project – and this project was intensely collaborative, involving most of the members of my lab as well as many others. Monkol of course played a key role in many analyses. Konrad Karczewski led the analyses on mutational recurrence and loss-of-function variants. Eric Minikel led much of the work surrounding the clinical interpretation of ExAC variants, with strong support from Anne O’Donnell Luria and James Ware. And Kaitlin Samocha, a ludicrously talented graduate student from Mark Daly’s lab, drove the analysis of loss-of-function constraint that ended up becoming a headline result for the paper. Along the way, many other trainees and scientists contributed in ways that strengthened the project. Tens of thousands of messages were posted in our lab’s #exac_papers Slack channel, including dozens of versions of every figure that ended up in the paper; and slowly, magically, an unstructured Google Doc morphed into a real scientific manuscript.

(Other papers came together, too. We published Eric’s analysis of variation in the PRNP gene back in February – and today, simultaneous with the main ExAC paper, ExAC collaborators have published work showing the discovery of copy number variants across the ExAC individuals as well as the use of ExAC to understand the impact of variants in severe heart muscle disease.)

Open everything

One of the things I’m proudest of about ExAC is the way we systematically set about making our data, tools and results available to the public as quickly and easily as possible. The data set was pushed out to the public, with no access restrictions, through Konrad’s browser basically as soon as we had a version we felt we could trust. All of the software we used to produce the ExAC paper figures is freely available, thanks to the hard work and dedication of Konrad Karczewski, Eric Minikel and others in cleaning up the code and preparing it for release.

We also made our analysis results available as a preprint back in November last year – and the final Nature paper, with many improvements thanks to reviewers and a wonderful editor, is fully open access.

The benefit of this approach, beyond the warm fuzzy feeling of releasing stuff that other people can use, is that we got rapid feedback from the community – this was extremely useful in spotting glitches in the data, as well as errors in the manuscript preprint. We intend to continue with this open approach for the project, and we hope this can also serve as a model for other consortia.

What comes next?

The work of ExAC is far from over. Later this year we’ll be announcing the release of the second version of this resource, which we hope will exceed 120,000 exomes. We’ll also be releasing a whole-genome version of the resource that provides insight into variants outside protein-coding regions. And most importantly, the science will continue: my group, our collaborators, and researchers and clinicians around the world will continue to use this resource to diagnose rare disease patients, to understand the distribution of variation across the human population, and to explore human biology and disease.

Special thanks

I’ve thanked a lot of people in the post above, which is only appropriate, but there are many more who deserve it. Huge numbers of people were involved in putting this resource together.

One important thing to note about ExAC is that it was effectively unfunded for most of its existence. During this time, the immense costs of data storage and computing were covered (invisibly) by institutional support from the Broad, as were many of the personnel who contributed to the project. Many people in the team now known as the Data Sciences Platform at the Broad were involved in building the pipeline to make this call set possible, running it (over and over again!) on enormous numbers of samples, and helping us to make sense of the results – especially including Amy Levy-Moonshine, Khalid Shakir, Ryan Poplin, Laura Gauthier, Valentin Ruano-Rubio, Yossi Farjoun, Kathleen Tibbetts, Charlotte Tolonen, Tim Fennell, and Eric Banks. The Broad’s Genomics Platform generated 90% of the sequencing data in ExAC, and helped in many ways to make the project possible – huge thanks to Stacey Gabriel for her constant support. Namrata Gupta and Christine Stevens played major roles in assembling massive lists of sequenced individuals, and seeking data usage permissions. Andrea Saltzmann and Stacey Donnelly were our ethics gurus, who helped us get the IRB permissions needed for this resource to go live.

In September 2014 we received a U54 grant from NIDDK that partially defrays the ongoing cost of the project, but we are still absolutely dependent on the amazing and ongoing support of this Broad team. The sheer amount of resources and time donated to this effort is testament to the Broad’s genuine commitment to producing resources that help other people; I couldn’t be prouder to be part of this organization today.

And finally, I wanted to give special thanks to two incredibly patient spouses – my wife Ilana and Monkol’s wife Angela – for tolerating our prolonged absences and frequent distraction over the last few years. We appreciate it more than we say.

A personal journey to quantify disease risk

We have a new paper out today in Science Translational Medicine that describes our application of the massive Exome Aggregation Consortium database to understanding the variation in one specific gene: the PRNP gene, which encodes the prion protein.

This project was a special one for a number of reasons. Firstly, there’s an incredibly strong personal motivation behind this work, which you can read much more about in a blog post by lead author Eric Minikel. Secondly, it’s a clear demonstration of the way in which we can use large-scale reference databases to interpret genetic variation, including flagging some variants as non-causal or having mild effects. Thirdly, as discussed in the accompanying perspective by Robert Green and colleagues, this work is already having clinical impact by changing the diagnosis for people with families affected by prion disease. And finally, the discovery of “knockout” variants in PRNP in healthy individuals is tantalizing evidence that inhibiting this gene in mutation carriers is likely to be a safe therapeutic approach.

The paper is of course open access, so you can read the details yourself. Huge congratulations to Eric for pulling this paper together!

Guidelines for finding genetic variants underlying human disease

This post by Daniel MacArthur and Chris Gunter was first posted on Genomes Unzipped.

New DNA sequencing technologies are rapidly transforming the diagnosis of rare genetic diseases, but they also carry a risk: by allowing us to see all of the hundreds of “interesting-looking” variants in a patient’s genome, they make it potentially easy for researchers to spin a causal narrative around genetic changes that have nothing to do with disease status. Such false positive reports can have serious consequences: incorrect diagnoses, unnecessary or ineffective treatment, and reproductive decisions (such as embryo termination) based on spurious test results. In order to minimize such outcomes the field needs to decide on clear statistical guidelines for deciding whether or not a variant is truly causally linked with disease.

In a paper in Nature this week we report the consensus statement from a workshop sponsored by the National Human Genome Research Institute, on establishing guidelines for assessing the evidence for variant causality. Continue reading Guidelines for finding genetic variants underlying human disease