Konrad Karczewski and Laurent Francioli
Today, we are pleased to announce the formal release of the genome aggregation database (gnomAD). This release comprises two callsets: exome sequence data from 123,136 individuals and whole genome sequencing from 15,496 individuals. Importantly, in addition to an increased number of individuals of each of the populations in ExAC, we now additionally provide allele frequencies across over 5000 Ashkenazi Jewish (ASJ) individuals. The population breakdown is detailed in the table below.
|OTH||Other (population not assigned)||491||2,743||3,234|
While the data from gnomAD have been available on the beta gnomAD browser for several months (since our announcement at ASHG last October), this release corresponds to a far more carefully filtered data set that is now available through the gnomAD browser, as well as a sites VCF and a Hail sites VDS for download and analysis at http://gnomad.broadinstitute.org/downloads.
In this blog post, we first describe the major changes from ExAC that will be apparent to users. We then dig into the details of the sample and variant QC done on the gnomAD data. It is important to note that while the alignment and variant calling process was similar to that of ExAC, the sample and variant QC were fairly different. In particular, the vast majority of processing and QC analyses were performed using Hail. This scalable framework was crucial for processing such large datasets in a reasonable amount of time, allowing for exploration of the data at a much more rapid scale. We’d like to extend a special thanks to the Hail team for their support throughout the process, especially Tim Poterba and Cotton Seed.
Important usage notes
These few notes describe important changes from the ExAC dataset.
Changes to the browser
The gnomAD browser is very similar to the ExAC browser with a few modifications to support integration of genome data. The coverage plot now has a green line to display genome coverage. In the variant table, a
source column indicates whether the variant belongs to the exome callset (), genome callset (), or both callsets ( and ). Variants present in both datasets have combined summary statistics (allele count, allele frequency, etc). We have added checkboxes for specifying which data to display on the page. Users can choose to include or exclude SNPs, indels, exome variants, genome variants, or filtered (non-pass) variants and the table statistics update accordingly. If filtered (non-pass) variants are included, the data source icons will be distinguished with reduced opacity and a dotted border to indicate lower reliability (). Variants present in both datasets have combined summary statistics (allele count, allele frequency, etc). We have added checkboxes for specifying which data to display on the page. A variant’s filter status can be discovered by hovering over the icon (e.g.
RF). Be careful when including filtered variants: some poor quality variants (and summary statistics) will be added to the table. This feature should be used with caution.
On the variant page, the population frequency table has been modified such that exome and genome population statistics can be included or excluded. By default, filtered variants are not included in the population table but can be added by clicking the checkbox.
There is a new “report variant” button on each variant page. Clicking this button opens a submission form where users can express concerns about a variant (such as low quality or clinical implausibility). Users can also request more information about variant carriers. Generally phenotype data is not available for samples in gnomAD as many of the individuals who have contributed data to gnomAD were not fully consented for phenotype data sharing. In the future, it may become available for a small subset of samples with appropriate consents. This is an experimental feature. Due to limited personnel, the gnomAD team may not be able to respond to your submission.
Finally, read data can now be viewed for both exome and genome data (restricted to exome calling intervals only), with up to 3 IGV.js tracks provided for each sample type (heterozygous, homozygous, and hemizygous). If there are more than 3 individuals for one of these sample types, the 3 individuals shown are those that have the highest GQ.
Allele count default is high quality genotypes only
We have changed the default allele count to include only genotypes passing our adjusted threshold (
GQ >= 20,
DP >= 10, and have now added: allele balance > 0.2 for heterozygote genotypes), which was previously in
AC_Adj. The raw allele counts are now available in
AC_raw. All sites with no high quality genotype (
AC = 0) are marked as filtered using the
No chromosome Y in genomes
There is no chromosome Y coverage / calls in the gnomAD genomes, because reasons.
Changes in the VCF
By and large, the release VCFs look and feel similar to the ExAC v1 VCF. Exomes and genomes are distributed separately. We have removed the Het fields as they were hard to parse and instead provide genotype counts (GC), or a G-indexed array of counts of individuals per genotype. As mentioned in the Variant QC section, we have fundamentally switched from a site-level filtering strategy to an allele-specific filtering strategy. For this reason, we now have a field named
AS_FilterStatus in the
INFO column that specifies the filtering status for each allele separately. If any allele is
PASS, then the entire site is
PASS (unless it is in a
SEGDUP region, or fails the
InbreedingCoeff filter). Finally, we provide VDS format for users wishing to perform analyses in Hail.
Alignment and variant QC
The alignment and variant calling was performed on exomes and genomes separately using a standardized BWA-Picard pipeline on the human genome build 37, followed by joint variant calling across each whole callset using GATK HaplotypeCaller (v3.4). Up to this point, the pipeline used is very similar to the ExAC pipeline; however, after the variant calling step, there are a number of differences that are worth noting here.
Sample QC process
The sample filtering process was similar to that of ExAC, where we first excluded samples based on sequencing and raw variant call quality metrics, then removed related individuals as to keep only samples more distant than 2nd degree relatives. From this set, we assigned ancestry and finally excluded samples based on raw variant call QC metrics stratified by sequencing platform and ancestry. These steps are detailed below.
Sequencing and alignment quality metrics were computed using Picard tools and raw quality metrics were computed using GATK, PLINK, and Hail. We have excluded samples based on the following thresholds:
- Both exomes and genomes
- High contamination: freemix > 5%
- Excess chimeric reads: > 5% chimeric reads
- Excess heterozygosity: F < -0.05
- Exomes only
- No coverage: a set of common test SNPs on chromosome 20 had no coverage
- Some TCGA samples (e.g. known tumor) were removed
- Ambiguous sex: fell outside of:
- Male: Proportion of heterozygous sites on chromosome X (non-PAR regions) < 0.2 and chromosome Y coverage > 0.1x
- Female: Proportion of heterozygous sites on chromosome X (non-PAR regions) > 0.2 and chromosome Y coverage < 0.1x
- Genomes only
- Low depth: Mean depth < 15x
- Small insert size: Median insert size < 250bp
- Ambiguous sex: fell outside of:
- Male: chromosome X F-stat > 0.8
- Female: chromosome X F-stat < 0.4
We used KING to infer relatedness amongst individuals that passed the hard filters thresholds. KING was run on all samples (exomes and genomes) together on a set of well behaved sites selected as follows:
- Autosomal, bi-allelic SNVs only (excluding CpG sites)
- Call rate across all samples > 99%
- Allele frequency > 0.1%
- LD-pruned to r2 = 0.1
We then selected the largest set of samples such that no two individuals are 2nd degree relative or closer. When we had to select between two samples to be kept in, we used the following criteria to select which one to keep:
- Genomes had priority over exomes
- Within genomes
- PCR free had priority over PCR+
- In a parent/child pair, the parent was kept
- Ties broken by the sample with highest mean coverage
- Within exomes
- Exomes sequenced at the Broad Institute had priority over others
- More recently sequenced samples were given priority (when information available)
- In a parent/child pair, the parent was kept
- Ties broken by the sample with the higher percent bases above 20x
We computed principal components (PCs) on the same well-behaved bi-allelic autosomal SNVs as described above on all unrelated samples (both exomes and genomes). We then leveraged a set of 53,044 samples for which we knew the ancestry to train a random forests classifier using the PCs as features. We then assigned ancestry to all samples for which the probability of that ancestry was > 90% according to the random forest model. All other samples, were assigned the other ancestry (OTH). In addition, the 34 south asian (SAS) samples among the genomes were assigned to other as well due to their low number.
Stratified QC filtering
Our final sample QC step was to exclude any sample falling outside 4 median absolute deviation (MAD) from the median of any of the given metrics:
- number of deletions
- number of insertions
- number of SNVs
- insertion : deletion ratio
- transition : transversion (TiTv) ratio
- heterozygous : homozygous ratio
Because these metrics are sensitive to population and sequencing platform, the distributions were computed separately for each population / platform group. The population used are those described above. The platforms were:
- Genome, PCR free
- Genome, PCR+
- Exome Agilent
- Exome ICE
- Exome ICE, 150bp read-length
- Exome, Nimblegen
- Exome, unknown capture kit (for these, the samples were removed if their metrics fell outside of the extreme bounds of all other exome datasets)
In some cases, certain projects had individuals that were sequenced on multiple platforms. To disentangle the platforms used, we calculated the missingness rate per individual per capture platform, and performed principal components analysis using these values. The first 3 PCs displayed signatures of capture platform (PC1 distinguished the Agilent platform from Illumina, PC2 distinguished samples on the Nimblegen platform, while PC3 represented 76 bp reads vs 150 bp reads), so these were used in a decision tree model to apply to those samples that fell into these projects with multiple platforms.
For the variant QC process, we used the final set of high quality release samples as well as additional family members forming trios (1,529 trios in exomes, 412 in genomes) that were filtered at the relatedness phase, in order to compute metrics such as transmission and Mendelian violations. Variant QC was performed on the exomes and genomes separately but using the same pipeline.
The variant QC process departed significantly from the ExAC pipeline, since this time round we decided (after several weeks of assessment) not to use the standard GATK Variant Quality Score Recalibration (VQSR) approach for filtering variants, which turns out not to perform especially well once sample sizes exceed 100,000. Instead, we developed a combination of a random forest classifier and hard filters, described below, which performs much better for the very rare variants uncovered by gnomAD.
Random forests classifier
One of the major limitations of the ExAC dataset (and, indeed, virtually all other large-scale call sets) was the fact that the VQSR filter was applied at the site (locus) level rather than the variant (allele) level. For instance, if a (say, artifactual) indel was found at the same site as a common SNP, the quality metrics would be combined. In this case, the quality metrics for the indel might be artificially inflated (or decreased for the SNP). This issue has become increasingly problematic with larger number of samples, and thus a higher density of variation – in the gnomAD dataset 22.3% of the variants in exomes and 12.8% of the variants in genomes occur at multi-allelic sites. To address these issues, we used an allele-specific random forest classifier trained implemented in Hail / pyspark with 500 trees of max. depth 5. The features and labels for the classifier are described below.
As the filtering process was performed on exomes and genomes separately, users will notice that for some variants, we have 2 filter statuses which may be discordant in some cases. In particular, there are 144,941 variants filtered out in exomes, but passing filters in genomes and 290,254 variants for the reverse. We’ll be investigating these variants in future analyses, but for now users should just treat them with caution.
Random forests features
While our variant calls VCF only contained site-specific metrics, Hail enabled us to compute some genotype-driven allele-specific metrics. The table below summarizes the features used in the random forests model, along with their respective weight in the model generated.
|Feature||Description||Site / Allele Specific||Genomes Feature Weight||Exomes Feature Weight|
|Allele type||SNV, Indel, complex||Allele||0.00123||0.0025|
||Median depth across variant carriers||Allele||0.0449||0.0513|
||Median genotype quality across variant carriers||Allele||0.1459||0.1520|
||Median reference dosage across variant carriers||Allele||0.4154||0.3327|
||Median allele balance among heterozygous carriers||Allele||0.2629||0.3559|
|Number of alleles||Total number of alleles present at the site||Site||0.0037||0.0031|
|Mixed site||Whether more than one allele type is present at the site||Site||0.0003||0.0001|
|Overlapping spanning deletion||Whether there is a spanning deletion (
||Z-score from Wilcoxon rank sum test of Alt vs. Ref read mapping qualities||Site*||0.0054||0.0067|
||Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias||Site*||0.0097||0.0101|
||Symmetric Odds Ratio of 2×2 contingency table to detect strand bias||Site*||0.0958||0.0482|
*: These features should ideally be computed for each allele separately but they were generated for each site in our callset and it was impractical to re-generate them for each allele.
Note that since random forests doesn’t tolerate missing data, we have naively imputed all missing values using the median value for that feature. In particular, this was done for metrics that are only calculated among heterozygous individuals (MQRankSum, ReadPosRankSum, allele balance), while the handful of variants missing other features were simply discarded.
Random forests training examples
Our true positive set was composed of: SNVs present on the Omni2.5 genotyping array and found in 1000 Genomes (available in the GATK bundle), indels present in the Mills and Devine indel dataset (available in the GATK bundle), and transmitted singletons (variants found in two and only two individuals, which were a parent-offspring pair). Our false positive set comprised of variants failing traditional GATK hard filters:
QD < 2 ||
FS > 60 ||
MQ < 30).
Random forests filtering threshold
In order to set a threshold for the
RF filter in the release, we used the following metrics to determine a cutoff on the random forests model output, to build what we believed to be a high quality set of variants:
- Precision / recall against two well characterized samples:
- Number of singleton Mendelian violations in the trios (N=1 amongst all samples and is found in a child)
- Singleton transition : transversion (TiTv) ratio for SNVs
- Singleton insertion : deletion ratio for indels
For exomes, our filtration process removes 18.5% of SNVs (RF probability >= 0.1) and 22.8% of indels (RF probability >= 0.2). For genomes, we filtered 11.0% of SNVs (RF probability >= 0.4) and 30.9% of indels (RF probability >= 0.4).
In addition to the random forests filter described above, we also filtered all variants falling in low complexity (
LCR filter) and segmental duplication (
SEGDUP filter) regions. We also filtered variants with low Inbreeding coefficient (< -0.3;
InbreedingCoeff filter). Finally, we filtered all variants where no individual had a high quality non-reference genotype (
GQ >= 20,
DP >= 10,
AB > 0.2 for heterozygotes;
Any data set of this scale requires a huge team to make it possible. We want to thank a few specific people here, while also acknowledging the many, many people behind the scenes at the Broad Institute who built the infrastructure and resources to enable this work.
Massive thanks, as always, to everyone who allowed their data to be used in this project. The gnomAD consortium now includes more than 100 principal investigators, who are listed here. We’re grateful to all of them for allowing their data to be used for this project.
Jessica Alföldi played a critical role in assembling the sample list and gathering the permissions required to ensure that each and every sample in the project was authorized to be included – a seriously daunting task! Fengmei Zhao contributed to the assembly of the underlying data, and Kristen Laricchia and Monkol Lek played important roles in sample and variant quality control.
As described above, we’ve made substantial changes to the browser for the gnomAD release (with more cool stuff coming soon!) – this is almost entirely thanks to Matthew Solomonson, who’s done a brilliant job on short timelines to make the system work. Ben Weisburd also contributed, and generated all of the read visualizations you can see on the variant pages.
The pipelines and infrastructure required to align and call variants, and the actual generation of the massive exome and genome call sets, are entirely the work of the heroic men and women of the Broad’s Data Sciences Platform. Many people deserve thanks here, but we wanted to single out Eric Banks, Kathleen Tibbetts, Charlotte Tolonen, Yossi Farjoun, Laura Gauthier, Dave Shiga, Jose Soto, George Grant, Sam Novod, Valentin Ruano-Rubio, and Bradley Taylor.
To complete quality control (many, many times!) on a call set of this scale in a timely fashion we needed serious computational grunt. We benefited enormously from the availability of Hail, an open-source Spark-based framework for genomic analysis – put simply, we would have been screwed without it. Massive thanks to the Hail team for building such a tremendous resource, as well as for their incredible responsiveness to each new bug report and feature request, especially Tim Poterba and Cotton Seed. We also needed good hardware to run the analysis on: over the course of this work we moved back and forth between a Cray Urika and the Google Cloud. Thanks to the Cray team for their hard work whenever the machine went down and we needed it back up fast.
And finally, special thanks to the Broad Institute Genomics Platform as a whole, who generated 90% of the data used in gnomAD, and also provided critical computing resources required for this work.
Today we are celebrating the official publication of the Exome Aggregation Consortium (ExAC) paper in Nature – marking the end of a phase in this project that has involved most of the members of my lab (and many, many others beyond) for a large chunk of the last few years. This official publication is an opportune time to reflect on how ExAC came to be, and the impact it’s had on us and the wider community.
First, some background
Exome sequencing is a very cost-effective approach that allows us to look with high resolution at just the 1-2% of the human genome that codes for protein – these are the parts we understand the best, and also the parts where the vast majority of severe disease-causing mutations are found. Because exome sequencing is so powerful it’s been applied to tens of thousands of patients with rare, severe diseases such as muscular dystrophy and epilepsy. However, a key challenge when sequencing patients is that everyone carries tens of thousands of genetic changes, and we need a database of “normal” variation that tells us which of those changes are seen in healthy people, and how common they are.
The goal of this project was to create such a database, at massive scale. Briefly, we pulled together DNA sequencing data from the largest collection of humans ever assembled – more than 60,000 people in total, from more than two dozen different disease projects, generously donated by researchers from around the world – to look at the distribution of genetic variation across the exome, and to make it publicly available for anyone to use. This project became known as the Exome Aggregation Consortium, or ExAC.
This project involved assembling nearly a petabyte (a thousand terabytes) of raw sequencing data and putting all of that through the same processing pipeline to produce a set of variant calls that were the same across the whole project. At the end of that process we produced a summary file, that basically is a list of all of the 10 million variants we discovered in the project, and how common they are across different populations – and made that publicly available back in 2014. That resource has now been received over 5.2 million page views by researchers around the world, mostly for the interpretation of genetic changes found in rare disease patients.
In this paper, we describe how this resource can be used in a variety of ways to understand human genetic variation, particularly focusing on the very, very rare DNA changes – those seen in less than one in a thousand people – most of which had never been observed until this project. We show that our new database is much more powerful than existing resources for filtering the variants found in rare disease patients, making it a much more effective tool for clinical labs to use in diagnosis. And importantly, we also show that we are able to use this database to identify genes that have fewer variants than we would expect to see by chance – basically, identifying genes that are intolerant of variation in the general population, meaning they are much more likely to be involved in causing disease.
I’m not going to describe these findings in detail, because the paper is open access and you can read it yourself – there is also a wonderful summary in Nature by Jay Shendure. However, I did want to highlight a few key things that you can’t learn from the paper, including the story of how this project actually happened.
How ExAC came about
In 2012 my (brand new) lab started sequencing exomes from patients with rare muscle diseases. Two things we found almost immediately were that (1) we desperately needed to interpret our variants in the context of large collections of variation from the “normal” population, and (2) the existing resources simply weren’t sufficient for this job. Both 1000 Genomes and the Exome Variant Server were fantastic resources, but neither was large enough to provide insight into the extremely rare variants we were seeing in our patients.
At around the same time, a few key things fell into place. Firstly, Mark DePristo’s GATK team at the Broad Institute had fancy new software that could, at least in theory, allow the generation of unified variant calls across tens of thousands of samples, and they were keen to try it out at scale. Secondly, it became clear that there was a critical mass of colleagues at the Broad Institute who had in combination sequenced over 20,000 exomes, and who were willing to make that data available for a large shared effort. And finally, I had been joined in my lab by a remarkably talented postdoc, Monkol Lek, who had the informatic background required to coordinate and analyze a large-scale sequencing resource. And so we set about building one.
Over the next 18 months, we worked closely with the Broad’s sequence production team to produce at least five call sets, starting with a pilot run of just over 13,000 samples. In each case we encountered either intractable computational scaling problems, or quality control issues in the final call set. Over the same time the number of samples continued to grow, and we became steadily more ambitious with each failure. In late 2013 we made an attempt at variant calling across nearly 57,000 samples, which took many months; depressingly, it produced a product with unusably high error rates. In early 2014 I was genuinely unsure whether this project would ever be successful.
We were ultimately saved by some truly remarkable engineering work on the Broad’s production variant-calling pipeline, spearheaded by a team guided by Tim Fennell and Eric Banks. Out of the ashes of the failed 57K call set rose a new, much larger effort, using a new pipeline (for aficionados, this marked the switch from UnifiedGenotyper to HaplotypeCaller) – and to our delight, this new tool proved both enormously faster and much more accurate than its predecessor. In June 2014 we suddenly had a call set that passed every quality control test we threw at it. And while the technical delays had been agonizing, that 18-month cloud turned out to have a massive silver lining: over this period of delays the number of exomes available to us had grown, staggeringly, to more than 90,000 samples. This June call set, after removing nearly a third of samples (for good reasons, outlined below), became what is now the public release of ExAC.
The awesome consortium
It’s worth being very explicit about this: ExAC didn’t sequence any exomes. The resource only exists because two dozen investigators from more than 20 consortia (see the full list here) chose to make their raw data freely available to create a resource for the benefit of the wider community. I am personally still absolutely astonished that so many of our colleagues were not only willing, but actively enthusiastic about making their data available to the world, and I’m forever indebted to them for this.
I won’t name every single of these wonderful individuals here (again, here’s the full list) but I did want to single out a few people who really went above and beyond to support the resource, even when it was barely more than an idea, especially David Altshuler, Mark Daly, Sek Kathiresan, Mark McCarthy and Pat Sullivan. Every one of these people stood up for ExAC multiple times. As a junior faculty member proposing a project that was basically insanely ambitious, I have been incredibly lucky to have support from such a group.
There are many other awesome people in this story. We’ll get back to them later.
So, in June 2014 we had the biggest collection of exome sequencing data that had ever been assembled. One thing that was never in question was that we wanted to get this data set out to the community as quickly as possible – that was the whole purpose of ExAC – and so we set ourselves the task of preparing for a public launch at the American Society of Human Genetics meeting in October, where I was scheduled to give a talk. Over the next couple of months, increasingly stringent QC run by Monkol and other members of the team convinced us that this was a call set of excellent quality, so everything was in place for an October announcement.
But we didn’t simply want to dump a raw text file of sequence variants on the web – we needed to make this resource available in a way that anyone could use without needing bioinformatics expertise. And so I asked another postdoc in the lab, Konrad Karczewski, if he could put his considerable web development experience to work in building such a website.
And so, in the months leading up to ASHG, the now-iconic ExAC browser took shape. This process reached its climax at approximately 3am on October 18th, the morning of my talk, in a San Diego AirBnB, where Konrad, Brett Thomas and Ben Weisburd worked to rescue the website from a catastrophic database failure while I sat nearby frantically editing my slides and preparing for the possibility of a completely non-functional website. Magically, everything worked at the last minute – and the site has now been stable for almost two years and 5.2 million page views, testament to Konrad’s development of a resource that was both robust and user-friendly. Last year, Ben Weisburd added a new feature that allowed users to view the raw read support for individual variants – which profoundly changed my experience of the data set, and has been very popular with users.
Once the chaos of the ASHG launch was over, we got back to work on actually making sense of this massive data set. The next year was incredibly fun, as we crawled through the the intricacies of the data set and figured out what we could learn from it about human variation, biology and disease.
There are few greater joys as a PI than watching a team of fantastic young scientists work together on an exciting project – and this project was intensely collaborative, involving most of the members of my lab as well as many others. Monkol of course played a key role in many analyses. Konrad Karczewski led the analyses on mutational recurrence and loss-of-function variants. Eric Minikel led much of the work surrounding the clinical interpretation of ExAC variants, with strong support from Anne O’Donnell Luria and James Ware. And Kaitlin Samocha, a ludicrously talented graduate student from Mark Daly’s lab, drove the analysis of loss-of-function constraint that ended up becoming a headline result for the paper. Along the way, many other trainees and scientists contributed in ways that strengthened the project. Tens of thousands of messages were posted in our lab’s #exac_papers Slack channel, including dozens of versions of every figure that ended up in the paper; and slowly, magically, an unstructured Google Doc morphed into a real scientific manuscript.
(Other papers came together, too. We published Eric’s analysis of variation in the PRNP gene back in February – and today, simultaneous with the main ExAC paper, ExAC collaborators have published work showing the discovery of copy number variants across the ExAC individuals as well as the use of ExAC to understand the impact of variants in severe heart muscle disease.)
One of the things I’m proudest of about ExAC is the way we systematically set about making our data, tools and results available to the public as quickly and easily as possible. The data set was pushed out to the public, with no access restrictions, through Konrad’s browser basically as soon as we had a version we felt we could trust. All of the software we used to produce the ExAC paper figures is freely available, thanks to the hard work and dedication of Konrad Karczewski, Eric Minikel and others in cleaning up the code and preparing it for release.
We also made our analysis results available as a preprint back in November last year – and the final Nature paper, with many improvements thanks to reviewers and a wonderful editor, is fully open access.
The benefit of this approach, beyond the warm fuzzy feeling of releasing stuff that other people can use, is that we got rapid feedback from the community – this was extremely useful in spotting glitches in the data, as well as errors in the manuscript preprint. We intend to continue with this open approach for the project, and we hope this can also serve as a model for other consortia.
What comes next?
The work of ExAC is far from over. Later this year we’ll be announcing the release of the second version of this resource, which we hope will exceed 120,000 exomes. We’ll also be releasing a whole-genome version of the resource that provides insight into variants outside protein-coding regions. And most importantly, the science will continue: my group, our collaborators, and researchers and clinicians around the world will continue to use this resource to diagnose rare disease patients, to understand the distribution of variation across the human population, and to explore human biology and disease.
I’ve thanked a lot of people in the post above, which is only appropriate, but there are many more who deserve it. Huge numbers of people were involved in putting this resource together.
One important thing to note about ExAC is that it was effectively unfunded for most of its existence. During this time, the immense costs of data storage and computing were covered (invisibly) by institutional support from the Broad, as were many of the personnel who contributed to the project. Many people in the team now known as the Data Sciences Platform at the Broad were involved in building the pipeline to make this call set possible, running it (over and over again!) on enormous numbers of samples, and helping us to make sense of the results – especially including Amy Levy-Moonshine, Khalid Shakir, Ryan Poplin, Laura Gauthier, Valentin Ruano-Rubio, Yossi Farjoun, Kathleen Tibbetts, Charlotte Tolonen, Tim Fennell, and Eric Banks. The Broad’s Genomics Platform generated 90% of the sequencing data in ExAC, and helped in many ways to make the project possible – huge thanks to Stacey Gabriel for her constant support. Namrata Gupta and Christine Stevens played major roles in assembling massive lists of sequenced individuals, and seeking data usage permissions. Andrea Saltzmann and Stacey Donnelly were our ethics gurus, who helped us get the IRB permissions needed for this resource to go live.
In September 2014 we received a U54 grant from NIDDK that partially defrays the ongoing cost of the project, but we are still absolutely dependent on the amazing and ongoing support of this Broad team. The sheer amount of resources and time donated to this effort is testament to the Broad’s genuine commitment to producing resources that help other people; I couldn’t be prouder to be part of this organization today.
And finally, I wanted to give special thanks to two incredibly patient spouses – my wife Ilana and Monkol’s wife Angela – for tolerating our prolonged absences and frequent distraction over the last few years. We appreciate it more than we say.
As we were organizing analyses for the ExAC flagship paper, we were inspired by Titus Brown’s manifesto on reproducibility. As with most collaborative papers, we had a core team of analysts, each responsible for their subset of analyses and figure panels. Part of the way through, we realized there were many shared features that we wanted to keep consistent across analyses, particularly variant filtering strategies and annotations. We made the decision to organize the code in a central github repository, which the team members could access and edit as analyses progressed. Today, we are happy to release this code, which you can use to reproduce every figure in the paper!
How to use
The code and miscellaneous data are available at https://github.com/macarthur-lab/exac_2015.
git clone email@example.com:macarthur-lab/exac_2015.git cd exac_2015
Then, start R, and run the following. Warning: this process requires a decent amount of memory, ideally on a machine that has at least 16G RAM (newer Macbook Pros are fine, but not Macbook Airs). On Linux and machines with stricter memory policies, we recommend 32G or even 64G to allow for overhead in loading.
source('exac_constants.R') exac = load_exac_data()
##  "Using locally loaded exac file. Run exac = load_exac_data(reload=T) to update to newest file." ##  "Step (1/6): Loading data..." ##  "Now processing data (2/6): Determining sequence contexts..." ##  "(3/6): Computing allele frequencies..." ##  "(4/6): Calculating bad regions of the genome..." ##  "(5/6): Determining what to _use_..." ##  "(6/6): Parsing SIFT and PolyPhen scores, and separating functional categories..." ##  "Done! Here you go. Took:" ## user system elapsed ## 857.437 17.924 878.059
This will take about 10-15 mins to load, plus a ~800M download the first time you run it. exac is then the data frame of ALL ~10M variants in the dataset with over ~100 annotations per variant. Each variant is now its own entry, as opposed to a typical VCF where multi-allelic variants are combined into a single line. Note that these are unfiltered variant calls. Make sure to filter to either PASS sites, or better yet, our criteria for high-quality calls (given by the column
use, see below for details):
filtered_calls = subset(exac, filter == 'PASS')
use_data = subset(exac, use)
You can view the table using
head() or with dplyr’s
tbl_df, which formats the output nicely:
## Source: local data frame [10,195,872 x 116] ## ## chrom pos ref alt id filter ## (chr) (int) (chr) (chr) (chr) (chr) ## 1 1 13372 G C . PASS ## 2 1 13380 C G . VQSRTrancheSNP99.60to99.80 ## 3 1 13382 C G . VQSRTrancheSNP99.60to99.80 ## 4 1 13402 G C . VQSRTrancheSNP99.60to99.80 ## 5 1 13404 G A . VQSRTrancheSNP99.60to99.80 ## 6 1 13404 G T . VQSRTrancheSNP99.60to99.80 ## 7 1 13417 C CGAGA . PASS ## 8 1 13418 G A . AC_Adj0_Filter ## 9 1 13426 A G . VQSRTrancheSNP99.80to99.90 ## 10 1 13438 C A . VQSRTrancheSNP99.80to99.90 ## .. ... ... ... ... ... ... ## Variables not shown: ac (int), ac_afr (int), ac_amr (int), ## ac_adj (int), ac_eas (int), ac_fin (int), ac_hemi (int), ## ac_het (int), ac_hom (int), ac_nfe (int), ac_oth (int), ## ac_sas (int), ac_male (int), ac_female (int), ## ac_consanguineous (int), ac_popmax (int), an (int), an_afr ## (int), an_amr (int), an_adj (int), an_eas (int), an_fin ## (int), an_nfe (int), an_oth (int), an_sas (int), an_male ## (int), an_female (int), an_consanguineous (int), an_popmax ## (int), hom_afr (int), hom_amr (int), hom_eas (int), hom_fin ## (int), hom_nfe (int), hom_oth (int), hom_sas (int), ## hom_consanguineous (int), clinvar_measureset_id (int), ## clinvar_conflicted (int), clinvar_pathogenic (int), ## clinvar_mut (chr), popmax (chr), k1_run (chr), k2_run ## (chr), k3_run (chr), doubleton_dist (dbl), inbreedingcoeff ## (dbl), esp_af_popmax (dbl), esp_af_global (dbl), esp_ac ## (int), kg_af_popmax (dbl), kg_af_global (dbl), kg_ac (int), ## vqslod (dbl), gene (chr), feature (chr), symbol (chr), ## canonical (chr), ccds (chr), exon (chr), intron (chr), ## consequence (chr), hgvsc (chr), amino_acids (chr), sift ## (chr), polyphen (chr), lof_info (chr), lof_flags (chr), ## lof_filter (chr), lof (chr), context (chr), ancestral ## (chr), coverage_median (dbl), coverage_mean (dbl), ## coverage_10 (dbl), coverage_15 (dbl), coverage_20 (dbl), ## coverage_30 (dbl), pos_id (chr), indel (lgl), ## bases_inserted (int), transition (lgl), cpg (lgl), ## alt_is_ancestral (lgl), af_popmax (dbl), af_global (dbl), ## singleton (lgl), maf_global (dbl), daf_global (dbl), ## daf_popmax (dbl), pos_bin (fctr), bin_start (dbl), sector ## (chr), bad (lgl), use (lgl), lof_use (lgl), sift_word ## (chr), sift_score (dbl), polyphen_word (chr), ## polyphen_score (dbl), category (chr)
A previous blog post by Monkol Lek, A Guide to the Exome Aggregation Consortium (ExAC) Data Set, described some of the challenges in exploring the data. With the dataset loaded into R, we can dive into a few more tips and tricks. First, you’ll notice that there are many annotations per variant pre-calculated, in particular:
In particular, one set of questions that always comes up is regarding the allele counts (AC) and allele numbers (AN). The raw counts (
an) refer to the total number of chromosomes with this allele, and total that were able to be called (whether reference or alternate), respectively. Thus, the allele frequency is
ac/an. However, for more stringent applications, we adjusted the allele count to only include individuals with genotype quality (GQ) >= 20 and depth (DP) >= 10, which we term
ac_adj (and correspondingly,
an_adj for total number of chromosomes from individuals with those criteria regardless of genotype call).
In this table,
af_global is then
an from each population is filtered to only adjusted counts (i.e.
ac_EAS includes only individuals of East Asian descent that meet this criteria). Additionally, we provide a maximum allele frequency across all populations as
af_popmax and the corresponding population as
popmax, which is useful for filtering variants for Mendelian gene discovery. Finally, we have
hom_* for all populations, which is the count of homoyzgous individuals for this allele.
The output from VEP is parsed for each allele and, in the default table, summarized using the most severe consequence across all transcripts. You can run
load_exac_data('canonical') for instance to load the annotations for only the canonical transcripts, or pass
gene=T to load in a data frame with one entry for every transcript or gene, respectively, for each variant.
Loss-of-function (LoF) annotations of variants from LOFTEE are provided in the
lof_flags columns, and the highest-confidence variants are indicated as
lof_use. PolyPhen and SIFT annotations are parsed into words (e.g.
possibly_damaging) and scores.
There are also annotations from ClinVar, intersected with output of Eric Minikel’s Clinvar parsing tool. A query for pathogenic variants without conflicting evidence may use
clinvar_pathogenic == 1 && clinvar_conflicted == 0 && clinvar_mut == 'ALT'.
Finally, there are numerous convenience annotations for quick analysis, including logical values for
indel (as well as the number of bases inserted or deleted in
singleton, various coverage metrics (mean, median, and proportion covered at a given depth – e.g.
coverage_30 indicates the proportion of individuals covered at 30X). The 3-base context around SNPs is provided in the
context column (warning: this is currently broken for indels, and in any case, not as useful).
As mentioned above, the
use column is a crucial one for filtering to the highest-quality variants, which includes the following criteria:
- Filter status of “PASS” (from VQSR)
- Adjusted allele count (ac_adj, see above) > 0 (i.e. there is at least one individual with at least GQ >= 20 and DP >= 10)
- At this locus, at least 80% of individuals (regardless of whether they have the variant) met the high-quality adjusted status as above (in code,
exac$an_adj > .8*max(exac$an_adj))
- Does not fall in the top 10 most-multiallelic kilobases of the genome (see Multi-allelic enriched regions)
With the data, there are many analyses that you can then do, including:
Recreate the figures from the paper
This one’s an easy one. Just run the following, which will take ~an hour. (Warning: will take lots of memory and CPU and has been known to crash R, particularly for some more computationally expensive figures, such as Figures 2E and 3F).
Depending on your system, it may be easier to open the script and run sections of the code independently.
Multi-allelic enriched regions
Early on after we released the ExAC dataset to the public, Nicole Deflaux at Google performed a nice analysis identifying chromosome 14 as a hotspot of multi-allelic variants (full analysis here: https://github.com/deflaux/codelabs/blob/exac/R/ExAC-Analysis/ExAC-Analysis.md). We wanted to trace the source of this issue, so we used
ggbio to plot the multi-allelics across the genome.
Install ggbio if its not already installed:
Then, use the
genome_plotting.R script where we’ve written functions to calculate and plot the density of multi-allelic variants (for now, limiting to quad-allelic and up) per kilobase across the genome.
library(ggbio) source('analysis/supplement/genome_plotting.R') multiallelic_density = get_multiallelic_density(exac, resolution=1000) suppressMessages(plot_multiallelic_density(multiallelic_density))
The 2 regions highlighted in light blue look interesting (chr14 as before, as well as chr2), so we can dig into those regions. The top 10 regions ordered by frequency are:
multiallelic_density %>% arrange(desc(freq)) %>% head(10)
## chrom bin_start bin_end freq ## 1 14 106330000 106331000 109 ## 2 2 89160000 89161000 103 ## 3 14 106329000 106330000 76 ## 4 14 107178000 107179000 63 ## 5 17 18967000 18968000 46 ## 6 22 23223000 23224000 44 ## 7 1 152975000 152976000 42 ## 8 2 89161000 89162000 39 ## 9 14 107179000 107180000 38 ## 10 17 19091000 19092000 38
If we look these up, we’ll find that these regions include immunoglobulin kappa (IGK, chr2), heavy (IGH, chr14), lambda variable (IGLV, chr22), and joining (IGKJ1, chr2), as well as some difficult-to-call regions of the genome. In ExAC, we’ve added a column for whether a variant falls into these 10 regions of the genome, named
bad for lack of a better phrase, and these variants are then excluded from the high-quality
dbSNP rs number versus allele frequency
We can also do some fun analyses that didn’t make the cut. One easy hypothesis we had was whether rs number (the numbers in a dbSNP RSID identifier) would correlate with allele frequency. We would expect a negative correlation, as rs numbers are sequentially assigned, and common variants should have been discovered earlier on. Of course, this will not be a perfect correlation, as rare variants in the first individuals sequenced, as well as disease variants, will be identified earlier and have lower rs numbers.
First, let’s get the rsid column into a numeric form and subset to only those in dbSNP:
library(magrittr) library(tidyr) exac %<>% mutate(rsid = as.numeric(gsub('rs', '', id))) rsid_variants = filter(exac, !is.na(rsid) & af_global > 0)
The rs number is negatively correlated with allele frequency, as expected.
## ## Pearson's product-moment correlation ## ## data: rsid_variants$rsid and rsid_variants$af_global ## t = -536.42, df = 1108700, p-value = 0 ## alternative hypothesis: true correlation is not equal to 0 ## 95 percent confidence interval: ## -0.4554037 -0.4524481 ## sample estimates: ## cor ## -0.4539271
library(ggplot2) ggplot(rsid_variants) + aes(x = rsid, y = af_global) + geom_point() + geom_smooth(method="lm", se=F) + ylab('Global Allele Frequency')
It’s a bit tough to see the correlation there, but it’s promising: later RSIDs do appear to have lower allele frequencies than earlier ones. Our fit line doesn’t seem too great, so let’s use the log of the allele frequency and use
smoothScatter() to visualize it:
smoothScatter(rsid_variants$rsid, log10(rsid_variants$af_global), xlab='RSID', ylab='Global allele frequency (log10)', col='white') abline(lm(log10(rsid_variants$af_global) ~ rsid_variants$rsid))
The largest blob in the bottom right corresponds to variants submitted by the NHLBI Exome Sequencing Project, from which many exomes are included in ExAC.
We encourage anyone to continue exploring the ExAC dataset and we hope that this repository helps others reproduce and expand on the analyses we’ve done. We welcome any and all comments, here and on the github page. Enjoy!
We have a new paper out today in Science Translational Medicine that describes our application of the massive Exome Aggregation Consortium database to understanding the variation in one specific gene: the PRNP gene, which encodes the prion protein.
This project was a special one for a number of reasons. Firstly, there’s an incredibly strong personal motivation behind this work, which you can read much more about in a blog post by lead author Eric Minikel. Secondly, it’s a clear demonstration of the way in which we can use large-scale reference databases to interpret genetic variation, including flagging some variants as non-causal or having mild effects. Thirdly, as discussed in the accompanying perspective by Robert Green and colleagues, this work is already having clinical impact by changing the diagnosis for people with families affected by prion disease. And finally, the discovery of “knockout” variants in PRNP in healthy individuals is tantalizing evidence that inhibiting this gene in mutation carriers is likely to be a safe therapeutic approach.
The paper is of course open access, so you can read the details yourself. Huge congratulations to Eric for pulling this paper together!
Here’s a guide to the latest science that the MacArthur lab (and close affiliates!) will be presenting at the American Society of Human Genetics Meeting in Baltimore Oct 6-10, 2015.
|Wednesday 3:45p||Eric Minikel will present an evaluation of reportedly disease-causing variants in the prion protein gene (PRNP) by comparing allele frequencies in 60,706 ExAC and 531,575 23andMe individuals to a case series of 16,025 prion disease patients. The analysis reveals a spectrum of penetrance and, importantly for therapeutics, a tolerance for loss-of-function variation.
Room 316, Level 3
|Wednesday 4:00p||Kaitlin Samocha, a graduate student in Mark Daly’s lab who’s been working with us on ExAC analyses, will present Evaluation of the regional variability of missense constraint in 60,000 exomes. We have previously focused on identifying genes that are intolerant of missense. With the large number of individuals in the Exome Aggregation Consortium data, we expect enough missense variants per gene to search for regions within genes that are intolerant of missense variation.
Room 316, Level 3
|Wednesday 4:15p||Taru Tukiainen has been exploring the landscape X chromosome inactivation across human tissues and will present the findings from the analysis of 1500 single cells and population-level data in her talk at the regulatory variation session.
Ballroom III, Level 4
|Wednesday 5:00p||Daniel Birnbaum will present poster 1735, Improving the annotation of splice-disrupting loss-of-function variants. Much work has been done on both modeling sequence motifs essential to splicing and predicting when sequence variants will disrupt splicing. However, this work has not been applied to the functional annotation of variant call sets. Here, we incorporate splice variant prediction into the Ensembl’s Variant Effect Predictor (VEP) annotation framework.
Exhibit Hall, Level 1
|Thursday 11:00a||Graduate student Beryl Cummings will present poster 2931, Improving genetic diagnoses in Mendelian disease with whole genome and RNA sequencing. The poster describes the value of both technologies for detecting and functionally validating variants that are invisible to or interpretable by current standard diagnostic methods for patients with neuromuscular disorders.
Exhibit Hall, Level 1
|Thursday 5:00p||Former lab member Andrew Hill (University of Washington), will present work on phased interpretation of multi-nucleotide polymorphisms in ExAC completed with current MacArthur Lab graduate student Beryl Cummings.
Ballroom III, Level 4
|Saturday 10:30a||Daniel MacArthur will present an overall summary of the latest news from the Exome Aggregation Consortium: new scientific results, new features in the ExAC browser, and plans for expansion to over 100,000 exomes.
Room 309, Level 3
We live in an amazing time to do human genetics. Over the last five years, thanks to impressive advances in DNA sequencing technology, the research community has collected sequencing data on genetic variation from over 200,000 samples. This provides us, for the first time, with the ability to study genetic variants at very low frequencies in the general population. However, in order to perform this research it’s critical that these genetic data be brought together and analyzed in the same way to ensure that the genetic changes that we find are real, and not the artifacts of differences in sequencing technology or analytical pipelines.
This goal is what drives the Exome Aggregation Consortium (ExAC), an international coalition of investigators with a focus on data from exome sequencing – an approach that allows us to focus variant discovery on the regions of the genome that encode proteins, known collectively as the exome. To date the Consortium has accumulated and jointly analyzed exome data from nearly 92,000 individuals, and has prepared a publicly accessible data set spanning 61,486 of these individuals for use as a global “reference set”. While the individuals in the reference set aren’t necessarily healthy – many have adult-onset diseases such as type 2 diabetes and schizophrenia – we have removed individuals with severe pediatric diseases, making this (we believe) a reasonable comparison data set for childhood-onset Mendelian diseases.
On October 20th at the American Society of Human Genetics (ASHG) conference we announced release 0.1 of the ExAC data set in two forms, as a browser and a downloadable raw data file. This was not just a massive data release but also a massive collaborative effort, which is detailed here. Four weeks after the release, the ExAC browser has received over 120,000 page views from over 17,000 unique users, and the raw data has been downloaded by over 150 organizations. The annotation tools ANNOVAR and ATAV have provided updates that have incorporated the ExAC data and the developers of Combined Annotation Dependent Depletion (CADD) have provided corresponding CADD scores. The commercial tools from GoldenHelix and GeneTalk have also incorporated the ExAC data. As the lead analyst on the project for over 2 years, I’ve been thrilled with the response it has received and the kind words and valuable feedback from the research community.
This practical guide, which uses two example genes FBN1 and MECP2 is aimed at general users and how they can access information using the ExAC Browser.
John Belmont commented on Nature News and Twitter that the FBN1 gene associated with Marfan syndrome has 11 subjects with Loss of function (LoF) mutations. If these are true disease causing variants then it fits roughly with the 1 in 5000 incidence of this disease.
The FBN1 LoF variants can be directly viewed on the ExAC browser by searching FBN1 or clicking on this link and then clicking on the LoF button. Things to note on the FBN1 gene summary page:
- The coverage plot (affectionately called the Guilin plots ) is of the canonical transcript, using the Ensembl definition. This may not necessarily correspond to the clinically relevant transcript.
- The genomic coordinates uses GRCh37 and NOT from the recently released GRCh38
- Variant sites with multiple alleles are represented on separate rows
- The functional annotation and corresponding protein consequence is from the most severe impact amongst the transcripts and may not affect all transcripts. As it is summarized from multiple transcripts, the amino acid position can sometimes appear out of order.
- Allele Number is the number of chromosomes so is twice the number of individuals (maximum 2*61,486). Due to the nature of exome capture and quality thresholds applied, this will not always be at the maximum.
- All variant data displayed in the table can be downloaded as a CSV text file and opened in Excel to restore the columns and rows.
Amongst the 10 Loss of Function variants in FBN1
Using the stop gained 15-48719948-G-C variant page is a good example to highlight important features:
- The histogram of Depth and Genotype Quality (GQ) for individuals with the allele or click on the full site metrics check box to display the histogram for all individuals with genotype calls (including those that are homozygous reference).
- The stop gain variant does not affect all transcripts. The variant results in a missense change in the canonical transcript (ENST00000316623). In fact the canonical transcript has only 8/10 LoF mutations.
One of the upcoming features we are developing for the ExAC browser is the ability to view the sequencing reads from the reconstructed BAMs produced by the Genome Analysis Toolkit (GATK) Haplotype Caller using the –bamOutput option. For the splice acceptor variant 15-48760301-T-C, this is particular useful to not only show the reads/bases supporting the SNP calls but also the reference sequence context and whether the acceptor site is the canonical (i.e. ends in [T/C]AG).
Note: FBN1 is on the reverse strand
Loss of Function and ClinVar variants
The LoF variants can be viewed by either following this link or searching MECP2 then clicking on the LoF button. In MECP2 there are 6 LoF variants. The stop gained variant X-153296689-G-A has an allele count of 68 with 20 homozygous individuals. Currently the ExAC data set is not sex aware and does not differentiate between hemizygous males and homozygous females. An upcoming feature is to calculate these numbers correctly for variants on the X chromosome. The sex of each individual in ExAC was determined by heterozygosity on the X chromosome and normalized chromosome Y coverage.
Differentiating males and females from exome sequencing data, using chrX heterozygosity (X axis) and coverage on the Y chromosome (Y axis). Males form a cluster on the left, females on the bottom right. A small number of unassigned individuals are also visible, some of whom are probable Klinefelter cases.
Now for the stop gained variant of interest, all 20 homozygous individuals are actually hemizygous males. Similar to the FBN1 example, the stop gained annotation only affects 1/3 transcripts while the other two (including the canonical) have a missense (p.Thr197Met, p.Thr209Met) annotation. According to ClinVar this variant is a missense variant and classified as benign. The LoF variants X-153296104-TCAGG-T and X-153296112-AGGTGGGG-A with homozygous individuals are also due to hemizygous males.
The variant X-153295997-C-T is an example of a pathogenic ClinVar variant in MECP2 that claims to be associated with neonatal severe encephalopathy in males. The 4 homozygous individuals in ExAC are actually 4 hemizygous males. It was soon later argued to be a rare variant rather than pathogenic but still remains classed as pathogenic in ClinVar!
Finally for a pathogenic variant in ClinVar not found on the ExAC browser with genomic coordinates X-153296806. The coverage data also provided for download shows that this site has adequate coverage for variants to be detected.
tabix -h Panel.chrX.coverage.txt.gz X:153296806-153296806
Fraction of samples at various coverage.
Looking more deeply at the insertion/deletions (indels) that result in frameshifts
Another advantage of having the ability to view sequencing reads is that users can now look at the reliability of the more difficult indels and other complex variant calls. The Exome Variant Server (EVS) is a fantastic resource for the research community but did not have features for researchers to scrutinize indel variant calls. This was particularly concerning for researchers when publishing on novel disease gene. In the case of a recently published paper on LMOD3, for instance, the presence of homozygous frameshift indels in EVS greatly concerned our collaborators; it was only through careful scrutiny of the raw data for these variants that we were able to reassure them that these were genotyping errors.
Firstly, let’s take a look at the 28 bp deletion X-153296090-CGGAGCTCTCGGGCTCAGGTGGAGGTGGG-C in MECP2 resulting in a frameshift variant.
Now let’s see the reads for a 1 bp insertion X-153296070-A-AG from a heterozygous female.
Both of these frameshift mutations appear real and may cause intellectual disability, so why do they exist in a data set of individuals without severe diseases? I propose three possible reasons:
- There is an obvious drop in coverage where all the LoFs in MECP2 have accumulated. This may indicate a region difficult to capture or sequence and perhaps also challenging to detect variants.
- The shorter protein coding transcript ENST00000407218 avoids all but 1 LoF mutation (X-153296689-G-A) and may rescue some function lost in the larger isoforms.
- Lastly, the LoFs are towards the end of the gene and may result in a much milder phenotype.
Investigating which of these possibilities may be contributing will require further detailed analysis. We welcome comments from MECP2 researchers regarding the LoF mutations in ExAC.
Tri and Quad allelic SNPs
Ending on an interesting point resulting from larger and larger data sets. The assumption that common variants remain bi-allelic is no longer valid, as with each new individual added there is a possibility of finding a new allele at a site where a bi-allelic variant is present. For example, the variant site rs2063690 is now a quad-allelic SNP – in other words, every possible base is present at this site in at least one individual in our data set! Furthermore, the figures below shows three individuals who are heterozygous for the reference and each of the alternate alleles, while the last individual is heterozygous for two alternate alleles.
Heterozygous G/C (ref/alt)
Heterozygous G/A (ref/alt)
Heterozygous G/T (ref/alt)
Heterozygous C/A (alt/alt)
There is increasing urgency for the development of tools that deal appropriately with these mult-iallelic sites – approximately ~7% of ExAC sites are now multi-allelic, and that fraction will grow as our sample size increases. That high rate of multiallelism shouldn’t be surprising, by the way – the ExAC data set now (staggeringly) contains one variant every six bases on average, so it’s not a shock to see many cases where variant locations overlap.
We’ve been gratified to see the rapid and positive response of the community to the ExAC data set. We still have plenty of work to do, though – and we’d love to get your feedback. If you have issues with the data set or the website, please drop us an email. For website bugs or feature requests you can also lodge a Github issue.
Many thanks to Daniel MacArthur for comments/feedback, writing introduction and final thoughts!
Several members of the MacArthur lab will be presenting at the American Society of Human Genetics Meeting in San Diego Oct 18-22, 2014. Here’s a rundown of what not to miss.
|Sunday 11:30a||Eric Minikel and his wife Sonia Vallabh will speak in invited session #6: Crowdsourced Genetics, chronicling their own story of becoming scientists after getting some bad genetic news, and discussing how they’ve engaged with crowds through blogging and crowdfunding.
Room 6AB, Upper Level.
|Sunday 4:00p||Eric Minikel will present poster 1255S on using allele frequency information from 63,000 exomes to analyze disease penetrance and loss-of-function tolerance in his favorite dominant disease gene, PRNP.
Exhibit Hall, Poster 1255S.
|Monday 8:00a||Konrad Karczewski will introduce the Human Knockout Project and the LOFTEE tool that he has developed to identify true loss-of-function variants.
Hall B1, Ground Level.
|Monday 10:30a||Daniel MacArthur will describe the Exome Aggregation Consortium (ExAC) analysis of over 61,000 exomes, and announce the public release of allele frequency data from this massive data-set.
Room 6AB, Upper Level.
|Monday 2:00p||Karol Estrada and collaborator Anthony Day-Williams will present poster 773M, a joint effort between the MacArthur Lab and Biogen Idec to use whole genome sequencing to identify causal variants in an ultra-rare autoimmune disease, neuromyelitis optica (NMO).
Exhibit Hall, Poster 773M.
|Monday 4:30p||Honorary lab member Kaitlin Samocha (a graduate student in Mark Daly’s lab) recently developed methods for computing missense constraint [Samocha 2014]. Now, with the vastly greater statistical power of the ExAC dataset, she will speak about applying this concept to loss-of-function constraint.
Room 20A, Upper Level.
|Monday 5:30p||Taru Tukiainen will show how pooled and single-cell RNA-seq data can be combined to provide new insights on X chromosome inactivation in her talk in the epigenomics session.
Room 30, Upper Level.
The exome era turns five years old this fall, with the anniversary of the “Targeted capture and massively parallel sequencing of 12 human exomes” [Ng 2009]. Born at a time when whole genome sequencing cost about a quarter million dollars, exome sequencing provided the cost savings that made sequencing large numbers of individuals for research feasible. Here in the MacArthur lab, our rare disease gene discovery projects rely on whole exome sequencing (WES) as a first pass on unsolved Mendelian families. If we can’t find a causal variant, we perform WES on additional family members. If we run out of family members and still haven’t found a likely causal variant, only then do we turn to whole genome sequencing (WGS), often alongside functional analyses such as RNA sequencing from muscle tissue.
WGS has remained a last resort for a few reasons: it’s still a bit more expensive than WES, it produces more data we have to process and store, and despite some notable recent advances [Ritchie 2014, Kircher 2014] our ability to interpret the pathogenicity of variants in the 99% of the genome that doesn’t code for protein is still a far cry from what we can do with exons. But we still turn to WGS for two main reasons: (1) its superior ability to call structural variations by covering the breakpoints, and (2) its more reliable coverage of the exome itself.
We now have 11 individuals from Mendelian disease families on whom we’ve performed both standard WES and very, very high-quality WGS, which gives us a chance to quantify exactly what we’re missing in the WES. The question of what we’re missing in WES is one input to our periodic re-evaluation of whether WES is still the best value for research dollars or if we should be turning to WGS right away.
In this post, I’ll explore the “callability” of these 11 WES samples: how much of the targeted exome is covered at a depth where one could hope to call variants, and how many variants are missed, and why? We’ll should say up front that this isn’t a product comparison – the WGS on these samples is of much higher quality than is standard these days, and our exomes are comparatively out of date, so this post cannot directly inform the “WES vs. WGS” decision.
We have 11 individuals from Mendelian muscle disease families on whom both WES and WGS were performed. Again, this is not designed to be a totally fair comparison, and the WES and WGS datasets differ on several dimensions:
|Capture||None||Agilent SureSelect Exome|
Many of the limitations we see in our analysis below are not actually inherent in exome capture but rather relate to PCR amplification or read length. And here are a couple of other caveats: some of the samples were deliberately WGS sequenced at 30x coverage and others at 60x coverage, so the “platinum standard” to which we’re comparing the WES data is a bit of a moving target. Finally, we are only considering one rather outdated exome capture technology (Agilent SureSelect v2) and not a survey of available products; the Broad, for instance, has recently moved to a substantially better capture technology (Illumina’s ICE technology), which we now use as standard for our rare disease patients. We have been working on performing a fairer and more comprehensive comparison of WES and WGS, and the analysis in this post may be considered as a pilot analysis.
The code used for this analysis is here; in the post I’ll only present the results. Briefly, I analyzed (1) coverage depth, and (2) called variants.
For the coverage depth analysis, I used BEDtools to create four sets of genomic intervals:
- The Broad Exome v2.8 targets (32.95 Mb)
- A 10-bp buffer around the Broad Exome (excluding the exome itself)
- The subset of the Broad Exome that intersects with 63 known Mendelian muscle disease genes that we know and love
- A 10-bp buffer around the above muscle disease gene exons (excluding the exons themselves)
Note that in this analysis I did not consider all Gencode intervals, as Agilent SureSelect does not even attempt to target some of these intervals. In exome sequencing, it goes without saying that you’re missing most of what you do not target; the question we want to address here is how much of what we do target we are still missing. (The question of relative coverage of all genic regions is a topic for a separate post.)
For each genomic interval, in WGS and WES, I then ran GATK DepthOfCoverage for every combination of minimum MAPQ of 0, 1, 10, or 20, and minimum base pair quality of 0, 1, 10 or 20. In other words, I computed the read depth across every base of these genomic intervals, stratified by mapping quality and by base pair quality.
For the called variant analysis, I used VCF files produced by GATK HaplotypeCaller. (An important caveat which I’ll repeat later is that the HaplotypeCaller calls were done in 2013 on a now-outdated GATK version and appear to contain a few bugs which may have since been fixed. When we repeat some of these analyses on a larger dataset we will use a newer callset.) I subsetted the variant calls to the Broad Exome and the samples of interest using GATK SelectVariants. I converted them to tables, decomposing multi-allelic sites into separate rows using GATK VariantsToTable, and then converting each allele to its minimal representation. The WES samples had been joint-called with a few hundred other samples, necessitating some further filtering at the analysis stage which I’ll describe below.
I’ll start with some descriptive statistics and then get gradually into the analysis. Although the samples were sequenced to a variety of mean exonic depths in both WES and WGS, it is categorically true across all 11 samples that the mean exonic depth of the WES data is higher than the WGS data:
Again, recall that some WGS samples were sequenced with a goal of 30x coverage and others 60x, hence some of the variability in the height of the yellow bars. As you’d expect, the WGS BAMs have similar coverage in a 10bp buffer around all exons (or muscle exons) as they do in the exons themselves, while the WES has reduced coverage in the ± 10bp buffer regions. In the few samples with the deepest WGS sequencing (e.g. S11), WGS is actually deeper than WES in the buffer regions.
As mentioned above, I stratified the coverage calculations by minimum mapping quality of the read (MAPQ). The above plots are for all reads regardless of MAPQ. When I stratify by MAPQ, I find that just requiring MAPQ ≥ 1 loses you about 3% of your depth in WES and 2% of your depth in WGS. Being even stricter and requiring MAPQ ≥ 10 or 20 doesn’t lose you any depth at all. This is true regardless of the target set (the different lines, denoted at right). The below plot shows just sample S11 as an example, but the same is true across the 11 samples. A MAPQ of 0 indicates that the read’s alignment is ambiguous – it fits in more than one place in the genome. So the conclusion here is that a few percent of reads, in both WES and WGS, have MAPQ 0, but other than that, virtually all reads have a high MAPQ. This plot shows one sample, but they all look pretty similar to this:
The picture when stratifying by base quality (BQ) rather than MAPQ is only slightly more interesting, as seen below. Virtually all bases in the WES data have quality > 20, so quality thresholds up to 20 really have no hit on depth (flat blue lines). In the whole genomes, on the other hand, requiring base quality of 10 or 20 loses almost a third of the depth in the one sample shown here (yellow curves). Other samples were qualitatively similar though some of them were less dramatic. My suspicion is that this is not due to it being WGS per se, but rather the fact that 250 bp reads are near the limit of what next generation sequencing is presently capable of, so that getting such long reads involves some sacrifice in base pair quality at the ends of the read. Again, this plot shows one individual, but is typical of all of them:
All of the above plots are concerning mean coverage across the entire target set. A different metric that might be more pertinent to callability is the percent of the target that is covered at 20x depth or better. Therefore I computed for each of the 11 samples what percent of the Broad Exome is covered at 20x or better, varying the minimum MAPQ threshold but applying no base quality cutoff. As the plot below shows, all but one of the WGS samples has > 95% of the exome covered at ≥ 20x when all reads are included, while the WES samples cluster closer to 90%. In both cases, you drop below the 20x threshold in about 1% of the exome by requiring that reads have MAPQ > 0. Requiring MAPQ of 10 or 20 doesn’t change things much.
The picture is remarkably similar when the threshold you’re trying to achieve is 1x instead of 20x:
All 11 samples are plotted in blue and yellow above — you just can’t see them because they overlap perfectly. In all 11 WGS samples, 100% of the Broad Exome is covered with at least 1 read, but only 99% is covered with at least 1 read of MAPQ > 0. In all 11 WES samples, 99% of the Broad Exome is covered by one read, but only 98% by one read of MAPQ >0. Interpretation: about 1% of the exome is just utterly unmappable even with 250bp reads, and 2% is unmappable with 76bp reads.
The comparable picture 20x+ depth stratifying by base quality is more predictable: as above, the WGS but not WES samples here take a hit at the higher base quality cutoffs:
So far I’ve established that base quality only matters in my WGS samples, and MAPQ only differs between 0 and > 0. I’ve also established that these trends are pretty consistent across my different target sets. Therefore, I’m going to simplify and from here on out, I’ll only look at the Broad Exome, and I’ll either filter on MAPQ ≥ 1 and BQ ≥ 20, or I won’t filter at all. Fixing these variables lets us look more quantitatively at the distribution of coverage depth across the exome, instead of looking at mean coverage or at the percent covered at ≥ 20x. Here is a plot of the cumulative distribution function of depth in the 11 samples, in the exome, without any quality filters:
Remarkably, with just one exception, the WGS samples stay very close to 100% coverage up until about 20x depth, and then each at some point drops off sharply. The WES samples, in contrast, drop off very steadily. That’s bad. The ideal sequencing scheme would have every last base covered at exactly, say, 30x depth – enough to call any variant with high confidence – and then no bases covered at ≥ 31x depth – more isn’t useful. As expected, WGS is much closer to this ideal than WES. Phrased differently, WES has much higher variance of coverage. This is probably due chiefly to capture efficiency, and possibly to some extent due to GC bias during PCR amplification. This is why, even though our samples had much lower mean coverage in WGS than WES, the WGS provides us (with one exception) with a larger proportion of the exome covered at ≥ 20x.
The picture is similar when we filter for MAPQ ≥ 1 and BQ ≥ 20 (below). The WGS coverage suffers a bit, particularly due to the base quality threshold, so that some of the WGS now have fewer bases at 20x coverage than WES. At 10x, though, WGS is still the clear winner.
With MAPQ > 0 and BQ ≥ 20, the WGS samples have an average of 98.4% of the exome covered at ≥ 10x, while WES only have 93.5%. So if you consider 10x the minimum to have decent odds of calling a variant, then the combination of longer reads (thus greater mappability) and more uniform coverage in these WGS samples is bringing an additional 5% of the exome into the callable zone.
Surprisingly, though we think of muscle disease genes as being enriched for unmappable repetitive regions, the above figure is nearly identical when only muscle disease gene exons are included. As expected, the advantage of WGS over WES is even clearer in the ±10bp around exons, though the shape of the curves is similar.
Moving on, let’s talk about variants called in the WES vs. WGS datasets. I decomposed the variant calls into separate alt alleles, so I’ll speak in terms of alleles rather than sites. This table provides a few descriptive statistics on alleles called when considering only the Broad exome with no buffer:
|WES only||WES and WGS||WGS only|
|PASS alleles called||712||40078||2120|
|of which at least one non-homref genotype has GQ > 20||451||2025|
Although more alleles were unique to WES than to WGS, most of these were filtered variants. Moreover, even among the PASS variants in the WES set, many PASSed only because the WES samples had been joint-called with a few hundred other samples, while none of the 11 individuals considered here actually had an alternate allele call with GQ > 20. So basically, we have about 40,000 PASS alleles shared by both sets, 500 good-quality PASS alleles unique to WES and 2,000 unique to WGS. The Venn diagram for variants of decent quality looks something like this:
The figure for WGS is precisely in line with expectation: as shown above, WGS brings an additional ~5% of the exome up to a “callable” threshold, so it makes sense that it produces about 5% more alt allele calls.
As an aside, of the alleles called in both datasets, WES and WGS called the exact same genotype for all 11 individuals in 91% of alleles. At the other 9% of alleles, WGS had the greater alt allele count 67% of the time. This suggests that even when WES and WGS can both detect an allele’s existence, WGS may have greater power to call the allele in more samples.
Next I wanted to dig into the variant calls in a more anecdotal way. Which types of situations are better called by WGS and which are better called by WES? What are the error modes of each dataset? To get at this question, I looked at IGV screenshots of both the WES and WGS reads at sites where a variant was called in one dataset but not the other. I looked at all such discordant sites in muscle disease gene exons (16 sites), and then intended to look at a random sampling of 10 WES-only and 10 WGS-only alt alleles exome-wide, though I ended up doing this a few separate times as I tweaked the filters above. In the final run, with the PASS and genotype quality filters described above in place before the random sampling, I did not notice an obvious difference in the proportion of genotype calls that I concluded looked correct on IGV after viewing reads from both datasets. 6 of the 10 WES-only alleles looked like the correct genotype call had indeed been made based on WES, and symmetrically, 6 of the 10 WGS-only allele calls looked correct. Of the 40% that I didn’t deem exactly correct, most were suggestive of some sort of non-reference allele at the site, it just wasn’t exactly as called by HaplotypeCaller. The number of samples I looked at clearly isn’t enough to be getting into any more numbers, but this exercise was anecdotally very rich. I found it quite illuminating as to some of the error modes of genotype calling and the relative strengths and weaknesses of WGS and WES, and of long reads vs. short reads.
Based on all the screenshots I looked at, I would break the discordantly genotyped sites into 7 categories. What follows is a list of these categories, with one example for each, and a description of whether WGS or WES performed better and why. You can click the image to view at full resolution.
1. more consistent read depth without capture
Based on the coverage analysis above, which indicated that depth is much more variable in WES than WGS, I expected to see lots of these cases, and I did. Here there are only two reads in WES, and both happen to support a C>A SNP, so it was called in WES. My guess is that’s just bad luck – base errors happen in sequencing, and with only two reads, sometimes they’ll happen in the same spot on two reads. In WGS, there was much more depth and no read support for this SNP, so it wasn’t called.
I chose this example because it’s interesting to note that the variable depth in WES doesn’t just lead to missed genotypes, it also leads to false positives. As you’d expect, though, there were more cases where a real variant was called in WGS, and was missed in WES due to having only 0 or 1 or 2 reads. Here is one such example if you’re curious.
2. longer reads, better mapping quality
In WGS, this was called as a homozygous alt genotype. Most, but not all, reads support the alt allele; if you look closely, almost 100% of the gray (MAPQ > 0) reads do support the alt allele, while the ones that don’t are almost exclusively white (MAPQ of 0). In the WES, by contrast, there are no gray reads – all the reads here aligned with a MAPQ of 0. That’s because this region is similar enough to other regions that some (not all) 250bp reads can align uniquely, while no 76bp reads can align uniquely. Not only was this not seen as homozygous in the WES data, the variant wasn’t even called at all.
3. longer reads spanning short repeats
A different advantage of longer reads is their ability to span short tandem repeats or homopolymer runs. I looked at these screenshots because the WES data had called a 3bp in-frame deletion that was absent from the WGS calls. On very close inspection, here’s what happened. RRLLLF became RLLLLF – meaning that an R changed to an L, or an R was deleted and an L duplicated, however you prefer to think of it. The WGS data called a MNP in which tandem T>G and C>A mutations turned an R (TCG) into an L (GAG), a reasonable interpretation. In WES, almost half of reads supported that same interpretation, but a subset of reads (6 of them) failed to span the LLL repeat, so they didn’t see that there were 4 Ls instead of 3, and this resulted in a call of a 3-bp deletion of the R. If you look closely, two of these reads spanned into the F after the Ls, and had mismatched bases as a result. The longer reads in WGS, because they spanned the entire repeat region, gave the more correct call.
4. better distribution of position in read without capture
In these screenshots, a homopolymer run of As has contracted by 3 As compared to reference. In WGS, this was called as a left-aligned TAAA>T deletion, as it should be. In WES, no reads spanned the polyA tract, and so no variant was called. At first I thought this was just an issue of read length, but in fact, the polyA tract is plenty short enough to be spanned by 76bp reads in WES. My suspicion is that this relates more to capture. The capture probe for this exon in the Agilent baits file ends at the beginning of the polyA:
2 47641406 47641559 + target_20209
So perhaps it has captured fragments centered further left, such that only their tail ends protrude into the polyA area.
5. dense SNPs or MNPs interfering with capture
Although it appears intronic on the IGV track at bottom, this region is actually baited in the Agilent exome capture kit:
5 140186771 140189171 + target_50250
Thus making this a fair comparison. There are 7 non-ref bases in just a 12-bp stretch, making this a big MNP or a non-contiguous series of SNPs, depending on your perspective. They’re quite clearly all on the same haplotype. In WGS, they achieve about a 50% allelic balance, and were called as separate heterozygous SNPs. In WES, the allelic balance is way off, with only two alt reads. I see two possible explanations. The density of mismatches may be enough to throw off alignment of a 76bp read but not a 250bp read, or it may be that so many SNPs so close together is enough to reduce the affinity of the DNA for capture probes, introducing bias where the ref allele dominates.
6. the curse of too much depth
This site in DUX4L2 is way out in the telomere at the end of the q end of chromosome 10. This is what’s known as a “collapsed repeat” – there are many near-identical tandem copies of this sequence in the human genome, but only one copy of the repeat is present in the reference genome. As a result there are a ton of reads here. The depth bar in the WGS screenshot goes up to 13,756 and in WES it goes to 837. While there is clearly a lot going on here, I’d say to be safe one might want to avoid calling any SNPs at all in this region. In WGS, however, several SNPs were called in this region. In WES, a different set of not-very-credible SNPs was called in this region, though in WES, most of them were filtered, whereas in WGS, they mysteriously achieved PASS scores. If you hover over the depth barplot for any one base, you’ll see they all have a mix of different bases:
This seemed to occur at sites with a massive amount of depth, yet the allelic depth reported by HaplotypeCaller was actually very low – in the individual in the above screenshot, the AD was listed as 2,1. This relates to the next category.
7. genotype calls without obvious read support
A surprisingly common category in the screenshots I looked at was cases where the reads appeared to provide no support whatsoever for the variant that was called. In WGS, a GGT>G deletion was called, and in WES it wasn’t called, but no reads in either dataset seemed to support the variant. On further inspection, the call in WGS was made with AD 6,0 and GQ 41.
Now, HaplotypeCaller does perform local realignment at complex sites, with the result that sometimes an IGV screenshot may not reflect the alignment on which the calls were based. However, in the example above, HaplotypeCaller recorded 0 alt alleles, so that can’t explain this call. One caveat for the analysis here is that the variant calls I’m analyzing were performed on a now-outdated version of GATK HaplotypeCaller in 2013 and so may reflect bugs that have since been fixed.
Although calls like this with no read support are probably just technical errors, I added this category to the list to remind us that not all of the ~2000 WGS-only alleles and ~500 WES-only alleles reflect differences in sequencing technology – some probably just reflect random error.
Here I’ve compared WGS and WES data for the same 11 individuals. The libraries differ in depth, read length and PCR amplification in addition to exome capture, so the comparison will reflect all of these differences. WGS was able to cover 98.4% of the exome at a “callable” depth and quality, while WES covered only 93.5% at the same threshold. This probably reflects the more uniform distribution of coverage in our WGS data due to lack of capture and lack of PCR amplification, as well as the superior mappability of the longer reads. Consistent with WGS covering an additional ~5% of the exome, about 5% of exonic alt allele calls in WGS calls were unique to WGS and were not called in WES. Only about 1% of WES reads were unique to WES. Manual examination of the sites that were discordantly genotyped by the two technologies revealed several explanations. WGS gave superior variant calls where its longer reads gave better mappability or spanned short tandem repeats, where its coverage was more uniform either in total depth or in distribution of the position of a variant within a read, and where it gave more even allelic balance in loci where a high density of SNPs may have interfered with exome capture. WES and WGS both gave spurious variant calls at sites of extraordinarily high depth, and many calls in both datastes were simply not supported by reads at all.
This post originally appeared on CureFFI.org
Here in the MacArthur lab we do a lot of Mendelian genetics – sifting through exomes of patients with rare neuromuscular diseases to try to find the causal variant. A few years ago, it was common practice that when looking for a rare disease-causing variant one would automatically rule out any variants that were already in dbSNP. But as dbSNP itself has grown and grown to include more and more rare disease variants, that’s become a better and better way to miss your causal variant. A more nuanced approach is to compare the variants in your sample to reference populations – 1000 Genomes, ESP, or the new 86K exome dataset our lab is working on – and limit your search to very rare variants, say, those below 1% or 0.1% frequency in the reference population.
Doing that requires that you be able to find your variant in the reference population if it’s there. As we’ve been using this approach for a while now, we’ve noticed a common error mode. Say we have a dataset of a few hundred joint-called exomes, and a patient has a potentially interesting frameshift variant that looks something like this Continue reading Converting genetic variants to their minimal representation