The first million genes are the hardest to make(r)

Matthew MacManes
Sep 17, 2019
3 min read

Everyone knows that gene annotations are critical for enabling the analysis a new genome. Knowing something about the full complement of genes, and their order along chromosomes, is exceptionally powerful, and allows direct comparisons of genomes across the tree of life. To begin to understand this, we set out to create a set of annotations for all the mammals at the DNA Zoo.

tl;dr, the entire set of >1 million protein-coding genes spanning 67 mammalian genomes, can be found here (see also Wasabi mirror). Each species’ folder contains a set of transcripts, proteins, and a gff3 file. The same files can also be found in data release folders associated with each individual assembly.

A few important stats. On average, we identified 16,445 protein-coding genes per species. We do know that we’re missing between 8% and 10% of genes, based on gene content analysis in BUSCO. We have a clear analytical path to recover most of these (currently missing) genes, so stay tuned!

In total, we recovered 1,101,834 protein-coding genes. 98.1% of these genes are contained in about 22,156 orthogroups (set of genes that share a common evolutionary history, specifically orthology or parology). What does this mean? Although we may be missing some genes (more on this below), with <2% of genes being “unassigned,” we are not predicting a bunch of junk, e.g, our false positive rate is low. How is this defined, you might ask. 1st let’s assert that if a given transcript is found in other (or even many other) species by OrthoFinder2, then it is biological in nature, rather than a technical artifact of the annotation process. So this leaves us with 2% of transcripts - what are these? These are transcripts that are either novel - singularly evolved in one species and no other (in our dataset), or it is an artifact - a false positive. Surely we do observe evolutionary novelty, but what fraction of this 2% is novelty is unclear. Let’s be safe and say that it is all artifactual, and that are false positive rate on calling transcripts is 2%.

What did we do?

Maker-based annotation: A key constraint is that we wanted a standardized gene annotation in each species, but we don’t have transcriptome data for every species. As such, we devised a strategy to leverage the power of homology, along with the fact that we have extensive knowledge of gene content in mammals.

Specifically, we elected to use the mammal-specific subset of UniProtKB/Swiss-Prot, a manually annotated, non-redundant protein sequence database. We believe this is a reasonable first approach, given the broad taxonomic and genic coverage of the genomic dataset. The Swiss-Prot subset used is available here.

So, critically, the annotations contained here are based on SwissProt mammals. This does a pretty good job and identifies the vast majority of protein-coding genes. However, note that we did not endeavor to annotate noncoding transcripts – that’s something we will be doing in the future.

To reproduce these runs, see https://github.com/macmanes-lab/dnazoo_annotation.

Each annotation took between 18 and 35 hours to run across 48 cores, for a total of about 80,000 core-hours.

Orthogroups: In addition to the annotations, we aimed to generate orthogroups using OrthoFinder2 (Emms, Kelly, bioRxiv, 2018) – this is incredibly useful information for comparative biologists. We’ve even included a tree, see below. This tree was constructed within OrthoFinder using the default settings (e.g., using FastTree), and based on 1,023 orthogroups. Note that there may be some inconsistencies in the topology, as it’s out 1st stab at this, and further refinements are upcoming! One obvious way to improve the robustness of the tree is to apply a more appropriate model of sequence evolution for each partition, e.g., using IQTREE, but this is very time consuming.

Phylogenetic tree based on 1,023 orthogroups and constructed by OrthoFinder2 (Emms, Kelly, 2018). Scale bar refers to a phylogenetic distance of 0.04 nucleotide substitutions per site.

What’s next?

1. We know there are some genes that we’ve missed, and we have been developing a better approach, without sacrificing too much time. In addition to this, our current approach is missing ncRNA - stay tuned, because v2 annotation will contain these critically important elements!

2. Tell us what you want? What makes this even more useful? Let us know!

3. Is your favorite gene missing? Let us know and we can see where it went.

DNA ZOO

The first million genes are the hardest to make(r)

Recent Posts

Comments