Announcing the release of updated (version 2) genome annotations, plus the initial release of 39 newly annotated DNA Zoo genomes!
tl;dr, the entire set of >2.9 million protein-coding genes spanning 109 mammalian genomes, can be found here (see also Wasabi mirror). This set is much improved over the version 1 annotations, with the fraction of missing mammalian BUSCOs down to 5% (from 10%). We’ve called on average 28141 genes per species (Min 22,417 Eidolon helvum, Max 45,707 Saimiri boliviensis). 95.3% of genes are assigned to 74,713 orthogroups. 9165 species specific genes have been assigned to 2609 orthogroups.
All protein files, transcripts and the gff3 can be found in data release folders associated with each individual assembly. Orthofinder summary files are found here, while the file that contains the orthogroups are found here
What did we do differently?
Remember that in the 1st attempt, we used genes contained in the Swiss-prot reference. To update, in brief, we added more reference material. Non-coding RNAs, transcript evidence from other species focusing on adding coverage to carnivores, rodents, and primates. This additional transcript information has dramatically improved our ability to detect genes in genomes.
See this blog post for information on the original version 1 genome annotations.
Each annotation took between 48 and 72 hours to run across 80 cores, for a total of about 450,000 core-hours!
The phylogeny of these 109 mammals (plus the Ostrich used as an outgroup) was computed using OrthoFinder 2.4.0. The image is below, and the Newick text file is here.
We can still do better, but for this we need RNAseq data! If you have transcriptome data for any of the DNA Zoo genomes, and would like to share it, I’d be happy to update the annotation! This would really help us improve both the completeness of the genomes, but also the accuracy.
Is your favorite gene missing? Let us know and we can see where it went.