top of page

Everyone knows that gene annotations are critical for enabling the analysis a new genome. Knowing something about the full complement of genes, and their order along chromosomes, is exceptionally powerful, and allows direct comparisons of genomes across the tree of life. To begin to understand this, we set out to create a set of annotations for all the mammals at the DNA Zoo.


tl;dr, the entire set of >1 million protein-coding genes spanning 67 mammalian genomes, can be found here (see also Wasabi mirror). Each species’ folder contains a set of transcripts, proteins, and a gff3 file. The same files can also be found in data release folders associated with each individual assembly.


A few important stats. On average, we identified 16,445 protein-coding genes per species. We do know that we’re missing between 8% and 10% of genes, based on gene content analysis in BUSCO. We have a clear analytical path to recover most of these (currently missing) genes, so stay tuned!


In total, we recovered 1,101,834 protein-coding genes. 98.1% of these genes are contained in about 22,156 orthogroups (set of genes that share a common evolutionary history, specifically orthology or parology). What does this mean? Although we may be missing some genes (more on this below), with <2% of genes being “unassigned,” we are not predicting a bunch of junk, e.g, our false positive rate is low.  How is this defined, you might ask. 1st let’s assert that if a given transcript is found in other (or even many other) species by OrthoFinder2, then it is biological in nature, rather than a technical artifact of the annotation process. So this leaves us with 2% of transcripts - what are these? These are transcripts that are either novel - singularly evolved in one species and no other (in our dataset), or it is an artifact - a false positive. Surely we do observe evolutionary novelty, but what fraction of this 2% is novelty is unclear. Let’s be safe and say that it is all artifactual, and that are false positive rate on calling transcripts is 2%.


What did we do?


Maker-based annotation: A key constraint is that we wanted a standardized gene annotation in each species, but we don’t have transcriptome data for every species. As such, we devised a strategy to leverage the power of homology, along with the fact that we have extensive knowledge of gene content in mammals.


Specifically, we elected to use the mammal-specific subset of UniProtKB/Swiss-Prot, a manually annotated, non-redundant protein sequence database. We believe this is a reasonable first approach, given the broad taxonomic and genic coverage of the genomic dataset. The Swiss-Prot subset used is available here.


So, critically, the annotations contained here are based on SwissProt mammals. This does a pretty good job and identifies the vast majority of protein-coding genes. However, note that we did not endeavor to annotate noncoding transcripts – that’s something we will be doing in the future.



Each annotation took between 18 and 35 hours to run across 48 cores, for a total of about 80,000 core-hours.


Orthogroups: In addition to the annotations, we aimed to generate orthogroups using OrthoFinder2 (Emms, Kelly, bioRxiv, 2018) – this is incredibly useful information for comparative biologists. We’ve even included a tree, see below. This tree was constructed within OrthoFinder using the default settings (e.g., using FastTree), and based on 1,023 orthogroups. Note that there may be some inconsistencies in the topology, as it’s out 1st stab at this, and further refinements are upcoming! One obvious way to improve the robustness of the tree is to apply a more appropriate model of sequence evolution for each partition, e.g., using IQTREE, but this is very time consuming.

ree
Phylogenetic tree based on 1,023 orthogroups and constructed by OrthoFinder2 (Emms, Kelly, 2018). Scale bar refers to a phylogenetic distance of 0.04 nucleotide substitutions per site.

What’s next?


1. We know there are some genes that we’ve missed, and we have been developing a better approach, without sacrificing too much time. In addition to this, our current approach is missing ncRNA - stay tuned, because v2 annotation will contain these critically important elements!


2. Tell us what you want? What makes this even more useful? Let us know!


3. Is your favorite gene missing? Let us know and we can see where it went.

 
 
 

The canyon mouse (Peromyscus crinitus) is native to North America [1]. Its preferred habitat is arid, rocky desert making it a great model organism to study adaptation to desert. Gaining a deeper understanding of desert adaptation (e.g., osmoregulation and water metabolism) is important for conservation, climate change studies, and human health (for instance, understanding kidney disease).


In collaboration with the MacManes lab at the University of New Hampshire, today we share the chromosome-length genome assembly for the canyon mouse. The draft genome assembly was generated using 10X data by Matthew MacManes, Anna Tigano and Jocelyn Colella at the University of New Hampshire. The fibroblast culture for Hi-C library preparation came from the archive collected at the Texas Medical Center.


Check out below how the new genome assembly compares to a publicly available genome of a close relative, the prairie deer mouse (P. maniculatus, ~5MY divergence [2]). The genome assembly, HU_Pman_2.1, was shared by J.-M. Lassance and H.E. Hoekstra at Harvard University and Howard Hughes Medical Institute, here. We also include a comparison to the golden hamster chromosome-length genome assembly (Mesocricetus auratus, upgrade of genome assembly MesAur1.0), a rodent from the same family from the DNA Zoo collection (Cricetidae, ~20MY divergence [3]).

ree
Whole-genome alignment plots between the chromosome-length canyon mouse genome assembly (pecr10X_v2_HiC) and the assemblies of the prairie deer mouse (HU_Pman_2.1) and the golden hamster (MesAur1.0_HiC).

It is worth noting that the sample used for Hi-C library preparation proved to have a polymorphic karyotype. The fasta shared today represents one of these karyotypes, the one most consistent with an individual animal used to create the draft genome assembly. We are now working to sequence more canyon mice to figure out if this polymorphism is a feature or a bug. So, stay tuned for more info on this, and for more Peromyscus genomes and data!

 
 
 

The blue wildebeest aka common wildebeest or brindled gnu (Connochaetes taurinus) is a large African antelope from the Bovidae (cow, goat and sheep) family. The blue wildebeest is currently widespread: the population is estimated to be around 1.5 million, and is stable. At least in part this population success is thought to be brought about by management-controlled translocations in private game farms, reserves and conservancies [1].


Today, we continue our survey into ruminant genomics by sharing a chromosome-length blue wildebeest genome assembly. This is once again based on the recent Science paper by Chen, Qiu, Juiang, Wang, Lin, Li et al. (See our previous posts for the Chinese muntjac and gerenuk based on the same work.) Thank you, SeaWorld, for the sample used to generate the Hi-C data and create the upgrade!


We take this opportunity to further our comparison of Bovidae genomes, below, through their alignment to the genome assembly of cattle, from (Zimin et al., Genome Biol. 2009). This is the first genome in our collection with a different chromosome count: the assembly suggests (independently but in agreement with published data) 2n=58 for the blue wildebeest and 2n=60 for the other 3 Bovidae assemblies shared by DNA Zoo (bison, sable antelope and gerenuk).

ree
Whole-genome alignments of the four Bovidae genomes to the cow reference Bos_taurus_UMD_3.1.1. The species included are: bison (Bison_UMD1.0_HiC), sable antelope (Sable_antelope_masurca.scf_HiC), gerenuk (GRK_HiC) and the blue wildebeest (BWD_HiC). The orange circle highlights a change in the position of the bit corresponding to cow chromosome #25 in the blue wildebeest genome as compared to all other genomes in the family: both cow chromosome #25 and #2 align to the chromosome labeled #1 in the blue wildebeest, suggesting a fusion.

This chromosome count change is brought about by a fusion of a bit corresponding to chromosome #25 in the cow (highlighted in the image above) to the bit corresponding to cow chromosome #2. (The fused chromosome is labeled #1 in the new wildebeest genome assembly.) The fusion is obvious in the assembly data, along with a few other smaller rearrangements. It is worth noting that 2;25 fusion has been previously mapped using G-banded karyotyping, by Cynthia Steiner and colleagues at the San Diego Zoo Institute for Conservation Research. See image below from their paper (Steiner et al., Journal of Heredity 2014)!

ree
Figure 1B from (Steiner et al., Journal of Heredity, 2014): G-banded karyotype of the blue wildebeest. Autosomal arms are numbered according to standard karyotype of cattle. (The highlight (by DNA Zoo) shows the fused chromosome.)

 
 
 

Join our mailing list

ARC-Logo-Final-2018-01.png

© 2018-2022 by the Aiden Lab.

bottom of page