top of page

中国DNA动物园完成濒危动物林麝的基因组组装升级


The forest musk deer (Moschus berezovskii) is a species of particular importance to Chinese ecology, biodiversity conservation, economy, and medicine. Like other musk deer, the forest musk deer has been (and still is) hunted for its musk. The hunting pressure, habitat loss and disease in captive animals have led to significant population decline. Today, forest musk deer is a Class I protected species in China, listed in CITES Appendix II and classified as endangered by IUCN [1].


林麝(Moschus berezovskii)在中国是一种对生态和物种多样性非常重要,且具有重要药用和经济价值的物种。 同其他麝一样,林麝因为产麝香被猎杀(直到今天也仍然有人非法猎杀林麝)。由于人类的捕杀、生存环境的破坏以及人工养殖中的疾病等问题最终导致林麝总群数量不断下降。目前,林麝是中国国家一级野生保护动物,也被世界自然保护联盟列入了濒危保护名单,被《濒危物种国际贸易公约》列为II类物种[1]。


To facilitate conservation efforts in China we have recently started DNA Zoo China at ShanghaiTech University. If you would like to work together on genome assembly of interesting species in China, please reach out to me, Dr. Lichun Jiang (jianglch@shanghaitech.edu.cn).


为了支持中国的野生物种保护工作,我们最近在上海科技大学开展了中国DNA动物园项目。如果您有兴趣和我们一起对中国的野生物种进行基因测序组装,请联系本人,蒋立春博士(jianglch@shanghaitech.edu.cn)。


As inaugural effort, today DNA Zoo China upgrades the genome assembly for the forest musk deer from (Fan et al., 2018). We are very grateful to Dr. Suwen Zhao from the ShanghaiTech University and ShangHai ChongMing DongPing YuanShe XunYang Co., Ltd. for providing us with the sample for Hi-C library preparation!


作为中国DNA动物园的首发项目,我们于今天正式发布最新的林麝基因组。基于2018年发表的林麝基因组草图(Fan et al., 2018),我们通过Hi-C 基因组测序技术成功组装了中国DNA 动物园项目的首个染色体水平的林麝基因组。在此诚挚感谢为我们提供Hi-C文库构建所需样品的上海科技大学赵素文博士以及上海崇明东平原麝驯养有限公司。


The chromosome-length assembly is shared here. See below how the 29 chromosomes of the forest musk deer relate to the 30 chromosomes of cattle, from (Zimin et al., Genome Biology, 2009). Note the fusion of cow chromosomes #26 and 28 in the musk deer.


我们把染色体水平的组装结果分享在此(请点击链接)。如下图可以看到林麝的29条染色体和牛基因组的30条染色体(Zimin et al., Genome Biology, 2009)比对的情况。值得关注的是牛的26号和28号染色体在林麝中发生了融合。

Whole-genome alignment plot between the chromosomes of the new forest musk deer genome assembly (ls35.final.genome_HiC) and those of the domestic cow (Bos_taurus_UMD_3.1.1, from Zimin et al., 2009).

 
 
 

Everyone knows that gene annotations are critical for enabling the analysis a new genome. Knowing something about the full complement of genes, and their order along chromosomes, is exceptionally powerful, and allows direct comparisons of genomes across the tree of life. To begin to understand this, we set out to create a set of annotations for all the mammals at the DNA Zoo.


tl;dr, the entire set of >1 million protein-coding genes spanning 67 mammalian genomes, can be found here (see also Wasabi mirror). Each species’ folder contains a set of transcripts, proteins, and a gff3 file. The same files can also be found in data release folders associated with each individual assembly.


A few important stats. On average, we identified 16,445 protein-coding genes per species. We do know that we’re missing between 8% and 10% of genes, based on gene content analysis in BUSCO. We have a clear analytical path to recover most of these (currently missing) genes, so stay tuned!


In total, we recovered 1,101,834 protein-coding genes. 98.1% of these genes are contained in about 22,156 orthogroups (set of genes that share a common evolutionary history, specifically orthology or parology). What does this mean? Although we may be missing some genes (more on this below), with <2% of genes being “unassigned,” we are not predicting a bunch of junk, e.g, our false positive rate is low.  How is this defined, you might ask. 1st let’s assert that if a given transcript is found in other (or even many other) species by OrthoFinder2, then it is biological in nature, rather than a technical artifact of the annotation process. So this leaves us with 2% of transcripts - what are these? These are transcripts that are either novel - singularly evolved in one species and no other (in our dataset), or it is an artifact - a false positive. Surely we do observe evolutionary novelty, but what fraction of this 2% is novelty is unclear. Let’s be safe and say that it is all artifactual, and that are false positive rate on calling transcripts is 2%.


What did we do?


Maker-based annotation: A key constraint is that we wanted a standardized gene annotation in each species, but we don’t have transcriptome data for every species. As such, we devised a strategy to leverage the power of homology, along with the fact that we have extensive knowledge of gene content in mammals.


Specifically, we elected to use the mammal-specific subset of UniProtKB/Swiss-Prot, a manually annotated, non-redundant protein sequence database. We believe this is a reasonable first approach, given the broad taxonomic and genic coverage of the genomic dataset. The Swiss-Prot subset used is available here.


So, critically, the annotations contained here are based on SwissProt mammals. This does a pretty good job and identifies the vast majority of protein-coding genes. However, note that we did not endeavor to annotate noncoding transcripts – that’s something we will be doing in the future.



Each annotation took between 18 and 35 hours to run across 48 cores, for a total of about 80,000 core-hours.


Orthogroups: In addition to the annotations, we aimed to generate orthogroups using OrthoFinder2 (Emms, Kelly, bioRxiv, 2018) – this is incredibly useful information for comparative biologists. We’ve even included a tree, see below. This tree was constructed within OrthoFinder using the default settings (e.g., using FastTree), and based on 1,023 orthogroups. Note that there may be some inconsistencies in the topology, as it’s out 1st stab at this, and further refinements are upcoming! One obvious way to improve the robustness of the tree is to apply a more appropriate model of sequence evolution for each partition, e.g., using IQTREE, but this is very time consuming.

Phylogenetic tree based on 1,023 orthogroups and constructed by OrthoFinder2 (Emms, Kelly, 2018). Scale bar refers to a phylogenetic distance of 0.04 nucleotide substitutions per site.

What’s next?


1. We know there are some genes that we’ve missed, and we have been developing a better approach, without sacrificing too much time. In addition to this, our current approach is missing ncRNA - stay tuned, because v2 annotation will contain these critically important elements!


2. Tell us what you want? What makes this even more useful? Let us know!


3. Is your favorite gene missing? Let us know and we can see where it went.

 
 
 

The canyon mouse (Peromyscus crinitus) is native to North America [1]. Its preferred habitat is arid, rocky desert making it a great model organism to study adaptation to desert. Gaining a deeper understanding of desert adaptation (e.g., osmoregulation and water metabolism) is important for conservation, climate change studies, and human health (for instance, understanding kidney disease).


In collaboration with the MacManes lab at the University of New Hampshire, today we share the chromosome-length genome assembly for the canyon mouse. The draft genome assembly was generated using 10X data by Matthew MacManes, Anna Tigano and Jocelyn Colella at the University of New Hampshire. The fibroblast culture for Hi-C library preparation came from the archive collected at the Texas Medical Center.


Check out below how the new genome assembly compares to a publicly available genome of a close relative, the prairie deer mouse (P. maniculatus, ~5MY divergence [2]). The genome assembly, HU_Pman_2.1, was shared by J.-M. Lassance and H.E. Hoekstra at Harvard University and Howard Hughes Medical Institute, here. We also include a comparison to the golden hamster chromosome-length genome assembly (Mesocricetus auratus, upgrade of genome assembly MesAur1.0), a rodent from the same family from the DNA Zoo collection (Cricetidae, ~20MY divergence [3]).

Whole-genome alignment plots between the chromosome-length canyon mouse genome assembly (pecr10X_v2_HiC) and the assemblies of the prairie deer mouse (HU_Pman_2.1) and the golden hamster (MesAur1.0_HiC).

It is worth noting that the sample used for Hi-C library preparation proved to have a polymorphic karyotype. The fasta shared today represents one of these karyotypes, the one most consistent with an individual animal used to create the draft genome assembly. We are now working to sequence more canyon mice to figure out if this polymorphism is a feature or a bug. So, stay tuned for more info on this, and for more Peromyscus genomes and data!

 
 
 

Join our mailing list

ARC-Logo-Final-2018-01.png

© 2018-2022 by the Aiden Lab.

bottom of page