Mouse-Geneformer: A Deep Learning Model for Mouse Single-Cell Transcriptome and Its Cross-Species Utility

Keita Ito, Tsubasa Hirakawa, Shuji Shigenobu, Hironobu Fujiyoshi, Takayoshi Yamashita
{"title":"Mouse-Geneformer: A Deep Learning Model for Mouse Single-Cell Transcriptome and Its Cross-Species Utility","authors":"Keita Ito, Tsubasa Hirakawa, Shuji Shigenobu, Hironobu Fujiyoshi, Takayoshi Yamashita","doi":"10.1101/2024.09.09.611960","DOIUrl":null,"url":null,"abstract":"Deep learning techniques are increasingly utilized to analyze large-scale single-cell RNA sequencing (scRNA-seq) data, offering valuable insights from complex transcriptome datasets. Geneformer, a pre-trained model using a Transformer Encoder architecture and human scRNA-seq datasets, has demonstrated remarkable success in human transcriptome analysis. However, given the prominence of the mouse, Mus musculus, as a primary mammalian model in biological and medical research, there is an acute need for a mouse-specific version of Geneformer. In this study, we developed a mouse-specific Geneformer (mouse-Geneformer) by constructing a large transcriptome dataset consisting of 21 million mouse scRNA-seq profiles and pre-training Geneformer on this dataset. The mouse-Geneformer effectively models the mouse transcriptome and, upon fine-tuning for downstream tasks, enhances the accuracy of cell type classification. In silico perturbation experiments using mouse-Geneformer successfully identified disease-causing genes that have been validated in in vivo experiments. These results demonstrate the feasibility of analyzing mouse data with mouse-Geneformer and highlight the robustness of the Geneformer architecture, applicable to any species with large-scale transcriptome data available. Furthermore, we found that mouse-Geneformer can analyze human transcriptome data in a cross-species manner. After the ortholog-based gene name conversion, the analysis of human scRNA-seq data using mouse-Geneformer, followed by fine-tuning with human data, achieved cell type classification accuracy comparable to that obtained using the original human Geneformer. In in silico simulation experiments using human disease models, we obtained results similar to human-Geneformer for the myocardial infarction model but only partially consistent results for the COVID-19 model, a trait unique to humans (laboratory mice are not susceptible to SARS-CoV-2). These findings suggest the potential for cross-species application of the Geneformer model while emphasizing the importance of species-specific models for capturing the full complexity of disease mechanisms. Despite the existence of the original Geneformer tailored for humans, human research could benefit from mouse-Geneformer due to its inclusion of samples that are ethically or technically inaccessible for humans, such as embryonic tissues and certain disease models. Additionally, this cross-species approach indicates potential use for non-model organisms, where obtaining large-scale single-cell transcriptome data is challenging.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"50 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"bioRxiv - Bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.09.09.611960","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Deep learning techniques are increasingly utilized to analyze large-scale single-cell RNA sequencing (scRNA-seq) data, offering valuable insights from complex transcriptome datasets. Geneformer, a pre-trained model using a Transformer Encoder architecture and human scRNA-seq datasets, has demonstrated remarkable success in human transcriptome analysis. However, given the prominence of the mouse, Mus musculus, as a primary mammalian model in biological and medical research, there is an acute need for a mouse-specific version of Geneformer. In this study, we developed a mouse-specific Geneformer (mouse-Geneformer) by constructing a large transcriptome dataset consisting of 21 million mouse scRNA-seq profiles and pre-training Geneformer on this dataset. The mouse-Geneformer effectively models the mouse transcriptome and, upon fine-tuning for downstream tasks, enhances the accuracy of cell type classification. In silico perturbation experiments using mouse-Geneformer successfully identified disease-causing genes that have been validated in in vivo experiments. These results demonstrate the feasibility of analyzing mouse data with mouse-Geneformer and highlight the robustness of the Geneformer architecture, applicable to any species with large-scale transcriptome data available. Furthermore, we found that mouse-Geneformer can analyze human transcriptome data in a cross-species manner. After the ortholog-based gene name conversion, the analysis of human scRNA-seq data using mouse-Geneformer, followed by fine-tuning with human data, achieved cell type classification accuracy comparable to that obtained using the original human Geneformer. In in silico simulation experiments using human disease models, we obtained results similar to human-Geneformer for the myocardial infarction model but only partially consistent results for the COVID-19 model, a trait unique to humans (laboratory mice are not susceptible to SARS-CoV-2). These findings suggest the potential for cross-species application of the Geneformer model while emphasizing the importance of species-specific models for capturing the full complexity of disease mechanisms. Despite the existence of the original Geneformer tailored for humans, human research could benefit from mouse-Geneformer due to its inclusion of samples that are ethically or technically inaccessible for humans, such as embryonic tissues and certain disease models. Additionally, this cross-species approach indicates potential use for non-model organisms, where obtaining large-scale single-cell transcriptome data is challenging.
Mouse-Geneformer:小鼠单细胞转录组深度学习模型及其跨物种实用性
深度学习技术越来越多地被用于分析大规模单细胞 RNA 测序(scRNA-seq)数据,从而为复杂的转录组数据集提供有价值的见解。Geneformer 是一种使用 Transformer 编码器架构和人类 scRNA-seq 数据集进行预训练的模型,在人类转录组分析中取得了显著的成功。然而,鉴于小鼠(Mus musculus)作为主要哺乳动物模型在生物和医学研究中的突出地位,我们迫切需要一个小鼠专用版的 Geneformer。在这项研究中,我们构建了一个由 2100 万个小鼠 scRNA-seq 图谱组成的大型转录组数据集,并在该数据集上对 Geneformer 进行了预训练,从而开发出了小鼠专用的 Geneformer(小鼠-Geneformer)。小鼠基因改造器有效地模拟了小鼠转录组,并根据下游任务进行了微调,提高了细胞类型分类的准确性。使用小鼠基因改造器进行的硅学扰动实验成功鉴定出了致病基因,这些基因已在体内实验中得到验证。这些结果证明了用小鼠-Geneformer 分析小鼠数据的可行性,并突出了 Geneformer 架构的稳健性,它适用于任何有大规模转录组数据的物种。此外,我们还发现小鼠-Geneformer 可以跨物种分析人类转录组数据。在基于直向同源物的基因名称转换之后,使用小鼠-Geneformer 分析人类 scRNA-seq 数据,再根据人类数据进行微调,其细胞类型分类准确率与使用原始人类 Geneformer 所获得的准确率相当。在使用人类疾病模型进行的硅学模拟实验中,我们在心肌梗塞模型中得到了与人类基因改造器相似的结果,但在 COVID-19 模型中只得到了部分一致的结果,这是人类特有的特征(实验鼠对 SARS-CoV-2 病毒不敏感)。这些发现表明,Geneformer 模型具有跨物种应用的潜力,同时也强调了物种特异性模型对于捕捉疾病机制全部复杂性的重要性。尽管最初的 Geneformer 是为人类量身定做的,但由于小鼠-Geneformer 包含了人类在伦理或技术上无法获得的样本,如胚胎组织和某些疾病模型,因此人类研究可以从小鼠-Geneformer 中获益。此外,这种跨物种方法还显示了在非模式生物体中的潜在用途,因为在非模式生物体中获取大规模单细胞转录组数据具有挑战性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信