{"title":"Genos: A Human-Centric Genomic Foundation Model.","authors":"Adi Lin, Bin Xie, Cheng Ye, Cheng Wang, Duoyuan Chen, Ercheng Wang, Fanfeng Lu, Guirong Xue, Haiqiang Zhang, Jiajie Zhan, Jianfeng Zhang, Jiangshuan Pang, Jianqiang Liang, Jiawei Lin, Jiaxin Ma, Jie Hu, Jing Ma, Jinni Dong, Jiongzhen Li, Junchen Liu, Junhong Chen, Junyou Li, Kai Ding, Kaiwen Deng, Kui Chen, Lihui Wang, Longqi Liu, Ling Guo, Liwen Xiong, Luhao Yang, Ming Cheng, Nanning Chen, Renzhong Chen, Shanxin Sun, Shaoshuai Li, Shicheng Chen, Shiping Liu, Siwei Xie, Suyan Liu, Tao Zhou, Wangyang Tang, Weiqiang Zhang, Xianyue Jiang, Xianzhi Qi, Xin Jin, Xinjiang Tan, Xinyue Hu, Xun Xu, Xuyang Feng, Yafei Lu, Yifan Gao, Yong Shang, Youzhe He, Yue Yuan, Yufan Wang, Yuqi Liu, Zhan Xiao, Zhangyuan Meng, Zhaorong Li, Zhe Zhao, Zheng Yang, Zilin Wang","doi":"10.1093/gigascience/giaf132","DOIUrl":null,"url":null,"abstract":"<p><p>The rapid expansion of human genomic data demands foundation models that manage ultra-long sequences and capture population diversity, limitations common in existing models which lack human-specific representation and clinical inference efficiency. Here, we introduce Genos (Genos-1.2B/Genos-10B), a human-centric genomic foundation model engineered for million-base-pair sequence modeling. Genos utilizes a large-scale Mixture of Experts (MoE) structure, optimized for a 1Mb context, trained on high-quality human de novo assemblies from datasets such as HPRC and HGSVC, representing diverse global populations. A suite of optimization strategies was implemented to ensure training stability and enhance computational efficiency, which collectively reduces costs and facilitates million-base-pair context modeling. Functionally, Genos performs single-nucleotide resolution analysis and dynamically simulates the cascade effects of non-coding variations on RNA expression profiles. In comprehensive evaluations, Genos uniformly surpasses State-of-the-Art models on critical human genomics benchmarks and demonstrates robust Omics-Text cross-modal diagnostic capabilities. We present a systematic technical evaluation and validation of Genos's architecture, training convergence, and performance across standard benchmarks. This work provides a reliable technical blueprint and performance benchmark for the development of the next generation of high-efficiency genomic foundation models. Genos model weights, inference code, and usage documentation are publicly available on GitHub (https://github.com/BGI-HangzhouAI/Genos) and Hugging Face Hub (https://huggingface.co/BGI-HangzhouAI), with additional cloud services accessible via BGI DCS Cloud-all released under the MIT License.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8000,"publicationDate":"2025-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"GigaScience","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/gigascience/giaf132","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}
引用次数: 0
Abstract
The rapid expansion of human genomic data demands foundation models that manage ultra-long sequences and capture population diversity, limitations common in existing models which lack human-specific representation and clinical inference efficiency. Here, we introduce Genos (Genos-1.2B/Genos-10B), a human-centric genomic foundation model engineered for million-base-pair sequence modeling. Genos utilizes a large-scale Mixture of Experts (MoE) structure, optimized for a 1Mb context, trained on high-quality human de novo assemblies from datasets such as HPRC and HGSVC, representing diverse global populations. A suite of optimization strategies was implemented to ensure training stability and enhance computational efficiency, which collectively reduces costs and facilitates million-base-pair context modeling. Functionally, Genos performs single-nucleotide resolution analysis and dynamically simulates the cascade effects of non-coding variations on RNA expression profiles. In comprehensive evaluations, Genos uniformly surpasses State-of-the-Art models on critical human genomics benchmarks and demonstrates robust Omics-Text cross-modal diagnostic capabilities. We present a systematic technical evaluation and validation of Genos's architecture, training convergence, and performance across standard benchmarks. This work provides a reliable technical blueprint and performance benchmark for the development of the next generation of high-efficiency genomic foundation models. Genos model weights, inference code, and usage documentation are publicly available on GitHub (https://github.com/BGI-HangzhouAI/Genos) and Hugging Face Hub (https://huggingface.co/BGI-HangzhouAI), with additional cloud services accessible via BGI DCS Cloud-all released under the MIT License.
人类基因组数据的快速扩展需要管理超长序列和捕获种群多样性的基础模型,现有模型缺乏人类特异性表征和临床推断效率的局限性。在这里,我们介绍了Genos (Genos-1.2 b /Genos- 10b),一个以人类为中心的基因组基础模型,用于百万碱基对序列建模。Genos利用大规模混合专家(MoE)结构,针对1Mb的上下文进行了优化,并对来自HPRC和HGSVC等数据集的高质量人类从头组装进行了训练,代表了不同的全球人口。为了保证训练的稳定性和提高计算效率,实现了一套优化策略,这些策略共同降低了成本,促进了百万碱基对上下文建模。在功能上,Genos执行单核苷酸分辨率分析,并动态模拟非编码变异对RNA表达谱的级联效应。在综合评估中,Genos在关键的人类基因组学基准上一致超过了最先进的模型,并展示了强大的Omics-Text跨模式诊断能力。我们对Genos的架构、训练收敛和跨标准基准的性能进行了系统的技术评估和验证。这项工作为下一代高效基因组基础模型的开发提供了可靠的技术蓝图和性能基准。Genos模型权重、推理代码和使用文档可在GitHub (https://github.com/BGI-HangzhouAI/Genos)和hugs Face Hub (https://huggingface)上公开获取。co/华大基因-杭州ai),通过华大基因DCS云提供额外的云服务——所有这些都是根据MIT许可发布的。
期刊介绍:
GigaScience seeks to transform data dissemination and utilization in the life and biomedical sciences. As an online open-access open-data journal, it specializes in publishing "big-data" studies encompassing various fields. Its scope includes not only "omic" type data and the fields of high-throughput biology currently serviced by large public repositories, but also the growing range of more difficult-to-access data, such as imaging, neuroscience, ecology, cohort data, systems biology and other new types of large-scale shareable data.