Shu-Yang Jiang, Shi-Shun Zhao, Jun-Qing Wei, Sen Zhang, Zhongpeng Zhao, Yigang Tong, Wei Liu, Jianwei Wang, Tao Jiang, Jing Li
{"title":"General Intelligence Framework to Predict Virus Adaptation Based on a Genome Language Model.","authors":"Shu-Yang Jiang, Shi-Shun Zhao, Jun-Qing Wei, Sen Zhang, Zhongpeng Zhao, Yigang Tong, Wei Liu, Jianwei Wang, Tao Jiang, Jing Li","doi":"10.34133/research.0871","DOIUrl":null,"url":null,"abstract":"<p><p>Most human viral pandemics are caused by animal-originated viruses with human adaptation. It is challenging to infer adaptation from viral genes or their coded protein sequences, particularly when the data labels for modeling are inadequate or the input sequence to be predicted is incomplete. Here, we developed a semi-supervised General Intelligence framework to predict Virus Adaptation based on Language-model-embedded protein sequences (GIVAL) for blind input of virus sequences. The language model in GIVAL, named virus Bidirectional Encoder Representations from Transformers (vBERT), was pretrained for embedding using hidden Markov model-contextualized tokens of viral protein sequences. vBERT outperformed prevalent pretrained models like DNABERT-2, proteinBERT, ESM-2, Transformer, and Word2Vec on distinguishing viral proteins with various-grained labels, such as serotypes and single phenotype-altering mutation. The semi-supervised GIVAL obtained higher accuracy in virus adaptation prediction and better fault tolerance on raw labels in the training dataset, overcoming the obstacle of modeling with insufficient labels and predicting blind input. GIVAL was applicable to the adaptation prediction of diverse viruses. For influenza A viruses (IAVs), higher human adaptation was predicted for equine-origin H3N8 IAVs and bovine H5N1 IAVs with simulated mutations. For coronaviruses, GIVAL predicted an adaptation shift of receptor binding from Middle East respiratory syndrome-related coronavirus (MERS-CoV) receptor to severe acute respiratory syndrome coronavirus receptor of 2 recently reported MERS-CoV-like virus variants. For monkeypox viruses, GIVAL quantified an incremental adaptation shift of viral variants, matching the rise in human monkeypox cases. Summarily, GIVAL provides a generally intelligent framework for predicting virus adaptation based on its genotype, with the potential to extend to more genotype-to-phenotype prediction scenarios.</p>","PeriodicalId":21120,"journal":{"name":"Research","volume":"8 ","pages":"0871"},"PeriodicalIF":10.7000,"publicationDate":"2025-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12480747/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Research","FirstCategoryId":"103","ListUrlMain":"https://doi.org/10.34133/research.0871","RegionNum":1,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"Multidisciplinary","Score":null,"Total":0}
引用次数: 0
Abstract
Most human viral pandemics are caused by animal-originated viruses with human adaptation. It is challenging to infer adaptation from viral genes or their coded protein sequences, particularly when the data labels for modeling are inadequate or the input sequence to be predicted is incomplete. Here, we developed a semi-supervised General Intelligence framework to predict Virus Adaptation based on Language-model-embedded protein sequences (GIVAL) for blind input of virus sequences. The language model in GIVAL, named virus Bidirectional Encoder Representations from Transformers (vBERT), was pretrained for embedding using hidden Markov model-contextualized tokens of viral protein sequences. vBERT outperformed prevalent pretrained models like DNABERT-2, proteinBERT, ESM-2, Transformer, and Word2Vec on distinguishing viral proteins with various-grained labels, such as serotypes and single phenotype-altering mutation. The semi-supervised GIVAL obtained higher accuracy in virus adaptation prediction and better fault tolerance on raw labels in the training dataset, overcoming the obstacle of modeling with insufficient labels and predicting blind input. GIVAL was applicable to the adaptation prediction of diverse viruses. For influenza A viruses (IAVs), higher human adaptation was predicted for equine-origin H3N8 IAVs and bovine H5N1 IAVs with simulated mutations. For coronaviruses, GIVAL predicted an adaptation shift of receptor binding from Middle East respiratory syndrome-related coronavirus (MERS-CoV) receptor to severe acute respiratory syndrome coronavirus receptor of 2 recently reported MERS-CoV-like virus variants. For monkeypox viruses, GIVAL quantified an incremental adaptation shift of viral variants, matching the rise in human monkeypox cases. Summarily, GIVAL provides a generally intelligent framework for predicting virus adaptation based on its genotype, with the potential to extend to more genotype-to-phenotype prediction scenarios.
期刊介绍:
Research serves as a global platform for academic exchange, collaboration, and technological advancements. This journal welcomes high-quality research contributions from any domain, with open arms to authors from around the globe.
Comprising fundamental research in the life and physical sciences, Research also highlights significant findings and issues in engineering and applied science. The journal proudly features original research articles, reviews, perspectives, and editorials, fostering a diverse and dynamic scholarly environment.