{"title":"基于序列-结构预训练的抗体表征学习大语言模型。","authors":"Mingze Yin, Hanjing Zhou, Jialu Wu, Yiheng Zhu, Yuxuan Zhan, Zitai Kong, Hongxia Xu, Chang-Yu Hsieh, Jintai Chen, Tingjun Hou, Jian Wu","doi":"10.34133/research.0721","DOIUrl":null,"url":null,"abstract":"<p><p>Antibodies safeguard our health through their precise and potent binding to specific antigens, demonstrating promising therapeutic efficacy in the treatment of numerous diseases, including COVID-19. Recent advancements in biomedical language models have shown the great potential to interpret complex biological structures and functions. However, existing antibody-specific models have a notable limitation that they lack explicit consideration for antibody structural information, despite the fact that both 1-dimensional sequence and 3-dimensional structure carry unique and complementary insights into antibody behavior and functionality. This paper proposes the <b>S</b>equence-<b>S</b>tructure multi-level pre-trained <b>A</b>ntibody <b>L</b>anguage <b>M</b>odel (S<sup>2</sup>ALM), combining holistic sequential and structural information in one unified, generic antibody foundation model. We construct a hierarchical pre-training paradigm incorporated with 2 customized multi-level training objectives to facilitate the modeling of comprehensive antibody representations. S<sup>2</sup>ALM's representation space uncovers inherent functional binding mechanisms, biological evolution properties, and structural interaction patterns. Pre-trained over 75 million sequences and 11.7 million structures, S<sup>2</sup>ALM can be adopted for diverse downstream tasks: accurately predicting antigen-antibody binding affinities, precisely distinguishing B cell maturation stages, identifying antibody crucial binding positions, and specifically designing novel coronavirus-binding antibodies. Remarkably, S<sup>2</sup>ALM outperforms well-established and renowned baselines and sets new state-of-the-art performance across extensive antibody-specific understanding and generation tasks. S<sup>2</sup>ALM's ability to model comprehensive and generalized representations further positions its potential to advance real-world therapeutic antibody development, potentially addressing unmet academic, industrial, and clinical needs.</p>","PeriodicalId":21120,"journal":{"name":"Research","volume":"8 ","pages":"0721"},"PeriodicalIF":10.7000,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12364524/pdf/","citationCount":"0","resultStr":"{\"title\":\"S<sup>2</sup>ALM: Sequence-Structure Pre-trained Large Language Model for Comprehensive Antibody Representation Learning.\",\"authors\":\"Mingze Yin, Hanjing Zhou, Jialu Wu, Yiheng Zhu, Yuxuan Zhan, Zitai Kong, Hongxia Xu, Chang-Yu Hsieh, Jintai Chen, Tingjun Hou, Jian Wu\",\"doi\":\"10.34133/research.0721\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Antibodies safeguard our health through their precise and potent binding to specific antigens, demonstrating promising therapeutic efficacy in the treatment of numerous diseases, including COVID-19. Recent advancements in biomedical language models have shown the great potential to interpret complex biological structures and functions. However, existing antibody-specific models have a notable limitation that they lack explicit consideration for antibody structural information, despite the fact that both 1-dimensional sequence and 3-dimensional structure carry unique and complementary insights into antibody behavior and functionality. This paper proposes the <b>S</b>equence-<b>S</b>tructure multi-level pre-trained <b>A</b>ntibody <b>L</b>anguage <b>M</b>odel (S<sup>2</sup>ALM), combining holistic sequential and structural information in one unified, generic antibody foundation model. We construct a hierarchical pre-training paradigm incorporated with 2 customized multi-level training objectives to facilitate the modeling of comprehensive antibody representations. S<sup>2</sup>ALM's representation space uncovers inherent functional binding mechanisms, biological evolution properties, and structural interaction patterns. Pre-trained over 75 million sequences and 11.7 million structures, S<sup>2</sup>ALM can be adopted for diverse downstream tasks: accurately predicting antigen-antibody binding affinities, precisely distinguishing B cell maturation stages, identifying antibody crucial binding positions, and specifically designing novel coronavirus-binding antibodies. Remarkably, S<sup>2</sup>ALM outperforms well-established and renowned baselines and sets new state-of-the-art performance across extensive antibody-specific understanding and generation tasks. S<sup>2</sup>ALM's ability to model comprehensive and generalized representations further positions its potential to advance real-world therapeutic antibody development, potentially addressing unmet academic, industrial, and clinical needs.</p>\",\"PeriodicalId\":21120,\"journal\":{\"name\":\"Research\",\"volume\":\"8 \",\"pages\":\"0721\"},\"PeriodicalIF\":10.7000,\"publicationDate\":\"2025-08-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12364524/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Research\",\"FirstCategoryId\":\"103\",\"ListUrlMain\":\"https://doi.org/10.34133/research.0721\",\"RegionNum\":1,\"RegionCategory\":\"综合性期刊\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q1\",\"JCRName\":\"Multidisciplinary\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Research","FirstCategoryId":"103","ListUrlMain":"https://doi.org/10.34133/research.0721","RegionNum":1,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"Multidisciplinary","Score":null,"Total":0}
S2ALM: Sequence-Structure Pre-trained Large Language Model for Comprehensive Antibody Representation Learning.
Antibodies safeguard our health through their precise and potent binding to specific antigens, demonstrating promising therapeutic efficacy in the treatment of numerous diseases, including COVID-19. Recent advancements in biomedical language models have shown the great potential to interpret complex biological structures and functions. However, existing antibody-specific models have a notable limitation that they lack explicit consideration for antibody structural information, despite the fact that both 1-dimensional sequence and 3-dimensional structure carry unique and complementary insights into antibody behavior and functionality. This paper proposes the Sequence-Structure multi-level pre-trained Antibody Language Model (S2ALM), combining holistic sequential and structural information in one unified, generic antibody foundation model. We construct a hierarchical pre-training paradigm incorporated with 2 customized multi-level training objectives to facilitate the modeling of comprehensive antibody representations. S2ALM's representation space uncovers inherent functional binding mechanisms, biological evolution properties, and structural interaction patterns. Pre-trained over 75 million sequences and 11.7 million structures, S2ALM can be adopted for diverse downstream tasks: accurately predicting antigen-antibody binding affinities, precisely distinguishing B cell maturation stages, identifying antibody crucial binding positions, and specifically designing novel coronavirus-binding antibodies. Remarkably, S2ALM outperforms well-established and renowned baselines and sets new state-of-the-art performance across extensive antibody-specific understanding and generation tasks. S2ALM's ability to model comprehensive and generalized representations further positions its potential to advance real-world therapeutic antibody development, potentially addressing unmet academic, industrial, and clinical needs.
期刊介绍:
Research serves as a global platform for academic exchange, collaboration, and technological advancements. This journal welcomes high-quality research contributions from any domain, with open arms to authors from around the globe.
Comprising fundamental research in the life and physical sciences, Research also highlights significant findings and issues in engineering and applied science. The journal proudly features original research articles, reviews, perspectives, and editorials, fostering a diverse and dynamic scholarly environment.