BFGTP: bert引导的毒性预测的两阶段分子表示学习框架。

IF 6.7 2区医学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Journal of Biomedical and Health Informatics Pub Date : 2025-04-01 DOI:10.1109/JBHI.2025.3556766

Kaimiao Hu, Yuan He, Jianguo Wei, Changming Sun, Jie Geng, Leyi Wei, Ran Su

{"title":"BFGTP: bert引导的毒性预测的两阶段分子表示学习框架。","authors":"Kaimiao Hu, Yuan He, Jianguo Wei, Changming Sun, Jie Geng, Leyi Wei, Ran Su","doi":"10.1109/JBHI.2025.3556766","DOIUrl":null,"url":null,"abstract":"Accurate prediction of molecular toxicity is vital for drug development. Most mainstream methods rely on fingerprints or graph-based feature extraction, the emergence of large language models (LLMs) offers new prospects for molecular representation learning in toxicity prediction. Although several studies attempt to leverage LLMs to integrate molecular sequence data for pretraining molecular representations, certain limitations remain. Current LLM-based approaches usually utilize solely on class embedding features, overlooking the rich information in sequence embedding. Moreover, integrating pre-trained molecular representations with multi-modal molecular data may further enhance performance in toxicity prediction. To address these challenges, we propose BFGTP, a BERT-guided two-stage molecular representation learning framework for toxicity prediction. Firstly, we design independent encoders for molecular descriptions of three modalities, where the fingerprint encoder with dual level attention mechanisms effectively integrates multi-category fingerprints. Then, the two-stage guide strategy is introduced to fully utilize the prior knowledge of LLMs, employing contrastive learning to align and fuse the tri-modal representations and knowledge distillation to align predicted value distributions. BFGTP ultimately combines fingerprint and graph representations to predict molecular toxicity. Experiments on seven toxicity datasets show that BFGTP outperforms baselines, achieving the highest AUC on five datasets and the best average performance across five evaluation metrics. Ablation studies, t-SNE visualization and case study confirm the effectiveness of BFGTP's components and its ability to capture meaningful molecular representations.","PeriodicalId":13073,"journal":{"name":"IEEE Journal of Biomedical and Health Informatics","volume":"PP ","pages":""},"PeriodicalIF":6.7000,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"BFGTP: A BERT-Guided Two-Stage Molecular Representation Learning Framework for Toxicity Prediction.\",\"authors\":\"Kaimiao Hu, Yuan He, Jianguo Wei, Changming Sun, Jie Geng, Leyi Wei, Ran Su\",\"doi\":\"10.1109/JBHI.2025.3556766\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Accurate prediction of molecular toxicity is vital for drug development. Most mainstream methods rely on fingerprints or graph-based feature extraction, the emergence of large language models (LLMs) offers new prospects for molecular representation learning in toxicity prediction. Although several studies attempt to leverage LLMs to integrate molecular sequence data for pretraining molecular representations, certain limitations remain. Current LLM-based approaches usually utilize solely on class embedding features, overlooking the rich information in sequence embedding. Moreover, integrating pre-trained molecular representations with multi-modal molecular data may further enhance performance in toxicity prediction. To address these challenges, we propose BFGTP, a BERT-guided two-stage molecular representation learning framework for toxicity prediction. Firstly, we design independent encoders for molecular descriptions of three modalities, where the fingerprint encoder with dual level attention mechanisms effectively integrates multi-category fingerprints. Then, the two-stage guide strategy is introduced to fully utilize the prior knowledge of LLMs, employing contrastive learning to align and fuse the tri-modal representations and knowledge distillation to align predicted value distributions. BFGTP ultimately combines fingerprint and graph representations to predict molecular toxicity. Experiments on seven toxicity datasets show that BFGTP outperforms baselines, achieving the highest AUC on five datasets and the best average performance across five evaluation metrics. Ablation studies, t-SNE visualization and case study confirm the effectiveness of BFGTP's components and its ability to capture meaningful molecular representations.\",\"PeriodicalId\":13073,\"journal\":{\"name\":\"IEEE Journal of Biomedical and Health Informatics\",\"volume\":\"PP \",\"pages\":\"\"},\"PeriodicalIF\":6.7000,\"publicationDate\":\"2025-04-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Journal of Biomedical and Health Informatics\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://doi.org/10.1109/JBHI.2025.3556766\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal of Biomedical and Health Informatics","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.1109/JBHI.2025.3556766","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

准确预测分子毒性对药物开发至关重要。大多数主流方法依赖于指纹或基于图的特征提取，而大语言模型（LLM）的出现为毒性预测中的分子表征学习提供了新的前景。尽管有几项研究试图利用 LLMs 来整合分子序列数据，以便对分子表征进行预训练，但仍存在一定的局限性。目前基于 LLM 的方法通常只利用类嵌入特征，忽略了序列嵌入中的丰富信息。此外，将预训练的分子表征与多模态分子数据整合在一起可能会进一步提高毒性预测的性能。为了应对这些挑战，我们提出了 BFGTP--一种 BERT 引导的两阶段毒性预测分子表征学习框架。首先，我们为三种模式的分子描述设计了独立的编码器，其中具有双级注意机制的指纹编码器有效地整合了多类别指纹。然后，我们引入了两阶段引导策略，充分利用 LLMs 的先验知识，利用对比学习来调整和融合三模态表征，并利用知识提炼来调整预测值分布。BFGTP 最终结合了指纹和图表示法来预测分子毒性。在七个毒性数据集上进行的实验表明，BFGTP 的性能优于基线，在五个数据集上获得了最高的 AUC 值，在五个评价指标上获得了最佳平均性能。消融研究、t-SNE 可视化和案例研究证实了 BFGTP 组件的有效性及其捕捉有意义分子表征的能力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

BFGTP: A BERT-Guided Two-Stage Molecular Representation Learning Framework for Toxicity Prediction.

Accurate prediction of molecular toxicity is vital for drug development. Most mainstream methods rely on fingerprints or graph-based feature extraction, the emergence of large language models (LLMs) offers new prospects for molecular representation learning in toxicity prediction. Although several studies attempt to leverage LLMs to integrate molecular sequence data for pretraining molecular representations, certain limitations remain. Current LLM-based approaches usually utilize solely on class embedding features, overlooking the rich information in sequence embedding. Moreover, integrating pre-trained molecular representations with multi-modal molecular data may further enhance performance in toxicity prediction. To address these challenges, we propose BFGTP, a BERT-guided two-stage molecular representation learning framework for toxicity prediction. Firstly, we design independent encoders for molecular descriptions of three modalities, where the fingerprint encoder with dual level attention mechanisms effectively integrates multi-category fingerprints. Then, the two-stage guide strategy is introduced to fully utilize the prior knowledge of LLMs, employing contrastive learning to align and fuse the tri-modal representations and knowledge distillation to align predicted value distributions. BFGTP ultimately combines fingerprint and graph representations to predict molecular toxicity. Experiments on seven toxicity datasets show that BFGTP outperforms baselines, achieving the highest AUC on five datasets and the best average performance across five evaluation metrics. Ablation studies, t-SNE visualization and case study confirm the effectiveness of BFGTP's components and its ability to capture meaningful molecular representations.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Journal of Biomedical and Health Informatics COMPUTER SCIENCE, INFORMATION SYSTEMS-COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

CiteScore

13.60

自引率

6.50%

发文量

1151

期刊介绍： IEEE Journal of Biomedical and Health Informatics publishes original papers presenting recent advances where information and communication technologies intersect with health, healthcare, life sciences, and biomedicine. Topics include acquisition, transmission, storage, retrieval, management, and analysis of biomedical and health information. The journal covers applications of information technologies in healthcare, patient monitoring, preventive care, early disease diagnosis, therapy discovery, and personalized treatment protocols. It explores electronic medical and health records, clinical information systems, decision support systems, medical and biological imaging informatics, wearable systems, body area/sensor networks, and more. Integration-related topics like interoperability, evidence-based medicine, and secure patient data are also addressed.