Kaimiao Hu, Yuan He, Jianguo Wei, Changming Sun, Jie Geng, Leyi Wei, Ran Su
{"title":"BFGTP: A BERT-Guided Two-Stage Molecular Representation Learning Framework for Toxicity Prediction.","authors":"Kaimiao Hu, Yuan He, Jianguo Wei, Changming Sun, Jie Geng, Leyi Wei, Ran Su","doi":"10.1109/JBHI.2025.3556766","DOIUrl":null,"url":null,"abstract":"<p><p>Accurate prediction of molecular toxicity is vital for drug development. Most mainstream methods rely on fingerprints or graph-based feature extraction, the emergence of large language models (LLMs) offers new prospects for molecular representation learning in toxicity prediction. Although several studies attempt to leverage LLMs to integrate molecular sequence data for pretraining molecular representations, certain limitations remain. Current LLM-based approaches usually utilize solely on class embedding features, overlooking the rich information in sequence embedding. Moreover, integrating pre-trained molecular representations with multi-modal molecular data may further enhance performance in toxicity prediction. To address these challenges, we propose BFGTP, a BERT-guided two-stage molecular representation learning framework for toxicity prediction. Firstly, we design independent encoders for molecular descriptions of three modalities, where the fingerprint encoder with dual level attention mechanisms effectively integrates multi-category fingerprints. Then, the two-stage guide strategy is introduced to fully utilize the prior knowledge of LLMs, employing contrastive learning to align and fuse the tri-modal representations and knowledge distillation to align predicted value distributions. BFGTP ultimately combines fingerprint and graph representations to predict molecular toxicity. Experiments on seven toxicity datasets show that BFGTP outperforms baselines, achieving the highest AUC on five datasets and the best average performance across five evaluation metrics. Ablation studies, t-SNE visualization and case study confirm the effectiveness of BFGTP's components and its ability to capture meaningful molecular representations.</p>","PeriodicalId":13073,"journal":{"name":"IEEE Journal of Biomedical and Health Informatics","volume":"PP ","pages":""},"PeriodicalIF":6.7000,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal of Biomedical and Health Informatics","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.1109/JBHI.2025.3556766","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Accurate prediction of molecular toxicity is vital for drug development. Most mainstream methods rely on fingerprints or graph-based feature extraction, the emergence of large language models (LLMs) offers new prospects for molecular representation learning in toxicity prediction. Although several studies attempt to leverage LLMs to integrate molecular sequence data for pretraining molecular representations, certain limitations remain. Current LLM-based approaches usually utilize solely on class embedding features, overlooking the rich information in sequence embedding. Moreover, integrating pre-trained molecular representations with multi-modal molecular data may further enhance performance in toxicity prediction. To address these challenges, we propose BFGTP, a BERT-guided two-stage molecular representation learning framework for toxicity prediction. Firstly, we design independent encoders for molecular descriptions of three modalities, where the fingerprint encoder with dual level attention mechanisms effectively integrates multi-category fingerprints. Then, the two-stage guide strategy is introduced to fully utilize the prior knowledge of LLMs, employing contrastive learning to align and fuse the tri-modal representations and knowledge distillation to align predicted value distributions. BFGTP ultimately combines fingerprint and graph representations to predict molecular toxicity. Experiments on seven toxicity datasets show that BFGTP outperforms baselines, achieving the highest AUC on five datasets and the best average performance across five evaluation metrics. Ablation studies, t-SNE visualization and case study confirm the effectiveness of BFGTP's components and its ability to capture meaningful molecular representations.
期刊介绍:
IEEE Journal of Biomedical and Health Informatics publishes original papers presenting recent advances where information and communication technologies intersect with health, healthcare, life sciences, and biomedicine. Topics include acquisition, transmission, storage, retrieval, management, and analysis of biomedical and health information. The journal covers applications of information technologies in healthcare, patient monitoring, preventive care, early disease diagnosis, therapy discovery, and personalized treatment protocols. It explores electronic medical and health records, clinical information systems, decision support systems, medical and biological imaging informatics, wearable systems, body area/sensor networks, and more. Integration-related topics like interoperability, evidence-based medicine, and secure patient data are also addressed.