迈向可解释致癌性预测：可能致癌化学物质的综合化学信息学方法和共识框架。

IF 5.3 2区化学 Q1 CHEMISTRY, MEDICINAL

Journal of Chemical Information and Modeling Pub Date : 2025-09-12 DOI:10.1021/acs.jcim.5c01873

Huynh Anh Duy,Tarapong Srisongkram

{"title":"迈向可解释致癌性预测：可能致癌化学物质的综合化学信息学方法和共识框架。","authors":"Huynh Anh Duy,Tarapong Srisongkram","doi":"10.1021/acs.jcim.5c01873","DOIUrl":null,"url":null,"abstract":"A carcinogenicity assessment of possibly carcinogenic chemicals (International Agency for Research on Cancer: IARC class 2B) was conducted using a consensus framework constructed from three complementary machine learning models: BiLSTM with MACCS fingerprints, LightGBM with RDKit descriptors, and Random Forest (RF) with E-state features. These models were developed and rigorously evaluated on benchmark carcinogenicity data sets, with LightGBM emerging as the top performer (accuracy = 0.800, MCC = 0.615, AUROC = 0.882, sensitivity = 0.739, specificity = 0.857). Consistent and competitive performance was also observed for RF and BiLSTM, affirming the reliability of individual predictions. Notably, LightGBM maintained strong generalization ability on independent human carcinogen test sets from IARC and IRIS (accuracy = 0.753, MCC = 0.535, AUROC = 0.842). For the ISSCAN internal test set, the top three models achieved MCC values ranging from 0.564 to 0.615, with AUROC scores between 0.858 and 0.882. For the human carcinogen test set, the top three models attained MCC values from 0.335 to 0.535 and AUROC scores ranging from 0.785 to 0.842. The consensus model was subsequently applied to 47 within-domain compounds from the 2B category, classifying them into 16 potential carcinogens, 8 presumed noncarcinogens, and 23 cases with inconclusive results. To uncover structural correlates, a SHAP-based interpretation of the BiLSTM model was performed, revealing discriminative molecular features including MACCS fingerprint keys and core Bemis-Murcko scaffolds associated with predicted carcinogenicity. To support practical applications, a freely accessible web server for carcinogenicity assessment has been developed and is available at https://carcinogenicity-predictor.streamlit.app.","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":"66 1","pages":""},"PeriodicalIF":5.3000,"publicationDate":"2025-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Toward Explainable Carcinogenicity Prediction: An Integrated Cheminformatics Approach and Consensus Framework for Possibly Carcinogenic Chemicals.\",\"authors\":\"Huynh Anh Duy,Tarapong Srisongkram\",\"doi\":\"10.1021/acs.jcim.5c01873\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"A carcinogenicity assessment of possibly carcinogenic chemicals (International Agency for Research on Cancer: IARC class 2B) was conducted using a consensus framework constructed from three complementary machine learning models: BiLSTM with MACCS fingerprints, LightGBM with RDKit descriptors, and Random Forest (RF) with E-state features. These models were developed and rigorously evaluated on benchmark carcinogenicity data sets, with LightGBM emerging as the top performer (accuracy = 0.800, MCC = 0.615, AUROC = 0.882, sensitivity = 0.739, specificity = 0.857). Consistent and competitive performance was also observed for RF and BiLSTM, affirming the reliability of individual predictions. Notably, LightGBM maintained strong generalization ability on independent human carcinogen test sets from IARC and IRIS (accuracy = 0.753, MCC = 0.535, AUROC = 0.842). For the ISSCAN internal test set, the top three models achieved MCC values ranging from 0.564 to 0.615, with AUROC scores between 0.858 and 0.882. For the human carcinogen test set, the top three models attained MCC values from 0.335 to 0.535 and AUROC scores ranging from 0.785 to 0.842. The consensus model was subsequently applied to 47 within-domain compounds from the 2B category, classifying them into 16 potential carcinogens, 8 presumed noncarcinogens, and 23 cases with inconclusive results. To uncover structural correlates, a SHAP-based interpretation of the BiLSTM model was performed, revealing discriminative molecular features including MACCS fingerprint keys and core Bemis-Murcko scaffolds associated with predicted carcinogenicity. To support practical applications, a freely accessible web server for carcinogenicity assessment has been developed and is available at https://carcinogenicity-predictor.streamlit.app.\",\"PeriodicalId\":44,\"journal\":{\"name\":\"Journal of Chemical Information and Modeling \",\"volume\":\"66 1\",\"pages\":\"\"},\"PeriodicalIF\":5.3000,\"publicationDate\":\"2025-09-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Chemical Information and Modeling \",\"FirstCategoryId\":\"92\",\"ListUrlMain\":\"https://doi.org/10.1021/acs.jcim.5c01873\",\"RegionNum\":2,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"CHEMISTRY, MEDICINAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Chemical Information and Modeling ","FirstCategoryId":"92","ListUrlMain":"https://doi.org/10.1021/acs.jcim.5c01873","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MEDICINAL","Score":null,"Total":0}

引用次数: 0

摘要

对可能致癌的化学物质进行致癌性评估（国际癌症研究机构：IARC 2B级），使用由三个互补的机器学习模型构建的共识框架进行：具有MACCS指纹的BiLSTM，具有RDKit描述符的LightGBM和具有E-state特征的随机森林（RF）。这些模型在基准致癌性数据集上进行了开发和严格评估，其中LightGBM表现最佳（准确性= 0.800,MCC = 0.615, AUROC = 0.882，敏感性= 0.739，特异性= 0.857）。一致和竞争的表现也被观察到RF和BiLSTM，肯定了个人预测的可靠性。值得注意的是，LightGBM对IARC和IRIS独立的人类致癌物检测集保持了较强的泛化能力（准确率= 0.753,MCC = 0.535, AUROC = 0.842）。对于iscan内部测试集，前三名模型的MCC值在0.564 ~ 0.615之间，AUROC得分在0.858 ~ 0.882之间。对于人类致癌物测试集，前三名模型的MCC值为0.335 ~ 0.535，AUROC评分为0.785 ~ 0.842。随后将共识模型应用于2B类中的47种结构域内化合物，将其分为16种潜在致癌物，8种假定的非致癌物和23种不确定结果。为了揭示结构相关性，我们对BiLSTM模型进行了基于shap的解释，揭示了与预测致癌性相关的鉴别分子特征，包括MACCS指纹键和核心Bemis-Murcko支架。为了支持实际应用，已经开发了一个免费的致癌评估网络服务器，可在https://carcinogenicity-predictor.streamlit.app上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Toward Explainable Carcinogenicity Prediction: An Integrated Cheminformatics Approach and Consensus Framework for Possibly Carcinogenic Chemicals.

A carcinogenicity assessment of possibly carcinogenic chemicals (International Agency for Research on Cancer: IARC class 2B) was conducted using a consensus framework constructed from three complementary machine learning models: BiLSTM with MACCS fingerprints, LightGBM with RDKit descriptors, and Random Forest (RF) with E-state features. These models were developed and rigorously evaluated on benchmark carcinogenicity data sets, with LightGBM emerging as the top performer (accuracy = 0.800, MCC = 0.615, AUROC = 0.882, sensitivity = 0.739, specificity = 0.857). Consistent and competitive performance was also observed for RF and BiLSTM, affirming the reliability of individual predictions. Notably, LightGBM maintained strong generalization ability on independent human carcinogen test sets from IARC and IRIS (accuracy = 0.753, MCC = 0.535, AUROC = 0.842). For the ISSCAN internal test set, the top three models achieved MCC values ranging from 0.564 to 0.615, with AUROC scores between 0.858 and 0.882. For the human carcinogen test set, the top three models attained MCC values from 0.335 to 0.535 and AUROC scores ranging from 0.785 to 0.842. The consensus model was subsequently applied to 47 within-domain compounds from the 2B category, classifying them into 16 potential carcinogens, 8 presumed noncarcinogens, and 23 cases with inconclusive results. To uncover structural correlates, a SHAP-based interpretation of the BiLSTM model was performed, revealing discriminative molecular features including MACCS fingerprint keys and core Bemis-Murcko scaffolds associated with predicted carcinogenicity. To support practical applications, a freely accessible web server for carcinogenicity assessment has been developed and is available at https://carcinogenicity-predictor.streamlit.app.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Chemical Information and Modeling 化学-化学综合

CiteScore

9.80

自引率

10.70%

发文量

529

审稿时长

1.4 months

期刊介绍： The Journal of Chemical Information and Modeling publishes papers reporting new methodology and/or important applications in the fields of chemical informatics and molecular modeling. Specific topics include the representation and computer-based searching of chemical databases, molecular modeling, computer-aided molecular design of new materials, catalysts, or ligands, development of new computational methods or efficient algorithms for chemical software, and biopharmaceutical chemistry including analyses of biological activity and other issues related to drug discovery. Astute chemists, computer scientists, and information specialists look to this monthly’s insightful research studies, programming innovations, and software reviews to keep current with advances in this integral, multidisciplinary field. As a subscriber you’ll stay abreast of database search systems, use of graph theory in chemical problems, substructure search systems, pattern recognition and clustering, analysis of chemical and physical data, molecular modeling, graphics and natural language interfaces, bibliometric and citation analysis, and synthesis design and reactions databases.