基于umap的聚类分裂对肿瘤细胞系虚拟筛选人工智能模型进行严格评估*

IF 5.7 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Journal of Cheminformatics Pub Date : 2025-06-10 DOI:10.1186/s13321-025-01039-8

Qianrong Guo, Saiveth Hernandez-Hernandez, Pedro J. Ballester

{"title":"基于umap的聚类分裂对肿瘤细胞系虚拟筛选人工智能模型进行严格评估*","authors":"Qianrong Guo, Saiveth Hernandez-Hernandez, Pedro J. Ballester","doi":"10.1186/s13321-025-01039-8","DOIUrl":null,"url":null,"abstract":"Virtual Screening (VS) of large compound libraries using Artificial Intelligence (AI) models is a highly effective approach for early drug discovery. Data splitting is crucial for benchmarking the performance of such AI models. Traditional random data splits often result in structurally similar molecules in both training and test sets, which conflict with the reality of VS libraries that typically contain structurally diverse compounds. To tackle this challenge, scaffold split, which groups molecules by shared core structure, and Butina clustering, which clusters molecules by chemotypes, have long been used. However, we show that these methods still introduce high similarities between training and test sets, leading to overestimated model performance. Our study examined four representative AI models across 60 NCI-60 datasets, each comprising approximately 33,000–54,000 molecules tested on different cancer cell lines. Each dataset was split in four ways: random, scaffold, Butina clustering and the more realistic Uniform Manifold Approximation and Projection (UMAP) clustering. Using Linear Regression, Random Forest, Transformer-CNN, and GEM, we trained a total of 8400 models and evaluated under four splitting methods. These comprehensive results show that UMAP split provides more challenging and realistic benchmarks for model evaluation, followed by Butina splits, then scaffold splits and closely after random splits. Consequently, we recommend using UMAP splits instead of overly optimistic Butina splits and especially scaffold splits for molecular property prediction, including VS. Lastly, we illustrate how misaligned ROC AUC is with VS goals, despite its common use. The code and datasets for reproducibility are available at https://github.com/Rong830/UMAP_split_for_VS and archived in https://zenodo.org/records/14736486 . Scientific contribution This work advances the field by introducing UMAP clustering as a robust splitting method for molecular datasets, improving over traditional methods like Butina clustering and especially scaffold splits. It offers a new evaluation framework to benchmark AI models under more realistic conditions, fostering progress in molecular property prediction. The findings also show how inappropriate the use of ROC AUC for virtual screening (VS) continues to be, despite its popularity, emphasizing the need for context-specific evaluation metrics.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"218 1","pages":""},"PeriodicalIF":5.7000,"publicationDate":"2025-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"UMAP-based clustering split for rigorous evaluation of AI models for virtual screening on cancer cell lines*\",\"authors\":\"Qianrong Guo, Saiveth Hernandez-Hernandez, Pedro J. Ballester\",\"doi\":\"10.1186/s13321-025-01039-8\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Virtual Screening (VS) of large compound libraries using Artificial Intelligence (AI) models is a highly effective approach for early drug discovery. Data splitting is crucial for benchmarking the performance of such AI models. Traditional random data splits often result in structurally similar molecules in both training and test sets, which conflict with the reality of VS libraries that typically contain structurally diverse compounds. To tackle this challenge, scaffold split, which groups molecules by shared core structure, and Butina clustering, which clusters molecules by chemotypes, have long been used. However, we show that these methods still introduce high similarities between training and test sets, leading to overestimated model performance. Our study examined four representative AI models across 60 NCI-60 datasets, each comprising approximately 33,000–54,000 molecules tested on different cancer cell lines. Each dataset was split in four ways: random, scaffold, Butina clustering and the more realistic Uniform Manifold Approximation and Projection (UMAP) clustering. Using Linear Regression, Random Forest, Transformer-CNN, and GEM, we trained a total of 8400 models and evaluated under four splitting methods. These comprehensive results show that UMAP split provides more challenging and realistic benchmarks for model evaluation, followed by Butina splits, then scaffold splits and closely after random splits. Consequently, we recommend using UMAP splits instead of overly optimistic Butina splits and especially scaffold splits for molecular property prediction, including VS. Lastly, we illustrate how misaligned ROC AUC is with VS goals, despite its common use. The code and datasets for reproducibility are available at https://github.com/Rong830/UMAP_split_for_VS and archived in https://zenodo.org/records/14736486 . Scientific contribution This work advances the field by introducing UMAP clustering as a robust splitting method for molecular datasets, improving over traditional methods like Butina clustering and especially scaffold splits. It offers a new evaluation framework to benchmark AI models under more realistic conditions, fostering progress in molecular property prediction. The findings also show how inappropriate the use of ROC AUC for virtual screening (VS) continues to be, despite its popularity, emphasizing the need for context-specific evaluation metrics.\",\"PeriodicalId\":617,\"journal\":{\"name\":\"Journal of Cheminformatics\",\"volume\":\"218 1\",\"pages\":\"\"},\"PeriodicalIF\":5.7000,\"publicationDate\":\"2025-06-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Cheminformatics\",\"FirstCategoryId\":\"92\",\"ListUrlMain\":\"https://doi.org/10.1186/s13321-025-01039-8\",\"RegionNum\":2,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"CHEMISTRY, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Cheminformatics","FirstCategoryId":"92","ListUrlMain":"https://doi.org/10.1186/s13321-025-01039-8","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

摘要

利用人工智能（AI）模型对大型化合物文库进行虚拟筛选（VS）是一种非常有效的药物早期发现方法。数据分割对于此类人工智能模型的性能基准测试至关重要。传统的随机数据分割通常会导致训练集和测试集中的分子结构相似，这与通常包含结构多样化合物的VS库的现实相冲突。为了应对这一挑战，支架分裂（通过共享核心结构将分子分组）和Butina聚类（通过化学型将分子聚类）已经被长期使用。然而，我们表明这些方法仍然引入了训练集和测试集之间的高度相似性，导致高估模型性能。我们的研究检查了60个NCI-60数据集中的四个代表性人工智能模型，每个数据集包含大约33,000-54,000个在不同癌细胞系上测试的分子。每个数据集被分成四种方式：随机、支架、Butina聚类和更现实的统一流形近似和投影（UMAP）聚类。我们使用线性回归、随机森林、Transformer-CNN和GEM训练了8400个模型，并在四种分裂方法下进行了评估。这些综合结果表明，UMAP拆分为模型评估提供了更具挑战性和更现实的基准，其次是Butina拆分，然后是scaffold拆分，紧随其后的是随机拆分。因此，我们建议使用UMAP分裂，而不是过于乐观的Butina分裂，特别是支架分裂，用于分子性质预测，包括VS。最后，我们说明了尽管常用，但ROC AUC与VS目标是如何不一致的。可再现性的代码和数据集可在https://github.com/Rong830/UMAP_split_for_VS和https://zenodo.org/records/14736486存档。这项工作通过引入UMAP聚类作为分子数据集的鲁棒分裂方法，改进了传统的方法，如Butina聚类，特别是支架分裂，从而推动了该领域的发展。它为在更现实的条件下对人工智能模型进行基准测试提供了一个新的评估框架，促进了分子性质预测的进展。研究结果还表明，尽管使用ROC AUC进行虚拟筛查（VS）很受欢迎，但它仍然是不合适的，强调需要针对具体情况的评估指标。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

UMAP-based clustering split for rigorous evaluation of AI models for virtual screening on cancer cell lines*

Virtual Screening (VS) of large compound libraries using Artificial Intelligence (AI) models is a highly effective approach for early drug discovery. Data splitting is crucial for benchmarking the performance of such AI models. Traditional random data splits often result in structurally similar molecules in both training and test sets, which conflict with the reality of VS libraries that typically contain structurally diverse compounds. To tackle this challenge, scaffold split, which groups molecules by shared core structure, and Butina clustering, which clusters molecules by chemotypes, have long been used. However, we show that these methods still introduce high similarities between training and test sets, leading to overestimated model performance. Our study examined four representative AI models across 60 NCI-60 datasets, each comprising approximately 33,000–54,000 molecules tested on different cancer cell lines. Each dataset was split in four ways: random, scaffold, Butina clustering and the more realistic Uniform Manifold Approximation and Projection (UMAP) clustering. Using Linear Regression, Random Forest, Transformer-CNN, and GEM, we trained a total of 8400 models and evaluated under four splitting methods. These comprehensive results show that UMAP split provides more challenging and realistic benchmarks for model evaluation, followed by Butina splits, then scaffold splits and closely after random splits. Consequently, we recommend using UMAP splits instead of overly optimistic Butina splits and especially scaffold splits for molecular property prediction, including VS. Lastly, we illustrate how misaligned ROC AUC is with VS goals, despite its common use. The code and datasets for reproducibility are available at https://github.com/Rong830/UMAP_split_for_VS and archived in https://zenodo.org/records/14736486 . Scientific contribution This work advances the field by introducing UMAP clustering as a robust splitting method for molecular datasets, improving over traditional methods like Butina clustering and especially scaffold splits. It offers a new evaluation framework to benchmark AI models under more realistic conditions, fostering progress in molecular property prediction. The findings also show how inappropriate the use of ROC AUC for virtual screening (VS) continues to be, despite its popularity, emphasizing the need for context-specific evaluation metrics.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Cheminformatics CHEMISTRY, MULTIDISCIPLINARY-COMPUTER SCIENCE, INFORMATION SYSTEMS

CiteScore

14.10

自引率

7.00%

发文量

审稿时长

3 months

期刊介绍： Journal of Cheminformatics is an open access journal publishing original peer-reviewed research in all aspects of cheminformatics and molecular modelling. Coverage includes, but is not limited to: chemical information systems, software and databases, and molecular modelling, chemical structure representations and their use in structure, substructure, and similarity searching of chemical substance and chemical reaction databases, computer and molecular graphics, computer-aided molecular design, expert systems, QSAR, and data mining techniques.