HIV-1蛋白酶抑制剂耐药性预测的基准机器学习模型：数据集构建和特征表示的影响。

IF 5.3 2区化学 Q1 CHEMISTRY, MEDICINAL

Journal of Chemical Information and Modeling Pub Date : 2025-09-25 DOI:10.1021/acs.jcim.5c01544

Rocío Lucía Beatriz Riveros Maidana,Lucas de Almeida Machado,Ana Carolina Ramos Guimarães

{"title":"HIV-1蛋白酶抑制剂耐药性预测的基准机器学习模型：数据集构建和特征表示的影响。","authors":"Rocío Lucía Beatriz Riveros Maidana,Lucas de Almeida Machado,Ana Carolina Ramos Guimarães","doi":"10.1021/acs.jcim.5c01544","DOIUrl":null,"url":null,"abstract":"The rapid emergence of drug resistance in viral infections represents a significant global health challenge, threatening the efficacy of treatments for multiple diseases. Machine learning models have emerged as valuable tools for predicting antiviral drug resistance from genomic data, with HIV-1 protease serving as a well-characterized model system due to its extensive experimental data and clinical relevance. Here, we systematically evaluate multiple previously published HIV-1 protease inhibitor (PI) resistance prediction models across three distinct data sets with different preprocessing and ambiguous sequencing processing strategies and propose a new approach for preprocessing. We tested Steiner's data set (n = 1540) with first-amino-acid selection at ambiguous positions, Shen's expanded data set (n = 500,390) with all possible combinations at ambiguous positions, and our In-house data set (n = 869) with strict exclusion of ambiguous sequences. We compare neural networks architectures (Multilayer Perceptron, Bidirectional Recurrent Neural Network, and Convolutional Neural Network), traditional machine learning models (Random Forest and K-Nearest Neighbor), and logistic regression using either zScales physicochemical descriptors or Rosetta energy terms. Sequence expansion preprocessing can artificially increase performance metrics (mean AUC: 0.986-0.999) by creating substantial redundancy (99.6% of expanded data set consists of duplicated sequences from 2096 unique originals), while our clustering-based validation approach provides a more stringent assessment of model generalizability. Remarkably, our physicochemically informed logistic regression models achieved performance comparable to complex neural networks on challenging test sets (zScales LR: AUC = 0.973; Rosetta LR: AUC = 0.944), while offering superior interpretability. Furthermore, the zScales LR model offered significantly greater computational efficiency (0.007 s/prediction) compared to that of Rosetta LR (776.117 s/prediction). Mutual information analysis revealed distinct complementary resistance mechanisms: The zScales descriptors identified discrete resistance hotspots at positions 10, 46, 54, 71, and 90, while the Rosetta energy terms revealed interconnected energetic networks across structurally adjacent residues, particularly in functionally critical flap regions (positions 46-54). This study demonstrates how data set construction choices directly impact apparent model performance while establishing that well-chosen physicochemical feature representations can match or exceed complex neural networks for HIV-1 PI resistance modeling, offering both accuracy and mechanistic interpretability critical for clinical implementation and drug development.","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":"18 1","pages":""},"PeriodicalIF":5.3000,"publicationDate":"2025-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Benchmarking Machine Learning Models for HIV-1 Protease Inhibitor Resistance Prediction: Impact of Data Set Construction and Feature Representation.\",\"authors\":\"Rocío Lucía Beatriz Riveros Maidana,Lucas de Almeida Machado,Ana Carolina Ramos Guimarães\",\"doi\":\"10.1021/acs.jcim.5c01544\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The rapid emergence of drug resistance in viral infections represents a significant global health challenge, threatening the efficacy of treatments for multiple diseases. Machine learning models have emerged as valuable tools for predicting antiviral drug resistance from genomic data, with HIV-1 protease serving as a well-characterized model system due to its extensive experimental data and clinical relevance. Here, we systematically evaluate multiple previously published HIV-1 protease inhibitor (PI) resistance prediction models across three distinct data sets with different preprocessing and ambiguous sequencing processing strategies and propose a new approach for preprocessing. We tested Steiner's data set (n = 1540) with first-amino-acid selection at ambiguous positions, Shen's expanded data set (n = 500,390) with all possible combinations at ambiguous positions, and our In-house data set (n = 869) with strict exclusion of ambiguous sequences. We compare neural networks architectures (Multilayer Perceptron, Bidirectional Recurrent Neural Network, and Convolutional Neural Network), traditional machine learning models (Random Forest and K-Nearest Neighbor), and logistic regression using either zScales physicochemical descriptors or Rosetta energy terms. Sequence expansion preprocessing can artificially increase performance metrics (mean AUC: 0.986-0.999) by creating substantial redundancy (99.6% of expanded data set consists of duplicated sequences from 2096 unique originals), while our clustering-based validation approach provides a more stringent assessment of model generalizability. Remarkably, our physicochemically informed logistic regression models achieved performance comparable to complex neural networks on challenging test sets (zScales LR: AUC = 0.973; Rosetta LR: AUC = 0.944), while offering superior interpretability. Furthermore, the zScales LR model offered significantly greater computational efficiency (0.007 s/prediction) compared to that of Rosetta LR (776.117 s/prediction). Mutual information analysis revealed distinct complementary resistance mechanisms: The zScales descriptors identified discrete resistance hotspots at positions 10, 46, 54, 71, and 90, while the Rosetta energy terms revealed interconnected energetic networks across structurally adjacent residues, particularly in functionally critical flap regions (positions 46-54). This study demonstrates how data set construction choices directly impact apparent model performance while establishing that well-chosen physicochemical feature representations can match or exceed complex neural networks for HIV-1 PI resistance modeling, offering both accuracy and mechanistic interpretability critical for clinical implementation and drug development.\",\"PeriodicalId\":44,\"journal\":{\"name\":\"Journal of Chemical Information and Modeling \",\"volume\":\"18 1\",\"pages\":\"\"},\"PeriodicalIF\":5.3000,\"publicationDate\":\"2025-09-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Chemical Information and Modeling \",\"FirstCategoryId\":\"92\",\"ListUrlMain\":\"https://doi.org/10.1021/acs.jcim.5c01544\",\"RegionNum\":2,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"CHEMISTRY, MEDICINAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Chemical Information and Modeling ","FirstCategoryId":"92","ListUrlMain":"https://doi.org/10.1021/acs.jcim.5c01544","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MEDICINAL","Score":null,"Total":0}

引用次数: 0

摘要

病毒感染中耐药性的迅速出现是一项重大的全球卫生挑战，威胁到多种疾病的治疗效果。机器学习模型已经成为从基因组数据预测抗病毒药物耐药性的有价值的工具，HIV-1蛋白酶由于其广泛的实验数据和临床相关性而成为一个特征良好的模型系统。在这里，我们系统地评估了多个先前发表的HIV-1蛋白酶抑制剂（PI）耐药性预测模型，跨越三个不同的数据集，采用不同的预处理和模糊的测序处理策略，并提出了一种新的预处理方法。我们对Steiner的数据集（n = 1540）进行了模糊位置的第一氨基酸选择测试，对Shen的扩展数据集（n = 500,390）进行了模糊位置的所有可能组合测试，并对我们的内部数据集（n = 869）进行了严格排除模糊序列的测试。我们比较了神经网络架构（多层感知器、双向循环神经网络和卷积神经网络）、传统机器学习模型（随机森林和k近邻）以及使用zScales物理化学描述符或Rosetta能量项的逻辑回归。序列扩展预处理可以通过创建大量冗余（99.6%的扩展数据集由来自2096个唯一原始序列的重复序列组成）人为地提高性能指标（平均AUC: 0.986-0.999），而我们基于聚类的验证方法提供了更严格的模型可泛化性评估。值得注意的是，我们的物理化学信息逻辑回归模型在具有挑战性的测试集上取得了与复杂神经网络相当的性能（zScales LR: AUC = 0.973; Rosetta LR: AUC = 0.944），同时提供了优越的可解释性。此外，zScales LR模型的计算效率（0.007 s/prediction）显著高于Rosetta LR （776.117 s/prediction）。互信息分析揭示了不同的互补阻力机制：zScales描述符在10、46、54、71和90位发现了离散的阻力热点，而Rosetta能量项揭示了结构相邻残基之间相互关联的能量网络，特别是在功能关键的皮瓣区域（46-54位）。本研究展示了数据集构建选择如何直接影响明显的模型性能，同时建立了精心选择的物理化学特征表示可以匹配或超过用于HIV-1 PI抗性建模的复杂神经网络，为临床实施和药物开发提供准确性和机制可解释性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Benchmarking Machine Learning Models for HIV-1 Protease Inhibitor Resistance Prediction: Impact of Data Set Construction and Feature Representation.

The rapid emergence of drug resistance in viral infections represents a significant global health challenge, threatening the efficacy of treatments for multiple diseases. Machine learning models have emerged as valuable tools for predicting antiviral drug resistance from genomic data, with HIV-1 protease serving as a well-characterized model system due to its extensive experimental data and clinical relevance. Here, we systematically evaluate multiple previously published HIV-1 protease inhibitor (PI) resistance prediction models across three distinct data sets with different preprocessing and ambiguous sequencing processing strategies and propose a new approach for preprocessing. We tested Steiner's data set (n = 1540) with first-amino-acid selection at ambiguous positions, Shen's expanded data set (n = 500,390) with all possible combinations at ambiguous positions, and our In-house data set (n = 869) with strict exclusion of ambiguous sequences. We compare neural networks architectures (Multilayer Perceptron, Bidirectional Recurrent Neural Network, and Convolutional Neural Network), traditional machine learning models (Random Forest and K-Nearest Neighbor), and logistic regression using either zScales physicochemical descriptors or Rosetta energy terms. Sequence expansion preprocessing can artificially increase performance metrics (mean AUC: 0.986-0.999) by creating substantial redundancy (99.6% of expanded data set consists of duplicated sequences from 2096 unique originals), while our clustering-based validation approach provides a more stringent assessment of model generalizability. Remarkably, our physicochemically informed logistic regression models achieved performance comparable to complex neural networks on challenging test sets (zScales LR: AUC = 0.973; Rosetta LR: AUC = 0.944), while offering superior interpretability. Furthermore, the zScales LR model offered significantly greater computational efficiency (0.007 s/prediction) compared to that of Rosetta LR (776.117 s/prediction). Mutual information analysis revealed distinct complementary resistance mechanisms: The zScales descriptors identified discrete resistance hotspots at positions 10, 46, 54, 71, and 90, while the Rosetta energy terms revealed interconnected energetic networks across structurally adjacent residues, particularly in functionally critical flap regions (positions 46-54). This study demonstrates how data set construction choices directly impact apparent model performance while establishing that well-chosen physicochemical feature representations can match or exceed complex neural networks for HIV-1 PI resistance modeling, offering both accuracy and mechanistic interpretability critical for clinical implementation and drug development.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Chemical Information and Modeling 化学-化学综合

CiteScore

9.80

自引率

10.70%

发文量

529

审稿时长

1.4 months

期刊介绍： The Journal of Chemical Information and Modeling publishes papers reporting new methodology and/or important applications in the fields of chemical informatics and molecular modeling. Specific topics include the representation and computer-based searching of chemical databases, molecular modeling, computer-aided molecular design of new materials, catalysts, or ligands, development of new computational methods or efficient algorithms for chemical software, and biopharmaceutical chemistry including analyses of biological activity and other issues related to drug discovery. Astute chemists, computer scientists, and information specialists look to this monthly’s insightful research studies, programming innovations, and software reviews to keep current with advances in this integral, multidisciplinary field. As a subscriber you’ll stay abreast of database search systems, use of graph theory in chemical problems, substructure search systems, pattern recognition and clustering, analysis of chemical and physical data, molecular modeling, graphics and natural language interfaces, bibliometric and citation analysis, and synthesis design and reactions databases.