{"title":"HIV-1蛋白酶抑制剂耐药性预测的基准机器学习模型:数据集构建和特征表示的影响。","authors":"Rocío Lucía Beatriz Riveros Maidana,Lucas de Almeida Machado,Ana Carolina Ramos Guimarães","doi":"10.1021/acs.jcim.5c01544","DOIUrl":null,"url":null,"abstract":"The rapid emergence of drug resistance in viral infections represents a significant global health challenge, threatening the efficacy of treatments for multiple diseases. Machine learning models have emerged as valuable tools for predicting antiviral drug resistance from genomic data, with HIV-1 protease serving as a well-characterized model system due to its extensive experimental data and clinical relevance. Here, we systematically evaluate multiple previously published HIV-1 protease inhibitor (PI) resistance prediction models across three distinct data sets with different preprocessing and ambiguous sequencing processing strategies and propose a new approach for preprocessing. We tested Steiner's data set (n = 1540) with first-amino-acid selection at ambiguous positions, Shen's expanded data set (n = 500,390) with all possible combinations at ambiguous positions, and our In-house data set (n = 869) with strict exclusion of ambiguous sequences. We compare neural networks architectures (Multilayer Perceptron, Bidirectional Recurrent Neural Network, and Convolutional Neural Network), traditional machine learning models (Random Forest and K-Nearest Neighbor), and logistic regression using either zScales physicochemical descriptors or Rosetta energy terms. Sequence expansion preprocessing can artificially increase performance metrics (mean AUC: 0.986-0.999) by creating substantial redundancy (99.6% of expanded data set consists of duplicated sequences from 2096 unique originals), while our clustering-based validation approach provides a more stringent assessment of model generalizability. Remarkably, our physicochemically informed logistic regression models achieved performance comparable to complex neural networks on challenging test sets (zScales LR: AUC = 0.973; Rosetta LR: AUC = 0.944), while offering superior interpretability. Furthermore, the zScales LR model offered significantly greater computational efficiency (0.007 s/prediction) compared to that of Rosetta LR (776.117 s/prediction). Mutual information analysis revealed distinct complementary resistance mechanisms: The zScales descriptors identified discrete resistance hotspots at positions 10, 46, 54, 71, and 90, while the Rosetta energy terms revealed interconnected energetic networks across structurally adjacent residues, particularly in functionally critical flap regions (positions 46-54). This study demonstrates how data set construction choices directly impact apparent model performance while establishing that well-chosen physicochemical feature representations can match or exceed complex neural networks for HIV-1 PI resistance modeling, offering both accuracy and mechanistic interpretability critical for clinical implementation and drug development.","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":"18 1","pages":""},"PeriodicalIF":5.3000,"publicationDate":"2025-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Benchmarking Machine Learning Models for HIV-1 Protease Inhibitor Resistance Prediction: Impact of Data Set Construction and Feature Representation.\",\"authors\":\"Rocío Lucía Beatriz Riveros Maidana,Lucas de Almeida Machado,Ana Carolina Ramos Guimarães\",\"doi\":\"10.1021/acs.jcim.5c01544\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The rapid emergence of drug resistance in viral infections represents a significant global health challenge, threatening the efficacy of treatments for multiple diseases. Machine learning models have emerged as valuable tools for predicting antiviral drug resistance from genomic data, with HIV-1 protease serving as a well-characterized model system due to its extensive experimental data and clinical relevance. Here, we systematically evaluate multiple previously published HIV-1 protease inhibitor (PI) resistance prediction models across three distinct data sets with different preprocessing and ambiguous sequencing processing strategies and propose a new approach for preprocessing. We tested Steiner's data set (n = 1540) with first-amino-acid selection at ambiguous positions, Shen's expanded data set (n = 500,390) with all possible combinations at ambiguous positions, and our In-house data set (n = 869) with strict exclusion of ambiguous sequences. We compare neural networks architectures (Multilayer Perceptron, Bidirectional Recurrent Neural Network, and Convolutional Neural Network), traditional machine learning models (Random Forest and K-Nearest Neighbor), and logistic regression using either zScales physicochemical descriptors or Rosetta energy terms. Sequence expansion preprocessing can artificially increase performance metrics (mean AUC: 0.986-0.999) by creating substantial redundancy (99.6% of expanded data set consists of duplicated sequences from 2096 unique originals), while our clustering-based validation approach provides a more stringent assessment of model generalizability. Remarkably, our physicochemically informed logistic regression models achieved performance comparable to complex neural networks on challenging test sets (zScales LR: AUC = 0.973; Rosetta LR: AUC = 0.944), while offering superior interpretability. Furthermore, the zScales LR model offered significantly greater computational efficiency (0.007 s/prediction) compared to that of Rosetta LR (776.117 s/prediction). Mutual information analysis revealed distinct complementary resistance mechanisms: The zScales descriptors identified discrete resistance hotspots at positions 10, 46, 54, 71, and 90, while the Rosetta energy terms revealed interconnected energetic networks across structurally adjacent residues, particularly in functionally critical flap regions (positions 46-54). This study demonstrates how data set construction choices directly impact apparent model performance while establishing that well-chosen physicochemical feature representations can match or exceed complex neural networks for HIV-1 PI resistance modeling, offering both accuracy and mechanistic interpretability critical for clinical implementation and drug development.\",\"PeriodicalId\":44,\"journal\":{\"name\":\"Journal of Chemical Information and Modeling \",\"volume\":\"18 1\",\"pages\":\"\"},\"PeriodicalIF\":5.3000,\"publicationDate\":\"2025-09-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Chemical Information and Modeling \",\"FirstCategoryId\":\"92\",\"ListUrlMain\":\"https://doi.org/10.1021/acs.jcim.5c01544\",\"RegionNum\":2,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"CHEMISTRY, MEDICINAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Chemical Information and Modeling ","FirstCategoryId":"92","ListUrlMain":"https://doi.org/10.1021/acs.jcim.5c01544","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MEDICINAL","Score":null,"Total":0}
Benchmarking Machine Learning Models for HIV-1 Protease Inhibitor Resistance Prediction: Impact of Data Set Construction and Feature Representation.
The rapid emergence of drug resistance in viral infections represents a significant global health challenge, threatening the efficacy of treatments for multiple diseases. Machine learning models have emerged as valuable tools for predicting antiviral drug resistance from genomic data, with HIV-1 protease serving as a well-characterized model system due to its extensive experimental data and clinical relevance. Here, we systematically evaluate multiple previously published HIV-1 protease inhibitor (PI) resistance prediction models across three distinct data sets with different preprocessing and ambiguous sequencing processing strategies and propose a new approach for preprocessing. We tested Steiner's data set (n = 1540) with first-amino-acid selection at ambiguous positions, Shen's expanded data set (n = 500,390) with all possible combinations at ambiguous positions, and our In-house data set (n = 869) with strict exclusion of ambiguous sequences. We compare neural networks architectures (Multilayer Perceptron, Bidirectional Recurrent Neural Network, and Convolutional Neural Network), traditional machine learning models (Random Forest and K-Nearest Neighbor), and logistic regression using either zScales physicochemical descriptors or Rosetta energy terms. Sequence expansion preprocessing can artificially increase performance metrics (mean AUC: 0.986-0.999) by creating substantial redundancy (99.6% of expanded data set consists of duplicated sequences from 2096 unique originals), while our clustering-based validation approach provides a more stringent assessment of model generalizability. Remarkably, our physicochemically informed logistic regression models achieved performance comparable to complex neural networks on challenging test sets (zScales LR: AUC = 0.973; Rosetta LR: AUC = 0.944), while offering superior interpretability. Furthermore, the zScales LR model offered significantly greater computational efficiency (0.007 s/prediction) compared to that of Rosetta LR (776.117 s/prediction). Mutual information analysis revealed distinct complementary resistance mechanisms: The zScales descriptors identified discrete resistance hotspots at positions 10, 46, 54, 71, and 90, while the Rosetta energy terms revealed interconnected energetic networks across structurally adjacent residues, particularly in functionally critical flap regions (positions 46-54). This study demonstrates how data set construction choices directly impact apparent model performance while establishing that well-chosen physicochemical feature representations can match or exceed complex neural networks for HIV-1 PI resistance modeling, offering both accuracy and mechanistic interpretability critical for clinical implementation and drug development.
期刊介绍:
The Journal of Chemical Information and Modeling publishes papers reporting new methodology and/or important applications in the fields of chemical informatics and molecular modeling. Specific topics include the representation and computer-based searching of chemical databases, molecular modeling, computer-aided molecular design of new materials, catalysts, or ligands, development of new computational methods or efficient algorithms for chemical software, and biopharmaceutical chemistry including analyses of biological activity and other issues related to drug discovery.
Astute chemists, computer scientists, and information specialists look to this monthly’s insightful research studies, programming innovations, and software reviews to keep current with advances in this integral, multidisciplinary field.
As a subscriber you’ll stay abreast of database search systems, use of graph theory in chemical problems, substructure search systems, pattern recognition and clustering, analysis of chemical and physical data, molecular modeling, graphics and natural language interfaces, bibliometric and citation analysis, and synthesis design and reactions databases.