更正:对细胞穿透肽的可推广机器学习预测器的整体方法

IF 0.9 4区化学 Q4 CHEMISTRY, MULTIDISCIPLINARY

Australian Journal of Chemistry Pub Date : 2023-08-11 DOI:10.1071/ch22247_co

Bahaa Ismail, Sarah Jones, John Howl

{"title":"更正:对细胞穿透肽的可推广机器学习预测器的整体方法","authors":"Bahaa Ismail, Sarah Jones, John Howl","doi":"10.1071/ch22247_co","DOIUrl":null,"url":null,"abstract":"The development of machine learning (ML) predictors does not necessarily require the employment of expansive classifiers and complex feature encoding schemes to achieve the highest accuracy scores. It rather requires data pre-processing, feature optimization, and robust evaluation to ensure consistent results and generalizability. Herein, we describe a multi-stage process to develop a reliable ML predictor of cell penetrating peptides (CPPs). We emphasize the challenges of: (i) the generation of representative datasets with all required pre-processing procedures; (ii) comprehensive and exclusive encoding of peptides using their amino acid composition; (iii) obtaining an optimized feature set using a simple classifier (support vector machine, SVM); (iv) ensuring consistent results; and (v) verifying generalizability at the highest achievable accuracy scores. Two peptide sub-spaces were used to generate the negative examples, which are required, along with the known CPPs, to train the classifier. These included: (i) randomly generated peptides with all amino acid types being equally represented and (ii) extracted peptides from receptor proteins. Results indicated that the randomly generated dataset performed perfectly well within its own peptide sub-space, while it poorly generalized to the other sub-space. Conversely, the dataset extracted from receptor proteins, while achieving lower accuracies, showed a perfect generalizability to the other peptide sub-space. We combined the qualities of these two datasets by utilizing the average of their predictions within our ultimate framework. This functional ML predictor, WLVCPP, and associated software and datasets can be downloaded from <a ext-link-type=\"uri\" href=\"https://github.com/BahaaIsmail/WLVCPP\">https://github.com/BahaaIsmail/WLVCPP</a>.","PeriodicalId":8575,"journal":{"name":"Australian Journal of Chemistry","volume":"12 1","pages":"0"},"PeriodicalIF":0.9000,"publicationDate":"2023-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Corrigendum to: A holistic approach towards a generalizable machine learning predictor of cell penetrating peptides\",\"authors\":\"Bahaa Ismail, Sarah Jones, John Howl\",\"doi\":\"10.1071/ch22247_co\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The development of machine learning (ML) predictors does not necessarily require the employment of expansive classifiers and complex feature encoding schemes to achieve the highest accuracy scores. It rather requires data pre-processing, feature optimization, and robust evaluation to ensure consistent results and generalizability. Herein, we describe a multi-stage process to develop a reliable ML predictor of cell penetrating peptides (CPPs). We emphasize the challenges of: (i) the generation of representative datasets with all required pre-processing procedures; (ii) comprehensive and exclusive encoding of peptides using their amino acid composition; (iii) obtaining an optimized feature set using a simple classifier (support vector machine, SVM); (iv) ensuring consistent results; and (v) verifying generalizability at the highest achievable accuracy scores. Two peptide sub-spaces were used to generate the negative examples, which are required, along with the known CPPs, to train the classifier. These included: (i) randomly generated peptides with all amino acid types being equally represented and (ii) extracted peptides from receptor proteins. Results indicated that the randomly generated dataset performed perfectly well within its own peptide sub-space, while it poorly generalized to the other sub-space. Conversely, the dataset extracted from receptor proteins, while achieving lower accuracies, showed a perfect generalizability to the other peptide sub-space. We combined the qualities of these two datasets by utilizing the average of their predictions within our ultimate framework. This functional ML predictor, WLVCPP, and associated software and datasets can be downloaded from <a ext-link-type=\\\"uri\\\" href=\\\"https://github.com/BahaaIsmail/WLVCPP\\\">https://github.com/BahaaIsmail/WLVCPP</a>.\",\"PeriodicalId\":8575,\"journal\":{\"name\":\"Australian Journal of Chemistry\",\"volume\":\"12 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.9000,\"publicationDate\":\"2023-08-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Australian Journal of Chemistry\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1071/ch22247_co\",\"RegionNum\":4,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"CHEMISTRY, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Australian Journal of Chemistry","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1071/ch22247_co","RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

摘要

机器学习(ML)预测器的开发并不一定需要使用扩展分类器和复杂的特征编码方案来实现最高的准确率分数。相反，它需要数据预处理、特征优化和健壮的评估，以确保一致的结果和通用性。在这里，我们描述了一个多阶段的过程，以开发一个可靠的ML预测细胞穿透肽(CPPs)。我们强调以下挑战:(i)具有所有所需预处理程序的代表性数据集的生成;(ii)利用肽的氨基酸组成对肽进行全面和独家的编码;(iii)使用简单分类器(支持向量机，SVM)获得优化的特征集;确保结果一致;(v)在可达到的最高精度分数下验证泛化性。两个肽子空间用于生成负示例，这是与已知的CPPs一起训练分类器所必需的。这些包括:(i)随机生成的肽，所有氨基酸类型都是平等的;(ii)从受体蛋白中提取的肽。结果表明，随机生成的数据集在其自身的肽子空间内表现良好，而在其他子空间的泛化效果较差。相反，从受体蛋白中提取的数据集虽然精度较低，但显示出对其他肽子空间的完美泛化性。我们将这两个数据集的质量结合起来，在我们的最终框架内利用它们预测的平均值。这个功能机器学习预测器，WLVCPP，以及相关的软件和数据集可以从<a ext-link-type="uri" href="https://github.com/BahaaIsmail/WLVCPP">https://github.com/BahaaIsmail/WLVCPP</a>下载。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Corrigendum to: A holistic approach towards a generalizable machine learning predictor of cell penetrating peptides

The development of machine learning (ML) predictors does not necessarily require the employment of expansive classifiers and complex feature encoding schemes to achieve the highest accuracy scores. It rather requires data pre-processing, feature optimization, and robust evaluation to ensure consistent results and generalizability. Herein, we describe a multi-stage process to develop a reliable ML predictor of cell penetrating peptides (CPPs). We emphasize the challenges of: (i) the generation of representative datasets with all required pre-processing procedures; (ii) comprehensive and exclusive encoding of peptides using their amino acid composition; (iii) obtaining an optimized feature set using a simple classifier (support vector machine, SVM); (iv) ensuring consistent results; and (v) verifying generalizability at the highest achievable accuracy scores. Two peptide sub-spaces were used to generate the negative examples, which are required, along with the known CPPs, to train the classifier. These included: (i) randomly generated peptides with all amino acid types being equally represented and (ii) extracted peptides from receptor proteins. Results indicated that the randomly generated dataset performed perfectly well within its own peptide sub-space, while it poorly generalized to the other sub-space. Conversely, the dataset extracted from receptor proteins, while achieving lower accuracies, showed a perfect generalizability to the other peptide sub-space. We combined the qualities of these two datasets by utilizing the average of their predictions within our ultimate framework. This functional ML predictor, WLVCPP, and associated software and datasets can be downloaded from https://github.com/BahaaIsmail/WLVCPP.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Australian Journal of Chemistry 化学-化学综合

CiteScore

2.50

自引率

0.00%

发文量

审稿时长

1.3 months

期刊介绍： Australian Journal of Chemistry - an International Journal for Chemical Science publishes research papers from all fields of chemical science. Papers that are multidisciplinary or address new or emerging areas of chemistry are particularly encouraged. Thus, the scope is dynamic. It includes (but is not limited to) synthesis, structure, new materials, macromolecules and polymers, supramolecular chemistry, analytical and environmental chemistry, natural products, biological and medicinal chemistry, nanotechnology, and surface chemistry. Australian Journal of Chemistry is published with the endorsement of the Commonwealth Scientific and Industrial Research Organisation (CSIRO) and the Australian Academy of Science.

</i>更正</i>:对细胞穿透肽的可推广机器学习预测器的整体方法

摘要