Benchmarking protein classification algorithms via supervised cross-validation

Journal of biochemical and biophysical methods Pub Date : 2008-04-24 DOI:10.1016/j.jbbm.2007.05.011

Attila Kertész-Farkas , Somdutta Dhir , Paolo Sonego , Mircea Pacurar , Sergiu Netoteia , Harm Nijveen , Arnold Kuzniar , Jack A.M. Leunissen , András Kocsor , Sándor Pongor

{"title":"Benchmarking protein classification algorithms via supervised cross-validation","authors":"Attila Kertész-Farkas , Somdutta Dhir , Paolo Sonego , Mircea Pacurar , Sergiu Netoteia , Harm Nijveen , Arnold Kuzniar , Jack A.M. Leunissen , András Kocsor , Sándor Pongor","doi":"10.1016/j.jbbm.2007.05.011","DOIUrl":null,"url":null,"abstract":"<div><p>Development and testing of protein classification algorithms are hampered by the fact that the protein universe is characterized by groups vastly different in the number of members, in average protein size, similarity within group, etc. Datasets based on traditional cross-validation (<em>k</em>-fold, leave-one-out, etc.) may not give reliable estimates on how an algorithm will generalize to novel, distantly related subtypes of the known protein classes. Supervised cross-validation, i.e., selection of test and train sets according to the known subtypes within a database has been successfully used earlier in conjunction with the SCOP database. Our goal was to extend this principle to other databases and to design standardized benchmark datasets for protein classification. Hierarchical classification trees of protein categories provide a simple and general framework for designing supervised cross-validation strategies for protein classification. Benchmark datasets can be designed at various levels of the concept hierarchy using a simple graph-theoretic distance. A combination of supervised and random sampling was selected to construct reduced size model datasets, suitable for algorithm comparison. Over 3000 new classification tasks were added to our recently established protein classification benchmark collection that currently includes protein sequence (including protein domains and entire proteins), protein structure and reading frame DNA sequence data. We carried out an extensive evaluation based on various machine-learning algorithms such as nearest neighbor, support vector machines, artificial neural networks, random forests and logistic regression, used in conjunction with comparison algorithms, BLAST, Smith-Waterman, Needleman-Wunsch, as well as 3D comparison methods DALI and PRIDE. The resulting datasets provide lower, and in our opinion more realistic estimates of the classifier performance than do random cross-validation schemes. A combination of supervised and random sampling was used to construct model datasets, suitable for algorithm comparison.</p><p>The datasets are available at <span>http://hydra.icgeb.trieste.it/benchmark</span><svg><path></path></svg>.</p></div>","PeriodicalId":15257,"journal":{"name":"Journal of biochemical and biophysical methods","volume":"70 6","pages":"Pages 1215-1223"},"PeriodicalIF":0.0000,"publicationDate":"2008-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1016/j.jbbm.2007.05.011","citationCount":"23","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of biochemical and biophysical methods","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0165022X07001169","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 23

Abstract

Development and testing of protein classification algorithms are hampered by the fact that the protein universe is characterized by groups vastly different in the number of members, in average protein size, similarity within group, etc. Datasets based on traditional cross-validation (k-fold, leave-one-out, etc.) may not give reliable estimates on how an algorithm will generalize to novel, distantly related subtypes of the known protein classes. Supervised cross-validation, i.e., selection of test and train sets according to the known subtypes within a database has been successfully used earlier in conjunction with the SCOP database. Our goal was to extend this principle to other databases and to design standardized benchmark datasets for protein classification. Hierarchical classification trees of protein categories provide a simple and general framework for designing supervised cross-validation strategies for protein classification. Benchmark datasets can be designed at various levels of the concept hierarchy using a simple graph-theoretic distance. A combination of supervised and random sampling was selected to construct reduced size model datasets, suitable for algorithm comparison. Over 3000 new classification tasks were added to our recently established protein classification benchmark collection that currently includes protein sequence (including protein domains and entire proteins), protein structure and reading frame DNA sequence data. We carried out an extensive evaluation based on various machine-learning algorithms such as nearest neighbor, support vector machines, artificial neural networks, random forests and logistic regression, used in conjunction with comparison algorithms, BLAST, Smith-Waterman, Needleman-Wunsch, as well as 3D comparison methods DALI and PRIDE. The resulting datasets provide lower, and in our opinion more realistic estimates of the classifier performance than do random cross-validation schemes. A combination of supervised and random sampling was used to construct model datasets, suitable for algorithm comparison.

The datasets are available at http://hydra.icgeb.trieste.it/benchmark.

查看原文本刊更多论文

通过监督交叉验证对标蛋白质分类算法

蛋白质分类算法的开发和测试受到以下事实的阻碍:蛋白质宇宙的特征是在成员数量、平均蛋白质大小、组内相似性等方面存在巨大差异的组。基于传统交叉验证(k-fold, leave-one-out等)的数据集可能无法给出可靠的估计，即算法将如何推广到已知蛋白质类别的新的、远亲的亚型。监督交叉验证，即根据数据库中已知的子类型选择测试和训练集，已经成功地与SCOP数据库一起使用。我们的目标是将这一原则扩展到其他数据库，并为蛋白质分类设计标准化的基准数据集。蛋白质分类的层次分类树为设计蛋白质分类的监督交叉验证策略提供了一个简单而通用的框架。可以使用简单的图论距离在概念层次的各个层次上设计基准数据集。选择监督抽样和随机抽样相结合的方法构建适合算法比较的约简模型数据集。我们最近建立的蛋白质分类基准集合中增加了3000多个新的分类任务，目前包括蛋白质序列(包括蛋白质结构域和整个蛋白质)，蛋白质结构和阅读框DNA序列数据。我们基于各种机器学习算法(如最近邻、支持向量机、人工神经网络、随机森林和逻辑回归)进行了广泛的评估，并与比较算法(BLAST、Smith-Waterman、Needleman-Wunsch)以及3D比较方法DALI和PRIDE结合使用。结果数据集提供了较低的分类器性能估计，在我们看来，比随机交叉验证方案更现实。采用监督抽样和随机抽样相结合的方法构建适合算法比较的模型数据集。这些数据集可在http://hydra.icgeb.trieste.it/benchmark上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of biochemical and biophysical methods

自引率

0.00%

发文量