Support Vector Machine Classification of Probability Models and Peptide Features for Improved Peptide Identification from Shotgun Proteomics

Sixth International Conference on Machine Learning and Applications (ICMLA 2007) Pub Date : 2007-12-13 DOI:10.1109/ICMLA.2007.17

C. H. Yamamoto, Maria Cristina Ferreira de Oliveira, M. L. Fujimoto, S. O. Rezende

{"title":"Support Vector Machine Classification of Probability Models and Peptide Features for Improved Peptide Identification from Shotgun Proteomics","authors":"C. H. Yamamoto, Maria Cristina Ferreira de Oliveira, M. L. Fujimoto, S. O. Rezende","doi":"10.1109/ICMLA.2007.17","DOIUrl":null,"url":null,"abstract":"Mass spectrometry (MS)-based proteomics is a powerful and popular high-throughput process for characterizing the global protein content of a sample. In shotgun proteomics, typically proteins are digested into fragments (peptides) prior to mass analysis, and the presence of a protein in inferred from the identification of its constituent peptides. Thus, accurate proteome characterization is dependent upon the accuracy of this peptide identification step. Database search routines generate predicted spectra for all peptides derived from the known genome information, and thus, identify a peptide by 'matching' an experimental to a predicted spectrum. However, due to many problems, such as incomplete fragmentation, this process results in a large number of false positives. We present a new scoring algorithm that integrates probabilistic database scoring metrics (from the MSPolygraph program) with physico-chemical properties in a support vector machine (SVM). We demonstrate that this peptide identification classifier SVM (PICS) score is not only more accurate than the single best database scoring metric, but is also significantly more accurate than models derived using a linear discriminant analysis, decision tree, or artificial neural network.","PeriodicalId":448863,"journal":{"name":"Sixth International Conference on Machine Learning and Applications (ICMLA 2007)","volume":"81 3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2007-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Sixth International Conference on Machine Learning and Applications (ICMLA 2007)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICMLA.2007.17","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 13

Abstract

Mass spectrometry (MS)-based proteomics is a powerful and popular high-throughput process for characterizing the global protein content of a sample. In shotgun proteomics, typically proteins are digested into fragments (peptides) prior to mass analysis, and the presence of a protein in inferred from the identification of its constituent peptides. Thus, accurate proteome characterization is dependent upon the accuracy of this peptide identification step. Database search routines generate predicted spectra for all peptides derived from the known genome information, and thus, identify a peptide by 'matching' an experimental to a predicted spectrum. However, due to many problems, such as incomplete fragmentation, this process results in a large number of false positives. We present a new scoring algorithm that integrates probabilistic database scoring metrics (from the MSPolygraph program) with physico-chemical properties in a support vector machine (SVM). We demonstrate that this peptide identification classifier SVM (PICS) score is not only more accurate than the single best database scoring metric, but is also significantly more accurate than models derived using a linear discriminant analysis, decision tree, or artificial neural network.

查看原文本刊更多论文

基于概率模型和肽特征的支持向量机分类，用于改进霰弹枪蛋白质组学的肽识别

质谱(MS)为基础的蛋白质组学是一个强大的和流行的高通量的过程，表征样品的整体蛋白质含量。在散弹枪蛋白质组学中，通常蛋白质在质量分析之前被消化成片段(肽)，并且从其组成肽的鉴定中推断出蛋白质的存在。因此，准确的蛋白质组表征取决于这一肽鉴定步骤的准确性。数据库搜索程序生成来自已知基因组信息的所有肽的预测光谱，因此，通过“匹配”实验与预测光谱来识别肽。然而，由于碎片化不完全等诸多问题，这一过程会产生大量的误报。我们提出了一种新的评分算法，该算法将概率数据库评分指标(来自MSPolygraph程序)与支持向量机(SVM)的物理化学性质相结合。我们证明，这种肽识别分类器SVM (PICS)评分不仅比单一最佳数据库评分指标更准确，而且比使用线性判别分析，决策树或人工神经网络导出的模型更准确。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Sixth International Conference on Machine Learning and Applications (ICMLA 2007)

自引率

0.00%

发文量