一种基于异常值数据集的蛋白质定位预测改进方法

2009 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology Pub Date : 2009-03-30 DOI:10.1109/CIBCB.2009.4925714

Jiang Tian, Hong Gu, Wenqi Liu

{"title":"一种基于异常值数据集的蛋白质定位预测改进方法","authors":"Jiang Tian, Hong Gu, Wenqi Liu","doi":"10.1109/CIBCB.2009.4925714","DOIUrl":null,"url":null,"abstract":"Large-scale genome analysis and drug discovery require an automated prediction method for protein subcellular localization, and Support Vector Machines (SVMs) effectively solve this problem in a supervised manner. However, the protein subcellular localization datasets obtained from experiments always contain outliers, which can lead to poor generalization ability and classification accuracy. To address this issue, we first analyzed the influence of Principal Component Analysis (PCA) on classification performance, and then proposed a hybrid method for prediction of protein subcellular localization based on Weighted Supported Vector Machine (WSVM) and PCA. Different weights were assigned to different data points, so the training algorithm could learn the decision boundary according to the relative importance of the data points. After performing dimension reduction operations on the datasets, kernel-based possibilistic c-means (KPCM) was chosen to generate weights for this algorithm, as it generates relative high values for important data points but low values for outliers. Experimental results on a benchmark dataset show promising results, which confirms the effectiveness of the proposed method in terms of prediction accuracy.","PeriodicalId":162052,"journal":{"name":"2009 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"A method for improving protein localization prediction from datasets with outliers\",\"authors\":\"Jiang Tian, Hong Gu, Wenqi Liu\",\"doi\":\"10.1109/CIBCB.2009.4925714\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Large-scale genome analysis and drug discovery require an automated prediction method for protein subcellular localization, and Support Vector Machines (SVMs) effectively solve this problem in a supervised manner. However, the protein subcellular localization datasets obtained from experiments always contain outliers, which can lead to poor generalization ability and classification accuracy. To address this issue, we first analyzed the influence of Principal Component Analysis (PCA) on classification performance, and then proposed a hybrid method for prediction of protein subcellular localization based on Weighted Supported Vector Machine (WSVM) and PCA. Different weights were assigned to different data points, so the training algorithm could learn the decision boundary according to the relative importance of the data points. After performing dimension reduction operations on the datasets, kernel-based possibilistic c-means (KPCM) was chosen to generate weights for this algorithm, as it generates relative high values for important data points but low values for outliers. Experimental results on a benchmark dataset show promising results, which confirms the effectiveness of the proposed method in terms of prediction accuracy.\",\"PeriodicalId\":162052,\"journal\":{\"name\":\"2009 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology\",\"volume\":\"32 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-03-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2009 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CIBCB.2009.4925714\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CIBCB.2009.4925714","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

大规模基因组分析和药物发现需要一种蛋白质亚细胞定位的自动预测方法，而支持向量机(svm)以监督的方式有效地解决了这一问题。然而，实验得到的蛋白质亚细胞定位数据集往往存在异常值，导致泛化能力和分类精度较差。为了解决这一问题，我们首先分析了主成分分析(PCA)对分类性能的影响，然后提出了一种基于加权支持向量机(WSVM)和主成分分析的混合蛋白质亚细胞定位预测方法。对不同的数据点赋予不同的权重，这样训练算法就可以根据数据点的相对重要性来学习决策边界。在对数据集进行降维操作后，选择基于核的可能性c均值(KPCM)来生成该算法的权重，因为它对重要数据点生成相对较高的值，而对异常值生成较低的值。在一个基准数据集上的实验结果显示了令人满意的结果，证实了该方法在预测精度方面的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A method for improving protein localization prediction from datasets with outliers

Large-scale genome analysis and drug discovery require an automated prediction method for protein subcellular localization, and Support Vector Machines (SVMs) effectively solve this problem in a supervised manner. However, the protein subcellular localization datasets obtained from experiments always contain outliers, which can lead to poor generalization ability and classification accuracy. To address this issue, we first analyzed the influence of Principal Component Analysis (PCA) on classification performance, and then proposed a hybrid method for prediction of protein subcellular localization based on Weighted Supported Vector Machine (WSVM) and PCA. Different weights were assigned to different data points, so the training algorithm could learn the decision boundary according to the relative importance of the data points. After performing dimension reduction operations on the datasets, kernel-based possibilistic c-means (KPCM) was chosen to generate weights for this algorithm, as it generates relative high values for important data points but low values for outliers. Experimental results on a benchmark dataset show promising results, which confirms the effectiveness of the proposed method in terms of prediction accuracy.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2009 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology

自引率

0.00%

发文量