一种基于异常值数据集的蛋白质定位预测改进方法

Jiang Tian, Hong Gu, Wenqi Liu
{"title":"一种基于异常值数据集的蛋白质定位预测改进方法","authors":"Jiang Tian, Hong Gu, Wenqi Liu","doi":"10.1109/CIBCB.2009.4925714","DOIUrl":null,"url":null,"abstract":"Large-scale genome analysis and drug discovery require an automated prediction method for protein subcellular localization, and Support Vector Machines (SVMs) effectively solve this problem in a supervised manner. However, the protein subcellular localization datasets obtained from experiments always contain outliers, which can lead to poor generalization ability and classification accuracy. To address this issue, we first analyzed the influence of Principal Component Analysis (PCA) on classification performance, and then proposed a hybrid method for prediction of protein subcellular localization based on Weighted Supported Vector Machine (WSVM) and PCA. Different weights were assigned to different data points, so the training algorithm could learn the decision boundary according to the relative importance of the data points. After performing dimension reduction operations on the datasets, kernel-based possibilistic c-means (KPCM) was chosen to generate weights for this algorithm, as it generates relative high values for important data points but low values for outliers. Experimental results on a benchmark dataset show promising results, which confirms the effectiveness of the proposed method in terms of prediction accuracy.","PeriodicalId":162052,"journal":{"name":"2009 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"A method for improving protein localization prediction from datasets with outliers\",\"authors\":\"Jiang Tian, Hong Gu, Wenqi Liu\",\"doi\":\"10.1109/CIBCB.2009.4925714\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Large-scale genome analysis and drug discovery require an automated prediction method for protein subcellular localization, and Support Vector Machines (SVMs) effectively solve this problem in a supervised manner. However, the protein subcellular localization datasets obtained from experiments always contain outliers, which can lead to poor generalization ability and classification accuracy. To address this issue, we first analyzed the influence of Principal Component Analysis (PCA) on classification performance, and then proposed a hybrid method for prediction of protein subcellular localization based on Weighted Supported Vector Machine (WSVM) and PCA. Different weights were assigned to different data points, so the training algorithm could learn the decision boundary according to the relative importance of the data points. After performing dimension reduction operations on the datasets, kernel-based possibilistic c-means (KPCM) was chosen to generate weights for this algorithm, as it generates relative high values for important data points but low values for outliers. Experimental results on a benchmark dataset show promising results, which confirms the effectiveness of the proposed method in terms of prediction accuracy.\",\"PeriodicalId\":162052,\"journal\":{\"name\":\"2009 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology\",\"volume\":\"32 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-03-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2009 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CIBCB.2009.4925714\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CIBCB.2009.4925714","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

摘要

大规模基因组分析和药物发现需要一种蛋白质亚细胞定位的自动预测方法,而支持向量机(svm)以监督的方式有效地解决了这一问题。然而,实验得到的蛋白质亚细胞定位数据集往往存在异常值,导致泛化能力和分类精度较差。为了解决这一问题,我们首先分析了主成分分析(PCA)对分类性能的影响,然后提出了一种基于加权支持向量机(WSVM)和主成分分析的混合蛋白质亚细胞定位预测方法。对不同的数据点赋予不同的权重,这样训练算法就可以根据数据点的相对重要性来学习决策边界。在对数据集进行降维操作后,选择基于核的可能性c均值(KPCM)来生成该算法的权重,因为它对重要数据点生成相对较高的值,而对异常值生成较低的值。在一个基准数据集上的实验结果显示了令人满意的结果,证实了该方法在预测精度方面的有效性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
A method for improving protein localization prediction from datasets with outliers
Large-scale genome analysis and drug discovery require an automated prediction method for protein subcellular localization, and Support Vector Machines (SVMs) effectively solve this problem in a supervised manner. However, the protein subcellular localization datasets obtained from experiments always contain outliers, which can lead to poor generalization ability and classification accuracy. To address this issue, we first analyzed the influence of Principal Component Analysis (PCA) on classification performance, and then proposed a hybrid method for prediction of protein subcellular localization based on Weighted Supported Vector Machine (WSVM) and PCA. Different weights were assigned to different data points, so the training algorithm could learn the decision boundary according to the relative importance of the data points. After performing dimension reduction operations on the datasets, kernel-based possibilistic c-means (KPCM) was chosen to generate weights for this algorithm, as it generates relative high values for important data points but low values for outliers. Experimental results on a benchmark dataset show promising results, which confirms the effectiveness of the proposed method in terms of prediction accuracy.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信