基于中文分割技术的蛋白质序列特征提取及其亚细胞定位

2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology Pub Date : 1900-01-01 DOI:10.1109/CIBCB.2005.1594931

Yang Yang, Bao-Liang Lu

{"title":"基于中文分割技术的蛋白质序列特征提取及其亚细胞定位","authors":"Yang Yang, Bao-Liang Lu","doi":"10.1109/CIBCB.2005.1594931","DOIUrl":null,"url":null,"abstract":"This paper proposes a new method for extracting features from protein sequences to deal with the problem of protein subcellular localization. The idea behind the method arises from Chinese segmentation techniques. We regard the amino acid sequences as text and segment them into words in a non-overlapping way. The words are predefined in a dictionary, which includes valuable words according to some criteria. Every word in the dictionary will be assigned a weight, and a matching strategy called maximum weight product is adopted for segmentation. By recording word frequencies, a given sequence can be converted into a feature vector. To evaluate the effectiveness of the proposed feature extraction method, two different kinds of classifiers are used to predict protein subcellular locations. The experimental results show that our method is superior to existing approaches in classification accuracy and reduces the number of dimensions of feature space at the same time.","PeriodicalId":330810,"journal":{"name":"2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology","volume":"50 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":"{\"title\":\"Extracting Features from Protein Sequences Using Chinese Segmentation Techniques for Subcellular Localization\",\"authors\":\"Yang Yang, Bao-Liang Lu\",\"doi\":\"10.1109/CIBCB.2005.1594931\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper proposes a new method for extracting features from protein sequences to deal with the problem of protein subcellular localization. The idea behind the method arises from Chinese segmentation techniques. We regard the amino acid sequences as text and segment them into words in a non-overlapping way. The words are predefined in a dictionary, which includes valuable words according to some criteria. Every word in the dictionary will be assigned a weight, and a matching strategy called maximum weight product is adopted for segmentation. By recording word frequencies, a given sequence can be converted into a feature vector. To evaluate the effectiveness of the proposed feature extraction method, two different kinds of classifiers are used to predict protein subcellular locations. The experimental results show that our method is superior to existing approaches in classification accuracy and reduces the number of dimensions of feature space at the same time.\",\"PeriodicalId\":330810,\"journal\":{\"name\":\"2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology\",\"volume\":\"50 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1900-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"12\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CIBCB.2005.1594931\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CIBCB.2005.1594931","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 12

摘要

针对蛋白质亚细胞定位问题，提出了一种新的蛋白质序列特征提取方法。该方法的思想来源于汉语分词技术。我们将氨基酸序列视为文本，并以不重叠的方式分割成单词。这些词是在字典中预定义的，字典根据某些标准包含有价值的词。将字典中的每个单词分配一个权值，并采用最大权值积匹配策略进行分词。通过记录词频，可以将给定的序列转换为特征向量。为了评估所提出的特征提取方法的有效性，使用了两种不同的分类器来预测蛋白质亚细胞的位置。实验结果表明，该方法在分类精度上优于现有方法，同时减少了特征空间的维数。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Extracting Features from Protein Sequences Using Chinese Segmentation Techniques for Subcellular Localization

This paper proposes a new method for extracting features from protein sequences to deal with the problem of protein subcellular localization. The idea behind the method arises from Chinese segmentation techniques. We regard the amino acid sequences as text and segment them into words in a non-overlapping way. The words are predefined in a dictionary, which includes valuable words according to some criteria. Every word in the dictionary will be assigned a weight, and a matching strategy called maximum weight product is adopted for segmentation. By recording word frequencies, a given sequence can be converted into a feature vector. To evaluate the effectiveness of the proposed feature extraction method, two different kinds of classifiers are used to predict protein subcellular locations. The experimental results show that our method is superior to existing approaches in classification accuracy and reduces the number of dimensions of feature space at the same time.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology

自引率

0.00%

发文量