{"title":"基于中文分割技术的蛋白质序列特征提取及其亚细胞定位","authors":"Yang Yang, Bao-Liang Lu","doi":"10.1109/CIBCB.2005.1594931","DOIUrl":null,"url":null,"abstract":"This paper proposes a new method for extracting features from protein sequences to deal with the problem of protein subcellular localization. The idea behind the method arises from Chinese segmentation techniques. We regard the amino acid sequences as text and segment them into words in a non-overlapping way. The words are predefined in a dictionary, which includes valuable words according to some criteria. Every word in the dictionary will be assigned a weight, and a matching strategy called maximum weight product is adopted for segmentation. By recording word frequencies, a given sequence can be converted into a feature vector. To evaluate the effectiveness of the proposed feature extraction method, two different kinds of classifiers are used to predict protein subcellular locations. The experimental results show that our method is superior to existing approaches in classification accuracy and reduces the number of dimensions of feature space at the same time.","PeriodicalId":330810,"journal":{"name":"2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology","volume":"50 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":"{\"title\":\"Extracting Features from Protein Sequences Using Chinese Segmentation Techniques for Subcellular Localization\",\"authors\":\"Yang Yang, Bao-Liang Lu\",\"doi\":\"10.1109/CIBCB.2005.1594931\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper proposes a new method for extracting features from protein sequences to deal with the problem of protein subcellular localization. The idea behind the method arises from Chinese segmentation techniques. We regard the amino acid sequences as text and segment them into words in a non-overlapping way. The words are predefined in a dictionary, which includes valuable words according to some criteria. Every word in the dictionary will be assigned a weight, and a matching strategy called maximum weight product is adopted for segmentation. By recording word frequencies, a given sequence can be converted into a feature vector. To evaluate the effectiveness of the proposed feature extraction method, two different kinds of classifiers are used to predict protein subcellular locations. The experimental results show that our method is superior to existing approaches in classification accuracy and reduces the number of dimensions of feature space at the same time.\",\"PeriodicalId\":330810,\"journal\":{\"name\":\"2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology\",\"volume\":\"50 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1900-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"12\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CIBCB.2005.1594931\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CIBCB.2005.1594931","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Extracting Features from Protein Sequences Using Chinese Segmentation Techniques for Subcellular Localization
This paper proposes a new method for extracting features from protein sequences to deal with the problem of protein subcellular localization. The idea behind the method arises from Chinese segmentation techniques. We regard the amino acid sequences as text and segment them into words in a non-overlapping way. The words are predefined in a dictionary, which includes valuable words according to some criteria. Every word in the dictionary will be assigned a weight, and a matching strategy called maximum weight product is adopted for segmentation. By recording word frequencies, a given sequence can be converted into a feature vector. To evaluate the effectiveness of the proposed feature extraction method, two different kinds of classifiers are used to predict protein subcellular locations. The experimental results show that our method is superior to existing approaches in classification accuracy and reduces the number of dimensions of feature space at the same time.