一种新的蛋白质分类在线层次特征提取算法

M. Kchouk, F. Mhamdi
{"title":"一种新的蛋白质分类在线层次特征提取算法","authors":"M. Kchouk, F. Mhamdi","doi":"10.1109/DEXA.2014.20","DOIUrl":null,"url":null,"abstract":"Feature extraction from biological data is a very important discipline in bioinformatics. The aim of this work is to classify protein sequences automatically. To do this, it seemed appropriate to use a data mining process: the process of Knowledge Discovery and Data mining (KDD) from biological data. We are interested in the first phase of the KDD, that consists in the preprocessing and we focus on the step: Feature extraction. Feature extraction is translated by the generation of a set of feature that is presented to a supervised learning algorithm for classification. An extraction method that we have adopted is the method of N-grams. The algorithm of n-grams consists in extracting feature of fixed size of length n. In this paper, we propose a hierarchical algorithm of construction of n-grams to obtain feature of variable sizes. This algorithm of extraction is used to meet the needs of biologists. By using the linear classifier SVM, the experiments on real protein banks show the efficiency of our algorithm while presenting a comparison of our work to previous works.","PeriodicalId":291899,"journal":{"name":"2014 25th International Workshop on Database and Expert Systems Applications","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"New Online Hierarchical Feature Extraction Algorithm for Classification of Protein\",\"authors\":\"M. Kchouk, F. Mhamdi\",\"doi\":\"10.1109/DEXA.2014.20\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Feature extraction from biological data is a very important discipline in bioinformatics. The aim of this work is to classify protein sequences automatically. To do this, it seemed appropriate to use a data mining process: the process of Knowledge Discovery and Data mining (KDD) from biological data. We are interested in the first phase of the KDD, that consists in the preprocessing and we focus on the step: Feature extraction. Feature extraction is translated by the generation of a set of feature that is presented to a supervised learning algorithm for classification. An extraction method that we have adopted is the method of N-grams. The algorithm of n-grams consists in extracting feature of fixed size of length n. In this paper, we propose a hierarchical algorithm of construction of n-grams to obtain feature of variable sizes. This algorithm of extraction is used to meet the needs of biologists. By using the linear classifier SVM, the experiments on real protein banks show the efficiency of our algorithm while presenting a comparison of our work to previous works.\",\"PeriodicalId\":291899,\"journal\":{\"name\":\"2014 25th International Workshop on Database and Expert Systems Applications\",\"volume\":\"13 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-09-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 25th International Workshop on Database and Expert Systems Applications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DEXA.2014.20\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 25th International Workshop on Database and Expert Systems Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DEXA.2014.20","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

摘要

生物数据特征提取是生物信息学中的一门重要学科。这项工作的目的是对蛋白质序列进行自动分类。要做到这一点,使用数据挖掘过程似乎是合适的:从生物数据中发现知识和数据挖掘(KDD)的过程。我们对KDD的第一阶段感兴趣,这包括预处理,我们关注的是步骤:特征提取。特征提取是通过生成一组特征来翻译的,这些特征被呈现给一个监督学习算法进行分类。我们采用的一种提取方法是N-grams法。n-gram的算法是提取长度为n的固定大小的特征。本文提出了一种构建n-gram的分层算法来获取可变大小的特征。这种提取算法是为了满足生物学家的需要。通过使用线性分类器SVM在真实蛋白库上的实验,验证了算法的有效性,并与前人的工作进行了比较。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
New Online Hierarchical Feature Extraction Algorithm for Classification of Protein
Feature extraction from biological data is a very important discipline in bioinformatics. The aim of this work is to classify protein sequences automatically. To do this, it seemed appropriate to use a data mining process: the process of Knowledge Discovery and Data mining (KDD) from biological data. We are interested in the first phase of the KDD, that consists in the preprocessing and we focus on the step: Feature extraction. Feature extraction is translated by the generation of a set of feature that is presented to a supervised learning algorithm for classification. An extraction method that we have adopted is the method of N-grams. The algorithm of n-grams consists in extracting feature of fixed size of length n. In this paper, we propose a hierarchical algorithm of construction of n-grams to obtain feature of variable sizes. This algorithm of extraction is used to meet the needs of biologists. By using the linear classifier SVM, the experiments on real protein banks show the efficiency of our algorithm while presenting a comparison of our work to previous works.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信