优化了蛋白质序列的树分类算法

M. Iqbal, I. Faye, A. Said, Brahim Belhaouari Samir
{"title":"优化了蛋白质序列的树分类算法","authors":"M. Iqbal, I. Faye, A. Said, Brahim Belhaouari Samir","doi":"10.1109/ISMSC.2015.7594037","DOIUrl":null,"url":null,"abstract":"Computational intelligence is an ongoing area of research, which has been successfully utilized in the analysis and modeling of the tremendous amount of biological data accumulated under different high throughput genome sequencing projects. The data gathered is mainly comprised of DNA, RNA and protein sequences, which are imprecise, incomplete and increasing exponentially. Classification of protein sequences into different superfamilies could be helpful for knowing the structure/function or hidden characteristics of an unknown protein sequence. The problem of classifying protein sequences based on the primary sequence information is a very complex and challenging task in the analysis and understanding of sequenced data. The existing classification methods are performing well on a very limited data; however the rapid increase in the genomic data leads to the development of improved computational methods. In this work, we have proposed an optimized tree-classification technique which uses cluster k nearest neighbor classification algorithm to classify protein sequences into superfamilies. The proposed technique is alignment free and the experimental results reveal that it outperforms than the previous state-of-the-art approaches. The overall best classification accuracy achieved is 97-98% on the previously utilized dataset, which is taken from the well-known UniProtKB database.","PeriodicalId":407600,"journal":{"name":"2015 International Symposium on Mathematical Sciences and Computing Research (iSMSC)","volume":"130 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Optimized tree-classification algorithm for classification of protein sequences\",\"authors\":\"M. Iqbal, I. Faye, A. Said, Brahim Belhaouari Samir\",\"doi\":\"10.1109/ISMSC.2015.7594037\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Computational intelligence is an ongoing area of research, which has been successfully utilized in the analysis and modeling of the tremendous amount of biological data accumulated under different high throughput genome sequencing projects. The data gathered is mainly comprised of DNA, RNA and protein sequences, which are imprecise, incomplete and increasing exponentially. Classification of protein sequences into different superfamilies could be helpful for knowing the structure/function or hidden characteristics of an unknown protein sequence. The problem of classifying protein sequences based on the primary sequence information is a very complex and challenging task in the analysis and understanding of sequenced data. The existing classification methods are performing well on a very limited data; however the rapid increase in the genomic data leads to the development of improved computational methods. In this work, we have proposed an optimized tree-classification technique which uses cluster k nearest neighbor classification algorithm to classify protein sequences into superfamilies. The proposed technique is alignment free and the experimental results reveal that it outperforms than the previous state-of-the-art approaches. The overall best classification accuracy achieved is 97-98% on the previously utilized dataset, which is taken from the well-known UniProtKB database.\",\"PeriodicalId\":407600,\"journal\":{\"name\":\"2015 International Symposium on Mathematical Sciences and Computing Research (iSMSC)\",\"volume\":\"130 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-05-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 International Symposium on Mathematical Sciences and Computing Research (iSMSC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISMSC.2015.7594037\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 International Symposium on Mathematical Sciences and Computing Research (iSMSC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISMSC.2015.7594037","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

摘要

计算智能是一个正在进行的研究领域,它已经成功地用于分析和建模在不同的高通量基因组测序项目中积累的大量生物数据。收集到的数据主要由DNA、RNA和蛋白质序列组成,这些数据不精确、不完整且呈指数级增长。将蛋白质序列分类为不同的超家族有助于了解未知蛋白质序列的结构/功能或隐藏特征。基于一级序列信息对蛋白质序列进行分类是测序数据分析和理解中一个非常复杂和具有挑战性的问题。现有的分类方法在非常有限的数据上表现良好;然而,基因组数据的快速增长导致了计算方法的改进。在这项工作中,我们提出了一种优化的树分类技术,该技术使用聚类k最近邻分类算法将蛋白质序列分类为超家族。实验结果表明,所提出的方法不需要对准,其性能优于以往的先进方法。在先前使用的数据集上获得的总体最佳分类准确率为97-98%,该数据集来自著名的UniProtKB数据库。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Optimized tree-classification algorithm for classification of protein sequences
Computational intelligence is an ongoing area of research, which has been successfully utilized in the analysis and modeling of the tremendous amount of biological data accumulated under different high throughput genome sequencing projects. The data gathered is mainly comprised of DNA, RNA and protein sequences, which are imprecise, incomplete and increasing exponentially. Classification of protein sequences into different superfamilies could be helpful for knowing the structure/function or hidden characteristics of an unknown protein sequence. The problem of classifying protein sequences based on the primary sequence information is a very complex and challenging task in the analysis and understanding of sequenced data. The existing classification methods are performing well on a very limited data; however the rapid increase in the genomic data leads to the development of improved computational methods. In this work, we have proposed an optimized tree-classification technique which uses cluster k nearest neighbor classification algorithm to classify protein sequences into superfamilies. The proposed technique is alignment free and the experimental results reveal that it outperforms than the previous state-of-the-art approaches. The overall best classification accuracy achieved is 97-98% on the previously utilized dataset, which is taken from the well-known UniProtKB database.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信