优化了蛋白质序列的树分类算法

2015 International Symposium on Mathematical Sciences and Computing Research (iSMSC) Pub Date : 2015-05-19 DOI:10.1109/ISMSC.2015.7594037

M. Iqbal, I. Faye, A. Said, Brahim Belhaouari Samir

{"title":"优化了蛋白质序列的树分类算法","authors":"M. Iqbal, I. Faye, A. Said, Brahim Belhaouari Samir","doi":"10.1109/ISMSC.2015.7594037","DOIUrl":null,"url":null,"abstract":"Computational intelligence is an ongoing area of research, which has been successfully utilized in the analysis and modeling of the tremendous amount of biological data accumulated under different high throughput genome sequencing projects. The data gathered is mainly comprised of DNA, RNA and protein sequences, which are imprecise, incomplete and increasing exponentially. Classification of protein sequences into different superfamilies could be helpful for knowing the structure/function or hidden characteristics of an unknown protein sequence. The problem of classifying protein sequences based on the primary sequence information is a very complex and challenging task in the analysis and understanding of sequenced data. The existing classification methods are performing well on a very limited data; however the rapid increase in the genomic data leads to the development of improved computational methods. In this work, we have proposed an optimized tree-classification technique which uses cluster k nearest neighbor classification algorithm to classify protein sequences into superfamilies. The proposed technique is alignment free and the experimental results reveal that it outperforms than the previous state-of-the-art approaches. The overall best classification accuracy achieved is 97-98% on the previously utilized dataset, which is taken from the well-known UniProtKB database.","PeriodicalId":407600,"journal":{"name":"2015 International Symposium on Mathematical Sciences and Computing Research (iSMSC)","volume":"130 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Optimized tree-classification algorithm for classification of protein sequences\",\"authors\":\"M. Iqbal, I. Faye, A. Said, Brahim Belhaouari Samir\",\"doi\":\"10.1109/ISMSC.2015.7594037\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Computational intelligence is an ongoing area of research, which has been successfully utilized in the analysis and modeling of the tremendous amount of biological data accumulated under different high throughput genome sequencing projects. The data gathered is mainly comprised of DNA, RNA and protein sequences, which are imprecise, incomplete and increasing exponentially. Classification of protein sequences into different superfamilies could be helpful for knowing the structure/function or hidden characteristics of an unknown protein sequence. The problem of classifying protein sequences based on the primary sequence information is a very complex and challenging task in the analysis and understanding of sequenced data. The existing classification methods are performing well on a very limited data; however the rapid increase in the genomic data leads to the development of improved computational methods. In this work, we have proposed an optimized tree-classification technique which uses cluster k nearest neighbor classification algorithm to classify protein sequences into superfamilies. The proposed technique is alignment free and the experimental results reveal that it outperforms than the previous state-of-the-art approaches. The overall best classification accuracy achieved is 97-98% on the previously utilized dataset, which is taken from the well-known UniProtKB database.\",\"PeriodicalId\":407600,\"journal\":{\"name\":\"2015 International Symposium on Mathematical Sciences and Computing Research (iSMSC)\",\"volume\":\"130 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-05-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 International Symposium on Mathematical Sciences and Computing Research (iSMSC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISMSC.2015.7594037\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 International Symposium on Mathematical Sciences and Computing Research (iSMSC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISMSC.2015.7594037","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

计算智能是一个正在进行的研究领域，它已经成功地用于分析和建模在不同的高通量基因组测序项目中积累的大量生物数据。收集到的数据主要由DNA、RNA和蛋白质序列组成，这些数据不精确、不完整且呈指数级增长。将蛋白质序列分类为不同的超家族有助于了解未知蛋白质序列的结构/功能或隐藏特征。基于一级序列信息对蛋白质序列进行分类是测序数据分析和理解中一个非常复杂和具有挑战性的问题。现有的分类方法在非常有限的数据上表现良好;然而，基因组数据的快速增长导致了计算方法的改进。在这项工作中，我们提出了一种优化的树分类技术，该技术使用聚类k最近邻分类算法将蛋白质序列分类为超家族。实验结果表明，所提出的方法不需要对准，其性能优于以往的先进方法。在先前使用的数据集上获得的总体最佳分类准确率为97-98%，该数据集来自著名的UniProtKB数据库。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Optimized tree-classification algorithm for classification of protein sequences

Computational intelligence is an ongoing area of research, which has been successfully utilized in the analysis and modeling of the tremendous amount of biological data accumulated under different high throughput genome sequencing projects. The data gathered is mainly comprised of DNA, RNA and protein sequences, which are imprecise, incomplete and increasing exponentially. Classification of protein sequences into different superfamilies could be helpful for knowing the structure/function or hidden characteristics of an unknown protein sequence. The problem of classifying protein sequences based on the primary sequence information is a very complex and challenging task in the analysis and understanding of sequenced data. The existing classification methods are performing well on a very limited data; however the rapid increase in the genomic data leads to the development of improved computational methods. In this work, we have proposed an optimized tree-classification technique which uses cluster k nearest neighbor classification algorithm to classify protein sequences into superfamilies. The proposed technique is alignment free and the experimental results reveal that it outperforms than the previous state-of-the-art approaches. The overall best classification accuracy achieved is 97-98% on the previously utilized dataset, which is taken from the well-known UniProtKB database.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2015 International Symposium on Mathematical Sciences and Computing Research (iSMSC)

自引率

0.00%

发文量