Clustering DNA Sequences of Aspergillus Fumigatus Using Incremental Multiple Medoids

2015 Fifth International Conference on Advances in Computing and Communications (ICACC) Pub Date : 2015-09-01 DOI:10.1109/ICACC.2015.19

T. Ajayan, P. Sony, Janu R. Panicker, S. Shailesh

{"title":"Clustering DNA Sequences of Aspergillus Fumigatus Using Incremental Multiple Medoids","authors":"T. Ajayan, P. Sony, Janu R. Panicker, S. Shailesh","doi":"10.1109/ICACC.2015.19","DOIUrl":null,"url":null,"abstract":"Clustering DNA sequences of Aspergillus fumigatus is a process that groups a set of sequences into clusters such that the similarity among sequences in the same cluster is high, while that among the sequences in different clusters is low. The main objective of this clustering is to obtain a more refined clustering techinque inorder to analyze biological data and to bunch DNA sequences to many clusters more easily. CDHIT and DNACLUST are the two existing approaches used in bioinformatics for clustering sequences. The major disadvantage of both approach is that longest sequence is selected as the cluster representative. As DNA sequences are enomorous in number, the traditional clustering algorithm are infeasible for analysis. To handle such large DNA sequences, a modified version of incremental clustering using multiple medoids has been proposed. The key idea is to find multiple representative sequences like medoids to represent a cluster in a chunk and final DNA analysis is carried out based on those identified medoids from all the chunks. The main advantage of this incremental clustering is that it uses multiple medoids to represent each cluster in each chunk which capture the pattern structure more accurately. Not only that it overcomes the disadvantages of existing techniques but also has the mechanism to make use of DNA sequence relationship among those identified medoids that serves as a side information to help the final DNA sequence clustering. The proposed incremental approach outperforms existing clustering approaches in terms of clustering accuracy.","PeriodicalId":368544,"journal":{"name":"2015 Fifth International Conference on Advances in Computing and Communications (ICACC)","volume":"112 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 Fifth International Conference on Advances in Computing and Communications (ICACC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICACC.2015.19","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Clustering DNA sequences of Aspergillus fumigatus is a process that groups a set of sequences into clusters such that the similarity among sequences in the same cluster is high, while that among the sequences in different clusters is low. The main objective of this clustering is to obtain a more refined clustering techinque inorder to analyze biological data and to bunch DNA sequences to many clusters more easily. CDHIT and DNACLUST are the two existing approaches used in bioinformatics for clustering sequences. The major disadvantage of both approach is that longest sequence is selected as the cluster representative. As DNA sequences are enomorous in number, the traditional clustering algorithm are infeasible for analysis. To handle such large DNA sequences, a modified version of incremental clustering using multiple medoids has been proposed. The key idea is to find multiple representative sequences like medoids to represent a cluster in a chunk and final DNA analysis is carried out based on those identified medoids from all the chunks. The main advantage of this incremental clustering is that it uses multiple medoids to represent each cluster in each chunk which capture the pattern structure more accurately. Not only that it overcomes the disadvantages of existing techniques but also has the mechanism to make use of DNA sequence relationship among those identified medoids that serves as a side information to help the final DNA sequence clustering. The proposed incremental approach outperforms existing clustering approaches in terms of clustering accuracy.

查看原文本刊更多论文

利用增量多重介质聚类烟曲霉DNA序列

烟曲霉DNA序列聚类是指将一组序列聚类，使同一聚类中的序列相似性较高，而不同聚类中的序列相似性较低的过程。这种聚类的主要目的是获得一种更精细的聚类技术，以便更容易地分析生物数据和将DNA序列聚类成许多簇。CDHIT和DNACLUST是生物信息学中用于序列聚类的两种现有方法。这两种方法的主要缺点是选择最长序列作为聚类代表。由于DNA序列数量庞大，传统的聚类算法难以进行分析。为了处理如此大的DNA序列，提出了一种使用多介质的增量聚类的改进版本。该方法的关键思想是找到多个具有代表性的序列，如介质序列来表示块中的一个簇，并基于从所有块中识别出的介质序列进行最终的DNA分析。这种增量聚类的主要优点是，它使用多个介质来表示每个块中的每个簇，从而更准确地捕获模式结构。该方法不仅克服了现有方法的不足，而且具有利用被鉴定介质之间的DNA序列关系作为侧信息来帮助最终DNA序列聚类的机制。所提出的增量方法在聚类精度方面优于现有的聚类方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2015 Fifth International Conference on Advances in Computing and Communications (ICACC)

自引率

0.00%

发文量