RESTRAC:基于参考序列的聚类空间变换

2017 IEEE International Conference on Data Mining Workshops (ICDMW) Pub Date : 2017-11-01 DOI:10.1109/ICDMW.2017.66

A. T. Islam, S. Pramanik, Vahid Mirjalili, S. Sural

{"title":"RESTRAC:基于参考序列的聚类空间变换","authors":"A. T. Islam, S. Pramanik, Vahid Mirjalili, S. Sural","doi":"10.1109/ICDMW.2017.66","DOIUrl":null,"url":null,"abstract":"Effective mining of large amount of DNA and RNA fragments obtained from next generation sequencing technologies, depends on the availability of efficient analytical tools to process them. One of the important aspects of this analysis, dealing with huge number of fragments, is partitioning them based on their level of similarities. In this paper we propose a space transformation based clustering approach to achieve this partitioning. In this approach, we transform each sequence by a set of reference sequences into a point in a multidimensional vector space and do the clustering in this vector space. We show through extensive analysis that the proposed transformation very closely preserve the clustering properties of the sequences using edit distance. Time for this transformation is linear with the number of sequences. The amount of time saving for this clustering is significant because in this approach edit distance calculations between two sequences are replaced by vector distance calculations between two corresponding feature vectors. We used agglomerative hierarchical clustering using single and average linkage because they are frequently used by the bioinformatics community. Agglomerative hierarchical clustering runs in quadratic time with the number of sequences and clustering time for this approach in the edit space can be prohibitive for large number of sequences. There exists greedy heuristic methods that perform clustering much faster but at the cost of significantly reduced cluster quality. We have applied our method to 16S rRNA fragment datasets obtained from different environmental samples. In these experiments, RESTRAC achieves up to five hundred times speed-up for single linkage and up to five times speed-up for average linkage while preserving good cluster quality.","PeriodicalId":389183,"journal":{"name":"2017 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"RESTRAC: REference Sequence Based Space TRAnsformation for Clustering\",\"authors\":\"A. T. Islam, S. Pramanik, Vahid Mirjalili, S. Sural\",\"doi\":\"10.1109/ICDMW.2017.66\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Effective mining of large amount of DNA and RNA fragments obtained from next generation sequencing technologies, depends on the availability of efficient analytical tools to process them. One of the important aspects of this analysis, dealing with huge number of fragments, is partitioning them based on their level of similarities. In this paper we propose a space transformation based clustering approach to achieve this partitioning. In this approach, we transform each sequence by a set of reference sequences into a point in a multidimensional vector space and do the clustering in this vector space. We show through extensive analysis that the proposed transformation very closely preserve the clustering properties of the sequences using edit distance. Time for this transformation is linear with the number of sequences. The amount of time saving for this clustering is significant because in this approach edit distance calculations between two sequences are replaced by vector distance calculations between two corresponding feature vectors. We used agglomerative hierarchical clustering using single and average linkage because they are frequently used by the bioinformatics community. Agglomerative hierarchical clustering runs in quadratic time with the number of sequences and clustering time for this approach in the edit space can be prohibitive for large number of sequences. There exists greedy heuristic methods that perform clustering much faster but at the cost of significantly reduced cluster quality. We have applied our method to 16S rRNA fragment datasets obtained from different environmental samples. In these experiments, RESTRAC achieves up to five hundred times speed-up for single linkage and up to five times speed-up for average linkage while preserving good cluster quality.\",\"PeriodicalId\":389183,\"journal\":{\"name\":\"2017 IEEE International Conference on Data Mining Workshops (ICDMW)\",\"volume\":\"44 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 IEEE International Conference on Data Mining Workshops (ICDMW)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDMW.2017.66\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE International Conference on Data Mining Workshops (ICDMW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDMW.2017.66","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

从下一代测序技术中获得的大量DNA和RNA片段的有效挖掘依赖于有效的分析工具来处理它们。在处理大量片段时，这种分析的一个重要方面是根据它们的相似程度对它们进行划分。在本文中，我们提出了一种基于空间变换的聚类方法来实现这种划分。在该方法中，我们通过一组参考序列将每个序列转换为多维向量空间中的一个点，并在该向量空间中进行聚类。我们通过广泛的分析表明，所提出的转换非常接近地保留了使用编辑距离的序列的聚类特性。这个变换的时间与序列的数量成线性关系。这种聚类节省了大量的时间，因为在这种方法中，两个序列之间的编辑距离计算被两个对应特征向量之间的向量距离计算所取代。我们使用单链接和平均链接的聚集分层聚类，因为它们经常被生物信息学社区使用。随着序列数量的增加，聚合分层聚类的运行时间为二次元，而对于大量序列，这种方法在编辑空间中的聚类时间可能会令人望而却步。有一些贪婪的启发式方法可以更快地执行聚类，但代价是显著降低了聚类质量。我们已经将我们的方法应用于从不同环境样本中获得的16S rRNA片段数据集。在这些实验中，RESTRAC在保持良好的集群质量的同时，对单个链接实现了高达500倍的加速，对平均链接实现了高达5倍的加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

RESTRAC: REference Sequence Based Space TRAnsformation for Clustering

Effective mining of large amount of DNA and RNA fragments obtained from next generation sequencing technologies, depends on the availability of efficient analytical tools to process them. One of the important aspects of this analysis, dealing with huge number of fragments, is partitioning them based on their level of similarities. In this paper we propose a space transformation based clustering approach to achieve this partitioning. In this approach, we transform each sequence by a set of reference sequences into a point in a multidimensional vector space and do the clustering in this vector space. We show through extensive analysis that the proposed transformation very closely preserve the clustering properties of the sequences using edit distance. Time for this transformation is linear with the number of sequences. The amount of time saving for this clustering is significant because in this approach edit distance calculations between two sequences are replaced by vector distance calculations between two corresponding feature vectors. We used agglomerative hierarchical clustering using single and average linkage because they are frequently used by the bioinformatics community. Agglomerative hierarchical clustering runs in quadratic time with the number of sequences and clustering time for this approach in the edit space can be prohibitive for large number of sequences. There exists greedy heuristic methods that perform clustering much faster but at the cost of significantly reduced cluster quality. We have applied our method to 16S rRNA fragment datasets obtained from different environmental samples. In these experiments, RESTRAC achieves up to five hundred times speed-up for single linkage and up to five times speed-up for average linkage while preserving good cluster quality.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2017 IEEE International Conference on Data Mining Workshops (ICDMW)

自引率

0.00%

发文量