基于近似生成树和素数滤波器的加速聚类

2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Pub Date : 2019-05-20 DOI:10.1109/IPDPSW.2019.00037

D. Rao, Sutharzan Sreeskandarajan, C. Liang

{"title":"基于近似生成树和素数滤波器的加速聚类","authors":"D. Rao, Sutharzan Sreeskandarajan, C. Liang","doi":"10.1109/IPDPSW.2019.00037","DOIUrl":null,"url":null,"abstract":"Motivation: Clustering genomic data, including those generated via high-throughput sequencing, is an important preliminary step for assembly and analysis. However, clustering a large number of sequences is time-consuming. Methods: In this paper, we discuss algorithmic performance improvements to our existing clustering system called PEACE via the following two new approaches: (1) using Approximate Spanning Tree (AST) that is computed much faster than the currently used Minimum Spanning Tree (MST) approach, and (2) a novel Prime Numbers based Heuristic (PNH) for generating features and comparing them to further reduce comparison overheads. Results: Experiments conducted using a variety of data sets show that the proposed method significantly improves performance for datasets with large clusters with only minimal degradation in clustering quality. We also compare our methods against wcd-kaboom, a state-of-the-art clustering software. Our experiments show that with AST and PNH underperform wcd-kaboom for datasets that have many small clusters. However, they significantly outperform wcd-kaboom for datasets with large clusters by a conspicuous ~550x with comparable clustering quality. The results indicate that the proposed methods hold considerable promise for accelerating clustering of genomic data with large clusters.","PeriodicalId":292054,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"120 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Accelerating Clustering using Approximate Spanning Tree and Prime Number Based Filter\",\"authors\":\"D. Rao, Sutharzan Sreeskandarajan, C. Liang\",\"doi\":\"10.1109/IPDPSW.2019.00037\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Motivation: Clustering genomic data, including those generated via high-throughput sequencing, is an important preliminary step for assembly and analysis. However, clustering a large number of sequences is time-consuming. Methods: In this paper, we discuss algorithmic performance improvements to our existing clustering system called PEACE via the following two new approaches: (1) using Approximate Spanning Tree (AST) that is computed much faster than the currently used Minimum Spanning Tree (MST) approach, and (2) a novel Prime Numbers based Heuristic (PNH) for generating features and comparing them to further reduce comparison overheads. Results: Experiments conducted using a variety of data sets show that the proposed method significantly improves performance for datasets with large clusters with only minimal degradation in clustering quality. We also compare our methods against wcd-kaboom, a state-of-the-art clustering software. Our experiments show that with AST and PNH underperform wcd-kaboom for datasets that have many small clusters. However, they significantly outperform wcd-kaboom for datasets with large clusters by a conspicuous ~550x with comparable clustering quality. The results indicate that the proposed methods hold considerable promise for accelerating clustering of genomic data with large clusters.\",\"PeriodicalId\":292054,\"journal\":{\"name\":\"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)\",\"volume\":\"120 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-05-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IPDPSW.2019.00037\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPSW.2019.00037","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

动机:聚类基因组数据，包括通过高通量测序产生的数据，是组装和分析的重要初步步骤。然而，对大量序列进行聚类是非常耗时的。方法:在本文中，我们通过以下两种新方法讨论了我们现有的称为PEACE的聚类系统的算法性能改进:(1)使用近似生成树(AST)，其计算速度比目前使用的最小生成树(MST)方法快得多;(2)一种新的基于素数的启发法(PNH)用于生成特征并对它们进行比较，以进一步减少比较开销。结果:使用各种数据集进行的实验表明，所提出的方法显著提高了具有大型聚类的数据集的性能，而聚类质量的下降很小。我们还将我们的方法与最先进的集群软件wcd- boom进行了比较。我们的实验表明，AST和PNH在具有许多小簇的数据集上的表现不如wcd- boom。然而，对于具有大型聚类的数据集，它们的性能明显优于wcd- boom，其聚类质量是同类数据集的550倍。结果表明，所提出的方法在加速基因组数据的大聚类方面具有相当大的前景。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Accelerating Clustering using Approximate Spanning Tree and Prime Number Based Filter

Motivation: Clustering genomic data, including those generated via high-throughput sequencing, is an important preliminary step for assembly and analysis. However, clustering a large number of sequences is time-consuming. Methods: In this paper, we discuss algorithmic performance improvements to our existing clustering system called PEACE via the following two new approaches: (1) using Approximate Spanning Tree (AST) that is computed much faster than the currently used Minimum Spanning Tree (MST) approach, and (2) a novel Prime Numbers based Heuristic (PNH) for generating features and comparing them to further reduce comparison overheads. Results: Experiments conducted using a variety of data sets show that the proposed method significantly improves performance for datasets with large clusters with only minimal degradation in clustering quality. We also compare our methods against wcd-kaboom, a state-of-the-art clustering software. Our experiments show that with AST and PNH underperform wcd-kaboom for datasets that have many small clusters. However, they significantly outperform wcd-kaboom for datasets with large clusters by a conspicuous ~550x with comparable clustering quality. The results indicate that the proposed methods hold considerable promise for accelerating clustering of genomic data with large clusters.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

自引率

0.00%

发文量