PARMA-CC:并行多相近似聚类组合

Proceedings of the 21st International Conference on Distributed Computing and Networking Pub Date : 2020-01-04 DOI:10.1145/3369740.3369785

Amir Keramatian, Vincenzo Gulisano, M. Papatriantafilou, P. Tsigas

{"title":"PARMA-CC:并行多相近似聚类组合","authors":"Amir Keramatian, Vincenzo Gulisano, M. Papatriantafilou, P. Tsigas","doi":"10.1145/3369740.3369785","DOIUrl":null,"url":null,"abstract":"Clustering is a common component in data analysis applications. Despite the extensive literature, the continuously increasing volumes of data produced by sensors (e.g. rates of several MB/s by 3D scanners such as LIDAR sensors), and the time-sensitivity of the applications leveraging the clustering outcomes (e.g. detecting critical situations, that are known to be accuracy-dependent), demand for novel approaches that respond faster while coping with large data sets. The latter is the challenge we address in this paper. We propose an algorithm, PARMA-CC, that complements existing density-based and distance-based clustering methods. PARMA-CC, is based on approximate, data parallel cluster combining, where parallel threads can compute summaries of clusters of data (sub)sets and, through combining, together construct a comprehensive summary of the sets of clusters. By approximating clusters with their respective geometrical summaries, our technique scales well with increased data volumes, and, by computing and efficiently combining the summaries in parallel, it enables latency improvements. PARMA-CC combines the summaries using special data structures that enable parallelism through in-place data processing. As we show in our analysis and evaluation, PARMA-CC can complement and outperform well-established methods, with significantly better scalability, while still providing highly accurate results in a variety of data sets, even with skewed data distributions, which cause the traditional approaches to exhibit their worst-case behaviour. In the paper we also describe how PARMA-CC can facilitate time-critical applications through appropriate use of the summaries.","PeriodicalId":240048,"journal":{"name":"Proceedings of the 21st International Conference on Distributed Computing and Networking","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"PARMA-CC: Parallel Multiphase Approximate Cluster Combining\",\"authors\":\"Amir Keramatian, Vincenzo Gulisano, M. Papatriantafilou, P. Tsigas\",\"doi\":\"10.1145/3369740.3369785\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Clustering is a common component in data analysis applications. Despite the extensive literature, the continuously increasing volumes of data produced by sensors (e.g. rates of several MB/s by 3D scanners such as LIDAR sensors), and the time-sensitivity of the applications leveraging the clustering outcomes (e.g. detecting critical situations, that are known to be accuracy-dependent), demand for novel approaches that respond faster while coping with large data sets. The latter is the challenge we address in this paper. We propose an algorithm, PARMA-CC, that complements existing density-based and distance-based clustering methods. PARMA-CC, is based on approximate, data parallel cluster combining, where parallel threads can compute summaries of clusters of data (sub)sets and, through combining, together construct a comprehensive summary of the sets of clusters. By approximating clusters with their respective geometrical summaries, our technique scales well with increased data volumes, and, by computing and efficiently combining the summaries in parallel, it enables latency improvements. PARMA-CC combines the summaries using special data structures that enable parallelism through in-place data processing. As we show in our analysis and evaluation, PARMA-CC can complement and outperform well-established methods, with significantly better scalability, while still providing highly accurate results in a variety of data sets, even with skewed data distributions, which cause the traditional approaches to exhibit their worst-case behaviour. In the paper we also describe how PARMA-CC can facilitate time-critical applications through appropriate use of the summaries.\",\"PeriodicalId\":240048,\"journal\":{\"name\":\"Proceedings of the 21st International Conference on Distributed Computing and Networking\",\"volume\":\"23 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-01-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 21st International Conference on Distributed Computing and Networking\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3369740.3369785\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 21st International Conference on Distributed Computing and Networking","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3369740.3369785","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

聚类是数据分析应用程序中常见的组件。尽管有大量的文献，但传感器产生的数据量不断增加(例如，激光雷达传感器等3D扫描仪的速率为几MB/s)，以及利用聚类结果的应用程序的时间敏感性(例如，检测已知与准确性相关的关键情况)，需要在处理大型数据集时响应更快的新方法。后者是我们在本文中要解决的挑战。我们提出了一种算法，PARMA-CC，它补充了现有的基于密度和基于距离的聚类方法。PARMA-CC是基于近似的、数据并行的集群组合，其中并行线程可以计算数据集群(子)集的摘要，并通过组合，共同构建集群集的综合摘要。通过使用各自的几何摘要近似集群，我们的技术可以随着数据量的增加而很好地扩展，并且通过并行计算和有效地组合摘要，它可以改进延迟。PARMA-CC使用特殊的数据结构组合摘要，通过就地数据处理实现并行性。正如我们在分析和评估中所显示的那样，PARMA-CC可以补充并优于成熟的方法，具有更好的可扩展性，同时仍然在各种数据集中提供高度准确的结果，即使在数据分布偏斜的情况下，这导致传统方法表现出最坏的情况。在本文中，我们还描述了PARMA-CC如何通过适当使用摘要来促进时间关键型应用程序。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

PARMA-CC: Parallel Multiphase Approximate Cluster Combining

Clustering is a common component in data analysis applications. Despite the extensive literature, the continuously increasing volumes of data produced by sensors (e.g. rates of several MB/s by 3D scanners such as LIDAR sensors), and the time-sensitivity of the applications leveraging the clustering outcomes (e.g. detecting critical situations, that are known to be accuracy-dependent), demand for novel approaches that respond faster while coping with large data sets. The latter is the challenge we address in this paper. We propose an algorithm, PARMA-CC, that complements existing density-based and distance-based clustering methods. PARMA-CC, is based on approximate, data parallel cluster combining, where parallel threads can compute summaries of clusters of data (sub)sets and, through combining, together construct a comprehensive summary of the sets of clusters. By approximating clusters with their respective geometrical summaries, our technique scales well with increased data volumes, and, by computing and efficiently combining the summaries in parallel, it enables latency improvements. PARMA-CC combines the summaries using special data structures that enable parallelism through in-place data processing. As we show in our analysis and evaluation, PARMA-CC can complement and outperform well-established methods, with significantly better scalability, while still providing highly accurate results in a variety of data sets, even with skewed data distributions, which cause the traditional approaches to exhibit their worst-case behaviour. In the paper we also describe how PARMA-CC can facilitate time-critical applications through appropriate use of the summaries.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 21st International Conference on Distributed Computing and Networking

自引率

0.00%

发文量