DBSCAN上的弹性分布式数据集

2015 International Conference on High Performance Computing & Simulation (HPCS) Pub Date : 2015-07-20 DOI:10.1109/HPCSim.2015.7237086

Irving Cordova, Teng-Sheng Moh

{"title":"DBSCAN上的弹性分布式数据集","authors":"Irving Cordova, Teng-Sheng Moh","doi":"10.1109/HPCSim.2015.7237086","DOIUrl":null,"url":null,"abstract":"DBSCAN is a well-known density-based data clustering algorithm that is widely used due to its ability to find arbitrarily shaped clusters in noisy data. However, DBSCAN is hard to scale which limits its utility when working with large data sets. Resilient Distributed Datasets (RDDs), on the other hand, are a fast data-processing abstraction created explicitly for in-memory computation of large data sets. This paper presents a new algorithm based on DBSCAN using the Resilient Distributed Datasets approach: RDD-DBSCAN. RDD-DBSCAN overcomes the scalability limitations of the traditional DBSCAN algorithm by operating in a fully distributed fashion. The paper also evaluates an implementation of RDD-DBSCAN using Apache Spark, the official RDD implementation.","PeriodicalId":134009,"journal":{"name":"2015 International Conference on High Performance Computing & Simulation (HPCS)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"45","resultStr":"{\"title\":\"DBSCAN on Resilient Distributed Datasets\",\"authors\":\"Irving Cordova, Teng-Sheng Moh\",\"doi\":\"10.1109/HPCSim.2015.7237086\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"DBSCAN is a well-known density-based data clustering algorithm that is widely used due to its ability to find arbitrarily shaped clusters in noisy data. However, DBSCAN is hard to scale which limits its utility when working with large data sets. Resilient Distributed Datasets (RDDs), on the other hand, are a fast data-processing abstraction created explicitly for in-memory computation of large data sets. This paper presents a new algorithm based on DBSCAN using the Resilient Distributed Datasets approach: RDD-DBSCAN. RDD-DBSCAN overcomes the scalability limitations of the traditional DBSCAN algorithm by operating in a fully distributed fashion. The paper also evaluates an implementation of RDD-DBSCAN using Apache Spark, the official RDD implementation.\",\"PeriodicalId\":134009,\"journal\":{\"name\":\"2015 International Conference on High Performance Computing & Simulation (HPCS)\",\"volume\":\"40 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-07-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"45\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 International Conference on High Performance Computing & Simulation (HPCS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HPCSim.2015.7237086\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 International Conference on High Performance Computing & Simulation (HPCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCSim.2015.7237086","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 45

摘要

DBSCAN是一种众所周知的基于密度的数据聚类算法，由于它能够在噪声数据中发现任意形状的聚类而被广泛使用。然而，DBSCAN很难扩展，这限制了它在处理大型数据集时的效用。另一方面，弹性分布式数据集(rdd)是为大型数据集的内存计算而显式创建的快速数据处理抽象。本文提出了一种基于弹性分布式数据集的DBSCAN算法:RDD-DBSCAN。RDD-DBSCAN以完全分布式的方式运行，克服了传统DBSCAN算法的可伸缩性限制。本文还评估了使用官方RDD实现Apache Spark的RDD- dbscan实现。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

DBSCAN on Resilient Distributed Datasets

DBSCAN is a well-known density-based data clustering algorithm that is widely used due to its ability to find arbitrarily shaped clusters in noisy data. However, DBSCAN is hard to scale which limits its utility when working with large data sets. Resilient Distributed Datasets (RDDs), on the other hand, are a fast data-processing abstraction created explicitly for in-memory computation of large data sets. This paper presents a new algorithm based on DBSCAN using the Resilient Distributed Datasets approach: RDD-DBSCAN. RDD-DBSCAN overcomes the scalability limitations of the traditional DBSCAN algorithm by operating in a fully distributed fashion. The paper also evaluates an implementation of RDD-DBSCAN using Apache Spark, the official RDD implementation.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2015 International Conference on High Performance Computing & Simulation (HPCS)

自引率

0.00%

发文量