Parallel k-Means Clustering of Geospatial Data Sets Using Manycore CPU Architectures

2018 IEEE International Conference on Data Mining Workshops (ICDMW) Pub Date : 2018-11-01 DOI:10.1109/ICDMW.2018.00118

R. Mills, Vamsi Sripathi, J. Kumar, S. Sreepathi, F. Hoffman, W. Hargrove

{"title":"Parallel k-Means Clustering of Geospatial Data Sets Using Manycore CPU Architectures","authors":"R. Mills, Vamsi Sripathi, J. Kumar, S. Sreepathi, F. Hoffman, W. Hargrove","doi":"10.1109/ICDMW.2018.00118","DOIUrl":null,"url":null,"abstract":"The increasing availability of high-resolution geospatiotemporal data sets from sources such as observatory networks, remote sensing platforms, and computational Earth system models has opened new possibilities for knowledge discovery and mining of weather, climate, ecological, and other geoscientific data sets fused from disparate sources. Many of the standard tools used on individual workstations are impractical for the analysis and synthesis of data sets of this size; however, new algorithmic approaches that can effectively utilize the complex memory hierarchies and the extremely high levels of parallelism available in state-of-the-art high-performance computing platforms can enable such analysis. Here, we describe pKluster, an open-source tool we have developed for accelerated k-means clustering of geospatial and geospatiotemporal data, and discuss algorithmic modifications and code optimizations we have made to enable it to effectively use parallel machines based on novel CPU architectures—such as the Intel Knights Landing Xeon Phi and Skylake Xeon processors—with many cores and hardware threads, and employing significant single instruction, multiple data (SIMD) parallelism. We outline some applications of the code in ecology and climate science contexts and present a detailed discussion of the performance of the code for one such application, LiDAR-derived vertical vegetation structure classification.","PeriodicalId":259600,"journal":{"name":"2018 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE International Conference on Data Mining Workshops (ICDMW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDMW.2018.00118","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

The increasing availability of high-resolution geospatiotemporal data sets from sources such as observatory networks, remote sensing platforms, and computational Earth system models has opened new possibilities for knowledge discovery and mining of weather, climate, ecological, and other geoscientific data sets fused from disparate sources. Many of the standard tools used on individual workstations are impractical for the analysis and synthesis of data sets of this size; however, new algorithmic approaches that can effectively utilize the complex memory hierarchies and the extremely high levels of parallelism available in state-of-the-art high-performance computing platforms can enable such analysis. Here, we describe pKluster, an open-source tool we have developed for accelerated k-means clustering of geospatial and geospatiotemporal data, and discuss algorithmic modifications and code optimizations we have made to enable it to effectively use parallel machines based on novel CPU architectures—such as the Intel Knights Landing Xeon Phi and Skylake Xeon processors—with many cores and hardware threads, and employing significant single instruction, multiple data (SIMD) parallelism. We outline some applications of the code in ecology and climate science contexts and present a detailed discussion of the performance of the code for one such application, LiDAR-derived vertical vegetation structure classification.

查看原文本刊更多论文

基于多核CPU架构的地理空间数据集并行k均值聚类

越来越多来自观测站网络、遥感平台和计算地球系统模型等来源的高分辨率地理时空数据集的可用性，为从不同来源融合的天气、气候、生态和其他地球科学数据集的知识发现和挖掘开辟了新的可能性。在单个工作站上使用的许多标准工具对于分析和综合这种规模的数据集是不切实际的;然而，新的算法方法可以有效地利用复杂的内存层次结构和最先进的高性能计算平台中可用的极高水平的并行性，从而实现这种分析。在这里，我们描述了pKluster，这是我们为加速地理空间和地理时空数据的k-means聚类而开发的开源工具，并讨论了我们所做的算法修改和代码优化，使其能够有效地使用基于新型CPU架构的并行机器-例如Intel Knights Landing Xeon Phi和Skylake Xeon处理器-具有许多内核和硬件线程，并采用显著的单指令多数据(SIMD)并行性。我们概述了代码在生态和气候科学背景下的一些应用，并详细讨论了代码在其中一个应用中的性能，即激光雷达衍生的垂直植被结构分类。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 IEEE International Conference on Data Mining Workshops (ICDMW)

自引率

0.00%

发文量