{"title":"Implementing a distributed volumetric data analytics toolkit on apache spark","authors":"Chao Chen, Yuzhong Yan, Lei Huang, Lijun Qian","doi":"10.1109/NYSDS.2017.8085038","DOIUrl":null,"url":null,"abstract":"The multidimensional array is a fundamental data structure that has been widely used in scientific computing, as well as in many big data analytics applications. Distributed multi-dimensional array has been well studied in the High Performance Computing (HPC) platforms; however, little research has been done in the widely-used big data analytics platforms. In this paper, we present an implementation of Distributed Multi-dimensional Array Toolkit (DMAT) on top of the Apache Spark big data analytics platform. The toolkit supports several fashions for multidimensional array distributions, repartition, transposition, access, and data parallelism with a variety of parallel execution templates. This paper introduces the software architecture and implementations of DMAT, and also studies the performance characteristics of some typical multi-dimensional array operations with different configurations.","PeriodicalId":380859,"journal":{"name":"2017 New York Scientific Data Summit (NYSDS)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 New York Scientific Data Summit (NYSDS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NYSDS.2017.8085038","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
The multidimensional array is a fundamental data structure that has been widely used in scientific computing, as well as in many big data analytics applications. Distributed multi-dimensional array has been well studied in the High Performance Computing (HPC) platforms; however, little research has been done in the widely-used big data analytics platforms. In this paper, we present an implementation of Distributed Multi-dimensional Array Toolkit (DMAT) on top of the Apache Spark big data analytics platform. The toolkit supports several fashions for multidimensional array distributions, repartition, transposition, access, and data parallelism with a variety of parallel execution templates. This paper introduces the software architecture and implementations of DMAT, and also studies the performance characteristics of some typical multi-dimensional array operations with different configurations.