2016 1st Joint International Workshop on Parallel Data Storage and data Intensive Scalable Computing Systems (PDSW-DISCS)最新文献

Klimatic: A Virtual Data Lake for Harvesting and Distribution of Geospatial Data Klimatic:一个用于地理空间数据收集和分布的虚拟数据湖

2016 1st Joint International Workshop on Parallel Data Storage and data Intensive Scalable Computing Systems (PDSW-DISCS) Pub Date : 2016-11-13 DOI: 10.1109/PDSW-DISCS.2016.9

Tyler J. Skluzacek, K. Chard, Ian T Foster

{"title":"Klimatic: A Virtual Data Lake for Harvesting and Distribution of Geospatial Data","authors":"Tyler J. Skluzacek, K. Chard, Ian T Foster","doi":"10.1109/PDSW-DISCS.2016.9","DOIUrl":"https://doi.org/10.1109/PDSW-DISCS.2016.9","url":null,"abstract":"Many interesting geospatial datasets are publicly accessible on web sites and other online repositories. However, the sheer number of datasets and locations, plus a lack of support for cross-repository search, makes it difficult for researchers to discover and integrate relevant data. We describe here early results from a system, Klimatic, that aims to overcome these barriers to discovery and use by automating the tasks of crawling, indexing, integrating, and distributing geospatial data. Klimatic implements a scalable crawling and processing architecture that uses an elastic container-based model to locate and retrieve relevant datasets and to extract metadata from headers and within files to build a global index of known geospatial data. In so doing, we create an expansive geospatial virtual data lake that records the location, formats, and other characteristics of large numbers of geospatial datasets while also caching popular data subsets for rapid access. A flexible query interface allows users to request data that satisfy supplied type, spatial, temporal, and provider specifications; in processing such queries, the system uses interpolation and aggregation to combine data of different types, data formats, resolutions, and bounds. Klimatic has so far incorporated more than 10,000 datasets from over 120 sources and has been demonstrated to scale well with data size and query complexity.","PeriodicalId":375550,"journal":{"name":"2016 1st Joint International Workshop on Parallel Data Storage and data Intensive Scalable Computing Systems (PDSW-DISCS)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114055835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

Parallel I/O Characterisation Based on Server-Side Performance Counters 基于服务器端性能计数器的并行I/O表征

2016 1st Joint International Workshop on Parallel Data Storage and data Intensive Scalable Computing Systems (PDSW-DISCS) Pub Date : 2016-11-13 DOI: 10.1109/PDSW-DISCS.2016.006

S. E. Sayed, M. Bolten, D. Pleiter, W. Frings

引用次数: 1

A Bloom Filter Based Scalable Data Integrity Check Tool for Large-Scale Dataset 基于布隆过滤器的大规模数据集可扩展数据完整性检查工具

2016 1st Joint International Workshop on Parallel Data Storage and data Intensive Scalable Computing Systems (PDSW-DISCS) Pub Date : 2016-11-13 DOI: 10.1109/PDSW-DISCS.2016.13

Sisi Xiong, Feiyi Wang, Qing Cao

{"title":"A Bloom Filter Based Scalable Data Integrity Check Tool for Large-Scale Dataset","authors":"Sisi Xiong, Feiyi Wang, Qing Cao","doi":"10.1109/PDSW-DISCS.2016.13","DOIUrl":"https://doi.org/10.1109/PDSW-DISCS.2016.13","url":null,"abstract":"Large scale HPC applications are becoming increasingly data intensive. At Oak Ridge Leadership Computing Facility (OLCF), we are observing the number of files curated under individual project are reaching as high as 200 millions and project data size is exceeding petabytes. These simulation datasets, once validated, often needs to be transferred to archival system for long term storage or shared with the rest of the research community. Ensuring the data integrity of the full dataset at this scale is paramount important but also a daunting task. This is especially true considering that most conventional tools are serial and file-based, unwieldy to use and/or can't scale to meet user's demand.To tackle this particular challenge, this paper presents the design, implementation and evaluation of a scalable parallel checksumming tool, fsum, which we developed at OLCF. It is built upon the principle of parallel tree walk and work-stealing pattern to maximize parallelism and is capable of generating a single, consistent signature for the entire dataset at extreme scale. We also applied a novel bloom-filter based technique in aggregating signatures to overcome the signature ordering requirement. Given the probabilistic nature of bloom filter, we provided a detailed error and trade-off analysis. Using multiple datasets from production environment, we demonstrated that our tool can efficiently handle both very large files as well as many small-file based datasets. Our preliminary test showed that on the same hardware, it outperforms conventional tool by as much as 4×. It also exhibited near-linear scaling properties when provisioned with more compute resources.","PeriodicalId":375550,"journal":{"name":"2016 1st Joint International Workshop on Parallel Data Storage and data Intensive Scalable Computing Systems (PDSW-DISCS)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126170733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Towards Energy Efficient Data Management in HPC: The Open Ethernet Drive Approach 迈向高效节能的HPC数据管理:开放以太网驱动器方法

2016 1st Joint International Workshop on Parallel Data Storage and data Intensive Scalable Computing Systems (PDSW-DISCS) Pub Date : 2016-11-13 DOI: 10.1109/PDSW-DISCS.2016.11

Anthony Kougkas, Anthony Fleck, Xian-He Sun

引用次数: 5

Scientific Workflows at DataWarp-Speed: Accelerated Data-Intensive Science Using NERSC's Burst Buffer DataWarp-Speed的科学工作流程:使用NERSC的突发缓冲区加速数据密集型科学

2016 1st Joint International Workshop on Parallel Data Storage and data Intensive Scalable Computing Systems (PDSW-DISCS) Pub Date : 2016-11-13 DOI: 10.1109/PDSW-DISCS.2016.5

A. Ovsyannikov, Melissa Romanus, B. V. Straalen, G. Weber, D. Trebotich

引用次数: 26

Replicating HPC I/O Workloads with Proxy Applications 使用代理应用程序复制HPC I/O工作负载

2016 1st Joint International Workshop on Parallel Data Storage and data Intensive Scalable Computing Systems (PDSW-DISCS) Pub Date : 2016-11-13 DOI: 10.1109/PDSW-DISCS.2016.6

J. Dickson, Steven A. Wright, S. Maheswaran, Andy Herdman, Mark C. Miller, S. Jarvis

引用次数: 14

FatMan vs. LittleBoy: Scaling Up Linear Algebraic Operations in Scale-Out Data Platforms 胖子vs小男孩:扩展数据平台中线性代数运算的扩展

2016 1st Joint International Workshop on Parallel Data Storage and data Intensive Scalable Computing Systems (PDSW-DISCS) Pub Date : 2016-11-13 DOI: 10.1109/PDSW-DISCS.2016.8

Luna Xu, Seung-Hwan Lim, A. Butt, S. Sukumar, R. Kannan

{"title":"FatMan vs. LittleBoy: Scaling Up Linear Algebraic Operations in Scale-Out Data Platforms","authors":"Luna Xu, Seung-Hwan Lim, A. Butt, S. Sukumar, R. Kannan","doi":"10.1109/PDSW-DISCS.2016.8","DOIUrl":"https://doi.org/10.1109/PDSW-DISCS.2016.8","url":null,"abstract":"Linear algebraic operations such as matrix manipulations form the kernel of many machine learning and other crucial algorithms. Scaling up as well as scaling out such algorithms are highly desirable to enable efficient processing over millions of data points. To this end, we present a matrix manipulation approach to effectively scale-up each node in a scale-out data parallel platform such as Apache Spark. Specifically, we enable hardware acceleration for matrix multiplications in a distributed Spark setup without user intervention. Our approach supports both dense and sparse distributed matrices, and provides flexible control of acceleration by matrix density. We demonstrate the benefit of our approach for generalized matrix multiplication operations over large matrices with up to four billion elements. To connect the effectiveness of our approach with machine learning applications, we performed Gramian matrix computation via generalized matrix multiplications. Our experiments show that our approach achieves more than 2× performance speed-up, and up to 96.1% computation improvement, compared to a state of the art Spark MLlib for dense matrices.","PeriodicalId":375550,"journal":{"name":"2016 1st Joint International Workshop on Parallel Data Storage and data Intensive Scalable Computing Systems (PDSW-DISCS)","volume":"364 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121408058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Can Non-volatile Memory Benefit MapReduce Applications on HPC Clusters? 非易失性内存是否有利于高性能计算集群上的MapReduce应用?

2016 1st Joint International Workshop on Parallel Data Storage and data Intensive Scalable Computing Systems (PDSW-DISCS) Pub Date : 2016-11-13 DOI: 10.1109/PDSW-DISCS.2016.7

Md. Wasi-ur-Rahman, Nusrat S. Islam, Xiaoyi Lu, D. Panda

{"title":"Can Non-volatile Memory Benefit MapReduce Applications on HPC Clusters?","authors":"Md. Wasi-ur-Rahman, Nusrat S. Islam, Xiaoyi Lu, D. Panda","doi":"10.1109/PDSW-DISCS.2016.7","DOIUrl":"https://doi.org/10.1109/PDSW-DISCS.2016.7","url":null,"abstract":"Modern High-Performance Computing (HPC) clusters are equipped with advanced technological resources that need to be properly utilized to achieve supreme performance for end applications. One such example, Non-Volatile Memory (NVM), provides the opportunity for fast scalable performance through its DRAM-like performance characteristics. On the other hand, distributed processing engines, such as MapReduce, are continuously being enhanced with features enabling high-performance technologies. In this paper, we present a novel MapReduce framework with NVRAM-assisted map output spill approach. We have designed our framework on top of the existing RDMA-enhanced Hadoop MapReduce to ensure both map and reduce phase performance enhancements to be present for end applications. Our proposed approach significantly enhances map phase performance proven by a wide variety of MapReduce benchmarks and workloads from Intel HiBench [9] and PUMA [18] suites. Our performance evaluation illustrates that NVRAM-based spill approach can improve map execution performance by 2.73x which contributes to the overall execution improvement of 55% for Sort. Our design also guarantees significant performance benefits for other workloads: 54% for TeraSort, 21% for PageRank, 58% for SelfJoin, etc. To the best of our knowledge, this is the first approach towards leveraging NVRAM in MapReduce execution frameworks for applications on HPC clusters.","PeriodicalId":375550,"journal":{"name":"2016 1st Joint International Workshop on Parallel Data Storage and data Intensive Scalable Computing Systems (PDSW-DISCS)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126562772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

A Generic Framework for Testing Parallel File Systems 测试并行文件系统的通用框架

2016 1st Joint International Workshop on Parallel Data Storage and data Intensive Scalable Computing Systems (PDSW-DISCS) Pub Date : 2016-11-13 DOI: 10.1109/PDSW-DISCS.2016.12

Jinrui Cao, Simeng Wang, Dong Dai, Mai Zheng, Yong Chen

引用次数: 13