IEEE International Symposium on High-Performance Parallel Distributed Computing最新文献_第2页

Design and evaluation of the gemtc framework for GPU-enabled many-task computing 支持gpu的多任务计算的gemtc框架的设计与评估

IEEE International Symposium on High-Performance Parallel Distributed Computing Pub Date : 2014-06-23 DOI: 10.1145/2600212.2600228

Scott J. Krieder, J. Wozniak, Timothy G. Armstrong, M. Wilde, D. Katz, Benjamin Grimmer, Ian T Foster, I. Raicu

{"title":"Design and evaluation of the gemtc framework for GPU-enabled many-task computing","authors":"Scott J. Krieder, J. Wozniak, Timothy G. Armstrong, M. Wilde, D. Katz, Benjamin Grimmer, Ian T Foster, I. Raicu","doi":"10.1145/2600212.2600228","DOIUrl":"https://doi.org/10.1145/2600212.2600228","url":null,"abstract":"We present the design and first performance and usability evaluation of GeMTC, a novel execution model and runtime system that enables accelerators to be programmed with many concurrent and independent tasks of potentially short or variable duration. With GeMTC, a broad class of such \"many-task\" applications can leverage the increasing number of accelerated and hybrid high-end computing systems. GeMTC overcomes the obstacles to using GPUs in a many-task manner by scheduling and launching independent tasks on hardware designed for SIMD-style vector processing. We demonstrate the use of a high-level MTC programming model (the Swift parallel dataflow language) to run tasks on many accelerators and thus provide a high-productivity programming model for the growing number of supercomputers that are accelerator-enabled. While still in an experimental stage, GeMTC can already support tasks of fine (subsecond) granularity and execute concurrent heterogeneous tasks on 86,000 independent GPU warps spanning 2.7M GPU threads on the Blue Waters supercomputer.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124523592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 43

A scalable distributed skip list for range queries 范围查询的可扩展分布式跳跃列表

IEEE International Symposium on High-Performance Parallel Distributed Computing Pub Date : 2014-06-23 DOI: 10.1145/2600212.2600712

S. Alam, Humaira Kamal, Alan S. Wagner

引用次数: 3

Glasswing: accelerating mapreduce on multi-core and many-core clusters Glasswing:在多核和多核集群上加速mapreduce

IEEE International Symposium on High-Performance Parallel Distributed Computing Pub Date : 2014-06-23 DOI: 10.1145/2600212.2600706

I. El-Helw, Rutger F. H. Hofman, H. Bal

引用次数: 14

ConCORD: easily exploiting memory content redundancy through the content-aware service command ConCORD:通过内容感知服务命令轻松利用内存内容冗余

IEEE International Symposium on High-Performance Parallel Distributed Computing Pub Date : 2014-06-23 DOI: 10.1145/2600212.2600214

Lei Xia, Kyle C. Hale, P. Dinda

引用次数: 3

Next generation job management systems for extreme-scale ensemble computing 用于超大规模集成计算的下一代作业管理系统

IEEE International Symposium on High-Performance Parallel Distributed Computing Pub Date : 2014-06-23 DOI: 10.1145/2600212.2600703

Ke Wang, Xiaobing Zhou, Hao Chen, M. Lang, I. Raicu

引用次数: 62

Scalable matrix inversion using MapReduce 使用MapReduce的可伸缩矩阵反演

IEEE International Symposium on High-Performance Parallel Distributed Computing Pub Date : 2014-06-23 DOI: 10.1145/2600212.2600220

Jingen Xiang, Huangdong Meng, Ashraf Aboulnaga

{"title":"Scalable matrix inversion using MapReduce","authors":"Jingen Xiang, Huangdong Meng, Ashraf Aboulnaga","doi":"10.1145/2600212.2600220","DOIUrl":"https://doi.org/10.1145/2600212.2600220","url":null,"abstract":"Matrix operations are a fundamental building block of many computational tasks in fields as diverse as scientific computing, machine learning, and data mining. Matrix inversion is an important matrix operation, but it is difficult to implement in today's popular parallel dataflow programming systems, such as MapReduce. The reason is that each element in the inverse of a matrix depends on multiple elements in the input matrix, so the computation is not easily partitionable. In this paper, we present a scalable and efficient technique for matrix inversion in MapReduce. Our technique relies on computing the LU decomposition of the input matrix and using that decomposition to compute the required matrix inverse. We present a technique for computing the LU decomposition and the matrix inverse using a pipeline of MapReduce jobs. We also present optimizations of this technique in the context of Hadoop. To the best of our knowledge, our technique is the first matrix inversion technique using MapReduce. We show experimentally that our technique has good scalability, enabling us to invert a 10^5 x 10^5 matrix in 5 hours on Amazon EC2. We also show that our technique outperforms ScaLAPACK, a state-of-the-art linear algebra package that uses MPI.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127615237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 29

FT-ScaLAPACK: correcting soft errors on-line for ScaLAPACK cholesky, QR, and LU factorization routines FT-ScaLAPACK:在线校正ScaLAPACK choolesky, QR和LU分解例程的软错误

IEEE International Symposium on High-Performance Parallel Distributed Computing Pub Date : 2014-06-23 DOI: 10.1145/2600212.2600232

Panruo Wu, Zizhong Chen

{"title":"FT-ScaLAPACK: correcting soft errors on-line for ScaLAPACK cholesky, QR, and LU factorization routines","authors":"Panruo Wu, Zizhong Chen","doi":"10.1145/2600212.2600232","DOIUrl":"https://doi.org/10.1145/2600212.2600232","url":null,"abstract":"It is well known that soft errors in linear algebra operations can be detected off-line at the end of the computation using algorithm-based fault tolerance (ABFT). However, traditional ABFT usually cannot correct errors in Cholesky, QR, and LU factorizations because any error in one matrix element will be propagated to many other matrix elements and hence cause too many errors to correct. Although, recently, tremendous progresses have been made to correct errors in LU and QR factorizations, these new techniques correct errors off-line at the end of the computation after errors propagated and accumulated, which significantly complicates the error correction process and introduces at least quadratically increasing overhead as the number of errors increases. In this paper, we present the design and implementation of FT-ScaLAPACK, a fault tolerant version ScaLAPACK that is able to detect, locate, and correct errors in Cholesky, QR, and LU factorizations on-line in the middle of the computation in a timely manner before the errors propagate and accumulate. FT-ScaLAPACK has been validated with thousands of cores on Stampede at the Texas Advanced Computing Center. Experimental results demonstrate that FT-ScaLAPACK is able to achieve comparable performance and scalability with the original ScaLAPACK.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124418635","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 51

Seraph: an efficient, low-cost system for concurrent graph processing Seraph:一种高效、低成本的并行图形处理系统

IEEE International Symposium on High-Performance Parallel Distributed Computing Pub Date : 2014-06-23 DOI: 10.1145/2600212.2600222

Jilong Xue, Zhi Yang, Zhi Qu, Shian Hou, Yafei Dai

{"title":"Seraph: an efficient, low-cost system for concurrent graph processing","authors":"Jilong Xue, Zhi Yang, Zhi Qu, Shian Hou, Yafei Dai","doi":"10.1145/2600212.2600222","DOIUrl":"https://doi.org/10.1145/2600212.2600222","url":null,"abstract":"Graph processing systems have been widely used in enterprises like online social networks to process their daily jobs. With the fast growing of social applications, they have to efficiently handle massive concurrent jobs. However, due to the inherent design for single job, existing systems incur great inefficiency in memory use and fault tolerance. Motivated by this, in this paper we introduce Seraph, a graph processing system that enables efficient job-level parallelism. Seraph is designed based on a decoupled data model, which allows multiple concurrent jobs to share graph structure data in memory. Seraph adopts a copy-on-write semantic to isolate the graph mutation of concurrent jobs, and a lazy snapshot protocol to generate consistent graph snapshots for jobs submitted at different time. Moreover, Seraph adopts an incremental checkpoint/regeneration model which can tremendously reduce the overhead of checkpointing. We have implemented Seraph, and the evaluation results show that Seraph significantly outperforms popular systems (such as Giraph and Spark) in both memory usage and job completion time, when executing concurrent graph jobs.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126611412","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 61

TOP-PIM: throughput-oriented programmable processing in memory TOP-PIM:内存中面向吞吐量的可编程处理

IEEE International Symposium on High-Performance Parallel Distributed Computing Pub Date : 2014-06-23 DOI: 10.1145/2600212.2600213

D. Zhang, N. Jayasena, Alexander Lyashevsky, J. Greathouse, Lifan Xu, Mike Ignatowski

{"title":"TOP-PIM: throughput-oriented programmable processing in memory","authors":"D. Zhang, N. Jayasena, Alexander Lyashevsky, J. Greathouse, Lifan Xu, Mike Ignatowski","doi":"10.1145/2600212.2600213","DOIUrl":"https://doi.org/10.1145/2600212.2600213","url":null,"abstract":"As computation becomes increasingly limited by data movement and energy consumption, exploiting locality throughout the memory hierarchy becomes critical to continued performance scaling. Moving computation closer to memory presents an opportunity to reduce both energy and data movement overheads. We explore the use of 3D die stacking to move memory-intensive computations closer to memory. This approach to processing in memory addresses some drawbacks of prior research on in-memory computing and is commercially viable in the foreseeable future.\u0000 Because 3D stacking provides increased bandwidth, we study throughput-oriented computing using programmable GPU compute units across a broad range of benchmarks, including graph and HPC applications. We also introduce a methodology for rapid design space exploration by analytically predicting performance and energy of in-memory processors based on metrics obtained from execution on today's GPU hardware. Our results show that, on average, viable PIM configurations show moderate performance losses (27%) in return for significant energy efficiency improvements (76% reduction in EDP) relative to a representative mainstream GPU at 22nm technology. At 16nm technology, on average, viable PIM configurations are performance competitive with a representative mainstream GPU (7% speedup) and provide even greater energy efficiency improvements (85% reduction in EDP).","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116653581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 322

Improving energy efficiency of embedded DRAM caches for high-end computing systems 提高高端计算系统嵌入式DRAM缓存的能源效率

IEEE International Symposium on High-Performance Parallel Distributed Computing Pub Date : 2014-06-23 DOI: 10.1145/2600212.2600216

Sparsh Mittal, J. Vetter, Dong Li

{"title":"Improving energy efficiency of embedded DRAM caches for high-end computing systems","authors":"Sparsh Mittal, J. Vetter, Dong Li","doi":"10.1145/2600212.2600216","DOIUrl":"https://doi.org/10.1145/2600212.2600216","url":null,"abstract":"The number of cores in a single chip in the nodes of high-end computing systems is on rise, due, in part, to a number of constraints, such as power consumption. With this, the size of the last level cache (LLC) has also increased significantly. Since LLCs built with SRAM consume high leakage power, power consumption of LLCs is becoming a significant fraction of processor power consumption. To address this issue, researchers have used embedded DRAM (eDRAM) LLCs which consume low leakage power. However, eDRAM caches consume a significant amount of energy in the form of refresh energy. In this paper, we propose ESTEEM, an energy saving technique for embedded DRAM caches. ESTEEM uses dynamic cache reconfiguration to turn off a portion of the cache to save both leakage and refresh energy. It logically divides the cache sets into multiple modules and turns off possibly different number of ways in each module. Microarchitectural simulations confirm that ESTEEM is effective in improving performance and energy efficiency and provides better results compared to a recently-proposed eDRAM cache energy saving technique, namely Refrint. For single and dual-core simulations, the average energy saving in memory subsystem (LLC+main memory) with ESTEEM is 25.8% and 32.6% respectively, and the average weighted speedup is 1.09x and 1.22x respectively. Additional experiments confirm that ESTEEM works well for a wide-range of system and algorithm parameters.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"280 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123089617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9