IEEE International Symposium on High-Performance Parallel Distributed Computing最新文献

筛选
英文 中文
Design and evaluation of the gemtc framework for GPU-enabled many-task computing 支持gpu的多任务计算的gemtc框架的设计与评估
IEEE International Symposium on High-Performance Parallel Distributed Computing Pub Date : 2014-06-23 DOI: 10.1145/2600212.2600228
Scott J. Krieder, J. Wozniak, Timothy G. Armstrong, M. Wilde, D. Katz, Benjamin Grimmer, Ian T Foster, I. Raicu
{"title":"Design and evaluation of the gemtc framework for GPU-enabled many-task computing","authors":"Scott J. Krieder, J. Wozniak, Timothy G. Armstrong, M. Wilde, D. Katz, Benjamin Grimmer, Ian T Foster, I. Raicu","doi":"10.1145/2600212.2600228","DOIUrl":"https://doi.org/10.1145/2600212.2600228","url":null,"abstract":"We present the design and first performance and usability evaluation of GeMTC, a novel execution model and runtime system that enables accelerators to be programmed with many concurrent and independent tasks of potentially short or variable duration. With GeMTC, a broad class of such \"many-task\" applications can leverage the increasing number of accelerated and hybrid high-end computing systems. GeMTC overcomes the obstacles to using GPUs in a many-task manner by scheduling and launching independent tasks on hardware designed for SIMD-style vector processing. We demonstrate the use of a high-level MTC programming model (the Swift parallel dataflow language) to run tasks on many accelerators and thus provide a high-productivity programming model for the growing number of supercomputers that are accelerator-enabled. While still in an experimental stage, GeMTC can already support tasks of fine (subsecond) granularity and execute concurrent heterogeneous tasks on 86,000 independent GPU warps spanning 2.7M GPU threads on the Blue Waters supercomputer.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124523592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 43
A scalable distributed skip list for range queries 范围查询的可扩展分布式跳跃列表
IEEE International Symposium on High-Performance Parallel Distributed Computing Pub Date : 2014-06-23 DOI: 10.1145/2600212.2600712
S. Alam, Humaira Kamal, Alan S. Wagner
{"title":"A scalable distributed skip list for range queries","authors":"S. Alam, Humaira Kamal, Alan S. Wagner","doi":"10.1145/2600212.2600712","DOIUrl":"https://doi.org/10.1145/2600212.2600712","url":null,"abstract":"In this paper we present a distributed, message passing implementation of a dynamic dictionary structure for range queries. The structure is based on a distributed fine-grain implementation of skip lists that can scale across a cluster of multicore machines. Our implementation makes use of the unique features of Fine-Grain MPI and introduces novel algorithms and techniques to achieve scalable performance on a cluster of multicore machines. Unlike concurrent data structures the distributed skip list operations are deterministic and atomic. Range-queries are implemented in a way that parallelizes the operation and takes advantage of the recursive properties of the skip list structure. We report on the performance of the skip list for range-queries, on a medium sized cluster with two hundred cores.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129209453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Glasswing: accelerating mapreduce on multi-core and many-core clusters Glasswing:在多核和多核集群上加速mapreduce
IEEE International Symposium on High-Performance Parallel Distributed Computing Pub Date : 2014-06-23 DOI: 10.1145/2600212.2600706
I. El-Helw, Rutger F. H. Hofman, H. Bal
{"title":"Glasswing: accelerating mapreduce on multi-core and many-core clusters","authors":"I. El-Helw, Rutger F. H. Hofman, H. Bal","doi":"10.1145/2600212.2600706","DOIUrl":"https://doi.org/10.1145/2600212.2600706","url":null,"abstract":"The impact and significance of parallel computing techniques is continuously increasing given the current trend of incorporating more cores in new processor designs. However, many Big Data systems fail to exploit the abundant computational power of multi-core CPUs and GPUs to their full potential. We present Glasswing, a scalable MapReduce framework that employs a configurable mixture of coarse- and fine-grained parallelism to achieve high performance on multi-core CPUs and GPUs. We experimentally evaluated the performance of five MapReduce applications and show that Glasswing outperforms Hadoop on a 64-node multi-core CPU cluster by a factor between 1.8 and 4, and by a factor from 20 to 30 on a 16-node GPU cluster.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126262380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
ConCORD: easily exploiting memory content redundancy through the content-aware service command ConCORD:通过内容感知服务命令轻松利用内存内容冗余
IEEE International Symposium on High-Performance Parallel Distributed Computing Pub Date : 2014-06-23 DOI: 10.1145/2600212.2600214
Lei Xia, Kyle C. Hale, P. Dinda
{"title":"ConCORD: easily exploiting memory content redundancy through the content-aware service command","authors":"Lei Xia, Kyle C. Hale, P. Dinda","doi":"10.1145/2600212.2600214","DOIUrl":"https://doi.org/10.1145/2600212.2600214","url":null,"abstract":"We argue that memory content-tracking across the nodes of a parallel machine should be factored into a distinct platform service on top of which application services can be built. ConCORD is a proof-of-concept system that we have developed and evaluated to test this claim. Our core insight is that many application services can be described as a query over memory content. This insight leads to a core concept in ConCORD, the content-aware service command architecture, in which an application service is implemented as a parametrization of a single general query that ConCORD knows how to execute well. ConCORD dynamically adapts the execution of the query to the amount of redundancy available and other factors. We show that a complex application service (collective checkpointing) can be implemented in only hundreds of lines of code within ConCORD, while performing well.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131865967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Next generation job management systems for extreme-scale ensemble computing 用于超大规模集成计算的下一代作业管理系统
IEEE International Symposium on High-Performance Parallel Distributed Computing Pub Date : 2014-06-23 DOI: 10.1145/2600212.2600703
Ke Wang, Xiaobing Zhou, Hao Chen, M. Lang, I. Raicu
{"title":"Next generation job management systems for extreme-scale ensemble computing","authors":"Ke Wang, Xiaobing Zhou, Hao Chen, M. Lang, I. Raicu","doi":"10.1145/2600212.2600703","DOIUrl":"https://doi.org/10.1145/2600212.2600703","url":null,"abstract":"With the exponential growth of supercomputers in parallelism, applications are growing more diverse, including traditional large-scale HPC MPI jobs, and ensemble workloads such as finer-grained many-task computing (MTC) applications. Delivering high throughput and low latency for both workloads requires developing a distributed job management system that is magnitudes more scalable than today's centralized ones. In this paper, we present a distributed job launch prototype, SLURM++, which is comprised of multiple controllers with each one managing a partition of SLURM daemons, while ZHT (a distributed key-value store) is used to store the job and resource metadata. We compared SLURM++ with SLURM using micro-benchmarks of different job sizes up to 500 nodes, with excellent results showing 10X higher throughput. We also studied the potential of distributed scheduling through simulations up to millions of nodes.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123524569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 62
Scalable matrix inversion using MapReduce 使用MapReduce的可伸缩矩阵反演
IEEE International Symposium on High-Performance Parallel Distributed Computing Pub Date : 2014-06-23 DOI: 10.1145/2600212.2600220
Jingen Xiang, Huangdong Meng, Ashraf Aboulnaga
{"title":"Scalable matrix inversion using MapReduce","authors":"Jingen Xiang, Huangdong Meng, Ashraf Aboulnaga","doi":"10.1145/2600212.2600220","DOIUrl":"https://doi.org/10.1145/2600212.2600220","url":null,"abstract":"Matrix operations are a fundamental building block of many computational tasks in fields as diverse as scientific computing, machine learning, and data mining. Matrix inversion is an important matrix operation, but it is difficult to implement in today's popular parallel dataflow programming systems, such as MapReduce. The reason is that each element in the inverse of a matrix depends on multiple elements in the input matrix, so the computation is not easily partitionable. In this paper, we present a scalable and efficient technique for matrix inversion in MapReduce. Our technique relies on computing the LU decomposition of the input matrix and using that decomposition to compute the required matrix inverse. We present a technique for computing the LU decomposition and the matrix inverse using a pipeline of MapReduce jobs. We also present optimizations of this technique in the context of Hadoop. To the best of our knowledge, our technique is the first matrix inversion technique using MapReduce. We show experimentally that our technique has good scalability, enabling us to invert a 10^5 x 10^5 matrix in 5 hours on Amazon EC2. We also show that our technique outperforms ScaLAPACK, a state-of-the-art linear algebra package that uses MPI.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127615237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 29
FT-ScaLAPACK: correcting soft errors on-line for ScaLAPACK cholesky, QR, and LU factorization routines FT-ScaLAPACK:在线校正ScaLAPACK choolesky, QR和LU分解例程的软错误
IEEE International Symposium on High-Performance Parallel Distributed Computing Pub Date : 2014-06-23 DOI: 10.1145/2600212.2600232
Panruo Wu, Zizhong Chen
{"title":"FT-ScaLAPACK: correcting soft errors on-line for ScaLAPACK cholesky, QR, and LU factorization routines","authors":"Panruo Wu, Zizhong Chen","doi":"10.1145/2600212.2600232","DOIUrl":"https://doi.org/10.1145/2600212.2600232","url":null,"abstract":"It is well known that soft errors in linear algebra operations can be detected off-line at the end of the computation using algorithm-based fault tolerance (ABFT). However, traditional ABFT usually cannot correct errors in Cholesky, QR, and LU factorizations because any error in one matrix element will be propagated to many other matrix elements and hence cause too many errors to correct. Although, recently, tremendous progresses have been made to correct errors in LU and QR factorizations, these new techniques correct errors off-line at the end of the computation after errors propagated and accumulated, which significantly complicates the error correction process and introduces at least quadratically increasing overhead as the number of errors increases. In this paper, we present the design and implementation of FT-ScaLAPACK, a fault tolerant version ScaLAPACK that is able to detect, locate, and correct errors in Cholesky, QR, and LU factorizations on-line in the middle of the computation in a timely manner before the errors propagate and accumulate. FT-ScaLAPACK has been validated with thousands of cores on Stampede at the Texas Advanced Computing Center. Experimental results demonstrate that FT-ScaLAPACK is able to achieve comparable performance and scalability with the original ScaLAPACK.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124418635","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 51
Seraph: an efficient, low-cost system for concurrent graph processing Seraph:一种高效、低成本的并行图形处理系统
IEEE International Symposium on High-Performance Parallel Distributed Computing Pub Date : 2014-06-23 DOI: 10.1145/2600212.2600222
Jilong Xue, Zhi Yang, Zhi Qu, Shian Hou, Yafei Dai
{"title":"Seraph: an efficient, low-cost system for concurrent graph processing","authors":"Jilong Xue, Zhi Yang, Zhi Qu, Shian Hou, Yafei Dai","doi":"10.1145/2600212.2600222","DOIUrl":"https://doi.org/10.1145/2600212.2600222","url":null,"abstract":"Graph processing systems have been widely used in enterprises like online social networks to process their daily jobs. With the fast growing of social applications, they have to efficiently handle massive concurrent jobs. However, due to the inherent design for single job, existing systems incur great inefficiency in memory use and fault tolerance. Motivated by this, in this paper we introduce Seraph, a graph processing system that enables efficient job-level parallelism. Seraph is designed based on a decoupled data model, which allows multiple concurrent jobs to share graph structure data in memory. Seraph adopts a copy-on-write semantic to isolate the graph mutation of concurrent jobs, and a lazy snapshot protocol to generate consistent graph snapshots for jobs submitted at different time. Moreover, Seraph adopts an incremental checkpoint/regeneration model which can tremendously reduce the overhead of checkpointing. We have implemented Seraph, and the evaluation results show that Seraph significantly outperforms popular systems (such as Giraph and Spark) in both memory usage and job completion time, when executing concurrent graph jobs.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126611412","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 61
TOP-PIM: throughput-oriented programmable processing in memory TOP-PIM:内存中面向吞吐量的可编程处理
IEEE International Symposium on High-Performance Parallel Distributed Computing Pub Date : 2014-06-23 DOI: 10.1145/2600212.2600213
D. Zhang, N. Jayasena, Alexander Lyashevsky, J. Greathouse, Lifan Xu, Mike Ignatowski
{"title":"TOP-PIM: throughput-oriented programmable processing in memory","authors":"D. Zhang, N. Jayasena, Alexander Lyashevsky, J. Greathouse, Lifan Xu, Mike Ignatowski","doi":"10.1145/2600212.2600213","DOIUrl":"https://doi.org/10.1145/2600212.2600213","url":null,"abstract":"As computation becomes increasingly limited by data movement and energy consumption, exploiting locality throughout the memory hierarchy becomes critical to continued performance scaling. Moving computation closer to memory presents an opportunity to reduce both energy and data movement overheads. We explore the use of 3D die stacking to move memory-intensive computations closer to memory. This approach to processing in memory addresses some drawbacks of prior research on in-memory computing and is commercially viable in the foreseeable future.\u0000 Because 3D stacking provides increased bandwidth, we study throughput-oriented computing using programmable GPU compute units across a broad range of benchmarks, including graph and HPC applications. We also introduce a methodology for rapid design space exploration by analytically predicting performance and energy of in-memory processors based on metrics obtained from execution on today's GPU hardware. Our results show that, on average, viable PIM configurations show moderate performance losses (27%) in return for significant energy efficiency improvements (76% reduction in EDP) relative to a representative mainstream GPU at 22nm technology. At 16nm technology, on average, viable PIM configurations are performance competitive with a representative mainstream GPU (7% speedup) and provide even greater energy efficiency improvements (85% reduction in EDP).","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116653581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 322
Improving energy efficiency of embedded DRAM caches for high-end computing systems 提高高端计算系统嵌入式DRAM缓存的能源效率
IEEE International Symposium on High-Performance Parallel Distributed Computing Pub Date : 2014-06-23 DOI: 10.1145/2600212.2600216
Sparsh Mittal, J. Vetter, Dong Li
{"title":"Improving energy efficiency of embedded DRAM caches for high-end computing systems","authors":"Sparsh Mittal, J. Vetter, Dong Li","doi":"10.1145/2600212.2600216","DOIUrl":"https://doi.org/10.1145/2600212.2600216","url":null,"abstract":"The number of cores in a single chip in the nodes of high-end computing systems is on rise, due, in part, to a number of constraints, such as power consumption. With this, the size of the last level cache (LLC) has also increased significantly. Since LLCs built with SRAM consume high leakage power, power consumption of LLCs is becoming a significant fraction of processor power consumption. To address this issue, researchers have used embedded DRAM (eDRAM) LLCs which consume low leakage power. However, eDRAM caches consume a significant amount of energy in the form of refresh energy. In this paper, we propose ESTEEM, an energy saving technique for embedded DRAM caches. ESTEEM uses dynamic cache reconfiguration to turn off a portion of the cache to save both leakage and refresh energy. It logically divides the cache sets into multiple modules and turns off possibly different number of ways in each module. Microarchitectural simulations confirm that ESTEEM is effective in improving performance and energy efficiency and provides better results compared to a recently-proposed eDRAM cache energy saving technique, namely Refrint. For single and dual-core simulations, the average energy saving in memory subsystem (LLC+main memory) with ESTEEM is 25.8% and 32.6% respectively, and the average weighted speedup is 1.09x and 1.22x respectively. Additional experiments confirm that ESTEEM works well for a wide-range of system and algorithm parameters.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"280 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123089617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信