2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)最新文献

筛选
英文 中文
GROPHECY: GPU performance projection from CPU code skeletons GROPHECY: CPU代码骨架的GPU性能投影
Jiayuan Meng, V. Morozov, Kalyan Kumaran, V. Vishwanath, T. Uram
{"title":"GROPHECY: GPU performance projection from CPU code skeletons","authors":"Jiayuan Meng, V. Morozov, Kalyan Kumaran, V. Vishwanath, T. Uram","doi":"10.1145/2063384.2063402","DOIUrl":"https://doi.org/10.1145/2063384.2063402","url":null,"abstract":"We propose GROPHECY, a GPU performance projection framework that can estimate the performance benefit of GPU acceleration without actual GPU programming or hardware. Users need only to skeletonize pieces of CPU code that are targets for GPU acceleration. Code skeletons are automatically transformed in various ways to mimic tuned GPU codes with characteristics resembling real implementations. The synthesized characteristics are used by an existing analytical model to project GPU performance. The cost and benefit of GPU development can then be estimated according to the transformed code skeleton that yields the best projected performance. With GROPHECY, users can leap toward GPU acceleration only when the cost-benefit makes sense. The framework is validated using kernel benchmarks and data-parallel codes in legacy scientific applications. The measured performance of manually tuned codes deviates from the projected performance by 17% in geometric mean.","PeriodicalId":358797,"journal":{"name":"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123968579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 102
Physis: An implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers 物理:用于大规模gpu加速超级计算机的模板计算的隐式并行编程模型
N. Maruyama, Tatsuo Nomura, Kento Sato, S. Matsuoka
{"title":"Physis: An implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers","authors":"N. Maruyama, Tatsuo Nomura, Kento Sato, S. Matsuoka","doi":"10.1145/2063384.2063398","DOIUrl":"https://doi.org/10.1145/2063384.2063398","url":null,"abstract":"This paper proposes a compiler-based programming framework that automatically translates user-written structured grid code into scalable parallel implementation code for GPU-equipped clusters. To enable such automatic translations, we design a small set of declarative constructs that allow the user to express stencil computations in a portable and implicitly parallel manner. Our framework translates the user-written code into actual implementation code in CUDA for GPU acceleration and MPI for node-level parallelization with automatic optimizations such as computation and communication overlapping. We demonstrate the feasibility of such automatic translations by implementing several structured grid applications in our framework. Experimental results on the TSUBAME2.0 GPU-based supercomputer show that the performance is comparable as hand-written code and good strong and weak scalability up to 256 GPUs.","PeriodicalId":358797,"journal":{"name":"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122520924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 173
Topology-aware data movement and staging for I/O acceleration on Blue Gene/P supercomputing systems Blue Gene/P超级计算系统I/O加速的拓扑感知数据移动和分级
V. Vishwanath, M. Hereld, V. Morozov, M. Papka
{"title":"Topology-aware data movement and staging for I/O acceleration on Blue Gene/P supercomputing systems","authors":"V. Vishwanath, M. Hereld, V. Morozov, M. Papka","doi":"10.1145/2063384.2063409","DOIUrl":"https://doi.org/10.1145/2063384.2063409","url":null,"abstract":"There is growing concern that I/O systems will be hard pressed to satisfy the requirements of future leadership-class machines. Even current machines are found to be I/O bound for some applications. In this paper, we identify existing performance bottlenecks in data movement for I/O on the IBM Blue Gene/P (BG/P) supercomputer currently deployed at several leadership computing facilities. We improve the I/O performance by exploiting the network topology of BG/P for collective I/O, leveraging data semantics of applications and incorporating asynchronous data staging. We demonstrate the efficacy of our approaches for synthetic benchmark experiments and for application-level benchmarks at scale on leadership computing systems.","PeriodicalId":358797,"journal":{"name":"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132134012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 90
Sustained systems performance monitoring at the U.S. Department of Defense High Performance Computing Modernization Program 美国国防部高性能计算现代化项目的持续系统性能监测
P. Bennett
{"title":"Sustained systems performance monitoring at the U.S. Department of Defense High Performance Computing Modernization Program","authors":"P. Bennett","doi":"10.1145/2063348.2063352","DOIUrl":"https://doi.org/10.1145/2063348.2063352","url":null,"abstract":"The U.S. Department of Defense High Performance Computing Modernization Program (HPCMP) has implemented sustained systems performance testing on high performance computing systems in use at DoD Supercomputing Resource Centers. The intent is to monitor performance improvements by updates to the operating system, compiler suites, and numerical and communications libraries, and to monitor penalties arising from security patches. In practice, each system's workload is simulated by appropriate choices of user application codes representative of the HPCMP computational technical areas. Past successes include surfacing an imminent failure of an OST in a Cray XT3, incomplete configuration of a scheduler update on an SGI Altix 4700, performance issues associated with a communications library update for a Linux Networx Advanced Technology Cluster, and intermittent resetting of Intel Nehalem cores to standard mode from turbo mode. This history demonstrates that SSP testing is critical to deliver the highest quality of service to the HPCMP users.","PeriodicalId":358797,"journal":{"name":"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128675464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Parallel random numbers: As easy as 1, 2, 3 并行随机数:就像1、2、3一样简单
J. Salmon, Mark A. Moraes, R. Dror, D. Shaw
{"title":"Parallel random numbers: As easy as 1, 2, 3","authors":"J. Salmon, Mark A. Moraes, R. Dror, D. Shaw","doi":"10.1145/2063384.2063405","DOIUrl":"https://doi.org/10.1145/2063384.2063405","url":null,"abstract":"Most pseudorandom number generators (PRNGs) scale poorly to massively parallel high-performance computation because they are designed as sequentially dependent state transformations. We demonstrate that independent, keyed transformations of counters produce a large alternative class of PRNGs with excellent statistical properties (long period, no discernable structure or correlation). These counter-based PRNGs are ideally suited to modern multi- core CPUs, GPUs, clusters, and special-purpose hardware because they vectorize and parallelize well, and require little or no memory for state. We introduce several counter-based PRNGs: some based on cryptographic standards (AES, Threefish) and some completely new (Philox). All our PRNGs pass rigorous statistical tests (including TestUOl's BigCrush) and produce at least 264 unique parallel streams of random numbers, each with period 2128 or more. In addition to essentially unlimited parallel scalability, our PRNGs offer excellent single-chip performance: Philox is faster than the CURAND library on a single NVIDIA GPU.","PeriodicalId":358797,"journal":{"name":"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128799226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 242
GreenSlot: Scheduling energy consumption in green datacenters greenlot:绿色数据中心能耗调度
Íñigo Goiri, Ryan Beauchea, Kien Le, Thu D. Nguyen, Md. E. Haque, Jordi Guitart, J. Torres, R. Bianchini
{"title":"GreenSlot: Scheduling energy consumption in green datacenters","authors":"Íñigo Goiri, Ryan Beauchea, Kien Le, Thu D. Nguyen, Md. E. Haque, Jordi Guitart, J. Torres, R. Bianchini","doi":"10.1145/2063384.2063411","DOIUrl":"https://doi.org/10.1145/2063384.2063411","url":null,"abstract":"In this paper, we propose GreenSlot, a parallel batch job scheduler for a datacenter powered by a photovoltaic solar array and the electrical grid (as a backup). GreenSlot predicts the amount of solar energy that will be available in the near future, and schedules the workload to maximize the green energy consumption while meet- ing the jobs' deadlines. If grid energy must be used to avoid dead- line violations, the scheduler selects times when it is cheap. Our results for production scientific workloads demonstrate that Green-Slot can increase green energy consumption by up to 117% and decrease energy cost by up to 39%, compared to a conventional scheduler. Based on these positive results, we conclude that green datacenters and green-energy-aware scheduling can have a significant role in building a more sustainable IT ecosystem.","PeriodicalId":358797,"journal":{"name":"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123647407","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 319
Reducing electricity cost through virtual machine placement in high performance computing clouds 通过在高性能计算云中放置虚拟机来降低电力成本
Kien Le, R. Bianchini, Jingru Zhang, Y. Jaluria, J. Meng, Thu D. Nguyen
{"title":"Reducing electricity cost through virtual machine placement in high performance computing clouds","authors":"Kien Le, R. Bianchini, Jingru Zhang, Y. Jaluria, J. Meng, Thu D. Nguyen","doi":"10.1145/2063384.2063413","DOIUrl":"https://doi.org/10.1145/2063384.2063413","url":null,"abstract":"In this paper, we first study the impact of load placement policies on cooling and maximum data center temperatures in cloud service providers that operate multiple geographically distributed data centers. Based on this study, we then propose dynamic load distribution policies that consider all electricity-related costs as well as transient cooling effects. Our evaluation studies the ability of different cooling strategies to handle load spikes, compares the behaviors of our dynamic cost-aware policies to cost-unaware and static policies, and explores the effects of many parameter settings. Among other interesting results, we demonstrate that (1) our policies can provide large cost savings, (2) load migration enables savings in many scenarios, and (3) all electricity-related costs must be considered at the same time for higher and consistent cost savings.","PeriodicalId":358797,"journal":{"name":"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116888755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 221
Challenges of HPC monitoring HPC监控的挑战
W. Allcock, Evan Felix, Mike Lowe, Randal Rheinheimer, Joshi Fullop
{"title":"Challenges of HPC monitoring","authors":"W. Allcock, Evan Felix, Mike Lowe, Randal Rheinheimer, Joshi Fullop","doi":"10.1145/2063348.2063378","DOIUrl":"https://doi.org/10.1145/2063348.2063378","url":null,"abstract":"At a recent meeting of monitoring experts from nine large supercomputing centers, there was a broad divergence of opinion on what monitoring in our environment actually is, what ought to be monitored, what technology should be used, etc. Broad consensus can be summarized in a couple of key points: Data management is increasingly a problem. As a result, historical information is rarely kept, or, if kept, rarely accessed. A proliferation of e-mails is ignored, and slow database interfaces are not used. At least some portion of the HPC Monitoring solution at each site can be summarized as \"Scripts written by smart personnel over the years\". An example is the \"Is this node ready to run?\" script developed essentially in isolation at each site. Given this environment of supercomputing centers trying to solve a seemingly simple, common problem with largely divergent technologies, philosophies, and problem definitions, we feel that a public conversation will be of value to the supercomputing community as a whole. This report outlines the general positions with regard to monitoring of five experienced supercomputing personnel, and is intended to be of benefit to the general community by revealing a variety of opinions on the following topics: What do you understand the monitoring of supercomputing to be? What are the most difficult problems in monitoring today, and which of the problems of five years ago have been put to rest? What areas of supercomputer monitoring are you most focused on at your site? Are there any particularly promising technologies you're using? If you could have the vendor community do one thing in this area, what would it be?","PeriodicalId":358797,"journal":{"name":"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113997120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Scaling lattice QCD beyond 100 GPUs 扩展晶格QCD超过100个gpu
R. Babich, M. Clark, B. Joó, Guochun Shi, R. Brower, S. Gottlieb
{"title":"Scaling lattice QCD beyond 100 GPUs","authors":"R. Babich, M. Clark, B. Joó, Guochun Shi, R. Brower, S. Gottlieb","doi":"10.1145/2063384.2063478","DOIUrl":"https://doi.org/10.1145/2063384.2063478","url":null,"abstract":"Over the past five years, graphics processing units (GPUs) have had a transformational effect on numerical lattice quantum chromodynamics (LQCD) calculations in nuclear and particle physics. While GPUs have been applied with great success to the post-Monte Carlo \"analysis\" phase which accounts for a substantial fraction of the workload in a typical LQCD calculation, the initial Monte Carlo \"gauge field generation\" phase requires capability-level supercomputing, corresponding to O(100) GPUs or more. Such strong scaling has not been previously achieved. In this contribution, we demonstrate that using a multi-dimensional parallelization strategy and a domain-decomposed preconditioner allows us to scale into this regime. We present results for two popular discretizations of the Dirac operator, Wilson-clover and improved staggered, employing up to 256 GPUs on the Edge cluster at Lawrence Livermore National Laboratory.","PeriodicalId":358797,"journal":{"name":"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123479154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 134
Parallel breadth-first search on distributed memory systems 分布式存储系统的并行宽度优先搜索
A. Buluç, Kamesh Madduri
{"title":"Parallel breadth-first search on distributed memory systems","authors":"A. Buluç, Kamesh Madduri","doi":"10.1145/2063384.2063471","DOIUrl":"https://doi.org/10.1145/2063384.2063471","url":null,"abstract":"Data-intensive, graph-based computations are pervasive in several scientific applications, and are known to to be quite challenging to implement on distributed memory systems. In this work, we explore the design space of parallel algorithms for Breadth-First Search (BFS), a key subroutine in several graph algorithms. We present two highly-tuned parallel approaches for BFS on large parallel systems: a level-synchronous strategy that relies on a simple vertex-based partitioning of the graph, and a two-dimensional sparse matrix partitioning-based approach that mitigates parallel communication overhead. For both approaches, we also present hybrid versions with intra-node multithreading. Our novel hybrid two-dimensional algorithm reduces communication times by up to a factor of 3.5, relative to a common vertex based approach. Our experimental study identifies execution regimes in which these approaches will be competitive, and we demonstrate extremely high performance on leading distributed-memory parallel systems. For instance, for a 40,000-core parallel execution on Hopper, an AMD MagnyCours based system, we achieve a BFS performance rate of 17.8 billion edge visits per second on an undirected graph of 4.3 billion vertices and 68.7 billion edges with skewed degree distribution.","PeriodicalId":358797,"journal":{"name":"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128860772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 230
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信