Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis最新文献

筛选
英文 中文
Beyond homogeneous decomposition: scaling long-range forces on Massively Parallel Systems 超越同质分解:大规模并行系统上的缩放远程力
D. Richards, J. Glosli, B. Chan, M. Dorr, E. Draeger, J. Fattebert, W. D. Krauss, T. Spelce, F. Streitz, M. Surh, John A. Gunnels
{"title":"Beyond homogeneous decomposition: scaling long-range forces on Massively Parallel Systems","authors":"D. Richards, J. Glosli, B. Chan, M. Dorr, E. Draeger, J. Fattebert, W. D. Krauss, T. Spelce, F. Streitz, M. Surh, John A. Gunnels","doi":"10.1145/1654059.1654121","DOIUrl":"https://doi.org/10.1145/1654059.1654121","url":null,"abstract":"With supercomputers anticipated to expand from thousands to millions of cores, one of the challenges facing scientists is how to effectively utilize this ever-increasing number. We report here an approach that creates a heterogeneous decomposition by partitioning effort according to the scaling properties of the component algorithms. We demonstrate our strategy by developing a capability to model hot dense plasma. We have performed benchmark calculations ranging from millions to billions of charged particles, including a 2.8 billion particle simulation that achieved 259.9 TFlop/s (26% of peak performance) on the 294,912 cpu JUGENE computer at the Jülich Supercomputing Centre in Germany. With this unprecedented simulation capability we have begun an investigation of plasma fusion physics under conditions where both theory and experiment are lacking-in the strongly-coupled regime as the plasma begins to burn. Our strategy is applicable to other problems involving long-range forces (i.e., biological or astrophysical simulations). We believe that the flexible heterogeneous decomposition approach demonstrated here will allow many problems to scale across current and next-generation machines.","PeriodicalId":371415,"journal":{"name":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130937996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 32
A microdriver architecture for error correcting codes inside the Linux kernel Linux内核中用于纠错代码的微驱动架构
A. Brinkmann, D. Eschweiler
{"title":"A microdriver architecture for error correcting codes inside the Linux kernel","authors":"A. Brinkmann, D. Eschweiler","doi":"10.1145/1654059.1654095","DOIUrl":"https://doi.org/10.1145/1654059.1654095","url":null,"abstract":"Coding tasks, such as encryption of data or the generation of failure-tolerant codes, belong to the most computationaly expensive tasks inside the Linux kernel. Their integration into the kernel enables the user to transparently access these functionalities, encrypted hard disks can be used in the same way as unencrypted ones. Nevertheless, Linux as a monolithic kernel is not prepared to support these expensive tasks by accessing modern hardware accelerators, like graphics processing units (GPUs), as the corresponding accelerator libraries, like the CUDA-API for NVIDIA GPUs, only offer user-space APIs. Linux is often used in conjunction with parallel file systems in high performance cluster environments and the tremendous storage growth in these environments leads to the requirement of multi-error correcting codes. Parallel file systems, which often run on a storage cluster, are required to store the calculated results without huge waiting times. Whereas the frontend of such a storage cluster can be build with standard PCs, it is in contrast nearly impossible to build a capable RAID backend with end user hardware up to now. This work investigated the potential of graphic cards for such coding applications like RAID in the Linux kernel. For this purpose, a special microdriver concept (Barracuda) has been designed that can be integrated into Linux without changing kernel APIs. For the investigation of the performance of this concept, the Linux RAID 6-system and the applied Reed-Solomon code have been exemplary extended and studied. The resulting measurements outline opportunities and limitations of our microdriver concept. On the one hand, the concept achieves a speed-up of 72 for complex, 8-failure correcting codes, while no additional speed-up can be generated for simpler, 2-error correcting codes. An example application for Barracuda could therefore be replacement of expensive RAID systems in cluster storage environments.","PeriodicalId":371415,"journal":{"name":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114392709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
FACT: fast communication trace collection for parallel applications through program slicing 事实:通过程序切片实现并行应用程序的快速通信跟踪收集
Jidong Zhai, Tianwei Sheng, Jiangzhou He, Wenguang Chen, Weimin Zheng
{"title":"FACT: fast communication trace collection for parallel applications through program slicing","authors":"Jidong Zhai, Tianwei Sheng, Jiangzhou He, Wenguang Chen, Weimin Zheng","doi":"10.1145/1654059.1654087","DOIUrl":"https://doi.org/10.1145/1654059.1654087","url":null,"abstract":"A proper understanding of communication patterns of parallel applications is important to optimize application performance and design better communication subsystems. Communication patterns can be obtained by analyzing communication traces. However, existing approaches to generate communication traces need to execute the entire parallel applications on full-scale systems that are time-consuming and expensive. In this paper, we propose a novel technique, called Fact, which can perform FAst Communication Trace collection for large-scale parallel applications on small-scale systems. Our idea is to reduce the original program to obtain a program slice through static analysis, and to execute the program slice to acquire the communication traces. The program slice preserves all the variables and statements in the original program relevant to spatial and volume communication attributes. Our idea is based on an observation that most computation and message contents in message-passing parallel applications are independent of these attributes, and therefore can be removed from the programs for the purpose of communication trace collection. We have implemented Fact and evaluated it with NPB programs and Sweep3D. The results show that Fact can preserve the spatial and volume communication attributes of original programs and reduce resource consumptions by two orders of magnitude in most cases. For example, Fact collects the communication traces of the Sweep3D for 512 processes on a 4-node (32 cores) platform in just 6.79 seconds, consuming 1.25 GB memory, while the original program takes 256.63 seconds and consumes 213.83 GB memory on a 32-node (512 cores) platform. Finally, we present an application of Fact.","PeriodicalId":371415,"journal":{"name":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115820993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 30
Flexible cache error protection using an ECC FIFO 灵活的缓存错误保护使用ECC FIFO
D. Yoon, M. Erez
{"title":"Flexible cache error protection using an ECC FIFO","authors":"D. Yoon, M. Erez","doi":"10.1145/1654059.1654109","DOIUrl":"https://doi.org/10.1145/1654059.1654109","url":null,"abstract":"We present ECC FIFO, a mechanism enabling two-tiered last-level cache error protection using an arbitrarily strong tier-2 code without increasing on-chip storage. Instead of adding redundant ECC information to each cache line, our ECC FIFO mechanism off-loads the extra information to off-chip DRAM. We augment each cache line with a tier-1 code, which provides error detection consuming limited resources. The redundancy required for strong protection is provided by a tier-2 code placed in off-chip memory. Because errors that require tier-2 correction are rare, the overhead of accessing DRAM is unimportant. We show how this method can save 15-25% and 10-17% of on-chip cache area and power respectively while minimally impacting performance, which decreases by 1% on average across a range of scientific and consumer benchmarks.","PeriodicalId":371415,"journal":{"name":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132150358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
Scalable computation of streamlines on very large datasets 在非常大的数据集上进行流线的可扩展计算
D. Pugmire, H. Childs, C. Garth, Sean Ahern, G. Weber
{"title":"Scalable computation of streamlines on very large datasets","authors":"D. Pugmire, H. Childs, C. Garth, Sean Ahern, G. Weber","doi":"10.1145/1654059.1654076","DOIUrl":"https://doi.org/10.1145/1654059.1654076","url":null,"abstract":"Understanding vector fields resulting from large scientific simulations is an important and often difficult task. Streamlines, curves that are tangential to a vector field at each point, are a powerful visualization method in this context. Application of streamline-based visualization to very large vector field data represents a significant challenge due to the non-local and data-dependent nature of streamline computation, and requires careful balancing of computational demands placed on I/O, memory, communication, and processors. In this paper we review two parallelization approaches based on established parallelization paradigms (static decomposition and on-demand loading) and present a novel hybrid algorithm for computing streamlines. Our algorithm is aimed at good scalability and performance across the widely varying computational characteristics of streamline-based problems. We perform performance and scalability studies of all three algorithms on a number of prototypical application problems and demonstrate that our hybrid scheme is able to perform well in different settings.","PeriodicalId":371415,"journal":{"name":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128803760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 87
Compact multi-dimensional kernel extraction for register tiling 紧凑的多维核提取寄存器平铺
Lakshminarayanan Renganarayanan, Uday Bondhugula, Salem Derisavi, A. Eichenberger, K. O'Brien
{"title":"Compact multi-dimensional kernel extraction for register tiling","authors":"Lakshminarayanan Renganarayanan, Uday Bondhugula, Salem Derisavi, A. Eichenberger, K. O'Brien","doi":"10.1145/1654059.1654105","DOIUrl":"https://doi.org/10.1145/1654059.1654105","url":null,"abstract":"To achieve high performance on multi-cores, modern loop optimizers apply long sequences of transformations that produce complex loop structures. Downstream optimizations such as register tiling (unroll-and-jam plus scalar promotion) typically provide a significant performance improvement. Typical register tilers provide this performance improvement only when applied on simple loop structures. They often fail to operate on complex loop structures leaving a significant amount of performance on the table. We present a technique called compact multi-dimensional kernel extraction (COMDEX) which can make register tilers operate on arbitrarily complex loop structures and enable them to provide the performance benefits. COMDEX extracts compact unrollable kernels from complex loops. We show that by using COMDEX as a pre-processing to register tiling we can (i) enable register tiling on complex loop structures and (ii) realize a significant performance improvement on a variety of codes.","PeriodicalId":371415,"journal":{"name":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis","volume":"178 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121081408","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Future scaling of processor-memory interfaces 处理器-内存接口的未来扩展
Jung Ho Ahn, N. Jouppi, C. Kozyrakis, J. Leverich, R. Schreiber
{"title":"Future scaling of processor-memory interfaces","authors":"Jung Ho Ahn, N. Jouppi, C. Kozyrakis, J. Leverich, R. Schreiber","doi":"10.1145/1654059.1654102","DOIUrl":"https://doi.org/10.1145/1654059.1654102","url":null,"abstract":"Continuous evolution in process technology brings energy-efficiency and reliability challenges, which are harder for memory system designs since chip multiprocessors demand high bandwidth and capacity, global wires improve slowly, and more cells are susceptible to hard and soft errors. Recently, there are proposals aiming at better main-memory energy efficiency by dividing a memory rank into subsets. We holistically assess the effectiveness of rank subsetting in the context of system-wide performance, energy-efficiency, and reliability perspectives. We identify the impact of rank subsetting on memory power and processor performance analytically, then verify the analyses by simulating a chip-multiprocessor system using multithreaded and consolidated workloads. We extend the design of Multicore DIMM, one proposal embodying rank subsetting, for high-reliability systems and show that compared with conventional chipkill approaches, it can lead to much higher system-level energy efficiency and performance at the cost of additional DRAM devices.","PeriodicalId":371415,"journal":{"name":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125775416","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 121
Millisecond-scale molecular dynamics simulations on Anton 在安东身上进行毫秒级分子动力学模拟
D. Shaw, R. Dror, J. Salmon, J. P. Grossman, Kenneth M. Mackenzie, Joseph A. Bank, C. Young, Martin M. Deneroff, Brannon Batson, K. Bowers, Edmond Chow, M. Eastwood, D. Ierardi, J. L. Klepeis, J. Kuskin, Richard H. Larson, K. Lindorff-Larsen, P. Maragakis, Mark A. Moraes, S. Piana, Yibing Shan, Brian Towles
{"title":"Millisecond-scale molecular dynamics simulations on Anton","authors":"D. Shaw, R. Dror, J. Salmon, J. P. Grossman, Kenneth M. Mackenzie, Joseph A. Bank, C. Young, Martin M. Deneroff, Brannon Batson, K. Bowers, Edmond Chow, M. Eastwood, D. Ierardi, J. L. Klepeis, J. Kuskin, Richard H. Larson, K. Lindorff-Larsen, P. Maragakis, Mark A. Moraes, S. Piana, Yibing Shan, Brian Towles","doi":"10.1145/1654059.1654126","DOIUrl":"https://doi.org/10.1145/1654059.1654126","url":null,"abstract":"Anton is a recently completed special-purpose supercomputer designed for molecular dynamics (MD) simulations of biomolecular systems. The machine's specialized hardware dramatically increases the speed of MD calculations, making possible for the first time the simulation of biologicl molecules at an atomic level of detail for periods on the order of a millisecond---about two orders of magnitude beyond the previous state of the art. Anton is now running simulations on a timescale at which many critically important, but poorly understood phenomena are known to occur, allowing the observation of aspects of protein dynamics that were previously inaccessible to both computational and experimental study. Here, we report Anton's performance when executing actual MD simulations whose accuracy has been validated against both existing MD software and experimental observations. We also discuss the manner in which novel algorithms have been coordinated with Anton's co-designed, application-specific hardware to achieve these results.","PeriodicalId":371415,"journal":{"name":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125858991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 101
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信