2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)最新文献

筛选
英文 中文
Parallelization design on multi-core platforms in density matrix renormalization group toward 2-D Quantum strongly-correlated systems 面向二维量子强相关系统的密度矩阵重整群多核平台并行化设计
S. Yamada, Toshiyuki Imamura, M. Machida
{"title":"Parallelization design on multi-core platforms in density matrix renormalization group toward 2-D Quantum strongly-correlated systems","authors":"S. Yamada, Toshiyuki Imamura, M. Machida","doi":"10.1145/2063384.2063467","DOIUrl":"https://doi.org/10.1145/2063384.2063467","url":null,"abstract":"One of the most fascinating issues in modern condensed matter physics is to understand highly-correlated electronic structures and propose their novel device designs toward the reduced carbon-dioxide future. Among various developed numerical approaches for highly-correlated electrons, the density matrix renormalization group (DMRG) has been widely accepted as the most promising numerical scheme compared to Monte Carlo and exact diagonalization in terms of accuracy and accessible system size. In fact, DMRG almost perfectly resolves one-dimensional chain like long quantum systems. In this paper, we suggest its extended approach toward higher-dimensional systems by high-performance computing techniques. The computing target in DMRG is a huge non-uniform sparse matrix diagonalization. In order to efficiently parallelize the part, we implement communication step doubling together with reuse of the mid-point data between the doubled two steps to avoid severe bottleneck of all-to-all communications essential for the diagonalization. The technique is successful even for clusters composed of more than 1000 cores and offers a trustworthy exploration way for two-dimensional highly-correlated systems.","PeriodicalId":358797,"journal":{"name":"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"265 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123107271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Server-side I/O coordination for parallel file systems 并行文件系统的服务器端I/O协调
Huaiming Song, Yanlong Yin, Xian-He Sun, R. Thakur, S. Lang
{"title":"Server-side I/O coordination for parallel file systems","authors":"Huaiming Song, Yanlong Yin, Xian-He Sun, R. Thakur, S. Lang","doi":"10.1145/2063384.2063407","DOIUrl":"https://doi.org/10.1145/2063384.2063407","url":null,"abstract":"Parallel file systems have become a common component of modern high-end computers to mask the ever-increasing gap between disk data access speed and CPU computing power. However, while working well for certain applications, current parallel file systems lack the ability to effectively handle concurrent I/O requests with data synchronization needs, whereas concurrent I/O is the norm in data-intensive applications. Recognizing that an I/O request will not complete until all involved file servers in the parallel file system have completed their parts, in this paper we propose a server-side I/O coordination scheme for parallel file systems. The basic idea is to coordinate file servers to serve one application at a time in order to reduce the completion time, and in the meantime maintain the server utilization and fairness. A window-wide coordination concept is introduced to serve our purpose. We present the proposed I/O coordination algorithm and its corresponding analysis of average completion time in this study. We also implement a prototype of the proposed scheme under the PVFS2 file system and MPI-IO environment. Experimental results demonstrate that the proposed scheme can reduce average completion time by 8% to 46%, and provide higher I/O bandwidth than that of default data access strategies adopted by PVFS2 for heavy I/O workloads. Experimental results also show that the server-side I/O coordination scheme has good scalability.","PeriodicalId":358797,"journal":{"name":"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116590747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 59
Improving communication performance in dense linear algebra via topology aware collectives 利用拓扑感知集体改进密集线性代数中的通信性能
Edgar Solomonik, A. Bhatele, J. Demmel
{"title":"Improving communication performance in dense linear algebra via topology aware collectives","authors":"Edgar Solomonik, A. Bhatele, J. Demmel","doi":"10.1145/2063384.2063487","DOIUrl":"https://doi.org/10.1145/2063384.2063487","url":null,"abstract":"Recent results have shown that topology aware mapping reduces network contention in communication-intensive kernels on massively parallel machines. We demonstrate that on mesh interconnects, topology aware mapping also allows for the utilization of highly-efficient topology aware collectives. We map novel 2.5D dense linear algebra algorithms to exploit rectangular collectives on cuboid partitions allocated by a Blue Gene/P supercomputer. Our mappings allow the algorithms to exploit optimized line multicasts and reductions. Commonly used 2D algorithms cannot be mapped in this fashion. On 16,384 nodes (65,536 cores) of Blue Gene/P, 2.5D algorithms that exploit rectangular collectives are sig- nificantly faster than 2D matrix multiplication (MM) and LU factorization, up to 8.7x and 2.1x, respectively. These speed-ups are due to communication reduction (up to 95.6% for 2.5D MM with respect to 2D MM). We also derive LogP- based novel performance models for rectangular broadcasts and reductions. Using those, we model the performance of matrix multiplication and LU factorization on a hypothetical exascale architecture.","PeriodicalId":358797,"journal":{"name":"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129079990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 57
A scalable eigensolver for large scale-free graphs using 2D graph partitioning 基于二维图划分的大规模无标度图的可扩展特征解算器
A. Yoo, A. Baker, R. Pearce, V. Henson
{"title":"A scalable eigensolver for large scale-free graphs using 2D graph partitioning","authors":"A. Yoo, A. Baker, R. Pearce, V. Henson","doi":"10.1145/2063384.2063469","DOIUrl":"https://doi.org/10.1145/2063384.2063469","url":null,"abstract":"Eigensolvers are important tools for analyzing and mining useful information from scale-free graphs. Such graphs are used in many applications and can be extremely large. Unfortunately, existing parallel eigensolvers do not scale well for these graphs due to the high communication overhead in the parallel matrix-vector multiplication (MatVec). We develop a MatVec algorithm based on 2D edge partitioning that significantly reduces the communication costs and embed it into a popular eigensolver library. We demonstrate that the enhanced eigensolver can attain two orders of magnitude performance improvement compared to the original on a state-of-art massively parallel machine. We illustrate the performance of the embedded MatVec by computing eigenvalues of a scale-free graph with 300 million vertices and 5 billion edges, the largest scale-free graph analyzed by any in-memory parallel eigensolver, to the best of our knowledge.","PeriodicalId":358797,"journal":{"name":"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133138209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 47
End-to-end network QoS via scheduling of flexible resource reservation requests 端到端网络QoS通过调度灵活的资源预留请求
Sushant Sharma, D. Katramatos, Dantong Yu
{"title":"End-to-end network QoS via scheduling of flexible resource reservation requests","authors":"Sushant Sharma, D. Katramatos, Dantong Yu","doi":"10.1145/2063384.2063475","DOIUrl":"https://doi.org/10.1145/2063384.2063475","url":null,"abstract":"Modern data-intensive applications move vast amounts of data between multiple locations around the world. To enable predictable and reliable data transfers, next generation networks allow such applications to reserve network resources for exclusive use. In this paper, we solve an important problem (called SMR3) to accommodate multiple and concurrent network reservation requests between a pair of end sites. Given the varying availability of bandwidth within the network, our goal is to accommodate as many reservation requests as possible while minimizing the total time needed to complete the data transfers. First, we prove that SMR3 is an NP-hard problem. Then, we solve it by developing a polynomial-time heuristic called RRA. The RRA algorithm hinges on an efficient mechanism to accommodate large number of requests in an iterative manner. Finally, we show via numerical results that RRA constructs schedules that accommodate significantly larger number of requests compared to other, seemingly efficient, heuristics.","PeriodicalId":358797,"journal":{"name":"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124710782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 31
Peta-scale phase-field simulation for dendritic solidification on the TSUBAME 2.0 supercomputer 在TSUBAME 2.0超级计算机上进行枝晶凝固的pb级相场模拟
T. Shimokawabe, T. Aoki, T. Takaki, Toshio Endo, A. Yamanaka, N. Maruyama, Akira Nukada, S. Matsuoka
{"title":"Peta-scale phase-field simulation for dendritic solidification on the TSUBAME 2.0 supercomputer","authors":"T. Shimokawabe, T. Aoki, T. Takaki, Toshio Endo, A. Yamanaka, N. Maruyama, Akira Nukada, S. Matsuoka","doi":"10.1145/2063384.2063388","DOIUrl":"https://doi.org/10.1145/2063384.2063388","url":null,"abstract":"The mechanical properties of metal materials largely depend on their intrinsic internal microstructures. To develop engineering materials with the expected properties, predicting patterns in solidified metals would be indispensable. The phase-field simulation is the most powerful method known to simulate the micro-scale dendritic growth during solidification in a binary alloy. To evaluate the realistic description of solidification, however, phase-field simulation requires computing a large number of complex nonlinear terms over a fine-grained grid. Due to such heavy computational demand, previous work on simulating three-dimensional solidification with phase-field methods was successful only in describing simple shapes. Our new simulation techniques achieved scales unprecedentedly large, sufficient for handling complex dendritic structures required in material science. Our simulations on the GPU-rich TSUBAME 2.0 super- computer at the Tokyo Institute of Technology have demonstrated good weak scaling and achieved 1.017 PFlops in single precision for our largest configuration, using 4,000 CPUs along with 16,000 CPU cores.","PeriodicalId":358797,"journal":{"name":"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129498360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 200
BlobCR: Efficient checkpoint-restart for HPC applications on IaaS clouds using virtual disk image snapshots BlobCR: IaaS云上使用虚拟磁盘映像快照的高性能计算应用程序的高效检查点重启
Bogdan Nicolae, F. Cappello
{"title":"BlobCR: Efficient checkpoint-restart for HPC applications on IaaS clouds using virtual disk image snapshots","authors":"Bogdan Nicolae, F. Cappello","doi":"10.1145/2063384.2063429","DOIUrl":"https://doi.org/10.1145/2063384.2063429","url":null,"abstract":"Infrastructure-as-a-Service (IaaS) cloud computing is gaining significant interest in industry and academia as an alternative platform for running scientific applications. Given the dynamic nature of IaaS clouds and the long runtime and resource utilization of such applications, an efficient checkpoint-restart mechanism becomes paramount in this context. This paper proposes a solution to the aforementioned challenge that aims at minimizing the storage space and performance overhead of checkpoint-restart. We introduce an approach that leverages virtual machine (VM) disk-image multi-snapshotting and multi-deployment inside checkpoint-restart protocols running at guest level in order to efficiently capture and potentially roll back the complete state of the application, including file system modifications. Experiments on the G5K testbed show substantial improvement for MPI applications over existing approaches, both for the case when customized checkpointing is available at application level and the case when it needs to be handled at process level.","PeriodicalId":358797,"journal":{"name":"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115786039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 87
Logjam: A scalable unified log file archiver Logjam:一个可扩展的统一日志文件归档器
N. Cardo
{"title":"Logjam: A scalable unified log file archiver","authors":"N. Cardo","doi":"10.1145/2063348.2063379","DOIUrl":"https://doi.org/10.1145/2063348.2063379","url":null,"abstract":"Log files are a necessary record of events on any system. However, as systems scale, so does the volume of data captured. To complicate matters, this data can be distributed across all nodes within the system. This creates challenges in ways to obtain these files as well as archiving them in a consistent manner. It has become commonplace to develop a custom written utility for each system that is tailored specifically to that system. For computer centers that contain multiple systems, each system would have their own respective utility for gathering and archiving log files. Each time a new log file is produced, a modification to the utility is necessary. With each modification, risk of errors could be introduced as well as spending time to introduce that change. This is precisely the purpose of logjam. Once installed, the code only requires modification when new features are required. A configuration file is used to identify each log file as well as where to harvest it and how to archive it. Adding a new log file is as simple as defining it in a configuration file and testing can be performed in the production environment.","PeriodicalId":358797,"journal":{"name":"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121358234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Scalable fast multipole methods on distributed heterogeneous architectures 分布式异构体系结构的可扩展快速多极方法
Qi Hu, N. Gumerov, R. Duraiswami
{"title":"Scalable fast multipole methods on distributed heterogeneous architectures","authors":"Qi Hu, N. Gumerov, R. Duraiswami","doi":"10.1145/2063384.2063432","DOIUrl":"https://doi.org/10.1145/2063384.2063432","url":null,"abstract":"We fundamentally reconsider implementation of the Fast Multipole Method (FMM) on a computing node with a heterogeneous CPU-GPU architecture with multicore CPU(s) and one or more GPU accelerators, as well as on an interconnected cluster of such nodes. The FMM is a divide-and-conquer algorithm that performs a fast N-body sum using a spatial decomposition and is often used in a time-stepping or iterative loop. Using the observation that the local summation and the analysis-based translation parts of the FMM are independent, we map these respectively to the GPUs and CPUs. Careful analysis of the FMM is performed to distribute work optimally between the multicore CPUs and the GPU accelerators. We first develop a single node version where the CPU part is parallelized using OpenMP and the GPU version via CUDA. New parallel algorithms for creating FMM data structures are presented together with load balancing strategies for the single node and distributed multiple-node versions. Our implementation can perform the N-body sum for 128M particles on 16 nodes in 4.23 seconds, a performance not achieved by others in the literature on such clusters.","PeriodicalId":358797,"journal":{"name":"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"137 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114107163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 48
Virtual I/O caching: Dynamic storage cache management for concurrent workloads 虚拟I/O缓存:用于并发工作负载的动态存储缓存管理
Michael R. Frasca, R. Prabhakar, P. Raghavan, M. Kandemir
{"title":"Virtual I/O caching: Dynamic storage cache management for concurrent workloads","authors":"Michael R. Frasca, R. Prabhakar, P. Raghavan, M. Kandemir","doi":"10.1145/2063384.2063435","DOIUrl":"https://doi.org/10.1145/2063384.2063435","url":null,"abstract":"A leading cause of reduced or unpredictable application performance in distributed systems is contention at the storage layer, where resources are multiplexed among many concurrent data intensive workloads. We target the shared storage cache, used to alleviate disk I/O bottlenecks, and propose a new caching paradigm to both improve performance and reduce memory requirements for HPC storage systems. We present the virtual I/O cache, a dynamic scheme to manage a limited storage cache resource. Application behavior and the corresponding performance of a chosen replacement policy are observed at run time, and a mechanism is designed to mitigate suboptimal behavior and increase cache efficiency. We further use the virtual I/O cache to isolate concurrent workloads and globally manage physical resource allocation towards system level performance objectives. We evaluate our scheme using twenty I/O intensive applications and benchmarks. Average hit rate gains over 17% were observed for isolated workloads, as well as cache size reductions near 80% for equivalent performance levels. Our largest concurrent workload achieved hit rate gains over 23%, and an over 80% iso-performance cache reduction.","PeriodicalId":358797,"journal":{"name":"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121782746","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信