2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)最新文献

Parallelization design on multi-core platforms in density matrix renormalization group toward 2-D Quantum strongly-correlated systems 面向二维量子强相关系统的密度矩阵重整群多核平台并行化设计

2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC) Pub Date : 2011-11-12 DOI: 10.1145/2063384.2063467

S. Yamada, Toshiyuki Imamura, M. Machida

{"title":"Parallelization design on multi-core platforms in density matrix renormalization group toward 2-D Quantum strongly-correlated systems","authors":"S. Yamada, Toshiyuki Imamura, M. Machida","doi":"10.1145/2063384.2063467","DOIUrl":"https://doi.org/10.1145/2063384.2063467","url":null,"abstract":"One of the most fascinating issues in modern condensed matter physics is to understand highly-correlated electronic structures and propose their novel device designs toward the reduced carbon-dioxide future. Among various developed numerical approaches for highly-correlated electrons, the density matrix renormalization group (DMRG) has been widely accepted as the most promising numerical scheme compared to Monte Carlo and exact diagonalization in terms of accuracy and accessible system size. In fact, DMRG almost perfectly resolves one-dimensional chain like long quantum systems. In this paper, we suggest its extended approach toward higher-dimensional systems by high-performance computing techniques. The computing target in DMRG is a huge non-uniform sparse matrix diagonalization. In order to efficiently parallelize the part, we implement communication step doubling together with reuse of the mid-point data between the doubled two steps to avoid severe bottleneck of all-to-all communications essential for the diagonalization. The technique is successful even for clusters composed of more than 1000 cores and offers a trustworthy exploration way for two-dimensional highly-correlated systems.","PeriodicalId":358797,"journal":{"name":"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"265 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123107271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Server-side I/O coordination for parallel file systems 并行文件系统的服务器端I/O协调

2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC) Pub Date : 2011-11-12 DOI: 10.1145/2063384.2063407

Huaiming Song, Yanlong Yin, Xian-He Sun, R. Thakur, S. Lang

{"title":"Server-side I/O coordination for parallel file systems","authors":"Huaiming Song, Yanlong Yin, Xian-He Sun, R. Thakur, S. Lang","doi":"10.1145/2063384.2063407","DOIUrl":"https://doi.org/10.1145/2063384.2063407","url":null,"abstract":"Parallel file systems have become a common component of modern high-end computers to mask the ever-increasing gap between disk data access speed and CPU computing power. However, while working well for certain applications, current parallel file systems lack the ability to effectively handle concurrent I/O requests with data synchronization needs, whereas concurrent I/O is the norm in data-intensive applications. Recognizing that an I/O request will not complete until all involved file servers in the parallel file system have completed their parts, in this paper we propose a server-side I/O coordination scheme for parallel file systems. The basic idea is to coordinate file servers to serve one application at a time in order to reduce the completion time, and in the meantime maintain the server utilization and fairness. A window-wide coordination concept is introduced to serve our purpose. We present the proposed I/O coordination algorithm and its corresponding analysis of average completion time in this study. We also implement a prototype of the proposed scheme under the PVFS2 file system and MPI-IO environment. Experimental results demonstrate that the proposed scheme can reduce average completion time by 8% to 46%, and provide higher I/O bandwidth than that of default data access strategies adopted by PVFS2 for heavy I/O workloads. Experimental results also show that the server-side I/O coordination scheme has good scalability.","PeriodicalId":358797,"journal":{"name":"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116590747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 59

Improving communication performance in dense linear algebra via topology aware collectives 利用拓扑感知集体改进密集线性代数中的通信性能

2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC) Pub Date : 2011-11-12 DOI: 10.1145/2063384.2063487

Edgar Solomonik, A. Bhatele, J. Demmel

{"title":"Improving communication performance in dense linear algebra via topology aware collectives","authors":"Edgar Solomonik, A. Bhatele, J. Demmel","doi":"10.1145/2063384.2063487","DOIUrl":"https://doi.org/10.1145/2063384.2063487","url":null,"abstract":"Recent results have shown that topology aware mapping reduces network contention in communication-intensive kernels on massively parallel machines. We demonstrate that on mesh interconnects, topology aware mapping also allows for the utilization of highly-efficient topology aware collectives. We map novel 2.5D dense linear algebra algorithms to exploit rectangular collectives on cuboid partitions allocated by a Blue Gene/P supercomputer. Our mappings allow the algorithms to exploit optimized line multicasts and reductions. Commonly used 2D algorithms cannot be mapped in this fashion. On 16,384 nodes (65,536 cores) of Blue Gene/P, 2.5D algorithms that exploit rectangular collectives are sig- nificantly faster than 2D matrix multiplication (MM) and LU factorization, up to 8.7x and 2.1x, respectively. These speed-ups are due to communication reduction (up to 95.6% for 2.5D MM with respect to 2D MM). We also derive LogP- based novel performance models for rectangular broadcasts and reductions. Using those, we model the performance of matrix multiplication and LU factorization on a hypothetical exascale architecture.","PeriodicalId":358797,"journal":{"name":"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129079990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 57

A scalable eigensolver for large scale-free graphs using 2D graph partitioning 基于二维图划分的大规模无标度图的可扩展特征解算器

2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC) Pub Date : 2011-11-12 DOI: 10.1145/2063384.2063469

A. Yoo, A. Baker, R. Pearce, V. Henson

引用次数: 47

End-to-end network QoS via scheduling of flexible resource reservation requests 端到端网络QoS通过调度灵活的资源预留请求

2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC) Pub Date : 2011-11-12 DOI: 10.1145/2063384.2063475

Sushant Sharma, D. Katramatos, Dantong Yu

引用次数: 31

Peta-scale phase-field simulation for dendritic solidification on the TSUBAME 2.0 supercomputer 在TSUBAME 2.0超级计算机上进行枝晶凝固的pb级相场模拟

2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC) Pub Date : 2011-11-12 DOI: 10.1145/2063384.2063388

T. Shimokawabe, T. Aoki, T. Takaki, Toshio Endo, A. Yamanaka, N. Maruyama, Akira Nukada, S. Matsuoka

{"title":"Peta-scale phase-field simulation for dendritic solidification on the TSUBAME 2.0 supercomputer","authors":"T. Shimokawabe, T. Aoki, T. Takaki, Toshio Endo, A. Yamanaka, N. Maruyama, Akira Nukada, S. Matsuoka","doi":"10.1145/2063384.2063388","DOIUrl":"https://doi.org/10.1145/2063384.2063388","url":null,"abstract":"The mechanical properties of metal materials largely depend on their intrinsic internal microstructures. To develop engineering materials with the expected properties, predicting patterns in solidified metals would be indispensable. The phase-field simulation is the most powerful method known to simulate the micro-scale dendritic growth during solidification in a binary alloy. To evaluate the realistic description of solidification, however, phase-field simulation requires computing a large number of complex nonlinear terms over a fine-grained grid. Due to such heavy computational demand, previous work on simulating three-dimensional solidification with phase-field methods was successful only in describing simple shapes. Our new simulation techniques achieved scales unprecedentedly large, sufficient for handling complex dendritic structures required in material science. Our simulations on the GPU-rich TSUBAME 2.0 super- computer at the Tokyo Institute of Technology have demonstrated good weak scaling and achieved 1.017 PFlops in single precision for our largest configuration, using 4,000 CPUs along with 16,000 CPU cores.","PeriodicalId":358797,"journal":{"name":"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129498360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 200

BlobCR: Efficient checkpoint-restart for HPC applications on IaaS clouds using virtual disk image snapshots BlobCR: IaaS云上使用虚拟磁盘映像快照的高性能计算应用程序的高效检查点重启

2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC) Pub Date : 2011-11-12 DOI: 10.1145/2063384.2063429

Bogdan Nicolae, F. Cappello

引用次数: 87

Logjam: A scalable unified log file archiver Logjam:一个可扩展的统一日志文件归档器

2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC) Pub Date : 2011-11-12 DOI: 10.1145/2063348.2063379

N. Cardo

引用次数: 1

Scalable fast multipole methods on distributed heterogeneous architectures 分布式异构体系结构的可扩展快速多极方法

2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC) Pub Date : 2011-11-12 DOI: 10.1145/2063384.2063432

Qi Hu, N. Gumerov, R. Duraiswami

{"title":"Scalable fast multipole methods on distributed heterogeneous architectures","authors":"Qi Hu, N. Gumerov, R. Duraiswami","doi":"10.1145/2063384.2063432","DOIUrl":"https://doi.org/10.1145/2063384.2063432","url":null,"abstract":"We fundamentally reconsider implementation of the Fast Multipole Method (FMM) on a computing node with a heterogeneous CPU-GPU architecture with multicore CPU(s) and one or more GPU accelerators, as well as on an interconnected cluster of such nodes. The FMM is a divide-and-conquer algorithm that performs a fast N-body sum using a spatial decomposition and is often used in a time-stepping or iterative loop. Using the observation that the local summation and the analysis-based translation parts of the FMM are independent, we map these respectively to the GPUs and CPUs. Careful analysis of the FMM is performed to distribute work optimally between the multicore CPUs and the GPU accelerators. We first develop a single node version where the CPU part is parallelized using OpenMP and the GPU version via CUDA. New parallel algorithms for creating FMM data structures are presented together with load balancing strategies for the single node and distributed multiple-node versions. Our implementation can perform the N-body sum for 128M particles on 16 nodes in 4.23 seconds, a performance not achieved by others in the literature on such clusters.","PeriodicalId":358797,"journal":{"name":"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"137 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114107163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 48

Virtual I/O caching: Dynamic storage cache management for concurrent workloads 虚拟I/O缓存:用于并发工作负载的动态存储缓存管理

2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC) Pub Date : 2011-11-12 DOI: 10.1145/2063384.2063435

Michael R. Frasca, R. Prabhakar, P. Raghavan, M. Kandemir

{"title":"Virtual I/O caching: Dynamic storage cache management for concurrent workloads","authors":"Michael R. Frasca, R. Prabhakar, P. Raghavan, M. Kandemir","doi":"10.1145/2063384.2063435","DOIUrl":"https://doi.org/10.1145/2063384.2063435","url":null,"abstract":"A leading cause of reduced or unpredictable application performance in distributed systems is contention at the storage layer, where resources are multiplexed among many concurrent data intensive workloads. We target the shared storage cache, used to alleviate disk I/O bottlenecks, and propose a new caching paradigm to both improve performance and reduce memory requirements for HPC storage systems. We present the virtual I/O cache, a dynamic scheme to manage a limited storage cache resource. Application behavior and the corresponding performance of a chosen replacement policy are observed at run time, and a mechanism is designed to mitigate suboptimal behavior and increase cache efficiency. We further use the virtual I/O cache to isolate concurrent workloads and globally manage physical resource allocation towards system level performance objectives. We evaluate our scheme using twenty I/O intensive applications and benchmarks. Average hit rate gains over 17% were observed for isolated workloads, as well as cache size reductions near 80% for equivalent performance levels. Our largest concurrent workload achieved hit rate gains over 23%, and an over 80% iso-performance cache reduction.","PeriodicalId":358797,"journal":{"name":"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121782746","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9