{"title":"Parallelization design on multi-core platforms in density matrix renormalization group toward 2-D Quantum strongly-correlated systems","authors":"S. Yamada, Toshiyuki Imamura, M. Machida","doi":"10.1145/2063384.2063467","DOIUrl":"https://doi.org/10.1145/2063384.2063467","url":null,"abstract":"One of the most fascinating issues in modern condensed matter physics is to understand highly-correlated electronic structures and propose their novel device designs toward the reduced carbon-dioxide future. Among various developed numerical approaches for highly-correlated electrons, the density matrix renormalization group (DMRG) has been widely accepted as the most promising numerical scheme compared to Monte Carlo and exact diagonalization in terms of accuracy and accessible system size. In fact, DMRG almost perfectly resolves one-dimensional chain like long quantum systems. In this paper, we suggest its extended approach toward higher-dimensional systems by high-performance computing techniques. The computing target in DMRG is a huge non-uniform sparse matrix diagonalization. In order to efficiently parallelize the part, we implement communication step doubling together with reuse of the mid-point data between the doubled two steps to avoid severe bottleneck of all-to-all communications essential for the diagonalization. The technique is successful even for clusters composed of more than 1000 cores and offers a trustworthy exploration way for two-dimensional highly-correlated systems.","PeriodicalId":358797,"journal":{"name":"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"265 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123107271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Huaiming Song, Yanlong Yin, Xian-He Sun, R. Thakur, S. Lang
{"title":"Server-side I/O coordination for parallel file systems","authors":"Huaiming Song, Yanlong Yin, Xian-He Sun, R. Thakur, S. Lang","doi":"10.1145/2063384.2063407","DOIUrl":"https://doi.org/10.1145/2063384.2063407","url":null,"abstract":"Parallel file systems have become a common component of modern high-end computers to mask the ever-increasing gap between disk data access speed and CPU computing power. However, while working well for certain applications, current parallel file systems lack the ability to effectively handle concurrent I/O requests with data synchronization needs, whereas concurrent I/O is the norm in data-intensive applications. Recognizing that an I/O request will not complete until all involved file servers in the parallel file system have completed their parts, in this paper we propose a server-side I/O coordination scheme for parallel file systems. The basic idea is to coordinate file servers to serve one application at a time in order to reduce the completion time, and in the meantime maintain the server utilization and fairness. A window-wide coordination concept is introduced to serve our purpose. We present the proposed I/O coordination algorithm and its corresponding analysis of average completion time in this study. We also implement a prototype of the proposed scheme under the PVFS2 file system and MPI-IO environment. Experimental results demonstrate that the proposed scheme can reduce average completion time by 8% to 46%, and provide higher I/O bandwidth than that of default data access strategies adopted by PVFS2 for heavy I/O workloads. Experimental results also show that the server-side I/O coordination scheme has good scalability.","PeriodicalId":358797,"journal":{"name":"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116590747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving communication performance in dense linear algebra via topology aware collectives","authors":"Edgar Solomonik, A. Bhatele, J. Demmel","doi":"10.1145/2063384.2063487","DOIUrl":"https://doi.org/10.1145/2063384.2063487","url":null,"abstract":"Recent results have shown that topology aware mapping reduces network contention in communication-intensive kernels on massively parallel machines. We demonstrate that on mesh interconnects, topology aware mapping also allows for the utilization of highly-efficient topology aware collectives. We map novel 2.5D dense linear algebra algorithms to exploit rectangular collectives on cuboid partitions allocated by a Blue Gene/P supercomputer. Our mappings allow the algorithms to exploit optimized line multicasts and reductions. Commonly used 2D algorithms cannot be mapped in this fashion. On 16,384 nodes (65,536 cores) of Blue Gene/P, 2.5D algorithms that exploit rectangular collectives are sig- nificantly faster than 2D matrix multiplication (MM) and LU factorization, up to 8.7x and 2.1x, respectively. These speed-ups are due to communication reduction (up to 95.6% for 2.5D MM with respect to 2D MM). We also derive LogP- based novel performance models for rectangular broadcasts and reductions. Using those, we model the performance of matrix multiplication and LU factorization on a hypothetical exascale architecture.","PeriodicalId":358797,"journal":{"name":"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129079990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A scalable eigensolver for large scale-free graphs using 2D graph partitioning","authors":"A. Yoo, A. Baker, R. Pearce, V. Henson","doi":"10.1145/2063384.2063469","DOIUrl":"https://doi.org/10.1145/2063384.2063469","url":null,"abstract":"Eigensolvers are important tools for analyzing and mining useful information from scale-free graphs. Such graphs are used in many applications and can be extremely large. Unfortunately, existing parallel eigensolvers do not scale well for these graphs due to the high communication overhead in the parallel matrix-vector multiplication (MatVec). We develop a MatVec algorithm based on 2D edge partitioning that significantly reduces the communication costs and embed it into a popular eigensolver library. We demonstrate that the enhanced eigensolver can attain two orders of magnitude performance improvement compared to the original on a state-of-art massively parallel machine. We illustrate the performance of the embedded MatVec by computing eigenvalues of a scale-free graph with 300 million vertices and 5 billion edges, the largest scale-free graph analyzed by any in-memory parallel eigensolver, to the best of our knowledge.","PeriodicalId":358797,"journal":{"name":"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133138209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"End-to-end network QoS via scheduling of flexible resource reservation requests","authors":"Sushant Sharma, D. Katramatos, Dantong Yu","doi":"10.1145/2063384.2063475","DOIUrl":"https://doi.org/10.1145/2063384.2063475","url":null,"abstract":"Modern data-intensive applications move vast amounts of data between multiple locations around the world. To enable predictable and reliable data transfers, next generation networks allow such applications to reserve network resources for exclusive use. In this paper, we solve an important problem (called SMR3) to accommodate multiple and concurrent network reservation requests between a pair of end sites. Given the varying availability of bandwidth within the network, our goal is to accommodate as many reservation requests as possible while minimizing the total time needed to complete the data transfers. First, we prove that SMR3 is an NP-hard problem. Then, we solve it by developing a polynomial-time heuristic called RRA. The RRA algorithm hinges on an efficient mechanism to accommodate large number of requests in an iterative manner. Finally, we show via numerical results that RRA constructs schedules that accommodate significantly larger number of requests compared to other, seemingly efficient, heuristics.","PeriodicalId":358797,"journal":{"name":"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124710782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Shimokawabe, T. Aoki, T. Takaki, Toshio Endo, A. Yamanaka, N. Maruyama, Akira Nukada, S. Matsuoka
{"title":"Peta-scale phase-field simulation for dendritic solidification on the TSUBAME 2.0 supercomputer","authors":"T. Shimokawabe, T. Aoki, T. Takaki, Toshio Endo, A. Yamanaka, N. Maruyama, Akira Nukada, S. Matsuoka","doi":"10.1145/2063384.2063388","DOIUrl":"https://doi.org/10.1145/2063384.2063388","url":null,"abstract":"The mechanical properties of metal materials largely depend on their intrinsic internal microstructures. To develop engineering materials with the expected properties, predicting patterns in solidified metals would be indispensable. The phase-field simulation is the most powerful method known to simulate the micro-scale dendritic growth during solidification in a binary alloy. To evaluate the realistic description of solidification, however, phase-field simulation requires computing a large number of complex nonlinear terms over a fine-grained grid. Due to such heavy computational demand, previous work on simulating three-dimensional solidification with phase-field methods was successful only in describing simple shapes. Our new simulation techniques achieved scales unprecedentedly large, sufficient for handling complex dendritic structures required in material science. Our simulations on the GPU-rich TSUBAME 2.0 super- computer at the Tokyo Institute of Technology have demonstrated good weak scaling and achieved 1.017 PFlops in single precision for our largest configuration, using 4,000 CPUs along with 16,000 CPU cores.","PeriodicalId":358797,"journal":{"name":"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129498360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"BlobCR: Efficient checkpoint-restart for HPC applications on IaaS clouds using virtual disk image snapshots","authors":"Bogdan Nicolae, F. Cappello","doi":"10.1145/2063384.2063429","DOIUrl":"https://doi.org/10.1145/2063384.2063429","url":null,"abstract":"Infrastructure-as-a-Service (IaaS) cloud computing is gaining significant interest in industry and academia as an alternative platform for running scientific applications. Given the dynamic nature of IaaS clouds and the long runtime and resource utilization of such applications, an efficient checkpoint-restart mechanism becomes paramount in this context. This paper proposes a solution to the aforementioned challenge that aims at minimizing the storage space and performance overhead of checkpoint-restart. We introduce an approach that leverages virtual machine (VM) disk-image multi-snapshotting and multi-deployment inside checkpoint-restart protocols running at guest level in order to efficiently capture and potentially roll back the complete state of the application, including file system modifications. Experiments on the G5K testbed show substantial improvement for MPI applications over existing approaches, both for the case when customized checkpointing is available at application level and the case when it needs to be handled at process level.","PeriodicalId":358797,"journal":{"name":"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115786039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Logjam: A scalable unified log file archiver","authors":"N. Cardo","doi":"10.1145/2063348.2063379","DOIUrl":"https://doi.org/10.1145/2063348.2063379","url":null,"abstract":"Log files are a necessary record of events on any system. However, as systems scale, so does the volume of data captured. To complicate matters, this data can be distributed across all nodes within the system. This creates challenges in ways to obtain these files as well as archiving them in a consistent manner. It has become commonplace to develop a custom written utility for each system that is tailored specifically to that system. For computer centers that contain multiple systems, each system would have their own respective utility for gathering and archiving log files. Each time a new log file is produced, a modification to the utility is necessary. With each modification, risk of errors could be introduced as well as spending time to introduce that change. This is precisely the purpose of logjam. Once installed, the code only requires modification when new features are required. A configuration file is used to identify each log file as well as where to harvest it and how to archive it. Adding a new log file is as simple as defining it in a configuration file and testing can be performed in the production environment.","PeriodicalId":358797,"journal":{"name":"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121358234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scalable fast multipole methods on distributed heterogeneous architectures","authors":"Qi Hu, N. Gumerov, R. Duraiswami","doi":"10.1145/2063384.2063432","DOIUrl":"https://doi.org/10.1145/2063384.2063432","url":null,"abstract":"We fundamentally reconsider implementation of the Fast Multipole Method (FMM) on a computing node with a heterogeneous CPU-GPU architecture with multicore CPU(s) and one or more GPU accelerators, as well as on an interconnected cluster of such nodes. The FMM is a divide-and-conquer algorithm that performs a fast N-body sum using a spatial decomposition and is often used in a time-stepping or iterative loop. Using the observation that the local summation and the analysis-based translation parts of the FMM are independent, we map these respectively to the GPUs and CPUs. Careful analysis of the FMM is performed to distribute work optimally between the multicore CPUs and the GPU accelerators. We first develop a single node version where the CPU part is parallelized using OpenMP and the GPU version via CUDA. New parallel algorithms for creating FMM data structures are presented together with load balancing strategies for the single node and distributed multiple-node versions. Our implementation can perform the N-body sum for 128M particles on 16 nodes in 4.23 seconds, a performance not achieved by others in the literature on such clusters.","PeriodicalId":358797,"journal":{"name":"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"137 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114107163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michael R. Frasca, R. Prabhakar, P. Raghavan, M. Kandemir
{"title":"Virtual I/O caching: Dynamic storage cache management for concurrent workloads","authors":"Michael R. Frasca, R. Prabhakar, P. Raghavan, M. Kandemir","doi":"10.1145/2063384.2063435","DOIUrl":"https://doi.org/10.1145/2063384.2063435","url":null,"abstract":"A leading cause of reduced or unpredictable application performance in distributed systems is contention at the storage layer, where resources are multiplexed among many concurrent data intensive workloads. We target the shared storage cache, used to alleviate disk I/O bottlenecks, and propose a new caching paradigm to both improve performance and reduce memory requirements for HPC storage systems. We present the virtual I/O cache, a dynamic scheme to manage a limited storage cache resource. Application behavior and the corresponding performance of a chosen replacement policy are observed at run time, and a mechanism is designed to mitigate suboptimal behavior and increase cache efficiency. We further use the virtual I/O cache to isolate concurrent workloads and globally manage physical resource allocation towards system level performance objectives. We evaluate our scheme using twenty I/O intensive applications and benchmarks. Average hit rate gains over 17% were observed for isolated workloads, as well as cache size reductions near 80% for equivalent performance levels. Our largest concurrent workload achieved hit rate gains over 23%, and an over 80% iso-performance cache reduction.","PeriodicalId":358797,"journal":{"name":"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121782746","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}