Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07)最新文献

Workstation capacity tuning using reinforcement learning 使用强化学习的工作站容量调整

Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07) Pub Date : 2007-11-16 DOI: 10.1145/1362622.1362666

Aharon Bar-Hillel, Amir Di-Nur, L. Ein-Dor, Ran Gilad-Bachrach, Yossi Ittach

引用次数: 13

Virtual machine aware communication libraries for high performance computing 用于高性能计算的虚拟机感知通信库

Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07) Pub Date : 2007-11-16 DOI: 10.1145/1362622.1362635

Wei Huang, Matthew J. Koop, Qi Gao, D. Panda

{"title":"Virtual machine aware communication libraries for high performance computing","authors":"Wei Huang, Matthew J. Koop, Qi Gao, D. Panda","doi":"10.1145/1362622.1362635","DOIUrl":"https://doi.org/10.1145/1362622.1362635","url":null,"abstract":"As the size and complexity of modern computing systems keep increasing to meet the demanding requirements of High Performance Computing (HPC) applications, manageability is becoming a critical concern to achieve both high performance and high productivity computing. Meanwhile, virtual machine (VM) technologies have become popular in both industry and academia due to various features designed to ease system management and administration. While a VM-based environment can greatly help manageability on large-scale computing systems, concerns over performance have largely blocked the HPC community from embracing VM technologies. In this paper, we follow three steps to demonstrate the ability to achieve near-native performance in a VM-based environment for HPC. First, we propose Inter-VM Communication (IVC), a VM-aware communication library to support efficient shared memory communication among computing processes on the same physical host, even though they may be in different VMs. This is critical for multi-core systems, especially when individual computing processes are hosted on different VMs to achieve fine-grained control. Second, we design a VM-aware MPI library based on MVAPICH2 (a popular MPI library), called MVAPICH2-ivc, which allows HPC MPI applications to transparently benefit from IVC. Finally, we evaluate MVAPICH2-ivc on clusters featuring multi-core systems and high performance InfiniBand interconnects. Our evaluation demonstrates that MVAPICH2-ivc can improve NAS Parallel Benchmark performance by up to 11% in VM-based environment on eight-core Intel Clover-town systems, where each compute process is in a separate VM. A detailed performance evaluation for up to 128 processes (64 node dual-socket single-core systems) shows only a marginal performance overhead of MVAPICH2-ivc as compared with MVAPICH2 running in a native environment. This study indicates that performance should no longer be a barrier preventing HPC environments from taking advantage of the various features available through VM technologies.","PeriodicalId":274744,"journal":{"name":"Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115252117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 122

Exploring event correlation for failure prediction in coalitions of clusters 探索集群联盟中故障预测的事件相关性

Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07) Pub Date : 2007-11-16 DOI: 10.1145/1362622.1362678

S. Fu, Chengzhong Xu

{"title":"Exploring event correlation for failure prediction in coalitions of clusters","authors":"S. Fu, Chengzhong Xu","doi":"10.1145/1362622.1362678","DOIUrl":"https://doi.org/10.1145/1362622.1362678","url":null,"abstract":"In large-scale networked computing systems, component failures become norms instead of exceptions. Failure prediction is a crucial technique for self-managing resource burdens. Failure events in coalition systems exhibit strong correlations in time and space domain. In this paper, we develop a spherical covariance model with an adjustable timescale parameter to quantify the temporal correlation and a stochastic model to describe spatial correlation. We further utilize the information of application allocation to discover more correlations among failure instances. We cluster failure events based on their correlations and predict their future occurrences. We implemented a failure prediction framework, called PREdictor of Failure Events Correlated Temporal-Spatially (hPREFECTs), which explores correlations among failures and forecasts the time-between-failure of future instances. We evaluate the performance of hPREFECTs in both offline prediction of failure by using the Los Alamos HPC traces and online prediction in an institute-wide clusters coalition environment. Experimental results show the system achieves more than 76% accuracy in offline prediction and more than 70% accuracy in online prediction during the time from May 2006 to April 2007.","PeriodicalId":274744,"journal":{"name":"Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114734677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 191

Evaluation of active storage strategies for the lustre parallel file system lustre并行文件系统主动存储策略的评估

Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07) Pub Date : 2007-11-16 DOI: 10.1145/1362622.1362660

J. Piernas, J. Nieplocha, Evan Felix

引用次数: 107

A preliminary investigation of a neocortex model implementation on the Cray XD1 新皮层模型在Cray XD1上实现的初步研究

Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07) Pub Date : 2007-11-16 DOI: 10.1145/1362622.1362626

Kenneth L. Rice, Christopher N. Vutsinas, Tarek M. Taha

引用次数: 3

High-performance ethernet-based communications for future multi-core processors 面向未来多核处理器的高性能以太网通信

Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07) Pub Date : 2007-11-16 DOI: 10.1145/1362622.1362672

M. Schlansker, Nagabhushan Chitlur, E. Oertli, Paul M. Stillwell, L. Rankin, Dennis Bradford, R. Carter, Jayaram Mudigonda, N. Binkert, N. Jouppi

{"title":"High-performance ethernet-based communications for future multi-core processors","authors":"M. Schlansker, Nagabhushan Chitlur, E. Oertli, Paul M. Stillwell, L. Rankin, Dennis Bradford, R. Carter, Jayaram Mudigonda, N. Binkert, N. Jouppi","doi":"10.1145/1362622.1362672","DOIUrl":"https://doi.org/10.1145/1362622.1362672","url":null,"abstract":"Data centers and HPC clusters often incorporate specialized networking fabrics to satisfy system requirements. However, Ethernet's low cost and high performance are causing a shift from specialized fabrics toward standard Ethernet. Although Ethernet's low-level performance approaches that of specialized fabrics, the features that these fabrics provide such as reliable in-order delivery and flow control are implemented, in the case of Ethernet, by endpoint hardware and software. Unfortunately, current Ethernet endpoints are either slow (commodity NICs with generic TCP/IP stacks) or costly (offload engines). To address these issues, the JNIC project developed a novel Ethernet endpoint. JNIC's hardware and software were specifically designed for the requirements of high-performance communications within future data-centers and compute clusters. The architecture combines capabilities already seen in advanced network architectures with new innovations to create a comprehensive solution for scalable and high-performance Ethernet. We envision a JNIC architecture that is suitable for most in-data-center communication needs.","PeriodicalId":274744,"journal":{"name":"Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129600500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 25

Data access history cache and associated data prefetching mechanisms 数据访问历史缓存和相关的数据预取机制

Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07) Pub Date : 2007-11-16 DOI: 10.1145/1362622.1362651

Yong Chen, S. Byna, Xian-He Sun

引用次数: 39

Scalable security for petascale parallel file systems 千兆级并行文件系统的可扩展安全性

Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07) Pub Date : 2007-11-16 DOI: 10.1145/1362622.1362644

A. Leung, E. L. Miller, S. N. Jones

引用次数: 58

Noncontiguous locking techniques for parallel file systems 并行文件系统的不连续锁定技术

Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07) Pub Date : 2007-11-16 DOI: 10.1145/1362622.1362658

A. Ching, W. Liao, A. Choudhary, R. Ross, L. Ward

引用次数: 16

Extending stability beyond CPU millennium: a micron-scale atomistic simulation of Kelvin-Helmholtz instability 扩展稳定性超越CPU千年:开尔文-亥姆霍兹不稳定性的微米尺度原子模拟

Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07) Pub Date : 2007-11-16 DOI: 10.1145/1362622.1362700

J. Glosli, D. Richards, K. Caspersen, R. Rudd, John A. Gunnels, F. Streitz

{"title":"Extending stability beyond CPU millennium: a micron-scale atomistic simulation of Kelvin-Helmholtz instability","authors":"J. Glosli, D. Richards, K. Caspersen, R. Rudd, John A. Gunnels, F. Streitz","doi":"10.1145/1362622.1362700","DOIUrl":"https://doi.org/10.1145/1362622.1362700","url":null,"abstract":"We report the computational advances that have enabled the first micron-scale simulation of a Kelvin-Helmholtz (KH) instability using molecular dynamics (MD). The advances are in three key areas for massively parallel computation such as on BlueGene/L (BG/L): fault tolerance, application kernel optimization, and highly efficient parallel I/O. In particular, we have developed novel capabilities for handling hardware parity errors and improving the speed of interatomic force calculations, while achieving near optimal I/O speeds on BG/L, allowing us to achieve excellent scalability and improve overall application performance. As a result we have successfully conducted a 2-billion atom KH simulation amounting to 2.8 CPU-millennia of run time, including a single, continuous simulation run in excess of 1.5 CPU-millennia. We have also conducted 9-billion and 62.5-billion atom KH simulations. The current optimized ddcMD code is benchmarked at 115.1 TFlop/s in our scaling study and 103.9 TFlop/s in a sustained science run, with additional improvements ongoing. These improvements enabled us to run the first MD simulations of micron-scale systems developing the KH instability.","PeriodicalId":274744,"journal":{"name":"Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125121927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 118