A. Devulapalli, D. Dalessandro, P. Wyckoff, N. Ali, P. Sadayappan
{"title":"Integrating parallel file systems with object-based storage devices","authors":"A. Devulapalli, D. Dalessandro, P. Wyckoff, N. Ali, P. Sadayappan","doi":"10.1145/1362622.1362659","DOIUrl":"https://doi.org/10.1145/1362622.1362659","url":null,"abstract":"As storage systems evolve, the block-based design of today's disks is becoming inadequate. As an alternative, object-based storage devices (OSDs) offer a view where the disk manages data layout and keeps track of various attributes about data objects. By moving functionality that is traditionally the responsibility of the host OS to the disk, it is possible to improve overall performance and simplify management of a storage system. The capabilities of OSDs will also permit performance improvements in parallel file systems, such as further decoupling metadata operations and thus reducing metadata server bottlenecks. In this work we present an implementation of the Parallel Virtual File System (PVFS) integrated with a software emulator of an OSD and describe an infrastructure for client access. Even with the overhead of emulation, performance is comparable to a traditional server-fronted implementation, demonstrating that serverless parallel file systems using OSDs are an achievable goal.","PeriodicalId":274744,"journal":{"name":"Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116732893","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Data exploration of turbulence simulations using a database cluster","authors":"E. Perlman, R. Burns, Yi Li, C. Meneveau","doi":"10.1145/1362622.1362654","DOIUrl":"https://doi.org/10.1145/1362622.1362654","url":null,"abstract":"We describe a new environment for the exploration of turbulent flows that uses a cluster of databases to store complete histories of Direct Numerical Simulation (DNS) results. This allows for spatial and temporal exploration of high-resolution data that were traditionally too large to store and too computationally expensive to produce on demand. We perform analysis of these data directly on the databases nodes, which minimizes the volume of network traffic. The low network demands enable us to provide public access to this experimental platform and its datasets through Web services. This paper details the system design and implementation. Specifically, we focus on hierarchical spatial indexing, cache-sensitive spatial scheduling of batch workloads, localizing computation through data partitioning, and load balancing techniques that minimize data movement. We provide real examples of how scientists use the system to perform high-resolution turbulence research from standard desktop computing environments.","PeriodicalId":274744,"journal":{"name":"Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115775808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Ohno, Takenori Yamamoto, Tatsunobu Kokubo, Akira Azami, Yuta Sakaguchi, T. Uda, T. Yamasaki, Daisuke Fukata, J. Koga
{"title":"First-principles calculations of large-scale semiconductor systems on the earth simulator","authors":"T. Ohno, Takenori Yamamoto, Tatsunobu Kokubo, Akira Azami, Yuta Sakaguchi, T. Uda, T. Yamasaki, Daisuke Fukata, J. Koga","doi":"10.1145/1362622.1362699","DOIUrl":"https://doi.org/10.1145/1362622.1362699","url":null,"abstract":"First-principles simulations of large-scale semiconductor systems using the PHASE code on the Earth Simulator (ES) demonstrate high performance with respect to the theoretical peak performance. PHASE, designed for vector-parallel systems like the ES, demonstrates excellent parallel efficiency. We simulated an arsenic donor in silicon using up to 8,000 atom unit cell. A sustained peak performance of 14.6 TFlop/s was measured on 3,072 processing elements, which corresponds to 59% of the theoretical peak performance. Preliminary results using 10,648 atom unit cell are also presented.","PeriodicalId":274744,"journal":{"name":"Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130549304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Abts, A. Bataineh, Steve Scott, Greg Faanes, J. Schwarzmeier, E. Lundberg, Tim Johnson, Mike Bye, Gerald Schwoerer
{"title":"The Cray BlackWidow: a highly scalable vector multiprocessor","authors":"D. Abts, A. Bataineh, Steve Scott, Greg Faanes, J. Schwarzmeier, E. Lundberg, Tim Johnson, Mike Bye, Gerald Schwoerer","doi":"10.1145/1362622.1362646","DOIUrl":"https://doi.org/10.1145/1362622.1362646","url":null,"abstract":"This paper describes the system architecture of the Cray BlackWidow scalable vector multiprocessor. The BlackWidow system is a distributed shared memory (DSM) architecture that is scalable to 32K processors, each with a 4-way dispatch scalar execution unit and an 8-pipe vector unit capable of 20.8 Gflops for 64-bit operations and 41.6 Gflops for 32-bit operations at the prototype operating frequency of 1.3 GHz. Global memory is directly accessible with processor loads and stores and is globally coherent. The system supports thousands of outstanding references to hide remote memory latencies, and provides a rich suite of built-in synchronization primitives. Each BlackWidow node is implemented as a 4-way SMP with up to 128 Gbytes of DDR2 main memory capacity. The system supports common programming models such as MPI and OpenMP, as well as global address space languages such as UPC and CAF. We describe the system architecture and microarchitecture of the processor, memory controller, and router chips. We give preliminary performance results and discuss design tradeoffs.","PeriodicalId":274744,"journal":{"name":"Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07)","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124558854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tong Wen, Jimmy Su, P. Colella, K. Yelick, N. Keen
{"title":"An adaptive mesh refinement benchmark for modern parallel programming languages","authors":"Tong Wen, Jimmy Su, P. Colella, K. Yelick, N. Keen","doi":"10.1145/1362622.1362676","DOIUrl":"https://doi.org/10.1145/1362622.1362676","url":null,"abstract":"We present an Adaptive Mesh Refinement benchmark for evaluating programmability and performance of modern parallel programming languages. Benchmarks employed today by language developing teams, originally designed for performance evaluation of computer architectures, do not fully capture the complexity of state-of-the-art computational software systems running on today's parallel machines or to be run on the emerging ones from the multi-cores to the peta-scale High Productivity Computer Systems. This benchmark, extracted from a real application framework, presents challenges for a programming language in both expressiveness and performance. It consists of an infrastructure for finite difference calculations on block-structured adaptive meshes and a solver for elliptic Partial Differential Equations built on this infrastructure. Adaptive Mesh Refinement algorithms are challenging to implement due to the irregularity introduced by local mesh refinement. We describe those challenges posed by this benchmark through two reference implementations (C++ /Fortran/MPI and Titanium) and in the context of three programming models.","PeriodicalId":274744,"journal":{"name":"Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123686803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Investigation of leading HPC I/O performance using a scientific-application derived benchmark","authors":"J. Borrill, L. Oliker, J. Shalf, H. Shan","doi":"10.1145/1362622.1362636","DOIUrl":"https://doi.org/10.1145/1362622.1362636","url":null,"abstract":"With the exponential growth of high-fidelity sensor and simulated data, the scientific community is increasingly reliant on ultrascale HPC resources to handle their data analysis requirements. However, to utilize such extreme computing power effectively, the I/O components must be designed in a balanced fashion, as any architectural bottleneck will quickly render the platform intolerably inefficient. To understand I/O performance of data-intensive applications in realistic computational settings, we develop a lightweight, portable benchmark called MADbench2, which is derived directly from a large-scale Cosmic Microwave Background (CMB) data analysis package. Our study represents one of the most comprehensive I/O analyses of modern parallel filesystems, examining a broad range of system architectures and configurations, including Lustre on the Cray XT3 and Intel Itanium2 cluster; GPFS on IBM Power5 and AMD Opteron platforms; two BlueGene/L installations utilizing GPFS and PVFS2 filesystems; and CXFS on the SGI Altix3700. We present extensive synchronous I/O performance data comparing a number of key parameters including concurrency, POSIX- versus MPI-IO, and unique- versus shared-file accesses, using both the default environment as well as highly-tuned I/O parameters. Finally, we explore the potential of asynchronous I/O and quantify the volume of computation required to hide a given volume of I/O. Overall our study quantifies the vast differences in performance and functionality of parallel filesystems across state-of-the-art platforms, while providing system designers and computational scientists a lightweight tool for conducting further analyses.","PeriodicalId":274744,"journal":{"name":"Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115465322","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A user-level secure grid file system","authors":"Ming Zhao, Renato J. O. Figueiredo","doi":"10.1145/1362622.1362683","DOIUrl":"https://doi.org/10.1145/1362622.1362683","url":null,"abstract":"A grid-wide distributed file system provides convenient data access interfaces that facilitate fine-grained cross-domain data sharing and collaboration. However, existing widely-adopted distributed file systems do not meet the security requirements for grid systems. This paper presents a Secure Grid File System (SGFS) which supports GSI-based authentication and access control, end-to-end message privacy, and integrity. It employs user-level virtualization of NFS to provide transparent grid data access leveraging existing, unmodified clients and servers. It supports user and application-tailored security customization per SGFS session, and leverages secure management services to control and configure the sessions. The system conforms to the GSI grid security infrastructure and allows for seamless integration with other grid middleware. A SGFS prototype is evaluated with both file system benchmarks and typical applications, which demonstrates that it can achieve strong security with an acceptable overhead, and substantially outperform native NFS in wide-area environments by using disk caching.","PeriodicalId":274744,"journal":{"name":"Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114889282","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Michalakes, J. Hacker, R. Loft, M. O. McCracken, A. Snavely, N. Wright, T. Spelce, B. Gorda, R. Walkup
{"title":"WRF nature run","authors":"J. Michalakes, J. Hacker, R. Loft, M. O. McCracken, A. Snavely, N. Wright, T. Spelce, B. Gorda, R. Walkup","doi":"10.1145/1362622.1362701","DOIUrl":"https://doi.org/10.1145/1362622.1362701","url":null,"abstract":"The Weather Research and Forecast (WRF) model is a limited-area model of the atmosphere for mesoscale research and operational numerical weather prediction (NWP). A petascale problem is a WRF nature run that provides very high-resolution \"truth\" against which more coarse simulations or perturbation runs may be compared for purposes of studying predictability, stochastic parameterization, and fundamental dynamics. We carried out a nature run involving an idealized high resolution rotating fluid on the hemisphere to investigate scales that span the k-3 to k-5/3 kinetic energy spectral transition of the observed atmosphere using 65,536 processors of the BG/L machine at LLNL. We worked through issues of parallel I/O and scalability. The primary result is not just the scalability and high Tflops number, but an important step towards understanding weather predictability at high resolution.","PeriodicalId":274744,"journal":{"name":"Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115198664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. Balaji, Wu-chun Feng, S. Bhagvat, D. Panda, R. Thakur, W. Gropp
{"title":"Analyzing the impact of supporting out-of-order communication on in-order performance with iWARP","authors":"P. Balaji, Wu-chun Feng, S. Bhagvat, D. Panda, R. Thakur, W. Gropp","doi":"10.1145/1362622.1362670","DOIUrl":"https://doi.org/10.1145/1362622.1362670","url":null,"abstract":"Due to the growing need to tolerate network faults and congestion in high-end computing systems, supporting multiple network communication paths is becoming increasingly important. However, multi-path communication comes with the disadvantage of out-of-order arrival of packets (because packets may traverse different paths). While modern networking stacks such as the Internet Wide-Area RDMA Protocol (iWARP) over 10-Gigabit Ethernet (10GE) support multi-path communication, their current implementations do not handle out-of-order packets primarily owing to the overhead on in-order communication that it adds. Specifically, in iWARP, supporting out-of-order packets requires every packet to carry additional information causing significant overhead on packets that arrive in-order. Thus, in this paper, we analyze the trade-offs in designing a feature-complete iWARP stack, i.e., one that provides support for out-of-order arriving packets, and thus, multi-path systems, while focusing on the performance of in-order communication. We propose three feature-complete designs of iWARP and analyze the pros and cons of each of these designs using performance experiments based on several micro-benchmarks as well as an iso-surface visual rendering application. Our analysis reveals that the iWARP design providing the best overall performance depends on the particular characteristics of the upper layers and that different designs are optimal based on the metric of interest.","PeriodicalId":274744,"journal":{"name":"Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114278586","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Variable latency caches for nanoscale processor","authors":"S. Ozdemir, A. Mallik, J. Ku, G. Memik, Y. Ismail","doi":"10.1145/1362622.1362650","DOIUrl":"https://doi.org/10.1145/1362622.1362650","url":null,"abstract":"Variability is one of the important issues in nanoscale processors. Due to increasing importance of interconnect structures in submicron technologies, the physical location and phenomena such as coupling have an increasing impact on the latency of operations. Therefore, traditional view of rigid access latencies to components wil result in suboptimal architectures. In this paper, we devise a cache architecture with variable access latency. Particularly, we a) develop a non-uniform access level 1 data-cache, b) study the impact of coupling and physical location on level 1 data cache access latencies, and c) develop and study an architecture where the variable latency cache can be accessed while the rest of the pipeline remains synchronous. To find the access latency with different input address transitions and environmental conditions, we first build a SPICE model at a 45nm technology for a cache similar to that of the level 1 data cache of the Intel Prescott architecture. Motivated by the large difference between the worst and best case latencies and the shape of the distribution curve, we change the cache architecture to allow variable latency accesses. Since the latency of the cache is not known at the time of instruction scheduling, we also modify the functional units with the addition of special queues that will temporarily store the dependent instructions and allow the data to be forwarded from the cache to the functional units correctly. Simulations based on SPEC2000 benchmarks show that our variable access latency cache structure can reduce the execution time by as much as 19.4% and 10.7% on average compared to a conventional cache architecture.","PeriodicalId":274744,"journal":{"name":"Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115936172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}