IEEE International Symposium on High-Performance Parallel Distributed Computing最新文献_第9页

Exascale opportunities and challenges 百亿亿次的机遇与挑战

IEEE International Symposium on High-Performance Parallel Distributed Computing Pub Date : 2011-06-08 DOI: 10.1145/1996130.1996132

K. Yelick

{"title":"Exascale opportunities and challenges","authors":"K. Yelick","doi":"10.1145/1996130.1996132","DOIUrl":"https://doi.org/10.1145/1996130.1996132","url":null,"abstract":"Despite the availability of petascale systems for scientific computing, demand for computational capability grows unabated, with areas of national and commercial interest including global climate change, alternative energy sources, defense and medicine, as well as basic science. Past growth in the high end has relied on a combination of faster clock speeds and larger systems, but the clock speed benefits of Moore's Law have ended, and 200-cabinet petascale machines are near a practical limit. In future computing systems, performance and energy optimization will be the combined responsibility of hardware and software developers. Since data movement dominates energy use in a computing system, minimizing the movement of data throughout the memory and communication fabric are essential. In this talk I will describe some of the hardware trends and open problems in developing and using an exascale system. In particular, how will an energy-constrained design affect the architecture, which in turn affects algorithms and programming models. In addition to these universal problems, fault resilience is a problem at the high end that will require novel system support, possibly propagating up the software stack to user level software and algorithms. Overall, the trends in hardware demand that the community undertake a broad set of research activities to sustain the growth in computing performance expected by users.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129245974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Experience of parallelizing cryo-EM 3D reconstruction on a CPU-GPU heterogeneous system 基于CPU-GPU异构系统的低温电镜三维重构并行化研究

IEEE International Symposium on High-Performance Parallel Distributed Computing Pub Date : 2011-06-08 DOI: 10.1145/1996130.1996157

Linchuan Li, Xingjian Li, Guangming Tan, Mingyu Chen, Peiheng Zhang

{"title":"Experience of parallelizing cryo-EM 3D reconstruction on a CPU-GPU heterogeneous system","authors":"Linchuan Li, Xingjian Li, Guangming Tan, Mingyu Chen, Peiheng Zhang","doi":"10.1145/1996130.1996157","DOIUrl":"https://doi.org/10.1145/1996130.1996157","url":null,"abstract":"Heterogeneous architecture is becoming an important way to build a massive parallel computer system, i.e. the CPU-GPU heterogeneous systems ranked in Top500 list. However, it is a challenge to efficiently utilize massive parallelism of both applications and architectures on such heterogeneous systems. In this paper we present a practice on how to exploit and orchestrate parallelism at algorithm level to take advantage of underlying parallelism at architecture level. A potential Petaflops application -- cryo-EM 3D reconstruction is selected as an example. We exploit all possible parallelism in cryo-EM 3D reconstruction, and leverage a self-adaptive dynamic scheduling algorithm to create a proper parallelism mapping between the application and architecture. The parallelized programs are evaluated on a subsystem of Dawning Nebulae supercomputer, whose node is composed of two Intel six-core Xeon CPUs and one Nvidia Fermi GPU. The experiment confirms that hierarchical parallelism is an efficient pattern of parallel programming to utilize capabilities of both CPU and GPU in a heterogeneous system. The CUDA kernels run more than 3 times faster than the OpenMP parallelized ones using 12 cores (threads). Based on the GPU-only version, the hybrid CPU-GPU program further improves the whole application's performance by 30% on the average.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124440995","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Introspective end-system modeling to optimize the transfer time of rate based protocols 基于速率协议的内省端系统建模优化传输时间

IEEE International Symposium on High-Performance Parallel Distributed Computing Pub Date : 2011-06-08 DOI: 10.1145/1996130.1996140

V. Ahuja, A. Banerjee, M. Farrens, D. Ghosal, G. Serazzi

{"title":"Introspective end-system modeling to optimize the transfer time of rate based protocols","authors":"V. Ahuja, A. Banerjee, M. Farrens, D. Ghosal, G. Serazzi","doi":"10.1145/1996130.1996140","DOIUrl":"https://doi.org/10.1145/1996130.1996140","url":null,"abstract":"The transmission capacity of today's high-speed networks is often greater than the capacity of an end-system (such as a server or a remote client) to consume the incoming data. The mismatch between the network and the end-system, which can be exacerbated by high end-system workloads, will result in incoming packets being dropped at different points in the packet receiving process. In particular, a packet may be dropped in the NIC, in the kernel ring buffer, and (for rate based protocols) in the socket buffer. To provide reliable data transfers, these losses require retransmissions, and if the loss rate is high enough result in longer download times. In this paper, we focus on UDP-like rate based transport protocols, and address the question of how best to estimate the rate at which the end-system can consume data which minimizes the overall transfer time of a file.\u0000 We propose a novel queueing network model of the end-system, which consists of a model of the NIC, a model of the kernel ring buffer and the protocol processing, and a model of the socket buffer from which the application process reads the data. We show that using simple and approximate queueing models, we can accurately predict the effective end-system bottleneck rate that minimizes the file transfer time. We compare our protocol with PA-UDP, an end-system aware rate based transport protocol, and show that our approach performs better, particularly when the packet losses in the NIC and/or the kernel ring buffer are high. We also compare our approach to TCP. Unlike in our rate based scheme, TCP invokes the congestion control algorithm when there are losses in the NIC and the ring buffer. With higher end-to-end delay, this results in significant performance degradation compared to our reliable end-system aware rate based protocol.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"434 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116014464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Wrangler: virtual cluster provisioning for the cloud Wrangler:为云提供虚拟集群

IEEE International Symposium on High-Performance Parallel Distributed Computing Pub Date : 2011-06-08 DOI: 10.1145/1996130.1996173

G. Juve, E. Deelman

引用次数: 41

Six degrees of scientific data: reading patterns for extreme scale science IO 六度科学数据:极端尺度科学的阅读模式

IEEE International Symposium on High-Performance Parallel Distributed Computing Pub Date : 2011-06-08 DOI: 10.1145/1996130.1996139

J. Lofstead, Milo Polte, Garth A. Gibson, S. Klasky, K. Schwan, R. Oldfield, M. Wolf, Qing Liu

{"title":"Six degrees of scientific data: reading patterns for extreme scale science IO","authors":"J. Lofstead, Milo Polte, Garth A. Gibson, S. Klasky, K. Schwan, R. Oldfield, M. Wolf, Qing Liu","doi":"10.1145/1996130.1996139","DOIUrl":"https://doi.org/10.1145/1996130.1996139","url":null,"abstract":"Petascale science simulations generate 10s of TBs of application data per day, much of it devoted to their checkpoint/restart fault tolerance mechanisms. Previous work demonstrated the importance of carefully managing such output to prevent application slowdown due to IO blocking, resource contention negatively impacting simulation performance and to fully exploit the IO bandwidth available to the petascale machine. This paper takes a further step in understanding and managing extreme-scale IO. Specifically, its evaluations seek to understand how to efficiently read data for subsequent data analysis, visualization, checkpoint restart after a failure, and other read-intensive operations. In their entirety, these actions support the 'end-to-end' needs of scientists enabling the scientific processes being undertaken. Contributions include the following. First, working with application scientists, we define 'read' benchmarks that capture the common read patterns used by analysis codes. Second, these read patterns are used to evaluate different IO techniques at scale to understand the effects of alternative data sizes and organizations in relation to the performance seen by end users. Third, defining the novel notion of a 'data district' to characterize how data is organized for reads, we experimentally compare the read performance seen with the ADIOS middleware's log-based BP format to that seen by the logically contiguous NetCDF or HDF5 formats commonly used by analysis tools. Measurements assess the performance seen across patterns and with different data sizes, organizations, and read process counts. Outcomes demonstrate that high end-to-end IO performance requires data organizations that offer flexibility in data layout and placement on parallel storage targets, including in ways that can make tradeoffs in the performance of data writes vs. reads.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133659799","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 97

Experiences using smaash to manage data-intensive simulations 使用smash管理数据密集型模拟的经验

IEEE International Symposium on High-Performance Parallel Distributed Computing Pub Date : 2011-06-08 DOI: 10.1145/1996130.1996158

R. Hudson, Johnny Norris, L. Reid, K. Weide, IV GeorgeCalJordan, M. Papka

{"title":"Experiences using smaash to manage data-intensive simulations","authors":"R. Hudson, Johnny Norris, L. Reid, K. Weide, IV GeorgeCalJordan, M. Papka","doi":"10.1145/1996130.1996158","DOIUrl":"https://doi.org/10.1145/1996130.1996158","url":null,"abstract":"High performance scientific computer simulations created with such systems as the University of Chicago's FLASH code generate enormous amounts of data that must be captured, cataloged, and analyzed. Unless this is formally done, monitoring such simulations, tracking and reproducing old ones, and analyzing and archiving their output, can be haphazard and idiosyncratic. Smaash, a simulation management and analysis system that has been developed at the University of Chicago and Argonne National Laboratory, seeks to solve some of these problems by offering what approaches a single point of control and analysis, a metadata-base, and a set of tools that automate some of what scientists have been doing by hand.\u0000 Smaash was designed to be independent of the particular simulation code, and is accessible from many computing platforms. It is automatic and standardized, and was built using open source software tools. Data security is considered throughout the process, yet users are insulated from onerous verification procedures. Because the system was developed with feedback from scientific users, its user interface reflects how scientists work in their daily life. We describe our system and a typical simulation it was designed to support. We illustrate its utility with several examples describing our experience of freeing scientists from the data manipulation phase to focus on the computational results and the analysis of high performance computing.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130084871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Just in time: adding value to the IO pipelines of high performance applications with JITStaging 及时:通过JITStaging为高性能应用程序的IO管道增加价值

IEEE International Symposium on High-Performance Parallel Distributed Computing Pub Date : 2011-06-08 DOI: 10.1145/1996130.1996137

H. Abbasi, G. Eisenhauer, M. Wolf, K. Schwan, S. Klasky

{"title":"Just in time: adding value to the IO pipelines of high performance applications with JITStaging","authors":"H. Abbasi, G. Eisenhauer, M. Wolf, K. Schwan, S. Klasky","doi":"10.1145/1996130.1996137","DOIUrl":"https://doi.org/10.1145/1996130.1996137","url":null,"abstract":"Large scale applications are generating a tsunami of data, with understanding driven by finding information hidden within this data. The ever-increasing sizes of output, however, are making it difficult for science users to inspect the data generated by their applications, understand its important properties, and/or organize it for subsequent analysis and visualization. This paper presents JITStager, a software infrastructure with which end users can dynamically customize and thus, add value to the output pipelines of their HEC applications. JITStager is able to customize data at scale, by leveraging the computational power of both compute nodes and of additional `data staging' nodes allocated by end users. Using existing, componentized I/O interfaces to decouple the compile-time specification of the program and the run-time customization of the data pipeline, JITStager employs efficient runtime methods for binary code generation and data movement to create custom pipelines for applications' output processes that provide end users with improved insights into the data being produced, without burdening the application's computational performance and without impeding output performance. This paper describes the JITStager architecture, evaluates its performance, and demonstrates the advantages derived from its use with representative HPC applications.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123742470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 72

VMFlock: virtual machine co-migration for the cloud VMFlock:面向云的虚拟机共迁移

IEEE International Symposium on High-Performance Parallel Distributed Computing Pub Date : 2011-06-08 DOI: 10.1145/1996130.1996153

S. Al-Kiswany, Dinesh Subhraveti, P. Sarkar, M. Ripeanu

{"title":"VMFlock: virtual machine co-migration for the cloud","authors":"S. Al-Kiswany, Dinesh Subhraveti, P. Sarkar, M. Ripeanu","doi":"10.1145/1996130.1996153","DOIUrl":"https://doi.org/10.1145/1996130.1996153","url":null,"abstract":"This paper presents VMFlockMS, a migration service optimized for cross-datacenter transfer and instantiation of groups of virtual machine (VM) images that comprise an application-level solution (e.g., a three-tier web application). We dub these groups of related VM images VMFlocks. VMFlockMS employs two main techniques: first, data deduplication within the VMFlock to be migrated and between the VMFlock and the data already present at the destination datacenter, and, second, accelerated instantiation of the application at the target datacenter after transferring only a partial set of data blocks and prioritization of the remaining data based on previously observed access patterns originating from the running VMs. VMFlockMS is designed to be deployed as a set of virtual appliances which make efficient use of the available cloud resources to locally access and deduplicate the images and data in a distributed fashion with minimal requirements imposed on the cloud API to access the VM image repository. VMFlockMS provides an incrementally scalable and high-performance migration service. Our evaluation shows that VMFlockMS can reduce the data volumes to be transferred over the network to as low as 3% of the original VMFlock size, enables the complete transfer of the VM images belonging to a VMFlock over transcontinental link up to 3.5x faster than alternative approaches, and enables booting these VM images with as little as 5% of the compressed VMFlock data available at the destination.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124024236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 141

HyFlow: a high performance distributed software transactional memory framework HyFlow:高性能分布式软件事务性内存框架

IEEE International Symposium on High-Performance Parallel Distributed Computing Pub Date : 2011-06-08 DOI: 10.1145/1996130.1996167

Mohamed M. Saad, B. Ravindran

{"title":"HyFlow: a high performance distributed software transactional memory framework","authors":"Mohamed M. Saad, B. Ravindran","doi":"10.1145/1996130.1996167","DOIUrl":"https://doi.org/10.1145/1996130.1996167","url":null,"abstract":"We present HyFlow --- a distributed software transactional memory (D-STM) framework for distributed concurrency control. HyFlow is a Java framework for D-STM, with pluggable support for directory lookup protocols, transactional synchronization and recovery mechanisms, contention management policies, cache coherence protocols, and network communication protocols. HyFlow exports a simple distributed programming model that excludes locks: using (Java 5) annotations, atomic sections are defined as transactions, in which reads and writes to shared, local and remote objects appear to take effect instantaneously. No changes are needed to the underlying virtual machine or compiler. We describe HyFlow's architecture and implementation, and report on experimental studies comparing HyFlow against competing models including Java remote method invocation (RMI) with mutual exclusion and read/write locks, distributed shared memory (DSM), and directory-based D-STM. Our studies show that HyFlow outperforms competitors by as much as 40-190% on a broad range of transactional workloads on a 72-node system, with more than 500 concurrent transactions.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114916460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 46

Juggle: proactive load balancing on multicore computers 杂耍:多核计算机上的主动负载平衡

IEEE International Symposium on High-Performance Parallel Distributed Computing Pub Date : 2011-06-08 DOI: 10.1145/1996130.1996134

S. Hofmeyr, Juan A. Colmenares, Costin Iancu, J. Kubiatowicz

引用次数: 25