2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing最新文献

Transactional Forwarding: Supporting Highly-Concurrent STM in Asynchronous Distributed Systems 事务性转发:支持异步分布式系统中的高并发STM

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing Pub Date : 2012-10-24 DOI: 10.1109/SBAC-PAD.2012.36

Mohamed M. Saad, B. Ravindran

{"title":"Transactional Forwarding: Supporting Highly-Concurrent STM in Asynchronous Distributed Systems","authors":"Mohamed M. Saad, B. Ravindran","doi":"10.1109/SBAC-PAD.2012.36","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.36","url":null,"abstract":"Distributed software transactional memory (or DTM) is an emerging promising model for distributed concurrency control, as it avoids the problems with locks (e.g., distributed deadlocks), while retaining the programming simplicity of coarse-grained locking. We consider DTM in Herlihy and Sun's data flow distributed execution model, where transactions are immobile and objects dynamically migrate to invoking transactions. To support DTM in this model and ensure transactional properties including atomicity, consistency, and isolation, we develop an algorithm called Transactional Forwarding Algorithm (or TFA). TFA guarantees a consistent view of shared objects between distributed transactions, provides atomicity for object operations, and transparently handles object relocation and versioning using an asynchronous version clock-based validation algorithm. We show that TFA is opaque (its correctness property) and permits strong progressiveness (its progress property). We implement TFA in a Java DTM framework and conduct experimental studies on a 120-node system, executing over 4 million transactions, with more than 1000 active concurrent transactions. Our implementation reveals that TFA outperforms competing distributed concurrency control models including Java RMI with spin locks, distributed shared memory, and directory-based DTM, by as much as 13x (for read-dominant transactions), and competitor DTM implementations by as much as 4x.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"128 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122126222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

FusedOS: Fusing LWK Performance with FWK Functionality in a Heterogeneous Environment FusedOS:在异构环境中融合LWK性能和FWK功能

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing Pub Date : 2012-10-24 DOI: 10.1109/SBAC-PAD.2012.14

Yoonho Park, E. V. Hensbergen, Marius Hillenbrand, T. Inglett, Bryan S. Rosenburg, K. D. Ryu, R. Wisniewski

{"title":"FusedOS: Fusing LWK Performance with FWK Functionality in a Heterogeneous Environment","authors":"Yoonho Park, E. V. Hensbergen, Marius Hillenbrand, T. Inglett, Bryan S. Rosenburg, K. D. Ryu, R. Wisniewski","doi":"10.1109/SBAC-PAD.2012.14","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.14","url":null,"abstract":"Traditionally, there have been two approaches to providing an operating environment for high performance computing (HPC). A Full-Weight Kernel(FWK) approach starts with a general-purpose operating system and strips it down to better scale up across more cores and out across larger clusters. A Light-Weight Kernel (LWK) approach starts with a new thin kernel code base and extends its functionality by adding more system services needed by applications. In both cases, the goal is to provide end-users with a scalable HPC operating environment with the functionality and services needed to reliably run their applications. To achieve this goal, we propose a new approach, called Fused OS, that combines the FWK and LWK approaches. Fused OS provides an infrastructure capable of partitioning the resources of a multicoreheterogeneous system and collaboratively running different operating environments on subsets of the cores and memory, without the use of a virtual machine monitor. With Fused OS, HPC applications can enjoy both the performance characteristics of an LWK and the rich functionality of an FWK through cross-core system service delegation. This paper presents the Fused OS architecture and a prototype implementation on Blue Gene/Q. The Fused OS prototype leverages Linux with small modifications as a FWK and implements a user-level LWK called Compute Library (CL) by leveraging CNK. We present CL performance results demonstrating low noise and show micro-benchmarks running with performance commensurate with that provided by CNK.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124625877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 53

Exploiting Phase-Change Memory in Cooperative Caches 利用协同缓存中的相变存储器

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing Pub Date : 2012-10-24 DOI: 10.1109/SBAC-PAD.2012.11

Luiz E. Ramos, R. Bianchini

引用次数: 11

Sparse Fast Fourier Transform on GPUs and Multi-core CPUs gpu和多核cpu的稀疏快速傅里叶变换

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing Pub Date : 2012-10-24 DOI: 10.1109/SBAC-PAD.2012.34

Jiaxi Hu, Zhaosen Wang, Qiyuan Qiu, Weijun Xiao, D. Lilja

引用次数: 13

Efficiently Handling Memory Accesses to Improve QoS in Multicore Systems under Real-Time Constraints 实时约束下多核系统中有效处理内存访问以提高QoS

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing Pub Date : 2012-10-24 DOI: 10.1109/SBAC-PAD.2012.16

José Luis March, S. Petit, J. Sahuquillo, H. Hassan, J. Duato

{"title":"Efficiently Handling Memory Accesses to Improve QoS in Multicore Systems under Real-Time Constraints","authors":"José Luis March, S. Petit, J. Sahuquillo, H. Hassan, J. Duato","doi":"10.1109/SBAC-PAD.2012.16","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.16","url":null,"abstract":"Chip multiprocessors (CMPs) are becoming the common choice to implement embedded systems due to they achieve a good tradeoff between performance and power. Because of manufacturability reasons, CMPs use to implement one or several memory controllers, each one shared by a set of cores. Thus, memory requests from distinct cores compete among them when accessing to memory. This means that the memory access latency can widely vary depending on the co-runners and the memory controller scheduling policy, thus yielding to unpredictable behavior. This work focuses on the design of a memory controller to support workloads with real-time constraints, both hard real-time (HRT) and soft real-time (SRT) applications. These systems must guarantee the execution of HRT applications while improving the performance of the SRT applications. In this paper we propose two memory controller policies for multicore embedded systems: HR-first and ATR-first. The former prioritizes memory requests of HRT tasks, achieving important energy savings but poor performance for SRT applications. The latter gives priority to those HRT requests that are critical to guarantee schedulability. Results show that the ATR-first policy presents similar energy consumption as the HR-first policy while reducing the number of SRT deadline misses around 49%, on average, and reaching the fulfillment of all deadlines in some scenarios.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127432531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Energy-Performance Tradeoffs in Software Transactional Memory 软件事务性内存中的能量-性能权衡

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing Pub Date : 2012-10-24 DOI: 10.1109/SBAC-PAD.2012.19

A. Baldassin, J. P. L. Carvalho, L. A. G. Garcia, R. Azevedo

{"title":"Energy-Performance Tradeoffs in Software Transactional Memory","authors":"A. Baldassin, J. P. L. Carvalho, L. A. G. Garcia, R. Azevedo","doi":"10.1109/SBAC-PAD.2012.19","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.19","url":null,"abstract":"Transactional memory (TM) is a new synchronization mechanism devised to simplify parallel programming, thereby helping programmers to unleash the power of current multicore processors. Although software implementations of TM (STM) have been extensively analyzed in terms of runtime performance, little attention has been paid to an equally important constraint faced by nearly all computer systems: energy consumption. In this work we conduct a comprehensive study of energy and runtime tradeoff sin software transactional memory systems. We characterize the behavior of three state-of-the-art lock-based STM algorithms, along with three different conflict resolution schemes. As a result of this characterization, we propose a DVFS-based technique that can be integrated into the resolution policies so as to improve the energy-delay product (EDP). Experimental results show that our DVFS-enhanced policies are indeed beneficial for applications with high contention levels. Improvements of up to 59% in EDP can be observed in this scenario, with an average EDP reduction of 16% across the STAMP workloads.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116285548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Global Data Re-allocation via Communication Aggregation in Chapel 在Chapel中通过通信聚合实现全局数据重新分配

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing Pub Date : 2012-10-24 DOI: 10.1109/SBAC-PAD.2012.18

Alberto Sanz, R. Asenjo, Juan López, R. Larrosa, A. Navarro, V. Litvinov, Sung-Eun Choi, B. Chamberlain

{"title":"Global Data Re-allocation via Communication Aggregation in Chapel","authors":"Alberto Sanz, R. Asenjo, Juan López, R. Larrosa, A. Navarro, V. Litvinov, Sung-Eun Choi, B. Chamberlain","doi":"10.1109/SBAC-PAD.2012.18","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.18","url":null,"abstract":"Chapel is a parallel programming language designed to improve the productivity and ease of use of conventional and parallel computers. This language currently delivers sub optimal performance when executing codes that perform global data re-allocation operations on distributed memory architectures. This is mainly due to data communication that is done without aggregation (one message for each remote array element). In this work, we analyze Chapel's standard Block and Cyclic distribution modules and optimize the communication routines for array assignments by performing aggregation. Thanks to the expressive power of Chapel, the compiler and runtime have enough information to do communication aggregation without user intervention. The runtime relies on the low-level GAS Net networking layer, whose versions of one-sided bulk put/get routines that support strides are particularly useful for us. Experimental results conducted on Hector (a Cray XE6) and Jaguar (Cray XK6)reveal that the implemented techniques can lead to significant reductions in communication time.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123667424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

Network Endpoints for Clusters of SMPs smp集群的网络端点

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing Pub Date : 2012-10-24 DOI: 10.1109/SBAC-PAD.2012.15

Ilie Gabriel Tanase, G. Almási, Hanhong Xue, C. Archer

{"title":"Network Endpoints for Clusters of SMPs","authors":"Ilie Gabriel Tanase, G. Almási, Hanhong Xue, C. Archer","doi":"10.1109/SBAC-PAD.2012.15","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.15","url":null,"abstract":"Modern large scale parallel machines feature an increasingly deep hierarchy of interconnections. Individual processing cores employ simultaneous multithreading (SMT) to better exploit functional units, multiple coherent processors are collocated in a node to better exploit links to cache, memory and network (SMP), and multiple nodes are interconnected by specialized low latency/high speed networks. Current trends indicate ever wider SMP nodes in the future. To service these nodes, modern high performance network devices (including Infiniband and all of IBM's recent offerings) offer the ability to sub-divide the network devices' resources among the processing threads. System software, however, lags in exploiting these capabilities, leaving users of e.g., MPI[14], UPC[19] in a bind, requiring complex and fragile workarounds in user programs. In this paper we discuss our implementation of endpoints, the software paradigm central to the IBM PAMI messaging library [3]. A PAMI endpoint is an expression in software of a slice of the network device. System software can service endpoints without serializing the many threads on an SMP by forcing them through a critical section. In the paper we describe the basic guarantees offered by PAMI to the programmer, and how these can be used to enable efficient implementations of high level libraries and programming languages like UPC. We evaluate the efficiency of our implementation on a novel P7IHsystem with up to 4096 cores, running micro benchmarks designed to find performance deficiencies in the endpoints implementation of both point-to-point and collective functions.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"464 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129358735","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

The Network Adapter: The Missing Link between MPI Applications and Network Performance 网络适配器:MPI应用程序和网络性能之间缺失的一环

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing Pub Date : 2012-10-24 DOI: 10.1109/SBAC-PAD.2012.17

G. Rodríguez, C. Minkenberg, R. Luijten, R. Beivide, P. Geoffray, J. Labarta, M. Valero, Steve Poole

{"title":"The Network Adapter: The Missing Link between MPI Applications and Network Performance","authors":"G. Rodríguez, C. Minkenberg, R. Luijten, R. Beivide, P. Geoffray, J. Labarta, M. Valero, Steve Poole","doi":"10.1109/SBAC-PAD.2012.17","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.17","url":null,"abstract":"Network design aspects that influence cost and performance can be classified according to their distance from the applications, into issues concerning topology, switch technology, link technology, network adapter, and communication library. The network adapter has a privileged position to take decisions with more global information than any other component in the network. It receives feedback from the switches and requests from the communication libraries and applications. Also, compared to a network switch, an adapter has access to significantly more memory (host memory and on-chip memory) and memory bandwidth (which typically exceeds network bandwidth). The potential of the adapter to improve global network performance has not yet been fully exploited. In this work we show a series of noticeable performance improvements (of at least 10% to 15%) for medium-sized message exchanges in typical HPC communication patterns by optimizing message segmentation and packet injection policies, that can be implemented in an adapter's firmware inexpensively. We also show that implementing equivalent solutions in the switch (as opposed to the adapter) leads to only marginal performance improvements as the ones obtained by controlling the segmentation and injection policy at the adapter, while involving significantly more cost. In addition, enhancing the adapter will lead to less hardware complexity in the switches, thus reducing cost and energy consumption.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121712323","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Level-3 BLAS on the TI C6678 Multi-core DSP TI C6678多核DSP上的3级BLAS

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing Pub Date : 2012-10-24 DOI: 10.1109/SBAC-PAD.2012.26

Murtaza Ali, E. Stotzer, Francisco D. Igual, R. V. D. Geijn

引用次数: 32