2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing最新文献

筛选
英文 中文
Transactional Forwarding: Supporting Highly-Concurrent STM in Asynchronous Distributed Systems 事务性转发:支持异步分布式系统中的高并发STM
Mohamed M. Saad, B. Ravindran
{"title":"Transactional Forwarding: Supporting Highly-Concurrent STM in Asynchronous Distributed Systems","authors":"Mohamed M. Saad, B. Ravindran","doi":"10.1109/SBAC-PAD.2012.36","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.36","url":null,"abstract":"Distributed software transactional memory (or DTM) is an emerging promising model for distributed concurrency control, as it avoids the problems with locks (e.g., distributed deadlocks), while retaining the programming simplicity of coarse-grained locking. We consider DTM in Herlihy and Sun's data flow distributed execution model, where transactions are immobile and objects dynamically migrate to invoking transactions. To support DTM in this model and ensure transactional properties including atomicity, consistency, and isolation, we develop an algorithm called Transactional Forwarding Algorithm (or TFA). TFA guarantees a consistent view of shared objects between distributed transactions, provides atomicity for object operations, and transparently handles object relocation and versioning using an asynchronous version clock-based validation algorithm. We show that TFA is opaque (its correctness property) and permits strong progressiveness (its progress property). We implement TFA in a Java DTM framework and conduct experimental studies on a 120-node system, executing over 4 million transactions, with more than 1000 active concurrent transactions. Our implementation reveals that TFA outperforms competing distributed concurrency control models including Java RMI with spin locks, distributed shared memory, and directory-based DTM, by as much as 13x (for read-dominant transactions), and competitor DTM implementations by as much as 4x.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"128 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122126222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
FusedOS: Fusing LWK Performance with FWK Functionality in a Heterogeneous Environment FusedOS:在异构环境中融合LWK性能和FWK功能
Yoonho Park, E. V. Hensbergen, Marius Hillenbrand, T. Inglett, Bryan S. Rosenburg, K. D. Ryu, R. Wisniewski
{"title":"FusedOS: Fusing LWK Performance with FWK Functionality in a Heterogeneous Environment","authors":"Yoonho Park, E. V. Hensbergen, Marius Hillenbrand, T. Inglett, Bryan S. Rosenburg, K. D. Ryu, R. Wisniewski","doi":"10.1109/SBAC-PAD.2012.14","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.14","url":null,"abstract":"Traditionally, there have been two approaches to providing an operating environment for high performance computing (HPC). A Full-Weight Kernel(FWK) approach starts with a general-purpose operating system and strips it down to better scale up across more cores and out across larger clusters. A Light-Weight Kernel (LWK) approach starts with a new thin kernel code base and extends its functionality by adding more system services needed by applications. In both cases, the goal is to provide end-users with a scalable HPC operating environment with the functionality and services needed to reliably run their applications. To achieve this goal, we propose a new approach, called Fused OS, that combines the FWK and LWK approaches. Fused OS provides an infrastructure capable of partitioning the resources of a multicoreheterogeneous system and collaboratively running different operating environments on subsets of the cores and memory, without the use of a virtual machine monitor. With Fused OS, HPC applications can enjoy both the performance characteristics of an LWK and the rich functionality of an FWK through cross-core system service delegation. This paper presents the Fused OS architecture and a prototype implementation on Blue Gene/Q. The Fused OS prototype leverages Linux with small modifications as a FWK and implements a user-level LWK called Compute Library (CL) by leveraging CNK. We present CL performance results demonstrating low noise and show micro-benchmarks running with performance commensurate with that provided by CNK.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124625877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 53
Exploiting Phase-Change Memory in Cooperative Caches 利用协同缓存中的相变存储器
Luiz E. Ramos, R. Bianchini
{"title":"Exploiting Phase-Change Memory in Cooperative Caches","authors":"Luiz E. Ramos, R. Bianchini","doi":"10.1109/SBAC-PAD.2012.11","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.11","url":null,"abstract":"Modern servers require large main memories, which so far have been enabled by improvements in DRAM density. However, the scalability of DRAM is approaching its limit, so Phase-Change Memory (PCM) is being considered as an alternative technology. PCM is denser, more scalable, and consumes lower idle power than DRAM, while exhibiting byte-address ability and access times in the nanosecond range. Unfortunately, PCM is also slower than DRAM and has limited endurance. These characteristics prompted the study of hybrid memory systems, combining a small amount of DRAM and a large amount of PCM. In this paper, we leverage hybrid memories to improve the performance of cooperative memory caches in server clusters. Our approach entails a novel policy that exploits popularity information in placing objects across servers and memory technologies. Our results show that (1) DRAM-only and PCM-only memory systems do not perform well in all cases, and (2) when managed properly, hybrid memories always exhibit the best or close-to-best performance, with significant gains in many cases, without increasing energy consumption.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128948239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Sparse Fast Fourier Transform on GPUs and Multi-core CPUs gpu和多核cpu的稀疏快速傅里叶变换
Jiaxi Hu, Zhaosen Wang, Qiyuan Qiu, Weijun Xiao, D. Lilja
{"title":"Sparse Fast Fourier Transform on GPUs and Multi-core CPUs","authors":"Jiaxi Hu, Zhaosen Wang, Qiyuan Qiu, Weijun Xiao, D. Lilja","doi":"10.1109/SBAC-PAD.2012.34","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.34","url":null,"abstract":"Given an N-point sequence, finding its k largest components in the frequency domain is a problem of great interest. This problem, which is usually referred to as a sparse Fourier Transform, was recently brought back on stage by a newly proposed algorithm called the sFFT. In this paper, we present a parallel implementation of sFFT on both multi-core CPUs and GPUs using a human voice signal as a case study. Using this example, an estimate of k for the 3dB cutoff points was conducted through concrete experiments. In addition, three optimization strategies are presented in this paper. We demonstrate that the multi-core-based sFFT achieves speedups of up to three times a single-threaded sFFT while a GPU-based version achieves up to ten times speedup. For large scale cases, the GPU-based sFFT also shows its considerable advantages, which is about 40 times speedup compared to the latest out-of-card FFT implementations [2].","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"315 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132879063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Efficiently Handling Memory Accesses to Improve QoS in Multicore Systems under Real-Time Constraints 实时约束下多核系统中有效处理内存访问以提高QoS
José Luis March, S. Petit, J. Sahuquillo, H. Hassan, J. Duato
{"title":"Efficiently Handling Memory Accesses to Improve QoS in Multicore Systems under Real-Time Constraints","authors":"José Luis March, S. Petit, J. Sahuquillo, H. Hassan, J. Duato","doi":"10.1109/SBAC-PAD.2012.16","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.16","url":null,"abstract":"Chip multiprocessors (CMPs) are becoming the common choice to implement embedded systems due to they achieve a good tradeoff between performance and power. Because of manufacturability reasons, CMPs use to implement one or several memory controllers, each one shared by a set of cores. Thus, memory requests from distinct cores compete among them when accessing to memory. This means that the memory access latency can widely vary depending on the co-runners and the memory controller scheduling policy, thus yielding to unpredictable behavior. This work focuses on the design of a memory controller to support workloads with real-time constraints, both hard real-time (HRT) and soft real-time (SRT) applications. These systems must guarantee the execution of HRT applications while improving the performance of the SRT applications. In this paper we propose two memory controller policies for multicore embedded systems: HR-first and ATR-first. The former prioritizes memory requests of HRT tasks, achieving important energy savings but poor performance for SRT applications. The latter gives priority to those HRT requests that are critical to guarantee schedulability. Results show that the ATR-first policy presents similar energy consumption as the HR-first policy while reducing the number of SRT deadline misses around 49%, on average, and reaching the fulfillment of all deadlines in some scenarios.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127432531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Energy-Performance Tradeoffs in Software Transactional Memory 软件事务性内存中的能量-性能权衡
A. Baldassin, J. P. L. Carvalho, L. A. G. Garcia, R. Azevedo
{"title":"Energy-Performance Tradeoffs in Software Transactional Memory","authors":"A. Baldassin, J. P. L. Carvalho, L. A. G. Garcia, R. Azevedo","doi":"10.1109/SBAC-PAD.2012.19","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.19","url":null,"abstract":"Transactional memory (TM) is a new synchronization mechanism devised to simplify parallel programming, thereby helping programmers to unleash the power of current multicore processors. Although software implementations of TM (STM) have been extensively analyzed in terms of runtime performance, little attention has been paid to an equally important constraint faced by nearly all computer systems: energy consumption. In this work we conduct a comprehensive study of energy and runtime tradeoff sin software transactional memory systems. We characterize the behavior of three state-of-the-art lock-based STM algorithms, along with three different conflict resolution schemes. As a result of this characterization, we propose a DVFS-based technique that can be integrated into the resolution policies so as to improve the energy-delay product (EDP). Experimental results show that our DVFS-enhanced policies are indeed beneficial for applications with high contention levels. Improvements of up to 59% in EDP can be observed in this scenario, with an average EDP reduction of 16% across the STAMP workloads.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116285548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Global Data Re-allocation via Communication Aggregation in Chapel 在Chapel中通过通信聚合实现全局数据重新分配
Alberto Sanz, R. Asenjo, Juan López, R. Larrosa, A. Navarro, V. Litvinov, Sung-Eun Choi, B. Chamberlain
{"title":"Global Data Re-allocation via Communication Aggregation in Chapel","authors":"Alberto Sanz, R. Asenjo, Juan López, R. Larrosa, A. Navarro, V. Litvinov, Sung-Eun Choi, B. Chamberlain","doi":"10.1109/SBAC-PAD.2012.18","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.18","url":null,"abstract":"Chapel is a parallel programming language designed to improve the productivity and ease of use of conventional and parallel computers. This language currently delivers sub optimal performance when executing codes that perform global data re-allocation operations on distributed memory architectures. This is mainly due to data communication that is done without aggregation (one message for each remote array element). In this work, we analyze Chapel's standard Block and Cyclic distribution modules and optimize the communication routines for array assignments by performing aggregation. Thanks to the expressive power of Chapel, the compiler and runtime have enough information to do communication aggregation without user intervention. The runtime relies on the low-level GAS Net networking layer, whose versions of one-sided bulk put/get routines that support strides are particularly useful for us. Experimental results conducted on Hector (a Cray XE6) and Jaguar (Cray XK6)reveal that the implemented techniques can lead to significant reductions in communication time.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123667424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
Network Endpoints for Clusters of SMPs smp集群的网络端点
Ilie Gabriel Tanase, G. Almási, Hanhong Xue, C. Archer
{"title":"Network Endpoints for Clusters of SMPs","authors":"Ilie Gabriel Tanase, G. Almási, Hanhong Xue, C. Archer","doi":"10.1109/SBAC-PAD.2012.15","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.15","url":null,"abstract":"Modern large scale parallel machines feature an increasingly deep hierarchy of interconnections. Individual processing cores employ simultaneous multithreading (SMT) to better exploit functional units, multiple coherent processors are collocated in a node to better exploit links to cache, memory and network (SMP), and multiple nodes are interconnected by specialized low latency/high speed networks. Current trends indicate ever wider SMP nodes in the future. To service these nodes, modern high performance network devices (including Infiniband and all of IBM's recent offerings) offer the ability to sub-divide the network devices' resources among the processing threads. System software, however, lags in exploiting these capabilities, leaving users of e.g., MPI[14], UPC[19] in a bind, requiring complex and fragile workarounds in user programs. In this paper we discuss our implementation of endpoints, the software paradigm central to the IBM PAMI messaging library [3]. A PAMI endpoint is an expression in software of a slice of the network device. System software can service endpoints without serializing the many threads on an SMP by forcing them through a critical section. In the paper we describe the basic guarantees offered by PAMI to the programmer, and how these can be used to enable efficient implementations of high level libraries and programming languages like UPC. We evaluate the efficiency of our implementation on a novel P7IHsystem with up to 4096 cores, running micro benchmarks designed to find performance deficiencies in the endpoints implementation of both point-to-point and collective functions.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"464 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129358735","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
The Network Adapter: The Missing Link between MPI Applications and Network Performance 网络适配器:MPI应用程序和网络性能之间缺失的一环
G. Rodríguez, C. Minkenberg, R. Luijten, R. Beivide, P. Geoffray, J. Labarta, M. Valero, Steve Poole
{"title":"The Network Adapter: The Missing Link between MPI Applications and Network Performance","authors":"G. Rodríguez, C. Minkenberg, R. Luijten, R. Beivide, P. Geoffray, J. Labarta, M. Valero, Steve Poole","doi":"10.1109/SBAC-PAD.2012.17","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.17","url":null,"abstract":"Network design aspects that influence cost and performance can be classified according to their distance from the applications, into issues concerning topology, switch technology, link technology, network adapter, and communication library. The network adapter has a privileged position to take decisions with more global information than any other component in the network. It receives feedback from the switches and requests from the communication libraries and applications. Also, compared to a network switch, an adapter has access to significantly more memory (host memory and on-chip memory) and memory bandwidth (which typically exceeds network bandwidth). The potential of the adapter to improve global network performance has not yet been fully exploited. In this work we show a series of noticeable performance improvements (of at least 10% to 15%) for medium-sized message exchanges in typical HPC communication patterns by optimizing message segmentation and packet injection policies, that can be implemented in an adapter's firmware inexpensively. We also show that implementing equivalent solutions in the switch (as opposed to the adapter) leads to only marginal performance improvements as the ones obtained by controlling the segmentation and injection policy at the adapter, while involving significantly more cost. In addition, enhancing the adapter will lead to less hardware complexity in the switches, thus reducing cost and energy consumption.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121712323","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Level-3 BLAS on the TI C6678 Multi-core DSP TI C6678多核DSP上的3级BLAS
Murtaza Ali, E. Stotzer, Francisco D. Igual, R. V. D. Geijn
{"title":"Level-3 BLAS on the TI C6678 Multi-core DSP","authors":"Murtaza Ali, E. Stotzer, Francisco D. Igual, R. V. D. Geijn","doi":"10.1109/SBAC-PAD.2012.26","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.26","url":null,"abstract":"Digital Signal Processors (DSP) are commonly employed in embedded systems. The increase of processing needs in cellular base-stations, radio controllers and industrial/medical imaging systems, has led to the development of multi-core DSPs as well as inclusion of floating point operations while maintaining low power dissipation. The eight-core DSP from Texas Instruments, codenamed TMS320C6678, provides a peak performance of 128 GFLOPS (single precision) and an effective 32 GFLOPS(double precision) for only 10 watts. In this paper, we present the first complete implementation and report performance of the Level-3 Basic Linear Algebra Subprograms(BLAS) routines for this DSP. These routines are first optimized for single core and then parallelized over the different cores using OpenMP constructs. The results show that we can achieve about 8 single precision GFLOPS/watt and 2.2double precision GFLOPS/watt for General Matrix-Matrix multiplication (GEMM). The performance of the rest of theLevel-3 BLAS routines is within 90% of the corresponding GEMM routines.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124624689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 32
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信