2011 International Conference on Parallel Architectures and Compilation Techniques最新文献

筛选
英文 中文
STM2: A Parallel STM for High Performance Simultaneous Multithreading Systems STM2:用于高性能同步多线程系统的并行STM
Gokcen Kestor, R. Gioiosa, T. Harris, O. Unsal, A. Cristal, I. Hur, M. Valero
{"title":"STM2: A Parallel STM for High Performance Simultaneous Multithreading Systems","authors":"Gokcen Kestor, R. Gioiosa, T. Harris, O. Unsal, A. Cristal, I. Hur, M. Valero","doi":"10.1109/PACT.2011.54","DOIUrl":"https://doi.org/10.1109/PACT.2011.54","url":null,"abstract":"Extracting high performance from modern chip multithreading (CMT) processors is a complex task, especially for large CMT systems. Programmers must efficiently parallelize performance-critical software while avoiding deadlocks and race conditions. Transactional memory (TM) is a promising programming model that allows programmers to focus on parallelism rather than maintaining correctness and avoiding deadlock. Software-only implementations (STMs) are especially compelling because they run on commodity hardware, therefore providing high portability. Unfortunately, STM systems usually suffer from high overheads, which may limit their usage especially at scale. In this paper we present STM2, a novel parallel STM designed for high performance, aggressive multithreading systems. STM2 significantly lowers runtime overhead by offloading read-set validation, bookkeeping and conflict detection to auxiliary threads running on sibling hardware threads. Auxiliary threads perform STM operations in parallel with their paired application threads and absorb STM overhead, significantly improving performance. We exploit the fact that, on modern multi-core processors, sets of cores can share L1 or L2 caches. This lets us achieve closer coupling between the application thread and the auxiliary thread (when compared with a traditional multi-processor systems). Our results, performed on an IBM POWER7 machine, a state-of-the-art, aggressive multi-threaded system, show that our approach outperforms several well-known STM implementations. In particular, STM2 shows speedups between 1.8x and 5.2x over the tested STM systems, on average, with peaks up to 12.8x.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116922373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Scalable Proximity-Aware Cache Replication in Chip Multiprocessors 芯片多处理器中可扩展的邻近感知缓存复制
Chongmin Li, Haixia Wang, Y. Xue, Dongsheng Wang, Jian Li
{"title":"Scalable Proximity-Aware Cache Replication in Chip Multiprocessors","authors":"Chongmin Li, Haixia Wang, Y. Xue, Dongsheng Wang, Jian Li","doi":"10.1109/PACT.2011.35","DOIUrl":"https://doi.org/10.1109/PACT.2011.35","url":null,"abstract":"We propose Proximity-Aware cache Replication (PAR), an LLC replication technique that elegantly integrates an intelligent cache replication placement mechanism and a hierarchical directory-based coherence protocol into one cost-effective and scalable design. Simulation results on a 64-core CMP show that PAR can achieve 12% speedup over the baseline shared cache design with SPLASH2 and PARSEC workloads. It also provides around 5% speedup over a couple contemporary approaches with much simpler and scalable support.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126326633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Collaborative Caching for Unknown Cache Sizes 未知缓存大小的协同缓存
Xiaoming Gu
{"title":"Collaborative Caching for Unknown Cache Sizes","authors":"Xiaoming Gu","doi":"10.1109/PACT.2011.50","DOIUrl":"https://doi.org/10.1109/PACT.2011.50","url":null,"abstract":"A number of hardware systems have been built or proposed to provide an interface for software to influence cache management. The combined software-hardware solution is called collaborative caching. Our previous work showed that in theory collaborative caching with LRU and MRU may enable a program to manage cache optimally. In this work we first present a prioritized LRU model. For each memory access, a program specifies a priority, the target cache position for the accessed datum, for all cache sizes. We have proved that the prioritized LRU holds inclusion property. Alternatively, we describe a dynamic cache control scheme based on the associated priority. The limitation of knowing cache size in our LRU-MRU collaborative caching work is removed.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121205192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Evaluation of Vectorizing Compilers 向量化编译器的评价
Saeed Maleki, Yaoqing Gao, M. Garzarán, Tommy Wong, D. Padua
{"title":"An Evaluation of Vectorizing Compilers","authors":"Saeed Maleki, Yaoqing Gao, M. Garzarán, Tommy Wong, D. Padua","doi":"10.1109/PACT.2011.68","DOIUrl":"https://doi.org/10.1109/PACT.2011.68","url":null,"abstract":"Most of today's processors include vector units that have been designed to speedup single threaded programs. Although vector instructions can deliver high performance, writing vector code in assembly language or using intrinsics in high level languages is a time consuming and error-prone task. The alternative is to automate the process of vectorization by using vectorizing compilers. This paper evaluates how well compilers vectorize a synthetic benchmark consisting of 151 loops, two application from Petascale Application Collaboration Teams (PACT), and eight applications from Media Bench II. We evaluated three compilers: GCC (version 4.7.0), ICC (version 12.0) and XLC (version 11.01). Our results show that despite all the work done in vectorization in the last 40 years 45-71% of the loops in the synthetic benchmark and only a few loops from the real applications are vectorized by the compilers we evaluated.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"112 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124112634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 216
Optimizing Regular Expression Matching with SR-NFA on Multi-Core Systems 基于SR-NFA的多核正则表达式匹配优化
Y. Yang, V. Prasanna
{"title":"Optimizing Regular Expression Matching with SR-NFA on Multi-Core Systems","authors":"Y. Yang, V. Prasanna","doi":"10.1109/PACT.2011.73","DOIUrl":"https://doi.org/10.1109/PACT.2011.73","url":null,"abstract":"Conventionally, regular expression matching (REM) has been performed by sequentially comparing the regular expression (regex) to the input stream, which can be slow due to excessive backtracking (smith:acsac06). Alternatively, the regex can be converted to a deterministic finite automaton (DFA) for efficient matching, which however may require an extremely large state transition table (STT) due to exponential state explosion (meyer:swat71, yu:ancs06). We propose the segmented regex-NFA (SR-NFA) architecture, where the regex is first compiled into modular nondeterministic finite automata (NFA), then partitioned, optimized, and matched efficiently on modern multi-core processors. SR-NFA offers attack-resilient multi-gigabit per second matching throughput, does not suffer from either backtracking or state explosion, and can be rapidly constructed. For regex sets that construct a DFA with moderate state explosion, i.e., on average 200k states in the STT, the proposed SR-NFA is 367k times faster to construct and update and use 23k times less memory than the DFA approach. Running on an 8-core 2.6 GHz Opteron platform, our prototype achieves 2.2 Gbps average matching throughput for regex sets with up to 4,000 SR-NFA states per regex set.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131634169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
No More Backstabbing... A Faithful Scheduling Policy for Multithreaded Programs 不再背后捅刀子……多线程程序的忠实调度策略
K. Pusukuri, Rajiv Gupta, L. Bhuyan
{"title":"No More Backstabbing... A Faithful Scheduling Policy for Multithreaded Programs","authors":"K. Pusukuri, Rajiv Gupta, L. Bhuyan","doi":"10.1109/PACT.2011.8","DOIUrl":"https://doi.org/10.1109/PACT.2011.8","url":null,"abstract":"Efficient contention management is the key to achieving scalable performance for multithreaded applications running on multicore systems. However, contention management policies provided by modern operating systems increase context-switches and lead to performance degradation for multithreaded applications under high loads. Moreover, this problem is exacerbated by the interaction between contention management policies and OS scheduling polices. Time Share (TS) is the default scheduling policy in a modern OS such as Open Solaris and with TS policy, priorities of threads change very frequently for balancing load and providing fairness in scheduling. Due to the frequent ping-ponging of priorities, threads of an application are often preempted by the threads of the same application. This increases the frequency of involuntary context-switches as wells as lock-holder thread preemptions and leads to poor performance. This problem becomes very serious under high loads. To alleviate this problem, in this paper, we present a scheduling policy called Faithful Scheduling (FF), which dramatically reduces context-switches as well as lock-holder thread preemptions. We implemented FF on a 24-core Dell Power Edge R905 server running OpenSolaris.2009.06 and evaluated it using 22 programs including the TATP database application, SPECjbb2005, programs from PARSEC, SPEC OMP, and some micro benchmarks. The experimental results show that FF policy achieves high performance for both lightly and heavily loaded systems. Moreover it does not require any changes to the application source code or the OS kernel.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130350185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
Phase-Based Application-Driven Hierarchical Power Management on the Single-chip Cloud Computer 单片云计算机上基于相位的应用驱动分层电源管理
Nikolas Ioannou, M. Kauschke, M. Gries, Marcelo H. Cintra
{"title":"Phase-Based Application-Driven Hierarchical Power Management on the Single-chip Cloud Computer","authors":"Nikolas Ioannou, M. Kauschke, M. Gries, Marcelo H. Cintra","doi":"10.1109/PACT.2011.19","DOIUrl":"https://doi.org/10.1109/PACT.2011.19","url":null,"abstract":"To improve energy efficiency processors allow for Dynamic Voltage and Frequency Scaling (DVFS), which enables changing their performance and power consumption on-the-fly. Many-core architectures, such as the Single-chip Cloud Computer (SCC) experimental processor from Intel Labs, have DVFS infrastructures that scale by having many more independent voltage and frequency domains on-die than today's multi-cores. This paper proposes a novel, hierarchical, and transparent client-server power management scheme applicable to such architectures. The scheme tries to minimize energy consumption within a performance window taking into consideration not only the local information for cores within frequency domains but also information that spans multiple frequency and voltage domains. We implement our proposed hierarchical power control using a novel application-driven phase detection and prediction approach for Message Passing Interface (MPI) applications, a natural choice on the SCC with its fast on-chip network and its non-coherent memory hierarchy. This phase predictor operates as the front-end to the hierarchical DVFS controller, providing the necessary DVFS scheduling points. Experimental results with SCC hardware show that our approach provides significant improvement of the Energy Delay Product (EDP) of as much as 27.2%, and 11.4% on average, with an average increase in execution time of 7.7% over a baseline version without DVFS. These improvements come from both improved phase prediction accuracy and more effective DVFS control of the domains, compared to existing approaches.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130775310","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 50
Scalable and Efficient Bounds Checking for Large-Scale CMP Environments 大规模CMP环境的可扩展和高效边界检查
Baik Song An, K. H. Yum, Eun Jung Kim
{"title":"Scalable and Efficient Bounds Checking for Large-Scale CMP Environments","authors":"Baik Song An, K. H. Yum, Eun Jung Kim","doi":"10.1109/PACT.2011.36","DOIUrl":"https://doi.org/10.1109/PACT.2011.36","url":null,"abstract":"We attempt to provide an architectural support for fast and efficient bounds checking for multithread work-loads in chip-multiprocessor (CMP) environments. Bounds information sharing and smart tagging help to perform bounds checking more effectively utilizing the characteristics of a pointer. Also, the BCache architecture allows fast access to the bounds information. Simulation results show that the proposed scheme increases ¥ìPC of memory operations by 29% on average compared to the previous hardware scheme.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122553906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Modeling and Performance Evaluation of TSO-Preserving Binary Optimization 保持tso的二值优化建模与性能评价
Cheng Wang, Youfeng Wu
{"title":"Modeling and Performance Evaluation of TSO-Preserving Binary Optimization","authors":"Cheng Wang, Youfeng Wu","doi":"10.1109/PACT.2011.69","DOIUrl":"https://doi.org/10.1109/PACT.2011.69","url":null,"abstract":"Program optimization on multi-core systems must preserve the program memory consistency. This paper studies TSO-preserving binary optimization. We introduce a novel approach to formally model TSO-preserving binary optimization based on the formal TSO memory model. The major contribution of the modeling is a sound and complete algorithm to verify TSO-preserving binary optimization with O(N2) complexity. We also developed a dynamic binary optimization system to evaluate the performance impact of TSO-preserving optimization. We show in our experiments that, dynamic binary optimization without memory optimizations can improve performance by 8.1%. TSO-preserving optimizations can further improve the performance by 4.8% to a total 12.9%. Without considering the restriction for TSO-preserving optimizations, the dynamic binary optimization can improve the overall performance to 20.4%.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"395 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113998939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Understanding the Behavior of Pthread Applications on Non-Uniform Cache Architectures 理解非统一缓存架构上Pthread应用程序的行为
Gagandeep S. Sachdev, K. Sudan, Mary W. Hall, R. Balasubramonian
{"title":"Understanding the Behavior of Pthread Applications on Non-Uniform Cache Architectures","authors":"Gagandeep S. Sachdev, K. Sudan, Mary W. Hall, R. Balasubramonian","doi":"10.1109/PACT.2011.26","DOIUrl":"https://doi.org/10.1109/PACT.2011.26","url":null,"abstract":"Future scalable multi-core chips are expected to implement a shared last-level cache (LLC) with banks distributed on chip, forcing a core to incur non-uniform access latencies to each bank. Consequently, high performance and energy efficiency depend on whether a thread's data is placed in local or nearby banks. Using compiler and programmer support, we aim to find an alternative solution to existing high-overhead designs. In this paper, we take existing parallel programs written in Pthreads, and show the performance gap between current static mapping schemes, costly migration schemes and idealized static and dynamic best-case scenarios.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129967907","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信