2011 International Conference on Parallel Architectures and Compilation Techniques最新文献_第5页

STM2: A Parallel STM for High Performance Simultaneous Multithreading Systems STM2:用于高性能同步多线程系统的并行STM

2011 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2011-10-10 DOI: 10.1109/PACT.2011.54

Gokcen Kestor, R. Gioiosa, T. Harris, O. Unsal, A. Cristal, I. Hur, M. Valero

{"title":"STM2: A Parallel STM for High Performance Simultaneous Multithreading Systems","authors":"Gokcen Kestor, R. Gioiosa, T. Harris, O. Unsal, A. Cristal, I. Hur, M. Valero","doi":"10.1109/PACT.2011.54","DOIUrl":"https://doi.org/10.1109/PACT.2011.54","url":null,"abstract":"Extracting high performance from modern chip multithreading (CMT) processors is a complex task, especially for large CMT systems. Programmers must efficiently parallelize performance-critical software while avoiding deadlocks and race conditions. Transactional memory (TM) is a promising programming model that allows programmers to focus on parallelism rather than maintaining correctness and avoiding deadlock. Software-only implementations (STMs) are especially compelling because they run on commodity hardware, therefore providing high portability. Unfortunately, STM systems usually suffer from high overheads, which may limit their usage especially at scale. In this paper we present STM2, a novel parallel STM designed for high performance, aggressive multithreading systems. STM2 significantly lowers runtime overhead by offloading read-set validation, bookkeeping and conflict detection to auxiliary threads running on sibling hardware threads. Auxiliary threads perform STM operations in parallel with their paired application threads and absorb STM overhead, significantly improving performance. We exploit the fact that, on modern multi-core processors, sets of cores can share L1 or L2 caches. This lets us achieve closer coupling between the application thread and the auxiliary thread (when compared with a traditional multi-processor systems). Our results, performed on an IBM POWER7 machine, a state-of-the-art, aggressive multi-threaded system, show that our approach outperforms several well-known STM implementations. In particular, STM2 shows speedups between 1.8x and 5.2x over the tested STM systems, on average, with peaks up to 12.8x.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116922373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

Scalable Proximity-Aware Cache Replication in Chip Multiprocessors 芯片多处理器中可扩展的邻近感知缓存复制

2011 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2011-10-10 DOI: 10.1109/PACT.2011.35

Chongmin Li, Haixia Wang, Y. Xue, Dongsheng Wang, Jian Li

引用次数: 0

Collaborative Caching for Unknown Cache Sizes 未知缓存大小的协同缓存

2011 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2011-10-10 DOI: 10.1109/PACT.2011.50

Xiaoming Gu

引用次数: 0

An Evaluation of Vectorizing Compilers 向量化编译器的评价

2011 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2011-10-10 DOI: 10.1109/PACT.2011.68

Saeed Maleki, Yaoqing Gao, M. Garzarán, Tommy Wong, D. Padua

引用次数: 216

Optimizing Regular Expression Matching with SR-NFA on Multi-Core Systems 基于SR-NFA的多核正则表达式匹配优化

2011 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2011-10-10 DOI: 10.1109/PACT.2011.73

Y. Yang, V. Prasanna

{"title":"Optimizing Regular Expression Matching with SR-NFA on Multi-Core Systems","authors":"Y. Yang, V. Prasanna","doi":"10.1109/PACT.2011.73","DOIUrl":"https://doi.org/10.1109/PACT.2011.73","url":null,"abstract":"Conventionally, regular expression matching (REM) has been performed by sequentially comparing the regular expression (regex) to the input stream, which can be slow due to excessive backtracking (smith:acsac06). Alternatively, the regex can be converted to a deterministic finite automaton (DFA) for efficient matching, which however may require an extremely large state transition table (STT) due to exponential state explosion (meyer:swat71, yu:ancs06). We propose the segmented regex-NFA (SR-NFA) architecture, where the regex is first compiled into modular nondeterministic finite automata (NFA), then partitioned, optimized, and matched efficiently on modern multi-core processors. SR-NFA offers attack-resilient multi-gigabit per second matching throughput, does not suffer from either backtracking or state explosion, and can be rapidly constructed. For regex sets that construct a DFA with moderate state explosion, i.e., on average 200k states in the STT, the proposed SR-NFA is 367k times faster to construct and update and use 23k times less memory than the DFA approach. Running on an 8-core 2.6 GHz Opteron platform, our prototype achieves 2.2 Gbps average matching throughput for regex sets with up to 4,000 SR-NFA states per regex set.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131634169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

No More Backstabbing... A Faithful Scheduling Policy for Multithreaded Programs 不再背后捅刀子……多线程程序的忠实调度策略

2011 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2011-10-10 DOI: 10.1109/PACT.2011.8

K. Pusukuri, Rajiv Gupta, L. Bhuyan

{"title":"No More Backstabbing... A Faithful Scheduling Policy for Multithreaded Programs","authors":"K. Pusukuri, Rajiv Gupta, L. Bhuyan","doi":"10.1109/PACT.2011.8","DOIUrl":"https://doi.org/10.1109/PACT.2011.8","url":null,"abstract":"Efficient contention management is the key to achieving scalable performance for multithreaded applications running on multicore systems. However, contention management policies provided by modern operating systems increase context-switches and lead to performance degradation for multithreaded applications under high loads. Moreover, this problem is exacerbated by the interaction between contention management policies and OS scheduling polices. Time Share (TS) is the default scheduling policy in a modern OS such as Open Solaris and with TS policy, priorities of threads change very frequently for balancing load and providing fairness in scheduling. Due to the frequent ping-ponging of priorities, threads of an application are often preempted by the threads of the same application. This increases the frequency of involuntary context-switches as wells as lock-holder thread preemptions and leads to poor performance. This problem becomes very serious under high loads. To alleviate this problem, in this paper, we present a scheduling policy called Faithful Scheduling (FF), which dramatically reduces context-switches as well as lock-holder thread preemptions. We implemented FF on a 24-core Dell Power Edge R905 server running OpenSolaris.2009.06 and evaluated it using 22 programs including the TATP database application, SPECjbb2005, programs from PARSEC, SPEC OMP, and some micro benchmarks. The experimental results show that FF policy achieves high performance for both lightly and heavily loaded systems. Moreover it does not require any changes to the application source code or the OS kernel.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130350185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

Phase-Based Application-Driven Hierarchical Power Management on the Single-chip Cloud Computer 单片云计算机上基于相位的应用驱动分层电源管理

2011 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2011-10-10 DOI: 10.1109/PACT.2011.19

Nikolas Ioannou, M. Kauschke, M. Gries, Marcelo H. Cintra

{"title":"Phase-Based Application-Driven Hierarchical Power Management on the Single-chip Cloud Computer","authors":"Nikolas Ioannou, M. Kauschke, M. Gries, Marcelo H. Cintra","doi":"10.1109/PACT.2011.19","DOIUrl":"https://doi.org/10.1109/PACT.2011.19","url":null,"abstract":"To improve energy efficiency processors allow for Dynamic Voltage and Frequency Scaling (DVFS), which enables changing their performance and power consumption on-the-fly. Many-core architectures, such as the Single-chip Cloud Computer (SCC) experimental processor from Intel Labs, have DVFS infrastructures that scale by having many more independent voltage and frequency domains on-die than today's multi-cores. This paper proposes a novel, hierarchical, and transparent client-server power management scheme applicable to such architectures. The scheme tries to minimize energy consumption within a performance window taking into consideration not only the local information for cores within frequency domains but also information that spans multiple frequency and voltage domains. We implement our proposed hierarchical power control using a novel application-driven phase detection and prediction approach for Message Passing Interface (MPI) applications, a natural choice on the SCC with its fast on-chip network and its non-coherent memory hierarchy. This phase predictor operates as the front-end to the hierarchical DVFS controller, providing the necessary DVFS scheduling points. Experimental results with SCC hardware show that our approach provides significant improvement of the Energy Delay Product (EDP) of as much as 27.2%, and 11.4% on average, with an average increase in execution time of 7.7% over a baseline version without DVFS. These improvements come from both improved phase prediction accuracy and more effective DVFS control of the domains, compared to existing approaches.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130775310","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 50

Scalable and Efficient Bounds Checking for Large-Scale CMP Environments 大规模CMP环境的可扩展和高效边界检查

2011 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2011-10-10 DOI: 10.1109/PACT.2011.36

Baik Song An, K. H. Yum, Eun Jung Kim

引用次数: 1

Modeling and Performance Evaluation of TSO-Preserving Binary Optimization 保持tso的二值优化建模与性能评价

2011 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2011-10-10 DOI: 10.1109/PACT.2011.69

Cheng Wang, Youfeng Wu

引用次数: 3

Understanding the Behavior of Pthread Applications on Non-Uniform Cache Architectures 理解非统一缓存架构上Pthread应用程序的行为

2011 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2011-10-10 DOI: 10.1109/PACT.2011.26

Gagandeep S. Sachdev, K. Sudan, Mary W. Hall, R. Balasubramonian

引用次数: 0