2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)最新文献_第6页

Analyzing cache performance bottlenecks of STM applications and addressing them with compiler's help 分析STM应用程序的缓存性能瓶颈，并在编译器的帮助下解决这些瓶颈

2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT) Pub Date : 2010-09-11 DOI: 10.1145/1854273.1854345

Sandya Mannarswamy, R. Govindarajan

引用次数: 0

Scalable thread scheduling and global power management for heterogeneous many-core architectures 异构多核架构的可伸缩线程调度和全局电源管理

2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT) Pub Date : 2010-09-11 DOI: 10.1145/1854273.1854283

Jonathan A. Winter, D. Albonesi, C. Shoemaker

{"title":"Scalable thread scheduling and global power management for heterogeneous many-core architectures","authors":"Jonathan A. Winter, D. Albonesi, C. Shoemaker","doi":"10.1145/1854273.1854283","DOIUrl":"https://doi.org/10.1145/1854273.1854283","url":null,"abstract":"Future many-core microprocessors are likely to be heterogeneous, by design or due to variability and defects. The latter type of heterogeneity is especially challenging due to its unpredictability. To minimize the performance and power impact of these hardware imperfections, the runtime thread scheduler and global power manager must be nimble enough to handle such random heterogeneity. With hundreds of cores expected on a single die in the future, these algorithms must provide high power-performance efficiency, yet remain scalable with low runtime overhead. This paper presents a range of scheduling and power management algorithms and performs a detailed evaluation of their effectiveness and scalability on heterogeneous many-core architectures with up to 256 cores. We also conduct a limit study on the potential benefits of coordinating scheduling and power management and demonstrate that coordination yields little benefit. We highlight the scalability limitations of previously proposed thread scheduling algorithms that were designed for small-scale chip multiprocessors and propose a Hierarchical Hungarian Scheduling Algorithm that dramatically reduces the scheduling overhead without loss of accuracy. Finally, we show that the high computational requirements of prior global power management algorithms based on linear programming make them infeasible for many-core chips, and that an algorithm that we call Steepest Drop achieves orders of magnitude lower execution time without sacrificing power-performance efficiency.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"278 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115600870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 159

Semi-automatic extraction and exploitation of hierarchical pipeline parallelism using profiling information 基于剖析信息的分层管道并行性半自动提取与开发

2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT) Pub Date : 2010-09-11 DOI: 10.1145/1854273.1854321

Georgios Tournavitis, Björn Franke

{"title":"Semi-automatic extraction and exploitation of hierarchical pipeline parallelism using profiling information","authors":"Georgios Tournavitis, Björn Franke","doi":"10.1145/1854273.1854321","DOIUrl":"https://doi.org/10.1145/1854273.1854321","url":null,"abstract":"In recent years multi-core computer systems have left the realm of high-performance computing and virtually all of today's desktop computers and embedded computing systems are equipped with several processing cores. Still, no single parallel programming model has found widespread support and parallel programming remains an art for the majority of application programmers. In addition, there exists a plethora of sequential legacy applications for which automatic parallelization is the only hope to benefit from the increased processing power of modern multi-core systems. In the past automatic parallelization largely focused on data parallelism. In this paper we present a novel approach to extracting and exploiting pipeline parallelism from sequential applications. We use profiling to overcome the limitations of static data and control flow analysis enabling more aggressive parallelization. Our approach is orthogonal to existing automatic parallelization approaches and additional data parallelism may be exploited in the individual pipeline stages. The key contribution of this paper is a whole-program representation that supports profiling, parallelism extraction and exploitation. We demonstrate how this enhances conventional pipeline parallelization by incorporating support for multi-level loops and pipeline stage replication in a uniform and automatic way. We have evaluated our methodology on a set of multimedia and stream processing benchmarks and demonstrate speedups of up to 4.7 on a eight-core Intel Xeon machine.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126765487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 57

Dynamically managed multithreaded reconfigurable architectures for chip multiprocessors 芯片多处理器的动态管理多线程可重构体系结构

2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT) Pub Date : 2010-09-11 DOI: 10.1145/1854273.1854284

Matthew A. Watkins, D. Albonesi

{"title":"Dynamically managed multithreaded reconfigurable architectures for chip multiprocessors","authors":"Matthew A. Watkins, D. Albonesi","doi":"10.1145/1854273.1854284","DOIUrl":"https://doi.org/10.1145/1854273.1854284","url":null,"abstract":"Prior work has demonstrated that reconfigurable logic can significantly benefit certain applications. However, recon-figurable architectures have traditionally suffered from high area overhead and limited application coverage. We present a dynamically managed multithreaded reconfigurable architecture consisting of multiple clusters of shared reconfigurable fabrics that greatly reduces the area overhead of reconfigurability while still offering the same power efficiency and performance benefits. Like other shared SMT and CMP resources, the dynamic partitioning of the reconfigurable resource among sharing threads, along with the co-scheduling of threads among different reconfigurable clusters, must be intelligently managed for the full benefits of the shared fabrics to be realized. We propose a number of sophisticated dynamic management approaches, including the application of machine learning, multithreaded phase-based management, and stability detection. Overall, we show that, with our dynamic management policies, multithreaded reconfigurable fabrics can achieve better energy × delay2, at far less area and power, than providing each core with a much larger private fabric. Moreover, our approach achieves dramatically higher performance and energy-efficiency for particular workloads compared to what can be ideally achieved by allocating the fabric area to additional cores.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129364383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Design and implementation of the PLUG architecture for programmable and efficient network lookups 为可编程和高效的网络查找设计和实现PLUG架构

2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT) Pub Date : 2010-09-11 DOI: 10.1145/1854273.1854316

Amit Kumar, Lorenzo De Carli, Sung Jin Kim, M. Kruijf, K. Sankaralingam, Cristian Estan, S. Jha

{"title":"Design and implementation of the PLUG architecture for programmable and efficient network lookups","authors":"Amit Kumar, Lorenzo De Carli, Sung Jin Kim, M. Kruijf, K. Sankaralingam, Cristian Estan, S. Jha","doi":"10.1145/1854273.1854316","DOIUrl":"https://doi.org/10.1145/1854273.1854316","url":null,"abstract":"This paper proposes a new architecture called Pipelined LookUp Grid (PLUG) that can perform data structure lookups in network processing. PLUGs are programmable and through simplicity achieve power efficiency. We draw upon the insights that data structure lookups have natural structure that can be statically determined and exploited. The PLUG execution model transforms data-structure lookups into pipelined stages of computation and associates small code-blocks with data. The PLUG architecture is a tiled architecture with each tile consisting predominantly of SRAMs, a lightweight no-buffering router, and an array of lightweight computation cores. Using a principle of fixed delays in the execution model, the architecture is contention-free and completely statically scheduled thus achieving high energy efficiency. The architecture enables rapid deployment of new network protocols and generalizes as a data-structure accelerator. This paper describes the PLUG architecture, the compiler, and evaluates our RTL prototype PLUG chip synthesized on a 55nm technology library. We evaluate six diverse high-end network processing workloads including IPv4, IPv6, and Ethernet forwarding. We show that at a 55nm technology, a 16-tile PLUG occupies 58mm2, provides 4MB on-chip storage, and sustains a clock frequency of 1 GHz. This translates to 1 billion lookups per second, a latency of 18ns to 219ns, and average power less than 1 watt.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129133678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Power and thermal characterization of POWER6 system POWER6系统的功率和热特性

2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT) Pub Date : 2010-09-11 DOI: 10.1145/1854273.1854281

Víctor Jiménez, F. Cazorla, R. Gioiosa, M. Valero, C. Boneti, E. Kursun, Chen-Yong Cher, C. Isci, A. Buyuktosunoglu, P. Bose

{"title":"Power and thermal characterization of POWER6 system","authors":"Víctor Jiménez, F. Cazorla, R. Gioiosa, M. Valero, C. Boneti, E. Kursun, Chen-Yong Cher, C. Isci, A. Buyuktosunoglu, P. Bose","doi":"10.1145/1854273.1854281","DOIUrl":"https://doi.org/10.1145/1854273.1854281","url":null,"abstract":"Controlling power consumption and temperature is of major concern for modern computing systems. In this work we characterize thermal behavior and power consumption of an IBM POWER6™-based system. We perform the characterization at several levels: application, operating system, and hardware level, both when the system is idle, and under load. At hardware level, we report a 25% reduction in total system power consumption by using the processor low power mode. We also study the effect of the hardware thread prioritization mechanism provided by POWER6 on different workloads and how this mechanism can be used to limit power consumption. At OS level, we analyze the power reduction techniques implemented in the Linux kernel, such as the tickless kernel and the CPU idle power manager. At application level, we characterize the power consumption and the temperature of two sets of benchmarks (METbench and SPEC CPU2006) and we study the effect of workload characteristics on power consumption and core temperature. From this characterization we derive a model based on performance counters that allows us to predict the total power consumption of the POWER6 system with an average error under 3% for CMP and 5% for SMT. To the best of our knowledge, this is the first power model of a system including CMP+SMT processors. Finally, we show that the static decision on whether to consolidate tasks into the same core/chip, as it is currently done in Linux, can be improved by dynamically considering the low-power capabilities of the underlying architecture and the characteristics of the application (up to 5X improvement in ED2P).","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126446046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 27

The potential of using dynamic information flow analysis in data value prediction 动态信息流分析在数据价值预测中的应用潜力

2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT) Pub Date : 2010-09-11 DOI: 10.1145/1854273.1854327

Walid J. Ghandour, Haitham Akkary, Wes Masri

引用次数: 13

Raising the level of many-core programming with compiler technology - meeting a grand challenge 利用编译器技术提高多核编程水平——迎接重大挑战

2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT) Pub Date : 2010-09-11 DOI: 10.1145/1854273.1854279

Wen-mei W. Hwu

{"title":"Raising the level of many-core programming with compiler technology - meeting a grand challenge","authors":"Wen-mei W. Hwu","doi":"10.1145/1854273.1854279","DOIUrl":"https://doi.org/10.1145/1854273.1854279","url":null,"abstract":"Modern GPUs and CPUs are massively parallel, many-core processors. While application developers for these many-core chips are reporting 10X-100X speedup over sequential code on traditional microprocessors, the current practice of many-core programming based on OpenCL, CUDA, and OpenMP puts strain on software development, testing and support teams. According to the semiconductor industry roadmap, these processors could scale up to over 1,000X speedup over single cores by the end of the year 2016. Such a dramatic performance difference between parallel and sequential execution will motivate an increasing number of developers to parallelize their applications. Today, an application programmer has to understand the desirable parallel programming idioms, manually work around potential hardware performance pitfalls, and restructure their application design in order to achieve their performance objectives on many-core processors. In this presentation, I will discuss why advanced compiler functionalities have not found traction with the developer communities, what the industry is doing today to try to address the challenges, and how the academic community can contribute to this exciting revolution.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"1996 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130958953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Simple and fast biased locks 简单和快速偏锁

2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT) Pub Date : 2010-09-11 DOI: 10.1145/1854273.1854287

N. Vasudevan, Kedar S. Namjoshi, S. Edwards

{"title":"Simple and fast biased locks","authors":"N. Vasudevan, Kedar S. Namjoshi, S. Edwards","doi":"10.1145/1854273.1854287","DOIUrl":"https://doi.org/10.1145/1854273.1854287","url":null,"abstract":"Locks are used to ensure exclusive access to shared memory locations. Unfortunately, lock operations are expensive, so much work has been done on optimizing their performance for common access patterns. One such pattern is found in networking applications, where there is a single thread dominating lock accesses. An important special case arises when a single-threaded program calls a thread-safe library that uses locks. An effective way to optimize the dominant-thread pattern is to “bias” the lock implementation so that accesses by the dominant thread have negligible overhead. We take this approach in this work: we simplify and generalize existing techniques for biased locks, producing a large design space with many trade-offs. For example, if we assume the dominant process acquires the lock infinitely often (a reasonable assumption for packet processing), it is possible to make the dominant process perform a lock operation without expensive fence or compare-and-swap instructions. This gives a very low overhead solution; we confirm its efficacy by experiments. We show how these constructions can be extended for lock reservation, re-reservation, and to reader-writer situations.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130756410","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 35

MapCG: Writing parallel program portable between CPU and GPU MapCG:编写可在CPU和GPU之间移植的并行程序

2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT) Pub Date : 2010-09-11 DOI: 10.1145/1854273.1854303

Chuntao Hong, Dehao Chen, Wenguang Chen, Weimin Zheng, Haibo Lin

引用次数: 152