2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)最新文献

A software-SVM-based transactional memory for multicore accelerator architectures with local memory 一种基于软件svm的事务性内存，用于带有本地内存的多核加速器体系结构

2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT) Pub Date : 2010-09-11 DOI: 10.1145/1854273.1854355

Jun Lee, Sangmin Seo, Jaejin Lee

{"title":"A software-SVM-based transactional memory for multicore accelerator architectures with local memory","authors":"Jun Lee, Sangmin Seo, Jaejin Lee","doi":"10.1145/1854273.1854355","DOIUrl":"https://doi.org/10.1145/1854273.1854355","url":null,"abstract":"We propose a software transactional memory (STM) for heterogeneous multicores with small local memory. The heterogeneous multicore architecture consists of a general-purpose processor element (GPE) and multiple accelerator processor elements (APEs). The GPE is typically backed by a deep, on-chip cache hierarchy and hardware cache coherence. On the other hand, the APEs have small, explicitly addressed local memory that is not coherent with the main memory. Programmers of such multicore architectures suffer from explicit memory management and coherence problems. The STM for such multicores can alleviate the burden of the programmer and transparently handle data transfers at run time. Moreover, it makes the programmer free from controlling locks. Our TM is based on an existing software SVM for the accelerator architecture. The software SVM exploits software-managed caches and coherence protocols between the GPE and APEs. We also propose an optimization technique, called abort prediction, for the TM. It blocks a transaction from running until the chance of potential conflicts is eliminated. We implement the TM system and the optimization technique for a single Cell BE processor and evaluate their effectiveness with six compute-intensive benchmark applications.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124388534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

AKULA: A toolset for experimenting and developing thread placement algorithms on multicore systems AKULA:用于在多核系统上实验和开发线程放置算法的工具集

2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT) Pub Date : 2010-09-11 DOI: 10.1145/1854273.1854307

Sergey Zhuravlev, S. Blagodurov, Alexandra Fedorova

{"title":"AKULA: A toolset for experimenting and developing thread placement algorithms on multicore systems","authors":"Sergey Zhuravlev, S. Blagodurov, Alexandra Fedorova","doi":"10.1145/1854273.1854307","DOIUrl":"https://doi.org/10.1145/1854273.1854307","url":null,"abstract":"Multicore processors have become commonplace in both desktop and servers. A serious challenge with multicore processors is that cores share on and off chip resources such as caches, memory buses, and memory controllers. Competition for these shared resources between threads running on different cores can result in severe and unpredictable performance degradations. It has been shown in previous work that the OS scheduler can be made shared-resource-aware and can greatly reduce the negative effects of resource contention. The search space of potential scheduling algorithms is huge considering the diversity of available multicore architectures, an almost infinite set of potential workloads, and a variety of conflicting performance goals. We believe the two biggest obstacles to developing new scheduling algorithms are the difficulty of implementation and the duration of testing. We address both of these challenges with our toolset AKULA which we introduce in this paper. AKULA provides an API that allows developers to implement and debug scheduling algorithms easily and quickly without the need to modify the kernel or use system calls. AKULA also provides a rapid evaluation module, based on a novel evaluation technique also introduced in this paper, which allows the created scheduling algorithm to be tested on a wide variety of workloads in just a fraction of the time testing on real hardware would take. AKULA also facilitates running scheduling algorithms created with its API on real machines without the need for additional modifications. We use AKULA to develop and evaluate a variety of different contention-aware scheduling algorithms. We use the rapid evaluation module to test our algorithms on thousands of workloads and assess their scalability to futuristic massively multicore machines.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"119 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121362295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 30

Towards a science of parallel programming 走向并行编程的科学

2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT) Pub Date : 2010-09-11 DOI: 10.1145/1854273.1854277

K. Pingali

引用次数: 4

Twin Peaks: A Software Platform for Heterogeneous Computing on General-Purpose and Graphics Processors 双峰:通用和图形处理器异构计算的软件平台

2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT) Pub Date : 2010-09-11 DOI: 10.1145/1854273.1854302

J. Gummaraju, L. Morichetti, Michael Houston, B. Sander, Benedict R. Gaster, Bixia Zheng

{"title":"Twin Peaks: A Software Platform for Heterogeneous Computing on General-Purpose and Graphics Processors","authors":"J. Gummaraju, L. Morichetti, Michael Houston, B. Sander, Benedict R. Gaster, Bixia Zheng","doi":"10.1145/1854273.1854302","DOIUrl":"https://doi.org/10.1145/1854273.1854302","url":null,"abstract":"Modern processors are evolving into hybrid, heterogeneous processors with both CPU and GPU cores used for generalpurpose computation. Several languages such as Brook, CUDA , and more recently OpenCL are being developed to fully harness the potential of these processors. These languages typically involve the control code running on the CPU and the performance-critical, data-parallel kernel code running on the GPUs. In this paper, we present Twin Peaks, a software platform for heterogeneous computing that executes code originally targeted for GPUs effi ciently on CPUs as well. This permits a more balanced execution between the CPU and GPU, and enables portability of code between these architectures and to CPU-only environments. We propose several techniques in the runtime system to efficiently utilize the caches and functional units present in CPUs. Using OpenCL as a canonical language for heterogeneous computing, and running several experiments on real hardware, we show that our techniques enable GPGPU-style code to execute efficiently on multi core CPUs with minimal runtime overhead. These results also show that for maximum performance, it is beneficial for applications to utilize both CPUs and GPUs as accelerator targets. Categories a nd Subject D escriptors: D.1.3 [Programming Techniques] : Concurrent Programming G eneral Terms: Design , Experimentation, Performance. K eywords: GPGPU, Multicore , OpenCL, Programmability, Runtime.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124374599","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 94

An empirical characterization of stream programs and its implications for language and compiler design 流程序的经验表征及其对语言和编译器设计的影响

2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT) Pub Date : 2010-09-11 DOI: 10.1145/1854273.1854319

W. Thies, Saman P. Amarasinghe

{"title":"An empirical characterization of stream programs and its implications for language and compiler design","authors":"W. Thies, Saman P. Amarasinghe","doi":"10.1145/1854273.1854319","DOIUrl":"https://doi.org/10.1145/1854273.1854319","url":null,"abstract":"Stream programs represent an important class of high-performance computations. Defined by their regular processing of sequences of data, stream programs appear most commonly in the context of audio, video, and digital signal processing, though also in networking, encryption, and other areas. In order to develop effective compilation techniques for the streaming domain, it is important to understand the common characteristics of these programs. Prior characterizations of stream programs have examined legacy implementations in C, C++, or FORTRAN, making it difficult to extract the high-level properties of the algorithms. In this work, we characterize a large set of stream programs that was implemented directly in a stream programming language, allowing new insights into the high-level structure and behavior of the applications. We utilize the StreamIt benchmark suite, consisting of 65 programs and 33,600 lines of code. We characterize the bottlenecks to parallelism, the data reference patterns, the input/output rates, and other properties. The lessons learned have implications for the design of future architectures, languages and compilers for the streaming domain.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127038400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 175

Revisiting sorting for GPGPU stream architectures 回顾GPGPU流架构的排序

2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT) Pub Date : 2010-09-11 DOI: 10.1145/1854273.1854344

D. Merrill, A. Grimshaw

引用次数: 162

Exploiting subtrace-level parallelism in clustered processors 利用集群处理器中的减迹级并行性

2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT) Pub Date : 2010-09-11 DOI: 10.1145/1854273.1854349

R. Ubal, J. Sahuquillo, S. Petit, P. López, J. Duato

引用次数: 0

SWEL: Hardware cache coherence protocols to map shared data onto shared caches 将共享数据映射到共享缓存的硬件缓存一致性协议

2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT) Pub Date : 2010-09-11 DOI: 10.1145/1854273.1854331

Seth H. Pugsley, J. Spjut, D. Nellans, R. Balasubramonian

{"title":"SWEL: Hardware cache coherence protocols to map shared data onto shared caches","authors":"Seth H. Pugsley, J. Spjut, D. Nellans, R. Balasubramonian","doi":"10.1145/1854273.1854331","DOIUrl":"https://doi.org/10.1145/1854273.1854331","url":null,"abstract":"Snooping and directory-based coherence protocols have become the de facto standard in chip multi-processors, but neither design is without drawbacks. Snooping protocols are not scalable, while directory protocols incur directory storage overhead, frequent indirections, and are more prone to design bugs. In this paper, we propose a novel coherence protocol that greatly reduces the number of coherence operations and falls back on a simple broadcast-based snooping protocol when infrequent coherence is required. This new protocol is based on the premise that most blocks are either private to a core or read-only, and hence, do not require coherence. This will be especially true for future large-scale multi-core machines that will be used to execute message-passing workloads in the HPC domain, or multiple virtual machines for servers. In such systems, it is expected that a very small fraction of blocks will be both shared and frequently written, hence the need to optimize coherence protocols for a new common case. In our new protocol, dubbed SWEL (protocol states are Shared, Written, Exclusivity Level), the L1 cache attempts to store only private or read-only blocks, while shared and written blocks must reside at the shared L2 level. These determinations are made at runtime without software assistance. While accesses to blocks banished from the L1 become more expensive, SWEL can improve throughput because directory indirection is removed for many common write-sharing patterns. Compared to a MESI based directory implementation, we see up to 15% increased performance, a maximum degradation of 2%, and an average performance increase of 2.5% using SWEL and its derivatives. Other advantages of this strategy are reduced protocol complexity (achieved by reducing transient states) and significantly less storage overhead than traditional directory protocols.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125223504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 59

Criticality-driven superscalar design space exploration 临界驱动的超标量设计空间探索

2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT) Pub Date : 2010-09-11 DOI: 10.1145/1854273.1854308

Sandeep Navada, N. Choudhary, E. Rotenberg

引用次数: 11

Handling the problems and opportunities posed by multiple on-chip memory controllers 处理多个片上存储器控制器带来的问题和机会

2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT) Pub Date : 2010-09-11 DOI: 10.1145/1854273.1854314

M. Awasthi, D. Nellans, K. Sudan, R. Balasubramonian, A. Davis

{"title":"Handling the problems and opportunities posed by multiple on-chip memory controllers","authors":"M. Awasthi, D. Nellans, K. Sudan, R. Balasubramonian, A. Davis","doi":"10.1145/1854273.1854314","DOIUrl":"https://doi.org/10.1145/1854273.1854314","url":null,"abstract":"Modern processors such as Tilera's Tile64, Intel's Nehalem, and AMD's Opteron are migrating memory controllers (MCs) on-chip, while maintaining a large, at memory address space. This trend to utilize multiple MCs will likely continue and a core or socket will consequently need to route memory requests to the appropriate MC via an inter- or intra-socket interconnect fabric similar to AMD's HyperTransport™, or Intel's Quick-Path Interconnect™. Such systems are therefore subject to non-uniform memory access (NUMA) latencies because of the time spent traveling to remote MCs. Each MC will act as the gateway to a particular piece of the physical memory. Data placement will therefore become increasingly critical in minimizing memory access latencies. To date, no prior work has examined the effects of data placement among multiple MCs in such systems. Future chip-multiprocessors are likely to comprise multiple MCs and an even larger number of cores. This trend will increase the memory access latency variation in these systems. Proper allocation of workload data to the appropriate MC will be important in reducing the latency of memory service requests. The allocation strategy will need to be aware of queuing delays, on-chip latencies, and row-buffer hit-rates for each MC. In this paper, we propose dynamic mechanisms that take these factors into account when placing data in appropriate slices of the physical memory. We introduce adaptive first-touch page placement, and dynamic page-migration mechanisms to reduce DRAM access delays for multi-MC systems. These policies yield average performance improvements of 17% for adaptive first-touch page-placement, and 35% for a dynamic page-migration policy.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131490986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 135