2011 International Conference on Parallel Architectures and Compilation Techniques最新文献_第3页

Programming Strategies for GPUs and their Power Consumption gpu的编程策略及其功耗

2011 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2011-10-10 DOI: 10.1109/PACT.2011.51

Sayan Ghosh, B. Chapman

引用次数: 1

SymptomTM: Symptom-Based Error Detection and Recovery Using Hardware Transactional Memory 症候:使用硬件事务性内存进行基于症候的错误检测和恢复

2011 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2011-10-10 DOI: 10.1109/PACT.2011.39

Gulay Yalcin, O. Unsal, A. Cristal, I. Hur, M. Valero

引用次数: 12

POPS: Coherence Protocol Optimization for Both Private and Shared Data 私有数据和共享数据的一致性协议优化

2011 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2011-10-10 DOI: 10.1109/PACT.2011.11

Hemayet Hossain, S. Dwarkadas, Michael C. Huang

{"title":"POPS: Coherence Protocol Optimization for Both Private and Shared Data","authors":"Hemayet Hossain, S. Dwarkadas, Michael C. Huang","doi":"10.1109/PACT.2011.11","DOIUrl":"https://doi.org/10.1109/PACT.2011.11","url":null,"abstract":"As the number of cores in a chip multiprocessor (CMP) increases, the need for larger on-chip caches also increases in order to avoid creating a bottleneck at the off-chip interconnect. Utilization of these CMPs include combinations of multithreading and multiprogramming, showing a range of sharing behavior, from frequent inter-thread communication to no communication. The goal of the CMP cache design is to maximize capacity for a given size while providing as low a latency as possible for the entire range of sharing behavior. In a typical CMP design, the last level cache (LLC) is shared across the cores and incurs a latency of access that is a function of distance on the chip. Sharing helps avoid the need for replicas at the LLC and allows access to the entire on-chip cache space by any core. However, the cost is the increased latency of communication based on where data is mapped on the chip. In this paper, we propose a cache coherence design we call POPS that provides localized data and metadata access for both shared data (in multithreaded workloads) and private data (predominant in multiprogrammed workloads). POPS achieves its goal by (1) decoupling data and metadata, allowing both to be delegated to local LLC slices for private data and between sharers for shared data, (2) freeing delegated data storage in the LLC for larger effective capacity, and (3) changing the delegation and/or coherence protocol action based on the observed sharing pattern. Our analysis on an execution-driven full system simulator using multithreaded and multiprogrammed workloads shows that POPS performs 42% (28% without micro benchmarks) better for multithreaded workloads, 16% better for multiprogrammed workloads, and 8% better when one single-threaded application is the only running process, compared to the base non-uniform shared L2 protocol. POPS has the added benefits of reduced on-chip and off-chip traffic and reduced dynamic energy consumption.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131553212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 44

Correctly Treating Synchronizations in Compiling Fine-Grained SPMD-Threaded Programs for CPU 在为CPU编译细粒度spmd线程程序时正确处理同步

2011 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2011-10-10 DOI: 10.1109/PACT.2011.62

Ziyu Guo, E. Zhang, Xipeng Shen

引用次数: 15

Probabilistic Models Towards Optimal Speculation of DFA Applications DFA应用中最优推测的概率模型

2011 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2011-10-10 DOI: 10.1109/PACT.2011.53

Zhijia Zhao, Bo Wu

引用次数: 2

A Unified Scheduler for Recursive and Task Dataflow Parallelism 用于递归和任务数据流并行的统一调度程序

2011 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2011-10-10 DOI: 10.1109/PACT.2011.7

H. Vandierendonck, George Tzenakis, Dimitrios S. Nikolopoulos

{"title":"A Unified Scheduler for Recursive and Task Dataflow Parallelism","authors":"H. Vandierendonck, George Tzenakis, Dimitrios S. Nikolopoulos","doi":"10.1109/PACT.2011.7","DOIUrl":"https://doi.org/10.1109/PACT.2011.7","url":null,"abstract":"Task dataflow languages simplify the specification of parallel programs by dynamically detecting and enforcing dependencies between tasks. These languages are, however, often restricted to a single level of parallelism. This language design is reflected in the runtime system, where a master thread explicitly generates a task graph and worker threads execute ready tasks and wake-up their dependents. Such an approach is incompatible with state-of-the-art schedulers such as the Cilk scheduler, that minimize the creation of idle tasks (work-first principle) and place all task creation and scheduling off the critical path. This paper proposes an extension to the Cilk scheduler in order to reconcile task dependencies with the work-first principle. We discuss the impact of task dependencies on the properties of the Cilk scheduler. Furthermore, we propose a low-overhead ticket-based technique for dependency tracking and enforcement at the object level. Our scheduler also supports renaming of objects in order to increase task-level parallelism. Renaming is implemented using versioned objects, a new type of hyper object. Experimental evaluation shows that the unified scheduler is as efficient as the Cilk scheduler when tasks have no dependencies. Moreover, the unified scheduler is more efficient than SMPSS, a particular implementation of a task dataflow language.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"s3-28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130123812","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 41

DiDi: Mitigating the Performance Impact of TLB Shootdowns Using a Shared TLB Directory DiDi:使用共享的TLB目录减轻TLB宕机对性能的影响

2011 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2011-10-10 DOI: 10.1109/PACT.2011.65

Carlos Villavieja, Vasileios Karakostas, L. Vilanova, Yoav Etsion, Alex Ramírez, A. Mendelson, N. Navarro, A. Cristal, O. Unsal

{"title":"DiDi: Mitigating the Performance Impact of TLB Shootdowns Using a Shared TLB Directory","authors":"Carlos Villavieja, Vasileios Karakostas, L. Vilanova, Yoav Etsion, Alex Ramírez, A. Mendelson, N. Navarro, A. Cristal, O. Unsal","doi":"10.1109/PACT.2011.65","DOIUrl":"https://doi.org/10.1109/PACT.2011.65","url":null,"abstract":"Translation Look aside Buffers (TLBs) are ubiquitously used in modern architectures to cache virtual-to-physical mappings and, as they are looked up on every memory access, are paramount to performance scalability. The emergence of chip-multiprocessors (CMPs) with per-core TLBs, has brought the problem of TLB coherence to front stage. TLBs are kept coherent at the software-level by the operating system (OS). Whenever the OS modifies page permissions in a page table, it must initiate a coherency transaction among TLBs, a process known as a TLB shoot down. Current CMPs rely on the OS to approximate the set of TLBs caching a mapping and synchronize TLBs using costly Inter-Proceessor Interrupts (IPIs) and software handlers. In this paper, we characterize the impact of TLB shoot downs on multiprocessor performance and scalability, and present the design of a scalable TLB coherency mechanism. First, we show that both TLB shoot down cost and frequency increase with the number of processors and project that software-based TLB shoot downs would thwart the performance of large multiprocessors. We then present a scalable architectural mechanism that couples a shared TLB directory with load/store queue support for lightweight TLB invalidation, and thereby eliminates the need for costly IPIs. Finally, we show that the proposed mechanism reduces the fraction of machine cycles wasted on TLB shoot downs by an order of magnitude.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114849184","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 101

Large Scale Verification of MPI Programs Using Lamport Clocks with Lazy Update 使用Lamport时钟延迟更新的MPI程序的大规模验证

2011 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2011-10-10 DOI: 10.1109/PACT.2011.64

Anh Vo, G. Gopalakrishnan, R. Kirby, B. Supinski, M. Schulz, G. Bronevetsky

引用次数: 24

Improving Run-Time Scheduling for General-Purpose Parallel Code 改进通用并行代码的运行时调度

2011 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2011-10-10 DOI: 10.1109/PACT.2011.49

Alexandros Tzannes, R. Barua, U. Vishkin

{"title":"Improving Run-Time Scheduling for General-Purpose Parallel Code","authors":"Alexandros Tzannes, R. Barua, U. Vishkin","doi":"10.1109/PACT.2011.49","DOIUrl":"https://doi.org/10.1109/PACT.2011.49","url":null,"abstract":"Today, almost all desktop and laptop computers are shared-memory multicores, but the code they run is overwhelmingly serial. High level language extensions and libraries (e.g., Open MP, Cilk++, TBB) make it much easier for programmers to write parallel code than previous approaches (e.g., MPI), in large part thanks to the efficient {em work-stealing} scheduler that allows the programmer to expose more parallelism than the actual hardware parallelism. But when the parallel tasks are too short or too many, the scheduling overheads become significant and hurt performance. Because this happens frequently (e.g, data-parallelism, PRAM algorithms), programmers need to manually coarsen tasks for performance by combining many of them into longer tasks. But manual coarsening typically causes over fitting of the code to the input data, platform and context used to do the coarsening, and harms performance-portability. We propose distinguishing between two types of coarsening and using different techniques for them. Then improve on our previous work on Lazy Binary Splitting (LBS), a scheduler that performs the second type of coarsening dynamically, but fails to scale on large commercial multicores. Our improved scheduler, Breadth-First Lazy Scheduling (BF-LS) overcomes the scalability issue of LBS and performs much better on large machines.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123295187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

An OpenCL Framework for Homogeneous Manycores with No Hardware Cache Coherence 无硬件缓存一致性的同构多核OpenCL框架

2011 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2011-10-10 DOI: 10.1109/PACT.2011.12

Jun Lee, Jungwon Kim, Junghyun Kim, Sangmin Seo, Jaejin Lee

{"title":"An OpenCL Framework for Homogeneous Manycores with No Hardware Cache Coherence","authors":"Jun Lee, Jungwon Kim, Junghyun Kim, Sangmin Seo, Jaejin Lee","doi":"10.1109/PACT.2011.12","DOIUrl":"https://doi.org/10.1109/PACT.2011.12","url":null,"abstract":"Recently, Intel has introduced a research prototype many core processor called the Single-chip Cloud Computer (SCC). The SCC is an experimental processor created by Intel Labs. It contains 48 cores in a single chip and each core has its own L1 and L2 caches without any hardware support for cache coherence. It allows maximum 64GB size of external memory that can be accessed by all cores and each core dynamically maps the external memory into their own address space. In this paper, we introduce the design and implementation of an OpenCL framework (i.e., runtime and compiler) for such many core architectures with no hardware cache coherence. We have found that the OpenCL coherence and consistency model fits well with the SCC architecture. The OpenCL's weak memory consistency model requires relatively small amount of messages and coherence actions to guarantee coherence and consistency between the memory blocks in the SCC. The dynamic memory mapping mechanism enables our framework to preserve the semantics of the buffer object operations in OpenCL with a small overhead. We have implemented the proposed OpenCL runtime and compiler and evaluate their performance on the SCC with OpenCL applications.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124002271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14