2014 23rd International Conference on Parallel Architecture and Compilation (PACT)最新文献

筛选
英文 中文
SQRL: Hardware accelerator for collecting software data structures 用于收集软件数据结构的硬件加速器
Snehasish Kumar, Arrvindh Shriraman, V. Srinivasan, Dan Lin, J. Phillips
{"title":"SQRL: Hardware accelerator for collecting software data structures","authors":"Snehasish Kumar, Arrvindh Shriraman, V. Srinivasan, Dan Lin, J. Phillips","doi":"10.1145/2628071.2628118","DOIUrl":"https://doi.org/10.1145/2628071.2628118","url":null,"abstract":"Software data structures are a critical aspect of emerging data-centric applications which makes it imperative to improve the energy efficiency of data delivery. We propose SQRL, a hardware accelerator that integrates with the last-level-cache (LLC) and enables energy-efficient iterative computation on data structures. SQRL integrates a data structure-specific LLC refill engine (Collector) with a compute array of lightweight processing elements (PEs). The collector exploits knowledge of the compute kernel to i) run ahead of the PEs in a decoupled fashion to gather data objects and ii) throttle fetch rate and adaptively tile the dataset based on the locality characteristics. The collector exploits data structure knowledge to find the memory level parallelism and eliminate data structure instructions.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121300138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
Automatic parallelism through macro dataflow in high-level array languages 通过高级数组语言中的宏数据流实现自动并行
Pushkar Ratnalikar, A. Chauhan
{"title":"Automatic parallelism through macro dataflow in high-level array languages","authors":"Pushkar Ratnalikar, A. Chauhan","doi":"10.1145/2628071.2628131","DOIUrl":"https://doi.org/10.1145/2628071.2628131","url":null,"abstract":"Dataflow computation is a powerful paradigm for parallel computing that is especially attractive on modern machines with multiple avenues for parallelism. However, adopting this model has been challenging as neither hardware-nor language-based approaches have been successful, except, in specialized contexts. We argue that general-purpose array languages, such as MATLAB, are good candidates for automatic translation to macro dataflow-style execution, where each array operation naturally maps to a macro dataflow operation and the model can be efficiently executed on contemporary multicore architecture. We support our argument with a fully automatic compilation technique to translate MATLAB programs to dynamic dataflow graphs that are capable of handling unbounded structured control flow. These graphs can be executed on multicore machines in an event driven fashion with the help of a runtime system built on top of Intel's Threading Building Blocks (TBB). By letting each task itself be data parallel, we are able to leverage existing data-parallel libraries and utilize parallelism at multiple levels. Our experiments on a set of benchmarks show speedups of up to 18× using our approach, over the original data-parallel code on a machine with two 16-core processors.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125993866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Cooperative cache scrubbing 协同缓存清理
Jennifer B. Sartor, W. Heirman, S. Blackburn, L. Eeckhout, K. McKinley
{"title":"Cooperative cache scrubbing","authors":"Jennifer B. Sartor, W. Heirman, S. Blackburn, L. Eeckhout, K. McKinley","doi":"10.1145/2628071.2628083","DOIUrl":"https://doi.org/10.1145/2628071.2628083","url":null,"abstract":"Managing the limited resources of power and memory bandwidth while improving performance on multicore hardware is challenging. In particular, more cores demand more memory bandwidth, and multi-threaded applications increasingly stress memory systems, leading to more energy consumption. However, we demonstrate that not all memory traffic is necessary. For modern Java programs, 10 to 60% of DRAM writes are useless, because the data on these lines are dead — the program is guaranteed to never read them again. Furthermore, reading memory only to immediately zero initialize it wastes bandwidth. We propose a software/hardware cooperative solution: the memory manager communicates dead and zero lines with cache scrubbing instructions. We show how scrubbing instructions satisfy MESI cache coherence protocol invariants and demonstrate them in a Java Virtual Machine and multicore simulator. Scrubbing reduces average DRAM traffic by 59%, total DRAM energy by 14%, and dynamic DRAM energy by 57% on a range of configurations. Cooperative software/hardware cache scrubbing reduces memory bandwidth and improves energy efficiency, two critical problems in modern systems.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132735353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 29
Realm: An event-based low-level runtime for distributed memory architectures Realm:用于分布式内存体系结构的基于事件的低级运行时
Sean Treichler, Michael A. Bauer, A. Aiken
{"title":"Realm: An event-based low-level runtime for distributed memory architectures","authors":"Sean Treichler, Michael A. Bauer, A. Aiken","doi":"10.1145/2628071.2628084","DOIUrl":"https://doi.org/10.1145/2628071.2628084","url":null,"abstract":"We present Realm, an event-based runtime system for heterogeneous, distributed memory machines. Realm is fully asynchronous: all runtime actions are non-blocking. Realm supports spawning computations, moving data, and reservations, a novel synchronization primitive. Asynchrony is exposed via a light-weight event system capable of operating without central management. We describe an implementation of Realm that relies on a novel generational event data structure for efficiently handling large numbers of events in a distributed address space. Microbenchmark experiments show our implementation of Realm approaches the underlying hardware performance limits. We measure the performance of three real-world applications on the Keeneland supercomputer. Our results demonstrate that Realm confers considerable latency hiding to clients, attaining significant speedups over traditional bulk-synchronous and independently optimized MPI codes.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129532189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 71
D2MA: Accelerating coarse-grained data transfer for GPUs D2MA:加速gpu的粗粒度数据传输
D. Jamshidi, M. Samadi, S. Mahlke
{"title":"D2MA: Accelerating coarse-grained data transfer for GPUs","authors":"D. Jamshidi, M. Samadi, S. Mahlke","doi":"10.1145/2628071.2628072","DOIUrl":"https://doi.org/10.1145/2628071.2628072","url":null,"abstract":"To achieve high performance on many-core architectures like GPUs, it is crucial to efficiently utilize the available memory bandwidth. Currently, it is common to use fast, on-chip scratchpad memories, like the shared memory available on GPUs' shader cores, to buffer data for computation. This buffering, however, has some sources of inefficiency that hinder it from most efficiently utilizing the available memory resources. These issues stem from shader resources being used for repeated, regular address calculations, a need to shuffle data multiple times between a physically unified on-chip memory, and forcing all threads to synchronize to ensure RAW consistency based on the speed of the slowest threads. To address these inefficiencies, we propose DataParallel DMA, or D2MA. D2MA is a reimagination of traditional DMA that addresses the challenges of extending DMA to thousands of concurrently executing threads. D2MA decouples address generation from the shader's computational resources, provides a more direct and efficient path for data in global memory to travel into the shared memory, and introduces a novel dynamic synchronization scheme that is transparent to the programmer. These advancements allow D2MA to achieve speedups as high as 2.29×, and reduces the average time to buffer data by 81% on average.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"140 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133417167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Improving performance of streaming applications with filtering and control messages 通过过滤和控制消息来提高流应用程序的性能
Peng Li, J. Buhler
{"title":"Improving performance of streaming applications with filtering and control messages","authors":"Peng Li, J. Buhler","doi":"10.1145/2628071.2671421","DOIUrl":"https://doi.org/10.1145/2628071.2671421","url":null,"abstract":"In streaming computing applications, some data can be filtered to reduce computation and communication. Due to filtering, however, some necessary information might be lost. To recover lost information, we use control messages, which carry control information rather than input data. The order between control messages and input data must be precise to guarantee correct computations. In this paper, we study the use of control message in suppressing data communication, which improves throughput. To ensure precise synchronization between control messages and input data, we propose a credit-base protocol and prove its correctness and safety. Results show that with the help of control messages, the application throughput can be improved in proportion to filtering ratios.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117099523","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Shuffling: A framework for lock contention aware thread scheduling for multicore multiprocessor systems shuffle:用于多核多处理器系统的锁争用感知线程调度的框架
K. Pusukuri, Rajiv Gupta, L. Bhuyan
{"title":"Shuffling: A framework for lock contention aware thread scheduling for multicore multiprocessor systems","authors":"K. Pusukuri, Rajiv Gupta, L. Bhuyan","doi":"10.1145/2628071.2628074","DOIUrl":"https://doi.org/10.1145/2628071.2628074","url":null,"abstract":"On a cache-coherent multicore multiprocessor system, the performance of a multithreaded application with high lock contention is very sensitive to the distribution of application threads across multiple processors (or Sockets). This is because the distribution of threads impacts the frequency of lock transfers between Sockets, which in turn impacts the frequency of last-level cache (LLC) misses that lie on the critical path of execution. Since the latency of a LLC miss is high, an increase of LLC misses on the critical path increases both lock acquisition latency and critical section processing time. However, thread schedulers for operating systems, such as Solaris and Linux, are oblivious of the lock contention among multiple threads belonging to an application and therefore fail to deliver high performance for multithreaded applications. To alleviate the above problem, in this paper, we propose a scheduling framework called Shuffling, which migrates threads of a multithreaded program across Sockets so that threads seeking locks are more likely to find the locks on the same Socket. Shuffling reduces the time threads spend on acquiring locks and speeds up the execution of shared data accesses in the critical section, ultimately reducing the execution time of the application. We have implemented Shuffling on a 64-core Supermicro server running Oracle Solaris 11™ and evaluated it using a wide variety of 20 multithreaded programs with high lock contention. Our experiments show that Shuffling achieves up to 54% reduction in execution time and an average reduction of 13%. Moreover it does not require any changes to the application source code or the OS kernel.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121932081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
What is the cost of weak determinism? 弱决定论的代价是什么?
Cedomir Segulja, T. Abdelrahman
{"title":"What is the cost of weak determinism?","authors":"Cedomir Segulja, T. Abdelrahman","doi":"10.1145/2628071.2628099","DOIUrl":"https://doi.org/10.1145/2628071.2628099","url":null,"abstract":"We analyze the fundamental performance impact of enforcing a fixed order of synchronization operations to achieve weak deterministic execution. Our analysis is in three parts, performed on a real system using the SPLASH-2 and PAR-SEC benchmarks. First, we quantify the impact of various sources of nondeterminism on execution of data-race-free programs. We find that thread synchronization is the prevalent source of nondeterminism, sometimes affecting program output. Second, we divorce the implementation overhead of a system imposing a specific synchronization order from the impact of enforcing this order. We show that this fundamental cost of determinism is small (slowdown of 4% on average and 32% in the worst case) and we identify application characteristics responsible for this cost. Finally, we evaluate this cost under perturbed execution conditions. We find that demanding determinism when threads face such conditions can cause almost 2× slowdown.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115595091","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
LCA: A memory link and cache-aware co-scheduling approach for CMPs LCA: cmp的内存链路和缓存感知协同调度方法
Alexandros-Herodotos Haritatos, G. Goumas, Nikos Anastopoulos, K. Nikas, K. Kourtis, N. Koziris
{"title":"LCA: A memory link and cache-aware co-scheduling approach for CMPs","authors":"Alexandros-Herodotos Haritatos, G. Goumas, Nikos Anastopoulos, K. Nikas, K. Kourtis, N. Koziris","doi":"10.1145/2628071.2628123","DOIUrl":"https://doi.org/10.1145/2628071.2628123","url":null,"abstract":"This paper presents LCA, a memory Link and Cache-Aware co-scheduling approach for CMPs. It is based on a novel application classification scheme that monitors resource utilization across the entire memory hierarchy from main memory down to CPU cores. This enables us to predict application interference accurately and support a co-scheduling algorithm that outperforms state-of-the-art scheduling policies both in terms of throughput and fairness. As LCA depends on information collected at runtime by existing monitoring mechanisms of modern processors, it can be easily incorporated in real-life co-scheduling scenarios with various application features and platform configurations.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121811897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Keynote: Domain-specific models for innovation in analytics 主题演讲:分析创新的领域特定模型
Bob Blainey
{"title":"Keynote: Domain-specific models for innovation in analytics","authors":"Bob Blainey","doi":"10.1145/2628071.2635932","DOIUrl":"https://doi.org/10.1145/2628071.2635932","url":null,"abstract":"Big data is a transformational force for businesses and organizations of every stripe. The ability to rapidly and accurately derive insights from massive amounts of data is becoming a critical competitive differentiator so it is driving continuous innovation among business analysts, data scientists, and computer engineers. Two of the most important success factors for analytic techniques are the ability to quickly develop and incrementally evolve them to suit changing business needs and the ability to scale these techniques using parallel computing to process huge collections of data. Unfortunately, these goals are often at odds with each other because innovation at the algorithm and data model level requires a combination of domain knowledge and expertise in data analysis while achieving high scale demands expertise in parallel computing, cloud computing and even hardware acceleration. In this talk, I will examine various approaches to bridging these two goals, with a focus on domain-specific models that simultaneously improve the agility of analytics development and the achievement of efficient parallel scaling.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"124 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133270461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信