2011 International Conference on Parallel Architectures and Compilation Techniques最新文献

筛选
英文 中文
Decoupled Cache Segmentation: Mutable Policy with Automated Bypass 解耦缓存分段:自动绕过可变策略
S. Khan, Daniel A. Jiménez
{"title":"Decoupled Cache Segmentation: Mutable Policy with Automated Bypass","authors":"S. Khan, Daniel A. Jiménez","doi":"10.1109/PACT.2011.45","DOIUrl":"https://doi.org/10.1109/PACT.2011.45","url":null,"abstract":"The least recently used (LRU) replacement policy performs poorly in the last-level cache (LLC) because temporal locality of memory accesses is filtered by first and second level caches. We propose a cache segmentation technique that adapts to cache access patterns by predicting the best number of not-yet-referenced and already-referenced blocks in the cache. The technique is independent from the LRU policy so it can work with less expensive replacement policies. It can automatically detect when to bypass blocks to the CPU with no extra overhead. It outperforms LRU replacement on average by 5.2% with not-recently-used (NRU) replacement and on average by 2.2% with random replacement in a 2MB LLC in a single-core processor with a memory intensive subset of SPEC CPU 2006 benchmarks.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132524041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Row-Buffer Reorganization: Simultaneously Improving Performance and Reducing Energy in DRAMs 行缓冲器重组:同时提高dram的性能和降低能量
N. Gulur, R. Manikantan, R. Govindarajan, M. Mehendale
{"title":"Row-Buffer Reorganization: Simultaneously Improving Performance and Reducing Energy in DRAMs","authors":"N. Gulur, R. Manikantan, R. Govindarajan, M. Mehendale","doi":"10.1109/PACT.2011.34","DOIUrl":"https://doi.org/10.1109/PACT.2011.34","url":null,"abstract":"In this paper, based on the temporal and spatial locality characteristics of memory accesses in multicores, we propose a re-organization of the existing single large row buffer in a DRAM bank into multiple smaller row-buffers. The proposed configuration helps improve the row hit rates and also brings down the energy required for row-activations. The major contribution of this work is proposing such a reorganization without requiring any significant changes to the existing widely accepted DRAM specifications. Our proposed reorganization improves performance by 35.8%, 14.5% and 21.6% in quad, eight and sixteen core workloads along with a 42%, 28% and 31% reduction in DRAM energy. Additionally, we introduce a Need Based Allocation scheme for buffer management that shows additional performance improvement.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"293 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131822126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Decoupled Architectures as a Low-Complexity Alternative to Out-of-order Execution 作为无序执行的低复杂度替代方案的解耦架构
N. Crago, Sanjay J. Patel
{"title":"Decoupled Architectures as a Low-Complexity Alternative to Out-of-order Execution","authors":"N. Crago, Sanjay J. Patel","doi":"10.1109/PACT.2011.28","DOIUrl":"https://doi.org/10.1109/PACT.2011.28","url":null,"abstract":"In this paper we present OUTRIDERHP, a novel implementation of a decoupled architecture that approaches the performance of contemporary out-of-order processors on parallel benchmarks while maintaining low hardware complexity. OUTRIDERHP leverages the compiler to separate a single thread of execution into memory-accessing and memory-consuming streams that can be executed concurrently, which we call strands. We identify loss-of-decoupling events which cripple performance on traditional decoupled architectures, and design OUTRIDERHP to enable extraction of multiple strands and control speculation which provide superior memory and functional unit latency tolerance. OUTRIDERHP outperforms a baseline in-order architecture by 26-220% and Decoupled Access/Execute by 7-172% when executing parallel benchmarks on an 8-core CMP configuration. OUTRIDERHP performs within 15% of higher-complexity out-of-order cores despite not utilizing large physical register files, dynamic scheduling, and register renaming hardware.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130479663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Memory Architecture for Integrating Emerging Memory Technologies 集成新兴存储器技术的存储器体系结构
Kun Fang, Long Chen, Zhao Zhang, Zhichun Zhu
{"title":"Memory Architecture for Integrating Emerging Memory Technologies","authors":"Kun Fang, Long Chen, Zhao Zhang, Zhichun Zhu","doi":"10.1109/PACT.2011.71","DOIUrl":"https://doi.org/10.1109/PACT.2011.71","url":null,"abstract":"Current main memory system design is severely limited by the decades-old synchronous DRAM architecture, which requires the memory controller to track the internal status of memory devices (chips) and schedule the timing of all device operations. This rigidity has become an obstacle of integrating emerging memory technologies such as PCM into existing memory systems, because their timing requirements are vastly different. Furthermore, with the trend of embedding memory controllers into processors, it is crucial to have interoperability among general-purpose processors and diverse memory modules. To address this issue, we propose a new memory architecture framework called universal memory architecture (UniMA). It enables the interoperability by decoupling the scheduling of device operations from memory controller, using a bridge chip at each memory module to perform local scheduling. The new architecture may also help improve memory scalability, power efficiency, and bandwidth as previously proposed decoupled memory organizations. A major focus of this study is to evaluate the performance impact of local scheduling of device operations. We present a prototype implementation of UniMA on top of DDRx memory bus, and then evaluate its efficiency with different workloads. The simulation results show that UniMA actually improves memory system efficiency for memory-intensive workloads due to increased parallelism among memory modules. The overall performance improvement over the conventional DDRx memory architecture is 3.1% on average. The performance of other workloads is reduced slightly, by 1.0% on average, due to a small increase of memory idle latency. In short, the prototype and evaluation demonstrate that it is possible to integrate diverse memory technologies into a single memory architecture with virtually no loss of overall performance.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130687013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
Compiler Directed Data Locality Optimization for Multicore Architectures 面向多核架构的编译器定向数据局部性优化
W. Ding, Jithendra Srinivas, M. Kandemir, Mustafa Karaköy
{"title":"Compiler Directed Data Locality Optimization for Multicore Architectures","authors":"W. Ding, Jithendra Srinivas, M. Kandemir, Mustafa Karaköy","doi":"10.1109/PACT.2011.24","DOIUrl":"https://doi.org/10.1109/PACT.2011.24","url":null,"abstract":"This paper presents and evaluates a cache hierarchy-aware code parallelization/mapping and scheduling strategy for multicore architectures. Our proposed parallelization/mapping strategy determines a loop iteration-to-core mapping by taking into account the data access pattern of an application and the on-chip cache hierarchy of a target architecture. The goal of this step is to maximize data locality at each level of caches while minimizing the data dependences across the cores. Our scheduling strategy on the other hand determines a schedule for the iterations assigned to each core in the target architecture, with the goal of satisfying all the data dependences in the code (both intra-core and inter-core) and reducing data reuse distances across the cores that share data. We formulate both parallelization/mapping problem and scheduling problem in a linear algebraic framework and solve them using the Farkas Lemma and the Integer Fourier-Motzkin Elimination. To measure the effectiveness of our schemes, we implemented them in a compiler and tested them using eight multithreaded application programs on a multicore machine. Our results show that the proposed mapping scheme reduces cache miss rates at all levels of the cache hierarchy and improves execution time of applications significantly, compared to alternate approaches, and when supported by scheduling, the improvements in cache miss rates and execution time become much larger.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130690034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Parameterized Micro-benchmarking: An Auto-tuning Approach for Complex Applications 参数化微基准测试:复杂应用的自动调优方法
Wenjing Ma, S. Krishnamoorthy, G. Agrawal
{"title":"Parameterized Micro-benchmarking: An Auto-tuning Approach for Complex Applications","authors":"Wenjing Ma, S. Krishnamoorthy, G. Agrawal","doi":"10.1145/2212908.2212938","DOIUrl":"https://doi.org/10.1145/2212908.2212938","url":null,"abstract":"Auto-tuning has emerged as an important practical method for creating highly optimized code. However, the growing complexity of architectures and applications has resulted in a prohibitively large search space that preclude empirical auto-tuning. Here, we focus on the challenge to auto-tuning presented by applications that require auto-tuning of not just a small number of distinct kernels, but a large number of kernels that exhibit similar computation and memory access characteristics and require optimization over similar problem spaces. We propose an auto-tuning method for tensor contraction functions on GPUs, based on parameterized micro-benchmarks. Using our parameterized micro-benchmarking approach, we obtain a speedup of up to 2 over the version that used default optimizations without auto-tuning.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"121 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133580702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Compiling Dynamic Data Structures in Python to Enable the Use of Multi-core and Many-core Libraries 在Python中编译动态数据结构以启用多核和多核库
Bin Ren, G. Agrawal
{"title":"Compiling Dynamic Data Structures in Python to Enable the Use of Multi-core and Many-core Libraries","authors":"Bin Ren, G. Agrawal","doi":"10.1109/PACT.2011.13","DOIUrl":"https://doi.org/10.1109/PACT.2011.13","url":null,"abstract":"Programmer productivity considerations are increasing the popularity of interpreted languages like Python. At the same time, for applications where performance is important, these languages clearly lack even on uniprocessors. In addition, the use of dynamic data structures in a language like Python makes it very hard to use emerging libraries for enabling the execution on multi-core and many-core architectures. This paper presents a framework for compiling Python to use multi-core and many-core libraries. The key component of our framework involves a suite of algorithms for replacing dynamic and/or nested data structures by arrays, while minimizing unnecessary data copying costs. This involves a novel use of an existing partial redundancy elimination algorithm, development of a new demand-driven interprocedural partial redundancy algorithm, a data flow formulation for determining that the contents of the data structure are of the same type, and a linearization algorithm. We have evaluated our framework using data mining and two linear algebra applications written in pure Python. The key observations were: 1) the code generated by our framework is only 10% to 20% slower compared to the hand-written C code that invokes the same libraries, 2) our optimizations turn out to be significant for improving the performance in most cases, and 3) we outperform interpreted Python and the C++ code generated by an existing tool by one to two orders of magnitude.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115518174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
An Alternative Memory Access Scheduling in Manycore Accelerators 多核加速器中可选的内存访问调度
Yonggon Kim, Hyunseok Lee, John Kim
{"title":"An Alternative Memory Access Scheduling in Manycore Accelerators","authors":"Yonggon Kim, Hyunseok Lee, John Kim","doi":"10.1109/PACT.2011.37","DOIUrl":"https://doi.org/10.1109/PACT.2011.37","url":null,"abstract":"Memory controllers in graphics processing units (GPU) often employ out-of-order scheduling to maximize row access locality. However, this requires complex logic to enable out-of-order scheduling compared with in-order scheduling. To provide a low-cost and low-complexity memory scheduling, we propose an alternative memory scheduling where the memory scheduling is performed not at the destination (i.e., memory controller) but is done at the source (i.e., the cores). We propose two complementary techniques in source-based memory scheduling -- network congestion-aware source throttling and super packets, where multiple request packets are grouped together to create a single super packet. By combing these techniques, the performance across a wide range of application is within 95% of the complex FR-FCFS on average and at significantly lower cost and complexity.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"301 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114477105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Regulating Locality vs. Parallelism Tradeoffs in Multiple Memory Controller Environments 在多内存控制器环境中调节局部性与并行性的权衡
S. M. Hassan, Dhruv Choudhary, M. Rasquinha, S. Yalamanchili
{"title":"Regulating Locality vs. Parallelism Tradeoffs in Multiple Memory Controller Environments","authors":"S. M. Hassan, Dhruv Choudhary, M. Rasquinha, S. Yalamanchili","doi":"10.1109/PACT.2011.33","DOIUrl":"https://doi.org/10.1109/PACT.2011.33","url":null,"abstract":"The presence of multiple MCs and their integration into the on-chip network fabric creates a highly concurrent system that can support significant levels of memory level parallelism (MLP) across cores. This work exposes the trade-off between DRAM parameters, bank level parallelism (BLP), and row buffer hit rate that exposes the amount of effective BLP that is necessary to approximate a 100% hit rate. We further study how this trade-off can be controlled and propose a class of global (system) and local (within an MC) address mappings that can be tuned to optimize the performance across a set of multiprogrammed benchmarks.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"14 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120904947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Exploiting Mutual Awareness between Prefetchers and On-chip Networks in Multi-cores 利用多核预取器和片上网络之间的相互感知
Junghoon Lee, Minjeong Shin, H. Kim, John Kim, Jaehyuk Huh
{"title":"Exploiting Mutual Awareness between Prefetchers and On-chip Networks in Multi-cores","authors":"Junghoon Lee, Minjeong Shin, H. Kim, John Kim, Jaehyuk Huh","doi":"10.1109/PACT.2011.27","DOIUrl":"https://doi.org/10.1109/PACT.2011.27","url":null,"abstract":"The unique characteristics of prefetch traffic have not been considered in on-chip network design for multicore architectures. Most prefetchers are often oblivious to the network congestion when generating prefetech requests. In this work, we investigate the interaction between prefetchers and on-chip networks and exploit the synergy of these two components in multi-core architectures. We explore prefetchaware on-chip networks that differentiates between prefetch and demand traffic by prioritizing demand traffic. In addition, we propose prefetch control mechanism based on network congestion. Our evaluations show that the combination of the proposed prefetch-aware router architecture and congestion sensitive prefetch control improves the performance of benchmarks by 11-13% on average, up to 30% on some of the workloads.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122200295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信