IEEE International Symposium on High-Performance Comp Architecture最新文献_第4页

BulkCompactor: Optimized deterministic execution via Conflict-Aware commit of atomic blocks BulkCompactor:通过原子块的冲突感知提交来优化确定性执行

IEEE International Symposium on High-Performance Comp Architecture Pub Date : 2012-02-25 DOI: 10.1109/HPCA.2012.6169040

Yuelu Duan, Xing Zhou, Wonsun Ahn, J. Torrellas

{"title":"BulkCompactor: Optimized deterministic execution via Conflict-Aware commit of atomic blocks","authors":"Yuelu Duan, Xing Zhou, Wonsun Ahn, J. Torrellas","doi":"10.1109/HPCA.2012.6169040","DOIUrl":"https://doi.org/10.1109/HPCA.2012.6169040","url":null,"abstract":"Recent proposals for determinism-enforcement architectures are able to honor the dependences between threads through a commit step that often becomes a performance bottleneck. As they commit code blocks (or chunks) in a round-robin order, if one chunk gets squashed due to a conflict, its successors also observe a stall. We call this effect transitive squash delay. This paper proposes a novel, high-performance approach to deterministic execution based on Conflict-Aware commit. Rather than committing chunks in strict round-robin order, the idea is to skip those chunks with conflicts and deterministically execute them slightly later. The scheme, called BulkCompactor, largely eliminates transitive squash delay, “compacts” the chunk commits, and substantially speeds-up execution. With BulkCompactor, the squash overhead is O(N) rather than O(N2) as in round-robin. We describe BulkCompactor designs for machines with centralized or distributed commit. Finally, a simulation-based evaluation shows that BulkCompactor delivers performance comparable to nondeter-ministic systems. For example, for 32 processors, BulkCompactor incurs an average execution overhead of 22% over a nondetermin-istic system. The round-robin scheme's average overhead is 133%.","PeriodicalId":380383,"journal":{"name":"IEEE International Symposium on High-Performance Comp Architecture","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122116160","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Balancing DRAM locality and parallelism in shared memory CMP systems 在共享内存CMP系统中平衡DRAM局部性和并行性

IEEE International Symposium on High-Performance Comp Architecture Pub Date : 2012-02-25 DOI: 10.1109/HPCA.2012.6168944

Minseong Jeong, D. Yoon, Dam Sunwoo, Michael B. Sullivan, Ikhwan Lee, M. Erez

{"title":"Balancing DRAM locality and parallelism in shared memory CMP systems","authors":"Minseong Jeong, D. Yoon, Dam Sunwoo, Michael B. Sullivan, Ikhwan Lee, M. Erez","doi":"10.1109/HPCA.2012.6168944","DOIUrl":"https://doi.org/10.1109/HPCA.2012.6168944","url":null,"abstract":"Modern memory systems rely on spatial locality to provide high bandwidth while minimizing memory device power and cost. The trend of increasing the number of cores that share memory, however, decreases apparent spatial locality because access streams from independent threads are interleaved. Memory access scheduling recovers only a fraction of the original locality because of buffering limits. We investigate new techniques to reduce inter-thread access interference. We propose to partition the internal memory banks between cores to isolate their access streams and eliminate locality interference. We implement this by extending the physical frame allocation algorithm of the OS such that physical frames mapped to the same DRAM bank can be exclusively allocated to a single thread. We compensate for the reduced bank-level parallelism of each thread by employing memory sub-ranking to effectively increase the number of independent banks. This combined approach, unlike memory bank partitioning or sub-ranking alone, simultaneously increases overall performance and significantly reduces memory power consumption.","PeriodicalId":380383,"journal":{"name":"IEEE International Symposium on High-Performance Comp Architecture","volume":"103 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129189636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 140

AgileRegulator: A hybrid voltage regulator scheme redeeming dark silicon for power efficiency in a multicore architecture agilerregulator:一种混合电压调节器方案，在多核架构中弥补了暗硅的功率效率

IEEE International Symposium on High-Performance Comp Architecture Pub Date : 2012-02-25 DOI: 10.1109/HPCA.2012.6169034

Guihai Yan, Yingmin Li, Yinhe Han, Xiaowei Li, M. Guo, Xiaoyao Liang

{"title":"AgileRegulator: A hybrid voltage regulator scheme redeeming dark silicon for power efficiency in a multicore architecture","authors":"Guihai Yan, Yingmin Li, Yinhe Han, Xiaowei Li, M. Guo, Xiaoyao Liang","doi":"10.1109/HPCA.2012.6169034","DOIUrl":"https://doi.org/10.1109/HPCA.2012.6169034","url":null,"abstract":"The widening gap between the fast-increasing transistor budget but slow-growing power delivery and system cooling capability calls for novel architectural solutions to boost energy efficiency. Leveraging the fact of surging “dark silicon” area, we propose a hybrid scheme to use both on-chip and off-chip voltage regulators, called “AgileRegulator”, for a multicore system to explore both coarse-grain and fine-grain power phases. We present two complementary algorithms: Sensitivity-Aware Application Scheduling (SAAS) and Responsiveness-Aware Application Scheduling (RAAS) to maximally achieve the energy saving potential of the hybrid regulator scheme. Experimental results show that the hybrid scheme achieves performance-energy efficiency close to per-core DVFS, without imposing much design cost. Meanwhile, the silicon overhead of this scheme is well contained into the “dark silicon”. Unlike other application specific schemes based on accelerators, the proposed scheme itself is a simple and universal solution for chip area and energy trade-offs.","PeriodicalId":380383,"journal":{"name":"IEEE International Symposium on High-Performance Comp Architecture","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124390582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 51

System-level implications of disaggregated memory 分解内存的系统级含义

IEEE International Symposium on High-Performance Comp Architecture Pub Date : 2012-02-25 DOI: 10.1109/HPCA.2012.6168955

Kevin T. Lim, Yoshio Turner, J. R. Santos, Alvin AuYoung, Jichuan Chang, Parthasarathy Ranganathan, T. Wenisch

{"title":"System-level implications of disaggregated memory","authors":"Kevin T. Lim, Yoshio Turner, J. R. Santos, Alvin AuYoung, Jichuan Chang, Parthasarathy Ranganathan, T. Wenisch","doi":"10.1109/HPCA.2012.6168955","DOIUrl":"https://doi.org/10.1109/HPCA.2012.6168955","url":null,"abstract":"Recent research on memory disaggregation introduces a new architectural building block — the memory blade — as a cost-effective approach for memory capacity expansion and sharing for an ensemble of blade servers. Memory blades augment blade servers' local memory capacity with a second-level (remote) memory that can be dynamically apportioned among blades in response to changing capacity demand, albeit at a higher access latency. In this paper, we build on the prior research to explore the software and systems implications of disaggregated memory. We develop a software-based prototype by extending the Xen hypervisor to emulate a disaggregated memory design wherein remote pages are swapped into local memory on-demand upon access. Our prototyping effort reveals that low-latency remote memory calls for a different regime of replacement policies than conventional disk paging, favoring minimal hypervisor overhead even at the cost of using less sophisticated replacement policies. Second, we demonstrate the synergy between disaggregated memory and content-based page sharing. By allowing content to be shared both within and across blades (in local and remote memory, respectively), we find that their combination provides greater workload consolidation opportunity and performance-per-dollar than either technique alone. Finally, we explore a realistic deployment scenario in which disaggregated memory is used to reduce the scaling cost of a memcached system. We show that disaggregated memory can provide a 50% improvement in performance-per-dollar relative to conventional scale-out.","PeriodicalId":380383,"journal":{"name":"IEEE International Symposium on High-Performance Comp Architecture","volume":"395 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131915188","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 194

The case for GPGPU spatial multitasking GPGPU空间多任务的案例

IEEE International Symposium on High-Performance Comp Architecture Pub Date : 2012-02-25 DOI: 10.1109/HPCA.2012.6168946

Jacob Adriaens, Katherine Compton, N. Kim, M. Schulte

{"title":"The case for GPGPU spatial multitasking","authors":"Jacob Adriaens, Katherine Compton, N. Kim, M. Schulte","doi":"10.1109/HPCA.2012.6168946","DOIUrl":"https://doi.org/10.1109/HPCA.2012.6168946","url":null,"abstract":"The set-top and portable device market continues to grow, as does the demand for more performance under increasing cost, power, and thermal constraints. The integration of Graphics Processing Units (GPUs) into these devices and the emergence of general-purpose computations on graphics hardware enable a new set of highly parallel applications. In this paper, we propose and make the case for a GPU multitasking technique called spatial multitasking. Traditional GPU multitasking techniques, such as cooperative and preemptive multitasking, partition GPU time among applications, while spatial multitasking allows GPU resources to be partitioned among multiple applications simultaneously. We demonstrate the potential benefits of spatial multitasking with an analysis and characterization of General-Purpose GPU (GPGPU) applications. We find that many GPGPU applications fail to utilize available GPU resources fully, which suggests the potential for significant performance benefits using spatial multitasking instead of, or in combination with, preemptive or cooperative multitasking. We then implement spatial multitasking and compare it to cooperative multitasking using simulation. We evaluate several heuristics for partitioning GPU stream multiprocessors (SMs) among applications and find spatial multitasking shows an average speedup of up to 1.19 over cooperative multitasking when two applications are sharing the GPU. Speedups are even higher when more than two applications are sharing the GPU.","PeriodicalId":380383,"journal":{"name":"IEEE International Symposium on High-Performance Comp Architecture","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133525122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 188

Efficient scrub mechanisms for error-prone emerging memories 易出错新兴存储器的高效刷洗机制

IEEE International Symposium on High-Performance Comp Architecture Pub Date : 2012-02-25 DOI: 10.1109/HPCA.2012.6168941

M. Awasthi, Manjunath Shevgoor, K. Sudan, B. Rajendran, R. Balasubramonian, V. Srinivasan

{"title":"Efficient scrub mechanisms for error-prone emerging memories","authors":"M. Awasthi, Manjunath Shevgoor, K. Sudan, B. Rajendran, R. Balasubramonian, V. Srinivasan","doi":"10.1109/HPCA.2012.6168941","DOIUrl":"https://doi.org/10.1109/HPCA.2012.6168941","url":null,"abstract":"Many memory cell technologies are being considered as possible replacements for DRAM and Flash technologies, both of which are nearing their scaling limits. While these new cells (PCM, STT-RAM, FeRAM, etc.) promise high density, better scaling, and non-volatility, they introduce new challenges. Solutions at the architecture level can help address some of these problems; e.g., prior research has proposed wear-leveling and hard error tolerance mechanisms to overcome the limited write endurance of PCM cells. In this paper, we focus on the soft error problem in PCM, a topic that has received little attention in the architecture community. Soft errors in DRAM memories are typically addressed by having SECDED support and a scrub mechanism. The scrub mechanism scans the memory looking for a single-bit error and corrects it before the line experiences a second uncorrectable error. However, PCM (and other emerging memories) are prone to new sources of soft errors. In particular, multi-level cell (MLC) PCM devices will suffer from resistance drift, that increases the soft error rate and incurs high overheads for the scrub mechanism. This paper is the first to study the design of architectural scrub mechanisms, especially when tailored to the drift phenomenon in MLC PCM. Many of our solutions will also apply to other soft-error prone emerging memories. We first show that scrub overheads can be reduced with support for strong ECC codes and a lightweight error detection operation. We then design different scrub algorithms that can adaptively trade-off soft and hard errors. Using an approach that combines all proposed solutions, our scrub mechanism yields a 96.5% reduction in uncorrectable errors, a 24.4 × decrease in scrub-related writes, and a 37.8% reduction in scrub energy, relative to a basic scrub algorithm used in modern DRAM systems.","PeriodicalId":380383,"journal":{"name":"IEEE International Symposium on High-Performance Comp Architecture","volume":"1117 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114841412","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 106

Accelerating business analytics applications 加速业务分析应用程序

IEEE International Symposium on High-Performance Comp Architecture Pub Date : 2012-02-25 DOI: 10.1109/HPCA.2012.6169044

V. Salapura, T. Karkhanis, P. Nagpurkar, J. Moreira

{"title":"Accelerating business analytics applications","authors":"V. Salapura, T. Karkhanis, P. Nagpurkar, J. Moreira","doi":"10.1109/HPCA.2012.6169044","DOIUrl":"https://doi.org/10.1109/HPCA.2012.6169044","url":null,"abstract":"Business text analytics applications have seen rapid growth, driven by the mining of data for various decision making processes. Regular expression processing is an important component of these applications, consuming as much as 50% of their total execution time. While prior work on accelerating regular expression processing has focused on Network Intrusion Detection Systems, business analytics applications impose different requirements on regular expression processing efficiency. We present an analytical model of accelerators for regular expression processing, which includes memory bus-, I/O bus-, and network-attached accelerators with a focus on business analytics applications. Based on this model, we advocate the use of vector-style processing for regular expressions in business analytics applications, leveraging the SIMD hardware available in many modern processors. In addition, we show how SIMD hardware can be enhanced to improve regular expression processing even further. We demonstrate a realized speedup better than 1.8 for the entire range of data sizes of interest. In comparison, the alternative strategies deliver only marginal improvement for large data sizes, while performing worse than the SIMD solution for small data sizes.","PeriodicalId":380383,"journal":{"name":"IEEE International Symposium on High-Performance Comp Architecture","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127134639","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

CPU-assisted GPGPU on fused CPU-GPU architectures 融合CPU-GPU架构的cpu辅助GPGPU

IEEE International Symposium on High-Performance Comp Architecture Pub Date : 2012-02-01 DOI: 10.1109/HPCA.2012.6168948

Yi Yang, Ping Xiang, Mike Mantor, Huiyang Zhou

{"title":"CPU-assisted GPGPU on fused CPU-GPU architectures","authors":"Yi Yang, Ping Xiang, Mike Mantor, Huiyang Zhou","doi":"10.1109/HPCA.2012.6168948","DOIUrl":"https://doi.org/10.1109/HPCA.2012.6168948","url":null,"abstract":"This paper presents a novel approach to utilize the CPU resource to facilitate the execution of GPGPU programs on fused CPU-GPU architectures. In our model of fused architectures, the GPU and the CPU are integrated on the same die and share the on-chip L3 cache and off-chip memory, similar to the latest Intel Sandy Bridge and AMD accelerated processing unit (APU) platforms. In our proposed CPU-assisted GPGPU, after the CPU launches a GPU program, it executes a pre-execution program, which is generated automatically from the GPU kernel using our proposed compiler algorithms and contains memory access instructions of the GPU kernel for multiple thread-blocks. The CPU pre-execution program runs ahead of GPU threads because (1) the CPU pre-execution thread only contains memory fetch instructions from GPU kernels and not floating-point computations, and (2) the CPU runs at higher frequencies and exploits higher degrees of instruction-level parallelism than GPU scalar cores. We also leverage the prefetcher at the L2-cache on the CPU side to increase the memory traffic from CPU. As a result, the memory accesses of GPU threads hit in the L3 cache and their latency can be drastically reduced. Since our pre-execution is directly controlled by user-level applications, it enjoys both high accuracy and flexibility. Our experiments on a set of benchmarks show that our proposed pre-execution improves the performance by up to 113% and 21.4% on average.","PeriodicalId":380383,"journal":{"name":"IEEE International Symposium on High-Performance Comp Architecture","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126862820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 90