HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture最新文献_第4页

Handling branches in TLS systems with Multi-Path Execution 使用多路径执行处理TLS系统中的分支

HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture Pub Date : 2010-04-01 DOI: 10.1109/HPCA.2010.5416632

Polychronis Xekalakis, Marcelo H. Cintra

{"title":"Handling branches in TLS systems with Multi-Path Execution","authors":"Polychronis Xekalakis, Marcelo H. Cintra","doi":"10.1109/HPCA.2010.5416632","DOIUrl":"https://doi.org/10.1109/HPCA.2010.5416632","url":null,"abstract":"Thread-Level Speculation (TLS) has been proposed to facilitate the extraction of parallel threads from sequential applications. Most prior work on TLS has focused on architectural features directly related to supporting the main TLS operations. In this work we, instead, investigate how a common microarchitectural feature, namely branch prediction, interacts with TLS. We show that branch prediction for TLS is even more important than it is for sequential execution. Unfortunately, branch prediction for TLS systems is also inherently harder. Code partitioning and re-executions of squashed threads pollute the branch history making it harder for predictors to be accurate. We thus propose to augment the hardware, so as to accommodate Multi-Path Execution (MP) within the existing TLS protocol. Under the MP execution model, all paths following a number of hard-to-predict conditional branches are followed simultaneously. MP execution thus removes branches that would have been otherwise mispredicted, helping in this way the core to exploit more ILP. We show that, with only minimal hardware support, one can combine these two execution models into a unified one. Experimental results show that our combined execution model achieves speedups of up to 23.2%, with an average of 9.2%, over an existing state-of-the-art TLS system and speedups of up to 138 %, with an average of 28.2%, when compared with MP execution for a subset of the SPEC2000 Int benchmark suite.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"124 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123194565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

CHOP: Adaptive filter-based DRAM caching for CMP server platforms CHOP:用于CMP服务器平台的基于自适应滤波器的DRAM缓存

HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture Pub Date : 2010-04-01 DOI: 10.1109/HPCA.2010.5416642

Xiaowei Jiang, Niti Madan, Li Zhao, Mike Upton, R. Iyer, S. Makineni, D. Newell, Yan Solihin, R. Balasubramonian

{"title":"CHOP: Adaptive filter-based DRAM caching for CMP server platforms","authors":"Xiaowei Jiang, Niti Madan, Li Zhao, Mike Upton, R. Iyer, S. Makineni, D. Newell, Yan Solihin, R. Balasubramonian","doi":"10.1109/HPCA.2010.5416642","DOIUrl":"https://doi.org/10.1109/HPCA.2010.5416642","url":null,"abstract":"As manycore architectures enable a large number of cores on the die, a key challenge that emerges is the availability of memory bandwidth with conventional DRAM solutions. To address this challenge, integration of large DRAM caches that provide as much as 5× higher bandwidth and as low as 1/3rd of the latency (as compared to conventional DRAM) is very promising. However, organizing and implementing a large DRAM cache is challenging because of two primary tradeoffs: (a) DRAM caches at cache line granularity require too large an on-chip tag area that makes it undesirable and (b) DRAM caches with larger page granularity require too much bandwidth because the miss rate does not reduce enough to overcome the bandwidth increase. In this paper, we propose CHOP (Caching HOt Pages) in DRAM caches to address these challenges. We study several filter-based DRAM caching techniques: (a) a filter cache (CHOP-FC) that profiles pages and determines the hot subset of pages to allocate into the DRAM cache, (b) a memory-based filter cache (CHOP-MFC) that spills and fills filter state to improve the accuracy and reduce the size of the filter cache and (c) an adaptive DRAM caching technique (CHOP-AFC) to determine when the filter cache should be enabled and disabled for DRAM caching. We conduct detailed simulations with server workloads to show that our filter-based DRAM caching techniques achieve the following: (a) on average over 30% performance improvement over previous solutions, (b) several magnitudes lower area overhead in tag space required for cache-line based DRAM caches, (c) significantly lower memory bandwidth consumption as compared to page-granular DRAM caches.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"128 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126286731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 151

ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers ATLAS:用于多个内存控制器的可扩展和高性能调度算法

HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture Pub Date : 2010-04-01 DOI: 10.1109/HPCA.2010.5416658

Yoongu Kim, Dongsu Han, O. Mutlu, Mor Harchol-Balter

{"title":"ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers","authors":"Yoongu Kim, Dongsu Han, O. Mutlu, Mor Harchol-Balter","doi":"10.1109/HPCA.2010.5416658","DOIUrl":"https://doi.org/10.1109/HPCA.2010.5416658","url":null,"abstract":"Modern chip multiprocessor (CMP) systems employ multiple memory controllers to control access to main memory. The scheduling algorithm employed by these memory controllers has a significant effect on system throughput, so choosing an efficient scheduling algorithm is important. The scheduling algorithm also needs to be scalable — as the number of cores increases, the number of memory controllers shared by the cores should also increase to provide sufficient bandwidth to feed the cores. Unfortunately, previous memory scheduling algorithms are inefficient with respect to system throughput and/or are designed for a single memory controller and do not scale well to multiple memory controllers, requiring significant finegrained coordination among controllers. This paper proposes ATLAS (Adaptive per-Thread Least-Attained-Service memory scheduling), a fundamentally new memory scheduling technique that improves system throughput without requiring significant coordination among memory controllers. The key idea is to periodically order threads based on the service they have attained from the memory controllers so far, and prioritize those threads that have attained the least service over others in each period. The idea of favoring threads with least-attained-service is borrowed from the queueing theory literature, where, in the context of a single-server queue it is known that least-attained-service optimally schedules jobs, assuming a Pareto (or any decreasing hazard rate) workload distribution. After verifying that our workloads have this characteristic, we show that our implementation of least-attained-service thread prioritization reduces the time the cores spend stalling and significantly improves system throughput. Furthermore, since the periods over which we accumulate the attained service are long, the controllers coordinate very infrequently to form the ordering of threads, thereby making ATLAS scalable to many controllers. We evaluate ATLAS on a wide variety of multiprogrammed SPEC 2006 workloads and systems with 4–32 cores and 1–16 memory controllers, and compare its performance to five previously proposed scheduling algorithms. Averaged over 32 workloads on a 24-core system with 4 controllers, ATLAS improves instruction throughput by 10.8%, and system throughput by 8.4%, compared to PAR-BS, the best previous CMP memory scheduling algorithm. ATLAS's performance benefit increases as the number of cores increases.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126459953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 429

High-Performance low-vcc in-order core 高性能低vcc序核

HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture Pub Date : 2010-04-01 DOI: 10.1109/HPCA.2010.5416630

J. Abella, P. Chaparro, X. Vera, J. Carretero, Antonio González

引用次数: 6

Extreme scale computing: Challenges and opportunities 极端规模计算:挑战与机遇

HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture Pub Date : 2010-01-09 DOI: 10.1145/1837853.1693468

J. Torrellas, W. Gropp, J. Moreno, K. Olukotun, Vivek Sarkar

引用次数: 0

Is hardware innovation over? 硬件创新已经结束了吗?

HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture Pub Date : 2010-01-09 DOI: 10.1145/1693453.1693455

Arvind

{"title":"Is hardware innovation over?","authors":"Arvind","doi":"10.1145/1693453.1693455","DOIUrl":"https://doi.org/10.1145/1693453.1693455","url":null,"abstract":"My colleagues, promotion committees, research funding agencies and business people often wonder if there is need for any architecture research. There seems to be no room to dislodge Intel IA-32. Even the number of new Application-Specific Integrated Circuits (ASICs) seems to be declining each year, because of the ever-increasing development cost. This viewpoint ignores another reality which is that the future will be dominated by mobile devices such as smart phones and the infrastructure needed to support consumer services on these devices. This is already restructuring the IT industry. To the first-order, in the mobile world functionality is determined by what can be supported within a 3W power budget. The only way to reduce power by one to two orders of magnitude is via functionally specialized hardware blocks. A fundamental shift is needed in the current design flow of systems-on-a-chip (SoCs) to produce them in a less-risky and cost-effective manner. In this talk we will present, via examples, a method of designing systems that facilitates the synthesis of complex SoCs from reusable “IP” modules. The technical challenge is to provide a method for connecting modules in a parallel setting so that the functionality and the performance of the composite are predictable.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"487 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116530393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Exascale computing: The challenges and opportunities in the next decade 百亿亿次计算:未来十年的挑战与机遇

HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture Pub Date : 2010-01-09 DOI: 10.1145/1693453.1693454

T. Agerwala

引用次数: 22

A Hybrid solid-state storage architecture for the performance, energy consumption, and lifetime improvement 混合固态存储架构，用于性能、能耗和生命周期的改进

HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture Pub Date : 2010-01-01 DOI: 10.1109/HPCA.2010.5416650

Guangyu Sun, Yongsoo Joo, Yibo Chen, Dimin Niu, Yuan Xie, Yiran Chen, Hai Helen Li

{"title":"A Hybrid solid-state storage architecture for the performance, energy consumption, and lifetime improvement","authors":"Guangyu Sun, Yongsoo Joo, Yibo Chen, Dimin Niu, Yuan Xie, Yiran Chen, Hai Helen Li","doi":"10.1109/HPCA.2010.5416650","DOIUrl":"https://doi.org/10.1109/HPCA.2010.5416650","url":null,"abstract":"In recent years, many systems have employed NAND flash memory as storage devices because of its advantages of higher performance (compared to the traditional hard disk drive), high-density, random-access, increasing capacity, and falling cost. On the other hand, the performance of NAND flash memory is limited by its “erase-before-write” requirement. Log-based structures have been used to alleviate this problem by writing updated data to the clean space. Prior log-based methods, however, cannot avoid excessive erase operations when there are frequent updates, which quickly consume free pages, especially when some data are updated repeatedly. In this paper, we propose a hybrid architecture for the NAND flash memory storage, of which the log region is implemented using phase change random access memory (PRAM). Compared to traditional log-based architectures, it has the following advantages: (1) the PRAM log region allows in-place updating so that it significantly improves the usage efficiency of log pages by eliminating out-of-date log records; (2) it greatly reduces the traffic of reading from the NAND flash memory storage since the size of logs loaded for the read operation is decreased; (3) the energy consumption of the storage system is reduced as the overhead of writing and reading log data is decreased with the PRAM log region; (4) the lifetime of NAND flash memory is increased because the number of erase operations are reduced. To facilitate the PRAM log region, we propose several management policies. The simulation results show that our proposed methods can substantially improve the performance, energy consumption, and lifetime of the NAND flash memory storage1.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125139722","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 141