HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture最新文献

筛选
英文 中文
Handling branches in TLS systems with Multi-Path Execution 使用多路径执行处理TLS系统中的分支
Polychronis Xekalakis, Marcelo H. Cintra
{"title":"Handling branches in TLS systems with Multi-Path Execution","authors":"Polychronis Xekalakis, Marcelo H. Cintra","doi":"10.1109/HPCA.2010.5416632","DOIUrl":"https://doi.org/10.1109/HPCA.2010.5416632","url":null,"abstract":"Thread-Level Speculation (TLS) has been proposed to facilitate the extraction of parallel threads from sequential applications. Most prior work on TLS has focused on architectural features directly related to supporting the main TLS operations. In this work we, instead, investigate how a common microarchitectural feature, namely branch prediction, interacts with TLS. We show that branch prediction for TLS is even more important than it is for sequential execution. Unfortunately, branch prediction for TLS systems is also inherently harder. Code partitioning and re-executions of squashed threads pollute the branch history making it harder for predictors to be accurate. We thus propose to augment the hardware, so as to accommodate Multi-Path Execution (MP) within the existing TLS protocol. Under the MP execution model, all paths following a number of hard-to-predict conditional branches are followed simultaneously. MP execution thus removes branches that would have been otherwise mispredicted, helping in this way the core to exploit more ILP. We show that, with only minimal hardware support, one can combine these two execution models into a unified one. Experimental results show that our combined execution model achieves speedups of up to 23.2%, with an average of 9.2%, over an existing state-of-the-art TLS system and speedups of up to 138 %, with an average of 28.2%, when compared with MP execution for a subset of the SPEC2000 Int benchmark suite.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"124 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123194565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
CHOP: Adaptive filter-based DRAM caching for CMP server platforms CHOP:用于CMP服务器平台的基于自适应滤波器的DRAM缓存
Xiaowei Jiang, Niti Madan, Li Zhao, Mike Upton, R. Iyer, S. Makineni, D. Newell, Yan Solihin, R. Balasubramonian
{"title":"CHOP: Adaptive filter-based DRAM caching for CMP server platforms","authors":"Xiaowei Jiang, Niti Madan, Li Zhao, Mike Upton, R. Iyer, S. Makineni, D. Newell, Yan Solihin, R. Balasubramonian","doi":"10.1109/HPCA.2010.5416642","DOIUrl":"https://doi.org/10.1109/HPCA.2010.5416642","url":null,"abstract":"As manycore architectures enable a large number of cores on the die, a key challenge that emerges is the availability of memory bandwidth with conventional DRAM solutions. To address this challenge, integration of large DRAM caches that provide as much as 5× higher bandwidth and as low as 1/3rd of the latency (as compared to conventional DRAM) is very promising. However, organizing and implementing a large DRAM cache is challenging because of two primary tradeoffs: (a) DRAM caches at cache line granularity require too large an on-chip tag area that makes it undesirable and (b) DRAM caches with larger page granularity require too much bandwidth because the miss rate does not reduce enough to overcome the bandwidth increase. In this paper, we propose CHOP (Caching HOt Pages) in DRAM caches to address these challenges. We study several filter-based DRAM caching techniques: (a) a filter cache (CHOP-FC) that profiles pages and determines the hot subset of pages to allocate into the DRAM cache, (b) a memory-based filter cache (CHOP-MFC) that spills and fills filter state to improve the accuracy and reduce the size of the filter cache and (c) an adaptive DRAM caching technique (CHOP-AFC) to determine when the filter cache should be enabled and disabled for DRAM caching. We conduct detailed simulations with server workloads to show that our filter-based DRAM caching techniques achieve the following: (a) on average over 30% performance improvement over previous solutions, (b) several magnitudes lower area overhead in tag space required for cache-line based DRAM caches, (c) significantly lower memory bandwidth consumption as compared to page-granular DRAM caches.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"128 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126286731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 151
ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers ATLAS:用于多个内存控制器的可扩展和高性能调度算法
Yoongu Kim, Dongsu Han, O. Mutlu, Mor Harchol-Balter
{"title":"ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers","authors":"Yoongu Kim, Dongsu Han, O. Mutlu, Mor Harchol-Balter","doi":"10.1109/HPCA.2010.5416658","DOIUrl":"https://doi.org/10.1109/HPCA.2010.5416658","url":null,"abstract":"Modern chip multiprocessor (CMP) systems employ multiple memory controllers to control access to main memory. The scheduling algorithm employed by these memory controllers has a significant effect on system throughput, so choosing an efficient scheduling algorithm is important. The scheduling algorithm also needs to be scalable — as the number of cores increases, the number of memory controllers shared by the cores should also increase to provide sufficient bandwidth to feed the cores. Unfortunately, previous memory scheduling algorithms are inefficient with respect to system throughput and/or are designed for a single memory controller and do not scale well to multiple memory controllers, requiring significant finegrained coordination among controllers. This paper proposes ATLAS (Adaptive per-Thread Least-Attained-Service memory scheduling), a fundamentally new memory scheduling technique that improves system throughput without requiring significant coordination among memory controllers. The key idea is to periodically order threads based on the service they have attained from the memory controllers so far, and prioritize those threads that have attained the least service over others in each period. The idea of favoring threads with least-attained-service is borrowed from the queueing theory literature, where, in the context of a single-server queue it is known that least-attained-service optimally schedules jobs, assuming a Pareto (or any decreasing hazard rate) workload distribution. After verifying that our workloads have this characteristic, we show that our implementation of least-attained-service thread prioritization reduces the time the cores spend stalling and significantly improves system throughput. Furthermore, since the periods over which we accumulate the attained service are long, the controllers coordinate very infrequently to form the ordering of threads, thereby making ATLAS scalable to many controllers. We evaluate ATLAS on a wide variety of multiprogrammed SPEC 2006 workloads and systems with 4–32 cores and 1–16 memory controllers, and compare its performance to five previously proposed scheduling algorithms. Averaged over 32 workloads on a 24-core system with 4 controllers, ATLAS improves instruction throughput by 10.8%, and system throughput by 8.4%, compared to PAR-BS, the best previous CMP memory scheduling algorithm. ATLAS's performance benefit increases as the number of cores increases.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126459953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 429
High-Performance low-vcc in-order core 高性能低vcc序核
J. Abella, P. Chaparro, X. Vera, J. Carretero, Antonio González
{"title":"High-Performance low-vcc in-order core","authors":"J. Abella, P. Chaparro, X. Vera, J. Carretero, Antonio González","doi":"10.1109/HPCA.2010.5416630","DOIUrl":"https://doi.org/10.1109/HPCA.2010.5416630","url":null,"abstract":"Power density grows in new technology nodes, thus requiring Vcc to scale especially in mobile platforms where energy is critical. This paper presents a novel approach to decrease Vcc while keeping operating frequency high. Our mechanism is referred to as immediate read after write (IRAW) avoidance. We propose an implementation of the mechanism for an Intel® SilverthorneTM in-order core. Furthermore, we show that our mechanism can be adapted dynamically to provide the highest performance and lowest energy-delay product (EDP) at each Vcc level. Results show that IRAW avoidance increases operating frequency by 57% at 500mV and 99% at 400mV with negligible area and power overhead (below 1%), which translates into large speedups (48% at 500mV and 90% at 400mV) and EDP reductions (0.61 EDP at 500mV and 0.33 at 400mV).","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125554679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Extreme scale computing: Challenges and opportunities 极端规模计算:挑战与机遇
J. Torrellas, W. Gropp, J. Moreno, K. Olukotun, Vivek Sarkar
{"title":"Extreme scale computing: Challenges and opportunities","authors":"J. Torrellas, W. Gropp, J. Moreno, K. Olukotun, Vivek Sarkar","doi":"10.1145/1837853.1693468","DOIUrl":"https://doi.org/10.1145/1837853.1693468","url":null,"abstract":"An extreme scale system is one that is one thousand times more capable than a current comparable system, with the same power and physical footprint. Intuitively, this means that the power consumption and physical footprint of a current departmental server should be enough to deliver petascale performance, and that a single, commodity chip should deliver terascale performance. In this panel, we will discuss the resulting challenges in energy/power efficiency, concurrency and locality, resiliency and programmability, and the research opportunities that may take us to extreme scale systems.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"191 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128642954","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Is hardware innovation over? 硬件创新已经结束了吗?
Arvind
{"title":"Is hardware innovation over?","authors":"Arvind","doi":"10.1145/1693453.1693455","DOIUrl":"https://doi.org/10.1145/1693453.1693455","url":null,"abstract":"My colleagues, promotion committees, research funding agencies and business people often wonder if there is need for any architecture research. There seems to be no room to dislodge Intel IA-32. Even the number of new Application-Specific Integrated Circuits (ASICs) seems to be declining each year, because of the ever-increasing development cost. This viewpoint ignores another reality which is that the future will be dominated by mobile devices such as smart phones and the infrastructure needed to support consumer services on these devices. This is already restructuring the IT industry. To the first-order, in the mobile world functionality is determined by what can be supported within a 3W power budget. The only way to reduce power by one to two orders of magnitude is via functionally specialized hardware blocks. A fundamental shift is needed in the current design flow of systems-on-a-chip (SoCs) to produce them in a less-risky and cost-effective manner. In this talk we will present, via examples, a method of designing systems that facilitates the synthesis of complex SoCs from reusable “IP” modules. The technical challenge is to provide a method for connecting modules in a parallel setting so that the functionality and the performance of the composite are predictable.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"487 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116530393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exascale computing: The challenges and opportunities in the next decade 百亿亿次计算:未来十年的挑战与机遇
T. Agerwala
{"title":"Exascale computing: The challenges and opportunities in the next decade","authors":"T. Agerwala","doi":"10.1145/1693453.1693454","DOIUrl":"https://doi.org/10.1145/1693453.1693454","url":null,"abstract":"Supercomputing systems have made great strides in recent years as the extensive computing needs of cuttingedge engineering work and scientific discovery have driven the development of more powerful systems. In 2008, the first petaflop machine was released, and historic trends indicate that in ten years, we should be at the exascale level. Indeed, various agencies are targeting a computer system capable of 1 Exaop (10⋆⋆18 ops) of computation within the next decade. We believe that applications in many industries will be materially transformed by exascale computers.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130213653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
A Hybrid solid-state storage architecture for the performance, energy consumption, and lifetime improvement 混合固态存储架构,用于性能、能耗和生命周期的改进
Guangyu Sun, Yongsoo Joo, Yibo Chen, Dimin Niu, Yuan Xie, Yiran Chen, Hai Helen Li
{"title":"A Hybrid solid-state storage architecture for the performance, energy consumption, and lifetime improvement","authors":"Guangyu Sun, Yongsoo Joo, Yibo Chen, Dimin Niu, Yuan Xie, Yiran Chen, Hai Helen Li","doi":"10.1109/HPCA.2010.5416650","DOIUrl":"https://doi.org/10.1109/HPCA.2010.5416650","url":null,"abstract":"In recent years, many systems have employed NAND flash memory as storage devices because of its advantages of higher performance (compared to the traditional hard disk drive), high-density, random-access, increasing capacity, and falling cost. On the other hand, the performance of NAND flash memory is limited by its “erase-before-write” requirement. Log-based structures have been used to alleviate this problem by writing updated data to the clean space. Prior log-based methods, however, cannot avoid excessive erase operations when there are frequent updates, which quickly consume free pages, especially when some data are updated repeatedly. In this paper, we propose a hybrid architecture for the NAND flash memory storage, of which the log region is implemented using phase change random access memory (PRAM). Compared to traditional log-based architectures, it has the following advantages: (1) the PRAM log region allows in-place updating so that it significantly improves the usage efficiency of log pages by eliminating out-of-date log records; (2) it greatly reduces the traffic of reading from the NAND flash memory storage since the size of logs loaded for the read operation is decreased; (3) the energy consumption of the storage system is reduced as the overhead of writing and reading log data is decreased with the PRAM log region; (4) the lifetime of NAND flash memory is increased because the number of erase operations are reduced. To facilitate the PRAM log region, we propose several management policies. The simulation results show that our proposed methods can substantially improve the performance, energy consumption, and lifetime of the NAND flash memory storage1.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125139722","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 141
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信