{"title":"Handling branches in TLS systems with Multi-Path Execution","authors":"Polychronis Xekalakis, Marcelo H. Cintra","doi":"10.1109/HPCA.2010.5416632","DOIUrl":"https://doi.org/10.1109/HPCA.2010.5416632","url":null,"abstract":"Thread-Level Speculation (TLS) has been proposed to facilitate the extraction of parallel threads from sequential applications. Most prior work on TLS has focused on architectural features directly related to supporting the main TLS operations. In this work we, instead, investigate how a common microarchitectural feature, namely branch prediction, interacts with TLS. We show that branch prediction for TLS is even more important than it is for sequential execution. Unfortunately, branch prediction for TLS systems is also inherently harder. Code partitioning and re-executions of squashed threads pollute the branch history making it harder for predictors to be accurate. We thus propose to augment the hardware, so as to accommodate Multi-Path Execution (MP) within the existing TLS protocol. Under the MP execution model, all paths following a number of hard-to-predict conditional branches are followed simultaneously. MP execution thus removes branches that would have been otherwise mispredicted, helping in this way the core to exploit more ILP. We show that, with only minimal hardware support, one can combine these two execution models into a unified one. Experimental results show that our combined execution model achieves speedups of up to 23.2%, with an average of 9.2%, over an existing state-of-the-art TLS system and speedups of up to 138 %, with an average of 28.2%, when compared with MP execution for a subset of the SPEC2000 Int benchmark suite.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"124 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123194565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaowei Jiang, Niti Madan, Li Zhao, Mike Upton, R. Iyer, S. Makineni, D. Newell, Yan Solihin, R. Balasubramonian
{"title":"CHOP: Adaptive filter-based DRAM caching for CMP server platforms","authors":"Xiaowei Jiang, Niti Madan, Li Zhao, Mike Upton, R. Iyer, S. Makineni, D. Newell, Yan Solihin, R. Balasubramonian","doi":"10.1109/HPCA.2010.5416642","DOIUrl":"https://doi.org/10.1109/HPCA.2010.5416642","url":null,"abstract":"As manycore architectures enable a large number of cores on the die, a key challenge that emerges is the availability of memory bandwidth with conventional DRAM solutions. To address this challenge, integration of large DRAM caches that provide as much as 5× higher bandwidth and as low as 1/3rd of the latency (as compared to conventional DRAM) is very promising. However, organizing and implementing a large DRAM cache is challenging because of two primary tradeoffs: (a) DRAM caches at cache line granularity require too large an on-chip tag area that makes it undesirable and (b) DRAM caches with larger page granularity require too much bandwidth because the miss rate does not reduce enough to overcome the bandwidth increase. In this paper, we propose CHOP (Caching HOt Pages) in DRAM caches to address these challenges. We study several filter-based DRAM caching techniques: (a) a filter cache (CHOP-FC) that profiles pages and determines the hot subset of pages to allocate into the DRAM cache, (b) a memory-based filter cache (CHOP-MFC) that spills and fills filter state to improve the accuracy and reduce the size of the filter cache and (c) an adaptive DRAM caching technique (CHOP-AFC) to determine when the filter cache should be enabled and disabled for DRAM caching. We conduct detailed simulations with server workloads to show that our filter-based DRAM caching techniques achieve the following: (a) on average over 30% performance improvement over previous solutions, (b) several magnitudes lower area overhead in tag space required for cache-line based DRAM caches, (c) significantly lower memory bandwidth consumption as compared to page-granular DRAM caches.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"128 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126286731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yoongu Kim, Dongsu Han, O. Mutlu, Mor Harchol-Balter
{"title":"ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers","authors":"Yoongu Kim, Dongsu Han, O. Mutlu, Mor Harchol-Balter","doi":"10.1109/HPCA.2010.5416658","DOIUrl":"https://doi.org/10.1109/HPCA.2010.5416658","url":null,"abstract":"Modern chip multiprocessor (CMP) systems employ multiple memory controllers to control access to main memory. The scheduling algorithm employed by these memory controllers has a significant effect on system throughput, so choosing an efficient scheduling algorithm is important. The scheduling algorithm also needs to be scalable — as the number of cores increases, the number of memory controllers shared by the cores should also increase to provide sufficient bandwidth to feed the cores. Unfortunately, previous memory scheduling algorithms are inefficient with respect to system throughput and/or are designed for a single memory controller and do not scale well to multiple memory controllers, requiring significant finegrained coordination among controllers. This paper proposes ATLAS (Adaptive per-Thread Least-Attained-Service memory scheduling), a fundamentally new memory scheduling technique that improves system throughput without requiring significant coordination among memory controllers. The key idea is to periodically order threads based on the service they have attained from the memory controllers so far, and prioritize those threads that have attained the least service over others in each period. The idea of favoring threads with least-attained-service is borrowed from the queueing theory literature, where, in the context of a single-server queue it is known that least-attained-service optimally schedules jobs, assuming a Pareto (or any decreasing hazard rate) workload distribution. After verifying that our workloads have this characteristic, we show that our implementation of least-attained-service thread prioritization reduces the time the cores spend stalling and significantly improves system throughput. Furthermore, since the periods over which we accumulate the attained service are long, the controllers coordinate very infrequently to form the ordering of threads, thereby making ATLAS scalable to many controllers. We evaluate ATLAS on a wide variety of multiprogrammed SPEC 2006 workloads and systems with 4–32 cores and 1–16 memory controllers, and compare its performance to five previously proposed scheduling algorithms. Averaged over 32 workloads on a 24-core system with 4 controllers, ATLAS improves instruction throughput by 10.8%, and system throughput by 8.4%, compared to PAR-BS, the best previous CMP memory scheduling algorithm. ATLAS's performance benefit increases as the number of cores increases.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126459953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Abella, P. Chaparro, X. Vera, J. Carretero, Antonio González
{"title":"High-Performance low-vcc in-order core","authors":"J. Abella, P. Chaparro, X. Vera, J. Carretero, Antonio González","doi":"10.1109/HPCA.2010.5416630","DOIUrl":"https://doi.org/10.1109/HPCA.2010.5416630","url":null,"abstract":"Power density grows in new technology nodes, thus requiring Vcc to scale especially in mobile platforms where energy is critical. This paper presents a novel approach to decrease Vcc while keeping operating frequency high. Our mechanism is referred to as immediate read after write (IRAW) avoidance. We propose an implementation of the mechanism for an Intel® SilverthorneTM in-order core. Furthermore, we show that our mechanism can be adapted dynamically to provide the highest performance and lowest energy-delay product (EDP) at each Vcc level. Results show that IRAW avoidance increases operating frequency by 57% at 500mV and 99% at 400mV with negligible area and power overhead (below 1%), which translates into large speedups (48% at 500mV and 90% at 400mV) and EDP reductions (0.61 EDP at 500mV and 0.33 at 400mV).","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125554679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Torrellas, W. Gropp, J. Moreno, K. Olukotun, Vivek Sarkar
{"title":"Extreme scale computing: Challenges and opportunities","authors":"J. Torrellas, W. Gropp, J. Moreno, K. Olukotun, Vivek Sarkar","doi":"10.1145/1837853.1693468","DOIUrl":"https://doi.org/10.1145/1837853.1693468","url":null,"abstract":"An extreme scale system is one that is one thousand times more capable than a current comparable system, with the same power and physical footprint. Intuitively, this means that the power consumption and physical footprint of a current departmental server should be enough to deliver petascale performance, and that a single, commodity chip should deliver terascale performance. In this panel, we will discuss the resulting challenges in energy/power efficiency, concurrency and locality, resiliency and programmability, and the research opportunities that may take us to extreme scale systems.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"191 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128642954","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Is hardware innovation over?","authors":"Arvind","doi":"10.1145/1693453.1693455","DOIUrl":"https://doi.org/10.1145/1693453.1693455","url":null,"abstract":"My colleagues, promotion committees, research funding agencies and business people often wonder if there is need for any architecture research. There seems to be no room to dislodge Intel IA-32. Even the number of new Application-Specific Integrated Circuits (ASICs) seems to be declining each year, because of the ever-increasing development cost. This viewpoint ignores another reality which is that the future will be dominated by mobile devices such as smart phones and the infrastructure needed to support consumer services on these devices. This is already restructuring the IT industry. To the first-order, in the mobile world functionality is determined by what can be supported within a 3W power budget. The only way to reduce power by one to two orders of magnitude is via functionally specialized hardware blocks. A fundamental shift is needed in the current design flow of systems-on-a-chip (SoCs) to produce them in a less-risky and cost-effective manner. In this talk we will present, via examples, a method of designing systems that facilitates the synthesis of complex SoCs from reusable “IP” modules. The technical challenge is to provide a method for connecting modules in a parallel setting so that the functionality and the performance of the composite are predictable.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"487 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116530393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exascale computing: The challenges and opportunities in the next decade","authors":"T. Agerwala","doi":"10.1145/1693453.1693454","DOIUrl":"https://doi.org/10.1145/1693453.1693454","url":null,"abstract":"Supercomputing systems have made great strides in recent years as the extensive computing needs of cuttingedge engineering work and scientific discovery have driven the development of more powerful systems. In 2008, the first petaflop machine was released, and historic trends indicate that in ten years, we should be at the exascale level. Indeed, various agencies are targeting a computer system capable of 1 Exaop (10⋆⋆18 ops) of computation within the next decade. We believe that applications in many industries will be materially transformed by exascale computers.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130213653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guangyu Sun, Yongsoo Joo, Yibo Chen, Dimin Niu, Yuan Xie, Yiran Chen, Hai Helen Li
{"title":"A Hybrid solid-state storage architecture for the performance, energy consumption, and lifetime improvement","authors":"Guangyu Sun, Yongsoo Joo, Yibo Chen, Dimin Niu, Yuan Xie, Yiran Chen, Hai Helen Li","doi":"10.1109/HPCA.2010.5416650","DOIUrl":"https://doi.org/10.1109/HPCA.2010.5416650","url":null,"abstract":"In recent years, many systems have employed NAND flash memory as storage devices because of its advantages of higher performance (compared to the traditional hard disk drive), high-density, random-access, increasing capacity, and falling cost. On the other hand, the performance of NAND flash memory is limited by its “erase-before-write” requirement. Log-based structures have been used to alleviate this problem by writing updated data to the clean space. Prior log-based methods, however, cannot avoid excessive erase operations when there are frequent updates, which quickly consume free pages, especially when some data are updated repeatedly. In this paper, we propose a hybrid architecture for the NAND flash memory storage, of which the log region is implemented using phase change random access memory (PRAM). Compared to traditional log-based architectures, it has the following advantages: (1) the PRAM log region allows in-place updating so that it significantly improves the usage efficiency of log pages by eliminating out-of-date log records; (2) it greatly reduces the traffic of reading from the NAND flash memory storage since the size of logs loaded for the read operation is decreased; (3) the energy consumption of the storage system is reduced as the overhead of writing and reading log data is decreased with the PRAM log region; (4) the lifetime of NAND flash memory is increased because the number of erase operations are reduced. To facilitate the PRAM log region, we propose several management policies. The simulation results show that our proposed methods can substantially improve the performance, energy consumption, and lifetime of the NAND flash memory storage1.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125139722","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}