M. A. Islam, Hasan Mahmud, Shaolei Ren, Xiaorui Wang
{"title":"Paying to save: Reducing cost of colocation data center via rewards","authors":"M. A. Islam, Hasan Mahmud, Shaolei Ren, Xiaorui Wang","doi":"10.1109/HPCA.2015.7056036","DOIUrl":"https://doi.org/10.1109/HPCA.2015.7056036","url":null,"abstract":"Power-hungry data centers face an urgent pressure on reducing the energy cost. The existing efforts, despite being numerous, have primarily centered around owner-operated data centers (e.g., Google), leaving another critical data center segment - colocation data center (e.g., Equinix) which rents out physical space to multiple tenants for housing their own servers - much less explored. Colocations have a major barrier to achieve cost efficiency: server power management by individual tenants is uncoordinated. This paper proposes RECO (REward for COst reduction), which shifts tenants' power management from uncoordinated to coordinated, using financial reward as a lever. RECO pays (voluntarily participating) tenants for energy reduction such that the colocation operator's overall cost is minimized. RECO incorporates the time-varying operation environment (e.g., cooling efficiency, intermittent renewables), addresses the peak power demand charge, and also proactively learns tenants' unknown responses to the offered reward. RECO includes a new feedback-based online algorithm to optimize the reward without far future offline information. We evaluate RECO using both scaled-down prototype experiments and simulations. Our results show that RECO is \"win-win\" and can successfully reduce the colocation operator's overall cost, by up to 27% compared to the no-incentive baseline case. Further, tenants receive financial rewards (up to 15% of their colocation costs) for \"free\" without violating Service Level Agreements.","PeriodicalId":6593,"journal":{"name":"2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)","volume":"38 1","pages":"235-245"},"PeriodicalIF":0.0,"publicationDate":"2015-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90132943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Talus: A simple way to remove cliffs in cache performance","authors":"Nathan Beckmann, Daniel Sánchez","doi":"10.1109/HPCA.2015.7056022","DOIUrl":"https://doi.org/10.1109/HPCA.2015.7056022","url":null,"abstract":"Caches often suffer from performance cliffs: minor changes in program behavior or available cache space cause large changes in miss rate. Cliffs hurt performance and complicate cache management. We present Talus,1 a simple scheme that removes these cliffs. Talus works by dividing a single application's access stream into two partitions, unlike prior work that partitions among competing applications. By controlling the sizes of these partitions, Talus ensures that as an application is given more cache space, its miss rate decreases in a convex fashion. We prove that Talus removes performance cliffs, and evaluate it through extensive simulation. Talus adds negligible overheads, improves single-application performance, simplifies partitioning algorithms, and makes cache partitioning more effective and fair.","PeriodicalId":6593,"journal":{"name":"2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)","volume":"27 1","pages":"64-75"},"PeriodicalIF":0.0,"publicationDate":"2015-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84907524","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jongmin Won, Gwangsun Kim, John Kim, Ted Jiang, Mike Parker, Steve Scott
{"title":"Overcoming far-end congestion in large-scale networks","authors":"Jongmin Won, Gwangsun Kim, John Kim, Ted Jiang, Mike Parker, Steve Scott","doi":"10.1109/HPCA.2015.7056051","DOIUrl":"https://doi.org/10.1109/HPCA.2015.7056051","url":null,"abstract":"Accurately estimating congestion for proper global adaptive routing decisions (i.e., determine whether a packet should be routed minimally or non-minimally) has a significant impact on overall performance for high-radix topologies, such as the Dragonfly topology. Prior work have focused on understanding near-end congestion - i.e., congestion that occurs at the current router - or downstream congestion - i.e., congestion that occurs in downstream routers. However, most prior work do not evaluate the impact of far-end congestion or the congestion from the high channel latency between the routers. In this work, we refer to far-end congestion as phantom congestion as the congestion is not \"real\" congestion. Because of the long inter-router latency, the in-flight packets (and credits) result in inaccurate congestion information and can lead to inaccurate adaptive routing decisions. In addition, we show how transient congestion occurs as the occupancy of network queues fluctuate due to random traffic variation, even in steady-state conditions. This also results in inaccurate adaptive routing decisions that degrade network performance with lower throughput and higher latency. To overcome these limitations, we propose a history-window based approach to remove the impact of phantom congestion. We also show how using the average of local queue occupancies and adding an offset significantly remove the impact of transient congestion. Our evaluations of the adaptive routing in a large-scale Dragonfly network show that the combination of these techniques results in an adaptive routing that nearly matches the performance of an ideal adaptive routing algorithm.","PeriodicalId":6593,"journal":{"name":"2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)","volume":"2 1","pages":"415-427"},"PeriodicalIF":0.0,"publicationDate":"2015-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72738941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Amin Farmahini Farahani, Jung Ho Ahn, Katherine Morrow, N. Kim
{"title":"NDA: Near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules","authors":"Amin Farmahini Farahani, Jung Ho Ahn, Katherine Morrow, N. Kim","doi":"10.1109/HPCA.2015.7056040","DOIUrl":"https://doi.org/10.1109/HPCA.2015.7056040","url":null,"abstract":"Energy consumed for transferring data across the processor memory hierarchy constitutes a large fraction of total system energy consumption, and this fraction has steadily increased with technology scaling. In this paper, we propose near-DRAM acceleration (NDA) architectures, which process data using accelerators 3D-stacked on DRAM devices comprising off-chip main memory modules. NDA transfers most data through high-bandwidth and low-energy 3D interconnects between accelerators and DRAM devices instead of low-bandwidth and high-energy off-chip interconnects between a processor and DRAM devices, substantially reducing energy consumption and improving performance. Unlike previous near-memory processing architectures, NDA is built upon commodity DRAM devices; apart from inserting through-silicon vias (TSVs) to 3D-interconnect DRAM devices and accelerators, NDA requires minimal changes to the commodity DRAM device and standard memory module architectures. This allows NDA to be more easily adopted in both existing and emerging systems. Our experiments demonstrate that, on average, our NDA-based system consumes 46% (68%) lower (data transfer) energy at 1.67× higher performance than a system that integrates the same accelerator logic within the processor itself.","PeriodicalId":6593,"journal":{"name":"2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)","volume":"23 1","pages":"283-295"},"PeriodicalIF":0.0,"publicationDate":"2015-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82469792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"BeBoP: A cost effective predictor infrastructure for superscalar value prediction","authors":"Arthur Perais, André Seznec","doi":"10.1109/HPCA.2015.7056018","DOIUrl":"https://doi.org/10.1109/HPCA.2015.7056018","url":null,"abstract":"Up to recently, it was considered that a performance-effective implementation of Value Prediction (VP) would add tremendous complexity and power consumption in the pipeline, especially in the Out-of-Order engine and the predictor infrastructure. Despite recent progress in the field of Value Prediction, this remains partially true. Indeed, if the recent EOLE architecture proposition suggests that the OoO engine need not be altered to accommodate VP, complexity in the predictor infrastructure itself is still problematic. First, multiple predictions must be generated each cycle, but multi-ported structures should be avoided. Second, the predictor should be small enough to be considered for implementation, yet coverage must remain high enough to increase performance. To address these remaining concerns, we first propose a block-based value prediction scheme mimicking current instruction fetch mechanisms, BeBoP. It associates the predicted values with a fetch block rather than distinct instructions. Second, to remedy the storage issue, we present the Differential VTAGE predictor. This new tightly coupled hybrid predictor covers instructions predictable by both VTAGE and Stride-based value predictors, and its hardware cost and complexity can be made similar to those of a modern branch predictor. Third, we show that block-based value prediction allows to implement the checkpointing mechanism needed to provide D-VTAGE with last computed/predicted values at moderate cost. Overall, we establish that EOLE with a 32.8KB block-based D-VTAGE predictor and a 4-issue OoO engine can significantly outperform a baseline 6-issue superscalar processor, by up to 62.2% and 11.2% on average (gmean), on our benchmark set.","PeriodicalId":6593,"journal":{"name":"2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)","volume":"23 1","pages":"13-25"},"PeriodicalIF":0.0,"publicationDate":"2015-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73968415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scalable communication architecture for network-attached accelerators","authors":"Sarah Neuwirth, Dirk Frey, M. Nüssle, U. Brüning","doi":"10.1109/HPCA.2015.7056068","DOIUrl":"https://doi.org/10.1109/HPCA.2015.7056068","url":null,"abstract":"On the road to Exascale computing, novel communication architectures are required to overcome the limitations of host-centric accelerators. Typically, accelerator devices require a local host CPU to configure and operate them. This limits the number of accelerators per host system. Network-attached accelerators are a new architectural approach for scaling the number of accelerators and host CPUs independently. In this paper, the communication architecture for network-attached accelerators is described which enables remote initialization and control of the accelerator devices. Furthermore, an operative prototype implementation is presented. The prototype accelerator node consists of an Intel Xeon Phi coprocessor and an EXTOLL NIC. The EXTOLL interconnect provides new features to enable accelerator-to-accelerator direct communication without a local host. Workloads can be dynamically assigned to CPUs and accelerators at run-time in an N to M ratio. The latency, bandwidth, and performance of the low-level implementation and MPI communication layer are presented. The LAMMPS molecular dynamics simulator is used to evaluate the communication architecture. The internode communication time is improved by up to 47%.","PeriodicalId":6593,"journal":{"name":"2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)","volume":"26 1","pages":"627-638"},"PeriodicalIF":0.0,"publicationDate":"2015-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85721186","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CAFO: Cost aware flip optimization for asymmetric memories","authors":"R. Maddah, Seyed Mohammad Seyedzadeh, R. Melhem","doi":"10.1109/HPCA.2015.7056043","DOIUrl":"https://doi.org/10.1109/HPCA.2015.7056043","url":null,"abstract":"Phase Change Memory (PCM) and spin-transfer torque random access memory (STT-RAM) are emerging as new memory technologies to replace DRAM and NAND flash that are impeded by physical limitations. Programming PCM cells degrades their endurance while programming STT-RAM cells incurs a high bit error rate. Accordingly, several schemes have been proposed to service write requests while programing as few memory cells as possible. Nevertheless, those schemes did not address the asymmetry in programming memory cells that characterizes both PCM and STT-RAM. For instance, writing a bit value of 0 on PCM cells is more detrimental to endurance than 1 while writing a bit value of 1 on STT-RAM cells is more prone to error than 0. In this paper, we propose CAFO as a new cost aware flip reduction scheme. Essentially, CAFO encompasses a cost model that computes the cost of servicing write requests through assigning different costs to each cell that requires programming. Subsequently, CAFO encodes the data to be written into a form that incurs less cost through its cost aware encoding module. Overall, CAFO is capable of cutting down the write cost by up to 65% more than existing schemes.","PeriodicalId":6593,"journal":{"name":"2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)","volume":"43 4 1","pages":"320-330"},"PeriodicalIF":0.0,"publicationDate":"2015-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90020227","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"High Performance Computing and Applications, Second International Conference, HPCA 2009, Shanghai, China, August 10-12, 2009, Revised Selected Papers","authors":"Wu Zhang, Zhangxin Chen, C. Douglas, W. Tong","doi":"10.1007/978-3-642-11842-5","DOIUrl":"https://doi.org/10.1007/978-3-642-11842-5","url":null,"abstract":"","PeriodicalId":6593,"journal":{"name":"2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)","volume":"67 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2010-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83259148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}