{"title":"PABST: Proportionally Allocated Bandwidth at the Source and Target","authors":"Derek Hower, Harold W. Cain, Carl A. Waldspurger","doi":"10.1109/HPCA.2017.33","DOIUrl":"https://doi.org/10.1109/HPCA.2017.33","url":null,"abstract":"Higher integration lowers total cost of ownership (TCO) in the data center by reducing equipment cost and lowering energy consumption. However, higher integration also makes it difficult to achieve guaranteed quality of service (QoS) for shared resources. Unlike many other resources, memory bandwidth cannot be finely controlled by software in existing systems. As a result, many systems running critical, bandwidth-sensitive applications remain underutilized to protect against bandwidth interference. In this paper, we propose a novel hardware architecture allowing practical, software-controlled partitioning of memory bandwidth. Proportionally Allocated Bandwidth at the Source and Target (PABST) precisely controls the bandwidth of applications by throttling request rates at the source and prioritizes requests at the target. We show that PABST is work conserving, such that excess bandwidth beyond the requested allocation will not go unused. For applications sensitive to memory latency, we pair PABST with a simple priority scheme at the memory controller. We show that when combined, the system is able to lower TCO by providing performance isolation across a wide range of workloads, even when co-located with memory-intensive background jobs.","PeriodicalId":118950,"journal":{"name":"2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130669356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Karthik Rao, J. Wang, S. Yalamanchili, Y. Wardi, Handong Ye
{"title":"Application-Specific Performance-Aware Energy Optimization on Android Mobile Devices","authors":"Karthik Rao, J. Wang, S. Yalamanchili, Y. Wardi, Handong Ye","doi":"10.1109/HPCA.2017.32","DOIUrl":"https://doi.org/10.1109/HPCA.2017.32","url":null,"abstract":"Energy management is a key issue for mobile devices. On current Android devices, power management relies heavily on OS modules known as governors. These modules are created for various hardware components, including the CPU, to support DVFS. They implement algorithms that attempt to balance performance and power consumption. In this paper we make the observation that the existing governors are (1) general-purpose by nature (2) focused on power reduction and (3) are not energy-optimal for many applications. We thus establish the need for an application-specific approach that could overcome these drawbacks and provide higher energy efficiency for suitable applications. We also show that existing methods manage power and performance in an independent and isolated fashion and that co-ordinated control of multiple components can save more energy. In addition, we note that on mobile devices, energy savings cannot be achieved at the expense of performance. Consequently, we propose a solution that minimizes energy consumption of specific applications while maintaining a user-specified performance target. Our solution consists of two stages: (1) offline profiling and (2) online controlling. Utilizing the offline profiling data of the target application, our control theory based online controller dynamically selects the optimal system configuration (in this paper, combination of CPU frequency and memory bandwidth) for the application, while it is running. Our energy management solution is tested on a Nexus 6 smartphone with 6 real-world applications. We achieve 4 - 31% better energy than default governors with a worst case performance loss of","PeriodicalId":118950,"journal":{"name":"2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114459083","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yanqi Zhou, Sameer Wagh, Prateek Mittal, D. Wentzlaff
{"title":"Camouflage: Memory Traffic Shaping to Mitigate Timing Attacks","authors":"Yanqi Zhou, Sameer Wagh, Prateek Mittal, D. Wentzlaff","doi":"10.1109/HPCA.2017.36","DOIUrl":"https://doi.org/10.1109/HPCA.2017.36","url":null,"abstract":"Information leaks based on timing side channels in computing devices have serious consequences for user security and privacy. In particular, malicious applications in multi-user systems such as data centers and cloud-computing environments can exploit memory timing as a side channel to infer a victim's program access patterns/phases. Memory timing channels can also be exploited for covert communications by an adversary. We propose Camouflage, a hardware solution to mitigate timing channel attacks not only in the memory system, but also along the path to and from the memory system (e.g. NoC, memory scheduler queues). Camouflage introduces the novel idea of shaping memory requests' and responses' inter-arrival time into a pre-determined distribution for security purposes, even creating additional fake traffic if needed. This limits untrusted parties (either cloud providers or co-scheduled clients) from inferring information from another security domain by probing the bus to and from memory, or analyzing memory response rate. We design three different memory traffic shaping mechanisms for different security scenarios by having Camouflage work on requests, responses, and bi-directional (both) traffic. Camouflage is complementary to ORAMs and can be optionally used in conjunction with ORAMs to protect information leaks via both memory access timing and memory access patterns. Camouflage offers a tunable trade-off between system security and system performance. We evaluate Camouflage's security and performance both theoretically and via simulations, and find that Camouflage outperforms state-of-the-art solutions in performance by up to 50%.","PeriodicalId":118950,"journal":{"name":"2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115888959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Near-Ideal Networks-on-Chip for Servers","authors":"P. Lotfi-Kamran, M. Modarressi, H. Sarbazi-Azad","doi":"10.1109/HPCA.2017.16","DOIUrl":"https://doi.org/10.1109/HPCA.2017.16","url":null,"abstract":"Server workloads benefit from execution on many-core processors due to their massive request-level parallelism. A key characteristic of server workloads is the large instruction footprints. While a shared last-level cache (LLC) captures the footprints, it necessitates a low-latency network-on-chip (NOC) to minimize the core stall time on accesses serviced by the LLC. As strict quality-of-service requirements preclude the use of lean cores in server processors, we observe that even state-of-the-art single-cycle multi-hop NOCs are far from ideal because they impose significant NOC-induced delays on the LLC access latency, and diminish performance. Most of the NOC delay is due to per-hop resource allocation. In this paper, we take advantage of proactive resource allocation (PRA) to eliminate per-hop resource allocation time in single-cycle multi-hop networks to reach a near-ideal network for servers. PRA is undertaken during (1) the time interval in which it is known that LLC has the requested data, but the data is not yet ready, and (2) the time interval in which a packet is stalled in a router because the required resources are dedicated to another packet. Through detailed evaluation targeting a 64-core processor and a set of server workloads, we show that our proposal improves system performance by 12% over the state-of-the-art single-cycle multi-hop mesh NOC.","PeriodicalId":118950,"journal":{"name":"2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123252862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mohammad Alian, Ahmed H. M. O. Abulila, Lokesh Jindal, Daehoon Kim, N. Kim
{"title":"NCAP: Network-Driven, Packet Context-Aware Power Management for Client-Server Architecture","authors":"Mohammad Alian, Ahmed H. M. O. Abulila, Lokesh Jindal, Daehoon Kim, N. Kim","doi":"10.1109/HPCA.2017.57","DOIUrl":"https://doi.org/10.1109/HPCA.2017.57","url":null,"abstract":"The rate of network packets encapsulating requests from clients can significantly affect the utilization, and thus performance and sleep states of processors in servers deploying a power management policy. To improve energy efficiency, servers may adopt an aggressive power management policy that frequently transitions a processor to a low-performance or sleep state at a low utilization. However, such servers may not respond to a sudden increase in the rate of requests from clients early enough due to a considerable performance penalty of transitioning a processor from a sleep or low-performance state to a high-performance state. This in turn entails violations of a service level agreement (SLA), discourages server operators from deploying an aggressive power management policy, and thus wastes energy during low-utilization periods. For both fast response time and high energy-efficiency, we propose NCAP, Network-driven, packet Context-Aware Power management for client-server architecture. NCAP enhances a network interface card (NIC) and its driver such that it can examine received and transmitted network packets, determine the rate of network packets containing latency-critical requests, and proactively transition a processor to an appropriate performance or sleep state. To demonstrate the efficacy, we evaluate on-line data-intensive (OLDI) applications and show that a server deploying NCAP consumes 37~61% lower processor energy than a baseline server while satisfying a given SLA at various load levels.","PeriodicalId":118950,"journal":{"name":"2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125315531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jayesh Gaur, Mainak Chaudhuri, Pradeep Ramachandran, S. Subramoney
{"title":"Near-Optimal Access Partitioning for Memory Hierarchies with Multiple Heterogeneous Bandwidth Sources","authors":"Jayesh Gaur, Mainak Chaudhuri, Pradeep Ramachandran, S. Subramoney","doi":"10.1109/HPCA.2017.46","DOIUrl":"https://doi.org/10.1109/HPCA.2017.46","url":null,"abstract":"The memory wall continues to be a major performance bottleneck. While small on-die caches have been effective so far in hiding this bottleneck, the ever-increasing footprint of modern applications renders such caches ineffective. Recent advances in memory technologies like embedded DRAM (eDRAM) and High Bandwidth Memory (HBM) have enabled the integration of large memories on the CPU package as an additional source of bandwidth other than the DDR main memory. Because of limited capacity, these memories are typically implemented as a memory-side cache. Driven by traditional wisdom, many of the optimizations that target improving system performance have been tried to maximize the hit rate of the memory-side cache. A higher hit rate enables better utilization of the cache, and is therefore believed to result in higher performance. In this paper, we challenge this traditional wisdom and present DAP, a Dynamic Access Partitioning algorithm that sacrifices cache hit rates to exploit under-utilized bandwidth available at main memory. DAP achieves a near-optimal bandwidth partitioning between the memory-side cache and main memory by using a light-weight learning mechanism that needs just sixteen bytes of additional hardware. Simulation results show a 13% average performance gain when DAP is implemented on top of a die-stacked memory-side DRAM cache. We also show that DAP delivers large performance benefits across different implementations, bandwidth points, and capacity points of the memory-side cache, making it a valuable addition to any current or future systems based on multiple heterogeneous bandwidth sources beyond the on-chip SRAM cache hierarchy.","PeriodicalId":118950,"journal":{"name":"2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"160 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121555293","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Linghao Song, Xuehai Qian, Hai Helen Li, Yiran Chen
{"title":"PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning","authors":"Linghao Song, Xuehai Qian, Hai Helen Li, Yiran Chen","doi":"10.1109/HPCA.2017.55","DOIUrl":"https://doi.org/10.1109/HPCA.2017.55","url":null,"abstract":"Convolution neural networks (CNNs) are the heart of deep learning applications. Recent works PRIME [1] and ISAAC [2] demonstrated the promise of using resistive random access memory (ReRAM) to perform neural computations in memory. We found that training cannot be efficiently supported with the current schemes. First, they do not consider weight update and complex data dependency in training procedure. Second, ISAAC attempts to increase system throughput with a very deep pipeline. It is only beneficial when a large number of consecutive images can be fed into the architecture. In training, the notion of batch (e.g. 64) limits the number of images can be processed consecutively, because the images in the next batch need to be processed based on the updated weights. Third, the deep pipeline in ISAAC is vulnerable to pipeline bubbles and execution stall. In this paper, we present PipeLayer, a ReRAM-based PIM accelerator for CNNs that support both training and testing. We analyze data dependency and weight update in training algorithms and propose efficient pipeline to exploit inter-layer parallelism. To exploit intra-layer parallelism, we propose highly parallel design based on the notion of parallelism granularity and weight replication. With these design choices, PipeLayer enables the highly pipelined execution of both training and testing, without introducing the potential stalls in previous work. The experiment results show that, PipeLayer achieves the speedups of 42.45x compared with GPU platform on average. The average energy saving of PipeLayer compared with GPU implementation is 7.17x.","PeriodicalId":118950,"journal":{"name":"2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"146 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116437077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shaizeen Aga, Supreet Jeloka, Arun K. Subramaniyan, S. Narayanasamy, D. Blaauw, R. Das
{"title":"Compute Caches","authors":"Shaizeen Aga, Supreet Jeloka, Arun K. Subramaniyan, S. Narayanasamy, D. Blaauw, R. Das","doi":"10.1109/HPCA.2017.21","DOIUrl":"https://doi.org/10.1109/HPCA.2017.21","url":null,"abstract":"This paper presents the Compute Cache architecturethat enables in-place computation in caches. ComputeCaches uses emerging bit-line SRAM circuit technology to repurpose existing cache elements and transforms them into active very large vector computational units. Also, it significantlyreduces the overheads in moving data between different levelsin the cache hierarchy. Solutions to satisfy new constraints imposed by ComputeCaches such as operand locality are discussed. Also discussedare simple solutions to problems in integrating them into aconventional cache hierarchy while preserving properties suchas coherence, consistency, and reliability. Compute Caches increase performance by 1.9× and reduceenergy by 2.4× for a suite of data-centric applications, includingtext and database query processing, cryptographic kernels, and in-memory checkpointing. Applications with larger fractionof Compute Cache operations could benefit even more, asour micro-benchmarks indicate (54× throughput, 9× dynamicenergy savings).","PeriodicalId":118950,"journal":{"name":"2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125920178","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qiuyun Llull, Songchun Fan, S. Zahedi, Benjamin C. Lee
{"title":"Cooper: Task Colocation with Cooperative Games","authors":"Qiuyun Llull, Songchun Fan, S. Zahedi, Benjamin C. Lee","doi":"10.1109/HPCA.2017.22","DOIUrl":"https://doi.org/10.1109/HPCA.2017.22","url":null,"abstract":"Task colocation improves datacenter utilization but introduces resource contention for shared hardware. In this setting, a particular challenge is balancing performance and fairness. We present Cooper, a game-theoretic framework for task colocation that provides fairness while preserving performance. Cooper predicts users' colocation preferences and finds stable matches between them. Its colocations satisfy preferences and encourage strategic users to participate inshared systems. Given Cooper's colocations, users' performance penalties are strongly correlated to their contributions to contention, which is fair according to cooperative game theory. Moreover, its colocations perform within 5% of prior heuristics.","PeriodicalId":118950,"journal":{"name":"2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131255736","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mingzhe Zhang, Lunkai Zhang, Lei Jiang, Zhiyong Liu, F. Chong
{"title":"Balancing Performance and Lifetime of MLC PCM by Using a Region Retention Monitor","authors":"Mingzhe Zhang, Lunkai Zhang, Lei Jiang, Zhiyong Liu, F. Chong","doi":"10.1109/HPCA.2017.45","DOIUrl":"https://doi.org/10.1109/HPCA.2017.45","url":null,"abstract":"Multi Level Cell (MLC) Phase Change Memory (PCM) is an enhancement of PCM technology, which provides higher capacity by allowing multiple digital bits to be stored in a single PCM cell. However, the retention time of MLC PCM is limited by the resistance drift problem and refresh operations are required. Previous work shows that there exists a trade-off between write latency and retention—a write scheme with more SET iterations and smaller current provides a longer retention time but at the cost of a longer write latency. Otherwise, a write scheme with fewer SET iterations achieves high performance for writes but requires a greater number of refresh operations due to its significantly reduced retention time, and this hurts the lifetime of MLC PCM. In this paper, we show that only a small part of memory (i.e., hot memory regions) will be frequently accessed in a given period of time. Based on such an observation, we propose Region Retention Monitor (RRM), a novel structure that records and predicts the write frequency of memory regions. For every incoming memory write operation, RRM select a proper write latency for it. Our evaluations show that RRM helps the system improves the balance between system performance and memory lifetime. On the performance side, the system with RRM bridges 77.2% of the performance gap between systems with long writes and systems with short writes. On the lifetime side, a system with RRM achieves a lifetime of 6.4 years, while systems using only long writes and short writes achieve lifetimes of 10.6 and 0.3 years, respectively. Also, we can easily control the aggressiveness of RRM through an attribute called hot threshold. A more aggressively configured RRM can achieve the performance which is only 3.5% inferior than the system using static short writes, while still achieve a lifetime of 5.78 years.","PeriodicalId":118950,"journal":{"name":"2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131104208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}