2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)最新文献_第3页

PABST: Proportionally Allocated Bandwidth at the Source and Target PABST:在源和目标按比例分配带宽

2017 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2017-02-01 DOI: 10.1109/HPCA.2017.33

Derek Hower, Harold W. Cain, Carl A. Waldspurger

{"title":"PABST: Proportionally Allocated Bandwidth at the Source and Target","authors":"Derek Hower, Harold W. Cain, Carl A. Waldspurger","doi":"10.1109/HPCA.2017.33","DOIUrl":"https://doi.org/10.1109/HPCA.2017.33","url":null,"abstract":"Higher integration lowers total cost of ownership (TCO) in the data center by reducing equipment cost and lowering energy consumption. However, higher integration also makes it difficult to achieve guaranteed quality of service (QoS) for shared resources. Unlike many other resources, memory bandwidth cannot be finely controlled by software in existing systems. As a result, many systems running critical, bandwidth-sensitive applications remain underutilized to protect against bandwidth interference. In this paper, we propose a novel hardware architecture allowing practical, software-controlled partitioning of memory bandwidth. Proportionally Allocated Bandwidth at the Source and Target (PABST) precisely controls the bandwidth of applications by throttling request rates at the source and prioritizes requests at the target. We show that PABST is work conserving, such that excess bandwidth beyond the requested allocation will not go unused. For applications sensitive to memory latency, we pair PABST with a simple priority scheme at the memory controller. We show that when combined, the system is able to lower TCO by providing performance isolation across a wide range of workloads, even when co-located with memory-intensive background jobs.","PeriodicalId":118950,"journal":{"name":"2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130669356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Application-Specific Performance-Aware Energy Optimization on Android Mobile Devices Android移动设备上特定应用的性能感知能量优化

2017 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2017-02-01 DOI: 10.1109/HPCA.2017.32

Karthik Rao, J. Wang, S. Yalamanchili, Y. Wardi, Handong Ye

{"title":"Application-Specific Performance-Aware Energy Optimization on Android Mobile Devices","authors":"Karthik Rao, J. Wang, S. Yalamanchili, Y. Wardi, Handong Ye","doi":"10.1109/HPCA.2017.32","DOIUrl":"https://doi.org/10.1109/HPCA.2017.32","url":null,"abstract":"Energy management is a key issue for mobile devices. On current Android devices, power management relies heavily on OS modules known as governors. These modules are created for various hardware components, including the CPU, to support DVFS. They implement algorithms that attempt to balance performance and power consumption. In this paper we make the observation that the existing governors are (1) general-purpose by nature (2) focused on power reduction and (3) are not energy-optimal for many applications. We thus establish the need for an application-specific approach that could overcome these drawbacks and provide higher energy efficiency for suitable applications. We also show that existing methods manage power and performance in an independent and isolated fashion and that co-ordinated control of multiple components can save more energy. In addition, we note that on mobile devices, energy savings cannot be achieved at the expense of performance. Consequently, we propose a solution that minimizes energy consumption of specific applications while maintaining a user-specified performance target. Our solution consists of two stages: (1) offline profiling and (2) online controlling. Utilizing the offline profiling data of the target application, our control theory based online controller dynamically selects the optimal system configuration (in this paper, combination of CPU frequency and memory bandwidth) for the application, while it is running. Our energy management solution is tested on a Nexus 6 smartphone with 6 real-world applications. We achieve 4 - 31% better energy than default governors with a worst case performance loss of","PeriodicalId":118950,"journal":{"name":"2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114459083","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 32

Camouflage: Memory Traffic Shaping to Mitigate Timing Attacks 伪装:内存流量整形以减轻定时攻击

2017 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2017-02-01 DOI: 10.1109/HPCA.2017.36

Yanqi Zhou, Sameer Wagh, Prateek Mittal, D. Wentzlaff

{"title":"Camouflage: Memory Traffic Shaping to Mitigate Timing Attacks","authors":"Yanqi Zhou, Sameer Wagh, Prateek Mittal, D. Wentzlaff","doi":"10.1109/HPCA.2017.36","DOIUrl":"https://doi.org/10.1109/HPCA.2017.36","url":null,"abstract":"Information leaks based on timing side channels in computing devices have serious consequences for user security and privacy. In particular, malicious applications in multi-user systems such as data centers and cloud-computing environments can exploit memory timing as a side channel to infer a victim's program access patterns/phases. Memory timing channels can also be exploited for covert communications by an adversary. We propose Camouflage, a hardware solution to mitigate timing channel attacks not only in the memory system, but also along the path to and from the memory system (e.g. NoC, memory scheduler queues). Camouflage introduces the novel idea of shaping memory requests' and responses' inter-arrival time into a pre-determined distribution for security purposes, even creating additional fake traffic if needed. This limits untrusted parties (either cloud providers or co-scheduled clients) from inferring information from another security domain by probing the bus to and from memory, or analyzing memory response rate. We design three different memory traffic shaping mechanisms for different security scenarios by having Camouflage work on requests, responses, and bi-directional (both) traffic. Camouflage is complementary to ORAMs and can be optionally used in conjunction with ORAMs to protect information leaks via both memory access timing and memory access patterns. Camouflage offers a tunable trade-off between system security and system performance. We evaluate Camouflage's security and performance both theoretically and via simulations, and find that Camouflage outperforms state-of-the-art solutions in performance by up to 50%.","PeriodicalId":118950,"journal":{"name":"2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115888959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 27

Near-Ideal Networks-on-Chip for Servers 近乎理想的服务器片上网络

2017 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2017-02-01 DOI: 10.1109/HPCA.2017.16

P. Lotfi-Kamran, M. Modarressi, H. Sarbazi-Azad

{"title":"Near-Ideal Networks-on-Chip for Servers","authors":"P. Lotfi-Kamran, M. Modarressi, H. Sarbazi-Azad","doi":"10.1109/HPCA.2017.16","DOIUrl":"https://doi.org/10.1109/HPCA.2017.16","url":null,"abstract":"Server workloads benefit from execution on many-core processors due to their massive request-level parallelism. A key characteristic of server workloads is the large instruction footprints. While a shared last-level cache (LLC) captures the footprints, it necessitates a low-latency network-on-chip (NOC) to minimize the core stall time on accesses serviced by the LLC. As strict quality-of-service requirements preclude the use of lean cores in server processors, we observe that even state-of-the-art single-cycle multi-hop NOCs are far from ideal because they impose significant NOC-induced delays on the LLC access latency, and diminish performance. Most of the NOC delay is due to per-hop resource allocation. In this paper, we take advantage of proactive resource allocation (PRA) to eliminate per-hop resource allocation time in single-cycle multi-hop networks to reach a near-ideal network for servers. PRA is undertaken during (1) the time interval in which it is known that LLC has the requested data, but the data is not yet ready, and (2) the time interval in which a packet is stalled in a router because the required resources are dedicated to another packet. Through detailed evaluation targeting a 64-core processor and a set of server workloads, we show that our proposal improves system performance by 12% over the state-of-the-art single-cycle multi-hop mesh NOC.","PeriodicalId":118950,"journal":{"name":"2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123252862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

NCAP: Network-Driven, Packet Context-Aware Power Management for Client-Server Architecture 网络驱动的，数据包上下文感知的客户端-服务器架构电源管理

2017 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2017-02-01 DOI: 10.1109/HPCA.2017.57

Mohammad Alian, Ahmed H. M. O. Abulila, Lokesh Jindal, Daehoon Kim, N. Kim

{"title":"NCAP: Network-Driven, Packet Context-Aware Power Management for Client-Server Architecture","authors":"Mohammad Alian, Ahmed H. M. O. Abulila, Lokesh Jindal, Daehoon Kim, N. Kim","doi":"10.1109/HPCA.2017.57","DOIUrl":"https://doi.org/10.1109/HPCA.2017.57","url":null,"abstract":"The rate of network packets encapsulating requests from clients can significantly affect the utilization, and thus performance and sleep states of processors in servers deploying a power management policy. To improve energy efficiency, servers may adopt an aggressive power management policy that frequently transitions a processor to a low-performance or sleep state at a low utilization. However, such servers may not respond to a sudden increase in the rate of requests from clients early enough due to a considerable performance penalty of transitioning a processor from a sleep or low-performance state to a high-performance state. This in turn entails violations of a service level agreement (SLA), discourages server operators from deploying an aggressive power management policy, and thus wastes energy during low-utilization periods. For both fast response time and high energy-efficiency, we propose NCAP, Network-driven, packet Context-Aware Power management for client-server architecture. NCAP enhances a network interface card (NIC) and its driver such that it can examine received and transmitted network packets, determine the rate of network packets containing latency-critical requests, and proactively transition a processor to an appropriate performance or sleep state. To demonstrate the efficacy, we evaluate on-line data-intensive (OLDI) applications and show that a server deploying NCAP consumes 37~61% lower processor energy than a baseline server while satisfying a given SLA at various load levels.","PeriodicalId":118950,"journal":{"name":"2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125315531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

Near-Optimal Access Partitioning for Memory Hierarchies with Multiple Heterogeneous Bandwidth Sources 具有多个异构带宽源的内存层次结构的近最优访问分区

2017 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2017-02-01 DOI: 10.1109/HPCA.2017.46

Jayesh Gaur, Mainak Chaudhuri, Pradeep Ramachandran, S. Subramoney

{"title":"Near-Optimal Access Partitioning for Memory Hierarchies with Multiple Heterogeneous Bandwidth Sources","authors":"Jayesh Gaur, Mainak Chaudhuri, Pradeep Ramachandran, S. Subramoney","doi":"10.1109/HPCA.2017.46","DOIUrl":"https://doi.org/10.1109/HPCA.2017.46","url":null,"abstract":"The memory wall continues to be a major performance bottleneck. While small on-die caches have been effective so far in hiding this bottleneck, the ever-increasing footprint of modern applications renders such caches ineffective. Recent advances in memory technologies like embedded DRAM (eDRAM) and High Bandwidth Memory (HBM) have enabled the integration of large memories on the CPU package as an additional source of bandwidth other than the DDR main memory. Because of limited capacity, these memories are typically implemented as a memory-side cache. Driven by traditional wisdom, many of the optimizations that target improving system performance have been tried to maximize the hit rate of the memory-side cache. A higher hit rate enables better utilization of the cache, and is therefore believed to result in higher performance. In this paper, we challenge this traditional wisdom and present DAP, a Dynamic Access Partitioning algorithm that sacrifices cache hit rates to exploit under-utilized bandwidth available at main memory. DAP achieves a near-optimal bandwidth partitioning between the memory-side cache and main memory by using a light-weight learning mechanism that needs just sixteen bytes of additional hardware. Simulation results show a 13% average performance gain when DAP is implemented on top of a die-stacked memory-side DRAM cache. We also show that DAP delivers large performance benefits across different implementations, bandwidth points, and capacity points of the memory-side cache, making it a valuable addition to any current or future systems based on multiple heterogeneous bandwidth sources beyond the on-chip SRAM cache hierarchy.","PeriodicalId":118950,"journal":{"name":"2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"160 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121555293","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning

2017 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2017-02-01 DOI: 10.1109/HPCA.2017.55

Linghao Song, Xuehai Qian, Hai Helen Li, Yiran Chen

{"title":"PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning","authors":"Linghao Song, Xuehai Qian, Hai Helen Li, Yiran Chen","doi":"10.1109/HPCA.2017.55","DOIUrl":"https://doi.org/10.1109/HPCA.2017.55","url":null,"abstract":"Convolution neural networks (CNNs) are the heart of deep learning applications. Recent works PRIME [1] and ISAAC [2] demonstrated the promise of using resistive random access memory (ReRAM) to perform neural computations in memory. We found that training cannot be efficiently supported with the current schemes. First, they do not consider weight update and complex data dependency in training procedure. Second, ISAAC attempts to increase system throughput with a very deep pipeline. It is only beneficial when a large number of consecutive images can be fed into the architecture. In training, the notion of batch (e.g. 64) limits the number of images can be processed consecutively, because the images in the next batch need to be processed based on the updated weights. Third, the deep pipeline in ISAAC is vulnerable to pipeline bubbles and execution stall. In this paper, we present PipeLayer, a ReRAM-based PIM accelerator for CNNs that support both training and testing. We analyze data dependency and weight update in training algorithms and propose efficient pipeline to exploit inter-layer parallelism. To exploit intra-layer parallelism, we propose highly parallel design based on the notion of parallelism granularity and weight replication. With these design choices, PipeLayer enables the highly pipelined execution of both training and testing, without introducing the potential stalls in previous work. The experiment results show that, PipeLayer achieves the speedups of 42.45x compared with GPU platform on average. The average energy saving of PipeLayer compared with GPU implementation is 7.17x.","PeriodicalId":118950,"journal":{"name":"2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"146 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116437077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 570

Compute Caches 计算缓存

2017 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2017-02-01 DOI: 10.1109/HPCA.2017.21

Shaizeen Aga, Supreet Jeloka, Arun K. Subramaniyan, S. Narayanasamy, D. Blaauw, R. Das

引用次数: 241

Cooper: Task Colocation with Cooperative Games Cooper:合作游戏中的任务分配

2017 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2017-02-01 DOI: 10.1109/HPCA.2017.22

Qiuyun Llull, Songchun Fan, S. Zahedi, Benjamin C. Lee

引用次数: 28

Balancing Performance and Lifetime of MLC PCM by Using a Region Retention Monitor 用区域保留监视器平衡MLC PCM的性能和寿命

2017 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2017-02-01 DOI: 10.1109/HPCA.2017.45

Mingzhe Zhang, Lunkai Zhang, Lei Jiang, Zhiyong Liu, F. Chong

{"title":"Balancing Performance and Lifetime of MLC PCM by Using a Region Retention Monitor","authors":"Mingzhe Zhang, Lunkai Zhang, Lei Jiang, Zhiyong Liu, F. Chong","doi":"10.1109/HPCA.2017.45","DOIUrl":"https://doi.org/10.1109/HPCA.2017.45","url":null,"abstract":"Multi Level Cell (MLC) Phase Change Memory (PCM) is an enhancement of PCM technology, which provides higher capacity by allowing multiple digital bits to be stored in a single PCM cell. However, the retention time of MLC PCM is limited by the resistance drift problem and refresh operations are required. Previous work shows that there exists a trade-off between write latency and retention—a write scheme with more SET iterations and smaller current provides a longer retention time but at the cost of a longer write latency. Otherwise, a write scheme with fewer SET iterations achieves high performance for writes but requires a greater number of refresh operations due to its significantly reduced retention time, and this hurts the lifetime of MLC PCM. In this paper, we show that only a small part of memory (i.e., hot memory regions) will be frequently accessed in a given period of time. Based on such an observation, we propose Region Retention Monitor (RRM), a novel structure that records and predicts the write frequency of memory regions. For every incoming memory write operation, RRM select a proper write latency for it. Our evaluations show that RRM helps the system improves the balance between system performance and memory lifetime. On the performance side, the system with RRM bridges 77.2% of the performance gap between systems with long writes and systems with short writes. On the lifetime side, a system with RRM achieves a lifetime of 6.4 years, while systems using only long writes and short writes achieve lifetimes of 10.6 and 0.3 years, respectively. Also, we can easily control the aggressiveness of RRM through an attribute called hot threshold. A more aggressively configured RRM can achieve the performance which is only 3.5% inferior than the system using static short writes, while still achieve a lifetime of 5.78 years.","PeriodicalId":118950,"journal":{"name":"2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131104208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 34