Minseok Lee, Gwangsun Kim, John Kim, Woong Seo, Yeon-Gon Cho, Soojung Ryu
{"title":"iPAWS: Instruction-issue pattern-based adaptive warp scheduling for GPGPUs","authors":"Minseok Lee, Gwangsun Kim, John Kim, Woong Seo, Yeon-Gon Cho, Soojung Ryu","doi":"10.1109/HPCA.2016.7446079","DOIUrl":"https://doi.org/10.1109/HPCA.2016.7446079","url":null,"abstract":"Thread or warp scheduling in GPGPUs has been shown to have a significant impact on overall performance. Recently proposed warp schedulers have been based on a greedy warp scheduler where some warps are prioritized over other warps. However, a single warp scheduling policy does not necessarily provide good performance across all types of workloads; in particular, we show that greedy warp schedulers are not necessarily optimal for workloads with inter-warp locality while a simple round-robin warp scheduler provides better performance. Thus, we argue that instead of single, static warp scheduling, an adaptive warp scheduler that dynamically changes the warp scheduler based on the workload characteristics should be leveraged. In this work, we propose an instruction-issue pattern-based adaptive warp scheduler (iPAWS) that dynamically adapts between a greedy warp scheduler and a fair, round-robin scheduler. We exploit the observation that workloads that favor a greedy warp scheduler will have an instruction-issue pattern that is biased towards some warps while workloads that favor a fair, round-robin warp scheduler will tend to issue instructions across all of the warps. Our evaluations show that iPAWS is able to adapt to the more optimal warp scheduler dynamically and achieve performance that is within a few percent of the statically determined, more optimal warp scheduler. We also show that iPAWS can be extended to other warp schedulers, including the cache-conscious wavefront scheduling (CCWS) and Memory Aware Scheduling and Cache Access Re-execution (MASCAR) to exploit the benefits of other warp schedulers while still providing adaptivity in warp scheduling.","PeriodicalId":417994,"journal":{"name":"2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"101 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117333444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yajing Chen, Shengshuo Lu, Hun-Seok Kim, D. Blaauw, R. Dreslinski, T. Mudge
{"title":"A low power software-defined-radio baseband processor for the Internet of Things","authors":"Yajing Chen, Shengshuo Lu, Hun-Seok Kim, D. Blaauw, R. Dreslinski, T. Mudge","doi":"10.1109/HPCA.2016.7446052","DOIUrl":"https://doi.org/10.1109/HPCA.2016.7446052","url":null,"abstract":"In this paper, we define a configurable Software Defined Radio (SDR) baseband processor design for the Internet of Things (IoT). We analyzed the fundamental algorithms in communications systems on IoT devices to enable a microarchitecture design that supports many IoT standards and custom nonstandard communications. Based on this analysis, we propose a custom SIMD execution model coupled with a scalar unit. We introduce several architectural optimizations to this design: streaming registers, variable bit width datapath, dedicated ALUs for critical kernels, and an optimized flexible reduction network. We employ voltage scaling and clock gating to further reduce the power, while more than a 100% time margin has been reserved for reliable operation in the near-threshold region. Together our architectural enhancements lead to a 71× power reduction compared to a classic general purpose SDR SIMD architecture. Our IoT SDR datapath has sub-mW power consumption based on SPICE simulation, and is placed and routed to fit within an area of 0.074mm2 in a 28nm process. We implemented several essential elementary signal processing kernels and combined them to demonstrate two end-to-end upper bound systems, 802.15.4-OQPSK and Bluetooth Low Energy. Our full SDR baseband system consists of a configurable SIMD with a control plane MCU and memory. For comparison, the best commercial wireless transceiver consumes 23.8mW for the entire wireless system (digital/RF/ analog). We show that our digital system power is below 2mW, in other words only 8% of the total system power. The wireless system is dominated by RF/analog power comsumption, thus the price of flexibility that SDR affords is small. We believe this work is unique in demonstrating the value of baseband SDR in the low power IoT domain.","PeriodicalId":417994,"journal":{"name":"2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"9 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113971574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Chang, Prashant J. Nair, Donghyuk Lee, Saugata Ghose, Moinuddin K. Qureshi, O. Mutlu
{"title":"Low-Cost Inter-Linked Subarrays (LISA): Enabling fast inter-subarray data movement in DRAM","authors":"K. Chang, Prashant J. Nair, Donghyuk Lee, Saugata Ghose, Moinuddin K. Qureshi, O. Mutlu","doi":"10.1109/HPCA.2016.7446095","DOIUrl":"https://doi.org/10.1109/HPCA.2016.7446095","url":null,"abstract":"This paper introduces a new DRAM design that enables fast and energy-efficient bulk data movement across subarrays in a DRAM chip. While bulk data movement is a key operation in many applications and operating systems, contemporary systems perform this movement inefficiently, by transferring data from DRAM to the processor, and then back to DRAM, across a narrow off-chip channel. The use of this narrow channel for bulk data movement results in high latency and energy consumption. Prior work proposed to avoid these high costs by exploiting the existing wide internal DRAM bandwidth for bulk data movement, but the limited connectivity of wires within DRAM allows fast data movement within only a single DRAM subarray. Each subarray is only a few megabytes in size, greatly restricting the range over which fast bulk data movement can happen within DRAM. We propose a new DRAM substrate, Low-Cost Inter-Linked Subarrays (LISA), whose goal is to enable fast and efficient data movement across a large range of memory at low cost. LISA adds low-cost connections between adjacent subarrays. By using these connections to interconnect the existing internal wires (bitlines) of adjacent subarrays, LISA enables wide-bandwidth data transfer across multiple subarrays with little (only 0.8%) DRAM area overhead. As a DRAM substrate, LISA is versatile, enabling an array of new applications. We describe and evaluate three such applications in detail: (1) fast inter-subarray bulk data copy, (2) in-DRAM caching using a DRAM architecture whose rows have heterogeneous access latencies, and (3) accelerated bitline precharging by linking multiple precharge units together. Our extensive evaluations show that each of LISA's three applications significantly improves performance and memory energy efficiency, and their combined benefit is higher than the benefit of each alone, on a variety of workloads and system configurations.","PeriodicalId":417994,"journal":{"name":"2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"132 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125794762","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SLaC: Stage laser control for a flattened butterfly network","authors":"Y. Demir, N. Hardavellas","doi":"10.1109/HPCA.2016.7446075","DOIUrl":"https://doi.org/10.1109/HPCA.2016.7446075","url":null,"abstract":"Photonic interconnects have emerged as a promising candidate technology for high-performance energy-efficient on-chip, on-board, and datacenter-scale interconnects. However, the high optical loss of many nanophotonic components coupled with the low efficiency of current laser sources result in exceedingly high total power requirements for the laser. As optical interconnects stay on even during periods of system inactivity, most of this power is wasted, which has prompted research on laser gating. Unfortunately, prior work on laser gating has only focused on low-scalability on-chip photonic interconnects (photonic crossbars), and disrupts the connectivity of the network which renders a high-performance implementation challenging. In this paper we propose SLaC, a laser gating technique that turns on and off redundant paths in a photonic flattened-butterfly network to save laser energy while maintaining high performance and full connectivity. Maintaining full connectivity removes the laser turn-on latency from the critical path and results in minimal performance degradation. SLaC is equally applicable to on-chip, on-board, and datacenter level interconnects. For on-chip and multi-chip applications, SLaC saves up to 67% of the laser energy (43-57% on average) when running real-world workloads. On a datacenter network, SLaC saves 79% of the laser energy on average when running traffic traces collected from university datacenter servers.","PeriodicalId":417994,"journal":{"name":"2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129873941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DVFS for NoCs in CMPs: A thread voting approach","authors":"Y. Yao, Zhonghai Lu","doi":"10.1109/HPCA.2016.7446074","DOIUrl":"https://doi.org/10.1109/HPCA.2016.7446074","url":null,"abstract":"As the core count grows rapidly, dynamic voltage/frequency scaling (DVFS) in networks-on-chip (NoCs) becomes critical in optimizing energy efficacy in chip multiprocessors (CMPs). Previously proposed techniques often exploit inherent network-level metrics to do so. However, such network metrics may contradictorily reflect application's performance need, leading to power over/under provisioning. We propose a novel on-chip DVFS technique for NoCs that is able to adjust per-region V/F level according to voted V/F levels of communicating threads. Each region is composed of a few adjacent routers sharing the same V/F level. With a voting-based approach, threads seek to influence the DVFS decisions independently by voting for a preferred V/F level that best suits their own performance interest according to their runtime profiled message generation rate and data sharing characteristics. The vote expressed in a few bits is then carried in the packet header and spread to the routers on the packet route. The final DVFS decision is made democratically by a region DVFS controller based on the majority election result of collected votes from all active threads. To achieve scalable V/F adjustment, each region works independently, and the voting-based V/F tuning forms a distributed decision making process. We evaluate our technique with detailed simulations of a 64-core CMP running a variety of multi-threaded PARSEC benchmarks. Compared with a network without DVFS and a network metric (router buffer occupancy) based approach, experimental results show that our voting based DVFS mechanism improves the network energy efficacy measured in MPPJ (million packets per joule) by about 17.9% and 9.7% on average, respectively, and the system energy efficacy measured in MIPJ (million instructions per joule) by about 26.3% and 17.1% on average, respectively.","PeriodicalId":417994,"journal":{"name":"2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"129 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124546171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Renji Thomas, Kristin Barber, N. Sedaghati, Li Zhou, R. Teodorescu
{"title":"Core tunneling: Variation-aware voltage noise mitigation in GPUs","authors":"Renji Thomas, Kristin Barber, N. Sedaghati, Li Zhou, R. Teodorescu","doi":"10.1109/HPCA.2016.7446061","DOIUrl":"https://doi.org/10.1109/HPCA.2016.7446061","url":null,"abstract":"Voltage noise and manufacturing process variation represent significant reliability challenges for modern microprocessors. Voltage noise is caused by rapid changes in processor activity that can lead to timing violations and errors. Process variation is caused by manufacturing challenges in low-nanometer technologies and can lead to significant heterogeneity in performance and reliability across the chip. To ensure correct execution under worst-case conditions, chip designers generally add operating margins that are often unnecessarily conservative for most use cases, which results in wasted energy. This paper investigates the combined effects of process variation and voltage noise on modern GPU architectures. A distributed power delivery and process variation model at functional unit granularity was developed and used to simulate supply voltage behavior in a multicore GPU system. We observed that, just like in CPUs, large changes in power demand can lead to significant voltage droops. We also note that process variation makes some cores much more vulnerable to noise than others in the same GPU. Therefore, protecting the chip against large voltage droops by using fixed and uniform voltage guardbands is costly and inefficient. This paper presents core tunneling, a variation-aware solution for dynamically reducing voltage margins. The system relies on hardware critical path monitors to detect voltage noise conditions and quickly reacts by clock-gating vulnerable cores to prevent timing violations. This allows a substantial reduction in voltage margins. Since clock gating is enabled infrequently and only on the most vulnerable cores, the performance impact of core tunneling is very low. On average, core tunneling reduces energy consumption by 15%.","PeriodicalId":417994,"journal":{"name":"2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122468063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fangfei Liu, Qian Ge, Y. Yarom, Francis X. McKeen, Carlos V. Rozas, G. Heiser, R. Lee
{"title":"CATalyst: Defeating last-level cache side channel attacks in cloud computing","authors":"Fangfei Liu, Qian Ge, Y. Yarom, Francis X. McKeen, Carlos V. Rozas, G. Heiser, R. Lee","doi":"10.1109/HPCA.2016.7446082","DOIUrl":"https://doi.org/10.1109/HPCA.2016.7446082","url":null,"abstract":"Cache side channel attacks are serious threats to multi-tenant public cloud platforms. Past work showed how secret information in one virtual machine (VM) can be extracted by another co-resident VM using such attacks. Recent research demonstrated the feasibility of high-bandwidth, low-noise side channel attacks on the last-level cache (LLC), which is shared by all the cores in the processor package, enabling attacks even when VMs are scheduled on different cores. This paper shows how such LLC side channel attacks can be defeated using a performance optimization feature recently introduced in commodity processors. Since most cloud servers use Intel processors, we show how the Intel Cache Allocation Technology (CAT) can be used to provide a system-level protection mechanism to defend from side channel attacks on the shared LLC. CAT is a way-based hardware cache-partitioning mechanism for enforcing quality-of-service with respect to LLC occupancy. However, it cannot be directly used to defeat cache side channel attacks due to the very limited number of partitions it provides. We present CATalyst, a pseudo-locking mechanism which uses CAT to partition the LLC into a hybrid hardware-software managed cache. We implement a proof-of-concept system using Xen and Linux running on a server with Intel processors, and show that LLC side channel attacks can be defeated. Furthermore, CATalyst only causes very small performance overhead when used for security, and has negligible impact on legacy applications.","PeriodicalId":417994,"journal":{"name":"2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121203758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vasileios Karakostas, Jayneel Gandhi, A. Cristal, M. Hill, K. McKinley, M. Nemirovsky, M. Swift, O. Unsal
{"title":"Energy-efficient address translation","authors":"Vasileios Karakostas, Jayneel Gandhi, A. Cristal, M. Hill, K. McKinley, M. Nemirovsky, M. Swift, O. Unsal","doi":"10.1109/HPCA.2016.7446100","DOIUrl":"https://doi.org/10.1109/HPCA.2016.7446100","url":null,"abstract":"Address translation is fundamental to processor performance. Prior work focused on reducing Translation Lookaside Buffer (TLB) misses to improve performance and energy, whereas we show that even TLB hits consume a significant amount of dynamic energy. To reduce the energy cost of address translation, we first propose Lite, a mechanism that monitors the performance and utility of L1 TLBs, and adaptively changes their sizes with way-disabling. The resulting TLBLite organization opportunistically reduces the dynamic energy spent in address translation by 23% on average with minimal impact on TLB miss cycles. To further reduce the energy and performance overheads of L1 TLBs, we also propose RMMLite that targets the recently proposed Redundant Memory Mappings (RMM) address-translation mechanism. RMM maps most of a process's address space with arbitrarily large ranges of contiguous pages in both virtual and physical address space using a modest number of entries in a range TLB. RMMLite adds to RMM an L1-range TLB and the Lite mechanism. The high hit ratio of the L1-range TLB allows Lite to downsize the L1-page TLBs more aggressively. RMMLite reduces the dynamic energy spent in address translation by 71% on average. Above the near-zero L2 TLB misses from RMM, RMMLite further reduces the overhead from L1 TLB misses by 99%. These proposed designs target current and future energy-efficient memory system design to meet the ever increasing memory demands of applications.","PeriodicalId":417994,"journal":{"name":"2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126648480","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. A. Islam, Xiaoqi Ren, Shaolei Ren, A. Wierman, Xiaorui Wang
{"title":"A market approach for handling power emergencies in multi-tenant data center","authors":"M. A. Islam, Xiaoqi Ren, Shaolei Ren, A. Wierman, Xiaorui Wang","doi":"10.1109/HPCA.2016.7446084","DOIUrl":"https://doi.org/10.1109/HPCA.2016.7446084","url":null,"abstract":"Power oversubscription in data centers may occasionally trigger an emergency when the aggregate power demand exceeds the capacity. Handling such an emergency requires a graceful power capping solution that minimizes the performance loss. In this paper, we study power capping in a multi-tenant data center where the operator supplies power to multiple tenants that manage their own servers. Unlike owner-operated data centers, the operator lacks control over tenants' servers. To address this challenge, we propose a novel market mechanism based on supply function bidding, called COOP, to financially incentivize and coordinate tenants' power reduction for minimizing total performance loss (quantified in performance cost) while satisfying multiple power capping constraints. We build a prototype to show that COOP is efficient in terms of minimizing the total performance cost, even compared to the ideal but infeasible case that assumes the operator has full control over tenants' servers. We also demonstrate that COOP is \"win-win\", increasing the operator's profit (through oversubscription) and reducing tenants' cost (through financial compensation for their power reduction during emergencies).","PeriodicalId":417994,"journal":{"name":"2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127390287","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Revisiting virtual L1 caches: A practical design using dynamic synonym remapping","authors":"Hongil Yoon, G. Sohi","doi":"10.1109/HPCA.2016.7446066","DOIUrl":"https://doi.org/10.1109/HPCA.2016.7446066","url":null,"abstract":"Virtual caches have potentially lower access latency and energy consumption than physical caches because they do not consult the TLB prior to cache access. However, they have not been popular in commercial designs. The crux of the problem is the possibility of synonyms. This paper makes several empirical observations about the temporal characteristics of synonyms, especially in caches of sizes that are typical of L1 caches. By leveraging these observations, the paper proposes a practical design of an L1 virtual cache that (1) dynamically decides a unique virtual page number for all the synonymous virtual pages that map to the same physical page and (2) uses this unique page number to place and look up data in the virtual caches. Accesses to this unique page number proceed without any intervention. Accesses to other synonymous pages are dynamically detected, and remapped to the corresponding unique virtual page number to correctly access data in the cache. Such remapping operations are rare, due to the temporal properties of synonyms, allowing a Virtual Cache with Dynamic Synonym Remapping (VC-DSR) to achieve most of the benefits of virtual caches but without software involvement. Experimental results based on real world applications show that VC-DSR can achieve about 92% of the dynamic energy savings for TLB lookups, and 99.4% of the latency benefits of ideal (but impractical) virtual caches for the configurations considered.","PeriodicalId":417994,"journal":{"name":"2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131861677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}