{"title":"Static Bubble: A Framework for Deadlock-Free Irregular On-chip Topologies","authors":"Aniruddh Ramrakhyani, T. Krishna","doi":"10.1109/HPCA.2017.44","DOIUrl":"https://doi.org/10.1109/HPCA.2017.44","url":null,"abstract":"Future SoCs are expected to have irregular on-chip topologies, either at design time due to heterogeneity in the size of core/accelerator tiles, or at runtime due to link/node failures or power-gating of network elements such as routers/router datapaths. A key challenge with irregular topologies is that of routing deadlocks (cyclic dependence between buffers), since conventional XY or turn-model based approaches are no longer applicable. Most prior works in heterogeneous SoC design, resiliency, and power-gating, have addressed the deadlock problem by constructing spanning trees over the physical topology, messages are routed via the root removing cyclic dependencies. However, this comes at a cost of tree construction at runtime, and increased latency and energy for certain flows as they are forced to use non-minimal routes. In this work, we sweep the design space of possible topologies as the number of disconnected components (links/routers) increase, and demonstrate that while most of the resulting topologies are deadlock prone (i.e., have cycles), the injection rates at which they deadlock are often much higher than the injection rates of real applications, making the current solutions highly conservative. We propose a novel framework for deadlock-freedom called Static Bubble, that can be applied at design time to the underlying mesh topology, and guarantees deadlock-freedom for any runtime topology derived from this mesh due to power-gating or failure of router/link. We present an algorithm to augment a subset of routers in any n × m mesh (21 routers in a 64-core mesh) with an additional buffer called static bubble, such that any dependence chain has at least one static bubble. We also present the microarchitecture of a low-cost (less than 1% overhead) FSM at every router to activate one static bubble for deadlock recovery. Static Bubble enhances existing solutions for NoC resiliency and power-gating by providing up to 30% less network latency, 4x more throughput and 50% less EDP.","PeriodicalId":118950,"journal":{"name":"2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122152851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"G-Scalar: Cost-Effective Generalized Scalar Execution Architecture for Power-Efficient GPUs","authors":"Zhenhong Liu, S. Gilani, M. Annavaram, N. Kim","doi":"10.1109/HPCA.2017.51","DOIUrl":"https://doi.org/10.1109/HPCA.2017.51","url":null,"abstract":"The GPU has provide higher throughput by integrating more execution resources into a single chip without unduly compromising power efficiency. With the power wall challenge, however, increasing the throughput will require significant improvement in power efficiency. To accomplish this goal, we propose G-Scalar, a cost-effective generalized scalar execution architecture for GPUs in this paper. G-Scalar offers two key advantages over prior architectures supporting scalar execution for only non-divergent arithmetic/logic instructions. First, G-Scalar is more power-efficient as it can also support scalar execution of divergent and special-function instructions, the fraction of which in contemporary GPU applications has notably increased. Second, G-Scalar is less expensive as it can share most of its hardware resources with register value compression, of which adoption has been strongly promoted to reduce high power consumption of accessing the large register file. Compared with the baseline and previous scalar architectures, G-Scalar improves power efficiency by 24% and 15%, respectively, at a negligible cost.","PeriodicalId":118950,"journal":{"name":"2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"134 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114432229","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chenhao Xie, S. Song, Jing Wang, Wei-gong Zhang, Xin Fu
{"title":"Processing-in-Memory Enabled Graphics Processors for 3D Rendering","authors":"Chenhao Xie, S. Song, Jing Wang, Wei-gong Zhang, Xin Fu","doi":"10.1109/HPCA.2017.37","DOIUrl":"https://doi.org/10.1109/HPCA.2017.37","url":null,"abstract":"The performance of 3D rendering of GraphicsProcessing Unit that converts 3D vector stream into 2D framewith 3D image effects significantly impacts users gamingexperience on modern computer systems. Due to its hightexture throughput requirement, main memory bandwidthbecomes a critical obstacle for improving the overall renderingperformance. 3D-stacked memory systems such as HybridMemory Cube provide opportunities to significantly overcomethe memory wall by directly connecting logic controllers toDRAM dies. Although recent works have shown promisingimprovement in performance by utilizing HMC to acceleratespecial-purpose applications, a critical challenge of how toeffectively leverage its high internal bandwidth and computingcapability in GPU for 3D rendering remains unresolved. Basedon the observation that texel fetches greatly impact off-chipmemory traffic, we propose two architectural designs to enableProcessing-In-Memory based GPU for efficient 3D rendering. Additionally, we employ camera angles of pixels to controlthe performance-quality tradeoff of 3D rendering. Extensiveevaluation across several real-world games demonstrates thatour design can significantly improve the performance of texturefiltering and 3D rendering by an average of 3.97X (up to 6.4X) and 43% (up to 65%) respectively, over the baseline GPU. Meanwhile, our design provides considerable memory trafficand energy reduction without sacrificing rendering quality.","PeriodicalId":118950,"journal":{"name":"2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123525419","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Understanding and Optimizing Power Consumption in Memory Networks","authors":"Xun Jian, P. Hanumolu, Rakesh Kumar","doi":"10.1109/HPCA.2017.60","DOIUrl":"https://doi.org/10.1109/HPCA.2017.60","url":null,"abstract":"As the amount of digital data the world generates explodes, data centers and HPC systems that process this big data will require high bandwidth and high capacity main memory. Unfortunately, conventional memory technologies either provide high memory capacity (e.g., DDRx memory) or high bandwidth (GDDRx memory), but not both. Memory networks, which provide both high bandwidth and high capacity memory by connecting memory modules together via a network of point-to-point links, are promising future memory candidates for data centers and HPCs. In this paper, we perform the first exploration to understand the power characteristics of memory networks. We find idle I/O links to be the biggest power contributor in memory networks. Subsequently, we study idle I/O power in more detail. We evaluate well-known circuit-level I/O power control mechanisms such as rapid on off, variable link width, and DVFS. We also adapt prior works on memory power management to memory networks. The adapted schemes together reduce I/O power by 32% and 21%, on average, for big and small networks, respectively. We also explore novel power management schemes specifically targeting memory networks, which yield another 29% and 17% average I/O power reduction for big and small networks, respectively.","PeriodicalId":118950,"journal":{"name":"2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129386050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FlexFlow: A Flexible Dataflow Accelerator Architecture for Convolutional Neural Networks","authors":"Wenyan Lu, Guihai Yan, Jiajun Li, Shijun Gong, Yinhe Han, Xiaowei Li","doi":"10.1109/HPCA.2017.29","DOIUrl":"https://doi.org/10.1109/HPCA.2017.29","url":null,"abstract":"Convolutional Neural Networks (CNN) are verycomputation-intensive. Recently, a lot of CNN accelerators based on the CNN intrinsic parallelism are proposed. However, we observed that there is a big mismatch between the parallel types supported by computing engine and the dominant parallel types of CNN workloads. This mismatch seriously degrades resource utilization of existing accelerators. In this paper, we propose aflexible dataflow architecture (FlexFlow) that can leverage the complementary effects among feature map, neuron, and synapse parallelism to mitigate the mismatch. We evaluated our design with six typical practical workloads, it acquires 2-10x performance speedup and 2.5-10x power efficiency improvement compared with three state-of-the-art accelerator architectures. Meanwhile, FlexFlow is highly scalable with growing computing engine scale.","PeriodicalId":118950,"journal":{"name":"2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131434379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xulong Tang, Ashutosh Pattnaik, Huaipan Jiang, Onur Kayiran, Adwait Jog, Sreepathi Pai, M. Ibrahim, M. Kandemir, C. Das
{"title":"Controlled Kernel Launch for Dynamic Parallelism in GPUs","authors":"Xulong Tang, Ashutosh Pattnaik, Huaipan Jiang, Onur Kayiran, Adwait Jog, Sreepathi Pai, M. Ibrahim, M. Kandemir, C. Das","doi":"10.1109/HPCA.2017.14","DOIUrl":"https://doi.org/10.1109/HPCA.2017.14","url":null,"abstract":"Dynamic parallelism (DP) is a promising feature for GPUs, which allows on-demand spawning of kernels on the GPU without any CPU intervention. However, this feature has two major drawbacks. First, the launching of GPU kernels can incur significant performance penalties. Second, dynamically-generated kernels are not always able to efficiently utilize the GPU cores due to hardware-limits. To address these two concerns cohesively, we propose SPAWN, a runtime framework that controls the dynamically-generated kernels, thereby directly reducing the associated launch overheads and queuing latency. Moreover, it allows a better mix of dynamically-generated and original (parent) kernels for the scheduler to effectively hide the remaining overheads and improve the utilization of the GPU resources. Our results show that, across 13 benchmarks, SPAWN achieves 69% and 57% speedup over the flat (non-DP) implementation and baseline DP, respectively.","PeriodicalId":118950,"journal":{"name":"2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115274417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Imani, Abbas Rahimi, Deqian Kong, T. Simunic, J. Rabaey
{"title":"Exploring Hyperdimensional Associative Memory","authors":"M. Imani, Abbas Rahimi, Deqian Kong, T. Simunic, J. Rabaey","doi":"10.1109/HPCA.2017.28","DOIUrl":"https://doi.org/10.1109/HPCA.2017.28","url":null,"abstract":"Brain-inspired hyperdimensional (HD) computing emulates cognition tasks by computing with hypervectors as an alternative to computing with numbers. At its very core, HD computing is about manipulating and comparing large patterns, stored in memory as hypervectors: the input symbols are mapped to a hypervector and an associative search is performed for reasoning and classification. For every classification event, an associative memory is in charge of finding the closest match between a set of learned hypervectors and a query hypervector by using a distance metric. Hypervectors with the i.i.d. components qualify a memory-centric architecture to tolerate massive number of errors, hence it eases cooperation of various methodological design approaches for boosting energy efficiency and scalability. This paper proposes architectural designs for hyperdimensional associative memory (HAM) to facilitate energy-efficient, fast, and scalable search operation using three widely-used design approaches. These HAM designs search for the nearest Hamming distance, and linearly scale with the number of dimensions in the hypervectors while exploring a large design space with orders of magnitude higher efficiency. First, we propose a digital CMOS-based HAM (D-HAM) that modularly scales to any dimension. Second, we propose a resistive HAM (R-HAM) that exploits timing discharge characteristic of nonvolatile resistive elements to approximately compute Hamming distances at a lower cost. Finally, we combine such resistive characteristic with a currentbased search method to design an analog HAM (A-HAM) that results in faster and denser alternative. Our experimental results show that R-HAM and A-HAM improve the energy-delay product by 9.6× and 1347× compared to D-HAM while maintaining a moderate accuracy of 94% in language recognition.","PeriodicalId":118950,"journal":{"name":"2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123319547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cooperative Path-ORAM for Effective Memory Bandwidth Sharing in Server Settings","authors":"Rujia Wang, Youtao Zhang, Jun Yang","doi":"10.1109/HPCA.2017.9","DOIUrl":"https://doi.org/10.1109/HPCA.2017.9","url":null,"abstract":"Path ORAM (Oblivious RAM) is a recently proposed ORAM protocol for preventing information leakage from memory access sequences. It receives wide adoption due to its simplicity, practical efficiency and asymptotic efficiency. However, Path ORAM has extremely large memory bandwidth demand, leading to severe memory competition in server settings, e.g., a server may service one application that uses Path ORAM and one or multiple applications that do not. While Path ORAM synchronously and intensively uses all memory channels, the non-secure applications often exhibit low access intensity and large channel level imbalance. Traditional memory scheduling schemes lead to wasted memory bandwidth to the system and large performance degradation to both types of applications. In this paper, we propose CP-ORAM, a Cooperative Path ORAM design, to effectively schedule the memory requests from both types of applications. CP-ORAM consists of three schemes: P-Path, R-Path, and W-Path. P-Path assigns and enforces scheduling priority for effective memory bandwidth sharing. R-Path maximizes bandwidth utilization by proactively scheduling read operations from the next Path ORAM access. W-Path mitigates contention on busy memory channels with write redirection. We evaluate CP-ORAM and compare it to the state-of-the-art. Our results show that CP-ORAM helps to achieve 20% performance improvement on average over the baseline Path ORAM for the secure application in a four-channel server setting.","PeriodicalId":118950,"journal":{"name":"2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134117387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Reliability-Aware Scheduling on Heterogeneous Multicore Processors","authors":"Ajeya Naithani, Stijn Eyerman, L. Eeckhout","doi":"10.1109/HPCA.2017.12","DOIUrl":"https://doi.org/10.1109/HPCA.2017.12","url":null,"abstract":"Reliability to soft errors is an increasingly important issue as technology continues to shrink. In this paper, we show that applications exhibit different reliability characteristics on big, high-performance cores versus small, power-efficient cores, and that there is significant opportunity to improve system reliability through reliability-aware scheduling on heterogeneous multicore processors. We monitor the reliability characteristics of all running applications, and dynamically schedule applications to the different core types in a heterogeneous multicore to maximize system reliability. Reliability-aware scheduling improves reliability by 25.4% on average (and up to 60.2%) compared to performance-optimized scheduling on a heterogeneous multicore processor with two big cores and two small cores, while de-grading performance by 6.3% only. We also introduce a novel system-level reliability metric for multiprogram workloads on (heterogeneous) multicores. We further show that our reliability-aware scheduler is robust across core count, number of big and small cores, and their frequency settings. The hardware cost in support of our reliability-aware scheduler is limited to 296 bytes per core.","PeriodicalId":118950,"journal":{"name":"2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"115 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132101304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mohammad Abdel-Majeed, A. Shafaei, Hyeran Jeon, Massoud Pedram, M. Annavaram
{"title":"Pilot Register File: Energy Efficient Partitioned Register File for GPUs","authors":"Mohammad Abdel-Majeed, A. Shafaei, Hyeran Jeon, Massoud Pedram, M. Annavaram","doi":"10.1109/HPCA.2017.47","DOIUrl":"https://doi.org/10.1109/HPCA.2017.47","url":null,"abstract":"GPU adoption for general purpose computing hasbeen accelerating. To support a large number of concurrentlyactive threads, GPUs are provisioned with a very large registerfile (RF). The RF power consumption is a critical concern. Oneoption to reduce the power consumption dramatically is touse near-threshold voltage(NTV) to operate the RF. However, operating MOSFET devices at NTV is fraught with stabilityand reliability concerns. The adoption of FinFET devices inchip industry is providing a promising path to operate theRF at NTV while satisfactorily tackling the stability andreliability concerns. However, the fundamental problem of NTVoperation, namely slow access latency, remains. To tackle thischallenge in this paper we propose to build a partitioned RFusing FinFET technology. The partitioned RF design exploitsour observation that applications exhibit strong preference toutilize a small subset of their registers. One way to exploitthis behavior is to cache the RF content as has been proposedin recent works. However, caching leads to unnecessary areaoverheads since a fraction of the RF must be replicated. Furthermore, we show that caching is not efficient as weincrease the number of issued instructions per cycle, which isthe expected trend in GPU designs. The proposed partitionedRF splits the registers into two partitions: the highly accessedregisters are stored in a small RF that switches betweenhigh and low power modes. We use the FinFET's back gatecontrol to provide low overhead switching between the twopower modes. The remaining registers are stored in a largeRF partition that always operates at NTV. The assignment ofthe registers to the two partitions will be based on statisticscollected by the a hybrid profiling technique that combines thecompiler based profiling and the pilot warp profiling techniqueproposed in this paper. The partitioned FinFET RF is able tosave 39% and 54% of the RF leakage and the dynamic energy, respectively, and suffers less than 2% performance overhead.","PeriodicalId":118950,"journal":{"name":"2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"261 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115216387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}