Zimo Li, Joshua San Miguel, Natalie D. Enright Jerger
{"title":"The runahead network-on-chip","authors":"Zimo Li, Joshua San Miguel, Natalie D. Enright Jerger","doi":"10.1109/HPCA.2016.7446076","DOIUrl":"https://doi.org/10.1109/HPCA.2016.7446076","url":null,"abstract":"With increasing core counts and higher memory demands from applications, it is imperative that networks-on-chip (NoCs) provide low-latency, power-efficient communication. Conventional NoCs tend to be over-provisioned for worst-case bandwidth demands leading to ineffective use of network resources and significant power inefficiency; average channel utilization is typically less than 5% in real-world applications. In terms of performance, low-latency techniques often introduce power and area overheads and incur significant complexity in the router microarchitecture. We find that both low latency and power efficiency are possible by relaxing the constraint of lossless communication. This is inspired from internetworking where best effort delivery is commonplace. We propose the Runahead NoC, a lightweight, lossy network that provides single-cycle hops. Allowing for lossy delivery enables an extremely simple bufferless router microarchitecture that performs routing and arbitration within the same cycle as link traversal. The Runahead NoC operates either as a power-saver that is integrated into an existing conventional NoC to improve power efficiency, or as an accelerator that is added on top to provide ultra-low latency communication for select packets. On a range of PAR-SEC and SPLASH-2 workloads, we find that the Runahead NoC reduces power consumption by 1.81 as a power-saver and improves runtime and packet latency by 1.08× and 1.66× as an accelerator.","PeriodicalId":417994,"journal":{"name":"2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128878242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Elvira Teran, Yingying Tian, Zhe Wang, Daniel A. Jiménez
{"title":"Minimal disturbance placement and promotion","authors":"Elvira Teran, Yingying Tian, Zhe Wang, Daniel A. Jiménez","doi":"10.1109/HPCA.2016.7446065","DOIUrl":"https://doi.org/10.1109/HPCA.2016.7446065","url":null,"abstract":"Cache replacement policies often order blocks into distinct positions. A block is placed into a set in some initial position. A re-referenced block is promoted into a higher position while other blocks may move into lower positions. A block in the lowest position is a candidate for replacement. Tree-based PseudoLRU is a well-known space-efficient replacement policy based on representing block positions as distinct paths in a binary tree. We find that a placement or promotion for one block often needlessly disturbs the non-promoted blocks. Guided by the principle of minimal disturbance, i.e. that a policy should seek to disturb the order of non-promoted blocks to the smallest extent possible, we develop a simple modification to PseudoLRU resulting in a policy that improves performance over previous techniques while retaining the low cost of PseudoLRU. The result is a minimal disturbance placement and promotion (MDPP) policy. We first give a static formulation of MDPP and show that it provides superior performance to LRU, PseudoLRU and matches performance for SRRIP for both single-threaded and multi-core workloads. We then give a dynamic formulation that uses dead block prediction for placement and bypass and show that it meets or exceeds state-of-the-art performance with lower overhead. For single-threaded workloads, dynamic MDPP matches the 5.9% speedup over LRU of the state-of-the-art policy SHiP. For multi-core workloads, dynamic MDPP gives a normalized weighted speedup of 14.3% over LRU, compared with SHiP that yields a speedup of 12.3% over LRU and requires double the storage overhead per set. We show that minimal disturbance policies can reduce the frequency of a costly read-modify-write cycle for replacement state, making them potentially suitable for future work in DRAM caches.","PeriodicalId":417994,"journal":{"name":"2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129256556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Approximating warps with intra-warp operand value similarity","authors":"Daniel Wong, N. Kim, M. Annavaram","doi":"10.1109/HPCA.2016.7446063","DOIUrl":"https://doi.org/10.1109/HPCA.2016.7446063","url":null,"abstract":"Value locality, the recurrence of a previously-seen value, has been the enabler of myriad optimization techniques in traditional processors. Value similarity relaxes the constraint of value locality by allowing values to differ in the lowest significant bits where values are micro-architecturally near. With the end of Dennard Scaling and the turn towards massively parallel accelerators, we revisit value similarity in the context of GPUs. We identify a form of value similarity called intra-warp operand value similarity, which is abundant in GPUs. We present Warp Approximation, which leverages intra-warp operand value similarity to trade off accuracy for energy. Warp Approximation dynamically identifies intra-warp operand value similarity in hardware, and executes a single representative thread on behalf of all the active threads in a warp, thereby producing a representative value with approximate value locality. This representative value can then be stored compactly in the register file as a value similar scalar, reducing the read and write energy when dealing with approximate data. With Warp Approximation, we can reduce execution unit energy by 37%, register file energy by 28%, and improve overall GPGPU energy efficiency by 26% with minimal quality degradation.","PeriodicalId":417994,"journal":{"name":"2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117302492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CompEx: Compression-expansion coding for energy, latency, and lifetime improvements in MLC/TLC NVM","authors":"Poovaiah M. Palangappa, K. Mohanram","doi":"10.1109/HPCA.2016.7446056","DOIUrl":"https://doi.org/10.1109/HPCA.2016.7446056","url":null,"abstract":"Multi-level/triple-level cell non-volatile memories (MLC/TLC NVMs) such as PCM and RRAM are the subject of active research and development as replacement candidates for DRAM, which is limited by its high refresh power and poor scaling potential. Besides the benefits of non-volatility (low refresh power) and improved scalability, MLC/TLC NVMs offer high data density and memory capacity over DRAM. However, the viability of MLC/TLC NVMs is limited primarily due to the high programming energy and latency as well as the low endurance of NVM cells; these are primarily attributable to the iterative program-and-verify procedure necessary for programming the NVM cells. In this paper, we propose compression-expansion (CompEx) coding, a low overhead scheme that synergistically integrates statistical compression with expansion coding to realize simultaneous energy, latency, and lifetime improvements in MLC/TLC NVMs. CompEx coding is agnostic to the choice of compression technique; in this paper, we evaluate CompEx coding using both frequent pattern compression (FPC) and base-delta-immediate (BΔI) compression. CompEx coding integrates FPC/BΔI with (k, m)q `expansion' coding; expansion codes are a class of q-ary linear block codes that encode data using only the low energy states of a q-ary NVM cell. CompEx coding simultaneously reduces energy and latency and improves lifetime for no memory overhead and negligible logic overhead (≈ 10k gates, which is <; 0.1% per NVM module). Our full-system simulations of a system that integrates TLC RRAM show that CompEx coding reduces total memory energy by 53% and write latency by 24%; these improvements translate to a 5.7% improvement in IPC, a 11.8% improvement in main memory bandwidth, and 1.8× improvement in lifetime over classical binary coding using data-comparison write. CompEx coding thus addresses the programming energy/latency and lifetime challenges of MLC/-TLC NVMs that pose a serious technological roadblock to their adoption in high performance computing systems.","PeriodicalId":417994,"journal":{"name":"2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132876026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Neha Agarwal, D. Nellans, Eiman Ebrahimi, T. Wenisch, John Danskin, S. Keckler
{"title":"Selective GPU caches to eliminate CPU-GPU HW cache coherence","authors":"Neha Agarwal, D. Nellans, Eiman Ebrahimi, T. Wenisch, John Danskin, S. Keckler","doi":"10.1109/HPCA.2016.7446089","DOIUrl":"https://doi.org/10.1109/HPCA.2016.7446089","url":null,"abstract":"Cache coherence is ubiquitous in shared memory multiprocessors because it provides a simple, high performance memory abstraction to programmers. Recent work suggests extending hardware cache coherence between CPUs and GPUs to help support programming models with tightly coordinated sharing between CPU and GPU threads. However, implementing hardware cache coherence is particularly challenging in systems with discrete CPUs and GPUs that may not be produced by a single vendor. Instead, we propose, selective caching, wherein we disallow GPU caching of any memory that would require coherence updates to propagate between the CPU and GPU, thereby decoupling the GPU from vendor-specific CPU coherence protocols. We propose several architectural improvements to offset the performance penalty of selective caching: aggressive request coalescing, CPU-side coherent caching for GPU-uncacheable requests, and a CPU-GPU interconnect optimization to support variable-size transfers. Moreover, current GPU workloads access many read-only memory pages; we exploit this property to allow promiscuous GPU caching of these pages, relying on page-level protection, rather than hardware cache coherence, to ensure correctness. These optimizations bring a selective caching GPU implementation to within 93% of a hardware cache-coherent implementation without the need to integrate CPUs and GPUs under a single hardware coherence protocol.","PeriodicalId":417994,"journal":{"name":"2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"63 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129999801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tianhao Zheng, D. Nellans, A. Zulfiqar, M. Stephenson, S. Keckler
{"title":"Towards high performance paged memory for GPUs","authors":"Tianhao Zheng, D. Nellans, A. Zulfiqar, M. Stephenson, S. Keckler","doi":"10.1109/HPCA.2016.7446077","DOIUrl":"https://doi.org/10.1109/HPCA.2016.7446077","url":null,"abstract":"Despite industrial investment in both on-die GPUs and next generation interconnects, the highest performing parallel accelerators shipping today continue to be discrete GPUs. Connected via PCIe, these GPUs utilize their own privately managed physical memory that is optimized for high bandwidth. These separate memories force GPU programmers to manage the movement of data between the CPU and GPU, in addition to the on-chip GPU memory hierarchy. To simplify this process, GPU vendors are developing software runtimes that automatically page memory in and out of the GPU on-demand, reducing programmer effort and enabling computation across datasets that exceed the GPU memory capacity. Because this memory migration occurs over a high latency and low bandwidth link (compared to GPU memory), these software runtimes may result in significant performance penalties. In this work, we explore the features needed in GPU hardware and software to close the performance gap of GPU paged memory versus legacy programmer directed memory management. Without modifying the GPU execution pipeline, we show it is possible to largely hide the performance overheads of GPU paged memory, converting an average 2× slowdown into a 12% speedup when compared to programmer directed transfers. Additionally, we examine the performance impact that GPU memory oversubscription has on application run times, enabling application designers to make informed decisions on how to shard their datasets across hosts and GPU instances.","PeriodicalId":417994,"journal":{"name":"2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124950869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SCsafe: Logging sequential consistency violations continuously and precisely","authors":"Yuelu Duan, David A. Koufaty, J. Torrellas","doi":"10.1109/HPCA.2016.7446069","DOIUrl":"https://doi.org/10.1109/HPCA.2016.7446069","url":null,"abstract":"Sequential Consistency Violations (SCV) in relaxed consistency machines cause programs to malfunction and are hard to debug. While there are proposals for detecting and recording SCVs, they are limited in that they end program execution after detecting the first SCV because the program is now non-SC. Therefore, they cannot be used in production runs. In addition, such proposals rely on complicated hardware. To address these problems, this paper proposes the first architecture that detects and logs SCVs in a continuous manner, while retaining SC. In addition, the scheme is precise and uses substantially simpler hardware. The scheme, called SCsafe, operates continously because, after SCV detection and logging, it recovers and resumes execution while retaining SC. As a result, it can be used in production runs. In addition, SCsafe is precise in that it identifies only true SCVs - rather than dependence cycles due to false sharing. Finally, SCsafe's hardware is mostly local to each processor, and uses known recovery techniques. We evaluate SCsafe using simulations of 16-processor multicores with Total Store Order or Release Consistency. In codes with SCVs, SCsafe detects and reports SCVs while enforcing SC during the execution. In codes with few SCVs, it adds a negligible performance overhead. Finally, SCsafe is scalable with the processor count.","PeriodicalId":417994,"journal":{"name":"2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123362398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Divya Mahajan, Jongse Park, Emmanuel Amaro, Hardik Sharma, A. Yazdanbakhsh, J. Kim, H. Esmaeilzadeh
{"title":"TABLA: A unified template-based framework for accelerating statistical machine learning","authors":"Divya Mahajan, Jongse Park, Emmanuel Amaro, Hardik Sharma, A. Yazdanbakhsh, J. Kim, H. Esmaeilzadeh","doi":"10.1109/HPCA.2016.7446050","DOIUrl":"https://doi.org/10.1109/HPCA.2016.7446050","url":null,"abstract":"A growing number of commercial and enterprise systems increasingly rely on compute-intensive Machine Learning (ML) algorithms. While the demand for these compute-intensive applications is growing, the performance benefits from general-purpose platforms are diminishing. Field Programmable Gate Arrays (FPGAs) provide a promising path forward to accommodate the needs of machine learning algorithms and represent an intermediate point between the efficiency of ASICs and the programmability of general-purpose processors. However, acceleration with FPGAs still requires long development cycles and extensive expertise in hardware design. To tackle this challenge, instead of designing an accelerator for a machine learning algorithm, we present TABLA, a framework that generates accelerators for a class of machine learning algorithms. The key is to identify the commonalities across a wide range of machine learning algorithms and utilize this commonality to provide a high-level abstraction for programmers. TABLA leverages the insight that many learning algorithms can be expressed as a stochastic optimization problem. Therefore, learning becomes solving an optimization problem using stochastic gradient descent that minimizes an objective function over the training data. The gradient descent solver is fixed while the objective function changes for different learning algorithms. TABLA provides a template-based framework to accelerate this class of learning algorithms. Therefore, a developer can specify the learning task by only expressing the gradient of the objective function using our high-level language. Tabla then automatically generates the synthesizable implementation of the accelerator for FPGA realization using a set of hand-optimized templates. We use Tabla to generate accelerators for ten different learning tasks targeted at a Xilinx Zynq FPGA platform. We rigorously compare the benefits of FPGA acceleration to multi-core CPUs (ARM Cortex A15 and Xeon E3) and many-core GPUs (Tegra K1, GTX 650 Ti, and Tesla K40) using real hardware measurements. TABLA-generated accelerators provide 19.4x and 2.9x average speedup over the ARM and Xeon processors, respectively. These accelerators provide 17.57x, 20.2x, and 33.4x higher Performance-per-Watt in comparison to Tegra, GTX 650 Ti and Tesla, respectively. These benefits are achieved while the programmers write less than 50 lines of code.","PeriodicalId":417994,"journal":{"name":"2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124431847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PleaseTM: Enabling transaction conflict management in requester-wins hardware transactional memory","authors":"Sunjae Park, Milos Prvulović, C. Hughes","doi":"10.1109/HPCA.2016.7446072","DOIUrl":"https://doi.org/10.1109/HPCA.2016.7446072","url":null,"abstract":"With recent commercial offerings, hardware transactional memory (HTM) has finally become an important tool in writing multithreaded applications. However, current offerings are commonly implemented in a way that keep the coherence protocol unmodified. Data conflicts are recognized by coherence messages sent by the requester to sharers of the cache block (e.g., a write to a speculatively read line), who are then aborted. This tends to abort transactions that have done more work, leading to suboptimal performance. Even worse, this can lead to live-lock situations where transactions repeatedly abort each other. In this paper, we present PleaseTM, a mechanism that allows more freedom in deciding which transaction to abort, while leaving the coherence protocol design unchanged. In PleaseTM, transactions insert plea bits into their responses to coherence requests as a simple payload, and use these bits to inform conflict management decisions. Coherence permission changes are then achieved with normal coherence requests. Our experiments show that this additional freedom can provide on average 43% speedup, with a maximum of 7-fold speedup, on STAMP benchmarks running at 32 threads compared to requester-wins HTM.","PeriodicalId":417994,"journal":{"name":"2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116775912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A performance analysis framework for optimizing OpenCL applications on FPGAs","authors":"Ze-ke Wang, Bingsheng He, Wei Zhang, Shunning Jiang","doi":"10.1109/HPCA.2016.7446058","DOIUrl":"https://doi.org/10.1109/HPCA.2016.7446058","url":null,"abstract":"Recently, FPGA vendors such as Altera and Xilinx have released OpenCL SDK for programming FPGAs. However, the architecture of FPGA is significantly different from that of CPU/GPU, for which OpenCL is originally designed. Tuning the OpenCL code for good performance on FPGAs is still an open problem, since the existing OpenCL tools and models designed for CPUs/GPUs are not directly applicable to FPGAs. In the paper, we present an FPGA-based performance analysis framework that can shed light on the performance bottleneck and thus guide the code tuning for OpenCL applications on FPGAs. Particularly, we leverage static and dynamic analysis to develop an analytical performance model, which has captured the key architectural features of FPGA abstractions under OpenCL. Then, we provide four programmer-interpretable metrics to quantify the performance potentials of the OpenCL program with input optimization combination for the next optimization step. We evaluate our framework with a number of user cases, and demonstrate that 1) our analytical performance model can accurately predict the performance of OpenCL programs with different optimization combinations on FPGAs, and 2) our tool can be used to effectively guide the code tuning on alleviating the performance bottleneck.","PeriodicalId":417994,"journal":{"name":"2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126583411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}