2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)最新文献

筛选
英文 中文
Accordion: Toward soft Near-Threshold Voltage Computing 手风琴:迈向软近阈值电压计算
Ulya R. Karpuzcu, Ismail Akturk, N. Kim
{"title":"Accordion: Toward soft Near-Threshold Voltage Computing","authors":"Ulya R. Karpuzcu, Ismail Akturk, N. Kim","doi":"10.1109/HPCA.2014.6835977","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835977","url":null,"abstract":"While more cores can find place in the unit chip area every technology generation, excessive growth in power density prevents simultaneous utilization of all. Due to the lower operating voltage, Near-Threshold Voltage Computing (NTC) promises to fit more cores in a given power envelope. Yet NTC prospects for energy efficiency disappear without mitigating (i) the performance degradation due to the lower operating frequency; (ii) the intensified vulnerability to parametric variation. To compensate for the first barrier, we need to raise the degree of parallelism - the number of cores engaged in computation. NTC-prompted power savings dominate the power cost of increasing the core count. Hence, limited parallelism in the application domain constitutes the critical barrier to engaging more cores in computation. To avoid the second barrier, the system should tolerate variation-induced errors. Unfortunately, engaging more cores in computation exacerbates vulnerability to variation further. To overcome NTC barriers, we introduce Accordion, a novel, light-weight framework, which exploits weak scaling along with inherent fault tolerance of emerging R(ecognition), M(ining), S(ynthesis) applications. The key observation is that the problem size not only dictates the number of cores engaged in computation, but also the application output quality. Consequently, Accordion designates the problem size as the main knob to trade off the degree of parallelism (i.e. the number of cores engaged in computation), with the degree of vulnerability to variation (i.e. the corruption in application output quality due to variation-induced errors). Parametric variation renders ample reliability differences between the cores. Since RMS applications can tolerate faults emanating from data-intensive program phases as opposed to control, variation-afflicted Accordion hardware executes fault-tolerant data-intensive phases on error-prone cores, and reserves reliable cores for control.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133741707","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
DASCA: Dead Write Prediction Assisted STT-RAM Cache Architecture 死写预测辅助STT-RAM缓存架构
Junwhan Ahn, S. Yoo, Kiyoung Choi
{"title":"DASCA: Dead Write Prediction Assisted STT-RAM Cache Architecture","authors":"Junwhan Ahn, S. Yoo, Kiyoung Choi","doi":"10.1109/HPCA.2014.6835944","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835944","url":null,"abstract":"Spin-Transfer Torque RAM (STT-RAM) has been considered as a promising candidate for on-chip last-level caches, replacing SRAM for better energy efficiency, smaller die footprint, and scalability. However, it also introduces several new challenges into last-level cache design that need to be overcome for feasible deployment of STT-RAM caches. Among other things, mitigating the impact of slow and energy-hungry write operations is of the utmost importance. In this paper, we propose a new mechanism to reduce write activities of STT-RAM last-level caches. The key observation is that a significant amount of data written to last-level caches is not actually re-referenced again during the lifetime of the corresponding cache blocks. Such write operations, which we call dead writes, can bypass the cache without incurring extra misses by definition. Based on this, we propose Dead Write Prediction Assisted STT-RAM Cache Architecture (DASCA), which predicts and bypasses dead writes for write energy reduction. For this purpose, we first propose a novel classification of dead writes, which is composed of dead-on-arrival fills, dead-value fills, and closing writes, as a theoretical model for redundant write elimination. On top of that, we present a dead write predictor based on a state-of-the-art dead block predictor. Evaluations show that our architecture achieves an energy reduction of 68% (62%) in last-level caches and an additional energy reduction of 10% (16%) in main memory and even improves system performance by 6% (14%) on average compared to the STT-RAM baseline in a single-core (quad-core) system.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130281837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 92
Transportation-network-inspired network-on-chip Transportation-network-inspired network-on-chip
H. Kim, Gwangsun Kim, S. Maeng, H. Yeo, John Kim
{"title":"Transportation-network-inspired network-on-chip","authors":"H. Kim, Gwangsun Kim, S. Maeng, H. Yeo, John Kim","doi":"10.1109/HPCA.2014.6835943","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835943","url":null,"abstract":"A cost-efficient network-on-chip is needed in a scalable many-core systems. Recent multicore processors have leveraged a ring topology and hierarchical ring can increase scalability but presents different challenges, including higher hop count and global ring bottleneck. In this work, we describe a hierarchical ring topology that we refer to as a transportation-network-inspired network-on-chip (tNoC) that leverages principles from transportation network systems. In particular, we propose a novel hybrid flow control for hierarchical ring topology to scale the topology efficiently. The flow control is hybrid in that the channels are allocated on flit granularity while the buffers are allocated on packet granularity. The hybrid flow control enables a simplified router microarchitecture (to minimize per-hop latency) as router input buffers are minimized and buffers are pushed to the edges, either at the output ports or at the hub routers that interconnect the local rings to the global ring - while still supporting virtual channels to avoid protocol deadlock. We also describe a packet-quota-system (PQS) and a separate credit network that provide congestion management, support prioritized arbitration in the network, and provide support for multiflit packets. A detailed evaluation of a 64-core CMP shows that the tNoC improves performance by up to 21% compared with a baseline, buffered hierarchical ring topology while reducing NoC energy by 51%.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128998622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
Sprinkler: Maximizing resource utilization in many-chip solid state disks 洒水器:在多片固态磁盘中最大限度地利用资源
Myoungsoo Jung, M. Kandemir
{"title":"Sprinkler: Maximizing resource utilization in many-chip solid state disks","authors":"Myoungsoo Jung, M. Kandemir","doi":"10.1109/HPCA.2014.6835961","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835961","url":null,"abstract":"Resource utilization is one of the emerging problems in many-chip SSDs. In this paper, we propose Sprinkler, a novel device-level SSD controller, which targets maximizing resource utilization and achieving high performance without additional NAND flash chips. Specifically, Sprinkler relaxes parallelism dependency by scheduling I/O requests based on internal resource layout rather than the order imposed by the device-level queue. In addition, Sprinkler improves flash-level parallelism and reduces the number of transactions (i.e., improves transactionallocality) by over-committing flash memory requests to specific resources. Our extensive experimental evaluation using a cycle-accurate large-scale SSD simulation framework shows that a many-chip SSD equipped with our Sprinkler provides at least 56.6% shorter latency and 1.8 -2.2 times better throughput than the state-of-the-art SSD controllers. Further, it improves overall resource utilization by 68.8% under different I/O request patterns and provides, on average, 80.2% more flash-level parallelism by reducing half of the flash memory requests at runtime.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122502200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 65
A scalable multi-path microarchitecture for efficient GPU control flow 一个可扩展的多路径微架构,用于高效的GPU控制流
Ahmed Eltantawy, Jessica Wenjie Ma, Mike O'Connor, Tor M. Aamodt
{"title":"A scalable multi-path microarchitecture for efficient GPU control flow","authors":"Ahmed Eltantawy, Jessica Wenjie Ma, Mike O'Connor, Tor M. Aamodt","doi":"10.1109/HPCA.2014.6835936","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835936","url":null,"abstract":"Graphics processing units (GPUs) are increasingly used for non-graphics computing. However, applications with divergent control flow incur performance degradation on current GPUs. These GPUs implement the SIMT execution model by serializing the execution of different control flow paths encountered by a warp. This serialization can mask thread level parallelism among the scalar threads comprising a warp thus degrading performance. In this paper, we propose a novel branch divergence handling mechanism that enables interleaved execution of divergent paths within a warp while maintaining immediate postdominator reconvergence. This multi-path microarchitecture decouples divergence and reconvergence tracking by replacing the stack-based structure typically employed to support SIMT execution with two tables: a warp split table and a warp reconvergence table. It also enables reconvergence before the immediate postdominator which is important for efficient execution of unstructured control flow. Evaluated on a set of benchmarks with complex divergent control flow, our proposal achieves up to a 7× speedup with a harmonic mean of 32% over conventional single-path SIMT execution.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"244 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122922792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 33
CREAM: A Concurrent-Refresh-Aware DRAM Memory architecture CREAM:一种并发刷新感知的DRAM内存架构
Zhang Tao, Matthew Poremba, Cong Xu, Guangyu Sun, Yuan Xie
{"title":"CREAM: A Concurrent-Refresh-Aware DRAM Memory architecture","authors":"Zhang Tao, Matthew Poremba, Cong Xu, Guangyu Sun, Yuan Xie","doi":"10.1109/HPCA.2014.6835947","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835947","url":null,"abstract":"As DRAM density keeps increasing, more rows need to be protected in a single refresh with the constant refresh number. Since no memory access is allowed during a refresh, the refresh penalty is no longer trivial and can result in significant performance degradation. To mitigate the refresh penalty, a Concurrent-REfresh-Aware Memory system (CREAM) is proposed in this work so that memory access and refresh can be served in parallel. The proposed CREAM architecture distinguishes itself with the following key contributions: (1) Under a given DRAM power budget, sub-rank-level refresh (SRLR) is developed to reduce refresh power and the saved power is used to enable concurrent memory access; (2) sub-array-level refresh (SALR) is also devised to effectively lower the probability of the conflict between memory access and refresh; (3) In addition, novel sub-array level refresh scheduling schemes, such as sub-array round-robin and dynamic scheduling, are designed to further improve the performance. A quasi-ROR interface protocol is proposed so that CREAM is fully compatible with JEDEC-DDR standard with negligible hardware overhead and no extra pin-out. The experimental results show that CREAM can improve the performance by 12.9% and 7.1% over the conventional DRAM and the Elastic-Refresh DRAM memory, respectively.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130094696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 38
Dynamically detecting and tolerating IF-Condition Data Races 动态检测和容忍if条件数据争用
Shanxiang Qi, A. Muzahid, Wonsun Ahn, J. Torrellas
{"title":"Dynamically detecting and tolerating IF-Condition Data Races","authors":"Shanxiang Qi, A. Muzahid, Wonsun Ahn, J. Torrellas","doi":"10.1109/HPCA.2014.6835923","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835923","url":null,"abstract":"An IF-Condition Invariance Violation (ICIV) occurs when, after a thread has computed the control expression of an IF statement and while it is executing the THEN or ELSE clauses, another thread updates variables in the IF's control expression. An ICIV can be easily detected, and is likely to be a sign of a concurrency bug in the code. Typically, the ICIV is caused by a data race, which we call IF-Condition Data Race (ICR). In this paper, we analyze the data races reported in the bug databases of popular software systems and show that ICRs occur relatively often. Then, we present two techniques to handle ICRs dynamically. They rely on simple code transformations and, in one case, additional hardware help. One of them (SW-IF) detects the races, while the other (HW-IF) detects and prevents them. We evaluate SW-IF and HW-IF using a variety of applica- tions. We show that these new techniques are effective at finding new data race bugs and run with low overhead. Specifically, HW-IF finds 5 new (unreported) race bugs and SW-IF finds 3 of them. In addition, 8-threaded executions of SPLASH-2 codes show that, on average, SW-IF adds 2% execution overhead, while HW-IF adds less than 1%.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128714828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
STM: Cloning the spatial and temporal memory access behavior STM:克隆空间和时间内存访问行为
Amro Awad, Yan Solihin
{"title":"STM: Cloning the spatial and temporal memory access behavior","authors":"Amro Awad, Yan Solihin","doi":"10.1109/HPCA.2014.6835935","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835935","url":null,"abstract":"Computer architects need a deep understanding of clients' workload in order to design and tune the architecture. Unfortunately, many important clients will not share their software to computer architects due to the proprietary or confidential nature of their software. One technique to mitigate this problem is producing synthetic traces (clone) that replicate the behavior of the original workloads. Unfortunately, today there is no universal cloning technique that can capture arbitrary memory access behavior of applications. Existing technique captures only temporal, but not spatial, locality. In order to study memory hierarchy organization beyond caches, such as including prefetchers and translation lookaside buffer (TLB), capturing only temporal locality is insufficient. In this paper, we propose a new memory access behavior cloning technique that captures both temporal and spatial locality. We abbreviate our scheme as Spatio-Temporal Memory (STM) cloning. We propose a new profiling method and statistics that capture stride patterns and transition probabilities. We show how the new statistics enable accurate clone generation that allow clones to be used in place of the original benchmarks for studying the L1/L2/TLB miss rates as we vary the L1 cache, L1 prefetcher, L2 cache, TLB, and page size configurations.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"129 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124650354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 27
TSO-CC: Consistency directed cache coherence for TSO TSO- cc:面向TSO的一致性缓存一致性
M. Elver, V. Nagarajan
{"title":"TSO-CC: Consistency directed cache coherence for TSO","authors":"M. Elver, V. Nagarajan","doi":"10.1109/HPCA.2014.6835927","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835927","url":null,"abstract":"Traditional directory coherence protocols are designed for the strictest consistency model, sequential consistency (SC). When they are used for chip multiprocessors (CMPs) that support relaxed memory consistency models, such protocols turn out to be unnecessarily strict. Usually this comes at the cost of scalability (in terms of per core storage), which poses a problem with increasing number of cores in today's CMPs, most of which no longer are sequentially consistent. Because of the wide adoption of Total Store Order (TSO) and its variants in x86 and SPARC processors, and existing parallel programs written for these architectures, we propose TSO-CC, a cache coherence protocol for the TSO memory consistency model. TSO-CC does not track sharers, and instead relies on self-invalidation and detection of potential acquires using timestamps to satisfy the TSO memory consistency model lazily. Our results show that TSO-CC achieves average performance comparable to a MESI directory protocol, while TSO-CC's storage overhead per cache line scales logarithmically with increasing core count.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128420949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 51
A detailed GPU cache model based on reuse distance theory 基于重用距离理论的GPU缓存详细模型
C. Nugteren, Gert-Jan van den Braak, H. Corporaal, H. Bal
{"title":"A detailed GPU cache model based on reuse distance theory","authors":"C. Nugteren, Gert-Jan van den Braak, H. Corporaal, H. Bal","doi":"10.1109/HPCA.2014.6835955","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835955","url":null,"abstract":"As modern GPUs rely partly on their on-chip memories to counter the imminent off-chip memory wall, the efficient use of their caches has become important for performance and energy. However, optimising cache locality system-atically requires insight into and prediction of cache behaviour. On sequential processors, stack distance or reuse distance theory is a well-known means to model cache behaviour. However, it is not straightforward to apply this theory to GPUs, mainly because of the parallel execution model and fine-grained multi-threading. This work extends reuse distance to GPUs by modelling: (1) the GPU's hierarchy of threads, warps, threadblocks, and sets of active threads, (2) conditional and non-uniform latencies, (3) cache associativity, (4) miss-status holding-registers, and (5) warp divergence. We implement the model in C++ and extend the Ocelot GPU emulator to extract lists of memory addresses. We compare our model with measured cache miss rates for the Parboil and PolyBench/GPU benchmark suites, showing a mean absolute error of 6% and 8% for two cache configurations. We show that our model is faster and even more accurate compared to the GPGPU-Sim simulator.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"122 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128467630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 115
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信