Conference Proceedings. The 24th Annual International Symposium on Computer Architecture最新文献

筛选
英文 中文
Coherence Controller Architectures For Smp-based Cc-numa Multiprocessors 基于smp的Cc-numa多处理器的相干控制器架构
Maged M. Michael, Ashwini K. Nanda, B. Lim, M. Scott
{"title":"Coherence Controller Architectures For Smp-based Cc-numa Multiprocessors","authors":"Maged M. Michael, Ashwini K. Nanda, B. Lim, M. Scott","doi":"10.1145/264107.264203","DOIUrl":"https://doi.org/10.1145/264107.264203","url":null,"abstract":"Scalable distributed shared-memory architectures rely on coherence controllers on each processing node to synthesize cache-coherent shared memory across the entire machine. The coherence controllers execute coherence protocol handlers that may be hardwired in custom hardware or programmed in a protocol processor within each coherence controller. Although custom hardware runs faster, a protocol processor allows the coherence protocol to be tailored to specific application needs and may shorten hardware development time. Previous research show that the increase in application execution time due to protocol processors over custom hardware is minimal.With the advent of SMP nodes and faster processors and networks, the tradeoff between custom hardware and protocol processors needs to be reexamined. This paper studies the performance of custom-hardware and protocol-processor-based coherence controllers in SMP-node-based CC-NUMA systems on applications from the SPLASH-2 suite. Using realistic parameters and detailed models of existing state-of-the-art system components, it shows that the occupancy of coherence controllers can limit the performance of applications with high communication requirements, where the execution time using protocol processors can be twice as long as using custom hardware.To gain a deeper understanding of the tradeoff, we investigate the effect of varying several architectural parameters that influence the communication characteristics of the applications and the underlying system on coherence controller performance. We identify measures of applications' communication requirements and their impact on the performance penalty of protocol processors, which can help system designers predict performance penalties for other applications. We also study the potential of improving the performance of hardware-based and protocol-processor-based coherence controllers by separating or duplicating critical components.","PeriodicalId":405506,"journal":{"name":"Conference Proceedings. The 24th Annual International Symposium on Computer Architecture","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132196708","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 50
The Design And Analysis Of A Cache Architecture For Texture Mapping 纹理映射缓存体系结构的设计与分析
Z. S. Hakura, Anoop Gupta
{"title":"The Design And Analysis Of A Cache Architecture For Texture Mapping","authors":"Z. S. Hakura, Anoop Gupta","doi":"10.1145/264107.264152","DOIUrl":"https://doi.org/10.1145/264107.264152","url":null,"abstract":"The effectiveness of texture mapping in enhancing the realism of computer generated imagery has made support for real-time texture mapping a critical part of graphics pipelines. Despite a recent surge in interest in three-dimensional graphics from computer architects, high-quality high-speed texture mapping has so far been confined to costly hardware systems that use brute-force techniques to achieve high performance. One obstacle faced by designers of texture mapping systems is the requirement of extremely high bandwidth to texture memory. High bandwidth is necessary since there are typically tens to hundreds of millions of accesses to texture memory per second. In addition, to achieve the high clock rates required in graphics pipelines, low-latency access to texture memory is needed. In this paper, we propose the use of texture image caches to alleviate the above bottlenecks, and evaluate various tradeoffs that arise in such designs.We find that the factors important to cache behavior are (i) the representation of texture images in memory, (ii) the rasterization order on screen and (iii) the cache organization. Through a detailed investigation of these issues, we explore the best way to exploit locality of reference and determine whether this technique is robust with respect to different scenes and different amounts of texture. Overall, we observe that there is a significant amount of temporal and spatial locality and that the working set sizes are relatively small (at most 16KB) across all cases that we studied. Consequently, the memory bandwidth requirements of a texture cache system are substantially lower (at least three times and as much as fifteen times) than the memory bandwidth requirements of a system which achieves equivalent performance but does not utilize a cache. These results are very encouraging and indicate that caching is a promising approach to designing memory systems for texture mapping.","PeriodicalId":405506,"journal":{"name":"Conference Proceedings. The 24th Annual International Symposium on Computer Architecture","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124953496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 183
On Deadlocks In Interconnection Networks 关于互连网络中的死锁
T. Pinkston, Sugath Warnakulasuriya
{"title":"On Deadlocks In Interconnection Networks","authors":"T. Pinkston, Sugath Warnakulasuriya","doi":"10.1145/264107.264127","DOIUrl":"https://doi.org/10.1145/264107.264127","url":null,"abstract":"Deadlock avoidance-based and deadlock recovery-based routing algorithms have been proposed in recent years without full understanding of the likelihood and characteristics of actual deadlocks in interconnection networks. This work models the interrelationships between routing freedom, message blocking, correlated resource dependencies and deadlock formation. We empirically show that increasing routing freedom, as achieved by allowing unrestricted routing over multiple virtual channels, makes deadlocks highly improbable and reduces the likelihood of other types of correlated message blocking behavior that can degrade performance. Our results further substantiate that recovery-based routing algorithms have a higher potential performance advantage over deadlock avoidance-based routing algorithms which, inherently, allow less routing freedom.","PeriodicalId":405506,"journal":{"name":"Conference Proceedings. The 24th Annual International Symposium on Computer Architecture","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127138642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 87
Data Prefetching On The HP PA-8000 HP PA-8000预取数据
Vatsa Santhanam, Edward H. Gornish, W. Hsu
{"title":"Data Prefetching On The HP PA-8000","authors":"Vatsa Santhanam, Edward H. Gornish, W. Hsu","doi":"10.1145/264107.264208","DOIUrl":"https://doi.org/10.1145/264107.264208","url":null,"abstract":"Memory latency is a major issue for many modern microprocessor based systems, including the Hewlett-Packard PA-8000. Due to its fast clock rate and wide issue capability, cache misses in the PA-8000 are very expensive. The PA-8000 combines out-of-order execution with multiple outstanding memory requests to tolerate memory latency; however, this approach has its limitations. In order to substantially reduce much of the memory latency penalty, the PA-8000 uses software-based data cache prefetching. In this paper, we discuss the implementation of the data prefetch generation algorithm in the Hewlett-Packard Precision Architecture (HP-PA) compiler. We present performance results for SPECfp95 on a PA-8000 system that show speedups, due to data prefetching, of up to 100%.","PeriodicalId":405506,"journal":{"name":"Conference Proceedings. The 24th Annual International Symposium on Computer Architecture","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115596007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 90
Implementing Multidestination Worms In Switch-based Parallel Systems: Architectural Alternatives And Their Impact 在基于交换机的并行系统中实现多目标蠕虫:架构选择及其影响
Rajeev Sivaram, C. Stunkel, D. Panda
{"title":"Implementing Multidestination Worms In Switch-based Parallel Systems: Architectural Alternatives And Their Impact","authors":"Rajeev Sivaram, C. Stunkel, D. Panda","doi":"10.1145/264107.264129","DOIUrl":"https://doi.org/10.1145/264107.264129","url":null,"abstract":"Multidestination message passing has been proposed as an attractive mechanism for efficiently implementing multicast and other collective operations on direct networks. However, applying this mechanism to switch-based parallel systems is non-trivial. In this paper we propose alternative switch architectures with differing buffer organizations to implement multidestination worms on switch-based parallel systems. First, we discuss issues related to such implementation (deadlock-freedom, replication mechanisms, header encoding, and routing). Next, we demonstrate how an existing central-buffer-based switch architecture supporting unicast message passing can be enhanced to accommodate multidestination message passing. Similarly, implementing multidestination worms on an input-buffer-based switch architecture is discussed. Both of these implementations are evaluated against each other as well as against a software-based scheme using the central buffer organization. Simulation experiments under a range of traffic (multiple multicast, bimodal, varying degree of multicast, and message length) and system size are used for evaluation. The study demonstrates the superiority of the central-buffer-based switch architecture. It also indicates that under bimodal traffic the central-buffer-based hardware multicast implementation affects background unicast traffic less adversely compared to a software-based multicast implementation. Thus, multidestination message passing can easily be applied to switch-based parallel systems to deliver good collective communication performance.","PeriodicalId":405506,"journal":{"name":"Conference Proceedings. The 24th Annual International Symposium on Computer Architecture","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115995623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 63
The Agree Predictor: A Mechanism For Reducing Negative Branch History Interference 同意预测器:减少负分支历史干扰的机制
Eric Sprangle, R. Chappell, M. Alsup, Y. Patt
{"title":"The Agree Predictor: A Mechanism For Reducing Negative Branch History Interference","authors":"Eric Sprangle, R. Chappell, M. Alsup, Y. Patt","doi":"10.1145/264107.264210","DOIUrl":"https://doi.org/10.1145/264107.264210","url":null,"abstract":"Deeply pipelined, superscalar processors require accurate branch prediction to achieve high performance. Two-level branch predictors have been shown to achieve high prediction accuracy. It has also been shown that branch interference is a major contributor to the number of branches mispredicted by two-level predictors.This paper presents a new method to reduce the interference problem called agree prediction, which reduces the chance that two branches aliasing the same PHT entry will interfere negatively. We evaluate the performance of this scheme using full traces (both user and supervisor) of the SPECint95 benchmarks. The result is a reduction in the misprediction rate of gcc ranging from 8.62% with a 64K-entry PHT up to 33.3% with a 1K-entry PHT.","PeriodicalId":405506,"journal":{"name":"Conference Proceedings. The 24th Annual International Symposium on Computer Architecture","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116206638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 173
The Energy Efficiency Of Iram Architectures 伊朗建筑的能源效率
R. Fromm, S. Perissakis, N. Cardwell, C. Kozyrakis, B. McGaughy, D. Patterson, Thomas E. Anderson, K. Yelick
{"title":"The Energy Efficiency Of Iram Architectures","authors":"R. Fromm, S. Perissakis, N. Cardwell, C. Kozyrakis, B. McGaughy, D. Patterson, Thomas E. Anderson, K. Yelick","doi":"10.1145/264107.264214","DOIUrl":"https://doi.org/10.1145/264107.264214","url":null,"abstract":"Portable systems demand energy efficiency in order to maximize battery life. IRAM architectures, which combine DRAM and a processor on the same chip in a DRAM process, are more energy efficient than conventional systems. The high density of DRAM permits a much larger amount of memory on-chip than a traditional SRAM cache design in a logic process. This allows most or all IRAM memory accesses to be satisfied on-chip. Thus there is much less need to drive high-capacitance off-chip buses, which contribute significantly to the energy consumption of a system. To quantify this advantage we apply models of energy consumption in DRAM and SRAM memories to results from cache simulations of applications reflective of personal productivity tasks on low power systems. We find that IRAM memory hierarchies consume as little as 22% of the energy consumed by a conventional memory hierarchy for memory-intensive applications, while delivering comparable performance. Furthermore, the energy consumed by a system consisting of an IRAM memory hierarchy combined with an energy efficient CPU core is as little as 40% of that of the same CPU core with a traditional memory hierarchy.","PeriodicalId":405506,"journal":{"name":"Conference Proceedings. The 24th Annual International Symposium on Computer Architecture","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125429215","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 128
Reactive NUMA: A Design For Unifying S-COMA And CC-NUMA 反应性NUMA:一种统一S-COMA和CC-NUMA的设计
B. Falsafi, D. Wood
{"title":"Reactive NUMA: A Design For Unifying S-COMA And CC-NUMA","authors":"B. Falsafi, D. Wood","doi":"10.1145/264107.264205","DOIUrl":"https://doi.org/10.1145/264107.264205","url":null,"abstract":"This paper proposes and evaluates a new approach to directory-based cache coherence protocols called Reactive NUMA (R-NUMA). An R-NUMA system combines a conventional CC-NUMA coherence protocol with a more-recent Simple-COMA (S-COMA) protocol. What makes R-NUMA novel is the way it dynamically reacts to program and system behavior to switch between CC-NUMA and S-COMA and exploit the best aspects of both protocols. This reactive behavior allows each node in an R-NUMA system to independently choose the best protocol for a particular page, thus providing much greater performance stability than either CC-NUMA or S-COMA alone. Our evaluation is both qualitative and quantitative. We first show the theoretical result that R-NUMA's worst-case performance is bounded within a small constant factor (i.e., two to three times) of the best of CC-NUMA and S-COMA. We then use detailed execution-driven simulation to show that, in practice, R-NUMA usually performs better than either a pure CC-NUMA or pure S-COMA protocol, and no more than 57% worse than the best of CC-NUMA and S-COMA, for our benchmarks and base system assumptions.","PeriodicalId":405506,"journal":{"name":"Conference Proceedings. The 24th Annual International Symposium on Computer Architecture","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132445300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 115
Effects Of Communication Latency, Overhead, And Bandwidth In A Cluster Architecture 集群架构中通信延迟、开销和带宽的影响
Richard P. Martin, Amin Vahdat, D. Culler, Thomas E. Anderson
{"title":"Effects Of Communication Latency, Overhead, And Bandwidth In A Cluster Architecture","authors":"Richard P. Martin, Amin Vahdat, D. Culler, Thomas E. Anderson","doi":"10.1145/264107.264146","DOIUrl":"https://doi.org/10.1145/264107.264146","url":null,"abstract":"This work provides a systematic study of the impact of communication performance on parallel applications in a high performance network of workstations. We develop an experimental system in which the communication latency, overhead, and bandwidth can be independently varied to observe the effects on a wide range of applications. Our results indicate that current efforts to improve cluster communication performance to that of tightly integrated parallel machines results in significantly improved application performance. We show that applications demonstrate strong sensitivity to overhead, slowing down by a factor of 60 on 32 processors when overhead is increased from 3 to 103 µs. Applications in this study are also sensitive to per-message bandwidth, but are surprisingly tolerant of increased latency and lower per-byte bandwidth. Finally, most applications demonstrate a highly linear dependence to both overhead and per-message bandwidth, indicating that further improvements in communication performance will continue to improve application performance.","PeriodicalId":405506,"journal":{"name":"Conference Proceedings. The 24th Annual International Symposium on Computer Architecture","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131363138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 275
The Mercury Interconnect Architecture: A Cost-effective Infrastructure For High-performance Servers 水星互连体系结构:高性能服务器的经济高效的基础设施
W. Weber, Stephen Gold, Pat Helland, Takeshi Shimizu, Thomas Wicki, W. Wilcke
{"title":"The Mercury Interconnect Architecture: A Cost-effective Infrastructure For High-performance Servers","authors":"W. Weber, Stephen Gold, Pat Helland, Takeshi Shimizu, Thomas Wicki, W. Wilcke","doi":"10.1145/264107.264149","DOIUrl":"https://doi.org/10.1145/264107.264149","url":null,"abstract":"This paper presents HAL's Mercury Interconnect Architecture, an interconnect infrastructure designed to link commodity microprocessors, memory, and I/O components into high-performance multiprocessing servers. Both shared-memory and message-passing systems, as well as hybrid systems are supported by the interconnect. The key attributes of the Mercury Interconnect Architecture are: low latency, high bandwidth, a modular and flexible design, reliability/availability/serviceability (RAS) features, and a simplicity that enables very cost-effective implementations. The first implementation of the architecture links multiple 4-processor Pentium™ Pro based nodes. In a 4-node (16-processor) shared-memory configuration, this system achieves a remote read latency of just over 1 µs, and a maximum interconnect bandwidth of 6.4 GByte/s. Both of these parameters far outpace comparable SCI-based solutions, while utilizing much fewer hardware components.","PeriodicalId":405506,"journal":{"name":"Conference Proceedings. The 24th Annual International Symposium on Computer Architecture","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121301408","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 66
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信