The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.最新文献

筛选
英文 中文
Caches and hash trees for efficient memory integrity verification 缓存和哈希树用于有效的内存完整性验证
B. Gassend, G. Suh, Dwaine E. Clarke, Marten van Dijk, S. Devadas
{"title":"Caches and hash trees for efficient memory integrity verification","authors":"B. Gassend, G. Suh, Dwaine E. Clarke, Marten van Dijk, S. Devadas","doi":"10.1109/HPCA.2003.1183547","DOIUrl":"https://doi.org/10.1109/HPCA.2003.1183547","url":null,"abstract":"We study the hardware cost of implementing hash-tree based verification of untrusted external memory by a high performance processor. This verification could enable applications such as certified program execution. A number of schemes are presented with different levels of integration between the on-processor L2 cache and the hash-tree machinery. Simulations show that for the best of our methods, the performance overhead is less than 25%, a significant decrease from the 10/spl times/ overhead of a naive implementation.","PeriodicalId":150992,"journal":{"name":"The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127284219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 282
Tradeoffs in buffering memory state for thread-level speculation in multiprocessors 在多处理器中为线程级推测缓冲内存状态的权衡
M. Garzarán, Milos Prvulović, J. Llabería, V. Viñals, Lawrence Rauchwerger, J. Torrellas
{"title":"Tradeoffs in buffering memory state for thread-level speculation in multiprocessors","authors":"M. Garzarán, Milos Prvulović, J. Llabería, V. Viñals, Lawrence Rauchwerger, J. Torrellas","doi":"10.1109/HPCA.2003.1183537","DOIUrl":"https://doi.org/10.1109/HPCA.2003.1183537","url":null,"abstract":"Thread-level speculation provides architectural support to aggressively run hard-to-analyze code in parallel. As speculative tasks run concurrently, they generate unsafe or speculative memory state that needs to be separately buffered and managed in the presence of distributed caches and buffers. Such state may contain multiple versions of the same variable. In this paper, we introduce a novel taxonomy of approaches to buffering and managing multi-version speculative memory state in multiprocessors. We also present a detailed complexity-benefit tradeoff analysis of the different approaches. Finally, we use numerical applications to evaluate the performance of the approaches under a single architectural framework. Our key insights are that support for buffering the state of multiple speculative tasks and versions per processor is more complexity-effective than support for merging the state of tasks with main memory lazily. Moreover, both supports can be gainfully combined and, in large machines, their effect is nearly fully additive. Finally, the more complex support for future state in main memory can boost performance when buffers are under pressure, but hurts performance when squashes are frequent.","PeriodicalId":150992,"journal":{"name":"The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.","volume":"262 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132572278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 55
A statistically rigorous approach for improving simulation methodology 改进模拟方法的统计严谨方法
J. Yi, D. Lilja, D. Hawkins
{"title":"A statistically rigorous approach for improving simulation methodology","authors":"J. Yi, D. Lilja, D. Hawkins","doi":"10.1109/HPCA.2003.1183546","DOIUrl":"https://doi.org/10.1109/HPCA.2003.1183546","url":null,"abstract":"Due to cost, time, and flexibility constraints, simulators are often used to explore the design space when developing new processor architectures, as well as when evaluating the performance of new processor enhancements. However, despite this dependence on simulators, statistically rigorous simulation methodologies are not typically used in computer architecture research. A formal methodology can provide a sound basis for drawing conclusions gathered from simulation results by adding statistical rigor, and consequently, can increase confidence in the simulation results. This paper demonstrates the application of a rigorous statistical technique to the setup and analysis phases of the simulation process. Specifically, we apply a Plackett and Burman design to: (1) identify key processor parameters; (2) classify benchmarks based on how they affect the processor; and (3) analyze the effect of processor performance enhancements. Our technique expands on previous work by applying a statistical method to improve the simulation methodology instead of applying a statistical model to estimate the performance of the processor.","PeriodicalId":150992,"journal":{"name":"The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.","volume":"83 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131349479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 171
Just say no: benefits of early cache miss determination 直接说不:早期缓存缺失判断的好处
G. Memik, Glenn D. Reinman, W. Mangione-Smith
{"title":"Just say no: benefits of early cache miss determination","authors":"G. Memik, Glenn D. Reinman, W. Mangione-Smith","doi":"10.1109/HPCA.2003.1183548","DOIUrl":"https://doi.org/10.1109/HPCA.2003.1183548","url":null,"abstract":"As the performance gap between the processor cores and the memory subsystem increases, designers are forced to develop new latency hiding techniques. Arguably, the most common technique is to utilize multi-level caches. Each new generation of processors is equipped with higher levels of memory hierarchy with increasing sizes at each level. In this paper, we propose 5 different techniques that will reduce the data access times and power consumption in processors with multi-level caches. Using the information about the blocks placed into and replaced from the caches, the techniques quickly determine whether an access at any cache level will be a miss. The accesses that are identified to miss are aborted. The structures used to recognize misses are much smaller than the cache structures. Consequently the data access times and power consumption are reduced. Using the SimpleScalar simulator, we study the performance of these techniques for a processor with 5 cache levels. The best technique is able to abort 53.1% of the misses on average in SPEC2000 applications. Using these techniques, the execution time of the applications is reduced by up to 12.4% (5.4% on average), and the power consumption of the caches is reduced by as much as 11.6% (3.8% on average).","PeriodicalId":150992,"journal":{"name":"The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116223556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 54
Dynamic data dependence tracking and its application to branch prediction 动态数据依赖跟踪及其在分支预测中的应用
Lei Chen, S. Dropsho, D. Albonesi
{"title":"Dynamic data dependence tracking and its application to branch prediction","authors":"Lei Chen, S. Dropsho, D. Albonesi","doi":"10.1109/HPCA.2003.1183525","DOIUrl":"https://doi.org/10.1109/HPCA.2003.1183525","url":null,"abstract":"To continue to improve processor performance, microarchitects seek to increase the effective instruction level parallelism (ILP) that can be exploited in applications. A fundamental limit to improving ILP is data dependences among instructions. If data dependence information is available at run-time, there are many uses to improve ILP. Prior published examples include decoupled branch execution architectures and critical instruction detection. In this paper, we describe an efficient hardware mechanism to dynamically track the data dependence chains of the instructions in the pipeline. This information is available on a cycle-by-cycle basis to the microengine for optimizing its performance. We then use this design in a new value-based branch prediction design using available register value information (ARVI). From the use of data dependence information, the ARVI branch predictor has better prediction accuracy over a comparably sized hybrid branch predictor With ARVI used as the second-level branch predictor the improved prediction accuracy results in a 12.6% performance improvement on average across the SPEC95 integer benchmark suite.","PeriodicalId":150992,"journal":{"name":"The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122261545","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 33
Performance enhancement techniques for InfiniBand/sup TM/ Architecture InfiniBand/sup TM/架构的性能增强技术
Eun Jung Kim, K. H. Yum, C. Das, Mazin S. Yousif, J. Duato
{"title":"Performance enhancement techniques for InfiniBand/sup TM/ Architecture","authors":"Eun Jung Kim, K. H. Yum, C. Das, Mazin S. Yousif, J. Duato","doi":"10.1109/HPCA.2003.1183543","DOIUrl":"https://doi.org/10.1109/HPCA.2003.1183543","url":null,"abstract":"The InfiniBand/sup TM/ Architecture (IBA) is envisioned to be the default communication fabric for future system area networks (SAN). However, the released IBA specification outlines only higher level functionalities, leaving it open for exploring various design alternatives. In this paper we investigate four co-related techniques to provide high and predictable performance in IBA. These are: (i) using the shortest path first (SPF) algorithm for deterministic packet routing; (ii) developing a multipath routing mechanism for minimizing congestion; (iii) developing a selective packet dropping scheme to handle deadlock and congestion; and (iv) providing multicasting support for customized applications. These designs are evaluated using an integrated workload on a versatile IBA simulation testbed. Simulation results indicate that the SPF routing, multipath routing, packet dropping, and multicasting schemes are quite effective in delivering high and assured performance in clusters. One of the major contributions of this research is the IBA simulation testbed, which is an essential tool to evaluate various design tradeoffs.","PeriodicalId":150992,"journal":{"name":"The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115587378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
Dynamic voltage scaling with links for power optimization of interconnection networks 用于互联网络功率优化的链路动态电压缩放
L. Shang, L. Peh, N. Jha
{"title":"Dynamic voltage scaling with links for power optimization of interconnection networks","authors":"L. Shang, L. Peh, N. Jha","doi":"10.1109/HPCA.2003.1183527","DOIUrl":"https://doi.org/10.1109/HPCA.2003.1183527","url":null,"abstract":"Originally developed to connect processors and memories in multicomputers, prior research and design of interconnection networks have focused largely on performance. As these networks get deployed in a wide range of new applications, where power is becoming a key design constraint, we need to seriously consider power efficiency in designing interconnection networks. As the demand for network bandwidth increases, communication links, already a significant consumer of power now, will take up an ever larger portion of total system power budget. In this paper we motivate the use of dynamic voltage scaling (DVS) for links, where the frequency and voltage of links are dynamically adjusted to minimize power consumption. We propose a history-based DVS policy that judiciously adjusts link frequencies and voltages based on past utilization. Our approach realizes up to 6.3/spl times/ power savings (4.6/spl times/ on average). This is accompanied by a moderate impact on performance (15.2% increase in average latency before network saturation and 2.5% reduction in throughput.) To the best of our knowledge, this is the first study that targets dynamic power optimization of interconnection networks.","PeriodicalId":150992,"journal":{"name":"The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114418159","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 490
Exploring the VLSI scalability of stream processors 探索流处理器的VLSI可扩展性
Brucek Khailany, W. Dally, S. Rixner, U. Kapasi, John Douglas Owens, Brian Towles
{"title":"Exploring the VLSI scalability of stream processors","authors":"Brucek Khailany, W. Dally, S. Rixner, U. Kapasi, John Douglas Owens, Brian Towles","doi":"10.1109/HPCA.2003.1183534","DOIUrl":"https://doi.org/10.1109/HPCA.2003.1183534","url":null,"abstract":"Stream processors are high-performance programmable processors optimized to run media applications. Recent work has shown these processors to be more area- and energy-efficient than conventional programmable architectures. This paper explores the scalability of stream architectures to future VLSI technologies where over a thousand floating-point units on a single chip will be feasible. Two techniques for increasing the number of ALU in a stream processor are presented: intracluster and intercluster scaling. These scaling techniques are shown to be cost-efficient to tens of ALU per cluster and to hundreds of arithmetic clusters. A 640-ALU stream processor with 128 clusters and 5 ALU per cluster is shown to be feasible in 45 nanometer technology, sustaining over 300 GOPS on kernels and providing 15.3/spl times/ of kernel speedup and 8.0/spl times/ of application speedup over a 40-ALU stream processor with a 2% degradation in area per ALU and a 7% degradation in energy dissipated per ALU operation.","PeriodicalId":150992,"journal":{"name":"The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130133352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 74
Deterministic clock gating for microprocessor power reduction 用于微处理器功耗降低的确定性时钟门控
Hai Helen Li, S. Bhunia, Yiran Chen, T. N. Vijaykumar, K. Roy
{"title":"Deterministic clock gating for microprocessor power reduction","authors":"Hai Helen Li, S. Bhunia, Yiran Chen, T. N. Vijaykumar, K. Roy","doi":"10.1109/HPCA.2003.1183529","DOIUrl":"https://doi.org/10.1109/HPCA.2003.1183529","url":null,"abstract":"With the scaling of technology and the need for higher performance and more functionality, power dissipation is becoming a major bottleneck for microprocessor designs. Pipeline balancing (PLB), a previous technique, is essentially a methodology to clock-gate unused components whenever a program's instruction-level parallelism is predicted to be low. However, no nonpredictive methodologies are available in the literature for efficient clock gating. This paper introduces deterministic clock gating (DCG) based on the key observation that for many of the stages in a modern pipeline, a circuit block's usage in a specific cycle in the near future is deterministically known a few cycles ahead of time. Our experiments show an average of 19.9% reduction in processor power with virtually no performance loss for an 8-issue, out-of-order superscalar processor by applying DCG to execution units, pipeline latches, D-Cache wordline decoders, and result bus drivers. In contrast, PLB achieves 9.9% average power savings at 2.9% performance loss.","PeriodicalId":150992,"journal":{"name":"The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.","volume":"212 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115781430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 110
Hierarchical backoff locks for nonuniform communication architectures 用于非统一通信体系结构的分层回退锁
Z. Radovic, Erik Hagersten
{"title":"Hierarchical backoff locks for nonuniform communication architectures","authors":"Z. Radovic, Erik Hagersten","doi":"10.1109/HPCA.2003.1183542","DOIUrl":"https://doi.org/10.1109/HPCA.2003.1183542","url":null,"abstract":"This paper identifies node affinity as an important property for scalable general-purpose locks. Nonuniform communication architectures (NUCA), for example CC-NUMA built from a few large nodes or from chip multiprocessors (CMP), have a lower penalty for reading data from a neighbor's cache than from a remote cache. Lock implementations that encourages handing over locks to neighbors will improve the lock handover time, as well as the access to the critical data guarded by the lock, but will also be vulnerable to starvation. We propose a set of simple software-based hierarchical backoff locks (HBO) that create node affinity in NUCA. A solution for lowering the risk of starvation is also suggested. The HBO locks are compared with other software-based lock implementations using simple benchmarks, and are shown to be very competitive for uncontested locks while being more than twice as fast for contended locks. An application study also demonstrates superior performance for applications with high lock contention and competitive performance for other programs.","PeriodicalId":150992,"journal":{"name":"The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116609803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 73
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信