Proceedings Fifth International Symposium on High-Performance Computer Architecture最新文献

筛选
英文 中文
Permutation development data layout (PDDL) 排列发展数据布局(PDDL)
T. Schwarz, J. Steinberg, W. Burkhard
{"title":"Permutation development data layout (PDDL)","authors":"T. Schwarz, J. Steinberg, W. Burkhard","doi":"10.1109/HPCA.1999.744365","DOIUrl":"https://doi.org/10.1109/HPCA.1999.744365","url":null,"abstract":"Declustered data organizations in disk arrays (RAIDs) achieve less-intrusive reconstruction of data after a disk failure. We present PDDL, a new data layout for declustered disk arrays. PDDL layouts exist for a large variety of disk array configurations with a distributed spare disk. PDDL declustered disk arrays have excellent run-time performance under light and heavy workloads. PDDL maximizes access parallelism in the most critical circumstances, namely during reconstruction of data on the spare disk. PDDL occurs minimum address translation overhead compared to all other proposed declustering layouts.","PeriodicalId":287867,"journal":{"name":"Proceedings Fifth International Symposium on High-Performance Computer Architecture","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122499204","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
A scalable cache coherent scheme exploiting wormhole routing networks 利用虫洞路由网络的可扩展缓存相干方案
Yunseok Rhee, Joonwon Lee
{"title":"A scalable cache coherent scheme exploiting wormhole routing networks","authors":"Yunseok Rhee, Joonwon Lee","doi":"10.1109/HPCA.1999.744367","DOIUrl":"https://doi.org/10.1109/HPCA.1999.744367","url":null,"abstract":"Large scale shared memory multiprocessors favor a directory-based cache coherence scheme for its scalability. The directory space needed to record the information for sharers has a complexity of /spl Theta/(N/sup 2/) when a full-mapped vector is used for an N-node system. Though this overhead can be reduced by limiting the directory size assuming that the sharing degree is small, it will experience significant inefficiency when a data is widely shared. In this paper, we propose a new directory scheme and a cache coherence scheme based on it for a mesh interconnection. Deterministic and wormhole routing enables a pointer to represent a set of nodes. Also a message traversing on the mesh performs a broadcast mission to a set of nodes without extra traffic, which can be utilized for the cache coherence problem. Only a slight change on a generic router is needed to implement our scheme. This scheme is also applicable to any k-ary n-cube networks including a mesh.","PeriodicalId":287867,"journal":{"name":"Proceedings Fifth International Symposium on High-Performance Computer Architecture","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132774637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Instruction recycling on a multiple-path processor 多路径处理器上的指令回收
S. Wallace, D. Tullsen, B. Calder
{"title":"Instruction recycling on a multiple-path processor","authors":"S. Wallace, D. Tullsen, B. Calder","doi":"10.1109/HPCA.1999.744323","DOIUrl":"https://doi.org/10.1109/HPCA.1999.744323","url":null,"abstract":"Processors that can simultaneously execute multiple paths of execution will only exacerbate the fetch bandwidth problem already plaguing conventional processors. On a multiple-path processor which speculatively executes less likely paths of hard-to-predict branches, the work done along a speculative path is normally discarded if that path is found to be incorrect. Instead, it can be beneficial to keep these instruction traces stored in the processor for possible future use. This paper introduces instruction recycling, where previously decoded instructions from recently executed paths are injected back into the rename stage. This increases the supply of instructions to the execution pipeline and decreases fetch latency. In addition, if the operands have not changed for a recycled instruction, the instruction can bypass the issue and execution stages, benefiting from instruction reuse. Instruction recycling and reuse are examined for a simultaneous multithreading architecture with multiple path execution. It is shown to increase performance by 7% for single-program workloads and by 12% on multiple-program workloads.","PeriodicalId":287867,"journal":{"name":"Proceedings Fifth International Symposium on High-Performance Computer Architecture","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129361260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Using Lamport clocks to reason about relaxed memory models 使用兰波特时钟来推理放松记忆模型
A. Condon, M. Hill, Manoj Plakal, Daniel J. Sorin
{"title":"Using Lamport clocks to reason about relaxed memory models","authors":"A. Condon, M. Hill, Manoj Plakal, Daniel J. Sorin","doi":"10.1109/HPCA.1999.744379","DOIUrl":"https://doi.org/10.1109/HPCA.1999.744379","url":null,"abstract":"Cache coherence protocols of current shared-memory multiprocessors are difficult to verify. Our previous work proposed an extension of Lamport's logical clocks for showing that multiprocessors can implement sequential consistency (SC) with an SGI Origin 2000-like directory protocol and a Sun Gigaplane-like split-transaction bus protocol. Many commercial multiprocessors, however, implement more relaxed models, such as SPARC Total Store Order (TSO), a variant of processor consistency, and Compaq (DEC) Alpha, a variant of weak consistency. This paper applies Lamport clocks to both a TSO and an Alpha implementation. Both implementations are based on the same Sun Gigaplane-like split-transaction bus protocol we previously used, but the TSO implementation places a first-in-first-out write buffer between a processor and its cache, while the Alpha implementation uses a coalescing write buffer. Both write buffers satisfy read requests for pending writes (i.e., do bypassing) without requiring the write to be immediately written to cache. Analysis shows how to apply Lamport clocks to verify TSO and Alpha specifications at the architectural level.","PeriodicalId":287867,"journal":{"name":"Proceedings Fifth International Symposium on High-Performance Computer Architecture","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129988616","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 38
Memory hierarchy considerations for fast transpose and bit-reversals 快速转置和位反转的内存层次考虑
K. Gatlin, L. Carter
{"title":"Memory hierarchy considerations for fast transpose and bit-reversals","authors":"K. Gatlin, L. Carter","doi":"10.1109/HPCA.1999.744320","DOIUrl":"https://doi.org/10.1109/HPCA.1999.744320","url":null,"abstract":"This paper explores the interplay between algorithm design and a computer's memory hierarchy. Matrix transpose and the bit-reversal reordering are important scientific subroutines which often exhibit severe performance degradation due to cache and TLB associativity problems. We give lower bounds that show for typical memory hierarchy designs, extra data movement is unavoidable. We also prescribe characteristics of various levels of the memory hierarchy needed to perform efficient bit-reversals. Insight gained from our analysis leads to the design of a near optimal bit-reversal algorithm. This Cache Optimal Bit Reverse Algorithm (COBRA) is implemented on the Digital Alpha 21164, Sun Ultrasparc 2, and IBM Power2. We show that COBRA is near optimal with respect to execution time on these machines and performs much better than previous best known algorithms.","PeriodicalId":287867,"journal":{"name":"Proceedings Fifth International Symposium on High-Performance Computer Architecture","volume":"196 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134332488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
Distributed modulo scheduling 分布式模调度
M. M. Fernandes, J. Llosa, N. Topham
{"title":"Distributed modulo scheduling","authors":"M. M. Fernandes, J. Llosa, N. Topham","doi":"10.1109/HPCA.1999.744349","DOIUrl":"https://doi.org/10.1109/HPCA.1999.744349","url":null,"abstract":"Wide-issue ILP machines can be built using the VLIW approach as many of the hardware complexities found in superscalar processors can be transferred to the compiler. However, the scalability of VLIW architectures is still constrained by the size and number of ports of the register file required by a large number of functional units. Organizations composed of clusters of a few functional units and small private register files have been proposed to deal with this problem; an approach highly dependent on scheduling and partitioning strategies. The paper presents DMS, an algorithm that integrates modulo scheduling and code partitioning in a single procedure. Experimental results have shown that the algorithm is effective for configurations up to 8 clusters, or even more when targeting vectorizable loops.","PeriodicalId":287867,"journal":{"name":"Proceedings Fifth International Symposium on High-Performance Computer Architecture","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128906064","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 48
Comparative evaluation of fine- and coarse-grain approaches for software distributed shared memory 软件分布式共享内存的细粒度和粗粒度方法的比较评价
S. Dwarkadas, K. Gharachorloo, L. Kontothanassis, D. Scales, M. Scott, R. Stets
{"title":"Comparative evaluation of fine- and coarse-grain approaches for software distributed shared memory","authors":"S. Dwarkadas, K. Gharachorloo, L. Kontothanassis, D. Scales, M. Scott, R. Stets","doi":"10.1109/HPCA.1999.744377","DOIUrl":"https://doi.org/10.1109/HPCA.1999.744377","url":null,"abstract":"Symmetric multiprocessors (SMPs) connected with low-latency networks provide attractive building blocks for software distributed shared memory systems. Two distinct approaches have been used: the fine-grain approach that instruments application loads and stores to support a small coherence granularity, and the coarse-grain approach based on virtual memory hardware that provides coherence at a page granularity. Fine-grain systems offer a simple migration path for applications developed on hardware multiprocessors by supporting coherence protocols similar to those implemented in hardware. On the other hand, coarse-grain systems can potentially provide higher performance through more optimized protocols and larger transfer granularities, while avoiding instrumentation overheads. Numerous studies have examined each approach individually, but major differences in experimental platforms and applications make comparison of the approaches difficult. This paper presents a detailed comparison of two mature systems, Shasta and Cashmere, representing the fine- and coarse-grain approaches, respectively. Both systems are tuned to run on the same commercially available, state-of-the-art cluster of AlphaServer SMPs connected via a Memory Channel network. As expected, our results show that Shasta provides robust performance for applications tuned for hardware multiprocessors, and can better tolerate fine-grain synchronization. In contrast, Cashmere is highly sensitive to fine-grain synchronization, but provides a performance edge for applications with coarse-grain behavior. Interestingly, we found that the performance gap between the systems can often be bridged by program modifications that address coherence and synchronization granularity. In addition, our study reveals some unexpected results related to the interaction of current compiler technology with application instrumentation, and the ability of SMP-aware protocols to avoid certain performance disadvantages of coarse-grain approaches.","PeriodicalId":287867,"journal":{"name":"Proceedings Fifth International Symposium on High-Performance Computer Architecture","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132381393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 34
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信