Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture最新文献

筛选
英文 中文
Implementation of atomic primitives on distributed shared memory multiprocessors 分布式共享内存多处理器上原子原语的实现
Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture Pub Date : 1995-01-22 DOI: 10.1109/HPCA.1995.386540
Maged M. Michael, M. Scott
{"title":"Implementation of atomic primitives on distributed shared memory multiprocessors","authors":"Maged M. Michael, M. Scott","doi":"10.1109/HPCA.1995.386540","DOIUrl":"https://doi.org/10.1109/HPCA.1995.386540","url":null,"abstract":"In this paper we consider several hardware implementations of the general-purpose atomic primitives fetch and /spl Phi/, compare and swap, load linked, and store conditional on large-scale shared-memory multiprocessors. These primitives have proven popular on small-scale bets-based machines, but have yet to become widely available on large-scale, distributed shared memory machines. We propose several alternative hardware implementations of these primitives, and then analyze the performance of these implementations for various data sharing patterns. Our results indicate that good overall performance can be obtained by implementing compare and swap in the cache controllers, and by providing an additional instruction to load an exclusive copy of a cache line.<<ETX>>","PeriodicalId":330315,"journal":{"name":"Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129679700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 29
DASC cache
Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture Pub Date : 1995-01-22 DOI: 10.1109/HPCA.1995.386548
André Seznec
{"title":"DASC cache","authors":"André Seznec","doi":"10.1109/HPCA.1995.386548","DOIUrl":"https://doi.org/10.1109/HPCA.1995.386548","url":null,"abstract":"For many microprocessors, cache hit time determines the clock cycle. On the other hand, cache miss penalty(measured in instruction issue delays) becomes higher and higher. Conciliating low cache miss ratio with low cache hit time is an important issue. When caches are virtually indexed, the operating system (or some specific hardware) has to manage data consistency of caches and memory. Unfortunately, conciliating physical indexing of the cache and low cache hit time is very difficult. In this paper, we propose the Direct-mapped Access Set-associative Check cache (DASC) for addressing both difficulties. On a DASC cache, the cache array is direct-mapped, so the cache hit time is low. However the tag array is set-associative and the external miss ratio on a DASC cache is the same as the miss ratio on a set-associative cache. When the size of an associativity degree of the tag array is tied to the minimum page size, a virtually indexed but physically tagged DASC cache correctly handles all difficulties associated with cache consistency. Trace driven simulations show that, for cache sizes in the range of 16 to 64 Kbytes and for page sizes in the range 4 to 8 Kbytes, a DASC cache is a valuable trade-off allowing fast cache hit time and low cache miss ratio while cache consistency management is performed by hardware.<<ETX>>","PeriodicalId":330315,"journal":{"name":"Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125222499","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Non-consistent dual register files to reduce register pressure 不一致的双寄存器文件减少寄存器压力
Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture Pub Date : 1995-01-22 DOI: 10.1109/HPCA.1995.386558
J. Llosa, M. Valero, E. Ayguadé
{"title":"Non-consistent dual register files to reduce register pressure","authors":"J. Llosa, M. Valero, E. Ayguadé","doi":"10.1109/HPCA.1995.386558","DOIUrl":"https://doi.org/10.1109/HPCA.1995.386558","url":null,"abstract":"The continuous grow on instruction level parallelism offered by microprocessors requires a large register file and a large number of ports to access it. This paper presents the non-consistent dual register file, an alternative implementation and management of the register file. Non-consistent dual register files support the bandwidth demands and the high register requirements, penalizing neither access time nor implementation cost. The proposal is evaluated for software pipelined loops and compared against a unified register file. Empirical results show improvements on performance and a noticeable reduction of the density of memory traffic due to a reduction of the spill code. The spill code can in general increase the minimum initiation interval and decrease loop performance. Additional improvements can be obtained when the operations are scheduled having in mind the register file organization proposed.<<ETX>>","PeriodicalId":330315,"journal":{"name":"Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129213947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 30
A VLSI architecture for computing the tree-to-tree distance 一种用于计算树到树距离的VLSI架构
Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture Pub Date : 1995-01-22 DOI: 10.1109/HPCA.1995.386530
R. Sastry, N. Ranganathan
{"title":"A VLSI architecture for computing the tree-to-tree distance","authors":"R. Sastry, N. Ranganathan","doi":"10.1109/HPCA.1995.386530","DOIUrl":"https://doi.org/10.1109/HPCA.1995.386530","url":null,"abstract":"The distance between two labeled ordered trees, /spl alpha/ and /spl beta/ is the minimum cost sequence of editing operations (insertions, deletions and substitutions, needed to transform or into /spl beta/ such that the predecessor-descendant relation between nodes and the ordering of nodes is not changed). Approximate tree matching has applications in genetic sequence comparison, scene analysis, error recovery and correction in programming languages, and cluster analysis. Edit distance determination is a computationally intensive task, and the design of special purpose hardware could result in a significant speed up. This paper describes in detail a VLSI architecture for computing the edit distance between arbitrary ordered trees, based on a parallel, systolic realization of the dynamic programming algorithm proposed by S.Y. Lu (1979). This architecture represents a significant improvement over that described by Sastry and Ranganathan (1994), which restricted the type of trees that could be processed by it. Two partitioning strategies to process trees of arbitrary sizes and structures on a fixed size implementation in multiple passes are proposed and analyzed.<<ETX>>","PeriodicalId":330315,"journal":{"name":"Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115573439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Massively parallel array processor for logic, fault, and design error simulation 用于逻辑、故障和设计错误仿真的大规模并行阵列处理器
Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture Pub Date : 1995-01-22 DOI: 10.1109/HPCA.1995.386529
Y. Hur, S. Szygenda, E. S. Fehr, G. Ott, Sungho Kang
{"title":"Massively parallel array processor for logic, fault, and design error simulation","authors":"Y. Hur, S. Szygenda, E. S. Fehr, G. Ott, Sungho Kang","doi":"10.1109/HPCA.1995.386529","DOIUrl":"https://doi.org/10.1109/HPCA.1995.386529","url":null,"abstract":"Digital logic, fault, and error simulation of large VLSI circuits is one of the most compute-intensive tasks in digital systems analysis. This paper describes a massively parallel special purpose array processor, or hardware accelerator, for digital logic, fault, and error simulation. Hardware simulation is a viable approach for simulation of large systems, since simulation time increases rapidly as a function of the size and complexity of the systems to be simulated. In order to reduce the cost and to achieve high performance, a massively parallel array processor and new algorithms have been introduced. By executing an efficient and direct model of the design on the PE array, the architecture can provide high performance, similar to prototyping. Simulation results show that the hardware accelerator is orders of magnitude faster than software simulation.<<ETX>>","PeriodicalId":330315,"journal":{"name":"Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture","volume":"325 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122816797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
U-cache: a cost-effective solution to synonym problem U-cache:一个高效的同义词问题解决方案
Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture Pub Date : 1995-01-22 DOI: 10.1109/HPCA.1995.386538
Jesung Kim, Sang Lyul Min, Sanghoon Jeon, ByoungChul Ahn, Deog-Kyoon Jeong, Chong-Sang Kim
{"title":"U-cache: a cost-effective solution to synonym problem","authors":"Jesung Kim, Sang Lyul Min, Sanghoon Jeon, ByoungChul Ahn, Deog-Kyoon Jeong, Chong-Sang Kim","doi":"10.1109/HPCA.1995.386538","DOIUrl":"https://doi.org/10.1109/HPCA.1995.386538","url":null,"abstract":"This paper proposes a cost-effective solution to the synonym problem. In this proposed solution, a minimal hardware addition guarantees the correctness whereas the software counterpart helps improve the performance. The key to this proposed solution is an addition of a small physically-indexed cache called U-cache. The U-cache maintains the reverse translation information of the cache blocks that belong to un-aligned virtual pages only, where aligned means that the lower bits of the virtual page number match those of the corresponding physical page number. A U-cache, even with only one entry, ensures correct handling of synonyms. A simple software optimization in the form of page alignment, helps improve the performance. Performance evaluation based on ATUM traces shows that a U-cache, with only a few entries, performs almost as well as (in some cases outperforms) a fully-configured hardware-based solution when more than 95% of the pages are aligned.<<ETX>>","PeriodicalId":330315,"journal":{"name":"Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124605883","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Efficient and balanced adaptive routing in two-dimensional meshes 二维网格中高效均衡的自适应路由
Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture Pub Date : 1995-01-22 DOI: 10.1109/HPCA.1995.386550
Jatin Upadhyay, Vara Varavithya, P. Mohapatra
{"title":"Efficient and balanced adaptive routing in two-dimensional meshes","authors":"Jatin Upadhyay, Vara Varavithya, P. Mohapatra","doi":"10.1109/HPCA.1995.386550","DOIUrl":"https://doi.org/10.1109/HPCA.1995.386550","url":null,"abstract":"In this paper, we present a new concept of region of adaptivity with respect to various routing algorithms in wormhole networks. Using this concept, we demonstrate that the previously proposed routing algorithms, though more adaptive, causes an uneven workload in the network which limits the performance improvement. A is observed that balanced distribution of traffic has greater impact on system performance than the adaptivity or efficiency of the algorithm. Based on these motivating factors, we have presented a new fully adaptive routing algorithm for 2-dimensional meshes using one extra virtual channel. The algorithm is more efficient in terms of the number of paths it offers between the source and the destination and also distributes the network load more evenly and symmetrically. The simulation results are presented and are compared with the results of previously proposed algorithms. It is shown that the proposed algorithm results in much better performance in terms of the average network latency and the throughput.<<ETX>>","PeriodicalId":330315,"journal":{"name":"Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121821881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 31
Fast barrier synchronization in wormhole k-ary n-cube networks with multidestination worms 多目标虫洞k-ary n-cube网络中的快速屏障同步
Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture Pub Date : 1995-01-22 DOI: 10.1109/HPCA.1995.386542
D. Panda
{"title":"Fast barrier synchronization in wormhole k-ary n-cube networks with multidestination worms","authors":"D. Panda","doi":"10.1109/HPCA.1995.386542","DOIUrl":"https://doi.org/10.1109/HPCA.1995.386542","url":null,"abstract":"This paper presents a new approach to implement fast barrier synchronization in wormhole k-ary n-cubes. The novelty lies in using multidestination messages instead of the traditional single destination messages. Two different multidestination worm types, gather and broadcasting, are introduced to implement the report and wake-up phases of barrier synchronization, respectively. Algorithms for complete and arbitrary set barrier synchronization are presented using these new worms. It is shown that complete barrier synchronization in a k-ary n-cube system with e-cube routing can be implemented with 2n communication start-ups as compared to 2n log/sub 2/ k start-ups needed with unicast-based message passing. For arbitrary set barrier, an interesting trend is observed where the synchronization cost keeps on reducing beyond a certain number of participating nodes.<<ETX>>","PeriodicalId":330315,"journal":{"name":"Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114774866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 49
Implementing register interlocks in parallel-pipeline, multiple instruction queue, superscalar processors 在并行流水线、多指令队列、超标量处理器中实现寄存器互锁
Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture Pub Date : 1995-01-22 DOI: 10.1109/HPCA.1995.386559
S. Weiss
{"title":"Implementing register interlocks in parallel-pipeline, multiple instruction queue, superscalar processors","authors":"S. Weiss","doi":"10.1109/HPCA.1995.386559","DOIUrl":"https://doi.org/10.1109/HPCA.1995.386559","url":null,"abstract":"A dependence for data, control, or resources might cause one instruction to become stalled in a pipeline stage waiting for a preceding instruction to produce a result or release a resource. The pipeline control hardware checks for dependences, and prevents the instruction from going to the next pipeline stage if a dependence occurs. We refer to this hardware as interlock logic. The amount and complexity of the interlock logic required to support a ten+ instruction issue bandwidth is a major concern in the design of the pipeline control hardware. We look specifically at register interlocks in the context of a parallel pipeline with separate dispatch and issue phases-a generalization of the pipeline organization implemented by a number of prominent recent superscalar processors. We describe four implementations of the register interlock logic and a comparison based on the number of logic levels. We also present a high-bandwidth implementation of table-based register renaming.<<ETX>>","PeriodicalId":330315,"journal":{"name":"Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131153532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Abstracting network characteristics and locality properties of parallel systems 抽象并行系统的网络特性和局部性
Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture Pub Date : 1995-01-22 DOI: 10.1109/HPCA.1995.386555
A. Sivasubramaniam, A. Singla, U. Ramachandran, H. Venkateswaran
{"title":"Abstracting network characteristics and locality properties of parallel systems","authors":"A. Sivasubramaniam, A. Singla, U. Ramachandran, H. Venkateswaran","doi":"10.1109/HPCA.1995.386555","DOIUrl":"https://doi.org/10.1109/HPCA.1995.386555","url":null,"abstract":"Abstracting features of parallel systems is a technique that has been traditionally used in theoretical and analytical models for program development and performance evaluation. We explore the use of abstractions in execution-driven simulators in order to speed up simulation. In particular, we evaluate abstractions for the interconnection network and locality, properties of parallel systems in the context of simulating cache-coherent shared memory (CC-NUMA) multiprocessors. We use the recently proposed LogP model to abstract the network. We abstract locality by modeling a cache at each processing node in the system which is maintained coherent, without modeling the overheads associated with coherence maintenance. Such an abstraction tries to capture the true communication characteristics of the application without modeling any hardware induced artifacts. Using a suite of applications and three network topologies simulated on a novel simulation platform, we show that the latency overhead modeled by LogP is fairly accurate. On the other hand, the contention overhead can become pessimistic when the applications display sufficient communication locality. Our abstraction for data locality closely models the behavior of the target system over the chosen range of applications. The simulation model which incorporated these abstractions was around 250-300% faster than the simulation of the target machine.<<ETX>>","PeriodicalId":330315,"journal":{"name":"Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127238684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信