Proceedings of the 20th Annual International Symposium on Computer Architecture最新文献_第3页

The J-machine Multicomputer: An Architectural Evaluation J-machine多计算机:体系结构评价

Proceedings of the 20th Annual International Symposium on Computer Architecture Pub Date : 1993-05-01 DOI: 10.1145/165123.165158

M. Noakes, D. Wallach, W. Dally

引用次数: 303

Working Sets, Cache Sizes, And Node Granularity Issues For Large-scale Multiprocessors 大型多处理器的工作集、缓存大小和节点粒度问题

Proceedings of the 20th Annual International Symposium on Computer Architecture Pub Date : 1993-05-01 DOI: 10.1109/ISCA.1993.698542

E. Rothberg, J. Singh, Anoop Gupta

{"title":"Working Sets, Cache Sizes, And Node Granularity Issues For Large-scale Multiprocessors","authors":"E. Rothberg, J. Singh, Anoop Gupta","doi":"10.1109/ISCA.1993.698542","DOIUrl":"https://doi.org/10.1109/ISCA.1993.698542","url":null,"abstract":"The distribution of resources among processors, memory and caches is a crucial question faced by designers of large-scale parallel machines. If a machine is to solve problems with a certain data set size, should it be built with a large number of processors each with a small amount of memory, or a smaller number of processors each with a large amount of memory? How much cache memory should be provided per processor for cost-effectiveness? And how do these decisions change as larger problems are run on larger machines?\u0000In this paper, we explore the above questions based on the characteristics of five important classes of large-scale parallel scientific applications. We first show that all the applications have a hierarchy of well-defined per-processor working sets, whose size, performance impact and scaling characteristics can help determine how large different levels of a multiprocessor's cache hierarchy should be. Then, we use these working sets together with certain other important characteristics of the applications—such as communication to computation ratios, concurrency, and load balancing behavior—to reflect upon the broader question of the granularity of processing nodes in high-performance multiprocessors.\u0000We find that very small caches whose sizes do not increase with the problem or machine size are adequate for all but two of the application classes. Even in the two exceptions, the working sets scale quite slowly with problem size, and the cache sizes needed for problems that will be run in the foreseeable future are small. We also find that relatively fine-grained machines, with large numbers of processors and quite small amounts of memory per processor, are appropriate for all the applications.","PeriodicalId":410022,"journal":{"name":"Proceedings of the 20th Annual International Symposium on Computer Architecture","volume":"145 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134520873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 118

Column-associative Caches: A Technique For Reducing The Miss Rate Of Direct-mapped Caches 列关联缓存:一种减少直接映射缓存缺失率的技术

Proceedings of the 20th Annual International Symposium on Computer Architecture Pub Date : 1993-05-01 DOI: 10.1145/165123.165153

A. Agarwal, S. Pudar

引用次数: 280

Multiple Threads In Cyclic Register Windows 循环寄存器窗口中的多线程

Proceedings of the 20th Annual International Symposium on Computer Architecture Pub Date : 1993-05-01 DOI: 10.1109/ISCA.1993.698552

Yasuo Hidaka, H. Koike, Hidehiko Tanaka

引用次数: 22

Performance Of Cached Dram Organizations In Vector Supercomputers 矢量超级计算机中缓存Dram组织的性能

Proceedings of the 20th Annual International Symposium on Computer Architecture Pub Date : 1993-05-01 DOI: 10.1145/165123.165170

W. Hsu, James E. Smith

引用次数: 56

Limitations Of Cache Prefetching On A Bus-based Multiprocessor 基于总线的多处理器缓存预取的局限性

Proceedings of the 20th Annual International Symposium on Computer Architecture Pub Date : 1993-05-01 DOI: 10.1145/165123.165163

D. Tullsen, S. Eggers

{"title":"Limitations Of Cache Prefetching On A Bus-based Multiprocessor","authors":"D. Tullsen, S. Eggers","doi":"10.1145/165123.165163","DOIUrl":"https://doi.org/10.1145/165123.165163","url":null,"abstract":"Compiler-directed cache prefetching has the potential to hide much of the high memory latency seen by current and future high-performance processors. However, prefetching is not without costs, particularly on a multiprocessor. Prefetching can negatively affect bus utilization, overall cache miss rates, memory latencies and data sharing. We simulated the effects of a particular compiler-directed prefetching algorithm, running on a bus-based multiprocesssor. We showed that, despite a high memory latency, this architecture is not very well-suited for prefetching. For several variations on the architecture, speedups for five parallel programs were no greater than 39%, and degradations were as high as 7%, when prefetching was added to the workload. We examined the sources of cache misses, in light of several different prefetching strategies, and pinpointed the causes of the performance changes. Invalidation misses pose a particular problem for current compiler-directed prefetchers. We applied two techniques that reduced their impact: a special prefetching heuristic tailored to write-shared data, and restructuring shared data to reduce false sharing, thus allowing traditional prefetching algorithms to work well.","PeriodicalId":410022,"journal":{"name":"Proceedings of the 20th Annual International Symposium on Computer Architecture","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121467658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 77

Cache Write Policies And Performance Cache写策略和性能

Proceedings of the 20th Annual International Symposium on Computer Architecture Pub Date : 1993-05-01 DOI: 10.1109/ISCA.1993.698560

N. Jouppi

引用次数: 254

Odd Memory Systems May Be Quite Interesting 奇怪的记忆系统可能相当有趣

Proceedings of the 20th Annual International Symposium on Computer Architecture Pub Date : 1993-05-01 DOI: 10.1109/ISCA.1993.698574

André Seznec, J. Lenfant

{"title":"Odd Memory Systems May Be Quite Interesting","authors":"André Seznec, J. Lenfant","doi":"10.1109/ISCA.1993.698574","DOIUrl":"https://doi.org/10.1109/ISCA.1993.698574","url":null,"abstract":"Using a prime number of N of memory banks on a vector processor allows a conflict-free access for any slice of N consecutive elements of a vector stored with a stride not multiple of N.\u0000To reject the use of a prime (or odd) number N of memory banks, it is generally advanced that address computation for such a memory system would require systematic Euclidean Division by the number N. We first show that the well known Chinese Remainder Theorem allows to define a very simple mapping of data onto the memory banks for which address computation does not require any Euclidean Division.\u0000Massively parallel SIMD computers may have several thousands of processors. When the memory on such a machine is globally shared, routing vectors from memory to the processors is a major difficulty; the control for the interconnection network cannot be generally computed at execution time. When the number of memory banks and processors is a product of prime numbers, the family of permutations needed for routing vectors for memory to the processors through the interconnection network have very specific properties. The Chinese Remainder Network presented in the paper is able to execute all these permutations in a single path and may be self-routed.","PeriodicalId":410022,"journal":{"name":"Proceedings of the 20th Annual International Symposium on Computer Architecture","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131764388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

The Cedar System And An Initial Performance Study 雪松系统及其初步性能研究

Proceedings of the 20th Annual International Symposium on Computer Architecture Pub Date : 1900-01-01 DOI: 10.1145/285930.286005

D. Kuck, E. Davidson, D. Lawrie, A. Sameh, Chuanqi Zhu

引用次数: 49