Proceedings of the 20th Annual International Symposium on Computer Architecture最新文献

筛选
英文 中文
Working Sets, Cache Sizes, And Node Granularity Issues For Large-scale Multiprocessors 大型多处理器的工作集、缓存大小和节点粒度问题
Proceedings of the 20th Annual International Symposium on Computer Architecture Pub Date : 1993-05-01 DOI: 10.1109/ISCA.1993.698542
E. Rothberg, J. Singh, Anoop Gupta
{"title":"Working Sets, Cache Sizes, And Node Granularity Issues For Large-scale Multiprocessors","authors":"E. Rothberg, J. Singh, Anoop Gupta","doi":"10.1109/ISCA.1993.698542","DOIUrl":"https://doi.org/10.1109/ISCA.1993.698542","url":null,"abstract":"The distribution of resources among processors, memory and caches is a crucial question faced by designers of large-scale parallel machines. If a machine is to solve problems with a certain data set size, should it be built with a large number of processors each with a small amount of memory, or a smaller number of processors each with a large amount of memory? How much cache memory should be provided per processor for cost-effectiveness? And how do these decisions change as larger problems are run on larger machines?\u0000In this paper, we explore the above questions based on the characteristics of five important classes of large-scale parallel scientific applications. We first show that all the applications have a hierarchy of well-defined per-processor working sets, whose size, performance impact and scaling characteristics can help determine how large different levels of a multiprocessor's cache hierarchy should be. Then, we use these working sets together with certain other important characteristics of the applications—such as communication to computation ratios, concurrency, and load balancing behavior—to reflect upon the broader question of the granularity of processing nodes in high-performance multiprocessors.\u0000We find that very small caches whose sizes do not increase with the problem or machine size are adequate for all but two of the application classes. Even in the two exceptions, the working sets scale quite slowly with problem size, and the cache sizes needed for problems that will be run in the foreseeable future are small. We also find that relatively fine-grained machines, with large numbers of processors and quite small amounts of memory per processor, are appropriate for all the applications.","PeriodicalId":410022,"journal":{"name":"Proceedings of the 20th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1993-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134520873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 118
The Detection And Elimination Of Useless Misses In Multiprocessors 多处理器中无用缺失的检测与消除
Proceedings of the 20th Annual International Symposium on Computer Architecture Pub Date : 1993-05-01 DOI: 10.1109/ISCA.1993.698548
M. Dubois, J. Skeppstedt, L. Ricciulli, Krishnan Ramamurthy, P. Stenström
{"title":"The Detection And Elimination Of Useless Misses In Multiprocessors","authors":"M. Dubois, J. Skeppstedt, L. Ricciulli, Krishnan Ramamurthy, P. Stenström","doi":"10.1109/ISCA.1993.698548","DOIUrl":"https://doi.org/10.1109/ISCA.1993.698548","url":null,"abstract":"In this paper we introduce a new classification of misses in shared-memory multiprocessors based on interprocessor communication. We identify the set of essential misses, i.e., the smallest set of misses necessary for correct execution. Essential misses include cold misses and true sharing misses. All other misses are useless misses and can be ignored without affecting the correctness of program execution. Based on the new classification we compare the effectiveness of five different protocols which delay and combine invalidations leading to useless misses. In cache-based systems the protocols are very effective and have miss rates close to the essential miss rate. In virtual shared memory systems the techniques are also effective but leave room for improvements.","PeriodicalId":410022,"journal":{"name":"Proceedings of the 20th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1993-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127592898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 126
Column-associative Caches: A Technique For Reducing The Miss Rate Of Direct-mapped Caches 列关联缓存:一种减少直接映射缓存缺失率的技术
A. Agarwal, S. Pudar
{"title":"Column-associative Caches: A Technique For Reducing The Miss Rate Of Direct-mapped Caches","authors":"A. Agarwal, S. Pudar","doi":"10.1145/165123.165153","DOIUrl":"https://doi.org/10.1145/165123.165153","url":null,"abstract":"Direct-mapped caches are a popular design choice for highperfortnsnce processors;unfortunately, direct-mapped cachessuffer systematic interference misses when more than one address maps into the sensecache set. This paper &scribes the design of column-ossociotive caches.which minhize the cofllcrs that arise in direct-mapped accessesby allowing conflicting addressesto dynamically choose alternate hashing functions, so that most of the cordiicting datacanreside in thecache. At the sametime, however, the critical hit accesspath is unchanged. The key to implementing this schemeefficiently is the addition of a reho.dsM to eachcache se~ which indicates whether that set storesdata that is referenced by an alternate hashing timction. When multiple addressesmap into the samelocatioz theserehoshed locatwns are preferentially replaced. Using trace-driven simulations and en analytical model, we demonstrate that a column-associative cacheremovesvirtually all interference missesfor large caches,without altering the critical hit accesstime.","PeriodicalId":410022,"journal":{"name":"Proceedings of the 20th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1993-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124554078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 280
Performance Of Cached Dram Organizations In Vector Supercomputers 矢量超级计算机中缓存Dram组织的性能
W. Hsu, James E. Smith
{"title":"Performance Of Cached Dram Organizations In Vector Supercomputers","authors":"W. Hsu, James E. Smith","doi":"10.1145/165123.165170","DOIUrl":"https://doi.org/10.1145/165123.165170","url":null,"abstract":"DRAMs containing cache memory are studied in the context of vector supercomputers. In particular, we consider systems where processors have no internal data caches and memory reference streams are generated by vector instructions. For this application, we expect that cached DRAMs can provide high bandwidth at relatively low cost.\u0000We study both DRAMs with a single, long cache line and with smaller, multiple cache lines. Memory interleaving schemes that increase data locality are proposed and studied. The interleaving schemes are also shown to lead to non-uniform bank accesses, i.e. hot banks. This suggest there is an important optimization problem involving methods that increase locality to improve performance, but not so much that hot banks diminish performance. We show that for uniprocessor systems, both types of cached DRAMs work well with the proposed interleave methods. For multiprogrammed multiprocessors, the multiple cache line DRAMs work better.","PeriodicalId":410022,"journal":{"name":"Proceedings of the 20th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1993-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126915209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 56
Multiple Threads In Cyclic Register Windows 循环寄存器窗口中的多线程
Proceedings of the 20th Annual International Symposium on Computer Architecture Pub Date : 1993-05-01 DOI: 10.1109/ISCA.1993.698552
Yasuo Hidaka, H. Koike, Hidehiko Tanaka
{"title":"Multiple Threads In Cyclic Register Windows","authors":"Yasuo Hidaka, H. Koike, Hidehiko Tanaka","doi":"10.1109/ISCA.1993.698552","DOIUrl":"https://doi.org/10.1109/ISCA.1993.698552","url":null,"abstract":"Multi-threading is often used to compile logic and functional languages, and implement parallel C libraries. Fine-grain multi-threading requires rapid context switching, which can be slow on architectures with register windows. In past, researchers have either proposed new hardware support for dynamic allocation of windows to threads, or have sacrificed fast procedure calls by fixed allocation of windows to threads. In this paper, a novel window management algorithm, which retains both fast procedure calls and fast context switching, is proposed. The algorithm has been implemented on the SPARC processor by modifying window trap handlers. A quantitative evaluation of the scheme using a multi-threaded application with various concurrency and granularity levels is given. The evaluation shows that the proposed scheme always does better than the other schemes. Some implications for multi-threaded architectures are also presented.","PeriodicalId":410022,"journal":{"name":"Proceedings of the 20th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1993-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129969784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
Limitations Of Cache Prefetching On A Bus-based Multiprocessor 基于总线的多处理器缓存预取的局限性
D. Tullsen, S. Eggers
{"title":"Limitations Of Cache Prefetching On A Bus-based Multiprocessor","authors":"D. Tullsen, S. Eggers","doi":"10.1145/165123.165163","DOIUrl":"https://doi.org/10.1145/165123.165163","url":null,"abstract":"Compiler-directed cache prefetching has the potential to hide much of the high memory latency seen by current and future high-performance processors. However, prefetching is not without costs, particularly on a multiprocessor. Prefetching can negatively affect bus utilization, overall cache miss rates, memory latencies and data sharing. We simulated the effects of a particular compiler-directed prefetching algorithm, running on a bus-based multiprocesssor. We showed that, despite a high memory latency, this architecture is not very well-suited for prefetching. For several variations on the architecture, speedups for five parallel programs were no greater than 39%, and degradations were as high as 7%, when prefetching was added to the workload. We examined the sources of cache misses, in light of several different prefetching strategies, and pinpointed the causes of the performance changes. Invalidation misses pose a particular problem for current compiler-directed prefetchers. We applied two techniques that reduced their impact: a special prefetching heuristic tailored to write-shared data, and restructuring shared data to reduce false sharing, thus allowing traditional prefetching algorithms to work well.","PeriodicalId":410022,"journal":{"name":"Proceedings of the 20th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1993-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121467658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 77
Cache Write Policies And Performance Cache写策略和性能
Proceedings of the 20th Annual International Symposium on Computer Architecture Pub Date : 1993-05-01 DOI: 10.1109/ISCA.1993.698560
N. Jouppi
{"title":"Cache Write Policies And Performance","authors":"N. Jouppi","doi":"10.1109/ISCA.1993.698560","DOIUrl":"https://doi.org/10.1109/ISCA.1993.698560","url":null,"abstract":"This paper investigates issues involving writes and caches. First, tradeoffs on writes that miss in the cache are investigated. In particular, whether the missed cache block is fetched on a write miss, whether the missed cache block is allocated in the cache, and whether the cache line is written before hit or miss is known are considered. Depending on the combination of these polices chosen, the entire cache miss rate can vary by a factor of two on some applications. The combination of no-fetch-on-write and write-allocate can provide better performance than cache line allocation instructions. Second, tradeoffs between write-through and write-back caching when writes hit in a cache are considered. A mixture of these two alternatives, called write caching is proposed. Write caching places a small fully-associative cache behind a write-through cache. A write cache can eliminate almost as much write traffic as a write-back cache.","PeriodicalId":410022,"journal":{"name":"Proceedings of the 20th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1993-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122418175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 254
Odd Memory Systems May Be Quite Interesting 奇怪的记忆系统可能相当有趣
Proceedings of the 20th Annual International Symposium on Computer Architecture Pub Date : 1993-05-01 DOI: 10.1109/ISCA.1993.698574
André Seznec, J. Lenfant
{"title":"Odd Memory Systems May Be Quite Interesting","authors":"André Seznec, J. Lenfant","doi":"10.1109/ISCA.1993.698574","DOIUrl":"https://doi.org/10.1109/ISCA.1993.698574","url":null,"abstract":"Using a prime number of N of memory banks on a vector processor allows a conflict-free access for any slice of N consecutive elements of a vector stored with a stride not multiple of N.\u0000To reject the use of a prime (or odd) number N of memory banks, it is generally advanced that address computation for such a memory system would require systematic Euclidean Division by the number N. We first show that the well known Chinese Remainder Theorem allows to define a very simple mapping of data onto the memory banks for which address computation does not require any Euclidean Division.\u0000Massively parallel SIMD computers may have several thousands of processors. When the memory on such a machine is globally shared, routing vectors from memory to the processors is a major difficulty; the control for the interconnection network cannot be generally computed at execution time. When the number of memory banks and processors is a product of prime numbers, the family of permutations needed for routing vectors for memory to the processors through the interconnection network have very specific properties. The Chinese Remainder Network presented in the paper is able to execute all these permutations in a single path and may be self-routed.","PeriodicalId":410022,"journal":{"name":"Proceedings of the 20th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1993-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131764388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
The Cedar System And An Initial Performance Study 雪松系统及其初步性能研究
D. Kuck, E. Davidson, D. Lawrie, A. Sameh, Chuanqi Zhu
{"title":"The Cedar System And An Initial Performance Study","authors":"D. Kuck, E. Davidson, D. Lawrie, A. Sameh, Chuanqi Zhu","doi":"10.1145/285930.286005","DOIUrl":"https://doi.org/10.1145/285930.286005","url":null,"abstract":"In thts paper. we give an overmew of the Cedar mutliprocessor and present recent performance results. These tnclude the performance of some computational kernels and the Perfect Benchmarks@ . We also pnsent a methodology for judging parallel system performance and apply this methodology to Cedar, Gray YMP-8, and Thinking Machines CM-S.","PeriodicalId":410022,"journal":{"name":"Proceedings of the 20th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124871943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 49
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信