{"title":"Data prefetching in multiprocessor vector cache memories","authors":"John W. C. Fu, J. Patel","doi":"10.1145/115952.115959","DOIUrl":"https://doi.org/10.1145/115952.115959","url":null,"abstract":"This paper reports the cache performance of a set of vectorized numerical program from the Perfect Club benchmarks. Using a low cost trace driven simularion technique we show how a non-prefetching vector cache can result in unpredictable performance and how rhis unpredictability makes it difficult to find a good block size. We describe two simple prefetch schemes to reduce the influence of long stride vector accesses and misses due IO block invalidations in mulliprocessor vector caches. These two schemes are shown to have better performance than a non-prefetching cache.","PeriodicalId":187095,"journal":{"name":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128820537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
L. John, P. T. Hulina, L. D. Coraor, Dhamir N. Mannai
{"title":"Classification and performance evaluation of instruction buffering techniques","authors":"L. John, P. T. Hulina, L. D. Coraor, Dhamir N. Mannai","doi":"10.1145/115952.115968","DOIUrl":"https://doi.org/10.1145/115952.115968","url":null,"abstract":"The speed disparity between processor and memory subsystenis has been bridged in many existing large- scale scientific computers and microproc.essors with the help of instruction burners or instruction caches. In this pa.per we 'classify t.hese bulrers into traditional in- struction buffers, conventional inst,ruct.ion caches and prefetch queues, det.ail their prominent. features, and evaluat.e.the percormanre of buffers in srveral cxisting -systems, using trace driven siinulat,ion. We compare ihse srhemes wit,ti a recentl) pro1iose\"d queue-based in- sihction cache nieniory. An implementation indrprn- dent. perforrnaiice metric is proposed for thi. various or- ganizations and used for the evaluat.ions. M'r analyze the simulation results and discuss the eIfec.1 of various paralneters RIIC~ as prefetch threshold, bus width and buffer size on performance.","PeriodicalId":187095,"journal":{"name":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116992469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An empirical study of the CRAY Y-MP processor using the PERFECT club benchmarks","authors":"S. Vajapeyam, G. Sohi, W. Hsu","doi":"10.1145/115952.115970","DOIUrl":"https://doi.org/10.1145/115952.115970","url":null,"abstract":"Characterization of machines, by studying pro~am usage of their architectural and organizational features, IS art essential ~art of the desi~n recess. ln this aper we re ort Y EL an empimcal study of a smg e processor of t e CRAY Y- P, using as benchmarks long-running scientific applications from the PERFECT Club benchmark set. Since the compiler plays a major mle in determining machine utilization and program execution speed, we compile our benchmarks usin the state-of-the-art Cray Research production FORTRA” compiler. We investigate instruction set usage, operation execution counts, sizes of basic blocks in the prorams, and instruction issue rate. We observe, among other 3“ mgs, that the vectorized fraction of the dynamic rogram % operation count ranges from 4% to %% for our bent marks, Instructions that move values between the scalar registers and corresponding backup registers form a si nificant fraction of the dynamic instruction count. Basic %locks which are more than a hundred instructions in size are significant in number; both small and large basic blocks are important from the point of view of pro ram performance. The E","PeriodicalId":187095,"journal":{"name":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","volume":"153 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116893766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chriss Stephens, B. Cogswell, J. Heinlein, Gregory Palmer, John Paul Shen
{"title":"Instruction level profiling and evaluation of the IBM RS/6000","authors":"Chriss Stephens, B. Cogswell, J. Heinlein, Gregory Palmer, John Paul Shen","doi":"10.1145/115952.115971","DOIUrl":"https://doi.org/10.1145/115952.115971","url":null,"abstract":"This paper reports preliminary results from using goblin, a new instruction level profiling system, to evaluate the IBM RISC System/6000 architecture. The evaluation presented is based on the SPEC benchmark suite. Each SPEC program (except gcc) is processed by goblin to produce an instrumented version. During execution of the instrumented program, profiling routines are invoked which trace the execution of the program. These routines also collect statistics on dynamic instruction mix, branching behavior, and resource utilization: Based on these statistics, the actual performance and the architectural efficiency of the RS/SOOO are evaluated. In order to provide a context for this evaluation, a comparison to the DECStation 3100 is also presented. The entire profiling and evaluation experiment on nine of the ten SPEC programs involves tracing and analyzing over 32 billion instructions on the RS/6000. The evaluation indicates that for the SPEC benchmark suite the architecture of the RS/6000 is well balanced and exhibits impressive performance, especially on the floating-point intensive applications.","PeriodicalId":187095,"journal":{"name":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","volume":"138 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122281450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Evaluation of memory system extensions","authors":"Kai Li, K. Petersen","doi":"10.1109/ISCA.1991.1021602","DOIUrl":"https://doi.org/10.1109/ISCA.1991.1021602","url":null,"abstract":"A traditional memory system for a uniprocessor consists of one or two levels of cache, a main memory and a backing store. One can extend such a memory sys tem by adding inexpensive but slower memories into the memory hierarchy. This paper uses an experimental approach to evaluate two methods of extending a memory system: direct and caching. The direct method adds the slower memory into the memory hierarchy by putting it at the same level as the main memory, allowing the CPU to access the slower memories directly; whereas the caching method puts the slower memory between the main memory and the backing store, using the main memory as a cache for the slower memory. We have implemented both approaches and our experiments indicate that applications with very large data structures can benefit significantly using an extended memory system, and that the direct approach outperforms the caching approach in memory-bound applications.","PeriodicalId":187095,"journal":{"name":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129553646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Strategies for achieving improved processor throughput","authors":"M. Farrens, A. Pleszkun","doi":"10.1145/115952.115988","DOIUrl":"https://doi.org/10.1145/115952.115988","url":null,"abstract":"Deeply pipelined processors have relatively low issue rates due to dependencies between instructions. In this paper we examine the possibility of interleaving a second stream of instructions into the pipeline, which would issue instructions during the cycles the first stream was unable to. Such an interleaving has the potential to significantly increase the throughput of a processor without seriously imparing the execution of either process. We propose a dynamic interleaving of at most 2 instructions streams, which share the the pipelined functional units of a machine. To support the interleaving of 2 instruction streams a number of interleaving policies are. described and discused. Finally, the amount of improvement in processor throughput is evaluated by simulating the interleaving policies for several machine varianv;.","PeriodicalId":187095,"journal":{"name":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123527217","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Reducing memory contention in shared memory multiprocessors","authors":"D. Harper","doi":"10.1145/115952.115960","DOIUrl":"https://doi.org/10.1145/115952.115960","url":null,"abstract":"Reducing Memory Contention in Multiprocessors~ D. T. Harper III Shared Memory Department of Electrical Engineering The University of Texas at Dallas P.O. BOX 830688, NIP 33 Richardson, Texas 75083-0688 USA (214) 690-2893","PeriodicalId":187095,"journal":{"name":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","volume":"114 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126131110","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Higuchi, T. Furuya, Ken'ichi Handa, Naoto Takahashi, H. Nishiyama, A. Kokubu
{"title":"IXM2: a parallel associative processor","authors":"T. Higuchi, T. Furuya, Ken'ichi Handa, Naoto Takahashi, H. Nishiyama, A. Kokubu","doi":"10.1145/115952.115956","DOIUrl":"https://doi.org/10.1145/115952.115956","url":null,"abstract":"This paper describes a parallel associative processor, lXM2, developed mainly for semantic network processing. IXM2 consists of 64 associative processors and 9 network processors, having a total of 256K words of associative memory. The large associative memory enables 65,536 semantic network nodes to be processed in parallel and reduces the order of algorithmic complexity to O( 1) in,basic semantic net operations. It is shown that IXM2 has computing power comparable to that of a Connection Machine. Programming for lXM2 is performed with the knowledge representation language IXL, a superset of Prolog, so that IXM2 can be utilized as a back-end to AI workstations.","PeriodicalId":187095,"journal":{"name":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130742590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Virtualizing the VAX architecture","authors":"J. S. Hall, Paul T. Robinson","doi":"10.1145/115952.115990","DOIUrl":"https://doi.org/10.1145/115952.115990","url":null,"abstract":"This paper describes modifications to the VAX architecture to support virtual machines. The VAX architecture contains several instructions that are sensitive but not privileged. It is also the first architecture with more than two protection rings to support virtual machines. A technique for mapping four virtual rings onto three physical rings, employing both software and microcode, is described. Differences between the modified and standard VAX architectures are presented. along with a description of the virtual VAX computer.","PeriodicalId":187095,"journal":{"name":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126860259","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multithreading: a revisionist view of dataflow architectures","authors":"G. Papadopoulos, K. R. Traub","doi":"10.1145/115953.115986","DOIUrl":"https://doi.org/10.1145/115953.115986","url":null,"abstract":"Although they are powerful intermediate representations for compilers, pure dataflow graphs are incomplete, and perhaps even undesirable, machine languages. They are incomplete because it is hard to encode critical sections and imperative operations which are essential for the efficient execution of operating system functions, such as resource management. They may be undesirable because they imply a uniform dynamic scheduling policy for all instructions, preventing a compiler from expressing a static schedule which could result in greater run time efficiency, both by reducing redundant operand synchronization, and by using high speed registers to communicate state between instructions. In this paper, we develop a new machine-level programming model which builds upon two previous improvements to the dataflow execution model: sequential scheduling of instructions, and multiported registers for expression temporaries. Surprisingly, these improvements have required almost no architectural changes to explicit token store (ETS) dataflow hardware, only a shift in mindset when reasoning about how that hardware works. Rather than viewing computational progress as the consumption of tokens and the firing of enabled instructions, we instead reason about the evolution of multiple, interacting sequential threads, where forking and joining are extremely efficient. Because this new paradigm has proven so valuable in coding resource management operations and in improving code efficiency, it is now the cornerstone of the Monsoon instruction set architecture and macro assembly language. In retrospect, this suggests that there is a continuum of multithreaded architectures, with pure ETS dataflow and single threaded von Neumann at the extrema. We use this new perspective to better understand the relative strengths and weaknesses of the Monsoon implement ation.","PeriodicalId":187095,"journal":{"name":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126589183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}