{"title":"Dynamic processor allocation in hypercube computers","authors":"Po-Jen Chuang, N. Tzeng","doi":"10.1145/325164.325110","DOIUrl":"https://doi.org/10.1145/325164.325110","url":null,"abstract":"Recognizing various subcubes in a hypercube computer fully and efficiently is nontrivial because of the specific structure of the hypercube. The authors propose a method that has much less complexity than the multiple-GC strategy in generating the search space, while achieving complete subcube recognition. This method is referred to as a dynamic processor allocation scheme because the search space generated is dependent upon the dimension of the requested subcube dynamically, instead of being predetermined and fixed. The basic idea of this strategy lies in collapsing the binary tree representations of a hypercube successively so that the nodes which form a subcube but are distant would be brought close to each other for recognition. The strategy can be implemented efficiently by using shuffle operations on the leaf node addresses of binary tree representations. Extensive simulation runs are carried out to collect experimental performance measures of interest of different allocation strategies. It is shown from analytic and experimental results that this strategy compares favorably in many situations with any other known allocation scheme capable of achieving complete subcube recognition.<<ETX>>","PeriodicalId":297046,"journal":{"name":"[1990] Proceedings. The 17th Annual International Symposium on Computer Architecture","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116311718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The TLB slice-a low-cost high-speed address translation mechanism","authors":"G. Taylor, Peter Davies, M. Farmwald","doi":"10.1145/325164.325161","DOIUrl":"https://doi.org/10.1145/325164.325161","url":null,"abstract":"The MIPS R6000 microprocessor relies on a new type of translation lookaside buffer, called a TLB slice, which is less than one-tenth the size of a conventional TLB and as fast as one multiplexer delay, yet has a high enough hit rate to be practical. The fast translation makes it possible to use a physical cache without adding a translation stage to the processor's pipeline. The small size makes it possible to include address translation on-chip, even in a technology with a limited number of devices. The key idea behind the TLB slice is to have both a virtual tag and a physical tag on a physically indexed cache.<<ETX>>","PeriodicalId":297046,"journal":{"name":"[1990] Proceedings. The 17th Annual International Symposium on Computer Architecture","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130988464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An investigation of static versus dynamic scheduling","authors":"C. Love, H. Jordan","doi":"10.1145/325164.325140","DOIUrl":"https://doi.org/10.1145/325164.325140","url":null,"abstract":"Two techniques for instruction scheduling, dynamic and static scheduling, are investigated. A decoupled access execute architecture consists of an execution unit and a memory unit with separate program counters and separate instruction memories. The very long instruction word (VLIW) architecture has only one program counter and relies on the compiler to perform static scheduling of multiple units. To idealize the comparison, the VLIW architecture considered had only two units. The instruction sets and execution times for the two architectures were made as nearly the same as possible. The execution times were compared and analyzed to compare the capabilities of static and dynamic instruction scheduling. Both regular and irregular programs were constructed and optimized by hand for each architecture.<<ETX>>","PeriodicalId":297046,"journal":{"name":"[1990] Proceedings. The 17th Annual International Symposium on Computer Architecture","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122647977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"APRIL: a processor architecture for multiprocessing","authors":"A. Agarwal, B. Lim, D. Kranz, J. Kubiatowicz","doi":"10.1145/325164.325119","DOIUrl":"https://doi.org/10.1145/325164.325119","url":null,"abstract":"The architecture of a rapid-context-switching processor called APRIL, with support for fine-grain threads and synchronization, is described. APRIL achieves high single-thread performance and supports virtual dynamic threads. A commercial reduced-instruction-set-computer-(RISC-) based implementation of APRIL and a run-time software system that can switch contexts in about 10 cycles are described. Measurements taken for several parallel applications on an APRIL simulator show that the overhead for supporting parallel tasks based on futures is reduced by a factor of 2 over a corresponding implementation on the Encore Multimax. The scalability of a multiprocessor based on APRIL is explored using a performance model. The authors show that the SPARC-based implementation of APRIL can achieve close to 80% processor utilization with as few as three resident threads per processor in a large-scale cache-based machine with an average base network latency of 55 cycles.<<ETX>>","PeriodicalId":297046,"journal":{"name":"[1990] Proceedings. The 17th Annual International Symposium on Computer Architecture","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125360046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Adaptive software cache management for distributed shared memory architectures","authors":"J. Bennett, J. Carter, W. Zwaenepoel","doi":"10.1145/325164.325124","DOIUrl":"https://doi.org/10.1145/325164.325124","url":null,"abstract":"An adaptive cache coherence mechanism exploits semantic information about the expected or observed access behavior of particular data objects. The authors contend that, in distributed shared-memory systems, adaptive cache coherence mechanisms will outperform static cache coherence mechanisms. They have examined the sharing and synchronization behavior of a variety of shared-memory parallel programs. It is found that the access patterns of a large percentage of shared data objects fall into a small number of categories for which efficient software coherence mechanisms exist. In addition, the authors have performed a simulation study that provides two examples of how an adaptive caching mechanism can take advantage of semantic information.<<ETX>>","PeriodicalId":297046,"journal":{"name":"[1990] Proceedings. The 17th Annual International Symposium on Computer Architecture","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122718618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Maximizing performance in a striped disk array","authors":"Peter M. Chen, D. Patterson","doi":"10.1145/325164.325158","DOIUrl":"https://doi.org/10.1145/325164.325158","url":null,"abstract":"Improvements in disk speeds have not kept up with improvements in processor and memory speeds. One way to correct the resulting speed mismatch is to stripe data across many disks. The authors address how to stripe data to get maximum performance from the disks. Specifically, they examine how to choose the striping unit, that is, the amount of logically contiguous data on each disk. Rules for determining the best striping unit for a given range of workloads are synthesized. It is shown how the choice of striping unit depends on only two parameters: (1) the number of outstanding requests in the disk system at any given time, and (2) the average positioning time*data transfer rate of the disks. The authors derive an equation for the optimal striping unit as a function of these two parameters; they also show how to choose the striping unit without prior knowledge about the workload.<<ETX>>","PeriodicalId":297046,"journal":{"name":"[1990] Proceedings. The 17th Annual International Symposium on Computer Architecture","volume":"97 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125819394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The impact of synchronization and granularity on parallel systems","authors":"D. Chen, H. Su, P. Yew","doi":"10.1145/325164.325150","DOIUrl":"https://doi.org/10.1145/325164.325150","url":null,"abstract":"A study is made of the impact of synchronization and granularity on the performance of parallel systems using an execution-driven simulation technique. It is found that, even though there can be a lot of parallelism at the fine-grain level, synchronization and scheduling strategies determine the ultimate performance of the system. Loop-iteration-level parallelism seems to be a more appropriate level when those factors are considered. Barrier synchronization and data synchronization at the loop-iteration level are also studied. It is found that both schemes are needed for a better performance.<<ETX>>","PeriodicalId":297046,"journal":{"name":"[1990] Proceedings. The 17th Annual International Symposium on Computer Architecture","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124557859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The performance impact of block sizes and fetch strategies","authors":"S. Przybylski","doi":"10.1145/325164.325135","DOIUrl":"https://doi.org/10.1145/325164.325135","url":null,"abstract":"The interactions between a cache's block size, fetch size, and fetch policy from the perspective of maximizing system-level performance are explored. It has been previously noted that, given a simple fetch strategy, the performance optimal block size is almost always four or eight words. If there is even a small cycle time penalty associated with either longer blocks or fetches, then the performance optimal size is noticeably reduced. In split cache organizations, where the fetch and block sizes of instruction and data caches are all independent design variables, instruction cache block size and fetch size should be the same. For the workload and write-back write policy used in this trace-driven simulation study, the instruction cache block size should be about a factor of 2 greater than the data cache fetch size, which in turn should be equal to or double the data cache block size. The simplest fetch strategy of fetching only on a miss and stalling the CPU until the fetch is complete works well. Complicated fetch strategies do not produce the performance improvements indicated by the accompanying reductions in miss ratios because of limited memory resources and a strong temporal clustering of cache misses. For the environments simulated, the most effective fetch strategy improved performance by between 1.7% and 4.5% over the simplest strategy described above.<<ETX>>","PeriodicalId":297046,"journal":{"name":"[1990] Proceedings. The 17th Annual International Symposium on Computer Architecture","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122916996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Trace-driven simulations for a two-level cache design of open bus systems","authors":"Hakon O. Bugge, E. Kristiansen, B. O. Bakka","doi":"10.1145/325164.325151","DOIUrl":"https://doi.org/10.1145/325164.325151","url":null,"abstract":"Two-level cache hierarchies will be a design issue in future high-performance CPUs. An evaluation is made of various metrics for data cache designs. A discussion is presented of one- and two-level cache hierarchies. The target is a new 100+ MIPS CPU, but the methods are applicable to any cache design. The basis of this work is a new trace-driven, multiprocess cache simulator. The simulator incorporates a simple priority-based scheduler which controls the execution of the processes. The scheduler blocks a process when a system call is executed. A workload consists of a total of 60 processes, distributed among seven unique programs with about nine instances each. Two open bus systems, Futurebus+ and Scalable Coherent Interface (SCI), that support a coherent memory model, are discussed as the interconnect system for main memory.<<ETX>>","PeriodicalId":297046,"journal":{"name":"[1990] Proceedings. The 17th Annual International Symposium on Computer Architecture","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122148503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Boosting beyond static scheduling in a superscalar processor","authors":"Michael D. Smith, M. Lam, M. Horowitz","doi":"10.1145/325164.325160","DOIUrl":"https://doi.org/10.1145/325164.325160","url":null,"abstract":"A superscalar processor that combines the best qualities of static and dynamic instruction scheduling to increase the performance of nonnumerical applications is described. The architecture performs all instruction scheduling statically to take advantage of the compiler's ability to schedule operations across many basic blocks efficiently. Since the conditional branches in nonnumerical code are highly data dependent, the architecture introduces the concept of boosted instructions, that is, instructions that are committed conditionally upon the result of later branch instructions. Boosting effectively removes the dependences caused by branches and makes the scheduling of side-effect instructions as simple as it is for instructions that are side-effect free. For efficiency, boosting is supported in the hardware by shadow structures that temporarily hold the side effects of boosted instructions until the conditional branches that the boosted instructions depend upon are executed.<<ETX>>","PeriodicalId":297046,"journal":{"name":"[1990] Proceedings. The 17th Annual International Symposium on Computer Architecture","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127724719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}