{"title":"Global context-based value prediction","authors":"T. Nakra, Rajiv Gupta, M. Soffa","doi":"10.1109/HPCA.1999.744311","DOIUrl":"https://doi.org/10.1109/HPCA.1999.744311","url":null,"abstract":"Various methods for value prediction have been proposed to overcome the limits imposed by data dependencies within programs. Using a value prediction scheme, an instruction's computed value is predicted during the fetch stage and forwarded to all dependent instructions to speed up execution. Value prediction schemes have been based on a local context by predicting values using the values generated by the same instruction. This paper presents techniques that predict values of an instruction based on a global context where the behavior of other instructions is used in prediction. The global context includes the path along which an instruction is executed and the values computed by other previously completed instructions. We present techniques that augment conventional last value and stride predictors with global context information. Experiments performed using path-based techniques with realistic table sizes resulted in an increase in prediction of 6.4-8.4% over the current prediction schemes. Prediction using values computed by other instructions resulted in a further improvement of 7.2% prediction accuracy over the best path-based predictor.","PeriodicalId":287867,"journal":{"name":"Proceedings Fifth International Symposium on High-Performance Computer Architecture","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128495929","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Lightweight hardware distributed shared memory supported by generalized combining","authors":"Kiyofumi Tanaka, T. Matsumoto, K. Hiraki","doi":"10.1109/HPCA.1999.744339","DOIUrl":"https://doi.org/10.1109/HPCA.1999.744339","url":null,"abstract":"On a large scale parallel computer system, shared memory provides a general and convenient programming environment. The paper describes a lightweight method for constructing an efficient shared memory system supported by hierarchical coherence management and generalized combining. The hierarchical management technique and generalized combining cooperate with each other. We eliminate the following heavyweight and high cost factors: a large amount of directory memory which is proportional to the number of processors, a separate memory component for the directory, tag/state information, and a protocol processor. In our method, the amount of memory required for the directory is proportional to the logarithm of the number of processors. This implies that a single word for each memory block is sufficient for covering a massively parallel system and that the access costs of the directory are small. Moreover, our combining technique, generalized combining, does not expect the accidental events which existing combining networks do, that is, events that messages meet each other at a switching node. A switching node can combine succeeding messages with a preceding one even after the preceding message leaves the node. This can increase the rate of successful combining. We have developed a prototype parallel computer OCHANOMIZ-5, that implements this lightweight distributed shared memory and generalized combining with simple hardware. The results of evaluating the prototype's performance using several programs show that our methodology provides the advantages of parallelization.","PeriodicalId":287867,"journal":{"name":"Proceedings Fifth International Symposium on High-Performance Computer Architecture","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127003689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving CC-NUMA performance using Instruction-based Prediction","authors":"S. Kaxiras, J. Goodman","doi":"10.1109/HPCA.1999.744359","DOIUrl":"https://doi.org/10.1109/HPCA.1999.744359","url":null,"abstract":"We propose Instruction-based Prediction as a means to optimize directory based cache coherent NUMA shared memory. Instruction-based prediction is based on observing the behavior of load and store instructions in relation to coherent events and predicting their future behavior. Although this technique is well established in the uniprocessor world, it has not been widely applied for optimizing transparent shared memory. Typically, in this environment, prediction is based on data block access history (address based prediction) in the form of adaptive cache coherence protocols. The advantage of instruction-based prediction is that it requires few hardware resources in the form of small prediction structures per node to match (or exceed) the performance of address based prediction. To show the potential of instruction-based prediction we propose and evaluate three different optimizations: i) a migratory sharing optimization, ii) a wide sharing optimization, and iii) a producer consumer optimization based on speculative execution. With execution driven simulation and a set of nine benchmarks we show that i) for the first two optimizations, instruction-based prediction, using few predictor entries per node, outpaces address based schemes, and (ii) for the producer consumer optimization which uses speculative execution, low mis speculation rates show promise for performance improvements.","PeriodicalId":287867,"journal":{"name":"Proceedings Fifth International Symposium on High-Performance Computer Architecture","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130151498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"LAPSES: a recipe for high performance adaptive router design","authors":"A. S. Vaidya, A. Sivasubramaniam, C. Das","doi":"10.1109/HPCA.1999.744375","DOIUrl":"https://doi.org/10.1109/HPCA.1999.744375","url":null,"abstract":"Earlier research has shown that adaptive routing can help in improving network performance. However, it has not received adequate attention in commercial routers mainly due to the additional hardware complexity, and the perceived cost and performance degradation that may result from this complexity. These concerns can be mitigated if one can design a cost-effective router that can support adaptive routing. This paper proposes a three step recipe-Look-Ahead routing, intelligent Path Selection, and an Economic Storage implementation, called the LAPSES approach-for cost-effective high performance pipelined adaptive router design. The first step, look-ahead routing, reduces a pipeline stage in the router by making table lookup and arbitration concurrent. Next, three new traffic-sensitive path selection heuristics (LRU, LFU and MAX-CREDIT) are proposed to select one of the available alternate paths. Finally, two techniques for reducing routing table size of the adaptive router are presented. These are called meta-table routing and economical storage. The proposed economical storage needs a routing table with only 9 and 27 entries for two and three dimensional meshes, respectively. All these design ideas are evaluated on a (16/spl times/16) mesh network via simulation. A fully adaptive algorithm and various traffic patterns are used to examine the performance benefits. Performance results show that the look-ahead design as well as the path selection heuristics boost network performance, while the economical storage approach turns out to be an ideal choice in comparison to full-table and meta-table options. We believe the router resulting from these three design enhancements can make adaptive routing a viable choice for interconnects.","PeriodicalId":287867,"journal":{"name":"Proceedings Fifth International Symposium on High-Performance Computer Architecture","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122183840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hardware for speculative parallelization of partially-parallel loops in DSM multiprocessors","authors":"Ye Zhang, Lawrence Rauchwerger, J. Torrellas","doi":"10.1109/HPCA.1999.744351","DOIUrl":"https://doi.org/10.1109/HPCA.1999.744351","url":null,"abstract":"Recently, we introduced a novel framework for speculative parallelization in hardware (Y. Zhang et al., 1998). The scheme is based on a software based run time parallelization scheme that we proposed earlier (L. Rauchwerger and D. Padue, 1995). The idea is to execute the code (loops) speculatively in parallel. As parallel execution proceeds, extra hardware added to the directory based cache coherence of the DSM machine detects if there is a dependence violation. If such a violation occurs, execution is interrupted, the state is rolled back in software to the most recent safe state, and the code is re-executed serially from that point. The safe state is typically established at the beginning of the loop. Such a scheme is somewhat related to speculative parallelization inside a multiprocessor chip, which also relies on extending the cache coherence protocol to detect dependence violations. Our scheme, however, is targeted to large scale DSM parallelism. In addition, it does not have some of the limitations of the proposed chip-multiprocessor schemes. Such limitations include the need to bound the size of the speculative state to fit in a buffer or L1 cache, and a strict in-order task commit policy that may result in load imbalance among processors. Unfortunately, our scheme has higher recovery costs if a dependence violation is detected, because execution has to backtrack to a safe state that is usually the beginning of the loop. Therefore, the aim of the paper is to extend our previous hardware scheme to effectively handle codes (loops) with a modest number of cross-iteration dependences.","PeriodicalId":287867,"journal":{"name":"Proceedings Fifth International Symposium on High-Performance Computer Architecture","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132068495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A performance comparison of homeless and home-based lazy release consistency protocols in software shared memory","authors":"A. Cox, E. D. Lara, Charlie Hu, W. Zwaenepoel","doi":"10.1109/HPCA.1999.744380","DOIUrl":"https://doi.org/10.1109/HPCA.1999.744380","url":null,"abstract":"In this paper, we compare the performance of two multiple-writer protocols based on lazy release consistency. In particular, we compare the performance of Princeton's home-based protocol and TreadMarks' protocol on a 32-processor platform. We found that the performance difference between the two protocols was less than 4% for four out of seven applications. For the three applications on which performance differed by more than 4%, the TreadMarks protocol performed better for two because most of their data were migratory, while the home-based protocol performed better for one. For this one application, the explicit control over the location of data provided by the home-based protocol resulted in a better distribution of communication load across the processors. These results differ from those of a previous comparison of the two protocols. We attribute this difference to (1) a different ratio of memory to network bandwidth on our platform and (2) lazy diffing and request overlapping, two optimizations used by TreadMarks that were not used in the previous study.","PeriodicalId":287867,"journal":{"name":"Proceedings Fifth International Symposium on High-Performance Computer Architecture","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134026324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Supporting fine-grained synchronization on a simultaneous multithreading processor","authors":"D. Tullsen, J. Lo, S. Eggers, H. Levy","doi":"10.1109/HPCA.1999.744326","DOIUrl":"https://doi.org/10.1109/HPCA.1999.744326","url":null,"abstract":"This paper proposes and evaluates new synchronization schemes for a simultaneous multithreaded processor. We present a scalable mechanism that permits threads to cheaply synchronize within the processor, with blocked threads consuming no processor resources. We also introduce the concept of lock release prediction, which gains an additional improvement of 40%. Overall, we show that these improvements in synchronization cost enable parallelization of code that could not be effectively parallelized using traditional techniques.","PeriodicalId":287867,"journal":{"name":"Proceedings Fifth International Symposium on High-Performance Computer Architecture","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133074837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The impact of link arbitration on switch performance","authors":"Marius Pirvu, L. Bhuyan, N. Ni","doi":"10.1109/HPCA.1999.744368","DOIUrl":"https://doi.org/10.1109/HPCA.1999.744368","url":null,"abstract":"Switch design for interconnection networks plays an important role in the overall performance of multiprocessors and computer networks. In this paper we study the impact of one parameter in the switch design space, link arbitration. We demonstrate that link arbitration can be a determining factor in the performance of current networks. Moreover, we expect increased research focus on arbitration techniques to become a trend in the future, as switch architectures evolve towards increasing the number of virtual channels and input ports. In the context of a state-of-the-art switch design we use both synthetic workload and execution driven simulations to compare several arbitration policies. Furthermore, we devise a new arbitration method, Look-Ahead arbitration. Under heavy traffic conditions the Look-Ahead policy provides important improvements over traditional arbitration schemes without a significant increase in hardware complexity. Also, we propose a priority based policy that is capable of reducing the execution time of parallel applications. Lastly, we enhance the arbitration policies by a supplemental mechanism, virtual channel reservation, intended to alleviate the hot-spot problem.","PeriodicalId":287867,"journal":{"name":"Proceedings Fifth International Symposium on High-Performance Computer Architecture","volume":"91 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114725280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient all-to-all broadcast in all-port mesh and torus networks","authors":"Yuanyuan Yang, Jianchao Wang","doi":"10.1109/HPCA.1999.744382","DOIUrl":"https://doi.org/10.1109/HPCA.1999.744382","url":null,"abstract":"All-to-all communication is one of the most dense communication patterns and occurs in many important applications in parallel computing. In this paper, we present a new all-to-all broadcast algorithm in all-port mesh and torus networks. Unlike existing all-to-all broadcast algorithms, the new algorithm takes advantage of overlapping of message switching time and transmission time, and achieves optimal transmission time for all-to-all broadcast. In addition, in most cases, the total communication delay is close to the lower bound of all-to-all broadcast within a small constant range. Finally, the algorithm is conceptually simple, and symmetrical for every message and every node so that it can be easily implemented in hardware and achieves the optimum in practice.","PeriodicalId":287867,"journal":{"name":"Proceedings Fifth International Symposium on High-Performance Computer Architecture","volume":"157 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114385120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Switch cache: a framework for improving the remote memory access latency of CC-NUMA multiprocessors","authors":"R. Iyer, L. Bhuyan","doi":"10.1109/HPCA.1999.744357","DOIUrl":"https://doi.org/10.1109/HPCA.1999.744357","url":null,"abstract":"Cache coherent non-uniform memory access (CC-NUMA) multiprocessors continue to suffer from remote memory access latencies due to comparatively slow memory technology and data transfer latencies in the interconnection network. We propose a novel hardware caching technique, called switch cache. The main idea is to implement small fast caches in crossbar switches of the interconnect medium to capture and store shared data as they flow from the memory module to the requesting processor. This stored data acts as a cache for subsequent requests, thus reducing the latency of remote memory accesses tremendously. The implementation of a cache in a crossbar switch needs to be efficient and robust, yet flexible for changes in the caching protocol. The design and implementation details of a CAche Embedded Switch ARchitecture, CAESAR, using wormhole routing with virtual channels is presented. Using detailed execution-driven simulations, we find that the CAESAR switch cache is capable of improving the performance of CC-NUMA multiprocessors by reducing the number of reads served at distant remote memories by up to 45% and improving the application execution time by as high as 20%. We conclude that the switch caches provide a cost-effective solution for designing high performance CC-NUMA multiprocessors.","PeriodicalId":287867,"journal":{"name":"Proceedings Fifth International Symposium on High-Performance Computer Architecture","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127566584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}