ASPLOS VIPub Date : 1994-11-01DOI: 10.1145/195473.195557
S. Carr, K. McKinley, C. Tseng
{"title":"Compiler optimizations for improving data locality","authors":"S. Carr, K. McKinley, C. Tseng","doi":"10.1145/195473.195557","DOIUrl":"https://doi.org/10.1145/195473.195557","url":null,"abstract":"In the past decade, processor speed has become significantly faster than memory speed. Small, fast cache memories are designed to overcome this discrepancy, but they are only effective when programs exhibit data locality. In this paper, we present compiler optimizations to improve data locality based on a simple yet accurate cost model. The model computes both temporal and spatial reuse of cache lines to find desirable loop organizations. The cost model drives the application of compound transformations consisting of loop permutation, loop fusion, loop distribution, and loop reversal. We demonstrate that these program transformations are useful for optimizing many programs.\u0000To validate our optimization strategy, we implemented our algorithms and ran experiments on a large collection of scientific programs and kernels. Experiments with kernels illustrate that our model and algorithm can select and achieve the best performance. For over thirty complete applications, we executed the original and transformed versions and simulated cache hit rates. We collected statistics about the inherent characteristics of these programs and our ability to improve their data locality. To our knowledge, these studies are the first of such breadth and depth. We found performance improvements were difficult to achieve because benchmark programs typically have high hit rates even for small data caches; however, our optimizations significantly improved several programs.","PeriodicalId":140481,"journal":{"name":"ASPLOS VI","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1994-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131764068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
ASPLOS VIPub Date : 1994-11-01DOI: 10.1145/195473.195576
J. Laudon, Anoop Gupta, M. Horowitz
{"title":"Interleaving: a multithreading technique targeting multiprocessors and workstations","authors":"J. Laudon, Anoop Gupta, M. Horowitz","doi":"10.1145/195473.195576","DOIUrl":"https://doi.org/10.1145/195473.195576","url":null,"abstract":"There is an increasing trend to use commodity microprocessors as the compute engines in large-scale multiprocessors. However, given that the majority of the microprocessors are sold in the workstation market, not in the multiprocessor market, it is only natural that architectural features that benefit only multiprocessors are less likely to be adopted in commodity microprocessors. In this paper, we explore multiple-context processors, an architectural technique proposed to hide the large memory latency in multiprocessors. We show that while current multiple-context designs work reasonably well for multiprocessors, they are ineffective in hiding the much shorter uniprocessor latencies using the limited parallelism found in workstation environments. We propose an alternative design that combines the best features of two existing approaches, and present simulation results that show it yields better performance for both multiprogrammed workloads on a workstation and parallel applications on a multiprocessor. By addressing the needs of the workstation environment, our proposal makes multiple contexts more attractive for commodity microprocessors.","PeriodicalId":140481,"journal":{"name":"ASPLOS VI","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1994-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124070514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
ASPLOS VIPub Date : 1994-11-01DOI: 10.1145/195473.195481
C. Thekkath, H. Levy, Ed Lazowska
{"title":"Separating data and control transfer in distributed operating systems","authors":"C. Thekkath, H. Levy, Ed Lazowska","doi":"10.1145/195473.195481","DOIUrl":"https://doi.org/10.1145/195473.195481","url":null,"abstract":"Advances in processor architecture and technology have resulted in workstations in the 100+ MIPS range. As well, newer local-area networks such as ATM promise a ten- to hundred-fold increase in throughput, much reduced latency, greater scalability, and greatly increased reliability, when compared to current LANs such as Ethernet.\u0000We believe that these new network and processor technologies will permit tighter coupling of distributed systems at the hardware level, and that distributed systems software should be designed to benefit from that tighter coupling. In this paper, we propose an alternative way of structuring distributed systems that takes advantage of a communication model based on remote network access (reads and writes) to protected memory segments.\u0000A key feature of the new structure, directly supported by the communication model, is the separation of data transfer and control transfer. This is in contrast to the structure of traditional distributed systems, which are typically organized using message passing or remote procedure call (RPC). In RPC-style systems, data and control are inextricably linked—all RPCs must transfer both data and control, even if the control transfer is unnecessary.\u0000We have implemented our model on DECstation hardware connected by an ATM network. We demonstrate how separating data transfer and control transfer can eliminate unnecessary control transfers and facilitate tighter coupling of the client and server. This has the potential to increase performance and reduce server load, which supports scaling in the face of an increasing number of clients. For example, for a small set of file server operations, our analysis shows a 50% decrease in server load when we switched from a communications mechanism requiring both control transfer and data transfer, to an alternative structure based on pure data transfer.","PeriodicalId":140481,"journal":{"name":"ASPLOS VI","volume":"145 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1994-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114841739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
ASPLOS VIPub Date : 1994-11-01DOI: 10.1145/195473.195553
B. Calder, D. Grunwald
{"title":"Reducing branch costs via branch alignment","authors":"B. Calder, D. Grunwald","doi":"10.1145/195473.195553","DOIUrl":"https://doi.org/10.1145/195473.195553","url":null,"abstract":"Several researchers have proposed algorithms for basic block reordering. We call these branch alignment algorithms. The primary emphasis of these algorithms has been on improving instruction cache locality, and the few studies concerned with branch prediction reported small or minimal improvements. As wide-issue architectures become increasingly popular the importance of reducing branch costs will increase, and branch alignment is one mechanism which can effectively reduce these costs.\u0000In this paper, we propose an improved branch alignment algorithm that takes into consideration the architectural cost model and the branch prediction architecture when performing the basic block reordering. We show that branch alignment algorithms can improve a broad range of static and dynamic branch prediction architectures. We also show that a program performance can be improved by approximately 5% even when using recently proposed, highly accurate branch prediction architectures. The programs are compiled by any existing compiler and then transformed via binary transformations. When implementing these algorithms on a Alpha AXP 21604 up to a 16% reduction in total execution time is achieved.","PeriodicalId":140481,"journal":{"name":"ASPLOS VI","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1994-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123968844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
ASPLOS VIPub Date : 1994-11-01DOI: 10.1145/195473.195490
B. Lim
{"title":"Reactive synchronization algorithms for multiprocessors","authors":"B. Lim","doi":"10.1145/195473.195490","DOIUrl":"https://doi.org/10.1145/195473.195490","url":null,"abstract":"Synchronization algorithms that are efficient across a wide range of applications and operating conditions are hard to design because their performance depends on unpredictable run-time factors. The designer of a synchronization algorithm has a choice of protocols to use for implementing the synchronization operation. For example, candidate protocols for locks include test-and-set protocols and queueing protocols. Frequently, the best choice of protocols depends on the level of contention: previous research has shown that test-and-set protocols for locks outperform queueing protocols at low contention, while the opposite is true at high contention.\u0000This paper investigates reactive synchronization algorithms that dynamically choose protocols in response to the level of contention. We describe reactive algorithms for spin locks and fetch-and-op that choose among several shared-memory and message-passing protocols. Dynamically choosing protocols presents a challenge: a reactive algorithm needs to select and change protocols efficiently, and has to allow for the possibility that multiple processes may be executing different protocols at the same time. We describe the notion of consensus objects that the reactive algorithms use to preserve correctness in the face of dynamic protocol changes.\u0000Experimental measurements demonstrate that reactive algorithms perform close to the best static choice of protocols at all levels of contention. Furthermore, with mixed levels of contention, reactive algorithms outperform passive algorithms with fixed protocols, provided that contention levels do not change too frequently. Measurements of several parallel applications show that reactive algorithms result in modest performance gains for spin locks and significant gains for fetch-and-op.","PeriodicalId":140481,"journal":{"name":"ASPLOS VI","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1994-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114066536","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
ASPLOS VIPub Date : 1994-11-01DOI: 10.1145/195473.195527
B. Bershad, Dennis Lee, T. Romer, J. B. Chen
{"title":"Avoiding conflict misses dynamically in large direct-mapped caches","authors":"B. Bershad, Dennis Lee, T. Romer, J. B. Chen","doi":"10.1145/195473.195527","DOIUrl":"https://doi.org/10.1145/195473.195527","url":null,"abstract":"This paper describes a method for improving the performance of a large direct-mapped cache by reducing the number of conflict misses. Our solution consists of two components: an inexpensive hardware device called a Cache Miss Lookaside (CML) buffer that detects conflicts by recording and summarizing a history of cache misses, and a software policy within the operating system's virtual memory system that removes conflicts by dynamically remapping pages whenever large numbers of conflict misses are detected. Using trace-driven simulation of applications and the operating system, we show that a CML buffer enables a large direct-mapped cache to perform nearly as well as a two-way set associative cache of equivalent size and speed, although with lower hardware cost and complexity.","PeriodicalId":140481,"journal":{"name":"ASPLOS VI","volume":"135 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1994-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127378383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}