ASPLOS VIPub Date : 1994-11-01DOI: 10.1145/195473.195538
K. Hayashi, Tsunehisa Doi, T. Horie, Y. Koyanagi, Osamu Shiraki, Nobutaka Imamura, T. Shimizu, H. Ishihata, Tatsuya Shindo
{"title":"AP1000+: architectural support of PUT/GET interface for parallelizing compiler","authors":"K. Hayashi, Tsunehisa Doi, T. Horie, Y. Koyanagi, Osamu Shiraki, Nobutaka Imamura, T. Shimizu, H. Ishihata, Tatsuya Shindo","doi":"10.1145/195473.195538","DOIUrl":"https://doi.org/10.1145/195473.195538","url":null,"abstract":"The scalability of distributed-memory parallel computers makes them attractive candidates for solving large-scale problems. New languages, such as HPF, FortranD, and VPP Fortran, have been developed to enable existing software to be easily ported to such machines. Many distributed-memory parallel computers have been built, but none of them support the mechanisms required by such languages. We studied the mechanisms required by parallelizing compilers and proposed a new architecture to support them. Based on this proposed architecture, we developed a new distributed-memory parallel computer, the AP1000+, which is an enhanced version of the AP1000. Using scientific applications in VPP Fortran and C, such as NAS parallel benchmarks, we simulated the performance of the AP1000+.","PeriodicalId":140481,"journal":{"name":"ASPLOS VI","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1994-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114266171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
ASPLOS VIPub Date : 1994-11-01DOI: 10.1145/195473.195515
C. Thekkath, H. Levy
{"title":"Hardware and software support for efficient exception handling","authors":"C. Thekkath, H. Levy","doi":"10.1145/195473.195515","DOIUrl":"https://doi.org/10.1145/195473.195515","url":null,"abstract":"Program-synchronous exceptions, for example, breakpoints, watchpoints, illegal opcodes, and memory access violations, provide information about exceptional conditions, interrupting the program and vectoring to an operating system handler. Over the last decade, however, programs and run-time systems have increasingly employed these mechanisms as a performance optimization to detect normal and expected conditions. Unfortunately, current architecture and operating system structures are designed for exceptional or erroneous conditions, where performance is of secondary importance, rather than normal conditions. Consequently, this has limited the practicality of such hardware-based detection mechanisms.\u0000We propose both hardware and software structures that permit efficient handling of synchronous exceptions by user-level code. We demonstrate a software implementation that reduces exception-delivery cost by an order-of-magnitude on current RISC processors, and show the performance benefits of that mechanism for several example applications.","PeriodicalId":140481,"journal":{"name":"ASPLOS VI","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1994-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130358340","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
ASPLOS VIPub Date : 1994-11-01DOI: 10.1145/195473.195499
V. Karamcheti, A. Chien
{"title":"Software overhead in messaging layers: where does the time go?","authors":"V. Karamcheti, A. Chien","doi":"10.1145/195473.195499","DOIUrl":"https://doi.org/10.1145/195473.195499","url":null,"abstract":"Despite improvements in network interfaces and software messaging layers, software communication overhead still dominates the hardware routing cost in most systems. In this study, we identify the sources of this overhead by analyzing software costs of typical communication protocols built atop the active messages layer on the CM-5. We show that up to 50–70% of the software messaging costs are a direct consequence of the gap between specific network features such as arbitrary delivery order, finite buffering, and limited fault-handling, and the user communication requirements of in-order delivery, end-to-end flow control, and reliable transmission. However, virtually all of these costs can be eliminated if routing networks provide higher-level services such as in-order delivery, end-to-end flow control, and packet-level fault-tolerance. We conclude that significant cost reductions require changing the constraints on messaging layers: we propose designing networks and network interfaces which simplify or replace software for implementing user communication requirements.","PeriodicalId":140481,"journal":{"name":"ASPLOS VI","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1994-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126403137","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
ASPLOS VIPub Date : 1994-11-01DOI: 10.1145/195473.195579
N. Carter, S. Keckler, W. Dally
{"title":"Hardware support for fast capability-based addressing","authors":"N. Carter, S. Keckler, W. Dally","doi":"10.1145/195473.195579","DOIUrl":"https://doi.org/10.1145/195473.195579","url":null,"abstract":"Traditional methods of providing protection in memory systems do so at the cost of increased context switch time and/or increased storage to record access permissions for processes. With the advent of computers that supported cycle-by-cycle multithreading, protection schemes that increase the time to perform a context switch are unacceptable, but protecting unrelated processes from each other is still necessary if such machines are to be used in non-trusting environments.\u0000This paper examines guarded pointers, a hardware technique which uses tagged 64-bit pointer objects to implement capability-based addressing. Guarded pointers encode a segment descriptor into the upper bits of every pointer, eliminating the indirection and related performance penalties associated with traditional implementations of capabilities. All processes share a single 54-bit virtual address space, and access is limited to the data that can be referenced through the pointers that a process has been issued. Only one level of address translation is required to perform a memory reference. Sharing data between processes is efficient, and protection states are defined to allow fast protected subsystem calls and create unforgeable data keys.","PeriodicalId":140481,"journal":{"name":"ASPLOS VI","volume":"169 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1994-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129383801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
ASPLOS VIPub Date : 1994-11-01DOI: 10.1145/195473.195583
R. Thekkath, S. Eggers
{"title":"The effectiveness of multiple hardware contexts","authors":"R. Thekkath, S. Eggers","doi":"10.1145/195473.195583","DOIUrl":"https://doi.org/10.1145/195473.195583","url":null,"abstract":"Multithreaded processors are used to tolerate long memory latencies. By executing threads loaded in multiple hardware contexts, an otherwise idle processor can keep busy, thus increasing its utilization. However, the larger size of a multi-thread working set can have a negative effect on cache conflict misses. In this paper we evaluate the two phenomena together, examining their combined effect on execution time.\u0000The usefulness of multiple hardware contexts depends on: program data locality, cache organization and degree of multiprocessing. Multiple hardware contexts are most effective on programs that have been optimized for data locality. For these programs, execution time dropped with increasing contexts, over widely varying architectures. With unoptimized applications, multiple contexts had limited value. The best performance was seen with only two contexts, and only on uniprocessors and small multiprocessors. The behavior of the unoptimized applications changed more noticeably with variations in cache associativity and cache hierarchy, unlike the optimized programs.\u0000As a mechanism for exploiting program parallelism, an additional processor is clearly better than another context. However, there were many configurations for which the addition of a few hardware contexts brought as much or greater performance than a larger multiprocessor with fewer than the optimal number of contexts.","PeriodicalId":140481,"journal":{"name":"ASPLOS VI","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1994-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115111252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
ASPLOS VIPub Date : 1994-11-01DOI: 10.1145/195473.195485
Rohit Chandra, Scott Devine, Ben Verghese, Anoop Gupta, M. Rosenblum
{"title":"Scheduling and page migration for multiprocessor compute servers","authors":"Rohit Chandra, Scott Devine, Ben Verghese, Anoop Gupta, M. Rosenblum","doi":"10.1145/195473.195485","DOIUrl":"https://doi.org/10.1145/195473.195485","url":null,"abstract":"Several cache-coherent shared-memory multiprocessors have been developed that are scalable and offer a very tight coupling between the processing resources. They are therefore quite attractive for use as compute servers for multiprogramming and parallel application workloads. Process scheduling and memory management, however, remain challenging due to the distributed main memory found on such machines. This paper examines the effects of OS scheduling and page migration policies on the performance of such compute servers. Our experiments are done on the Stanford DASH, a distributed-memory cache-coherent multiprocessor. We show that for our multiprogramming workloads consisting of sequential jobs, the traditional Unix scheduling policy does very poorly. In contrast, a policy incorporating cluster and cache affinity along with a simple page-migration algorithm offers up to two-fold performance improvement. For our workloads consisting of multiple parallel applications, we compare space-sharing policies that divide the processors among the applications to time-slicing policies such as standard Unix or gang scheduling. We show that space-sharing policies can achieve better processor utilization due to the operating point effect, but time-slicing policies benefit strongly from user-level data distribution. Our initial experience with automatic page migration suggests that policies based only on TLB miss information can be quite effective, and useful for addressing the data distribution problems of space-sharing schedulers.","PeriodicalId":140481,"journal":{"name":"ASPLOS VI","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1994-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129609289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
ASPLOS VIPub Date : 1994-11-01DOI: 10.1145/195473.195545
J. Larus, Brad Richards, Guhan Viswanathan
{"title":"LCM: memory system support for parallel language implementation","authors":"J. Larus, Brad Richards, Guhan Viswanathan","doi":"10.1145/195473.195545","DOIUrl":"https://doi.org/10.1145/195473.195545","url":null,"abstract":"Higher-level parallel programming languages can be difficult to implement efficiently on parallel machines. This paper shows how a flexible, compiler-controlled memory system can help achieve good performance for language constructs that previously appeared too costly to be practical.\u0000Our compiler-controlled memory system is called Loosely Coherent Memory (LCM). It is an example of a larger class of Reconcilable Shared Memory (RSM) systems, which generalize the replication and merge policies of cache-coherent shared-memory. RSM protocols differ in the action taken by a processor in response to a request for a location and the way in which a processor reconciles multiple outstanding copies of a location. LCM memory becomes temporarily inconsistent to implement the semantics of C** parallel functions efficiently. RSM provides a compiler with control over memory-system policies, which it can use to implement a language's semantics, improve performance, or detect errors. We illustrate the first two points with LCM and our compiler for the data-parallel language C**.","PeriodicalId":140481,"journal":{"name":"ASPLOS VI","volume":"249 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1994-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114291872","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
ASPLOS VIPub Date : 1994-11-01DOI: 10.1145/195473.195518
P. V. Argade, David K. Charles, C. Taylor
{"title":"A technique for monitoring run-time dynamics of an operating system and a microprocessor executing user applications","authors":"P. V. Argade, David K. Charles, C. Taylor","doi":"10.1145/195473.195518","DOIUrl":"https://doi.org/10.1145/195473.195518","url":null,"abstract":"In this paper, we present a non-invasive and efficient technique for simulating applications complete with their operating system interaction. The technique involves booting and initiating an application on a hardware development system, capturing the entire state of the application and the microprocessor at a well defined point in execution and then simulating the application on microprocessor simulators. Extensive statistics generated from the simulators on run-time dynamics of the application, the operating system as well as the microprocessor enabled us to tune the operating system and the microprocessor architecture and implementation. The results also enabled us to optimize system level design choices by anticipating/predicting the performance of the target system. Lastly, the results were used to adjust and refocus the evolution of the architecture of both the operating system and the microprocessor.","PeriodicalId":140481,"journal":{"name":"ASPLOS VI","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1994-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114522307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
ASPLOS VIPub Date : 1994-11-01DOI: 10.1145/195473.195510
M. Upton, T. Huff, T. Mudge, Richard B. Brown
{"title":"Resource allocation in a high clock rate microprocessor","authors":"M. Upton, T. Huff, T. Mudge, Richard B. Brown","doi":"10.1145/195473.195510","DOIUrl":"https://doi.org/10.1145/195473.195510","url":null,"abstract":"This paper discusses the design of a high clock rate (300MHz) processor. The architecture is described, and the goals for the design are explained. The performance of three processor models is evaluated using trace-driven simulation. A cost model is used to estimate the resources required to build processors with varying sizes of on-chip memories, in both single and dual issue models. Recommendations are then made to increase the effectiveness of each of the models.","PeriodicalId":140481,"journal":{"name":"ASPLOS VI","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1994-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114844380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
ASPLOS VIPub Date : 1994-11-01DOI: 10.1145/195473.195534
D. M. Gallagher, William Y. Chen, S. Mahlke, J. Gyllenhaal, Wen-mei W. Hwu
{"title":"Dynamic memory disambiguation using the memory conflict buffer","authors":"D. M. Gallagher, William Y. Chen, S. Mahlke, J. Gyllenhaal, Wen-mei W. Hwu","doi":"10.1145/195473.195534","DOIUrl":"https://doi.org/10.1145/195473.195534","url":null,"abstract":"To exploit instruction level parallelism, compilers for VLIW and superscalar processors often employ static code scheduling. However, the available code reordering may be severely restricted due to ambiguous dependences between memory instructions. This paper introduces a simple hardware mechanism, referred to as the memory conflict buffer, which facilitates static code scheduling in the presence of memory store/load dependences. Correct program execution is ensured by the memory conflict buffer and repair code provided by the compiler. With this addition, significant speedup over an aggressive code scheduling model can be achieved for both non-numerical and numerical programs.","PeriodicalId":140481,"journal":{"name":"ASPLOS VI","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1994-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130816978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}