{"title":"Reducing ownership overhead for load-store sequences in cache-coherent multiprocessors","authors":"J. Nilsson, F. Dahlgren","doi":"10.1109/IPDPS.2000.846053","DOIUrl":"https://doi.org/10.1109/IPDPS.2000.846053","url":null,"abstract":"Parallel programs that modify shared data in a cache-coherent multiprocessor with a write-invalidate coherence protocol create ownership overhead in the form of ownership acquisitions at writes to shared data. This can have a significant impact on performance in a cache-coherent non-uniform memory architecture (NUMA) multiprocessor. By combining a read-request and an ownership acquisition, the write latency and network traffic can potentially be reduced. In this paper we propose a new hardware-based approach far performing this optimization by targeting load-store sequences, which we show is a super-set of migrator sharing. A load-store sequence consists of a global read request followed by a global write action to the same memory, location from the same processor without any intervening access to the same block from any other processor. We use detailed simulation with four benchmark programs including one on-line transaction processing (OLTP) workload and operating system execution to examine the effectiveness of the proposed technique. The results show that the technique is able to reduce write-related latency and network traffic more than previous hardware-based techniques, up to twice as much.","PeriodicalId":206541,"journal":{"name":"Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122294626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Micro-architectures of high performance, multi-user system area network interface cards","authors":"B. S. Ang, Derek Chiou, L. Rudolph, Arvind","doi":"10.1109/IPDPS.2000.845959","DOIUrl":"https://doi.org/10.1109/IPDPS.2000.845959","url":null,"abstract":"This paper examines two Network Interface Card micro-architectures that support low latency, high bandwidth user level message passing in multi-user environments. The two are at different ends of a design spectrum-the Resident queues design relies completely on hardware, while the Non-resident queues design is heavily firmware driven. Through actual implementation of these designs and simulation-based micro-benchmark studies, we identify issues critical to the performance and functionality of the firmware-based approach. The firmware-based approach offers much flexibility at a moderate performance penalty, while the Resident design has superior performance for the functions it implements. This leads us to conclude that a hybrid design combining complete hardware support for common operations and a firmware implementation of less common functions achieves both high performance and flexibility.","PeriodicalId":206541,"journal":{"name":"Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132466080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Olivier Beaumont, Vincent Boudet, F. Rastello, Y. Robert
{"title":"Load balancing strategies for dense linear algebra kernels on heterogeneous two-dimensional grids","authors":"Olivier Beaumont, Vincent Boudet, F. Rastello, Y. Robert","doi":"10.1109/IPDPS.2000.846065","DOIUrl":"https://doi.org/10.1109/IPDPS.2000.846065","url":null,"abstract":"We study the implementation of dense linear algebra computations, such as matrix multiplication and linear system solvers, on two-dimensional (2D) grids of heterogeneous processors. For these operations, 2D-grids are the key to scalability and efficiency. The uniform block-cyclic data distribution scheme commonly used for homogeneous collections of processors limits the performance-of-these operations on heterogeneous grids to the speed of the slowest processor. We present and study more sophisticated data allocation strategies that balance the load on heterogeneous 2D-grids with respect to the performance of the processors. The usefulness of these strategies is demonstrated by simulation measurements for a heterogeneous network of workstations.","PeriodicalId":206541,"journal":{"name":"Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116475346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. Wu, S. Horng, Jinn-Fu Lin, Horng-Ren Tsai, Tsrong-Lay Lin
{"title":"An optimal parallel algorithm for computing moments on arrays with reconfigurable optical buses","authors":"C. Wu, S. Horng, Jinn-Fu Lin, Horng-Ren Tsai, Tsrong-Lay Lin","doi":"10.1109/IPDPS.2000.846059","DOIUrl":"https://doi.org/10.1109/IPDPS.2000.846059","url":null,"abstract":"Computing the moments of a two-dimensional (2-D) image involves a significant amount of multiplications and additions in a direct method. In this paper, we use the suffix sums to compute the 2-D moments instead of using a direct method. This method can reduce the number of multiplications tremendously. By integrating the advantages of both optical transmission and electronic computation, the 2-D moments can be computed in constant time on a 2-D arrays with reconfigurable optical buses (AROB). This result achieves optimal speed-up.","PeriodicalId":206541,"journal":{"name":"Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114478950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploration of the spatial locality on emerging applications and the consequences for cache performance","authors":"Martin Kämpe, F. Dahlgren","doi":"10.1109/IPDPS.2000.845978","DOIUrl":"https://doi.org/10.1109/IPDPS.2000.845978","url":null,"abstract":"The performance gap between processors and memory is increasing, making the cache hit-rate paramount for performance. Studies show room for improvement, especially in data caches. The cache effectiveness is dictated by software locality, hence the software behavior directs the cache performance. This paper presents a framework for studying spatial locality. It focuses on the characteristics of the spatial locality in terms of closeness in time and space, to get the amount of accessed sequential data and the potential for cache hits. By using the framework we gain knowledge for improving the cache performance. Our experiment consists of a program driven simulator and 11 important applications. We show a large performance potential in the data cache with up to 75% less miss rate, exploiting spatial locality. In order to investigate where potential bottlenecks are located we make a simple implementation of a scheme to exploit this spatial locality.","PeriodicalId":206541,"journal":{"name":"Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125911614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Power-aware localized routing in wireless networks","authors":"I. Stojmenovic, Xu Lin","doi":"10.1109/IPDPS.2000.846008","DOIUrl":"https://doi.org/10.1109/IPDPS.2000.846008","url":null,"abstract":"Two metrics where transmission power depends on distance between nodes, and a cost aware metric based on remaining battery power at nodes (assuming constant transmission power), together with corresponding non-localized shortest path routing algorithms, were recently proposed. We define a new power-cost metric based on the combination of both node's lifetime and distance based power metrics. We then propose power, cost, and power-cost GPS based localized routing algorithms, where nodes make routing decisions solely on the basis of location of their neighbors and destination. Power aware localized routing algorithm attempts to minimize the total power needed to route a message between a source and a destination. Cost-aware localized algorithm is aimed at extending battery's worst case lifetime. The combined power-cost algorithm attempts to minimize the total power needed and to avoid nodes with short remaining lifetime. We prove that these localized power, cost, and power-cost efficient routing algorithms are loop-free.","PeriodicalId":206541,"journal":{"name":"Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000","volume":"114 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123950559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"De Bruijn isomorphisms and free space optical networks","authors":"D. Coudert, Afonso Ferreira, S. Pérennes","doi":"10.1109/IPDPS.2000.846063","DOIUrl":"https://doi.org/10.1109/IPDPS.2000.846063","url":null,"abstract":"The de Bruijn digraph B(d, D) is usually defined by words of size D on an alphabet of cardinality d, through a cyclic left shift permutation on the words, after which the rightmost symbol is changed. In this paper we show that any digraph defined on words and alphabets of the same size, through an arbitrary permutation on the alphabet and an arbitrary permutation on the word indices, is isomorphic to the de Bruijn, provided that this latter permutation is cyclic. This work is motivated by the next application. It is known that the optical transpose interconnection system from UCSD can implement the de Bruijn interconnections for n nodes, for a fixed d, with O(n) lenses. We show here how to improve this hardware requirement to /spl Theta/(/spl radic/n).","PeriodicalId":206541,"journal":{"name":"Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129567331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A quantitative assessment of thread-level speculation techniques","authors":"P. Marcuello, Antonio González","doi":"10.1109/IPDPS.2000.846040","DOIUrl":"https://doi.org/10.1109/IPDPS.2000.846040","url":null,"abstract":"Speculative thread-level parallelism has been recently proposed as an alternative source of parallelism that can boost the performance for applications where independent threads are hard to find. Several schemes to exploit thread level parallelism have been proposed and significant performance gains have been reported. However, the sources of the performance gains are poorly understood as well as the impact of some design choices. In this work, the advantages of different thread speculation techniques are analyzed as are the impact of some critical issues including the value predictor, the branch predictor, the thread initialization overhead and the connectivity among thread units.","PeriodicalId":206541,"journal":{"name":"Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121495498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Image layer decomposition for distributed real-time rendering on clusters","authors":"Thu D. Nguyen, J. Zahorjan","doi":"10.1109/IPDPS.2000.846015","DOIUrl":"https://doi.org/10.1109/IPDPS.2000.846015","url":null,"abstract":"We propose a novel work partitioning technique, image layer decomposition (ILD), designed specifically to support distributed real-time rendering on commodity clusters. ILD has several advantages over previous partitioning algorithms for our targeted environment, including its compatibility with the use of hardware graphics accelerators, decoupling of communication bandwidth requirement from scene complexity, and reduced communication bandwidth growth as the system size increases. Furthermore, ILD tries to optimize the rendering of a sequence of frames (of an interactive application) instead of only individual frames. We simulate ILD using traces taken from a VRML viewer Our results show that ILD can be expected to work well up to moderately sized clusters and to outperform sort-last, a common partitioning approach, because of its smaller communication bandwidth requirement.","PeriodicalId":206541,"journal":{"name":"Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131428519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Monotonic counters: a new mechanism for thread synchronization","authors":"J. Thornley, K. Chandy","doi":"10.1109/IPDPS.2000.846037","DOIUrl":"https://doi.org/10.1109/IPDPS.2000.846037","url":null,"abstract":"Only a handful of fundamental mechanisms for synchronizing the access of concurrent threads to shared memory are widely implemented and used. These include locks, condition variables, semaphores, barriers, and monitors. In this paper, we introduce a new synchronization mechanism-monotonic counters-and make a case for its addition to this group. Unlike most other synchronization mechanisms, monotonic counters were designed primarily for multiprocessing, rather than for systems programming. Counters have a very simple definition: a counter object has a nonnegative value, an Increment operation, and a Check operation. Increment atomically increases the counter, and Check suspends until the counter reaches a specified level. We demonstrate that many practical thread synchronization patterns can be expressed more elegantly using counters than with other synchronization mechanisms. Of particular importance, the monotonicity of counters can be used to guarantee deterministic synchronization and the equivalence of multithreaded and sequential execution. In terms of implementation, counters are distinguished from traditional synchronization mechanisms, in that they have a dynamically varying number of thread suspension queues. We give several examples of multithreaded programs that use counter synchronization, and give an implementation of counters on top of locks and condition variables.","PeriodicalId":206541,"journal":{"name":"Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130830058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}