{"title":"Optimizing network throughput: optimal versus robust design","authors":"P. López, R. Alcover, J. Duato, L. Zúnica","doi":"10.1109/EMPDP.1999.746644","DOIUrl":"https://doi.org/10.1109/EMPDP.1999.746644","url":null,"abstract":"Interconnection network performance is usually measured in terms of its latency (time required to deliver a message) and throughput (maximum traffic accepted by the network). At first glance, minimizing average message latency is the main designer goal, because average network traffic is usually far from saturation. However, applications can also generate very high peak traffic. In order to deal with such situations, it is important that network throughput is also high. On the other hand, interconnection network performance depends on several parameters. Some of them can be chosen by the designer: routing algorithm, switching technique, topology and node design parameters. However, there are other parameters that cannot be selected by the designer. Among these, there are parameters that depend on the application, such as message size, message destination distribution and message traffic, as well as parameters defined by the customer, such as network size. Network designer can select the design parameters that maximize average (optimal design) or the design parameters that achieve a good performance under all the feasible combinations of the parameters that cannot be selected by him (robust design). Notice that both alternatives do not always lead to the same parameter configuration. Previously we chose the design parameters of a k-ary n-cube network considering optimize latency. In this case, optimal and robust design lead to the same choice. In this paper, we obtain these design parameters considering optimized network throughput. Unfortunately, there is a discrepancy between optimal and robust design criteria, being the former the best choice.","PeriodicalId":335983,"journal":{"name":"Proceedings of the Seventh Euromicro Workshop on Parallel and Distributed Processing. PDP'99","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115141150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The split data cache in multiprocessor systems: an initial hit ratio analysis","authors":"J. Sahuquillo, A. Pont","doi":"10.1109/EMPDP.1999.746641","DOIUrl":"https://doi.org/10.1109/EMPDP.1999.746641","url":null,"abstract":"As current first level (L1) data caches are poorly and inefficiently managed, new approaches to achieve better performance in uniprocessor systems have been proposed. The L1 data cache management system is basically the same as it was three decades ago. New organizations have recently been proposed, where two multi-lateral caches are included in the first level in accordance with the data locality where they are stored. The processor simultaneously sends the same memory request to both caches located in L1. These caches work independently and have different organizations. The main objective is to minimize the average data access time. These new organizations will normally increase the hit ratio. Additionally, the chip area occupied by these caches-including the necessary management hardware-is smaller than in a conventional organization. As the proposed cache size is smaller, it can work faster and improve access time at this level. Several authors have studied different approaches around this idea in uniprocessors. In this work we have made extensions for shared memory multiprocessors and studied the advantages.","PeriodicalId":335983,"journal":{"name":"Proceedings of the Seventh Euromicro Workshop on Parallel and Distributed Processing. PDP'99","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114151714","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Testing and debugging message passing applications based on the synergy of program and specification executions","authors":"Z. Tsiatsoulis, Y. Cotronis, E. Floros","doi":"10.1109/EMPDP.1999.746668","DOIUrl":"https://doi.org/10.1109/EMPDP.1999.746668","url":null,"abstract":"We outline Ensemble, a design and implementation methodology for composing message passing (MP) applications from program components. We also outline specification composition, directly associated with application composition. We present the integration of specification and implementation of program development. We particularly elaborate on testing and debugging of MP applications based on the synergy of tools for specification simulations with tools for program execution visualisation.","PeriodicalId":335983,"journal":{"name":"Proceedings of the Seventh Euromicro Workshop on Parallel and Distributed Processing. PDP'99","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131835285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallel resolution of alternating-line processes by means of pipelining techniques","authors":"David Espadas, M. Prieto, I. Llorente, F. Tirado","doi":"10.1109/EMPDP.1999.746691","DOIUrl":"https://doi.org/10.1109/EMPDP.1999.746691","url":null,"abstract":"The aim of this paper is to present an easy and efficient method to implement alternating-line processes on current parallel computers. First we show how data locality has an important impact on global efficiency, which leads us to the conclusion that one-dimensional compositions are the most convenient ones for 2D problems. Once this is asserted, a parallel algorithm is presented for the solution of the distributed tridiagonal systems along the partitioned direction. The key idea is to pipeline the simultaneous resolution of many systems of equations, not parallelising each resolution separately. This approach presents good numerical and architectural properties, in terms of memory usage and data locality, and high parallel efficiencies are obtained. For the case of alternating-line processes, the election of the optimal decomposition is studied. The experimental results have been obtained on a Cray T3E.","PeriodicalId":335983,"journal":{"name":"Proceedings of the Seventh Euromicro Workshop on Parallel and Distributed Processing. PDP'99","volume":"115 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115700670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The impact of cache organisation on the instruction issue rate of a superscalar processor","authors":"L. Vintan, Cristian Armat, G. Steven","doi":"10.1109/EMPDP.1999.746646","DOIUrl":"https://doi.org/10.1109/EMPDP.1999.746646","url":null,"abstract":"Much of the research on multiple-instruction-issue processor architecture assumes a perfect memory hierarchy and concentrates on increasing the instruction issue rate of the processor either through aggressive out-of-order instruction issue or through static instruction scheduling. In this paper we describe a trace driven simulation tool that we have developed to quantify the impact of the memory hierarchy on the performance of a superscalar processor that we have developed to support static instruction scheduling. We describe some initial studies performed using our simulator. As well as examining the more conventional split cache configurations, we also quantify the performance impact of using a unified cache. Finally, we examine the benefits of using two-level caches and victim caches.","PeriodicalId":335983,"journal":{"name":"Proceedings of the Seventh Euromicro Workshop on Parallel and Distributed Processing. PDP'99","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132159344","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Performance evaluation of the bubble algorithm: benefits for k-ary n-cubes","authors":"C. Carrión, R. Beivide, J. Gregorio","doi":"10.1109/EMPDP.1999.746699","DOIUrl":"https://doi.org/10.1109/EMPDP.1999.746699","url":null,"abstract":"The bubble algorithm evaluated in this paper assures message deadlock freedom in k-ary, n-cube network without using virtual channels. This algorithm is based both on a dimension order I outing (DOR) and on a restricted injection policy extended to the dimension changes. An exhaustive comparison between the bubble mechanism and the classical deterministic virtual channels solution is presented here. For that purpose, the message router of both proposals has been designed by using VHDL descriptions and the Synopsys VLSI CAD tool. Additionally, formal models of the routers, based on colored Petri nets, have been carried out together with simulation techniques in order to assure the validation of the results and shorten the design cycle. The performance evaluation of n-dimension tori highlights the benefits of the bubble algorithm as both the temporal delay and the necessary silicon area of the message router are reduced.","PeriodicalId":335983,"journal":{"name":"Proceedings of the Seventh Euromicro Workshop on Parallel and Distributed Processing. PDP'99","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126841085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Geert Deconinck, M. Truyens, V. D. Florio, W. Rosseel, R. Lauwereins, R. Belmans
{"title":"A framework backbone for software fault tolerance in embedded parallel applications","authors":"Geert Deconinck, M. Truyens, V. D. Florio, W. Rosseel, R. Lauwereins, R. Belmans","doi":"10.1109/EMPDP.1999.746666","DOIUrl":"https://doi.org/10.1109/EMPDP.1999.746666","url":null,"abstract":"The DIR net (detection-isolation-recovery net) is the main module of a software framework for the development of embedded supercomputing applications. This framework provides a set of functional elements, collected in a library, to improve the dependability attributes of the applications (especially the availability). The DIR net enables these functional elements to cooperate and enhances their efficiency by controlling and co-ordinating them. As a supervisor and the main executor of the fault tolerance strategy, it is the backbone of the framework, of which the application developer is the architect. Moreover, it provides an interface to which all detection and recovery tools should conform. Although the DIR net is meant to be used together within this fault tolerance framework, the adopted concepts and design decisions have a more general value, and can be applied in a wide range of parallel systems.","PeriodicalId":335983,"journal":{"name":"Proceedings of the Seventh Euromicro Workshop on Parallel and Distributed Processing. PDP'99","volume":"179 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123030112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dynamic load adaption in LIPS","authors":"Thomas Setz","doi":"10.1109/EMPDP.1999.746702","DOIUrl":"https://doi.org/10.1109/EMPDP.1999.746702","url":null,"abstract":"LIPS is a system for distributed computing using idle-cycles in heterogeneous networks of workstations. Especially data- and compute-intensive applications in the field of cryptography and computer algebra have used the system. The system provides its user with the tuple space based generative communication paradigm of parallel computing as known from the coordination language LINDA. In LIPS, failures (fail stop failures) like crashed machines are handled transparently for the application. Dynamic Load Adaption, meaning removing application processes from machines not being idle any longer and migrating those processes to idle machines is based on the detection of crashed application processes and the (re)start of application processes on an idle machine. The implementation of Dynamic Load Adaption for LIPS applications is easy, because checkpoint generation and the restart from a checkpoint is independent from the other application processes. As the crash of an application process (assuming the machine and the operating system the application process resides survive) can be detected very fast, the used mechanism allows for fast adaption of the applications distribution to changes in the NOW availability.","PeriodicalId":335983,"journal":{"name":"Proceedings of the Seventh Euromicro Workshop on Parallel and Distributed Processing. PDP'99","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115300366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A replicated resource architecture for high performance network service","authors":"C. Allison, M. Bramley, Jose Serrano","doi":"10.1109/EMPDP.1999.746652","DOIUrl":"https://doi.org/10.1109/EMPDP.1999.746652","url":null,"abstract":"Distributed Learning Environments represent the hope that communications and information technology can improve and widen access to education while maintaining and improving its quality. Such environments consist of network applications and services. Good interactive response time is crucial to their success. Slow responses can quickly dissuade teachers and learners alike from investing their time in the use of these services. Responsiveness timings taken across 155 Mb/s IP/ATM networks have exposed traditional monolithic server performance as the main bottleneck in interactive response time. A strategy of providing bigger and faster monolithic server hardware in response to each occurrence of system slow down is not a good solution as it is expensive and inflexible. Cluster computing has proven a successful and cost effective alternative to conventional supercomputing and it would now seem to be appropriate to investigate its application to the problem of high performance network service provision. In order to research this issue a replicated resolute architecture has been designed to harness the combined power of multiple independent computers. The architecture is outlined and an initial implementation of its core component, a coherence server, is described. Results are presented which indicate that this approach is viable within the context of Distributed Learning Environments.","PeriodicalId":335983,"journal":{"name":"Proceedings of the Seventh Euromicro Workshop on Parallel and Distributed Processing. PDP'99","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132366022","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Cortés, A. Ripoll, M. A. Senar, P. Pons, E. Luque
{"title":"On the performance of nearest-neighbors load balancing algorithms in parallel systems","authors":"A. Cortés, A. Ripoll, M. A. Senar, P. Pons, E. Luque","doi":"10.1109/EMPDP.1999.746661","DOIUrl":"https://doi.org/10.1109/EMPDP.1999.746661","url":null,"abstract":"DASUD (Diffusion Algorithm Searching Unbalanced Domains) is a totally distributed load-balancing algorithm which belongs to the nearest-neighbors class. DASUD detects unbalanced domains (a processor and its immediate neighbors) and corrects this situation by allowing load movements between non-connected processors. DASUD has been evaluated by comparison with two well-known nearest-neighbors load balancing strategies, namely, the GDE (Generalized Dimension Exchange) and the SID (Sender Initiated Diffusion) by considering a large set of initial load distributions. These distributions were applied to ring, tents and hypercube topologies, and the number of processors ranged from 8 to 128. From these experiments we have observed that DASUD outperforms the other strategies used in the comparison as it provides the best trade-off between the balance degree obtained at the final state and the number of iterations required to reach this state.","PeriodicalId":335983,"journal":{"name":"Proceedings of the Seventh Euromicro Workshop on Parallel and Distributed Processing. PDP'99","volume":"116 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115144315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}