{"title":"Mapping instruction sequences onto EPOM-processor arrays: a framework for parallel data processing","authors":"Jean-Paul Theis, Harald Schlimper","doi":"10.1109/HIPC.1998.737977","DOIUrl":"https://doi.org/10.1109/HIPC.1998.737977","url":null,"abstract":"The paper introduces an optimized mapping methodology for mapping instruction sequences (ISs) onto EPOM-processor arrays. The new features of this mapping methodology result from a systematic specification and exploitation of both instruction and processor level parallelism: ultra-low granularity of ISs requires an allocation and scheduling of individual instructions onto the given processor array. Moreover, this mapping methodology is complete in the sense that it considers both array bus-bandwidths and processor resource constraints. The mapping methodology is based on two concepts: 1) instruction sequences (ISs) which represent a generalized form of directed cyclic graphs (DCGs) and allow efficient specification of algorithm parallelism, and graph nodes represent instructions from the instruction set of a target processor architecture (J.P. Theis, 1997); 2) the EPOM-processor architecture which represents an optimized target VLIW processor architecture for parallel implementation of ISs (J.P. Theis and L. Thiele, 1996) and especially suited for parallel image/multimedia processing (J.P. Theis and L. Thiele, 1995). Special attention is paid to the optimization, of the mapping process of ISs onto EPOM-processor arrays. Algorithm execution time minimization is used as optimization goal. The mapping methodology is partially based on integer linear programming and heuristic techniques. The solution time complexity is substantially reduced by developing a two-phase hierarchical model, decoupling processor array allocation from subsequent scheduling. The efficiency of this mapping methodology was validated through experimental results on ISs of well known algorithm routines.","PeriodicalId":175528,"journal":{"name":"Proceedings. Fifth International Conference on High Performance Computing (Cat. No. 98EX238)","volume":"6 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125920878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Distributed routing balancing for interconnection network communication","authors":"I. Garcés, Daniel Franco, E. Luque","doi":"10.1109/HIPC.1998.737996","DOIUrl":"https://doi.org/10.1109/HIPC.1998.737996","url":null,"abstract":"An efficient design of the interconnection network is crucial because of its impact on the parallel computer performance. A high speed routing scheme that minimises contention and avoids the formation of hot-spots should be included in the design. We have developed a new method to uniformly balance communication traffic over the interconnection network called distributed routing balancing (DRB) that is based on limited and load-controlled path expansion in order to maintain a low message latency. The method uniformly distributes the communication load between all links of the interconnection network and maintains latency control provided that total bandwidth requirements do not exceed total available link bandwidth in the interconnection network. DRB defines how to create alternative paths to expand single paths (expanded path definition) and when to use them depending on traffic load (expanded path selection carried out by DRB routing). Some conclusions of the experimentation and comparisons with existing methods are given. It is demonstrated that DRB is a method to effectively balance network traffic.","PeriodicalId":175528,"journal":{"name":"Proceedings. Fifth International Conference on High Performance Computing (Cat. No. 98EX238)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121542938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Measurement-based modeling and analysis methodology for characterizing parallel I/O performance","authors":"S. Sharma, R. Iyer","doi":"10.1109/HIPC.1998.738013","DOIUrl":"https://doi.org/10.1109/HIPC.1998.738013","url":null,"abstract":"A parallel I/O characterization methodology that consists of a hierarchical modeling and measurement analysis environment for investigating I/O performance is presented. The methodology is illustrated via a case study of a video server workload running under the parallel I/O file system (PIOFS) of IBM SP/2. The measurements demonstrate that for video server and read-intensive workloads, spreading parallel files across all eight I/O servers improves a client's bandwidth performance by 36-52%. With eight clients, the per-client bandwidth performance increases by only 15%-23%. PIOFS-based default file striping results in degradation of bandwidth performance by as much as 25%.","PeriodicalId":175528,"journal":{"name":"Proceedings. Fifth International Conference on High Performance Computing (Cat. No. 98EX238)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126706663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Extrapolation in distributed adaptive integration","authors":"E. Doncker, Ajay Gupta, Rodger Zanny, J. Maile","doi":"10.1109/HIPC.1998.737975","DOIUrl":"https://doi.org/10.1109/HIPC.1998.737975","url":null,"abstract":"The paper addresses the design of distributed methods which incorporate numerical extrapolation into adaptive multivariate integration, in order to increase the functionality of the integration algorithms. When attempting to deal with singularities, adaptive integration algorithms need a very fine subdivision in the proximity of these \"hot spots\". This is not practical in higher dimensions where a vast number of subregions result. These problems may be alleviated through the use of a suitable extrapolation strategy. We present a strategy which, incorporated as a global extrapolation method in distributed adaptive integration, allows for load balancing.","PeriodicalId":175528,"journal":{"name":"Proceedings. Fifth International Conference on High Performance Computing (Cat. No. 98EX238)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131907931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A comparative study of some network subsystem organizations","authors":"D. Ponomarev, K. Ghose","doi":"10.1109/HIPC.1998.738019","DOIUrl":"https://doi.org/10.1109/HIPC.1998.738019","url":null,"abstract":"The impact of alternative network subsystem design for realizing low end-to-end latencies and high network throughput in a switched LAN are studied in detail through simulation. These alternatives include choices in the disposition of the network interface card (NIC), DMA priorities and OS services. Our simulation model captures the delays of OS services/software layers, message copying DMAs and, in addition, models non-network related traffic on the I/O and memory buses introduced by paging and on-chip cache misses. In a conventional setup, with the NIC placed on the I/O bus, we show that changing traffic priorities on the memory bus to speed up the transfers between the NIC and the DRAM has little impact on overall latency and network throughput as the offered network traffic increases. Improving the speed of the I/O bus produces some performance gains. These performance gains are shown to be quite limited until message demultiplexing capabilities are added to the NIC. The best performance comes from the use of dual-ported DRAMs, with a dedicated connection between the NIC and the added port.","PeriodicalId":175528,"journal":{"name":"Proceedings. Fifth International Conference on High Performance Computing (Cat. No. 98EX238)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134130434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A simple optimal list ranking algorithm","authors":"A. Ranade","doi":"10.1109/HIPC.1998.737971","DOIUrl":"https://doi.org/10.1109/HIPC.1998.737971","url":null,"abstract":"We consider the problem of ranking an N element list on a P processor EREW PRAM. Recent work on this problem has shown the importance of grain size. While several optimal O(N/P+log P) time list ranking algorithms are known, Reid-Miller and Blelloch (1994) recently showed that these do not lead to good implementations in practice, because of the fine-grained nature of these algorithms. In Reid-Miller and Blelloch's experiments the best performance was obtained by an O(N/P+log/sup 2/ P) time coarse grained randomized algorithm devised by them. We build upon their idea and present a coarse-grained randomized algorithm that runs in time O(N/P+log P), and is thus also optimal. Our algorithm simplifies some of the ideas from [6, 7]-these simplifications might be of interest to implementers.","PeriodicalId":175528,"journal":{"name":"Proceedings. Fifth International Conference on High Performance Computing (Cat. No. 98EX238)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134315259","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Performance-driven design and redesign of high-speed local area networks","authors":"C. Ravikumar, Dilip R. Pandit, A. Mishra","doi":"10.1109/HIPC.1998.738016","DOIUrl":"https://doi.org/10.1109/HIPC.1998.738016","url":null,"abstract":"Although distributed computing over a network of computers has become a reality, its success mainly depends on the performance of the underlying network. In this paper, we consider the problem of designing a local area network with specified cost and performance constraints. The cost and performance of a local area network (LAN) are directly related to its topology. Using the a priori knowledge of the approximate number of users of the network and the kind of communication traffic that must be supported, the designer can optimize the design of the of a LAN for superior performance. Design decisions include the number of LAN segments, number of bridges, assignment of users to segments, and the method to interconnect the segments through bridges. In case of ATM networks, the decisions are regarding the number of ATM switches, the assignment of hosts to switches, and the way to connect switches through cross-connects. While assigning too many users to the same segment may cause large delays due to the sharing of network bandwidth, splitting the LAN into too many segments will increase the cost of the LAN. We report a greedy heuristic algorithm for local area network design. We propose an interesting method to construct good initial solutions to the topology design problem using a heuristic method which is based on the three-opt technique for solving the travelling salesperson problem. Our experimental results indicate that the heuristic algorithm finds good solutions.","PeriodicalId":175528,"journal":{"name":"Proceedings. Fifth International Conference on High Performance Computing (Cat. No. 98EX238)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115207515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient address sequence generation for two-level mappings in High Performance Fortran","authors":"J. Ramanujam, A. Venkatachar, S. Dutta","doi":"10.1109/HIPC.1998.737981","DOIUrl":"https://doi.org/10.1109/HIPC.1998.737981","url":null,"abstract":"Data-parallel languages like High Performance Fortran allow users to specify mappings of arrays by first aligning elements to an abstract Cartesian grid called templates and then distributing the templates across processors. Code generation then includes the generation of the sequence of local addresses accessed on a processor. Address sequence generation for non-unit alignment strides, referred to as the two-level mapping problem, is difficult. We present efficient solutions to the problem of address generation for two-level mapping for general CYCLIC(k) distribution. Our approach involves the construction of pattern tables which incurs negligible runtime overhead compared to other existing solutions for this problem. We use two applications of the integer lattice-based method developed by Thirumalai and Ramanujam (1996) to generate the pattern of accesses using a variety of techniques. Extensive experiments demonstrate that the techniques presented in this paper significantly outperform current solutions to the two-level mapping problem.","PeriodicalId":175528,"journal":{"name":"Proceedings. Fifth International Conference on High Performance Computing (Cat. No. 98EX238)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128309786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Strategies for parallel implementation of a global spectral atmospheric general circulation model","authors":"R. Nanjundiah","doi":"10.1109/HIPC.1998.738021","DOIUrl":"https://doi.org/10.1109/HIPC.1998.738021","url":null,"abstract":"We discuss the parallel implementation of a global spectral atmospheric general circulation model on a message passing platform. We also discuss strategies that need to be employed to improve performance on parallel machines which will have multiprocessor nodes sharing an intra-node memory space. A brief discussion of the cause of load imbalances and simple methods to reduce the same are also presented.","PeriodicalId":175528,"journal":{"name":"Proceedings. Fifth International Conference on High Performance Computing (Cat. No. 98EX238)","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130922584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Skew-insensitive parallel algorithms for relational join","authors":"K. Alsabti, S. Ranka","doi":"10.1109/HIPC.1998.738010","DOIUrl":"https://doi.org/10.1109/HIPC.1998.738010","url":null,"abstract":"Join is the most important and expensive operation in relational databases. The parallel join operation is very sensitive to the presence of the data skew. In this paper we present two new parallel join algorithms for coarse grained machines which work optimally in presence of arbitrary amount of data skew. The first algorithm is sort-based and the second is hash-based. Both of these algorithms employ a preprocessing phase to equally partition the work among the processors. These algorithms are shown to be theoretically as well as practically scalable.","PeriodicalId":175528,"journal":{"name":"Proceedings. Fifth International Conference on High Performance Computing (Cat. No. 98EX238)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117130144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}