{"title":"External adjustment of runtime parameters in Time Warp synchronized parallel simulators","authors":"R. Radhakrishnan, L. Moore, P. Wilsey","doi":"10.1109/IPPS.1997.580905","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580905","url":null,"abstract":"Several optimizations to the Time Warp synchronization protocol for parallel discrete event simulation have been proposed and studied. Many of these optimizations have included some form of dynamic adjustment (or control) of the operating parameters of the simulation (e.g. checkpoint interval, cancellation strategy). Traditionally dynamic parameter adjustment has been performed at the simulation object level; each simulation object collects measures of its operating behaviors (e.g. rollback frequency, rollback length, etc.) and uses them to adjust its operating parameters. The performance data collection functions and parameter adjustment are overhead costs that are incurred in the expectation of higher throughput. The paper presents a method of eliminating some of these overheads through the use of an external object to adjust the control parameters. That is, instead of inserting code for adjusting simulation parameters in the simulation object, an external control object is defined to periodically analyze each simulation object's performance data and revise that object's operating parameters. An implementation of an external control object in the WARPED Time Warp simulation kernel has been completed. The simulation parameters updated by the implemented control system are: checkpoint interval, and cancellation strategy (lazy or aggressive). A comparative analysis of three test cases shows that the external control mechanism provides speedups between 5%-17% over the best performing embedded dynamic adjustment algorithms.","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115713743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Lower bounds on systolic gossip","authors":"M. Flammini, S. Pérennes","doi":"10.1109/IPPS.1997.580949","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580949","url":null,"abstract":"Gossiping is an information dissemination process in which each processor has a distinct item of information and has to collect all the items possessed by the other processors. We derive lower bounds on the gossiping time of systolic protocols, i.e. constituted by a periodic repetition of simple communication steps. In particular if we denote by n the number of processors in the network, then for directed networks and for undirected networks in the half-duplex mode any s-systolic gossip protocol takes at least g(s) log/sub 2/ n time steps, where g(4)=1.8133, g(6)=1.5310 and g(8)=1.4721. For the case s=4 this result is improved to 2.0218 log/sub 2/ n for directed butterflies of degree 2 and we show that the 2.0218 log/sub 2/ n and 1.8133 log/sub 2/ n lower bounds hold also respectively for undirected Butterfly and de Bruijn networks of degree 2 in the full-duplex case. Our results are obtained by means of new technique relying on two novel concepts in the field: the notion of delay digraph of a systolic protocol and the use of matrix norm methods.","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117160888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallel simulated annealing: an adaptive approach","authors":"J. Knopman, J. S. Aude","doi":"10.1109/IPPS.1997.580950","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580950","url":null,"abstract":"This paper analyses alternatives for the parallelization of the Simulated Annealing algorithm when applied to the placement of modules in a VLSI circuit considering the use of PVM on an Ethernet cluster of workstations. It is shown that different parallelization approaches have to be used for high and low temperature values of the annealing process. The algorithm used for low temperatures is an adaptive version of the speculative algorithm proposed in the literature. Within this adaptive algorithm, the number of processors allocated to the solution of the placement problem and the number of moves evaluated per processor between synchronization points change with the temperature. At high temperatures, an algorithm based on the parallel evaluation of independent chains of moves has been adopted. It is shown that results with the same quality of those produced by the serial version can be obtained when shorter length chains are used in the parallel implementation.","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121243197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The impact of timing on linearizability in counting networks","authors":"M. Mavronicolas, M. Papatriantafilou, P. Tsigas","doi":"10.1109/IPPS.1997.580978","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580978","url":null,"abstract":"Counting networks form a new class of distributed, low-contention data structures made up of interconnected balancers, and are suitable for solving a variety of multiprocessor synchronization problems that can be expressed as counting problems. A linearizable counting network guarantees that the order of the values it returns respects the real-time order they were requested. Linearizability significantly raises the capabilities of the network, but at a possible price in network size or synchronization support. In this paper, we further pursue the systematic study of the impact of timing on linearizability for counting networks, along a research line initiated by Lynch et al. (1996). We consider two basic timing models: the instantaneous balancer model, in which the transition of a token from an input to an output port of a balancer is modeled as an instantaneous event, and the periodic balancer model, where balancers send out tokens at a fixed rate. We also consider lower and upper bounds on the delays incurred by wires connecting the balancers. We present necessary and sufficient conditions for linearizability in the form of precise inequalities that involve timing parameters and identify structural parameters of the counting network, which may be of more general interest. Our results significantly extend and strengthen previous impossibility and possibility results on linearizability in counting networks (Herlihy et al., 1990; Lynch et al., 1996).","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127437605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
B. V. Voorst, Luiz Pires, R. Jha, Mustafa Muhammad
{"title":"Implementation and results of hypothesis testing from the C/sup 3/I parallel benchmark suite","authors":"B. V. Voorst, Luiz Pires, R. Jha, Mustafa Muhammad","doi":"10.1109/IPPS.1997.580886","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580886","url":null,"abstract":"This paper describes the implementation of the hypothesis testing benchmark, one of ten kernels from the C/sup 3/I (Command, Control, Communications and Intelligence) Parallel Benchmark Suite (C/sup 3/IPBS)/sup 1/. The benchmark was implemented and executed on a variety of parallel environments. This paper details the run times obtained with these implementations, and offers an analysis of the results.","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125381210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Performance prediction for complex parallel applications","authors":"J. Brehm, P. Worley","doi":"10.1109/IPPS.1997.580884","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580884","url":null,"abstract":"Today's massively parallel machines are typically message-passing systems consisting of hundreds or thousands of processors. Implementing parallel applications efficiently in this environment is a challenging task, and poor parallel design decisions can be expensive to correct. Tools and techniques that allow the fast and accurate evaluation of different parallelization strategies would significantly improve the productivity of application developers and increase throughput on parallel architectures. This paper investigates one of the major issues in building tools to compare parallelization strategies: determining what type of performance models of the application code and of the computer system are sufficient for a fast and accurate comparison of different strategies. The paper is built around a case study employing the Performance Prediction Tool (PerPreT) to predict performance of the Parallel Spectral Transform Shallow Water Model code (PSTSWM) on the Intel Paragon.","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123882379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Heinlein, K. Gharachorloo, Robert P. Bosch, M. Rosenblum, Anoop Gupta
{"title":"Coherent block data transfer in the FLASH multiprocessor","authors":"J. Heinlein, K. Gharachorloo, Robert P. Bosch, M. Rosenblum, Anoop Gupta","doi":"10.1109/IPPS.1997.580836","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580836","url":null,"abstract":"A key goal of the Stanford FLASH project is to explore the integration of multiple communication protocols in a single multiprocessor architecture. To achieve this goal, FLASH includes a programmable node controller called MAGIC, which contains an embedded protocol processor capable of implementing multiple protocols. In this paper we present a specialized protocol for block data transfer integrated with a conventional cache coherence protocol. Block transfer forms the basis for message passing implementations on top of shared memory, occurs in important workloads such as databases, and is frequently used by the operating system. We discuss the issues that arise in designing a fully integrated protocol and its interactions with cache coherence. Using microbenchmarks, MPI communication primitives, and an application running on the operating system, we compare our protocol with standard bcopy and bcopy augmented with prefetches. Our results show that integrated block transfer can accelerate communication between nodes while off-loading the task from the main processor utilizing the network more efficiently, and reducing the associated cache pollution. Given the aggressive support for prefetching in FLASH, prefetched bcopy is able to achieve competitive performance in many cases but lacks the other three advantages of our protocol.","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"107 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123641456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multiple templates access of trees in parallel memory systems","authors":"V. Auletta, A. D. Vivo, V. Scarano","doi":"10.1109/IPPS.1997.580980","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580980","url":null,"abstract":"Studies the problem of mapping the N nodes of a data structure onto M memory modules so that they can be accessed in parallel by templates, i.e. distinct sets of nodes. In the literature, several algorithms are available for arrays (accessed by rows, columns, diagonals and subarrays) and trees (accessed by subtrees, root-to-leaf paths, etc.). Although some mapping algorithms for arrays allow conflict-free access to several templates at once (e.g. rows and columns), no mapping algorithm is known for efficiently accessing both subtree and root-to-leaf path templates in complete binary trees. We prove that any mapping algorithm that is conflict-free for one of these two templates has /spl Omega/(M/log M) conflicts on the other. Therefore, no mapping algorithm can be found that is conflict-free on both templates. We give an algorithm for mapping complete binary trees with N=2/sup M/-1 nodes on M memory modules in such a way that: (a) the number of conflicts for accessing a subtree template or a root-to-leaf path template is O[/spl radic/(M/logM)], (b) the load (i.e. the ratio between the maximum and minimum number of data items mapped on each module) is 1+o(1), and (c) the time complexity for retrieving the module where a given data item is stored is O(1) if a preprocessing phase of space and time complexity O(log N) is executed, or O(log log N) if no preprocessing is allowed.","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121648319","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fast parallel computation of the polynomial shift","authors":"E. Zima","doi":"10.1109/IPPS.1997.580933","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580933","url":null,"abstract":"Given an n-degree polynomial f(x) over an arbitrary ring, the shift of f(x) by c is the operation which computes the coefficients of the polynomial f(x+c). In this paper, we consider the case when the shift by the given constant c has to be performed several times (repeatedly). We propose a parallel algorithm that is suited to an SIMD architecture to perform the shift in O(1) time if we have O(n/sup 2/) processor elements available. The proposed algorithm is easy to generalize to multivariate polynomial shifts. The possibility of applying this algorithm to polynomials with coefficients from non-commutative rings is discussed, as well as the bit-wise complexity of the algorithm.","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"198 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127929677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallel 'Go with the winners' algorithms in the LogP model","authors":"Marcus Peinado, Thomas Lengauer","doi":"10.1109/IPPS.1997.580972","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580972","url":null,"abstract":"The authors parallelize the 'Go with the winners' algorithm of Aldous and Vazirani (1994) and analyze the resulting parallel algorithm in the LogP-model. The main issues in the analysis are load imbalances and communication delays. The result of the analysis is a practical algorithm which, under reasonable assumptions, achieves linear speedup. Finally, they analyze the algorithm for a concrete application: generating models of amorphous chemical structures.","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132007214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}