{"title":"Improved Parallel and Sequential Walking Tree Methods for Biological String Alignments","authors":"P. Cull, Tai Hsu","doi":"10.1145/331532.331583","DOIUrl":"https://doi.org/10.1145/331532.331583","url":null,"abstract":"Approximate string matching is commonly used to align genetic sequences (DNA or RNA) to determine their shared characteristics. Most genetic string matching methods are based on the edit-distance model, which does not provide alignments for inversions and translocations. Recently, a heuristic called the Walking Tree Method [2, 3] has been developed to solve this problem. Unlike other heuristics, it can handle more than one level of inversion, i.e., inversions within inversions. Furthermore, it tends to capture the matched strings' genes while other heuristics fail. There are two versions of the original walking tree heuristics: the score version gives only the alignment score, the alignment version gives both the score and the alignment mapping between the strings. The score version runs in quadratic time and uses linear space while the alignment version uses an extra log factor for time and space. In this paper, we will briefly describe the walking tree method and the original sequential and parallel algorithms. We will explain why different parallel algorithms are needed for a network of workstations rather than the original algorithm which worked well on a symmetric multi-processor. Our improved parallel method also led to a quadratic time sequential algorithm that uses less space. We give an example of our parallel method. We describe several experiments that show speedup linear in the number of processors, but eventual drop off in speedup as the communication network saturates. For big enough strings, we found linear speedup for all processors we had available. These results suggest that our improved parallel method will scale up as both the size of the problem and the number of processors increase. We include two figures that use real biological data and show that the walking tree methods can find translocations and inversions in DNA sequences and also discover unknown genes.","PeriodicalId":354898,"journal":{"name":"ACM/IEEE SC 1999 Conference (SC'99)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121800440","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scal-Tool: Pinpointing and Quantifying Scalability Bottlenecks in DSM Multiprocessors","authors":"J. Torrellas, Yan Solihin, V. Lam","doi":"10.1145/331532.331549","DOIUrl":"https://doi.org/10.1145/331532.331549","url":null,"abstract":"Distributed Shared-Memory (DSM) multiprocessors provide an attractive combination of cost-effective commodity architecture and, thanks to the shared-memory abstraction, relative ease of programming. Unfortunately, it is well known that tuning applications for scalable performance in these machines is time-consuming. To address this problem, programmers use performance monitoring tools. However, these tools are often costly to run, especially if highly-processed information is desired. In addition, they usually cannot be used to experiment with hypothetical architecture organizations. In this paper, we present Scal-Tool, a tool that isolates and quantifies scalability bottlenecks in parallel applications running on DSM machines. The scalability bottlenecks currently quantified include insufficient caching space, load imbalance, and synchronization. The tool is based on an empirical model that uses as inputs measurements from hardware event counters in the processor. A major advantage of the tool is that it is quite inexpensive to run: it only needs the event counter values for the application running with a few different processor counts and data set sizes. In addition, it provides ways to analyze variations of several machine parameters.","PeriodicalId":354898,"journal":{"name":"ACM/IEEE SC 1999 Conference (SC'99)","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127849296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Christian Weiß, Wolfgang Karl, M. Kowarschik, U. Rüde
{"title":"Memory Characteristics of Iterative Methods","authors":"Christian Weiß, Wolfgang Karl, M. Kowarschik, U. Rüde","doi":"10.1145/331532.331563","DOIUrl":"https://doi.org/10.1145/331532.331563","url":null,"abstract":"Conventional implementations of iterative numerical algorithms, especially multigrid methods, merely reach a disappointing small percentage of the theoretically available CPU performance when applied to representative large problems. One of the most important reasons for this phenomenon is that the current DRAM technology cannot provide the data fast enough to keep the CPU busy. Although the fundamentals of cache optimizations are quite simple, current compilers cannot optimize even elementary iterative schemes. In this paper, we analyze the memory and cache behavior of iterative methods with extensive profiling and describe program transformation techniques to improve the cache performance of two- and three-dimensional multigrid algorithms.","PeriodicalId":354898,"journal":{"name":"ACM/IEEE SC 1999 Conference (SC'99)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127892818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Brauss, M. Lienhard, J. Nemecek, A. Gunzinger, M. Naf, M. Frey, M. Heimlicher, A. Huber, Pierre-Alain Muller, R. Paul
{"title":"An Efficient Communication Architecture for Commodity Supercomputers","authors":"S. Brauss, M. Lienhard, J. Nemecek, A. Gunzinger, M. Naf, M. Frey, M. Heimlicher, A. Huber, Pierre-Alain Muller, R. Paul","doi":"10.1145/331532.331551","DOIUrl":"https://doi.org/10.1145/331532.331551","url":null,"abstract":"The goal of the Swiss-Tx project is to develop, build and install the first Swiss tera-flop supercomputer called Swiss-T2, which is mainly based on commodity parts. Only the communication hardware and communication software is custom-made, because available off-the-shelf products, such as Ethernet with the socket interface, do not offer the necessary bandwidth, latency, and functionality. In this paper, we present a new efficient communication architecture for commodity super-computing called Fast Communication Interface (FCI), and we introduce T-NET, the custom-made high-performance communication hardware for the Swiss-Tx supercomputers. The highlights are low-latency, high-bandwidth, and portability. Portability means that the communication hardware and software is mainly platform independent and that a large number of modern workstations and standard operating systems can be used as they are. A full implementation of the standardized MPI (Message Passing Interface), written entirely on top of FCI, is also available.","PeriodicalId":354898,"journal":{"name":"ACM/IEEE SC 1999 Conference (SC'99)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124901375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Adaptive Two-level Thread Management for Fast MPI Execution on Shared Memory Machines","authors":"Kai Shen, Hong Tang, Tao Yang","doi":"10.1145/331532.331581","DOIUrl":"https://doi.org/10.1145/331532.331581","url":null,"abstract":"This paper addresses performance portability of MPI code on multiprogrammed shared memory machines. Conventional MPI implementations map each MPI node to an OS process, which suffers severe performance degradation in multiprogrammed environments. Our previous work (TMPI) has developed compile/run-time techniques to support threaded MPI execution by mapping each MPI node to a kernel thread. However, kernel threads have context switch cost higher than user-level threads and this leads to longer spinning time requirement during MPI synchronization. This paper presents an adaptive two-level thread scheme for MPI to reduce context switch and synchronization cost. This scheme also exposes thread scheduling information at user-level, which allows us to design an adaptive event waiting strategy to minimize CPU spinning and exploit cache affinity. Our experiments show that the MPI system based on the proposed techniques has great performance advantages over the previous version of TMPI and the SGI MPI implementation in multiprogrammed environments. The improvement ratio can reach as much as 161% or even more depending on the degree of multiprogramming.","PeriodicalId":354898,"journal":{"name":"ACM/IEEE SC 1999 Conference (SC'99)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126795859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Craig A. Lee, J. Stepanek, R. Wolski, C. Kesselman, Ian T Foster
{"title":"A Network Performance Tool for Grid Environments","authors":"Craig A. Lee, J. Stepanek, R. Wolski, C. Kesselman, Ian T Foster","doi":"10.1145/331532.331536","DOIUrl":"https://doi.org/10.1145/331532.331536","url":null,"abstract":"In grid computing environments, network bandwidth discovery and allocation is a serious issue. Before their applications are running, grid users will need to choose hosts based on available bandwidth. Running applications may need to adapt to a changing set of hosts. Hence, a tool is needed for monitoring network performance that is integral to the grid environment. To address this need, Gloperf was developed as part of the Globus grid computing toolkit. Gloperf is designed for ease of deployment and makes simple, end-to-end TCP measurements requiring no special host permissions. Scalability is addressed by a hierarchy of measurements based on group membership and by limiting overhead to a small, acceptable, fixed percentage of the available bandwidth. Since this fixed overhead may push host-pair revisit time into the tens-of-hours, we also quantitatively examine the \"trajectory\" of the cost-error trade-off for measurement frequency.","PeriodicalId":354898,"journal":{"name":"ACM/IEEE SC 1999 Conference (SC'99)","volume":"132 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114522739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving Performance of Sparse Matrix-Vector Multiplication","authors":"Ali Pinar, M. Heath","doi":"10.1145/331532.331562","DOIUrl":"https://doi.org/10.1145/331532.331562","url":null,"abstract":"Sparse matrix-vector multiplication (SpMxV) is one of the most important computational kernels in scientific computing. It often suffers from poor cache utilization and extra load operations because of memory indirections used to exploit sparsity. We propose alternative data structures, along with reordering algorithms to increase effectiveness of these data structures, to reduce the number of memory indirections. Toledo proposed handling the 1x2 blocks of a matrix separately, doing only one indirection for each block. We propose packing all contiguous nonzeros into a block to reduce the number of memory indirections further. This reduces memory indirections per block to one for the cost of an extra array in storage and a loop during SpMxV. We also propose an algorithm to permute the nonzeros of the matrix into contiguous locations. We state this problem as the traveling salesperson problem and use associated heuristics. Experiments verify the effectiveness of our techniques.","PeriodicalId":354898,"journal":{"name":"ACM/IEEE SC 1999 Conference (SC'99)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128365911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scalable Electromagnetic Scattering Calculations on the SGI Origin 2000","authors":"J. Ottusch, M. Stalzer, J. Visher, S. Wandzura","doi":"10.1145/331532.331586","DOIUrl":"https://doi.org/10.1145/331532.331586","url":null,"abstract":"We describe the FastScatTM program for electromagnetic scattering calculations and its parallel implementation on the SGI Origin 2000. FastScat recently computed the radar cross section of a sphere having an area of 45,239lambda2 to high accuracy in about a day. This is contrasted with a result for an 354lambda2 sphere reported at Supercomputing '92. Taking both size and accuracy into account, the FastScat result represents an improvement in solution time of over nine orders of magnitude. This improvement was due to systematically focusing on several issues that impact the scalability of electromagnetic scattering calculations.","PeriodicalId":354898,"journal":{"name":"ACM/IEEE SC 1999 Conference (SC'99)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128769107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Numerical Simulation and Immersive Visualization of Hairpin Vortices","authors":"H. Tufo, P. Fischer, M. Papka, Kristopher J. Blom","doi":"10.1145/331532.331594","DOIUrl":"https://doi.org/10.1145/331532.331594","url":null,"abstract":"To better understand the vortex dynamics of coherent structures in turbulent and transitional boundary layers, we consider direct numerical simulation of the interaction between a flat-plate-boundary-layer flow and an isolated hemispherical roughness element. Of principal interest is the evolution of hairpin vortices that form an interlacing pattern in the wake of the hemisphere, lift away from the wall, and are stretched by the shearing action of the boundary layer. Using animations of unsteady three-dimensional representations of this flow, produced by the vtk toolkit and enhanced to operate in a CAVE virtual environment, we identify and study several key features in the evolution of this complex vortex topology not previously observed in other visualization formats.","PeriodicalId":354898,"journal":{"name":"ACM/IEEE SC 1999 Conference (SC'99)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130570970","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MOM: a Matrix SIMD Instruction Set Architecture for Multimedia Applications","authors":"J. Corbal, R. Espasa, M. Valero","doi":"10.1145/331532.331547","DOIUrl":"https://doi.org/10.1145/331532.331547","url":null,"abstract":"MOM is a novel matrix-oriented ISA paradigm for multimedia applications, based on fusing conventional vector ISAs with SIMD ISAs such as MMX. This paper justifies why MOM is a suitable alternative for the multimedia domain due to its efficiency handling the small matrix structures typically found in most multimedia kernels. MOM leverages a performance boost between 1.3x and 4x over more conventional multimedia extensions (such as MMX and MDMX), which already achieve performance benefits ranging from 1.3x to 15x over conventional Alpha code. Moreover, MOM exhibit a high relative performance for low-issue rates and a high tolerance to memory latency. Both advantages present MOM as an attractive alternative for the embedded domain.","PeriodicalId":354898,"journal":{"name":"ACM/IEEE SC 1999 Conference (SC'99)","volume":"50 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120923284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}