{"title":"An Efficient Parallel Algorithm for the Multiple Longest Common Subsequence (MLCS) Problem","authors":"Dmitry Korkin, Qingguo Wang, Yi Shang","doi":"10.1109/ICPP.2008.79","DOIUrl":"https://doi.org/10.1109/ICPP.2008.79","url":null,"abstract":"Finding the multiple longest common subsequence (MLCS) is an important problem in the areas of bioinformatics and computational genomics. Approaches that are more efficient than the standard dynamic programming method have been introduced and successfully parallelized for the special cases of 2 sequences. However, the increasing complexity and size of biological data require an efficient method applicable to an arbitrary number of sequences as well as its efficient parallelization. A recently developed dominant points method for a general MLCS problem has been shown a significant performance improvement over the dynamic programming method, when number of sequences is larger than two. At the same time, the approach has revealed strong demand for its parallelization, in order to be applied to the larger families of sequences or sequences of the greater lengths. In this paper, we introduce an efficient parallel algorithm to find a MLCS for an arbitrary number of sequences, which is based on the dominant points method. When the number of processors is not greater than the size of alphabet multiplied by the number of sequences, the parallel algorithm is estimated to have the asymptotically linear speed up. We experimentally tested the algorithm using sets of randomly generated sequences over different alphabets as well as the protein sequences from a family of homologous proteins. We found that the performance of the algorithm increases with the number of input sequences and reaches a near-linear speedup for eight sequences.","PeriodicalId":388408,"journal":{"name":"2008 37th International Conference on Parallel Processing","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132787201","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Designing an Efficient Kernel-Level and User-Level Hybrid Approach for MPI Intra-Node Communication on Multi-Core Systems","authors":"Lei Chai, P. Lai, Hyun-Wook Jin, D. Panda","doi":"10.1109/ICPP.2008.16","DOIUrl":"https://doi.org/10.1109/ICPP.2008.16","url":null,"abstract":"The emergence of multi-core processors has made MPI intra-node communication a critical component in high performance computing. In this paper, we use a three-step methodology to design an efficient MPI intra-node communication scheme from two popular approaches: shared memory and OS kernel-assisted direct copy. We use an Intel quad-core cluster for our study. We first run micro-benchmarks to analyze the advantages and limitations of these two approaches, including the impacts of processor topology, communication buffer reuse, process skew effects, and L2 cache utilization. Based on the results and the analysis, we propose topology-aware and skew-aware thresholds to build an optimized hybrid approach. Finally, we evaluate the impact of the hybrid approach on MPI collective operations and applications using IMB, NAS, PSTSWM, and HPL benchmarks. We observe that the optimized hybrid approach can improve the performance of MPI collective operations by up to 60%, and applications by up to 17%.","PeriodicalId":388408,"journal":{"name":"2008 37th International Conference on Parallel Processing","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132202245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scalable Techniques for Transparent Privatization in Software Transactional Memory","authors":"Virendra J. Marathe, Michael F. Spear, M. Scott","doi":"10.1109/ICPP.2008.69","DOIUrl":"https://doi.org/10.1109/ICPP.2008.69","url":null,"abstract":"We address the recently recognized privatization problem in software transactional memory (STM) runtimes, and introduce the notion of partially visible reads (PVRs) to heuristically reduce the overhead of transparent privatization. Specifically, PVRs avoid the need for a \"privatization fence\" in the absence of conflict with concurrent readers. We present several techniques to trade off the cost of enforcing partial visibility with the precision of conflict detection. We also consider certain special-case variants of our approach, e.g., for predominantly read-only workloads. We compare our implementations to prior techniques on a multicore Niagara1 system using a variety of artificial workloads. Our results suggest that while no one technique performs best in all cases, a dynamic hybrid of PVRs and strict in-order commits is stable and reasonably fast across a wide range of load parameters. At the same time, the remaining overheads are high enough to suggest the need for programming model or architectural support.","PeriodicalId":388408,"journal":{"name":"2008 37th International Conference on Parallel Processing","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132739725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Scalable Architecture for Crowd Simulation: Implementing a Parallel Action Server","authors":"G. Vigueras, M. Lozano, C. Perez, J. Orduña","doi":"10.1109/ICPP.2008.20","DOIUrl":"https://doi.org/10.1109/ICPP.2008.20","url":null,"abstract":"Crowd simulation can be considered as a special case of virtual environments where avatars are intelligent agents instead of user-driven entities. These applications require both rendering visually plausible images of the virtual world and managing the behavior of autonomous agents. Although several proposals have focused on the software architectures for these systems, the scalability of crowd simulation is still an open issue. In this paper, we propose a scalable architecture that can manage large crowds of autonomous agents at interactive rates. This proposal consists of enhancing a previously proposed architecture through the efficient parallelization of the action server and the distribution of the semantic database. In this way, the system bottleneck is removed, and new action servers (hosted each one on a new computer) can be added as necessary. The evaluation results show that the proposed architecture is able to fully exploit the underlying hardware platform, regardless of both the number and the kind of computers that form the system. Therefore, this system architecture provides the scalability required for large-scale crowd simulation.","PeriodicalId":388408,"journal":{"name":"2008 37th International Conference on Parallel Processing","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130254272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cellular ANTomata: Food-Finding and Maze-Threading","authors":"A. Rosenberg","doi":"10.1109/ICPP.2008.13","DOIUrl":"https://doi.org/10.1109/ICPP.2008.13","url":null,"abstract":"A model for realizing ant-inspired algorithms that coordinate robots within a fixed, geographically constrained environment is proposed and illustrated. The model, dubbed cellular ANTomata, inverts the relationship between ant-robots and the environment that they navigate: intelligence now resides in the environment rather than in the ants. The cellular ANTomaton model is illustrated via three proof-of-concept problems: having ants \"park\" in the nearest corner; having ants seek \"food items\" (both with and without impenetrable obstacles); having a single ant thread a maze. In all cases, \"unintelligent\" cellular-ANTomata-based ant-robots accomplish goals provably more efficiently than traditional \"intelligent\" ant-robots can; indeed, \"intelligent\" ant-robots cannot park at all! All of the presented algorithms are scalable: they provably work within any finite-size environment.","PeriodicalId":388408,"journal":{"name":"2008 37th International Conference on Parallel Processing","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127711121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Carmelo Acosta, F. Cazorla, Alex Ramírez, M. Valero
{"title":"MFLUSH: Handling Long-Latency Loads in SMT On-Chip Multiprocessors","authors":"Carmelo Acosta, F. Cazorla, Alex Ramírez, M. Valero","doi":"10.1109/ICPP.2008.48","DOIUrl":"https://doi.org/10.1109/ICPP.2008.48","url":null,"abstract":"Nowadays, there is a clear trend in industry towards employing the growing amount of transistors on chip in replicating execution cores (CMP), where each core is simultaneous multithreading (SMT). State-of-the-art high-performance processors like the IBM POWER5 and POWER6 corroborate this CMP+SMT trend. Within each SMT core any of the well-known SMT mechanisms may be applied to face SMT related challenges. Among them, probably the most important issue in an SMT execution pipeline concerns the instruction fetch (IFetch) Policy. The FLUSH IFetch Policy represents a choice for throughput-oriented scenarios. It handles L2 cache misses in order to avoid hardware resource monopolization by any given execution thread; involving an additional energy cost via instruction refetching. However, the new constraints imposed by the CMP+SMT scenario may affect well-known SMT mechanisms, like the FLUSH mechanism. In this paper we revisit the FLUSH mechanism and analyze its application in the emerging CMP+SMT scenario. The included analysis points out the new difficulties to be faced by the FLUSH mechanism in the emerging CMP+SMT scenario. Then we propose a novel IFetch Policy designed to cope with the CMP+SMT scenario: the MFLUSH. We also include a complete evaluation of the MFLUSH policy, both in terms of throughput and energy consumption. Our results indicate that the MFLUSH, specifically designed for the emerging CMP+SMT scenario, succeeds not only in overcoming the specific CMP+SMT constraints but also allowing a 20% energy consumption reduction without a significant system throughput loss.","PeriodicalId":388408,"journal":{"name":"2008 37th International Conference on Parallel Processing","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127853843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sameer Kumar, Yogish Sabharwal, R. Garg, P. Heidelberger
{"title":"Optimization of All-to-All Communication on the Blue Gene/L Supercomputer","authors":"Sameer Kumar, Yogish Sabharwal, R. Garg, P. Heidelberger","doi":"10.1109/ICPP.2008.83","DOIUrl":"https://doi.org/10.1109/ICPP.2008.83","url":null,"abstract":"All-to-all communication is a well known performance bottleneck for many applications, such as the ones that use the Fast-Fourier-transform (FFT) algorithm. We analyze the performance of all-to-all communication on the BlueGene/L torus interconnect that has link contention even for all-to-all operations with short messages. We observed that the performance of all-to-all depends on the shape of the processor partition. We present a performance analysis of all-to-all on partitions of various shapes. We then present optimization schemes that substantially improve the performance of all-to-all with short and large messages.In particular, throughput improved from 64% to over 99% of peak on the 65,536 (64 times 32 times 32) node Blue Gene/L machine at the Lawrence Livermore National Lab. We show the impact of the all-to-all performance optimizations in 1-D and 3-D FFT benchmarks. We achieved a performance of over 2.8 TF for the HPC Challenge 1D FFT benchmark with our optimized all-to-all.","PeriodicalId":388408,"journal":{"name":"2008 37th International Conference on Parallel Processing","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123953627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving Priority Enforcement via Non-Work-Conserving Scheduling","authors":"J. C. Saez, J. I. Gómez, M. Prieto","doi":"10.1109/ICPP.2008.38","DOIUrl":"https://doi.org/10.1109/ICPP.2008.38","url":null,"abstract":"Current operating system schedulers are not fully aware of multi-core and multi-threaded architectures, and as a result, schedule threads in a way that may cause contention for critical resources such as the last level in the cache memory hierarchy or the memory access bandwidth. This contention has a significant impact on the system productivity and the quality of service that each individual thread gets from the platform, which can widely vary depending on the behavior of its simultaneous co-runners.In this paper we describe the design and implementation of a non-work-conserving framework to schedule threads that tries to improve priority enforcement, based on on-line statistics collected through hardware performance counters. We have implemented our scheme in Linux running on both multicore and SMT processors. For synthetic workloads based on the latest SPEC CPU2006 benchmarks, our framework speeds up high-priority threads by up to 50%, while keeping or even slightly improving the overall system throughput.","PeriodicalId":388408,"journal":{"name":"2008 37th International Conference on Parallel Processing","volume":"177 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124377995","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimized Workflow Orchestration of Database Aggregate Operations on Heterogenous Grids","authors":"W. Mach, E. Schikuta","doi":"10.1109/ICPP.2008.12","DOIUrl":"https://doi.org/10.1109/ICPP.2008.12","url":null,"abstract":"This paper presents an analytical discussion of parallel database aggregate algorithms (as count, sum, average,etc.) on grids, compares the findings to the classical generalized multiprocessor framework, and describes an optimization algorithm to maximize performance for a heterogeneous environment. In this context we develop a concise and comprehensive analytical model for database sub-queries with parallel aggregate operators. Based on these results the paper proves that by smart enhancement exploiting the heterogeneity of the grid the performance of the algorithms for aggregate operations can be increased remarkably.","PeriodicalId":388408,"journal":{"name":"2008 37th International Conference on Parallel Processing","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121154882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chen Tian, Hongbo Jiang, Xue Liu, Wenyu Liu, Yi Wang
{"title":"Towards Minimum Traffic Cost and Minimum Response Latency: A Novel Dynamic Query Protocol in Unstructured P2P Networks","authors":"Chen Tian, Hongbo Jiang, Xue Liu, Wenyu Liu, Yi Wang","doi":"10.1109/ICPP.2008.78","DOIUrl":"https://doi.org/10.1109/ICPP.2008.78","url":null,"abstract":"Controlled-flooding algorithms are widely used in unstructured networks. Expanding ring (ER) achieves low response delay, while its traffic cost is huge; dynamic querying (DQ) is known for its desirable behavior in traffic control, but it achieves lower search cost at the price of an undesirable latency performance; Enhanced dynamic querying (DQ+) can reduce the search latency too, while it is hard to determine a general optimum parameters set. In this paper, a novel algorithm named selective dynamic query (SDQ) is proposed. Unlike previous works that awkwardly processing floating TTL values, SDQ properly select an integer TTL value and a set of neighbors to narrow the scope of next query. Our experiments demonstrate that SDQ provides finer-grained control than other algorithms: its latency is close to the well-known minimum one via ER; in the mean time its traffic cost also close to the minimum. To our best knowledge, this is the first work capable of achieving best performance in terms of both response latency and traffic cost. In addition, our experiments also demonstrate that SDQ works well in various network topologies.","PeriodicalId":388408,"journal":{"name":"2008 37th International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116286603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}