Yuandong Chan, Kai Xu, Haidong Lan, Weiguo Liu, Yongchao Liu, B. Schmidt
{"title":"PUNAS: A Parallel Ungapped-Alignment-Featured Seed Verification Algorithm for Next-Generation Sequencing Read Alignment","authors":"Yuandong Chan, Kai Xu, Haidong Lan, Weiguo Liu, Yongchao Liu, B. Schmidt","doi":"10.1109/IPDPS.2017.35","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.35","url":null,"abstract":"The progress of next-generation sequencing has a major impact on medical and genomic research. This technology can now produce billions of short DNA fragments (reads) in a single run. One of the most demanding computational problems used by almost every sequencing pipeline is short-read alignment; i.e. determining where each fragment originated from in the original genome. Most current solutions are based on a seed-and-extend approach, where promising candidate regions (seeds) are first identified and subsequently extended in order to verify whether a full high-scoring alignment actually exists in the vicinity of each seed. Seed verification is the main bottleneck in many state-of-the-art aligners and thus finding fast solutions is of high importance. We present a parallel ungapped-alignment-featured seed verification (PUNAS) algorithm, a fast filter for effectively removing the majority of false positive seeds, thus significantly accelerating the short-read alignment process. PUNAS is based on bit-parallelism and takes advantage of SIMD vector units of modern microprocessors. Our implementation employs a vectorize-and-scale approach supporting multi-core CPUs and many-core Knights Landing (KNL)-based Xeon Phi processors. Performance evaluation reveals that PUNAS is over three orders-of-magnitude faster than seed verification with the Smith-Waterman algorithm and around one order-of-magnitude faster than seed verification with the banded version of Myers bit-vector algorithm. Using a single thread it achieves a speedup of up to 7.3, 27.1, and 11.6 compared to the shifted Hamming distance filter on a SSE, AVX2, and AVX-512 based CPU/KNL, respectively. The speed of our framework further scales almost linearly with the number of cores. PUNAS is open-source software available at https://github.com/Xu-Kai/PUNASfilter.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130497644","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Gowanlock, C. Rude, D. M. Blair, Justin D. Li, V. Pankratius
{"title":"Clustering Throughput Optimization on the GPU","authors":"M. Gowanlock, C. Rude, D. M. Blair, Justin D. Li, V. Pankratius","doi":"10.1109/IPDPS.2017.17","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.17","url":null,"abstract":"Large datasets in astronomy and geoscience often require clustering and visualizations of phenomena at different densities and scales in order to generate scientific insight. We examine the problem of maximizing clustering throughput for concurrent dataset clustering in spatial dimensions. We introduce a novel hybrid approach that uses GPUs in conjunction with multicore CPUs for algorithmic throughput optimizations. The key idea is to exploit the fast memory on the GPU for index searches and optimize I/O transfers in such a way that the low-bandwidth host-GPU bottleneck does not have a significant negative performance impact. To achieve this, we derive two distinct GPU kernels that exploit grid-based indexing schemes to improve clustering performance. To obviate limited GPU memory and enable large dataset clustering, our method is complemented by an efficient batching scheme for transfers between the host and GPU accelerator. This scheme is robust with respect to both sparse and dense data distributions and intelligently avoids buffer overflows that would otherwise degrade performance, all while minimizing the number of data transfers between the host and GPU. We evaluate our approaches on ionospheric total electron content datasets as well as intermediate-redshift galaxies from the Sloan Digital Sky Survey. Our hybrid approach yields a speedup of up to 50x over the sequential implementation on one of the experimental scenarios, which is respectable for I/O intensive clustering.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127835295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
O. Mendizabal, R. S. D. Moura, F. Dotti, F. Pedone
{"title":"Efficient and Deterministic Scheduling for Parallel State Machine Replication","authors":"O. Mendizabal, R. S. D. Moura, F. Dotti, F. Pedone","doi":"10.1109/IPDPS.2017.29","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.29","url":null,"abstract":"Many services used in large scale web applications should be able to tolerate faults without impacting their performance. State machine replication is a well-known approach to implementing fault-tolerant services, providing high availability and strong consistency. To boost the performance of state machine replication, recent proposals have introduced parallel execution of commands. In parallel state machine replication, incoming commands may or may not depend on other commands that are waiting for execution. Although dependent commands must be processed in the same relative order at every replica to avoid inconsistencies, independent commands can be executed in parallel and benefit from multi-core architectures. Since many application workloads are mostly composed of independent commands, these parallel models promise high throughput without sacrificing strong consistency. The efficient execution of commands in such environments, however, requires effective scheduling strategies. Existing approaches rely on dependency tracking based on pairwise comparison between commands, which introduces scheduling contention. In this paper, we propose a new and highly efficient scheduler for parallel state machine replication. Our scheduler considers batches of commands, instead of commands individually. Moreover, each batch of commands is augmented with a compact data structure that encodes commands information needed to the dependency analysis. We show, by means of experimental evaluation, that our technique outperforms schedulers for parallel state machine replication by a fairly large margin.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"376 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114002334","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
H. Nakashima, Yoshiki Summura, Keisuke Kikura, Y. Miyake
{"title":"Large Scale Manycore-Aware PIC Simulation with Efficient Particle Binning","authors":"H. Nakashima, Yoshiki Summura, Keisuke Kikura, Y. Miyake","doi":"10.1109/IPDPS.2017.65","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.65","url":null,"abstract":"We are now developing a manycore-aware implementation of multiprocessed PIC (particle-in-cell) simulation code with automatic load balancing. A key issue of the implementation is how to exploit the wide SIMD mechanism of manycore processors such as Intel Xeon Phi. Our solution is \"particle binning\" to rank all particles in a cell (voxel) in a chunk of SOA (structure-of-arrays) type one-dimensional arrays so that \"particle-push\" and \"current-scatter\" operations on them are efficiently SIMD-vectorized by our compiler. In addition, our sophisticated binning mechanism performs sorting of particles according to their positions \"on-the-fly\", efficiently coping with occasional \"bin overflow\" in a fully multithreaded manner. Our performance evaluation with up to 64 nodes of Cray XC30 and XC40 supercomputers, equipped with Xeon Phi 5120D (Knights Corner) and 7250 (Knights Landing) respectively, not only exhibited good parallel performance, but also proved the effectiveness of our binning mechanism.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116168905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Elastic-Cache: GPU Cache Architecture for Efficient Fine- and Coarse-Grained Cache-Line Management","authors":"Bingchao Li, Ji-zhou Sun, M. Annavaram, N. Kim","doi":"10.1109/IPDPS.2017.81","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.81","url":null,"abstract":"GPUs provide high-bandwidth/low-latency on-chip shared memory and L1 cache to efficiently service a large number of concurrent memory requests (to contiguous memory space). To support warp-wide accesses to L1 cache, GPU L1 cache lines are very wide. However, such L1 cache architecture cannot always be efficiently utilized when applications generate many memory requests with irregular access patterns especially due to branch and memory divergences. In this paper, we propose Elastic-Cache that can efficiently support both fine- and coarse-grained L1 cache-line management for applications with both regular and irregular memory access patterns. Specifically, it can store 32- or 64-byte words in non-contiguous memory space to a single 128-byte cache line. Furthermore, it neither requires an extra tag storage structure nor reduces the capacity of L1 cache since it stores auxiliary tags for fine-grained L1 cache-line managements in sharedmemory space that is not fully used in many applications. Our experiment shows that Elastic-Cache improves the geo-mean performance of applications with irregular memory access patterns by 58% without degrading performance of applications with regular memory access patterns.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"2022 18","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114087177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"High-Performance Virtual Machine Migration Framework for MPI Applications on SR-IOV Enabled InfiniBand Clusters","authors":"Jie Zhang, Xiaoyi Lu, D. Panda","doi":"10.1109/IPDPS.2017.43","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.43","url":null,"abstract":"High-speed interconnects (e.g. InfiniBand) have been widely deployed on modern HPC clusters. With the emergence of HPC in the cloud, high-speed interconnects have paved their way into the cloud with recently introduced Single Root I/O Virtualization (SR-IOV) technology, which is able to provide efficient sharing of high-speed interconnect resources and achieve near-native I/O performance. However, recent studies have shown that SR-IOV-based virtual networks prevent virtual machine migration, which is an essential virtualization capability towards high availability and resource provisioning. Although several initial solutions have been pro- posed in the literature to solve this problem, our investigations show that there are still many restrictions on these proposed approaches, such as depending on specific network adapters and/or hypervisors, which will limit the usage scope of these solutions on HPC environments. In this paper, we propose a high-performance virtual machine migration framework for MPI applications on SR-IOV enabled InfiniBand clusters. Our proposed method does not need any modification to the hypervisor and InfiniBand drivers and it can efficiently handle virtual machine (VM) migration with SR-IOV IB device. Our evaluation results indicate that the proposed design is able to not only achieve fast VM migration speed but also guarantee the high performance for MPI applications during the migration in the HPC cloud. At the application level, for NPB LU benchmark running inside VM, our proposed design is able to completely hide the migration overhead through the computation and migration overlapping. Furthermore, our proposed design shows good scaling when migrating multiple VMs.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114613208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Venkatesan T. Chakaravarthy, Jee W. Choi, Douglas J. Joseph, Xing Liu, Prakash Murali, Yogish Sabharwal, D. Sreedhar
{"title":"On Optimizing Distributed Tucker Decomposition for Dense Tensors","authors":"Venkatesan T. Chakaravarthy, Jee W. Choi, Douglas J. Joseph, Xing Liu, Prakash Murali, Yogish Sabharwal, D. Sreedhar","doi":"10.1109/IPDPS.2017.86","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.86","url":null,"abstract":"The Tucker decomposition expresses a given tensor as the product of a small core tensor and a set of factor matrices. Our objective is to develop an efficient distributed implementation for the case of dense tensors. The implementation is based on the HOOI (Higher Order Orthogonal Iterator) procedure, wherein the tensor-times-matrix product forms the core routine. Prior work have proposed heuristics for reducing the computational load and communication volume incurred by the routine. We study the two metrics in a formal and systematic manner, and design strategies that are optimal under the two fundamental metrics. Our experimental evaluation on a large benchmark of tensors shows that the optimal strategies provide significant reduction in load and volume compared to prior heuristics, and provide up to 7x speed-up in the overall running time.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"214 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114847538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tao Gao, Yanfei Guo, Boyu Zhang, Pietro Cicotti, Yutong Lu, P. Balaji, M. Taufer
{"title":"Mimir: Memory-Efficient and Scalable MapReduce for Large Supercomputing Systems","authors":"Tao Gao, Yanfei Guo, Boyu Zhang, Pietro Cicotti, Yutong Lu, P. Balaji, M. Taufer","doi":"10.1109/IPDPS.2017.31","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.31","url":null,"abstract":"In this paper we present Mimir, a new implementation of MapReduce over MPI. Mimir inherits the core principles of existing MapReduce frameworks, such as MR-MPI, while redesigning the execution model to incorporate a number of sophisticated optimization techniques that achieve similar or better performance with significant reduction in the amount of memory used. Consequently, Mimir allows significantly larger problems to be executed in memory, achieving large performance gains. We evaluate Mimir with three benchmarks on two highend platforms to demonstrate its superiority compared with that of other frameworks.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127408696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
I. Brumar, Marc Casas, Miquel Moretó, M. Valero, G. Sohi
{"title":"ATM: Approximate Task Memoization in the Runtime System","authors":"I. Brumar, Marc Casas, Miquel Moretó, M. Valero, G. Sohi","doi":"10.1109/IPDPS.2017.49","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.49","url":null,"abstract":"Redundant computations appear during the execution of real programs. Multiple factors contribute to these unnecessary computations, such as repetitive inputs and patterns, calling functions with the same parameters or bad programming habits. Compilers minimize non useful code with static analysis. However, redundant execution might be dynamic and there are no current approaches to reduce these inefficiencies. Additionally, many algorithms can be computed with different levels of accuracy. Approximate computing exploits this fact to reduce execution time at the cost of slightly less accurate results. In this case, expert developers determine the desired tradeoff between performance and accuracy for each application. In this paper, we present Approximate Task Memoization (ATM), a novel approach in the runtime system that transparently exploits both dynamic redundancy and approximation at the task granularity of a parallel application. Memoization of previous task executions allows predicting the results of future tasks without having to execute them and without losing accuracy. To further increase performance improvements, the runtime system can memoize similar tasks, which leads to task approximate computing. By defining how to measure task similarity and correctness, we present an adaptive algorithm in the runtime system that automatically decides if task approximation is beneficial or not. When evaluated on a real 8-core processor with applications from different domains (financial analysis, stencil-computation, machine-learning and linear-algebra), ATM achieves a 1.4x average speedup when only applying memoization techniques. When adding task approximation, ATM achieves a 2.5x average speedup with an average 0.7% accuracy loss (maximum of 3.2%).","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"132 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131355439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nikhil Jain, A. Bhatele, Xiang Ni, T. Gamblin, L. Kalé
{"title":"Partitioning Low-Diameter Networks to Eliminate Inter-Job Interference","authors":"Nikhil Jain, A. Bhatele, Xiang Ni, T. Gamblin, L. Kalé","doi":"10.1109/IPDPS.2017.91","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.91","url":null,"abstract":"On most supercomputers, except some torus network based systems, resource managers allocate nodes to jobs without considering the sharing of network resources by different jobs. Such network-oblivious resource allocations result in link sharing among multiple jobs that can cause significant performance variability and performance degradation for individual jobs. In this paper, we explore low-diameter networks and corresponding node allocation policies that can eliminate inter-job interference. We propose a variation to n-dimensional mesh networks called express mesh. An express mesh is denser than the corresponding mesh network, has a low diameter independent of the number of routers, and is easily partitionable. We compare structural properties and performance of express mesh with other popular low-diameter networks. We present practical node allocation policies for express mesh and fat-tree networks that not only eliminate inter-job interference and performance variability, but also improve overall performance.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131868253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}