{"title":"A GPU-Based Algorithm-Specific Optimization for High-Performance Background Subtraction","authors":"Chulian Zhang, H. Tabkhi, G. Schirner","doi":"10.1109/ICPP.2014.27","DOIUrl":"https://doi.org/10.1109/ICPP.2014.27","url":null,"abstract":"Background subtraction is an essential first stage in many vision applications differentiating foreground pixels from the background scene, with Mixture of Gaussians (MoG) being a widely used implementation choice. MoG's high computation demand renders a real-time single threaded realization infeasible. With it's pixel level parallelism, deploying MoG on top of parallel architectures such as a Graphics Processing Unit (GPU) is promising. However, MoG poses many challenges having a significant control flow (potentially reducing GPU efficiency) as well as a significant memory bandwidth demand. In this paper, we propose a GPU implementation of Mixture of Gaussians (MoG) that surpasses real-time processing for full HD (1080p 60 Hz). This paper describes step-wise optimizations starting from general GPU optimizations (such as memory coalescing, computation & communication overlapping), via algorithm-specific optimizations including control flow reduction and register usage optimization, to windowed optimization utilizing shared memory. For each optimization, this paper evaluates the performance potential and identifies architectural bottlenecks. Our CUDA-based implementation improves performance over sequential implementation by 57×, 97× and 101× through general, algorithm-specific, and windowed optimizations respectively, without impact to the output quality.","PeriodicalId":441115,"journal":{"name":"2014 43rd International Conference on Parallel Processing","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127581490","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
H. Subramoni, K. Kandalla, Jithin Jose, K. Tomko, K. Schulz, D. Pekurovsky, D. Panda
{"title":"Designing Topology-Aware Communication Schedules for Alltoall Operations in Large InfiniBand Clusters","authors":"H. Subramoni, K. Kandalla, Jithin Jose, K. Tomko, K. Schulz, D. Pekurovsky, D. Panda","doi":"10.1109/ICPP.2014.32","DOIUrl":"https://doi.org/10.1109/ICPP.2014.32","url":null,"abstract":"Network contention is a significant factor affecting the performance of communication intensive operations like All to all exchanges used for transpose operations of multi-dimensional FFTs on modern supercomputing systems. Over the last decade InfiniBand has become anincreasingly popular interconnect for deploying these systems. However, no practical schemes exist that allow the users of these systems to perform these communication operations in a network-to-pology-aware manner. In this paper we propose multiple schemes to create network topology-aware communication schedules for All to all FFT operations that reduce the volume of contention encountered by the operations. Through careful study and analysis of communication performance we derive critical factors that result in network contention in large scale InfiniBand clusters. We propose enhancements to our topology discovery service to generate the path matrix in a scalable and efficient manner. Through our techniques, we are able to significantly reduce the amount of network contention observed during the Alltoall / FFT operations. The results of our experimental evaluation indicate that our proposed technique is able to deliver up to a 12% improvement in the communication time of P3DFFT at 4,096 processes.","PeriodicalId":441115,"journal":{"name":"2014 43rd International Conference on Parallel Processing","volume":"113 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132245089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Andra Hugo, A. Guermouche, Pierre-André Wacrenier, R. Namyst
{"title":"A Runtime Approach to Dynamic Resource Allocation for Sparse Direct Solvers","authors":"Andra Hugo, A. Guermouche, Pierre-André Wacrenier, R. Namyst","doi":"10.1109/ICPP.2014.57","DOIUrl":"https://doi.org/10.1109/ICPP.2014.57","url":null,"abstract":"To face the advent of multicore processors and the ever increasing complexity of hardware architectures, programming models based on DAG-of-tasks parallelism regained popularity in the high performance, scientific computing community. In this context, enabling HPC applications to perform efficiently when dealing with graphs of parallel tasks that could potentially run simultaneously is a great challenge. Even if a uniform runtime system is used underneath, scheduling multiple parallel tasks over the same set of hardware resources introduces many issues, such as undesirable cache flushes or memory bus contention. In this paper, we show how runtime system-based scheduling contexts can be used to dynamically enforce locality of parallel tasks on multicore machines. We extend an existing generic sparse direct solver to use our mechanism and introduce a new decomposition method based on proportional mapping that is used to build the scheduling contexts. We propose a runtime-level dynamic context management policy to cope with the very irregular behaviour of the application. A detailed performance analysis shows significant performance improvements of the solver over various multicore hardware.","PeriodicalId":441115,"journal":{"name":"2014 43rd International Conference on Parallel Processing","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124075017","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Lightweight Software Transactions on GPUs","authors":"Anup Holey, Antonia Zhai","doi":"10.1109/ICPP.2014.55","DOIUrl":"https://doi.org/10.1109/ICPP.2014.55","url":null,"abstract":"Graphics Processing Units (GPUs) provide an attractive option for extracting data-level parallelism from diverse applications. However, some applications, although possess abundant data-level parallelism, exhibit irregular memory access patterns to the shared data structures. Porting such applications to GPUs requires synchronization mechanisms such as locks, which significantly increase the programming complexity. Coarse-grained locking, where a single lock controls all the shared resources, although reduces programming efforts, can substantially serialize GPU threads. On the other hand, fine-grained locking, where each data element is protected by an independent lock, although facilitates maximum parallelism, requires significant programming efforts. To overcome these challenges, we propose to support software transactional memory (STM) on GPU that is able to achieve performance comparable to fine-grained locking, while requiring minimal programming efforts. Software-based transactional execution can incur significant runtime overheads due to activities such as detecting conflicts across thousands of GPU threads and managing a consistent memory state. Thus, in this paper we illustrate three lightweight STM designs that are capable of scaling to a large number of GPU threads. In our system, programmers simply mark the critical sections in the applications, and the underlying STM support is able to achieve performance comparable to fine-grained locking.","PeriodicalId":441115,"journal":{"name":"2014 43rd International Conference on Parallel Processing","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127885760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FEVES: Framework for Efficient Parallel Video Encoding on Heterogeneous Systems","authors":"A. Ilic, S. Momcilovic, N. Roma, L. Sousa","doi":"10.1109/ICPP.2014.11","DOIUrl":"https://doi.org/10.1109/ICPP.2014.11","url":null,"abstract":"Lead by high performance computing potential of modern heterogeneous desktop systems and predominance of video content in general applications, we propose herein an autonomous unified video encoding framework for hybrid multi-core CPU and multi-GPU platforms. To fully exploit the capabilities of these platforms, the proposed framework integrates simultaneous execution control, automatic data access management, and adaptive scheduling and load balancing strategies to deal with the overall complexity of the video encoding procedure. These strategies consider the collaborative inter-loop encoding as a unified optimization problem to efficiently exploit several levels of concurrency between computation and communication. To support a wide range of CPU and GPU architectures, a specific encoding library is developed with highly optimized algorithms for all inter-loop modules. The obtained experimental results show that the proposed framework allows achieving a real-time encoding of full high-definition sequences in the state-of-the-art CPU+GPU systems, by outperforming individual GPU and quad-core CPU executions for more than 2 and 5 times, respectively.","PeriodicalId":441115,"journal":{"name":"2014 43rd International Conference on Parallel Processing","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114617445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallel Pointer Analysis with CFL-Reachability","authors":"Yu Su, Ding Ye, Jingling Xue","doi":"10.1109/ICPP.2014.54","DOIUrl":"https://doi.org/10.1109/ICPP.2014.54","url":null,"abstract":"This paper presents the first parallel implementation of pointer analysis with Context-Free Language (CFL) reachability, an important foundation for supporting demand queries in compiler optimisation and software engineering. Formulated as a graph traversal problem (often with context- and field-sensitivity for desired precision) and driven by queries (issued often in batch mode), this analysis is non-trivial to parallelise. We introduce a parallel solution to the CFL-reachability-based pointer analysis, with context- and field-sensitivity. We exploit its inherent parallelism by avoiding redundant graph traversals with two novel techniques, data sharing and query scheduling. With data sharing, paths discovered in answering a query are recorded as shortcuts so that subsequent queries will take the shortcuts instead of re-traversing its associated paths. With query scheduling, queries are prioritised according to their statically estimated dependences so that more redundant traversals can be further avoided. Evaluated using a set of 20 Java programs, our parallel implementation of CFL-reachability-based pointer analysis achieves an average speedup of 16.2X over a state-of-the-art sequential implementation on 16 CPU cores.","PeriodicalId":441115,"journal":{"name":"2014 43rd International Conference on Parallel Processing","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127277529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"R-Dedup: Content Aware Redundancy Management for SSD-Based RAID Systems","authors":"Yimo Du, Youtao Zhang, Nong Xiao","doi":"10.1109/ICPP.2014.20","DOIUrl":"https://doi.org/10.1109/ICPP.2014.20","url":null,"abstract":"While high density SSDs are increasingly adopted in enterprise computing environment, it remains a challenge to meet the high performance and reliability demands of server applications as well as the demands for longer system lifetime and high space utilization in such environment. Existing schemes often address these issues separately. In particular, deduplication schemes improve write performance and SSD lifetime while SSDbased RAID designs improve reliability and read performance. Naively integration of deduplication and RAID results in suboptimal designs. In this paper, we propose R-Dedup, a content-aware redundancy management scheme for SSD-based RAID storage. By combining deduplication with replication, R-Dedup evaluates system performance, reliability, endurance and space utilization, and dynamically manages replicas to achieve better trade off. Our experimental results show that R-Dedup achieves 18% and 20% improvements on read and write performance, respectively, and extends SSD lifetime by 20% with no reliability compromise.","PeriodicalId":441115,"journal":{"name":"2014 43rd International Conference on Parallel Processing","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124094651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Samyam Rajbhandari, Akshay Nikam, Pai-Wei Lai, Kevin Stock, S. Krishnamoorthy, P. Sadayappan
{"title":"CAST: Contraction Algorithm for Symmetric Tensors","authors":"Samyam Rajbhandari, Akshay Nikam, Pai-Wei Lai, Kevin Stock, S. Krishnamoorthy, P. Sadayappan","doi":"10.1109/ICPP.2014.35","DOIUrl":"https://doi.org/10.1109/ICPP.2014.35","url":null,"abstract":"Tensor contractions represent the most compute- intensive core kernels in ab initio computational quantum chemistry and nuclear physics. Symmetries in these tensor contractions make them difficult to load balance and scale to large distributed systems. In this paper, we develop an efficient and scalable algorithm to contract symmetric tensors. We introduce a novel approach that avoids data redistribution during contraction of symmetric tensors while also bypassing redundant storage and maintaining load balance. We present experimental results on two parallel supercomputers for several symmetric contractions that appear in the coupled cluster singles and doubles (CCSD) quantum chemistry method. We also present a novel approach to tensor redistribution that can take advantage of parallel hyperplanes when the initial distribution has replicated dimensions, and use collective broadcast when the final distribution has replicated dimensions, making the algorithm very efficient.","PeriodicalId":441115,"journal":{"name":"2014 43rd International Conference on Parallel Processing","volume":"518 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116243571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Simulating Big Data Clusters for System Planning, Evaluation, and Optimization","authors":"Zhaojuan Bian, Kebing Wang, Zhihong Wang, Gene Munce, Illia Cremer, Wei Zhou, Qian Chen, Gen Xu","doi":"10.1109/ICPP.2014.48","DOIUrl":"https://doi.org/10.1109/ICPP.2014.48","url":null,"abstract":"With the fast development of big data technologies, IT spending on computer clusters is increasing rapidly as well. In order to minimize the cost, architects must plan big data clusters with careful evaluation of various design choices. Current capacity planning methods are mostly trial-and-error or high level estimation based. These approaches, however, are far from efficient, especially with the increasing hardware diversity and software stack complexity. In this paper, we present CSMethod, a novel cluster simulation methodology, to facilitate efficient cluster capacity planning, performance evaluation and optimization, before system provisioning. With our proposed methodology, software stacks are simulated by an abstract yet high fidelity model, Hardware activities derived from software operations are dynamically mapped onto architecture models for processors, memory, storage and networking devices. This hardware/software hybrid methodology allows low overhead, fast and accurate cluster simulation that can be easily carried out on a standard client platform (desktop or laptop). Our experimental results with six popular Hadoop workloads demonstrate that CSMethod can achieve an average error rate of less than six percent, across various software parameters and cluster hardware configurations. We also illustrate the application of the proposed methodology with two real-world use cases: Video-streaming service system planning and Terasort cluster optimization. All our experiments are run on a commodity laptop with execution speeds faster than native executions on a multi-node high-end cluster.","PeriodicalId":441115,"journal":{"name":"2014 43rd International Conference on Parallel Processing","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124907066","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Kylix: A Sparse Allreduce for Commodity Clusters","authors":"Huasha Zhao, J. Canny","doi":"10.1109/ICPP.2014.36","DOIUrl":"https://doi.org/10.1109/ICPP.2014.36","url":null,"abstract":"Allreduce is a basic building block for parallel computing. Our target here is \"Big Data\" processing on commodity clusters (mostly sparse power-law data). Allreduce can be used to synchronize models, to maintain distributed datasets, and to perform operations on distributed data such as sparse matrix multiply. We first review a key constraint on cluster communication, the minimum efficient packet size, which hampers the use of direct all-to-all protocols on large networks. Our allreduce network is a nested, heterogeneous-degree butterfly. We show that communication volume in lower layers is typically much less than the top layer, and total communication across all layers a small constant larger than the top layer, which is close to optimal. A chart of network communication volume across layers has a characteristic \"Kylix\" shape, which gives the method its name. For optimum performance, the butterfly degrees also decrease down the layers. Furthermore, to efficiently route sparse updates to the nodes that need them, the network must be nested. While the approach is amenable to various kinds of sparse data, almost all \"Big Data\" sets show power-law statistics, and from the properties of these, we derive methods for optimal network design. Finally, we present experiments showing with Kylix on Amazon EC2 and demonstrating significant improvements over existing systems such as PowerGraph and Hadoop.","PeriodicalId":441115,"journal":{"name":"2014 43rd International Conference on Parallel Processing","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127721897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}