{"title":"DC-Top-k: A Novel Top-k Selecting Algorithm and Its Parallelization","authors":"Z. Xue, Ruixuan Li, Heng Zhang, X. Gu, Zhiyong Xu","doi":"10.1109/ICPP.2016.49","DOIUrl":"https://doi.org/10.1109/ICPP.2016.49","url":null,"abstract":"Sorting is a basic computational task in Computer Science. As a variant of the sorting problem, top-k selecting have been widely used. To our knowledge, on average, the state-of-the-art top-k selecting algorithm Partial Quicksort takes C(n, k) = 2(n+1)Hn+2n-6k+6-2(n+3-k)Hn+1-k comparisons and about C(n, k)/6 exchanges to select the largest k terms from n terms, where Hn denotes the n-th harmonic number. In this paper, a novel top-k algorithm called DC-Top-k is proposed by employing a divide-and-conquer strategy. By a theoretical analysis, the algorithm is proved to be competitive with the state-of-the-art top-k algorithm on the compare time, with a significant improvement on the exchange time. On average, DC-Top-k takes at most (2-1/k)n+O(klog2k) comparisons and O(klog2k) exchanges to select the largest k terms from n terms. The effectiveness of the proposed algorithm is verified by a number of experiments which show that DC-Top-k is 1-3 times faster than Partial Quicksort and, moreover, is notably stabler than the latter. With an increase of k, it is also significantly more efficient than Min-heap based top-k algorithm (U. S. Patent, 2012). In the end, DC-Top-k is naturally implemented in a parallel computing environment, and a better scalability than Partial Quicksort is also demonstrated by experiments.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124850648","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Prasanna Balaprakash, V. Morozov, R. Kettimuthu, Kalyan Kumaran, Ian T Foster
{"title":"Improving Data Transfer Throughput with Direct Search Optimization","authors":"Prasanna Balaprakash, V. Morozov, R. Kettimuthu, Kalyan Kumaran, Ian T Foster","doi":"10.1109/ICPP.2016.36","DOIUrl":"https://doi.org/10.1109/ICPP.2016.36","url":null,"abstract":"Improving data transfer throughput over high-speed long-distance networks has become increasingly difficult. Numerous factors such as nondeterministic congestion, dynamics of the transfer protocol, and multiuser and multitask source and destination endpoints, as well as interactions among these factors, contribute to this difficulty. A promising approach to improving throughput consists in using parallel streams at the application layer. We formulate and solve the problem of choosing the number of such streams from a mathematical optimization perspective. We propose the use of direct search methods, a class of easy-to-implement and light-weight mathematical optimization algorithms, to improve the performance of data transfers by dynamically adapting the number of parallel streams in a manner that does not require domain expertise, instrumentation, analytical models, or historic data. We apply our method to transfers performed with the GridFTP protocol, and illustrate the effectiveness of the proposed algorithm when used within Globus, a state-of-the-art data transfer tool, on production WAN links and servers. We show that when compared to user default settings our direct search methods can achieve up to 10x performance improvement under certain conditions. We also show that our method can overcome performance degradation due to external compute and network load on source end points, a common scenario at high performance computing facilities.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131468957","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hao Wen, D. Du, Milan M. Shetti, Doug Voigt, Shanshan Li
{"title":"Guaranteed Bang for the Buck: Modeling VDI Applications with Guaranteed Quality of Service","authors":"Hao Wen, D. Du, Milan M. Shetti, Doug Voigt, Shanshan Li","doi":"10.1109/ICPP.2016.55","DOIUrl":"https://doi.org/10.1109/ICPP.2016.55","url":null,"abstract":"In cloud environment, most services are provided by virtual machines (VMs). Providing storage quality of service (QoS) for VMs is essential to user experiences while challenging. It first requires an accurate estimate and description of VM requirements, however, people usually describe this via rules of thumb. The problems are exacerbated by the diversity and special characteristics of VMs in a computing environment. This paper chooses Virtual Desktop Infrastructure (VDI), a prevalent and complicated VM application, to characterize QoS requirements of VMs and to guarantee QoS with minimal required resources. We create a model to describe QoS requirements of VDI. We have collected real VDI traces from HP to validate the correctness of the model. Then we generate QoS requirements of VDI and determine bottlenecks. Based on this, we can tell what minimum capability a storage appliance needs in order to satisfy a given VDI configuration and QoS requirements. By comparing with industry experience, we validate our model. And our model can describe more fine-grained VM requirements varying with time and virtual disk types, and provide more confidence on sizing storage for VDI as well.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127639493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"AccuracyTrader: Accuracy-Aware Approximate Processing for Low Tail Latency and High Result Accuracy in Cloud Online Services","authors":"Rui Han, Siguang Huang, Fei Tang, Fu-Gui Chang, Jianfeng Zhan","doi":"10.1109/ICPP.2016.39","DOIUrl":"https://doi.org/10.1109/ICPP.2016.39","url":null,"abstract":"Modern latency-critical online services such as search engines often process requests by consulting large input data spanning massive parallel components. Hence the tail latency of these components determines the service latency. To trade off result accuracy for tail latency reduction, existing techniques use the components responding before a specified deadline to produce approximate results. However, they may skip a large proportion of components when load gets heavier, thus incurring large accuracy losses. This paper presents AccuracyTrader that produces approximate results with small accuracy losses while maintaining low tail latency. AccuracyTrader aggregates information of input data on each component to create a small synopsis, thus enabling all components producing initial results quickly using their synopses. AccuracyTrader also uses synopses to identify the parts of input data most related to arbitrary requests' result accuracy, thus first using these parts to improve the produced results in order to minimize accuracy losses. We evaluated AccuracyTrader using workloads in real services. The results show: (i) AccuracyTrader reduces tail latency by over 40 times with accuracy losses of less than 7% compared to existing exact processing techniques, (ii) when using the same latency, AccuracyTrader reduces accuracy losses by over 13 times comparing to existing approximate processing techniques.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123942770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Evangelia A. Sitaridi, René Müller, T. Kaldewey, G. Lohman, K. A. Ross
{"title":"Massively-Parallel Lossless Data Decompression","authors":"Evangelia A. Sitaridi, René Müller, T. Kaldewey, G. Lohman, K. A. Ross","doi":"10.1109/ICPP.2016.35","DOIUrl":"https://doi.org/10.1109/ICPP.2016.35","url":null,"abstract":"Today's exponentially increasing data volumes and the high cost of storage make compression essential for the Big Data industry. Although research has concentrated on efficient compression, fast decompression is critical for analytics queries that repeatedly read compressed data. While decompression can be parallelized somewhat by assigning each data block to a different process, break-through speed-ups require exploiting the massive parallelism of modern multi-core processors and GPUs for data decompression within a block. We propose two new techniques to increase the degree of parallelism during decompression. The first technique exploits the massive parallelism of GPU and SIMD architectures. The second sacrifices some compression efficiency to eliminate data dependencies that limit parallelism during decompression. We evaluate these techniques on the decompressor of the DEFLATE scheme, called Inflate, which is based on LZ77 compression and Huffman encoding. We achieve a 2× speed-up in a head-to-head comparison with several multi core CPU-based libraries, while achieving a 17% energy saving with comparable compression ratios.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":"89 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124399801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Locality-Aware Laplacian Mesh Smoothing","authors":"G. Aupy, Jeonghyung Park, P. Raghavan","doi":"10.1109/ICPP.2016.74","DOIUrl":"https://doi.org/10.1109/ICPP.2016.74","url":null,"abstract":"In this paper, we propose a novel reordering scheme to improve the performance of a Laplacian Mesh Smoothing (LMS). While the Laplacian smoothing algorithm is well optimized and studied, we show how a simple reordering of the vertices of the mesh can greatly improve the execution time of the smoothing algorithm. The idea of our reordering is based on (i) the postulate that cache misses are a very time consuming part of the execution of LMS, and (ii) the study of the reuse distance patterns of various executions of the LMS algorithm. Our reordering algorithm is very simple but allows for huge performance improvement. We ran it on a Westmere-EX platform and obtained a speedup of 75 on 32 cores compared to the single core execution without reordering, and a gain in execution of 32% on 32 cores compared to state of the art reordering. Finally, we show that we leave little room for a better ordering by reducing the L2 and L3 cache misses to a bare minimum.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125168379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient Parallel Algorithms for k-Center Clustering","authors":"J. McClintock, Anthony Wirth","doi":"10.1109/ICPP.2016.22","DOIUrl":"https://doi.org/10.1109/ICPP.2016.22","url":null,"abstract":"The k-center problem is a classic NP-hard clustering question. For contemporary massive data sets, RAM-based algorithms become impractical. Although there exist good algorithms for k-center, they are all inherently sequential. In this paper, we design and implement parallel approximation algorithms for k-center. We observe that Gonzalez's greedy algorithm can be efficiently parallelized in several MapReduce rounds, in practice, we find that two rounds are sufficient, leading to a 4-approximation. In practice, we find this parallel scheme is about 100 times faster than the sequential Gonzalez algorithm, and barely compromises solution quality. We contrast this with an existing parallel algorithm for k-center that offers a 10-approximation. Our analysis reveals that this scheme is often slow, and that its sampling procedure only runs if k is sufficiently small, relative to input size. In practice, it is slightly more effective than Gonzalez's approach, but is slow. To trade off runtime for approximation guarantee, we parameterize this sampling algorithm. We prove a lower bound on the parameter for effectiveness, and find experimentally that with values even lower than the bound, the algorithm is not only faster, but sometimes more effective.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":"92 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130576877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vivek Balasubramanian, Antons Treikalis, Ole Weidner, S. Jha
{"title":"Ensemble Toolkit: Scalable and Flexible Execution of Ensembles of Tasks","authors":"Vivek Balasubramanian, Antons Treikalis, Ole Weidner, S. Jha","doi":"10.1109/ICPP.2016.59","DOIUrl":"https://doi.org/10.1109/ICPP.2016.59","url":null,"abstract":"There are many science applications that require scalable task-level parallelism, support for flexible execution and coupling of ensembles of simulations. Most high-performance system software and middleware, however, are designed to support the execution and optimization of single tasks. Motivated by the missing capabilities of these computing systems and the increasing importance of task-level parallelism, we introduce the Ensemble toolkit which has the following application development features: (i) abstractions that enable the expression of ensembles as primary entities, and (ii) support for ensemble-based execution patterns that capture the majority of application scenarios. Ensemble toolkit uses a scalable pilot-based runtime system that decouples workload execution and resource management details from the expression of the application, and enables the efficient and dynamic execution of ensembles on heterogeneous computing resources. We investigate three execution patterns and characterize the scalability and overhead of Ensemble toolkit for these patterns. We investigate scaling properties for up to O(1000)concurrent ensembles and O(1000) cores and find linear weak and strong scaling behaviour.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116987073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Antons Treikalis, André Merzky, Haoyuan Chen, Tai-Sung Lee, D. York, S. Jha
{"title":"RepEx: A Flexible Framework for Scalable Replica Exchange Molecular Dynamics Simulations","authors":"Antons Treikalis, André Merzky, Haoyuan Chen, Tai-Sung Lee, D. York, S. Jha","doi":"10.1109/ICPP.2016.78","DOIUrl":"https://doi.org/10.1109/ICPP.2016.78","url":null,"abstract":"Replica Exchange (RE) simulations have emerged as an important algorithmic tool for the molecular sciences. Typically RE functionality is integrated into the molecular simulation software package. A primary motivation of the tight integration of RE functionality with simulation codes has been performance. This is limiting at multiple levels. First, advances in the RE methodology are tied to the molecular simulation code for which they were developed. Second, it is difficult to extend or experiment with novel RE algorithms, since expertise in the molecular simulation code is required. The tight integration results in difficulty to gracefully handle failures, and other runtime fragilities. We propose the RepEx framework which is addressing aforementioned shortcomings, while striking the balance between flexibility (any RE scheme) and scalability (several thousand replicas) over a diverse range of HPC platforms. The primary contributions of the RepEx framework are: (i) its ability to support different Replica Exchange schemes independent of molecular simulation codes, (ii) provide the ability to execute different exchange schemes and replica counts independent of the specific availability of resources, (iii) provide a runtime system that has first-class support for task-level parallelism, and (iv) provide a required scalability along multiple dimensions.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126951568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimal Collision/Conflict-Free Distance-2 Coloring in Wireless Synchronous Broadcast/Receive Tree Networks","authors":"Davide Frey, Hicham Lakhlef, M. Raynal","doi":"10.1109/ICPP.2016.47","DOIUrl":"https://doi.org/10.1109/ICPP.2016.47","url":null,"abstract":"This article is on message-passing systems where communication is (a) synchronous and (b) based on the “broadcast/receive” pair of communication operations. “Synchronous” means that time is discrete and appears as a sequence of time slots (or rounds) such that each message is received in the very same round in which it is sent. “Broadcast/receive” means that during a round a process can either broadcast a message to its neighbors or receive a message from one of them. In such a communication model, no two neighbors of the same process, nor a process and any of its neighbors, must be allowed to broadcast during the same time slot (thereby preventing message collisions in the first case, and message conflicts in the second case). From a graph theory point of view, the allocation of slots to processes is known as the distance-2 coloring problem: a color must be associated with each process (defining the time slots in which it will be allowed to broadcast) in such a way that any two processes at distance at most 2 obtain different colors, while the total number of colors is “as small as possible”. The paper presents a parallel message-passing distance-2 coloring algorithm suited to trees, whose roots are dynamically defined. This algorithm, which is itself collision-free and conflictfree, uses Δ + 1 colors where Δ is the maximal degree of the graph (hence the algorithm is color-optimal). It does not require all processes to have different initial identities, and its time complexity is O(dΔ), where d is the depth of the tree. As far as we know, this is the first distributed distance-2 coloring algorithm designed for the broadcast/receive round-based communication model, which owns all the previous properties. Index Terms-Broadcast/receive communication, Collision, Conflict, Distance-2 graph coloring, Message-passing, Network traversal, Synchronous system, Time slot assignment, Tree network, Wireless network.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124723178","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}