{"title":"Near-Optimal Distributed Algorithms for Fault-Tolerant Tree Structures","authors":"M. Ghaffari, M. Parter","doi":"10.1145/2935764.2935795","DOIUrl":"https://doi.org/10.1145/2935764.2935795","url":null,"abstract":"Tree structures such as breadth-first search (BFS) trees and minimum spanning trees (MST) are among the most fundamental graph structures in distributed network algorithms. However, by definition, these structures are not robust against failures and even a single edge's removal can disrupt their functionality. A well-studied concept which attempts to circumvent this issue is Fault-Tolerant Tree Structures, where the tree gets augmented with additional edges from the network so that the functionality of the structure is maintained even when an edge fails. These structures, or other equivalent formulations, have been studied extensively from a centralized viewpoint. However, despite the fact that the main motivations come from distributed networks, their distributed construction has not been addressed before. In this paper, we present distributed algorithms for constructing fault tolerant BFS and MST structures. The time complexity of our algorithms are nearly optimal in the following strong sense: they almost match even the lower bounds of constructing (basic) BFS and MST trees.","PeriodicalId":346939,"journal":{"name":"Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"282 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127551725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Practical Solution to the Cactus Stack Problem","authors":"Chaoran Yang, J. Mellor-Crummey","doi":"10.1145/2935764.2935787","DOIUrl":"https://doi.org/10.1145/2935764.2935787","url":null,"abstract":"Work-stealing is a popular method for load-balancing dynamic multithreaded computations on shared-memory systems. In theory, a randomized work-stealing scheduler can achieve near linear speedup when the computation has sufficient parallelism and requires stack space that is linear in the number of processors. In practice, however, work-stealing runtimes sacrifice interoperability with serial code to achieve these bounds. For example, both Cilk and Cilk++ prohibit a C function from calling aCilk function. Other work-stealing runtime systems that do not have this restriction either lack a strong time bound, which might cause them to deliver little or no speedup in the worst case, or lack a strong space bound, which might lead to an excessive memory footprint. This problem was previously described as the cactus stack problem. In this paper, we present Fibril, a new multithreading library that supports a fork-join programming model using work-stealing. Fibril solves the cactus stack problem by (1) implementing on a cactus stack that conforms to the calling conventions of serial code and (2) returning unused memory pages of suspended stacks to the operating system to bound consumption of physical memory. Theoretically, Fibril achieves strong bounds on both time and memory usage without sacrificing interoperability with serial code. Empirically, Fibril achieves up to 3x the performance of Intel Cilk Plus and up to 8x the performance of Intel Threading Building Blocks for the 12 benchmarks we evaluated.","PeriodicalId":346939,"journal":{"name":"Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"201 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124402400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scheduling Parallelizable Jobs Online to Minimize the Maximum Flow Time","authors":"Kunal Agrawal, Jing Li, Kefu Lu, Benjamin Moseley","doi":"10.1145/2935764.2935782","DOIUrl":"https://doi.org/10.1145/2935764.2935782","url":null,"abstract":"In this paper we study the problem of scheduling a set of dynamic multithreaded jobs with the objective of minimizing the maximum latency experienced by any job. We assume that jobs arrive online and the scheduler has no information about the arrival rate, arrival time or work distribution of the jobs. The scheduling goal is to minimize the maximum amount of time between the arrival of a job and its completion --- this goal is referred to in scheduling literature as maximum flow time. While theoretical online scheduling of parallel jobs has been studied extensively, most prior work has focussed on a highly stylized model of parallel jobs called the \"speedup curves model.\" We model parallel jobs as directed acyclic graphs, which is a more realistic way to model dynamic multithreaded jobs. In this context, we prove that a simple First-In-First-Out scheduler is (1+ε)-speed O(1/ε)-competitive for any ε >0. We then develop a more practical work-stealing scheduler and show that it has a maximum flow time of O(1/ε2 max{opt,ln(n)}) for n jobs, with (1+ε)-speed. This result is essentially tight as we also provide a lower bound of Ω(log(n)) for work stealing. In addition, for the case where jobs have weights (typically representing priorities) and the objective is minimizing the maximum weighted flow time, we show a non-clairvoyant algorithm is (1+ε)-speed O(1/ε2)-competitive for any ε >0, which is essentially the best positive result that can be shown in the online setting for the weighted case due to strong lower bounds without resource augmentation. After establishing theoretical results, we perform an empirical study of work-stealing. Our results indicate that, on both real world and synthetic workloads, work-stealing performs almost as well as an optimal scheduler.","PeriodicalId":346939,"journal":{"name":"Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116458484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures","authors":"C. Scheideler, Seth Gilbert","doi":"10.1145/2935764","DOIUrl":"https://doi.org/10.1145/2935764","url":null,"abstract":"It is my great pleasure to welcome you to the 28thACM Symposium on Parallelism in Algorithms and Architectures. \u0000 \u0000The goal of SPAA is to develop a deeper understanding of parallelism in all its forms, bringing together the theory and practice of parallel computing. This year's program reflects that goal, with a diverse selection of papers at the cutting edge of parallel computing. The program includes 38 regular papers and 14 brief announcements, as well as keynote talks by Michael I. Jordan and Nir Shavit. \u0000 \u0000Traditional topics in parallelism are well represented at SPAA this year. The program includes papers on parallel algorithms for classical questions (e.g., sorting and graph problems, see Sessions 9 and 14). It includes papers on scheduling parallel computations (see Session 3) and scheduling tasks in parallel systems (see Sessions 6 and 8). The program also includes papers on concurrent data structures (see Session 11), and on parallelism in distributed systems (see Session 13). These topics all have a long history at SPAA. \u0000 \u0000Over the last several years, the study of parallelism has expanded to include new models of parallel computation (e.g., Map-Reduce, see Session 1), new architectures (e.g., GPUs, see Session 9), new techniques for managing parallelism (e.g., transactional memory, see Session 4), and new types of parallel systems (e.g., programmable matter, see Session 10). These increasingly important topics are represented at SPAA this year. \u0000 \u0000The best paper award for SPAA 2016 is awarded to a paper focusing on the limitations of certain new models of parallel computation: \u0000Shuffles and Circuits (On Lower Bounds for Modern Parallel Computation) by Tim Roughgarden, Sergei Vassilvitskii and Joshua Wang. \u0000 \u0000 \u0000 \u0000The authors develop lower bounds on the speed of large-scale parallel computation in a model meant to capture the capabilities of Map-Reduce and Hadoop. They discover an important connection between these computations and polynomials representing boolean functions, and use this fact to show lower bounds for a variety of natural and important problems. \u0000 \u0000We would also like to recognize (in no particular order) three finalists for the best paper award: \u0000Randomized approximate nearest neighbor search with limited adaptivity by Mingmou Liu, Xiaoyin Pan and Yitong Yin. \u0000Robust and Probabilistic Failure-Aware Placement by Madhukar Korupolu and Rajmohan Rajaraman. \u0000Lock-free Transactions without Aborts for Linked Data Structures by Deli Zhang and Damian Dechev \u0000 \u0000 \u0000 \u0000These papers highlight the variety of exciting work in parallelism that is represented at SPAA 2016.","PeriodicalId":346939,"journal":{"name":"Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129889445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Lock-free Transactions without Rollbacks for Linked Data Structures","authors":"Deli Zhang, D. Dechev","doi":"10.1145/2935764.2935780","DOIUrl":"https://doi.org/10.1145/2935764.2935780","url":null,"abstract":"Non-blocking data structures allow scalable and thread-safe accesses to shared data. They provide individual operations that appear to execute atomically. However, it is often desirable to execute multiple operations atomically in a transactional manner. Previous solutions, such as software transactional memory (STM) and transactional boosting, manage transaction synchronization in an external layer separated from the data structure's own thread-level concurrency control. Although this reduces programming effort, it leads to overhead associated with additional synchronization and the need to rollback aborted transactions. In this work, we present a new methodology for transforming high-performance lock-free linked data structures into high-performance lock-free transactional linked data structures without revamping the data structures' original synchronization design. Our approach leverages the semantic knowledge of the data structure to eliminate the overhead of false conflicts and rollbacks. We encapsulate all operations, operands, and transaction status in a transaction descriptor, which is shared among the nodes accessed by the same transaction. We coordinate threads to help finish the remaining operations of delayed transactions based on their transaction descriptors. When transaction fails, we recover the correct abstract state by reversely interpreting the logical status of a node. In our experimental evaluation using transactions with randomly generated operations, our lock-free transactional lists and skiplist outperform the transactional boosted ones by 40% on average and as much as 125% for large transactions. They also outperform the alternative STM-based approaches by a factor of 3 to 10 across all scenarios. More importantly, we achieve 4 to 6 orders of magnitude less spurious aborts than the alternatives.","PeriodicalId":346939,"journal":{"name":"Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131217379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Robust and Probabilistic Failure-Aware Placement","authors":"M. Korupolu, R. Rajaraman","doi":"10.1145/2935764.2935802","DOIUrl":"https://doi.org/10.1145/2935764.2935802","url":null,"abstract":"Motivated by the growing complexity and heterogeneity of modern data centers, and the prevalence of commodity component failures, this paper studies the failure-aware placement problem of placing tasks of a parallel job on machines in the data center with the goal of increasing availability. We consider two models of failures: adversarial and probabilistic. In the adversarial model, each node has a weight (higher weight implying higher reliability) and the adversary can remove any subset of nodes of total weight at most a given bound W and our goal is to find a placement that incurs the least disruption against such an adversary. In the probabilistic model, each node has a probability of failure and we need to find a placement that maximizes the probability that at least K out of N tasks survive at any time. For adversarial failures, we first show that (i) the problems are in Σ2, the second level of the polynomial hierarchy, (ii) a basic variant, that we call RobustFAP, is co-NP-hard, and (iii) an all-or-nothing version of RobustFAP is Σ2-complete. We then give a PTAS for RobustFAP, a key ingredient of which is a solution that we design for a fractional version of RobustFAP. We then study fractional RobustFAP over hierarchies, denoted HierRobustFAP, and introduce a notion of hierarchical max-min fairness/ and a novel Generalized Spreading/ algorithm which is simultaneously optimal for all W. These generalize the classical notion of max-min fairness to work with nodes of differing capacities, differing reliability weights and hierarchical structures. Using randomized rounding, we extend this to give an algorithm for integral HierRobustFAP. For the probabilistic version, we first give an algorithm that achieves an additive ε approximation in the failure probability for the single level version, called ProbFAP, while giving up a (1 + ε) multiplicative factor in the number of failures. We then extend the result to the hierarchical version, HierProbFAP, achieving an ε additive approximation in failure probability while giving up an (L + ε) multiplicative factor in the number of failures, where $L$ is the number of levels in the hierarchy.","PeriodicalId":346939,"journal":{"name":"Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124396529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Brief Announcement: Transactional Data Structure Libraries","authors":"A. Spiegelman, Guy Golan-Gueta, I. Keidar","doi":"10.1145/2935764.2935805","DOIUrl":"https://doi.org/10.1145/2935764.2935805","url":null,"abstract":"We introduce transactions into libraries of concurrent data structures; such transactions can be used to ensure atomicity of sequences of data structure operations. By restricting transactional access to a well-defined set of data structure operations, we strike a balance between the ease-of-programming of transactions and the efficiency of custom-tailored data structures. We exemplify this concept by designing and implementing a library supporting transactions on any number of maps, sets (implemented as skiplists), and queues. Our library offers efficient and scalable transactions, which are an order of magnitude faster than state-of-the-art transactional memory toolkits. Moreover, our approach treats stand-alone data structure operations (like put and enqueue) as first class citizens, and allows them to execute with virtually no overhead, at the speed of the original data structure library.","PeriodicalId":346939,"journal":{"name":"Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116912589","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Brief Announcement: MIC++: Accelerating Maximal Information Coefficient Calculation with GPUs and FPGAs","authors":"Chao Wang, Xi Li, Aili Wang, Xuehai Zhou","doi":"10.1145/2935764.2935804","DOIUrl":"https://doi.org/10.1145/2935764.2935804","url":null,"abstract":"To discover relationships and associations between pairs of variables in large data sets have become one of the most significant challenges for bioinformatics scientists. To tackle this problem, maximal information coefficient (MIC) is widely applied as a measure of the linear or non-linear association between two variables. To improve the performance of MIC calculation, in this work we present MIC++, a parallel approach based on the heterogeneous accelerators including Graphic Processing Unit (GPU) and Field Programmable Gate Array (FPGA) engines, focusing on both coarse-grained and fine-grained parallelism. As the evaluation of MIC++, we have demonstrated the performance on the state-of-the-art GPU accelerators and the FPGA-based accelerators. Preliminary estimated results show that the proposed parallel implementation can significantly achieve more than 6X-14X speedup using GPU, and 4X-13X using FPGA-based accelerators.","PeriodicalId":346939,"journal":{"name":"Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124686959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chhaya Trehan, H. Vandierendonck, G. Karakonstantis, Dimitrios S. Nikolopoulos
{"title":"Brief Announcement: Energy Optimization of Memory Intensive Parallel Workloads","authors":"Chhaya Trehan, H. Vandierendonck, G. Karakonstantis, Dimitrios S. Nikolopoulos","doi":"10.1145/2935764.2935811","DOIUrl":"https://doi.org/10.1145/2935764.2935811","url":null,"abstract":"Energy consumption is an important concern in modern multicore processors. The energy consumed during the execution of an application can be minimized by tuning the hardware state utilizing knobs such as frequency, voltage etc. The existing theoretical work on energy minimization using Global DVFS (Dynamic Voltage and Frequency Scaling), despite being thorough, ignores the energy consumed by the CPU on memory accesses and the dynamic energy consumed by the idle cores. This article presents an analytical energy-performance model for parallel workloads that accounts for the energy consumed by the CPU chip on memory accesses in addition to the energy consumed on CPU instructions. In addition, the model we present also accounts for the dynamic energy consumed by the idle cores. We present an analytical framework around our energy-performance model to predict the operating frequencies for global DVFS that minimize the overall CPU energy consumption. We show how the optimal frequencies in our model differ from the optimal frequencies in a model that does not account for memory accesses.","PeriodicalId":346939,"journal":{"name":"Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126077880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallel Algorithms for Summing Floating-Point Numbers","authors":"M. Goodrich, A. Eldawy","doi":"10.1145/2935764.2935779","DOIUrl":"https://doi.org/10.1145/2935764.2935779","url":null,"abstract":"The problem of exactly summing n floating-point numbers is a fundamental problem that has many applications in large-scale simulations and computational geometry. Unfortunately, due to the round-off error in standard floating-point operations, this problem becomes very challenging. Moreover, all existing solutions rely on sequential algorithms which cannot scale to the huge datasets that need to be processed. In this paper, we provide several efficient parallel algorithms for summing n floating point numbers, so as to produce a faithfully rounded floating-point representation of the sum. We present algorithms in PRAM, external-memory, and MapReduce models, and we also provide an experimental analysis of our MapReduce algorithms, due to their simplicity and practical efficiency.","PeriodicalId":346939,"journal":{"name":"Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125467154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}